# Through the Lens of Truth: Analyzing and Detecting Fake News

## About the organization

![TruthLens](https://drive.google.com/uc?export=view&id=1BSdTj6PVZwEnSCqucDa5DVHUGcPV_6jK)

The TruthLens Institute is a pioneering research organization dedicated to combating misinformation and fostering digital literacy worldwide. Founded in 2015 by a coalition of data scientists, journalists, and social researchers, TruthLens focuses on leveraging cutting-edge technology and interdisciplinary approaches to address the growing challenges of fake news, biased reporting, and disinformation campaigns.

### Mission:

To empower individuals and organizations with tools, insights, and strategies to identify and mitigate the spread of false or misleading information.

### Core Focus Areas:

- Data-Driven Research: Analyzing large datasets to uncover patterns and trends in misinformation.
- Technology Development: Creating AI-driven solutions to detect and counteract fake news in real time.
- Public Education: Offering workshops, webinars, and toolkits to enhance critical thinking and digital literacy.
- Policy Advocacy: Collaborating with governments and tech companies to implement ethical frameworks for content moderation.

### Impact:

Over the years, TruthLens has partnered with global organizations like the United Nations, educational institutions, and social media platforms to amplify its efforts. Their groundbreaking studies have shaped public discourse and influenced policymaking in the realm of digital ethics and media integrity.

**Why the Name "TruthLens"?**

The name reflects the organization’s mission to provide a clear, unbiased lens through which to view the information landscape. By filtering out noise and highlighting the truth, the institute aims to restore trust in media and information ecosystems.

## Project Introduction

As part of your commitments to making the world a better place, you volunteer 8-10 hours a week as a data scientist for a research organization, TruthLens to help in tackling misinformation and understanding its viral nature. Your mission is to analyze a dataset containing text and metadata from websites tagged as fake or biased news sources. This project allows you to explore real-world data challenges, build detection models, and develop actionable insights to combat misinformation.

This project focuses on exploring, cleaning, and analyzing a dataset containing text and metadata scraped from 244 websites. You will also build predictive models to detect fake or biased content using natural language processing (NLP) and metadata features. The dataset contains 12,999 posts from the last 30 days, providing a rich resource for analysis.

In addition to technical skills, you will also reflect on the nuances of detecting misinformation, the ethical challenges of labeling data, and potential improvements for the dataset.

## Objectives

The main objectives of this project are:

- Data Exploration: Understand the structure, distribution, and nuances of the dataset.
- Data Cleaning: Handle missing or inconsistent labels and clean text data for analysis.
- Feature Engineering: Extract meaningful features from both text and metadata.
- Model Development: Build and evaluate machine learning models to detect fake or biased news.
- Insights and Recommendations: Provide actionable insights and propose potential improvements for misinformation detection systems.

## About the dataset

The dataset contains:

- Text Data: Articles or posts from websites.
- Metadata: Information such as timestamps, URLs, and labels (e.g., "bs").
- Labels: Predefined tags from the BS Detector extension indicating the type of fake or biased content.

**You would find the dataset at: "[fake.csv](https://drive.google.com/file/d/1VvlmdPSD8E3mNm-7O6CABGFM8qSAcZcT/view?usp=drive_link)"**

## Task

**Phase 1: Data Exploration and Cleaning**

- Load the dataset and examine its structure (e.g., columns, data types, missing values, remove unnecessary columns).
- Preprocess the text data (e.g., remove stopwords, punctuations, and perform tokenization).

**Phase 2: Feature Engineering**

- Extract key features from the text, such as word count, sentiment, and term frequency. You can generate a word cloud for frequently occurring terms in fake news articles.
- Extract metadata-based features (e.g., domain, publication time patterns). Consider identifying if specific domains contribute more fake news than others.

**Phase 3: Model Development**

- Split the data into training and test sets, ensuring balanced distribution of labels.
- Use NLP techniques (e.g., TF-IDF, embeddings) to represent the text data, and compare the performance of different models
- Evaluate model performance using appropriate metrics (e.g., accuracy, F1-score).

**Phase 4: Insights and Recommendations**

- Analyze the results and discuss the model's strengths and weaknesses and write a summary of key insights from the model and the dataset.
- Propose ethical considerations and improvements for detecting misinformation. Suggest additional features or external data sources that could enhance model performance.

## Deliverables

- Exploratory Data Analysis (EDA) notebook with visualizations and data cleaning steps. (3 weeks) --> Jupyter notebook
- An organized Jupyter Notebook detailing necessary project phases (2 weeks) --> Jupyter notebook
- Detailed documentation of the entire workflow, insights, and recommendations, including challenges faced and solutions implemented. (2 weeks) --> Microsoft word document or pdf file format

**Timeline = 7 weeks.**