<div align="left" style="background-color: #008080; padding: 20px 10px;">
<h3><b>IDEAS - Institute of Data Engineering, Analytics and Science Foundation</b></h3>
<p>Spring Internship Program 2026</p>
<hr style="width:100%;">
<h3><b>Project Title:</b> Fake News Detection and Evaluation with Confusion Matrix</h3>
<h4>Project Notebook</h4>

<blockquote style="border-left: 4px solid #4285F4; padding-left: 15px;">
  <strong>Created by:</strong> Suprava Das<br>
  <strong>Designation:</strong> Associate Software Developer
</blockquote>
<hr style="width:100%;">
</div>

## Project Goal: Automated Fake News Classification

The core task of this project is to develop and evaluate a machine learning system for identifying fake news.

**Objectives:**
*   Train a classification model on a labeled dataset of news articles.
*   Use textual data (title and content) as the primary features.
*   Evaluate the model's effectiveness using standard metrics, including a confusion matrix to analyze true vs. false positives and negatives.

---

## Dataset Overview

The project utilizes a publicly available dataset composed of two distinct classes of news content.

*   **Source of True Articles:**
    *   **Website:** `Reuters.com`
    *   **Description:** A reputable, mainstream news provider.

*   **Source of Fake Articles:**
    *   **Websites:** Various platforms identified by Politifact and Wikipedia.
    *   **Description:** Sources known for producing unreliable or intentionally false information.

*   **Content Focus:** The articles primarily cover topics related to politics and world news.

*   **Download Link:** The dataset is available on Kaggle at [www.kaggle.com/datasets/emineyetm/fake-news-detection-datasets](https://www.kaggle.com/datasets/emineyetm/fake-news-detection-datasets).

### Question 1: Import Libraries and Load Data (5 Marks)

Import `pandas` as `pd`. Load the `Fake.csv` and `True.csv` datasets into two separate DataFrames named `fake_df` and `true_df` respectively. Display the first 3 rows of `fake_df`.

**Hint:** Use `pd.read_csv()` for each file. The file paths are `/content/drive/My Drive/IDEAS-TIH/Internship_2025/Fake.csv` and `/content/drive/My Drive/IDEAS-TIH/Internship_2025/True.csv`. Use `.head(3)` to display the rows.

**Expected Output:** A table showing the first 3 rows of the fake news dataset.

In [None]:
import pandas as pd

fake_df = pd.read_csv('/content/fake.csv')
true_df = pd.read_csv('/content/true.csv')

fake_df.head(3)


Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"


### Question 2: Create a 'class' Column (5 Marks)

To prepare for merging, add a new column named `class` to each DataFrame. Assign the integer `1` to this column for `fake_df` and `0` for `true_df`. Display the last 3 rows of `true_df` to verify.

**Hint:** You can create a new column with `df['column_name'] = value`. Use the `.tail(3)` method to see the last rows.

**Expected Output:** A table showing the last 3 rows of the true news dataset with the new 'class' column containing zeros.

In [None]:
# Write your answer here
fake_df['class'] = 1
true_df['class'] = 0

true_df.tail(3)

Unnamed: 0,title,text,subject,date,class
21414,Minsk cultural hub becomes haven from authorities,MINSK (Reuters) - In the shadow of disused Sov...,worldnews,"August 22, 2017",0
21415,Vatican upbeat on possibility of Pope Francis ...,MOSCOW (Reuters) - Vatican Secretary of State ...,worldnews,"August 22, 2017",0
21416,Indonesia to buy $1.14 billion worth of Russia...,JAKARTA (Reuters) - Indonesia will buy 11 Sukh...,worldnews,"August 22, 2017",0


### Question 3: Merge and Shuffle DataFrames (5 Marks)

Merge `fake_df` and `true_df` into a single DataFrame called `df`. Then, shuffle the rows of this new DataFrame to randomize the order of true and fake news articles. Display the first 10 rows of the shuffled DataFrame.

**Hint:** Use `pd.concat([df1, df2], axis=0)`. To shuffle, you can use `df.sample(frac=1)`.

**Expected Output:** A table showing 10 random rows from the combined dataset.

In [None]:
# Write your answer here
df = pd.concat([fake_df, true_df], axis=0)
df = df.sample(frac=1).reset_index(drop=True)

df.head(10)

Unnamed: 0,title,text,subject,date,class
0,Leave It To Ellen DeGeneres To PERFECTLY Mock...,When thinking about who you d likely want as t...,News,"March 8, 2016",1
1,Clinton Campaign Attack Dog Warns Trump And Sa...,"Hey, everything is fair game in this war to wi...",politics,"Jan 11, 2016",1
2,President Trump Announces Decision on Paris Cl...,FOX News is announcing that President Trump be...,Government News,"Jun 1, 2017",1
3,New law needed to allow torture victims to sue...,KABUL (Reuters) - Human rights activists are u...,worldnews,"August 27, 2017",0
4,WATCH: DIVISIVE FORMER First Lady MICHELLE OBA...,"Michelle Obama is hands-down, the most divisiv...",politics,"Oct 4, 2017",1
5,Spain's King Felipe says committed to Spanish ...,MADRID (Reuters) - Spain s King Felipe VI said...,worldnews,"October 3, 2017",0
6,London's Angel underground station closed due ...,(Reuters) - London s Angel underground station...,worldnews,"October 4, 2017",0
7,New Jersey's Christie vetoes bill seeking Trum...,(Reuters) - New Jersey Governor Chris Christie...,politicsNews,"May 1, 2017",0
8,EU concerned over challenges to Romanian judic...,BUCHAREST (Reuters) - Justice reform has stagn...,worldnews,"November 15, 2017",0
9,BOLD! HOLLYWOOD ACTOR Speaks Up For Trump…Tell...,,politics,"Feb 2, 2017",1


### Question 4: Data Cleaning (10 Marks)

Create a new DataFrame `df_clean` by dropping the `title`, `subject`, and `date` columns from `df`. Then, reset the index of `df_clean`. Print the first 5 rows of `df_clean`.

**Hint:** Use the `.drop(columns=['col1', 'col2'])` method. To reset the index, use `.reset_index(drop=True)`.

**Expected Output:** A table with 5 rows and 2 columns ('text' and 'class').

In [None]:
# Write your answer here
df_clean = df.drop(columns=['title', 'subject', 'date'])
df_clean = df_clean.reset_index(drop=True)

df_clean.head(5)

Unnamed: 0,text,class
0,When thinking about who you d likely want as t...,1
1,"Hey, everything is fair game in this war to wi...",1
2,FOX News is announcing that President Trump be...,1
3,KABUL (Reuters) - Human rights activists are u...,0
4,"Michelle Obama is hands-down, the most divisiv...",1


### Question 5: Define a Text Preprocessing Function (10 Marks)

Define a Python function named `wordopt(text)` that takes a string, converts it to lowercase, removes URLs, removes all non-alphanumeric characters (replacing them with spaces), and replaces multiple spaces with a single space. Apply this function to the 'text' column of `df_clean`.

**Hint:** You'll need the `re` library. Use `re.sub(pattern, replacement, text)`. The pattern for non-alphanumeric characters is `[^a-zA-Z0-9]`.

**Expected Output:** No direct output. The 'text' column in `df_clean` will be processed.

In [None]:
# Write your answer here
import re

def wordopt(text):
    text = text.lower()
    text = re.sub(r'https?://\S+', '', text)
    text = re.sub(r'[^a-zA-Z0-9]', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    return text

df_clean['text'] = df_clean['text'].apply(wordopt)

### Question 6: Feature and Target Split (10 Marks)

Define your feature `x` as the 'text' column from `df_clean` and your target `y` as the 'class' column. Then, use `train_test_split` to create `x_train`, `x_test`, `y_train`, and `y_test`. Use a `test_size` of 0.25.

**Hint:** Import `train_test_split` from `sklearn.model_selection`. `x` and `y` will be Pandas Series.

**Expected Output:** No direct output. The data split variables will be created.

In [None]:
# Write your answer here
from sklearn.model_selection import train_test_split

x = df_clean['text']
y = df_clean['class']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)

### Question 7: Text to Vectors using TF-IDF (10 Marks)

Import `TfidfVectorizer` from `sklearn.feature_extraction.text`. Create an instance, fit it on `x_train`, and then transform both `x_train` and `x_test` into numerical vectors named `xv_train` and `xv_test`.

**Hint:** Create an instance `vectorizer = TfidfVectorizer()`. Use `.fit_transform()` on the training data and only `.transform()` on the test data.

**Expected Output:** No direct output. The vectorized training and testing data will be ready.

In [None]:
# Write your answer here
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
xv_train = vectorizer.fit_transform(x_train)
xv_test = vectorizer.transform(x_test)

### Question 8: Train and Evaluate a Logistic Regression Model (15 Marks)

Import `LogisticRegression`. Create an instance, train it on the vectorized training data (`xv_train`, `y_train`), and make predictions on `xv_test`. Finally, print the `classification_report` for the model's performance.

**Hint:** Import `classification_report` from `sklearn.metrics`. Follow the instantiate, fit, predict, and report pattern.

**Expected Output:** A text-based classification report for the Logistic Regression model.

In [None]:
# Write your answer here
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

lr = LogisticRegression()
lr.fit(xv_train, y_train)
y_pred = lr.predict(xv_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.98      0.99      0.98      5388
           1       0.99      0.98      0.99      5837

    accuracy                           0.98     11225
   macro avg       0.98      0.98      0.98     11225
weighted avg       0.98      0.98      0.98     11225



### Question 9: Train and Evaluate a Decision Tree Model (15 Marks)

Import `DecisionTreeClassifier` from `sklearn.tree`. Create an instance, train it on `xv_train` and `y_train`, predict on `xv_test`, and print the `accuracy_score`.

**Hint:** Import `accuracy_score` from `sklearn.metrics`. The steps are the same as the previous question, but you'll use a different metric function at the end.

**Expected Output:** A single decimal number representing the accuracy of the Decision Tree model.

In [None]:
# Write your answer here
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

dt = DecisionTreeClassifier()
dt.fit(xv_train, y_train)
y_pred = dt.predict(xv_test)

print(accuracy_score(y_test, y_pred))

0.9965256124721603


### Question 10: Hyperparameter Tuning with GridSearchCV (15 Marks)

Your `LogisticRegression` model used default parameters. Let's find better ones using `GridSearchCV`. Import it from `sklearn.model_selection`. Perform a grid search on a `LogisticRegression` model using the `xv_train` and `y_train` data.

Use the following parameter grid:
```python
param_grid = {'C': [0.1, 1, 10], 'solver': ['liblinear', 'lbfgs']}
```
After fitting, print the `best_params_` found by the grid search.

**Hint:** Instantiate `GridSearchCV(LogisticRegression(), param_grid, cv=3)`. Then, `.fit()` it on the training data. The best parameters are stored in the `.best_params_` attribute of the fitted grid search object.

**Expected Output:** A dictionary showing the best 'C' and 'solver' values found.

In [None]:
# Write your answer here
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

param_grid = {'C': [0.1, 1, 10], 'solver': ['liblinear', 'lbfgs']}
grid = GridSearchCV(LogisticRegression(), param_grid, cv=3)
grid.fit(xv_train, y_train)

print(grid.best_params_)

{'C': 10, 'solver': 'lbfgs'}
