<a href="https://colab.research.google.com/github/Adi1exe/Fake-News-detection-ML/blob/main/fake_news_detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📰 Fake News Detection using TF-IDF and Logistic Regression
This notebook uses the **LIAR dataset** to classify political statements as **Fake** or **Real** based on the text.

### 📦 Step 1: Import Libraries

In [None]:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score


### 📂 Step 2: Load and Combine Dataset (TSV Format)

In [4]:
from google.colab import files
# Load LIAR dataset files (TSV)
uploaded = files.upload() #Upload train.tsv
uploaded = files.upload() #Upload test.tsv
uploaded = files.upload() #Upload valid.tsv

train_df = pd.read_csv("train.tsv", sep='\t', header=None)
test_df = pd.read_csv("test.tsv", sep='\t', header=None)
valid_df = pd.read_csv("valid.tsv", sep='\t', header=None)

# Combine the three datasets
df = pd.concat([train_df, test_df, valid_df], ignore_index=True)

# Rename relevant columns
df.columns = ['ID', 'label', 'statement', 'subject', 'speaker', 'job', 'state', 'party',
              'barely_true_counts', 'half_true_counts', 'mostly_true_counts', 'false_counts',
              'pants_on_fire_counts', 'context']

# Retain only statement and label columns
df = df[['statement', 'label']]
df.head()


Saving train.tsv to train (2).tsv


Saving test.tsv to test (2).tsv


Saving valid.tsv to valid (2).tsv


Unnamed: 0,statement,label
0,Says the Annies List political group supports ...,false
1,When did the decline of coal start? It started...,half-true
2,"Hillary Clinton agrees with John McCain ""by vo...",mostly-true
3,Health care reform legislation is likely to ma...,false
4,The economic turnaround started at the end of ...,half-true


### 🔁 Step 3: Preprocess Labels

In [5]:

# Simplify labels to binary: 0 = Fake, 1 = Real
fake_labels = ['pants-fire', 'false', 'barely-true']
real_labels = ['half-true', 'mostly-true', 'true']

df = df[df['label'].isin(fake_labels + real_labels)]
df['label'] = df['label'].apply(lambda x: 0 if x in fake_labels else 1)

df['label'].value_counts()


Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,7134
0,5657


### 🧪 Step 4: Train/Test Split

In [6]:

X_train, X_test, y_train, y_test = train_test_split(df['statement'], df['label'], test_size=0.2, random_state=42)


### 🧠 Step 5: TF-IDF Vectorization

In [7]:

vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)


### 🤖 Step 6: Train Logistic Regression Model

In [8]:

model = LogisticRegression()
model.fit(X_train_tfidf, y_train)
y_pred = model.predict(X_test_tfidf)


### 📊 Step 7: Evaluate the Model

In [9]:

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred))


Accuracy: 0.6326690113325518

Classification Report:

              precision    recall  f1-score   support

           0       0.60      0.48      0.53      1121
           1       0.65      0.75      0.70      1438

    accuracy                           0.63      2559
   macro avg       0.63      0.62      0.62      2559
weighted avg       0.63      0.63      0.63      2559

