# **Fake News Detector: Real vs Misinformation Shield**

### **Project Objective**

Develop an ML model to classify news articles as real or fake using NLP techniques, helping combat misinformation on social media. Achieve >92% accuracy to detect deceptive patterns in text.



1. **DATA LOADING SECTION**

In [10]:
import pandas as pd
import numpy as np

News_ID= pd.read_csv("/content/Fake and Real News Dataset.csv")
News_ID

Unnamed: 0,News_ID,Title,Full_Text,Subject_ID,Date_ID,Source_ID,Author_ID,Country_ID,News_Type,Polarity_Score,Subjectivity_Score,Credibility_Score,Engagement_Count,Read_Count
0,1,Internet shutdown planned,False warnings cause concern,7,517,23,16,7,Fake,-0.68,0.88,18,38303,77077
1,2,Medical breakthrough in disease treatment,Research team publishes promising results,2,797,10,18,7,Real,0.68,0.24,93,22664,27433
2,3,Crops failing everywhere,Exaggerated reports of crisis,15,690,21,28,3,Fake,-0.76,0.89,20,18156,78655
3,4,Athlete breaks national record,Outstanding performance at championship,9,192,11,29,1,Real,0.70,0.30,96,14251,32579
4,5,Innocent person jailed,Unverified claims of injustice,11,149,19,3,7,Fake,-0.71,0.86,49,61246,90781
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
299995,299996,Olympics preparations underway,Infrastructure development on track,9,488,8,23,4,Real,0.69,0.37,98,11868,55940
299996,299997,Art exhibition showcases talent,Contemporary works receive acclaim,10,757,12,35,7,Real,0.57,0.31,98,15844,57636
299997,299998,Cyber attack imminent,False warning causes panic,12,496,8,20,5,Fake,-0.66,0.94,19,17741,76775
299998,299999,Currency worthless soon,Unverified claims cause concern,8,336,16,20,4,Fake,-0.87,0.93,23,48870,81092


In [11]:
News_ID.head()

Unnamed: 0,News_ID,Title,Full_Text,Subject_ID,Date_ID,Source_ID,Author_ID,Country_ID,News_Type,Polarity_Score,Subjectivity_Score,Credibility_Score,Engagement_Count,Read_Count
0,1,Internet shutdown planned,False warnings cause concern,7,517,23,16,7,Fake,-0.68,0.88,18,38303,77077
1,2,Medical breakthrough in disease treatment,Research team publishes promising results,2,797,10,18,7,Real,0.68,0.24,93,22664,27433
2,3,Crops failing everywhere,Exaggerated reports of crisis,15,690,21,28,3,Fake,-0.76,0.89,20,18156,78655
3,4,Athlete breaks national record,Outstanding performance at championship,9,192,11,29,1,Real,0.7,0.3,96,14251,32579
4,5,Innocent person jailed,Unverified claims of injustice,11,149,19,3,7,Fake,-0.71,0.86,49,61246,90781


In [15]:
print("Shape:", News_ID.shape)

Shape: (300000, 14)


In [16]:
News_ID.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300000 entries, 0 to 299999
Data columns (total 14 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   News_ID             300000 non-null  int64  
 1   Title               300000 non-null  object 
 2   Full_Text           300000 non-null  object 
 3   Subject_ID          300000 non-null  int64  
 4   Date_ID             300000 non-null  int64  
 5   Source_ID           300000 non-null  int64  
 6   Author_ID           300000 non-null  int64  
 7   Country_ID          300000 non-null  int64  
 8   News_Type           300000 non-null  object 
 9   Polarity_Score      300000 non-null  float64
 10  Subjectivity_Score  300000 non-null  float64
 11  Credibility_Score   300000 non-null  int64  
 12  Engagement_Count    300000 non-null  int64  
 13  Read_Count          300000 non-null  int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 32.0+ MB


In [22]:
News_ID.describe()

Unnamed: 0,News_ID,Subject_ID,Date_ID,Source_ID,Author_ID,Country_ID,Polarity_Score,Subjectivity_Score,Credibility_Score,Engagement_Count,Read_Count,label
count,300000.0,300000.0,300000.0,300000.0,300000.0,300000.0,300000.0,300000.0,300000.0,300000.0,300000.0,300000.0
mean,150000.5,7.51282,466.545523,13.018987,19.536373,7.01077,0.105387,0.560877,68.230233,26987.937697,63498.93177,0.38847
std,86602.684716,4.368148,210.968904,7.150875,11.136961,3.74556,0.67958,0.28621,30.044461,18891.03521,26941.927489,0.487403
min,1.0,1.0,101.0,1.0,1.0,1.0,-0.89,0.18,15.0,5000.0,20000.0,0.0
25%,75000.75,4.0,284.0,7.0,10.0,4.0,-0.7,0.32,37.0,13301.0,41921.0,0.0
50%,150000.5,7.0,467.0,13.0,20.0,7.0,0.6,0.38,87.0,20434.0,60795.0,0.0
75%,225000.25,11.0,650.0,19.0,29.0,10.0,0.67,0.89,93.0,37404.25,79684.0,1.0
max,300000.0,15.0,831.0,25.0,38.0,13.0,0.83,1.0,99.0,75000.0,130000.0,1.0


In [24]:
News_ID['label'] = (News_ID['Credibility_Score'] < 50).astype(int)
News_ID['text'] = News_ID['Title'] + ' ' + News_ID['Full_Text']

print("Labels created for News_ID dataset!")
print(f"Total News: {len(News_ID)}")
print(f"Fake: {len(News_ID[News_ID['label']==1])}")
print(f"Real: {len(News_ID[News_ID['label']==0])}")

Labels created for News_ID dataset!
Total News: 300000
Fake: 116541
Real: 183459


In [26]:
import re, string, nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('wordnet')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    text = str(text).lower()
    text = re.sub(f'[{re.escape(string.punctuation)}]', ' ', text)
    text = re.sub(r'\d+', '', text)
    text = ' '.join(text.split())
    words = [lemmatizer.lemmatize(w) for w in text.split() if w not in stop_words]
    return ' '.join(words)

News_ID['cleaned_text'] = News_ID['text'].apply(clean_text)
print("Text cleaning complete for 300K articles!")


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Text cleaning complete for 300K articles!


## **Train Model**

In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(News_ID['cleaned_text'])
y = News_ID['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")


Accuracy: 0.988


In [32]:
import pickle

pickle.dump(model, open('News_ID_model.pkl', 'wb'))
pickle.dump(vectorizer, open('News_ID_vectorizer.pkl', 'wb'))  # Changed 'rb' to 'wb'
print("Models saved correctly!")


Models saved correctly!
