### Fake news classfier

About the Dataset
The WELFake dataset contains 72,134 news articles, with 35,028 real and 37,106 fake news items. It was created by merging four popular news datasets (Kaggle, McIntire, Reuters, BuzzFeed Political) to reduce overfitting and provide more text data for better machine learning training.

The dataset includes four columns:

Serial number: starting from 0

Title: news headline

Text: news content

Label: 0 = fake, 1 = real

Out of 78,098 entries in the CSV file, only 72,134 were accessible and used for the analysis.

Published in:
IEEE Transactions on Computational Social Systems: pp. 1–13 (doi: 10.1109/TCSS.2021.3068519)

In [2]:
import pandas as pd
import textstat
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse import hstack
import numpy as np
from sklearn.preprocessing import StandardScaler
#import data set
df=pd.read_csv('WELFake_Dataset.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1
1,1,,Did they post their votes for Hillary already?,1
2,2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1
3,3,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0
4,4,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1


In [3]:
print(df.columns)

Index(['Unnamed: 0', 'title', 'text', 'label'], dtype='object')


In [4]:
df=df.drop(columns=['Unnamed: 0'])
print(df.columns)

Index(['title', 'text', 'label'], dtype='object')


In [5]:
df["title"] = df["title"].fillna("no_title")
df = df.dropna(subset=["text"])
df.head()

Unnamed: 0,title,text,label
0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1
1,no_title,Did they post their votes for Hillary already?,1
2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1
3,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0
4,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1


In [6]:
df['label'].isna().sum()

0

In [7]:
df.dtypes
df["title"] = df["title"].astype(str)
df["text"] = df["text"].astype(str)

In [8]:
df["full_text"] = df["title"] + " " + df["text"]
df.head()

Unnamed: 0,title,text,label,full_text
0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1,LAW ENFORCEMENT ON HIGH ALERT Following Threat...
1,no_title,Did they post their votes for Hillary already?,1,no_title Did they post their votes for Hillary...
2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...
3,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0,"Bobby Jindal, raised Hindu, uses story of Chri..."
4,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1,SATAN 2: Russia unvelis an image of its terrif...


In [9]:
#Higher score → easier to read
df['readability'] = df['full_text'].apply(lambda x: textstat.flesch_reading_ease(x))

In [10]:
df['repeated_punct'] = df['text'].apply(lambda x: len(re.findall(r'[!?]{2,}', x)))

In [11]:
df['title_len'] = df['title'].apply(lambda x: len(str(x).split()))#word len
df['text_len'] = df['text'].apply(lambda x: len(str(x).split()))
df['all_caps'] = df['title'].apply(lambda x: sum(1 for w in x.split() if w.isupper()))
df=df.drop(columns=['title','text'])
df.head()

Unnamed: 0,label,full_text,readability,repeated_punct,title_len,text_len,all_caps
0,1,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,68.02845,2,18,871,7
1,1,no_title Did they post their votes for Hillary...,66.1,0,1,8,0
2,1,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,29.141154,0,18,34,11
3,0,"Bobby Jindal, raised Hindu, uses story of Chri...",46.211376,0,16,1321,0
4,1,SATAN 2: Russia unvelis an image of its terrif...,45.509972,0,16,329,2


In [12]:
X=df.drop(columns=['label'])
y=df[['label']]
print(X.columns)
         

Index(['full_text', 'readability', 'repeated_punct', 'title_len', 'text_len',
       'all_caps'],
      dtype='object')


In [13]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,y,
    test_size=0.2,    
    random_state=42,  
    stratify=y  # keeps class distribution normal
)


In [14]:
#scale numeric data
numeric_cols = ['readability', 'repeated_punct', 'title_len', 'text_len', 'all_caps']
scaler = StandardScaler()

# Fit on training numeric features
X_train_numeric = scaler.fit_transform(X_train[numeric_cols])

# Transform test numeric features
X_test_numeric = scaler.transform(X_test[numeric_cols])

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
#vectorize text 
vectorizer =TfidfVectorizer(ngram_range=(1,2), max_features=10000)

# Fit on training text and transform
X_train_text =vectorizer.fit_transform(X_train['full_text'])

# Transform test text
X_test_text = vectorizer.transform(X_test['full_text'])

In [17]:
#combine features
X_train_combined = hstack([X_train_text, X_train_numeric])
X_test_combined  = hstack([X_test_text, X_test_numeric])

In [18]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Initialize model
lr_model = LogisticRegression(max_iter=1000, random_state=42)

# Train the model
lr_model.fit(X_train_combined, y_train.values.ravel())  # .ravel() converts y to 1D

# Predict on test data
y_pred = lr_model.predict(X_test_combined)

# Evaluate
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))


Logistic Regression Accuracy: 0.96525417851446
              precision    recall  f1-score   support

           0       0.97      0.96      0.96      7006
           1       0.96      0.97      0.97      7413

    accuracy                           0.97     14419
   macro avg       0.97      0.97      0.97     14419
weighted avg       0.97      0.97      0.97     14419



### Key Results & Observations

Initially, using only title and text, the model’s accuracy was around 50%, essentially no better than random guessing.

After feature engineering (readability scores index added, repeated punctuation, title length, text length, all-caps title count) and scaling numeric features, combined with TF-IDF vectorization of the text, the logistic regression model achieved an accuracy of ~96.5%.

This demonstrates the critical impact of carefully crafted features and proper preprocessing on model performance for fake news detection.