# **📰Fake News Detection using NLP and Logistic Regression**

About the Dataset:

1. id: unique id for a news article
2. title: the title of a news article
3. author: author of the news article
4. text: the text of the article; could be incomplete
5. label: a label that marks whether the news article is real or fake:
           1: Fake news
           0: real News





### **Importing the Dependencies**

In [2]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [3]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [4]:
# printing the stopwords in English
print(stopwords.words('english'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

### **Data Pre-processing**

In [75]:
import kagglehub
import os

# Download dataset
path = kagglehub.dataset_download("ruchi798/source-based-news-classification")

# Correct filename
csv_file_name = "news_articles.csv"

# Build full path
csv_file_path = os.path.join(path, csv_file_name)

# Load dataset
news_dataset = pd.read_csv(csv_file_path)
news_dataset['label'] = news_dataset['label'].map({'Fake': 0, 'Real': 1})
news_dataset['label'] = news_dataset['label'].fillna(0)
news_dataset['label'] = news_dataset['label'].astype(int)


news_dataset = news_dataset[[ 'author', 'title','text_without_stopwords','label']]
news_dataset = news_dataset.rename(columns={'text_without_stopwords': 'text'})

In [76]:
news_dataset.shape

(2096, 4)

In [77]:
# print the first 5 rows of the dataframe
news_dataset.head()

Unnamed: 0,author,title,text,label
0,Barracuda Brigade,muslims busted they stole millions in govt ben...,print pay back money plus interest entire fami...,1
1,reasoning with facts,re why did attorney general loretta lynch plea...,attorney general loretta lynch plead fifth bar...,1
2,Barracuda Brigade,breaking weiner cooperating with fbi on hillar...,red state fox news sunday reported morning ant...,1
3,Fed Up,pin drop speech by father of daughter kidnappe...,email kayla mueller prisoner tortured isis cha...,1
4,Fed Up,fantastic trumps point plan to reform healthc...,email healthcare reform make america great sin...,1


In [78]:
# counting the number of missing values in the dataset
news_dataset.isnull().sum()

Unnamed: 0,0
author,0
title,0
text,50
label,0


In [80]:
# replacing the null values with empty string
news_dataset = news_dataset.fillna('')

In [81]:
# merging the author name and news title
news_dataset['content'] = news_dataset['author']+' '+news_dataset['title']

In [82]:
print(news_dataset['content'])

0       Barracuda Brigade muslims busted they stole mi...
1       reasoning with facts re why did attorney gener...
2       Barracuda Brigade breaking weiner cooperating ...
3       Fed Up pin drop speech by father of daughter k...
4       Fed Up fantastic trumps  point plan to reform ...
                              ...                        
2091    -NO AUTHOR- teens walk free after gangrape con...
2092    -NO AUTHOR- school named for munichmassacre ma...
2093            -NO AUTHOR- russia unveils satan  missile
2094    -NO AUTHOR- check out hillarythemed haunted house
2095    Eddy Lavine cannabis aficionados develop thca ...
Name: content, Length: 2096, dtype: object


In [84]:
# separating the data & label
X = news_dataset.drop(columns='label', axis=1)
Y = news_dataset['label']

In [85]:
print(X)
print(Y)

                    author                                              title  \
0        Barracuda Brigade  muslims busted they stole millions in govt ben...   
1     reasoning with facts  re why did attorney general loretta lynch plea...   
2        Barracuda Brigade  breaking weiner cooperating with fbi on hillar...   
3                   Fed Up  pin drop speech by father of daughter kidnappe...   
4                   Fed Up  fantastic trumps  point plan to reform healthc...   
...                    ...                                                ...   
2091           -NO AUTHOR-          teens walk free after gangrape conviction   
2092           -NO AUTHOR-         school named for munichmassacre mastermind   
2093           -NO AUTHOR-                      russia unveils satan  missile   
2094           -NO AUTHOR-              check out hillarythemed haunted house   
2095           Eddy Lavine  cannabis aficionados develop thca crystalline ...   

                           

# **Stemming:**

### Stemming is the process of reducing a word to its Root word

### example:
### actor, actress, acting --> act

In [86]:
port_stem = PorterStemmer()

In [87]:
def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]',' ',content)
    stemmed_content = stemmed_content.lower()
    stemmed_content = stemmed_content.split()
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content

In [88]:
news_dataset['content'] = news_dataset['content'].apply(stemming)

In [89]:
print(news_dataset['content'])

0       barracuda brigad muslim bust stole million gov...
1       reason fact attorney gener loretta lynch plead...
2       barracuda brigad break weiner cooper fbi hilla...
3       fed pin drop speech father daughter kidnap kil...
4       fed fantast trump point plan reform healthcar ...
                              ...                        
2091                author teen walk free gangrap convict
2092          author school name munichmassacr mastermind
2093                    author russia unveil satan missil
2094                  author check hillarythem haunt hous
2095    eddi lavin cannabi aficionado develop thca cry...
Name: content, Length: 2096, dtype: object


In [90]:
#separating the data and label
X = news_dataset['content'].values
Y = news_dataset['label'].values

In [91]:
print(X)

['barracuda brigad muslim bust stole million govt benefit'
 'reason fact attorney gener loretta lynch plead fifth'
 'barracuda brigad break weiner cooper fbi hillari email investig' ...
 'author russia unveil satan missil' 'author check hillarythem haunt hous'
 'eddi lavin cannabi aficionado develop thca crystallin strongest hash world thc']


In [92]:
print(Y)

[1 1 1 ... 0 0 0]


In [93]:
Y.shape

(2096,)

In [94]:
# Assuming you still have raw text in 'content' column
X_raw = news_dataset['content'].values  # This is a numpy array of text

# Vectorizer with new feature limits
vectorizer = TfidfVectorizer(max_features=3000, min_df=5, max_df=0.7)

# Fit and transform on raw text
vectorizer.fit(X_raw)
X = vectorizer.transform(X_raw)

In [95]:
print(X)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 18425 stored elements and shape (2096, 5213)>
  Coords	Values
  (0, 385)	0.360186474515665
  (0, 440)	0.360186474515665
  (0, 589)	0.360186474515665
  (0, 639)	0.3697642072739326
  (0, 1918)	0.38148640479571866
  (0, 2934)	0.2756393343563781
  (0, 3040)	0.30866146215821777
  (0, 4409)	0.39659894583441546
  (1, 293)	0.35712952110668233
  (1, 1577)	0.35001440748294765
  (1, 1650)	0.40227711098282576
  (1, 1851)	0.31651949704469357
  (1, 2707)	0.35712952110668233
  (1, 2742)	0.3284095229839025
  (1, 3451)	0.40227711098282576
  (1, 3728)	0.3011905875442771
  (2, 385)	0.39151117947183195
  (2, 572)	0.31732120654919904
  (2, 589)	0.39151117947183195
  (2, 966)	0.4542427144371104
  (2, 1409)	0.2554614549889334
  (2, 1623)	0.2545896715839206
  (2, 2097)	0.1838024493567392
  (2, 2313)	0.2861131762188201
  (2, 5035)	0.3750843306435056
  :	:
  (2092, 2827)	0.5312937133594603
  (2092, 3030)	0.5312937133594603
  (2092, 3055)	0.5042141500

## **Splitting the dataset to training & test data**

In [96]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, stratify=Y, random_state=2)

## **Training the Model: Logistic Regression**

In [97]:
model = LogisticRegression()

In [98]:
model.fit(X_train, Y_train)

## **Evaluation**

### **Accuracy score**

In [99]:
# accuracy score on the training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

In [100]:
print('Accuracy score of the training data : ', training_data_accuracy)

Accuracy score of the training data :  0.9373508353221957


In [101]:
# accuracy score on the test data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [102]:
print('Accuracy score of the test data : ', test_data_accuracy)

Accuracy score of the test data :  0.7880952380952381


### **Making a Predictive System**

In [103]:
X_new = X_test[3]

prediction = model.predict(X_new)
print(prediction)

if (prediction[0]==0):
  print('The news is Real')
else:
  print('The news is Fake')

[0]
The news is Real


In [104]:
print(Y_test[3])

0


This project aims to build a predictive system to classify news articles as real or fake. Using a labeled dataset containing article titles, authors, and content, we perform data preprocessing with NLP techniques such as stopword removal and stemming. The textual data is converted into numerical features using TF-IDF vectorization. A Logistic Regression model is then trained to learn patterns distinguishing real from fake news. The model achieves high accuracy on both training and test data, and can be used to predict the authenticity of unseen news articles.

Thank You!