About the DataSet:

1. id:unique id for a news article
2. title:the Title of the news article
3. author: author of the news article
4. text: The text of the article; could be incomplete
5. label:a label that marks whether the news article is real or fake

1 : Fake news

0 : Real news

Importing the dependencies

In [None]:
import pandas as pd
import numpy as np
import re
# regular expression for searching text in doc
from nltk.corpus import stopwords
# natural language tool kit
from nltk.stem.porter import PorterStemmer
# return root word by removing prefix and sufix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


- Used for loading, cleaning, and manipulating the dataset (DataFrame operations)
- Provides efficient numerical operations and array handling for data processing
- Used for text cleaning — removing unwanted symbols, links, and punctuation using regular expressions
- Provides a list of common words (like "the", "is", "in") to remove from text to improve model accuracy
- Reduces words to their base form (e.g., “running” → “run”) to normalize text data
- Converts cleaned text into numerical features using TF-IDF (Term Frequency–Inverse Document Frequency)
- Splits the dataset into training and testing sets to evaluate model performance
- Machine Learning algorithm used to classify news as real or fake (binary classification)
- Measures how accurate the model’s predictions are compared to the actual labels


In [None]:
import  nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
print(stopwords.words('english'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

during stemming we remove all these stop words

**Data Pre-processing**

In [None]:
# loading dataset to a pandas DataFrame
news_dataset = pd.read_csv('/content/train.csv')

In [None]:
news_dataset.shape

(35000, 6)

In [None]:
news_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35000 entries, 0 to 34999
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   id       35000 non-null  int64 
 1   title    35000 non-null  object
 2   text     34896 non-null  object
 3   subject  35000 non-null  object
 4   date     35000 non-null  object
 5   type     35000 non-null  int64 
dtypes: int64(2), object(4)
memory usage: 1.6+ MB


In [None]:
news_dataset.head()

Unnamed: 0,id,title,text,subject,date,type
0,42618,"Immigration, abortion, race rulings due at Sup...",WASHINGTON (Reuters) - The U.S. Supreme Court ...,politicsNews,"June 19, 2016",0
1,41734,Big drop in asylum seekers illegally crossing ...,TORONTO (Reuters) - The number of asylum seeke...,worldnews,"October 16, 2017",0
2,37258,Trump tells House leaders to cancel healthcare...,WASHINGTON (Reuters) - President Donald Trump ...,politicsNews,"March 24, 2017",0
3,16780,Hypocrite Republicans Refuse To Investigate F...,After years of investigating Hillary Clinton o...,News,"February 14, 2017",1
4,17338,BREAKING: HISPANIC MEN (Cowards) BEAT WOMAN IN...,Leftist pigs inspired by the hateful </b> rhet...,left-news,"Oct 1, 2016",1


**Handling Missing Values**

In [None]:
# counting number of missing values in the dataset
news_dataset.isnull().sum()

Unnamed: 0,0
id,0
title,0
text,104
subject,0
date,0
type,0


In [None]:
# find total rows
# if the missing data is less than 5%, it is safe to remove those rows
total_rows = len(news_dataset)
# finding number of the missing texts
missing_texts = news_dataset['text'].isnull().sum()
print(f"Missing text: {missing_texts/total_rows:.2%}")

Missing text: 0.30%


In [None]:
# replacing the null values with empty string
news_dataset['text'] = news_dataset['text'].fillna('')

**Combine useful text features**
-here both title and text carry valuable information
-So we combine them into one column

In [None]:
news_dataset['content'] = news_dataset['title']+" "+news_dataset['subject']

In [None]:
print(news_dataset['content'])

0        Immigration, abortion, race rulings due at Sup...
1        Big drop in asylum seekers illegally crossing ...
2        Trump tells House leaders to cancel healthcare...
3         Hypocrite Republicans Refuse To Investigate F...
4        BREAKING: HISPANIC MEN (Cowards) BEAT WOMAN IN...
                               ...                        
34995    Democrats want strong response to intel report...
34996     GOP Congress Just Delivered Trump The Biggest...
34997     The Phoenix Police Department Just Sent Trump...
34998    Out of Russian custody, Tatar leaders vow to r...
34999    Timeline: Milestones in legal fight over Texas...
Name: content, Length: 35000, dtype: object


In [None]:
# we are using this content data to make our predictions

In [None]:
# separating the data and type(Label)
X = news_dataset.drop(columns='type' , axis=1)# removing a column axis=1
Y = news_dataset['type']
print(X)
print(Y)

          id                                              title  \
0      42618  Immigration, abortion, race rulings due at Sup...   
1      41734  Big drop in asylum seekers illegally crossing ...   
2      37258  Trump tells House leaders to cancel healthcare...   
3      16780   Hypocrite Republicans Refuse To Investigate F...   
4      17338  BREAKING: HISPANIC MEN (Cowards) BEAT WOMAN IN...   
...      ...                                                ...   
34995   5037  Democrats want strong response to intel report...   
34996   2390   GOP Congress Just Delivered Trump The Biggest...   
34997  31314   The Phoenix Police Department Just Sent Trump...   
34998   7079  Out of Russian custody, Tatar leaders vow to r...   
34999  31848  Timeline: Milestones in legal fight over Texas...   

                                                    text       subject  \
0      WASHINGTON (Reuters) - The U.S. Supreme Court ...  politicsNews   
1      TORONTO (Reuters) - The number of asylum

**Stemming:**

It is the process of reducing a word to its root word

example: actor,actoress, acting --> act

In [None]:
port_stem = PorterStemmer()

def stemming(content):
  stemmed_content = re.sub('[^a-zA-Z]',' ', content)
  # re.sub() means “substitute using regex”
  # [^a-zA-z] - [] → defines a set of characters
  # a-z → lowercase letters
  # A-z → uppercase letters
  # ^ (inside brackets) → negation, meaning not these characters
  # removing all numbers and punctuations with space excpet alphabets
  stemmed_content = stemmed_content.lower()
  stemmed_content = stemmed_content.split()
  stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
  # removing stop words
  stemmed_content = ' '.join(stemmed_content)
  return stemmed_content


Apply this function to our content column

In [None]:
news_dataset['content'] = news_dataset['content'].apply(stemming)

In [None]:
print(news_dataset['content'])

0        immigr abort race rule due suprem court politi...
1        big drop asylum seeker illeg cross canada sept...
2        trump tell hous leader cancel healthcar bill v...
3        hypocrit republican refus investig flynn scand...
4        break hispan men coward beat woman front yard ...
                               ...                        
34995    democrat want strong respons intel report elec...
34996    gop congress deliv trump biggest insult possib...
34997    phoenix polic depart sent trump ceas desist le...
34998    russian custodi tatar leader vow return crimea...
34999    timelin mileston legal fight texa abort law po...
Name: content, Length: 35000, dtype: object


In [None]:
# separating the data and the label
X = news_dataset['content'].values
Y = news_dataset['type'].values
print(X)

['immigr abort race rule due suprem court politicsnew'
 'big drop asylum seeker illeg cross canada septemb worldnew'
 'trump tell hous leader cancel healthcar bill vote politicsnew' ...
 'phoenix polic depart sent trump ceas desist letter furiou video news'
 'russian custodi tatar leader vow return crimea worldnew'
 'timelin mileston legal fight texa abort law politicsnew']


In [None]:
print(Y)

[0 0 0 ... 1 0 0]


**Coverting the textual data to numerical data**

In [None]:
vectorizer = TfidfVectorizer()
# Tf - Term frequency
# so, it basically counts the number of times a particular word is repeating in a document
# idf - inverse document frequency
# it detects that if a word is repeating no of times but doesn't add any value
# by these, they create feature vectors
vectorizer.fit(X)
X = vectorizer.transform(X)

In [None]:
print(X)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 360974 stored elements and shape (35000, 12332)>
  Coords	Values
  (0, 30)	0.41134707441289076
  (0, 2437)	0.30344433217381245
  (0, 3316)	0.46272204451955995
  (0, 5385)	0.3454471341324184
  (0, 8284)	0.1513849021296021
  (0, 8685)	0.36855342766078353
  (0, 9330)	0.33689498671940915
  (0, 10618)	0.36472244099987794
  (1, 597)	0.38126586067705814
  (1, 1031)	0.2971989646454156
  (1, 1607)	0.33782579295180953
  (1, 2542)	0.37005961825198525
  (1, 3290)	0.3059643132331325
  (1, 5362)	0.27952275082440603
  (1, 9638)	0.4032226856783021
  (1, 9684)	0.4065719185191879
  (1, 12162)	0.12490933943020247
  (2, 1045)	0.3222916495218398
  (2, 1611)	0.4784437591161827
  (2, 4999)	0.41468941622853034
  (2, 5228)	0.29080513300862804
  (2, 6191)	0.35466450051522724
  (2, 8284)	0.16857761427187357
  (2, 10844)	0.3600943936532504
  (2, 11223)	0.15145799762250878
  :	:
  (34997, 2919)	0.47617977917637005
  (34997, 4359)	0.3165281048960364
  (3

Split our data to training and testing data

In [None]:
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2,stratify = Y,random_state=2)

Training the Model : Logistic Regression

In [None]:
model = LogisticRegression()

In [None]:
model.fit(X_train, Y_train)

In [None]:
# accuracy score on the training data
X_train_prediction = model.predict(X_train)
# comparing predictions made by our model with actual labels
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)
print("Training Data Accuracy: ",training_data_accuracy)

Training Data Accuracy:  0.9998928571428571


In [None]:
X_test_prediction = model.predict(X_test)
testing_data_accuracy = accuracy_score(X_test_prediction, Y_test)
print("Testing Data Accuracy: ",testing_data_accuracy)

Testing Data Accuracy:  0.9998571428571429


We successfully trained our model and also evaluated our model and got 100% accuracy score

**Making Predictive System**

In [None]:
X_new = X_test[0]
prediction = model.predict(X_new)
print(prediction)
print(X_test[0])

[1]
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 7 stored elements and shape (1, 12332)>
  Coords	Values
  (0, 1195)	0.37551378060178636
  (0, 2443)	0.3946180686715474
  (0, 3392)	0.2838505059885791
  (0, 3602)	0.3849791050725543
  (0, 6600)	0.5203231965824306
  (0, 6903)	0.2828791887871301
  (0, 9263)	0.3517514676771524


In [None]:
if (prediction == 0):
  print("Real News")
else:
  print("Fake News!!!")

Fake News!!!


In [None]:
#f prediction == Y_test[0]:
# print("Correct Prediction")
#lse:
# print("False prediction")

In [None]:
import joblib

# Save trained model
joblib.dump(model, "fake_news_model.pkl")

# Save vectorizer
joblib.dump(vectorizer, "vectorizer.pkl")

print("Model and Vectorizer saved successfully")


Model and Vectorizer saved successfully


In [None]:
from google.colab import files

files.download("fake_news_model.pkl")
files.download("vectorizer.pkl")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>