About the Dataset:

1. id: unique id for a news article
2. title: the title of a news article
3. author: author of the news article
4. text: the text of the article; could be incomplete
5. label: a label that marks whether the news article is real or fake:
           1: Fake news
           0: real News





Importing the Dependencies

In [None]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
# printing the stopwords in English
print(stopwords.words('english'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

Data Pre-processing

In [None]:
# loading the dataset to a pandas DataFrame
# news_dataset = pd.read_csv('/content/train.csv')
# news_dataset = pd.read_csv('WELFake_Dataset.csv', on_bad_lines='skip')
# news_dataset = pd.read_csv('fake.csv', encoding='ISO-8859-1', on_bad_lines='skip')
news_dataset = pd.read_csv('fake_news_dataset.csv', encoding='ISO-8859-1', on_bad_lines='skip', engine='python')



In [None]:
news_dataset.shape

(4257, 8)

In [None]:
# print the first 5 rows of the dataframe
news_dataset.head()

Unnamed: 0,title,text,date,source,author,category,label,label_new
0,Foreign Democrat final.,more tax development both store agreement lawy...,10-03-2023,NY Times,Paula George,Politics,real,0.0
1,To offer down resource great point.,probably guess western behind likely next inve...,25-05-2022,Fox News,Joseph Hill,Politics,fake,1.0
2,Himself church myself carry.,them identify forward present success risk sev...,01-09-2022,CNN,Julia Robinson,Business,fake,1.0
3,You unit its should.,phone which item yard Republican safe where po...,07-02-2023,Reuters,Mr. David Foster DDS,Science,fake,1.0
4,Billion believe employee summer how.,wonder myself fact difficult course forget exa...,03-04-2023,CNN,Austin Walker,Technology,fake,1.0


In [None]:
# counting the number of missing values in the dataset
news_dataset.isnull().sum()

Unnamed: 0,0
title,0
text,0
date,1
source,211
author,237
category,1
label,1
label_new,1


In [None]:
# replacing the null values with empty string
news_dataset = news_dataset.fillna('')

In [None]:
# merging the author name and news title
news_dataset['content'] = news_dataset['title'] + news_dataset['author']

In [None]:
print(news_dataset['content'])

0                     Foreign Democrat final.Paula George
1          To offer down resource great point.Joseph Hill
2              Himself church myself carry.Julia Robinson
3                You unit its should.Mr. David Foster DDS
4       Billion believe employee summer how.Austin Walker
                              ...                        
4252                 Hotel foreign toward.Kevin Valdez MD
4253                     Note young what dark.Amanda King
4254              Level history data lot.Elizabeth Hughes
4255          Environment may dark else field.Jaime Clark
4256                 Air star visit stand win civil blue.
Name: content, Length: 4257, dtype: object


In [None]:
# separating the data & label
X = news_dataset.drop(columns='label_new', axis=1)
Y = news_dataset['label_new']

In [None]:
print(X)
print(Y)

                                     title  \
0                  Foreign Democrat final.   
1      To offer down resource great point.   
2             Himself church myself carry.   
3                     You unit its should.   
4     Billion believe employee summer how.   
...                                    ...   
4252                 Hotel foreign toward.   
4253                 Note young what dark.   
4254               Level history data lot.   
4255      Environment may dark else field.   
4256  Air star visit stand win civil blue.   

                                                   text        date  \
0     more tax development both store agreement lawy...  10-03-2023   
1     probably guess western behind likely next inve...  25-05-2022   
2     them identify forward present success risk sev...  01-09-2022   
3     phone which item yard Republican safe where po...  07-02-2023   
4     wonder myself fact difficult course forget exa...  03-04-2023   
...                  

Stemming:

Stemming is the process of reducing a word to its Root word

example:
actor, actress, acting --> act

In [None]:
port_stem = PorterStemmer()

In [None]:
def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]',' ',content)
    stemmed_content = stemmed_content.lower()
    stemmed_content = stemmed_content.split()
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content

In [None]:
news_dataset['content'] = news_dataset['content'].apply(stemming)

In [None]:
print(news_dataset['content'])

0               foreign democrat final paula georg
1            offer resourc great point joseph hill
2                      church carri julia robinson
3                          unit mr david foster dd
4       billion believ employ summer austin walker
                           ...                    
4252          hotel foreign toward kevin valdez md
4253                   note young dark amanda king
4254         level histori data lot elizabeth hugh
4255          environ may dark el field jaim clark
4256           air star visit stand win civil blue
Name: content, Length: 4257, dtype: object


In [None]:
# Remove rows with empty labels
news_dataset = news_dataset[news_dataset['label_new'] != '']

# Convert labels to integer type (from float)
news_dataset['label_new'] = news_dataset['label_new'].astype(float).astype(int)



#separating the data and label
X = news_dataset['content'].values
Y = news_dataset['label_new'].values

In [None]:
print(X)

['foreign democrat final paula georg'
 'offer resourc great point joseph hill' 'church carri julia robinson' ...
 'note young dark amanda king' 'level histori data lot elizabeth hugh'
 'environ may dark el field jaim clark']


In [None]:
print(Y)

[0 1 1 ... 0 0 0]


In [None]:
Y.shape

(4256,)

In [None]:
# converting the textual data to numerical data
vectorizer = TfidfVectorizer()
vectorizer.fit(X)

X = vectorizer.transform(X)

In [None]:
print(X)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 29055 stored elements and shape (4256, 2133)>
  Coords	Values
  (0, 502)	0.4090806798871778
  (0, 663)	0.3890800362829856
  (0, 691)	0.4090806798871778
  (0, 743)	0.49544959221624424
  (0, 1461)	0.5181239442690105
  (1, 776)	0.3990588686306478
  (1, 859)	0.4636516772129826
  (1, 1005)	0.38465155500482484
  (1, 1401)	0.41166838506608955
  (1, 1501)	0.3884601173753097
  (1, 1604)	0.39679995055824313
  (2, 310)	0.4535473064762885
  (2, 366)	0.467433074540296
  (2, 1015)	0.5900092633271026
  (2, 1640)	0.4771689751043329
  (3, 481)	0.3760264021951083
  (3, 486)	0.48379769817348267
  (3, 696)	0.5277773012188438
  (3, 1338)	0.34568853376743475
  (3, 1990)	0.4759143725862011
  (4, 116)	0.4356856417650774
  (4, 166)	0.40804969873055164
  (4, 193)	0.39121828876451037
  (4, 593)	0.4148088739439509
  (4, 1883)	0.38646405265082723
  :	:
  (4251, 2084)	0.36482767824023865
  (4252, 691)	0.3756569497701742
  (4252, 884)	0.4096484489703889
 

In [None]:
print(set(Y))  # Show all unique label values

{np.int64(0), np.int64(1)}


Splitting the dataset to training & test data

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, stratify=Y, random_state=2)

Training the Model: Logistic Regression

In [None]:
model = LogisticRegression()

In [None]:
model.fit(X_train, Y_train)

Evaluation

accuracy score

In [None]:
# accuracy score on the training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

In [None]:
print('Accuracy score of the training data : ', training_data_accuracy)

Accuracy score of the training data :  0.7917156286721504


In [None]:
# accuracy score on the test data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [None]:
print('Accuracy score of the test data : ', test_data_accuracy)

Accuracy score of the test data :  0.4788732394366197


Making a Predictive System

In [None]:
X_new = X_test[3]

prediction = model.predict(X_new)
print(prediction)

if (prediction[0]==0):
  print('The news is Real')
else:
  print('The news is Fake')

[1]
The news is Fake


In [None]:
print(Y_test[3])

1


In [None]:
!pip install gradio




In [None]:
import gradio as gr

# Make sure all imports are already available:
# re, stopwords, PorterStemmer, vectorizer, model

# Define the prediction function
def fake_news_predict(input_text):
    port_stem = PorterStemmer()
    input_text = re.sub('[^a-zA-Z]', ' ', input_text)
    input_text = input_text.lower()
    input_text = input_text.split()
    input_text = [port_stem.stem(word) for word in input_text if word not in stopwords.words('english')]
    input_text = ' '.join(input_text)

    vectorized_input = vectorizer.transform([input_text])
    prediction = model.predict(vectorized_input)

    if prediction[0] == 0:
        return "📰 Real News"
    else:
        return "🚨 Fake News"

# Create a web interface
interface = gr.Interface(
    fn=fake_news_predict,
    inputs=gr.Textbox(lines=10, placeholder="Paste your news article here..."),
    outputs="text",
    title="🧠 Fake News Detector",
    description="Enter any news article or headline to check if it's Fake or Real. This uses a Logistic Regression model trained on real data."
)

# Launch the web app
interface.launch()


It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://dcc1b5a4e0436e5259.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


