<a href="https://colab.research.google.com/github/JayanthiTanusha/Lamp/blob/main/Copy_of_Project_5_Fake_News_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

About the Dataset:

1. id: unique id for a news article
2. title: the title of a news article
3. author: author of the news article
4. text: the text of the article; could be incomplete
5. label: a label that marks whether the news article is real or fake:
           1: Fake news
           0: real News





Importing the Dependencies

In [None]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
# printing the stopwords in English
print(stopwords.words('english'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

Data Pre-processing

In [None]:
# loading the dataset to a pandas DataFrame
news_dataset = pd.read_csv('/content/sample_data/fakenews.csv')

In [None]:
news_dataset.shape

(13040, 5)

In [None]:
# print the first 5 rows of the dataframe
news_dataset.head()

Unnamed: 0,id,context,isfake,nchar_real,nchar_fake
0,000128042337,朝日新聞など各社の報道によれば、宅配便最大手「ヤマト運輸」が日本郵政公社を相手取り、大手コン...,0,541,0
1,00012b7a8314,11月5日の各社報道によると、諫早湾干拓事業は諫早海人（諫早湾の「海」）に囲まれる大洋に位置...,2,0,385
2,0005fb48880b,産経新聞、中日新聞によると、2004年から2005年まで、この大会による3年おきの開催を、2...,2,0,255
3,00087f9e14ab,開催地のリオデジャネイロ市に対して、大会期間中のリオデジャネイロオリンピックに関する公式発表...,1,435,218
4,000c9ac3d552,毎日新聞・時事通信によると、2006年2月13日には、グッドウィル・グッゲンハイム・アン・ハ...,2,0,248


In [None]:
# counting the number of missing values in the dataset
news_dataset.isnull().sum()

Unnamed: 0,0
id,0
context,0
isfake,0
nchar_real,0
nchar_fake,0


In [None]:
# replacing the null values with empty string
news_dataset = news_dataset.fillna('')

In [None]:
# merging the author name and news title
news_dataset['content'] = news_dataset['context']


In [None]:
print(news_dataset['content'])

0        Darrell Lucus House Dem Aide: We Didn’t Even S...
1        Daniel J. Flynn FLYNN: Hillary Clinton, Big Wo...
2        Consortiumnews.com Why the Truth Might Get You...
3        Jessica Purkiss 15 Civilians Killed In Single ...
4        Howard Portnoy Iranian woman jailed for fictio...
                               ...                        
20795    Jerome Hudson Rapper T.I.: Trump a ’Poster Chi...
20796    Benjamin Hoffman N.F.L. Playoffs: Schedule, Ma...
20797    Michael J. de la Merced and Rachel Abrams Macy...
20798    Alex Ansary NATO, Russia To Hold Parallel Exer...
20799              David Swanson What Keeps the F-35 Alive
Name: content, Length: 20800, dtype: object


In [None]:
# separating the data & label
X = news_dataset.drop(columns='isfake', axis=1)
Y = news_dataset['isfake']


In [None]:
print(X)
print(Y)

          id  ...                                            content
0          0  ...  Darrell Lucus House Dem Aide: We Didn’t Even S...
1          1  ...  Daniel J. Flynn FLYNN: Hillary Clinton, Big Wo...
2          2  ...  Consortiumnews.com Why the Truth Might Get You...
3          3  ...  Jessica Purkiss 15 Civilians Killed In Single ...
4          4  ...  Howard Portnoy Iranian woman jailed for fictio...
...      ...  ...                                                ...
20795  20795  ...  Jerome Hudson Rapper T.I.: Trump a ’Poster Chi...
20796  20796  ...  Benjamin Hoffman N.F.L. Playoffs: Schedule, Ma...
20797  20797  ...  Michael J. de la Merced and Rachel Abrams Macy...
20798  20798  ...  Alex Ansary NATO, Russia To Hold Parallel Exer...
20799  20799  ...            David Swanson What Keeps the F-35 Alive

[20800 rows x 5 columns]
0        1
1        0
2        1
3        1
4        1
        ..
20795    0
20796    0
20797    0
20798    1
20799    1
Name: label, Length: 2080

Stemming:

Stemming is the process of reducing a word to its Root word

example:
actor, actress, acting --> act

In [None]:
port_stem = PorterStemmer()

In [None]:
def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]',' ',content)
    stemmed_content = stemmed_content.lower()
    stemmed_content = stemmed_content.split()
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content

In [None]:
news_dataset['content'] = news_dataset['content'].apply(stemming)

In [None]:
print(news_dataset['content'])

0                                                         
1                                                         
2                                                       st
3                                                         
4                                                      nhk
                               ...                        
13035                                                  utc
13036                                                     
13037                                                     
13038                                                     
13039    nova nova nova nova nova nova nova nova nova n...
Name: content, Length: 13040, dtype: object


In [None]:
#separating the data and label
X = news_dataset.drop(columns='isfake', axis=1)
Y = news_dataset['isfake']


In [None]:
print(X)

                 id                                            context  \
0      000128042337  朝日新聞など各社の報道によれば、宅配便最大手「ヤマト運輸」が日本郵政公社を相手取り、大手コン...   
1      00012b7a8314  11月5日の各社報道によると、諫早湾干拓事業は諫早海人（諫早湾の「海」）に囲まれる大洋に位置...   
2      0005fb48880b  産経新聞、中日新聞によると、2004年から2005年まで、この大会による3年おきの開催を、2...   
3      00087f9e14ab  開催地のリオデジャネイロ市に対して、大会期間中のリオデジャネイロオリンピックに関する公式発表...   
4      000c9ac3d552  毎日新聞・時事通信によると、2006年2月13日には、グッドウィル・グッゲンハイム・アン・ハ...   
...             ...                                                ...   
13035  ffc1ab0492e3  広島市の健康福祉企画課の説明では11月1日から12月10日の間に、市内各区役所に22人の派遣...   
13036  ffc40591a6ae  日本経済新聞社によるとソフトバンクモバイルは5日、月額基本料金が980円（税込）の新料金プラ...   
13037  ffcabd663b9f  東京新聞によると※は日本生命所属のキャッチコピー・ロゴ。10日本生命（株）。本社には同社の主...   
13038  ffe993d53780  日刊スポーツによると、1996年の平塚市議会で木原さんは、平塚市内の病院に入院していた際、「...   
13039  fff20532e008  30日付の官報によると、NOVAの新学習センターは、1室4,300平方メートルで、NOVAの...   

       nchar_real  nchar_fake  \
0             541           0   
1               0         385   
2           

In [None]:
print(Y)

0        0
1        2
2        2
3        1
4        2
        ..
13035    1
13036    0
13037    2
13038    2
13039    2
Name: isfake, Length: 13040, dtype: int64


In [None]:
Y.shape

(13040,)

In [None]:
# converting the textual data to numerical data
vectorizer = TfidfVectorizer()
vectorizer.fit(X)

X = vectorizer.transform(X)

In [None]:
print(X)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 5 stored elements and shape (5, 5)>
  Coords	Values
  (0, 2)	1.0
  (1, 1)	1.0
  (2, 4)	1.0
  (3, 3)	1.0
  (4, 0)	1.0


Splitting the dataset to training & test data

In [None]:
#X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, stratify=Y, random_state=2)
X = news_dataset['content']   # Features = text
Y = news_dataset['isfake']    # Labels = target

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(
    X, Y,
    test_size=0.2,
    stratify=Y,
    random_state=2
)

Training the Model: Logistic Regression

In [None]:
model = LogisticRegression()

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

# Step 1: Convert text to numeric features
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Step 2: Train model on numeric data
model = LogisticRegression(max_iter=1000)
model.fit(X_train_tfidf, Y_train)


Evaluation

accuracy score

In [None]:
# accuracy score on the training data
from sklearn.metrics import accuracy_score

# Predict on training set (use TF-IDF vectors, not raw text)
X_train_prediction = model.predict(X_train_tfidf)
training_data_accuracy = accuracy_score(Y_train, X_train_prediction)

print("Training Accuracy:", training_data_accuracy)

# Predict on test set
X_test_prediction = model.predict(X_test_tfidf)
test_data_accuracy = accuracy_score(Y_test, X_test_prediction)

print("Test Accuracy:", test_data_accuracy)


Training Accuracy: 0.5773581288343558
Test Accuracy: 0.46088957055214724


In [None]:
print('Accuracy score of the training data : ', training_data_accuracy)

Accuracy score of the training data :  0.5773581288343558


In [None]:
# accuracy score on the test data
from sklearn.metrics import accuracy_score

# Training accuracy
X_train_prediction = model.predict(X_train_tfidf)
training_data_accuracy = accuracy_score(Y_train, X_train_prediction)
print("Training Accuracy:", training_data_accuracy)

# Test accuracy
X_test_prediction = model.predict(X_test_tfidf)
test_data_accuracy = accuracy_score(Y_test, X_test_prediction)
print("Test Accuracy:", test_data_accuracy)


Training Accuracy: 0.5773581288343558
Test Accuracy: 0.46088957055214724


In [None]:
print('Accuracy score of the test data : ', test_data_accuracy)

Accuracy score of the test data :  0.46088957055214724


Making a Predictive System

In [None]:
# Example: take the first news text
# Take the first news text
X_new = news_dataset['content'].iloc[0]

# Vectorize the text first (using the same vectorizer used during training)
X_new_vec = vectorizer.transform([X_new])  # Ensures 2D input

# Make prediction
prediction = model.predict(X_new_vec)

print("Prediction:", prediction)
if prediction[0] == 0:
    print("The news is Real")
else:
    print("The news is Fake")



Prediction: [2]
The news is Fake


In [None]:
print(news_dataset['content'])

0                                                         
1                                                         
2                                                       st
3                                                         
4                                                      nhk
                               ...                        
13035                                                  utc
13036                                                     
13037                                                     
13038                                                     
13039    nova nova nova nova nova nova nova nova nova n...
Name: content, Length: 13040, dtype: object
