<a href="https://colab.research.google.com/github/MariemMhadhbii/-Machine_Learning_Projects/blob/main/Project_3_Fake_News_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

About the Dataset:

1. id: unique id for a news article
2. title: the title of a news article
3. author: author of the news article
4. text: the text of the article; could be incomplete
5. label: a label that marks whether the news article is real or fake:
           1: Fake news
           0: real News





Importing the Dependencies

In [1]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [2]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [3]:
# printing the stopwords in English
print(stopwords.words('english'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

Data Pre-processing

In [5]:
# loading the dataset to a pandas DataFrame
news_dataset = pd.read_csv('/content/train.csv')

In [6]:
news_dataset.shape

(9900, 2)

In [7]:
# print the first 5 rows of the dataframe
news_dataset.head()

Unnamed: 0,Text,label
0,Top Trump Surrogate BRUTALLY Stabs Him In The...,Fake
1,U.S. conservative leader optimistic of common ...,Real
2,"Trump proposes U.S. tax overhaul, stirs concer...",Real
3,Court Forces Ohio To Allow Millions Of Illega...,Fake
4,Democrats say Trump agrees to work on immigrat...,Real


In [8]:
# counting the number of missing values in the dataset
news_dataset.isnull().sum()

Unnamed: 0,0
Text,0
label,0


In [9]:
# replacing the null values with empty string
news_dataset = news_dataset.fillna('')

In [24]:
news_dataset['content'] = news_dataset['Text']

In [14]:
print(news_dataset.columns)

Index(['Text', 'label'], dtype='object')


In [None]:
print(news_dataset['content'])

In [15]:
# separating the data & label
X = news_dataset.drop(columns='label', axis=1)
Y = news_dataset['label']

In [16]:
print(X)
print(Y)

                                                   Text
0      Top Trump Surrogate BRUTALLY Stabs Him In The...
1     U.S. conservative leader optimistic of common ...
2     Trump proposes U.S. tax overhaul, stirs concer...
3      Court Forces Ohio To Allow Millions Of Illega...
4     Democrats say Trump agrees to work on immigrat...
...                                                 ...
9895   Wikileaks Admits To Screwing Up IMMENSELY Wit...
9896  Trump consults Republican senators on Fed chie...
9897  Trump lawyers say judge lacks jurisdiction for...
9898   WATCH: Right-Wing Pastor Falsely Credits Trum...
9899   Sean Spicer HILARIOUSLY Branded As Chickensh*...

[9900 rows x 1 columns]
0       Fake
1       Real
2       Real
3       Fake
4       Real
        ... 
9895    Fake
9896    Real
9897    Real
9898    Fake
9899    Fake
Name: label, Length: 9900, dtype: object


Stemming:

Stemming is the process of reducing a word to its Root word

example:
actor, actress, acting --> act

In [17]:
port_stem = PorterStemmer()

In [18]:
def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]',' ',content)
    stemmed_content = stemmed_content.lower()
    stemmed_content = stemmed_content.split()
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content

In [25]:
news_dataset['content'] = news_dataset['content'].apply(stemming)

In [26]:
news_dataset['content'] = news_dataset['Text']

In [27]:
news_dataset['content'] = news_dataset['content'].apply(stemming)

In [28]:
print(news_dataset['content'])

0       top trump surrog brutal stab back pathet video...
1       u conserv leader optimist common ground health...
2       trump propos u tax overhaul stir concern defic...
3       court forc ohio allow million illeg purg voter...
4       democrat say trump agre work immigr bill wall ...
                              ...                        
9895    wikileak admit screw immens twitter poll hilla...
9896    trump consult republican senat fed chief candi...
9897    trump lawyer say judg lack jurisdict defam law...
9898    watch right wing pastor fals credit trump save...
9899    sean spicer hilari brand chickensh bolt brief ...
Name: content, Length: 9900, dtype: object


In [29]:
#separating the data and label
X = news_dataset['content'].values
Y = news_dataset['label'].values

In [30]:
print(X)

['top trump surrog brutal stab back pathet video look though republican presidenti candid donald trump lose support even within rank know thing get bad even top surrog start turn exactli happen fox news newt gingrich call trump pathet gingrich know trump need keep focu hillari clinton even remot want chanc defeat howev trump hurt feel mani republican support sexual assault women turn includ hous speaker paul ryan r wi made trump lash parti gingrich said fox news look first let say trump admir tri help much big trump littl trump littl trump frankli pathet mean mad get phone call trump refer fact paul ryan call congratul debat probabl win despit trump ego tell gingrich also ad donald trump one oppon name hillari clinton name paul ryan anybodi els trump seem realiz person mad truli worst enemi ultim lead defeat one blame watch via politico featur photo joe raedl getti imag'
 'u conserv leader optimist common ground healthcar washington reuter republican u hous repres could achiev common g

In [31]:
print(Y)

['Fake' 'Real' 'Real' ... 'Real' 'Fake' 'Fake']


In [32]:
Y.shape

(9900,)

In [33]:
# converting the textual data to numerical data
vectorizer = TfidfVectorizer()
vectorizer.fit(X)

X = vectorizer.transform(X)

In [34]:
print(X)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 1550030 stored elements and shape (9900, 40295)>
  Coords	Values
  (0, 287)	0.04778447355034268
  (0, 363)	0.09682298048854796
  (0, 972)	0.03143972892540736
  (0, 1448)	0.08345934109394892
  (0, 1856)	0.07286371750620677
  (0, 2257)	0.039527598617236225
  (0, 2297)	0.05966411102820767
  (0, 3213)	0.056383490883613656
  (0, 3432)	0.06506791054573881
  (0, 4362)	0.08391465433975229
  (0, 4826)	0.10289134738400817
  (0, 4910)	0.056223303571082184
  (0, 5468)	0.06871518325327437
  (0, 6151)	0.09885764047479048
  (0, 6687)	0.09815225049215494
  (0, 8036)	0.06544078756938368
  (0, 8151)	0.1531963802975418
  (0, 8470)	0.05695591125723686
  (0, 9163)	0.04559368638204754
  (0, 10010)	0.08854881843203868
  (0, 10221)	0.0668700967402666
  (0, 10404)	0.08237005776544144
  (0, 10837)	0.10579196794059839
  (0, 10914)	0.06618300605342758
  (0, 11142)	0.045310736697916625
  :	:
  (9899, 37855)	0.03572656618752602
  (9899, 37948)	0.03063734

Splitting the dataset to training & test data

In [35]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, stratify=Y, random_state=2)

Training the Model: Logistic Regression

In [36]:
model = LogisticRegression()

In [37]:
model.fit(X_train, Y_train)

Evaluation

accuracy score

In [38]:
# accuracy score on the training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

In [39]:
print('Accuracy score of the training data : ', training_data_accuracy)

Accuracy score of the training data :  0.9962121212121212


In [40]:
# accuracy score on the test data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [41]:
print('Accuracy score of the test data : ', test_data_accuracy)

Accuracy score of the test data :  0.9868686868686869


Making a Predictive System

In [45]:
X_new = X_test[3] # L'article de test à prédire

prediction = model.predict(X_new)
print(prediction)

# Correction : Comparer avec les chaînes de caractères 'Real' et 'Fake'
if (prediction[0]=='Real'):
  print('The news is Real')
else:
  print('The news is Fake')

['Real']
The news is Real


In [46]:
print(Y_test[3])

Real
