<a href="https://colab.research.google.com/github/Parimala-15/Brainwave_Matrix_Intern_AI-ML/blob/main/Fake_News_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Fake News = 1; Real News = 0;

Importing necessary libraries.

In [1]:
import numpy as np
import pandas as pd
import re #regular expression for searching word in a paragraph
from nltk.corpus import stopwords # Library for natural language processing, specifically for stop words
from nltk.stem.porter import PorterStemmer #gives root words for a particular word
from sklearn.feature_extraction.text import TfidfVectorizer #converts text into feature vectors
from sklearn.model_selection import train_test_split # Splits data into training and testing sets
from sklearn.linear_model import LogisticRegression # Logistic Regression model for classification
from sklearn.metrics import accuracy_score # Metric to evaluate model performance

Using logistic regression as a model allows for binary classification, such as distinguishing between fake news and real news articles based on extracted features.

Natural Language Processing (NLP) libraries, such as NLTK, help in managing text data effectively by providing tools for tasks like stemming and removing stop words.

In [2]:
import nltk
nltk.download('stopwords') # Downloads the list of common English stop words

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [3]:
print(stopwords.words('english')) # these words doesn't add much value to the dataset so we remove it.

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

In [4]:
from google.colab import drive
drive.mount('/content/drive') # Mounts Google Drive to access files

Mounted at /content/drive


# Data Pre-processing.

Preprocessing textual data is necessary because computers cannot directly interpret text; it must be converted into meaningful numerical representations. Various functions will be applied in this crucial step.

In [5]:
#Loading the Dataset.
news_dataset = pd.read_csv('/content/drive/MyDrive/Brainwave ML Internship/Fake News Detection/fake_or_real_news.csv') # Loads the dataset from Google Drive

In [6]:
#checking no.of rows and columns.
news_dataset.shape # Displays the number of rows and columns in the dataset

(6335, 4)

In [7]:
news_dataset.rename(columns={'Unnamed: 0':'id'}, inplace=True) # Renames the 'Unnamed: 0' column to 'id'

In [8]:
# print the rows of the dataframe
(news_dataset.head()) # Displays the first 5 rows of the dataframe

Unnamed: 0,id,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [9]:
# counting the number of missing values in the datset
news_dataset.isnull().sum() # Counts the number of missing values in each column

Unnamed: 0,0
id,0
title,0
text,0
label,0


Understanding missing values in a dataset is crucial for effective data analysis. This video demonstrates how to identify and handle these missing values in a news dataset

In [10]:
#replacng null values with empty string
news_dataset = news_dataset.fillna('') # Replaces any missing values with empty strings

In [11]:
# combines the title and text columns into a new 'content' column
news_dataset['content'] = news_dataset['title'] + " " + news_dataset['text']

In [12]:
print(news_dataset['content']) # Prints the content of the 'content' column

0       You Can Smell Hillary’s Fear Daniel Greenfield...
1       Watch The Exact Moment Paul Ryan Committed Pol...
2       Kerry to go to Paris in gesture of sympathy U....
3       Bernie supporters on Twitter erupt in anger ag...
4       The Battle of New York: Why This Primary Matte...
                              ...                        
6330    State Department says it can't find emails fro...
6331    The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...
6332    Anti-Trump Protesters Are Tools of the Oligarc...
6333    In Ethiopia, Obama seeks progress on peace, se...
6334    Jeb Bush Is Suddenly Attacking Trump. Here's W...
Name: content, Length: 6335, dtype: object


In [13]:
#separating the data & label
X = news_dataset.drop(columns='label', axis=1) # Creates a new dataframe X by dropping the 'label' column
Y = news_dataset['label'] # Creates a new series Y containing only the 'label' column

In [14]:
print(X) # Prints the dataframe X
print(Y) # Prints the series Y

         id                                              title  \
0      8476                       You Can Smell Hillary’s Fear   
1     10294  Watch The Exact Moment Paul Ryan Committed Pol...   
2      3608        Kerry to go to Paris in gesture of sympathy   
3     10142  Bernie supporters on Twitter erupt in anger ag...   
4       875   The Battle of New York: Why This Primary Matters   
...     ...                                                ...   
6330   4490  State Department says it can't find emails fro...   
6331   8062  The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...   
6332   8622  Anti-Trump Protesters Are Tools of the Oligarc...   
6333   4021  In Ethiopia, Obama seeks progress on peace, se...   
6334   4330  Jeb Bush Is Suddenly Attacking Trump. Here's W...   

                                                   text  \
0     Daniel Greenfield, a Shillman Journalism Fello...   
1     Google Pinterest Digg Linkedin Reddit Stumbleu...   
2     U.S. Secretary of State 

Stemming: the process of reducing a word to its root word

after stemming we perform vectorization : which will turn these words into feature vectors : which are converting text into numerical datas.

Stemming is defined as reducing words to their root form, which helps improve model performance by simplifying data. This is crucial for text processing in machine learning.

Vectorizing text data is necessary after stemming, transforming words into numerical feature vectors that can be fed into machine learning models. This facilitates better understanding and processing of the data.

In [15]:
port_stem = PorterStemmer() # Initializes the Porter Stemmer

In [16]:
# creating a function.
def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]',' ',content) # Removes non-alphabetic characters
    stemmed_content = stemmed_content.lower() # Converts text to lowercase
    stemmed_content = stemmed_content.split() # Splits the text into a list of words
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')] # Stems words and removes stop words
    stemmed_content = ' '.join(stemmed_content) # Joins the stemmed words back into a string
    return stemmed_content # Returns the stemmed content

The function utilizes regular expressions to filter out non-alphabetic characters from the text. This ensures that only meaningful words are retained for further processing.

All text is converted to lowercase to maintain uniformity and prevent uppercase letters from affecting machine learning models. This step is crucial for accurate word analysis.


In [17]:
#separating the data and label
x = news_dataset['content'].values # Extracts the values from the 'content' column
y = news_dataset['label'].values # Extracts the values from the 'label' column

In [18]:
print(x) # Prints the values of x

 'Watch The Exact Moment Paul Ryan Committed Political Suicide At A Trump Rally (VIDEO) Google Pinterest Digg Linkedin Reddit Stumbleupon Print Delicious Pocket Tumblr \nThere are two fundamental truths in this world: Paul Ryan desperately wants to be president. And Paul Ryan will never be president. Today proved it. \nIn a particularly staggering example of political cowardice, Paul Ryan re-re-re-reversed course and announced that he was back on the Trump Train after all. This was an aboutface from where he was a few weeks ago. He had previously declared he would not be supporting or defending Trump after a tape was made public in which Trump bragged about assaulting women. Suddenly, Ryan was appearing at a pro-Trump rally and boldly declaring that he already sent in his vote to make him President of the United States. It was a surreal moment. The figurehead of the Republican Party dosed himself in gasoline, got up on a stage on a chilly afternoon in Wisconsin, and lit a match. . @Spe

In [19]:
print(y) # Prints the values of y

['FAKE' 'FAKE' 'REAL' ... 'FAKE' 'REAL' 'REAL']


In [20]:
y.shape # Displays the shape of y

(6335,)

In [21]:
#converting the textual data to numerical dta
vectorizer = TfidfVectorizer() # Initializes the TfidfVectorizer
vectorizer.fit(x) # Fits the vectorizer to the data

x = vectorizer.transform(x) # Transforms the text data into numerical feature vectors

TfidfVectorizer is a technique used in Natural Language Processing (NLP) to convert text data into numerical feature vectors. It does this by calculating the Term Frequency-Inverse Document Frequency (TF-IDF) for each word in the text.

Term Frequency (TF): This measures how often a word appears in a document.

Inverse Document Frequency (IDF): This measures how important a word is across all documents. Words that appear in many documents will have a lower IDF score, while words that appear in fewer documents will have a higher IDF score.


By multiplying the TF and IDF scores, TfidfVectorizer assigns a numerical weight to each word that reflects its relevance within a document and across the entire dataset. This allows machine learning models, which typically work with numerical data, to process and analyze text.

Feature vectors are numerical representations of data that are essential for machine learning models. In this context, they are used to convert text into a format suitable for analysis.

In [22]:
print(x)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 2172758 stored elements and shape (6335, 67926)>
  Coords	Values
  (0, 1629)	0.01620067894685332
  (0, 2319)	0.02091617172019117
  (0, 2412)	0.012362968166478416
  (0, 2501)	0.024083334170135
  (0, 2651)	0.016015097660191803
  (0, 2655)	0.04332786565935788
  (0, 2764)	0.05119995295824521
  (0, 2994)	0.022388927309660397
  (0, 3051)	0.020332760645694843
  (0, 3232)	0.059975065138012595
  (0, 3252)	0.007110813516507457
  (0, 3274)	0.03356572352573969
  (0, 3283)	0.016211225437576967
  (0, 3292)	0.015386662303415337
  (0, 3296)	0.03629743475813936
  (0, 3348)	0.011742047788665014
  (0, 3394)	0.014483350138037688
  (0, 3768)	0.020198388339288345
  (0, 3780)	0.032391440169600466
  (0, 3792)	0.01929395208818379
  (0, 3831)	0.030721089197861658
  (0, 3853)	0.015232262979818117
  (0, 3919)	0.02182152114431929
  (0, 3924)	0.013725073522225474
  (0, 4085)	0.018099493273326585
  :	:
  (6334, 65088)	0.020695591830443796
  (6334, 65097)	

In [23]:
#converting the textual data to numerical dta
vectorizer = TfidfVectorizer() # Initializes the TfidfVectorizer
vectorizer.fit(y) # Fits the vectorizer to the data

y = vectorizer.transform(y) # Transforms the text data into numerical feature vectors

In [24]:
print(y)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 6335 stored elements and shape (6335, 2)>
  Coords	Values
  (0, 0)	1.0
  (1, 0)	1.0
  (2, 1)	1.0
  (3, 0)	1.0
  (4, 1)	1.0
  (5, 0)	1.0
  (6, 0)	1.0
  (7, 1)	1.0
  (8, 1)	1.0
  (9, 1)	1.0
  (10, 1)	1.0
  (11, 1)	1.0
  (12, 0)	1.0
  (13, 0)	1.0
  (14, 1)	1.0
  (15, 1)	1.0
  (16, 0)	1.0
  (17, 0)	1.0
  (18, 1)	1.0
  (19, 1)	1.0
  (20, 1)	1.0
  (21, 0)	1.0
  (22, 1)	1.0
  (23, 1)	1.0
  (24, 0)	1.0
  :	:
  (6310, 1)	1.0
  (6311, 1)	1.0
  (6312, 0)	1.0
  (6313, 0)	1.0
  (6314, 0)	1.0
  (6315, 1)	1.0
  (6316, 1)	1.0
  (6317, 0)	1.0
  (6318, 0)	1.0
  (6319, 1)	1.0
  (6320, 0)	1.0
  (6321, 0)	1.0
  (6322, 0)	1.0
  (6323, 1)	1.0
  (6324, 1)	1.0
  (6325, 0)	1.0
  (6326, 0)	1.0
  (6327, 1)	1.0
  (6328, 0)	1.0
  (6329, 0)	1.0
  (6330, 1)	1.0
  (6331, 0)	1.0
  (6332, 0)	1.0
  (6333, 1)	1.0
  (6334, 1)	1.0


# Splitting the dataset to training and test data

In [37]:
x_train, x_test, y_train, y_test = train_test_split(x, news_dataset['label'], test_size = 0.2, stratify=news_dataset['label'], random_state=2)

# Training the Model: Logistic Regression

In [27]:
model = LogisticRegression() # Initializes the Logistic Regression model

In [29]:
model.fit(x_train, y_train) # Trains the model using the training data

# Evaluation

accuracy score

In [30]:
#accuracy score on the training data
x_train_prediction = model.predict(x_train) # Makes predictions on the training data
training_data_accuracy = accuracy_score(x_train_prediction, y_train) # Calculates the accuracy of the predictions

In [31]:
print('Accuracy score of the training data : ', training_data_accuracy) # Prints the training data accuracy

Accuracy score of the training data :  0.9516574585635359


In [32]:
#accuracy score on the test data
x_test_prediction = model.predict(x_test) # Makes predictions on the test data
test_data_accuracy = accuracy_score(x_test_prediction, y_test) # Calculates the accuracy of the predictions

In [33]:
print('Accuracy score of the testing data : ', test_data_accuracy) # Prints the testing data accuracy

Accuracy score of the testing data :  0.9100236779794791


# Make a Predictive System.

In [34]:
# Selecting the first data point from the test set
x_new = x_test[0]

# Making a prediction using the trained model
prediction = model.predict(x_new)
print(prediction)

# Interpreting the prediction
if prediction[0] == 'REAL':
    print("The news is Real")
else:
    print("The news is Fake")

['REAL']
The news is Real


In [39]:
# Printing the actual label for the first test data point
print(y_test.iloc[0])

REAL


In [40]:
# Printing the prediction and its data type
print(prediction[0], type(prediction[0]))

REAL <class 'str'>


In [41]:
# Selecting the fourth data point from the test set
x_new = x_test[3]

# Making a prediction using the trained model
prediction = model.predict(x_new)
print(prediction)

# Interpreting the prediction
if prediction[0] == 'REAL':
    print("The news is Real")
else:
    print("The news is Fake")

['FAKE']
The news is Fake


In [42]:
# Printing the actual label for the fourth test data point
print(y_test.iloc[3])

FAKE


In [43]:
# Selecting the ninth data point from the test set
x_new = x_test[8]

# Making a prediction using the trained model
prediction = model.predict(x_new)
print(prediction)

# Interpreting the prediction
if prediction[0] == 'REAL':
    print("The news is Real")
else:
    print("The news is Fake")

['REAL']
The news is Real


In [44]:
# Printing the actual label for the ninth test data point
print(y_test.iloc[8])

REAL


In [None]:
# Selecting the twelfth data point from the test set
x_new = x_test[11]

# Making a prediction using the trained model
prediction = model.predict(x_new)
print(prediction)

# Interpreting the prediction
if prediction[0] == 'REAL':
    print("The news is Real")
else:
    print("The news is Fake")

In [None]:
# Printing the actual label for the twelfth data point
print(y_test.iloc[11])

In [None]:
# Selecting the sixteenth data point from the test set
x_new = x_test[15]

# Making a prediction using the trained model
prediction = model.predict(x_new)
print(prediction)

# Interpreting the prediction
if prediction[0] == 'REAL':
    print("The news is Real")
else:
    print("The news is Fake")

In [None]:
# Printing the actual label for the sixteenth data point
print(y_test.iloc[15])

In [None]:
# Selecting the twentieth data point from the test set
x_new = x_test[19]

# Making a prediction using the trained model
prediction = model.predict(x_new)
print(prediction)

# Interpreting the prediction
if prediction[0] == 'REAL':
    print("The news is Real")
else:
    print("The news is Fake")

In [None]:
# Printing the actual label for the twentieth data point
print(y_test.iloc[19])

In [45]:
import joblib
joblib.dump(model, 'model.pkl')


['model.pkl']

In [47]:
from google.colab import files
files.download('model.pkl')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [48]:
joblib.dump(vectorizer, 'vectorizer.pkl')
files.download('vectorizer.pkl')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>