<a href="https://colab.research.google.com/github/Aniket6334/Fake-News-Detector/blob/main/FakeNewsDetection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

About the Dataset:

1. id: unique id for a news article
2. title: the title of a news article
3. author: author of the news article
4. text: the text of the article; could be incomplete
5. label: a label that marks whether the news article is real or fake:
           1: Fake news
           0: real News

Import the dependencies

In [23]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [None]:
import nltk
nltk.download('stopwords')

In [None]:
print(stopwords.words('english'))

Data Pre-processing

In [31]:
# loading the dataset to a pandas dataframe
news_dataset = pd.read_csv('/content/train.csv')
# !ls /content

In [33]:
news_dataset.shape

(20800, 5)

In [None]:
# print the first 5 rows of the dataframe
news_dataset.head()

In [None]:
# counting the number of missing values in the dataset
news_dataset.isnull().sum()

In [38]:
# replacing the null values with empty string
news_dataset = news_dataset.fillna('')

In [39]:
# merging the author and news title for better detection
news_dataset['content'] = news_dataset['author']+' '+ news_dataset['title']

In [None]:
print(news_dataset['content'])

In [72]:
# separate the data & label column axis = 0 means removing row, 1 means column
X = news_dataset.drop(columns='label', axis=1)
Y = news_dataset['label']

In [None]:
print(X)
print(Y)

Stemming:
Reducing a word to its root word
Ex: actor, acting, actress --> act

In [43]:
port_stem = PorterStemmer()

In [46]:
# creating a function called stemming (33)
def stemming(content):
  stemmed_content = re.sub('[^a-zA-Z]',' ', content)
  stemmed_content = stemmed_content.lower()
  stemmed_content = stemmed_content.split()
  stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
  stemmed_content = ' '.join(stemmed_content)
  return stemmed_content

In [47]:
news_dataset['content'] = news_dataset['content'].apply(stemming)

In [None]:
print(news_dataset['content'])

In [82]:
# separating the data and label
X = news_dataset['content'].values
Y = news_dataset['label'].values

In [None]:
print(X)

In [None]:
print(Y)

In [84]:
# converting the textual data to numerical data
vectorizer = TfidfVectorizer()
vectorizer.fit(X)

X = vectorizer.transform(X)

In [None]:
print(X)

In [87]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, stratify=Y, random_state=2)

Training the Model : Logistic Regression

In [88]:
model = LogisticRegression()

In [None]:
model.fit(X_train, Y_train)

Evaluation

Accuracy Score

In [91]:
# accuracy score on the training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

In [92]:
print('Accuracy score of the training data : ', training_data_accuracy)

Accuracy score of the training data :  0.9865985576923076


In [93]:
# accuracy score on the test data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [94]:
print('Accuracy score of the test data : ', test_data_accuracy)

Accuracy score of the test data :  0.9790865384615385


Making a predictive system

In [97]:
X_new = X_test[8]

prediction = model.predict(X_new)
print(prediction)

if (prediction[0]==0):
  print('The news is Real')
else:
  print('The news is Fake')

[1]
The news is Fake


In [98]:
print(Y_test[8])

1
