Dataset Description
train.csv: A full training dataset with the following attributes:

id: unique id for a news article
title: the title of a news article
author: author of the news article
text: the text of the article; could be incomplete
label: a label that marks the article as potentially unreliable
1: unreliable
0: reliable

Importing the dependencies

In [None]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score



In [None]:
#printing the stropwords in english
print(stopwords.words('English'))

Data Pre-Processing

In [None]:
news_dataset = pd.read_csv('train.csv')

In [None]:
news_dataset.shape

In [None]:
news_dataset.describe()

In [None]:
print(news_dataset.head())

In [None]:
 #counting the number of missing values in each in the dataset
news_dataset.isnull().sum()

In [None]:
# replacing the null values with empty string
news_dataset = news_dataset.fillna('')

In [None]:
# Merging the author name and news title
news_dataset['content'] = news_dataset['author'] + ' ' + news_dataset['title']
print(news_dataset['content'].head(10))

In [None]:
# Separating the data and label
X = news_dataset.drop(columns='label', axis=1)
Y = news_dataset['label']

In [None]:
print(X)
print(Y)

Stemming::: Stemming is a process of reducing words to their word stem, base or root form (for example, books — book, looked — look). The main two algorithms are Porter stemming algorithm (removes common morphological and inflexional endings from words) and Lancaster stemming algorithm (a more aggressive stemming algorithm).

In [None]:
port_stem = PorterStemmer()

In [None]:
def stemming(content: str):
    stemmed_content = re.sub('[^a-zA-Z]', ' ', content)
    stemmed_content = stemmed_content.lower()
    stemmed_content = stemmed_content.split()
    stop_words = stopwords.words('english')
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stop_words]
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content

In [None]:
news_dataset['content'] = news_dataset['content'].apply(stemming) 

In [None]:
print(news_dataset['content'])

In [None]:
# Separating the data and labels
X = news_dataset['content'].values
Y = news_dataset['label'].values

In [None]:
# converting the textual data to numerical data using tfidf vectorizer
vectorizer = TfidfVectorizer()
vectorizer.fit(X)
X = vectorizer.transform(X)
print(X)

Splitiing to training and test data

In [None]:
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2,random_state=2, stratify=Y)

Training the model ::: Logistic Regression model

In [None]:
model = LogisticRegression()

In [None]:
model.fit(X_train,Y_train)

Evaluation :: Accuracy score

In [None]:
# accuracy score on training data
X_train_pred = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_pred,Y_train)

In [None]:
print('Accuracy score of training data: ', training_data_accuracy)

In [None]:
# accuracy score on test data
X_test_prediction = model.predict(X_test)   
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [None]:
print('Accuracy score of training data: ', test_data_accuracy)

Making a predictive system

In [None]:
X_new = X_test[0]
prediction = model.predict(X_new)
print(prediction)

if(prediction [0] == 0):
    print("The news is Real")
else:
    print("The news is Fake")

In [None]:
print(Y_test[0])