### About the Dataset:

- **id**: Unique ID for a news article
- **title**: Title of a news article
- **author**: Author of the news article
- **text**: The text of the article; could be incomplete
- **label**: Label that marks whether the news article is real or fake:
            - 1: Fake news
            - 0: Real news

### Importing the Dependencies

In [55]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

import nltk
nltk.download('stopwords')

# Displaying English stopwords
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Loading and Inspecting the Dataset

In [56]:
news_dataset = pd.read_csv("/content/drive/MyDrive/Projects/Fake News Prediction/train.csv")

print("Dataset Shape:", news_dataset.shape)

news_dataset.head()

Dataset Shape: (20800, 5)


Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


### Checking and Handling Missing Values

In [57]:
print("Missing values in each column:\n", news_dataset.isnull().sum())

# Replacing the null values with empty strings
news_dataset = news_dataset.fillna("")

print("Missing values after replacement:\n", news_dataset.isnull().sum())

Missing values in each column:
 id           0
title      558
author    1957
text        39
label        0
dtype: int64
Missing values after replacement:
 id        0
title     0
author    0
text      0
label     0
dtype: int64


### Merging Author and Title into a Single Column

In [58]:
news_dataset["content"] = news_dataset["author"] + " " + news_dataset["title"]

print(news_dataset["content"].head())

0    Darrell Lucus House Dem Aide: We Didn’t Even S...
1    Daniel J. Flynn FLYNN: Hillary Clinton, Big Wo...
2    Consortiumnews.com Why the Truth Might Get You...
3    Jessica Purkiss 15 Civilians Killed In Single ...
4    Howard Portnoy Iranian woman jailed for fictio...
Name: content, dtype: object


### Text Preprocessing: Stemming

In [59]:
port_stem = PorterStemmer()

def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]', ' ', content)  # Remove non-alphabet characters
    stemmed_content = stemmed_content.lower()  # Convert to lowercase
    stemmed_content = stemmed_content.split()  # Split into words
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]  # Apply stemming and remove stopwords
    stemmed_content = ' '.join(stemmed_content)  # Join words back into a single string
    return stemmed_content

news_dataset['content'] = news_dataset['content'].apply(stemming)

print(news_dataset['content'].head())

0    darrel lucu hous dem aid even see comey letter...
1    daniel j flynn flynn hillari clinton big woman...
2               consortiumnew com truth might get fire
3    jessica purkiss civilian kill singl us airstri...
4    howard portnoy iranian woman jail fiction unpu...
Name: content, dtype: object


### Defining Features and Labels

In [60]:
X = news_dataset['content'].values
y = news_dataset['label'].values

print("Features (X):", X[:5])
print("Labels (y):", y[:5])

Features (X): ['darrel lucu hous dem aid even see comey letter jason chaffetz tweet'
 'daniel j flynn flynn hillari clinton big woman campu breitbart'
 'consortiumnew com truth might get fire'
 'jessica purkiss civilian kill singl us airstrik identifi'
 'howard portnoy iranian woman jail fiction unpublish stori woman stone death adulteri']
Labels (y): [1 0 1 1 1]


### Converting Textual Data to Numerical Data

In [61]:
vectorizer = TfidfVectorizer()

X = vectorizer.fit_transform(X)

print("Transformed Features Shape:", X.shape)

Transformed Features Shape: (20800, 17128)


### Splitting the Dataset into Training and Testing Sets

In [62]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=2)

### Training the Model: Logistic Regression

In [63]:
model = LogisticRegression()

model.fit(X_train, y_train)

### Evaluating the Model

In [64]:
y_pred = model.predict(X_test)

test_data_accuracy = accuracy_score(y_pred, y_test)

print("Accuracy score of the test data:", test_data_accuracy)

Accuracy score of the test data: 0.9790865384615385
