#Fake News Detector


Notebook made by: __Hayder CHAKROUN__

E-mail: __hayderchakroun5@gmail.com__



<h2>Scope</h2>
This application will label news as fake or real, using a model of logistic regression trained on the Kaggle Fake News competition training set.

#Preprocessing

In [84]:
import nltk
nltk.download('stopwords')
import pandas as pd
import numpy as np
import re
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import FunctionTransformer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [60]:
X_train_data=pd.read_csv('/content/drive/MyDrive/Projects/Machine Learning projects/Fake News Detector - Logisitc Regression/train.csv')
X_test_data=pd.read_csv('/content/drive/MyDrive/Projects/Machine Learning projects/Fake News Detector - Logisitc Regression/test.csv')

In [7]:
print(X_train_data.shape)

(20800, 5)


In [8]:
print(X_test_data.shape)

(5200, 4)


Okay so our data here is mostly textual.

In [9]:
X_train_data.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


Lets check out wether we have columns with missing values

In [11]:
X_train_data.isnull().sum()

Unnamed: 0,0
id,0
title,558
author,1957
text,39
label,0


as you can see, we have missing values. especially the Author columns has about 10% Null values. that's a lot, howeve due to the high volume of data, and for simplicity and speed sake, were just going to ignore that and replace all the null values with an empty string with a simpleimputer that i'll be using later.

We will be using only the title of news and their author, to predict wether its spam or not. we will not using text for now, unless this notebook is updated, as doing so will require loads of time of processing.

So we have textual feature ( Title and Author). Machine learning dont do very well with that,and work with number.

For that sake, we will be converting our feature into numerical vectors, based on tfidf vectorizing, which highlights frequent relevant words, and ignores frequent irrelvant word.

to finetune this text conversion even more, were going to preprocess the data before transforming into a vector. We will be removing stopwords, which are words of general low significance, like 'the' or 'a' or 'or', and we will also be stemming words, meaning wee will removing their suffixes and prefixes, to render them to their root( actress, acting -> act).

we're going to try to do all that and fit in a pipe, so that our input data once we deploy our model, will carried through the whole pipe smoothly.

Let's begin by implementing the stemming function:

In [61]:

# Define the stemming function
def stemming(content):
    port_stem = PorterStemmer()
    stemmed_content = re.sub('[^a-zA-Z]', ' ', content)
    stemmed_content = stemmed_content.lower()
    stemmed_content = stemmed_content.split()
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if word not in stopwords.words('english')]
    return ' '.join(stemmed_content)


This function takes the input sentence, substitute anything that is not alphabet with an empty string, then all letters are case-loweredn they are then stemmed and filtered such as only the words, that arent stopwords, are left, and then we join the words to our new stemmed and filtered sentence.

However since we want to implement a pipeline. we'll have to prepare everything in the preprocessing beforehand.

Here is a function in a class that takes the input data, and selects only its author and title as we will only be using those.

In [78]:
# Custom transformer to concatenate 'title' and 'author'
class TitleAuthorConcatenator(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self  # No fitting necessary

    def transform(self, X):
        if isinstance(X, pd.DataFrame):
            # Handle DataFrame input
            return X['title'] + ' ' + X['author']
        elif isinstance(X, np.ndarray) and X.shape[1] == 4:
            # Handle NumPy array input assuming columns: id, title, author, text
            return X[:, 1] + ' ' + X[:, 2]  # Concatenate title and author (columns 1 and 2)
        else:
            raise ValueError("Input must be a pandas DataFrame or a NumPy array with 4 columns.")


here in this class, when we call fit in it, it will just return itself, as it should, however when we call transform, it will transform the new raw data that are arrays into dataframes while only leaving their title and author in. It will do the same for the training data.

In [None]:
# Stemming transformer that handles both DataFrame and NumPy array inputs
def stemming_transformer(X):
    if isinstance(X, pd.Series):
        return X.apply(stemming)  # Use apply for pandas Series
    elif isinstance(X, np.ndarray):
        return np.array([stemming(content) for content in X])  # Use list comprehension for NumPy array
    else:
        raise ValueError("Input must be a pandas Series or a NumPy array.")

Now lets get our training data labels, and test data labels

In [77]:
# Define features (title + author) and labels
X = X_train_data[['id', 'title', 'author', 'text']]  # Include the full four columns
Y = X_train_data['label']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=2)


In [79]:
# Define the pipeline
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='')),  # Fill missing values with an empty string
    ('concatenate', TitleAuthorConcatenator()),  # Custom transformer to concatenate title and author
    ('stemmer', FunctionTransformer(stemming_transformer, validate=False)),  # Apply stemming
    ('tfidf', TfidfVectorizer()),  # Apply TF-IDF vectorization
    ('model', LogisticRegression())  # Logistic Regression model
])

Now that our thick pipeline, basically, all data coming into it is imputed by simple Imputer and were replacing NULLS with an empty string. then were going to just take the title and author and concatenate them.  For training data and raw new data TitleAuthorConcatenator.transform() will handle it automatically in the pipeline. then we will apply our stemming function, into TFIDF vectorization, and then fed to our model.

Ok now lets fit our model on the training data.

#Model Training

In [83]:

# Fit the pipeline
pipeline.fit(X_train, y_train)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [86]:
# accuracy score on the training data
X_train_prediction = pipeline.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, y_train)

In [87]:
print('Accuracy score of the training data : ', training_data_accuracy)

Accuracy score of the training data :  0.9868990384615385


Almost perfectly overfitting !

In [89]:
# accuracy score on the test data
X_test_prediction = pipeline.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, y_test)

In [90]:
print('Accuracy score of the test data : ', test_data_accuracy)

Accuracy score of the test data :  0.9802884615384615


WE have excellent generalization accuracy !

#Model Deployement

In [102]:
input = np.array([
    [1, "Aliens Land in New York", "Anonymous", "Reports of UFO sightings and alien invasions in New York have emerged, causing widespread panic."],

])



prediction = pipeline.predict(X_new)
print(prediction)

if (prediction[0]==0):
  print('The news is Real')
else:
  print('The news is Fake')

[0]
The news is Real


