
### **MODEL 1  - LOGISTIC REGRESSION**

 Logistic regression is a pivotal statistical technique extensively employed for binary classification problems. It estimates the probability that a given input point belongs to a particular category. Contrary to its name, logistic regression is a linear model; however, it leverages the logistic (sigmoid) function to map predicted values to probabilities ranging between 0 and 1. This probabilistic output is intuitive, offering clear insights into the likelihood of an event, such as whether an email is spam or not.Despite its simplicity, logistic regression remains a powerful tool due to its ease of interpretation and its ability to serve as a reliable baseline model in various machine learning tasks. It is widely utilized in diverse applications, from predicting customer churn and detecting fraudulent activities to medical diagnosis and risk management. However, it is essential to note that logistic regression's linear nature can sometimes be a limitation, as it might struggle to capture complex, non-linear relationships within the data. Nevertheless, its fundamental properties of simplicity, interpretability, and reliable performance ensure that logistic regression continues to be a cornerstone in both statistical analysis and machine learning domains.

## Installing Necessary Libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import string
import re
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

## Loading the data

In [None]:
data_fake=pd.read_csv('Fake.csv')
data_true=pd.read_csv('True.csv')

### Data Preview

In [None]:
data_fake.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [None]:
data_true.tail()

Unnamed: 0,title,text,subject,date
21412,'Fully committed' NATO backs new U.S. approach...,BRUSSELS (Reuters) - NATO allies on Tuesday we...,worldnews,"August 22, 2017"
21413,LexisNexis withdrew two products from Chinese ...,"LONDON (Reuters) - LexisNexis, a provider of l...",worldnews,"August 22, 2017"
21414,Minsk cultural hub becomes haven from authorities,MINSK (Reuters) - In the shadow of disused Sov...,worldnews,"August 22, 2017"
21415,Vatican upbeat on possibility of Pope Francis ...,MOSCOW (Reuters) - Vatican Secretary of State ...,worldnews,"August 22, 2017"
21416,Indonesia to buy $1.14 billion worth of Russia...,JAKARTA (Reuters) - Indonesia will buy 11 Sukh...,worldnews,"August 22, 2017"


In [None]:
data_fake["class"]=0
data_true['class']=1

In [None]:
data_fake.shape, data_true.shape

((23481, 5), (21417, 5))

In [None]:
data_fake_manual_testing = data_fake.tail(10)
for i in range(23480,23470,-1):
    data_fake.drop([i],axis = 0, inplace = True)


data_true_manual_testing = data_true.tail(10)
for i in range(21416,21406,-1):
    data_true.drop([i],axis = 0, inplace = True)



In [None]:
data_fake.shape, data_true.shape

((23471, 5), (21407, 5))

In [None]:
data_fake_manual_testing['class']=0
data_true_manual_testing['class']=1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_fake_manual_testing['class']=0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_true_manual_testing['class']=1


In [None]:
data_fake_manual_testing.head(10)

Unnamed: 0,title,text,subject,date,class
23471,Seven Iranians freed in the prisoner swap have...,"21st Century Wire says This week, the historic...",Middle-east,"January 20, 2016",0
23472,#Hashtag Hell & The Fake Left,By Dady Chery and Gilbert MercierAll writers ...,Middle-east,"January 19, 2016",0
23473,Astroturfing: Journalist Reveals Brainwashing ...,Vic Bishop Waking TimesOur reality is carefull...,Middle-east,"January 19, 2016",0
23474,The New American Century: An Era of Fraud,Paul Craig RobertsIn the last years of the 20t...,Middle-east,"January 19, 2016",0
23475,Hillary Clinton: ‘Israel First’ (and no peace ...,Robert Fantina CounterpunchAlthough the United...,Middle-east,"January 18, 2016",0
23476,McPain: John McCain Furious That Iran Treated ...,21st Century Wire says As 21WIRE reported earl...,Middle-east,"January 16, 2016",0
23477,JUSTICE? Yahoo Settles E-mail Privacy Class-ac...,21st Century Wire says It s a familiar theme. ...,Middle-east,"January 16, 2016",0
23478,Sunnistan: US and Allied ‘Safe Zone’ Plan to T...,Patrick Henningsen 21st Century WireRemember ...,Middle-east,"January 15, 2016",0
23479,How to Blow $700 Million: Al Jazeera America F...,21st Century Wire says Al Jazeera America will...,Middle-east,"January 14, 2016",0
23480,10 U.S. Navy Sailors Held by Iranian Military ...,21st Century Wire says As 21WIRE predicted in ...,Middle-east,"January 12, 2016",0


In [None]:
data_true_manual_testing.head(10)

Unnamed: 0,title,text,subject,date,class
21407,"Mata Pires, owner of embattled Brazil builder ...","SAO PAULO (Reuters) - Cesar Mata Pires, the ow...",worldnews,"August 22, 2017",1
21408,"U.S., North Korea clash at U.N. forum over nuc...",GENEVA (Reuters) - North Korea and the United ...,worldnews,"August 22, 2017",1
21409,"U.S., North Korea clash at U.N. arms forum on ...",GENEVA (Reuters) - North Korea and the United ...,worldnews,"August 22, 2017",1
21410,Headless torso could belong to submarine journ...,COPENHAGEN (Reuters) - Danish police said on T...,worldnews,"August 22, 2017",1
21411,North Korea shipments to Syria chemical arms a...,UNITED NATIONS (Reuters) - Two North Korean sh...,worldnews,"August 21, 2017",1
21412,'Fully committed' NATO backs new U.S. approach...,BRUSSELS (Reuters) - NATO allies on Tuesday we...,worldnews,"August 22, 2017",1
21413,LexisNexis withdrew two products from Chinese ...,"LONDON (Reuters) - LexisNexis, a provider of l...",worldnews,"August 22, 2017",1
21414,Minsk cultural hub becomes haven from authorities,MINSK (Reuters) - In the shadow of disused Sov...,worldnews,"August 22, 2017",1
21415,Vatican upbeat on possibility of Pope Francis ...,MOSCOW (Reuters) - Vatican Secretary of State ...,worldnews,"August 22, 2017",1
21416,Indonesia to buy $1.14 billion worth of Russia...,JAKARTA (Reuters) - Indonesia will buy 11 Sukh...,worldnews,"August 22, 2017",1


In [None]:
data_merge=pd.concat([data_fake, data_true], axis = 0)
data_merge.head(10)

Unnamed: 0,title,text,subject,date,class
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0
5,Racist Alabama Cops Brutalize Black Boy While...,The number of cases of cops brutalizing and ki...,News,"December 25, 2017",0
6,"Fresh Off The Golf Course, Trump Lashes Out A...",Donald Trump spent a good portion of his day a...,News,"December 23, 2017",0
7,Trump Said Some INSANELY Racist Stuff Inside ...,In the wake of yet another court decision that...,News,"December 23, 2017",0
8,Former CIA Director Slams Trump Over UN Bully...,Many people have raised the alarm regarding th...,News,"December 22, 2017",0
9,WATCH: Brand-New Pro-Trump Ad Features So Muc...,Just when you might have thought we d get a br...,News,"December 21, 2017",0


#### "title",  "subject" and "date" columns is not required for detecting the fake news, so I am going to drop the columns.

In [None]:
data_merge.columns

Index(['title', 'text', 'subject', 'date', 'class'], dtype='object')

In [None]:
data=data_merge.drop(['title','subject','date'], axis = 1)

In [None]:
#count of missing values
data.isnull().sum()

text     0
class    0
dtype: int64

#### Randomly shuffling the dataframe

In [None]:
data = data.sample(frac = 1)

In [None]:
data.head()

Unnamed: 0,text,class
11968,BRASILIA (Reuters) - Brazil s Finance Minister...,1
17396,SEOUL (Reuters) - South Korean police are seek...,1
17928,BUCHAREST (Reuters) - Romanian Prime Minister ...,1
10219,WASHINGTON (Reuters) - U.S. Republican preside...,1
7355,"When you try and fail at everything you do, tr...",0


In [None]:
data.reset_index(inplace = True)
data.drop(['index'], axis = 1, inplace = True)

In [None]:
data.columns

Index(['text', 'class'], dtype='object')

In [None]:
data.head()

Unnamed: 0,text,class
0,BRASILIA (Reuters) - Brazil s Finance Minister...,1
1,SEOUL (Reuters) - South Korean police are seek...,1
2,BUCHAREST (Reuters) - Romanian Prime Minister ...,1
3,WASHINGTON (Reuters) - U.S. Republican preside...,1
4,"When you try and fail at everything you do, tr...",0


***Data Preprocessing***

Data preprocessing converts raw data into a clean, usable format by addressing missing values, outliers, and inconsistencies. It involves normalizing or standardizing data to ensure consistent scales and includes feature extraction and selection to improve dataset quality. This crucial step enhances the efficiency and accuracy of data analysis and machine learning models.

The Steps involved in data preprocessing are:

Lowercasing the text.
Expand contractions.
Remove links if any.
Remove punctuations and digits
Tokenize the words
Stop words removal
Lemmetize text.

## Preprocessing Text

#### Creating a function to convert the text in lowercase, remove the extra space, special chr., ulr and links.

In [None]:
def wordopt(text):
    text = text.lower()
    text = re.sub('\[.*?\]','',text)
    text = re.sub("\\W"," ",text)
    text = re.sub('https?://\S+|www\.\S+','',text)
    text = re.sub('<.*?>+',b'',text)
    text = re.sub('[%s]' % re.escape(string.punctuation),'',text)
    text = re.sub('\w*\d\w*','',text)
    return text

In [None]:
data['text'] = data['text'].apply(wordopt)

#### Defining dependent and independent variable as x and y

In [None]:
x = data['text']
y = data['class']

## Training the model

#### Splitting the dataset into training set and testing set.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.25)

### Extracting Features from the Text

#### Convert text to vectors

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorization = TfidfVectorizer()
xv_train = vectorization.fit_transform(x_train)
xv_test = vectorization.transform(x_test)

## Logistic Regression (Model Building)

Model building with logistic regression for binary classification involves preparing the data by handling missing values, encoding categorical variables, and scaling features. The data is then split into training and testing sets. A logistic regression model is trained on the training set, and its performance is evaluated on the test set using metrics like accuracy, precision, recall. Hyperparameter tuning can be performed to optimize the model. This process helps create a robust predictive model that can effectively classify binary outcomes.



In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
LR = LogisticRegression()
LR.fit(xv_train, y_train)

In [None]:
pred_lr = LR.predict(xv_test)

In [None]:
LR.score(xv_test, y_test)

0.9841354723707665

In [None]:
print (classification_report(y_test, pred_lr))

              precision    recall  f1-score   support

           0       0.99      0.98      0.98      5837
           1       0.98      0.99      0.98      5383

    accuracy                           0.98     11220
   macro avg       0.98      0.98      0.98     11220
weighted avg       0.98      0.98      0.98     11220



# ***Detailed Performance Metrics***

The logistic regression model achieved an overall accuracy of 0.9841, or approximately 98.41%, on the test set. This indicates that the model correctly classified 98.41% of the instances in the test data.

## **Detailed Performance Metrics**

***Class 0 Performance***

Precision: 0.99,
Recall: 0.98,
F1-score: 0.98,
Support: 5837

***Class 1 Performance***

Precision: 0.98,
Recall: 0.99,
F1-score: 0.98,
Support: 5383

***Overall Performance***

Accuracy: 0.98,
Macro Average Precision: 0.98,
Macro Average Recall: 0.98,
Macro Average F1-score: 0.98,
Weighted Average Precision: 0.98,
Weighted Average Recall: 0.98,
Weighted Average F1-score: 0.98

The model's overall accuracy of 99% indicates strong performance. The macro average metrics provide an unweighted average across both classes.

## Decision Tree Classifier

The Decision Tree Classifier is a straightforward yet effective machine learning model that partitions data into subsets based on the features that best separate the target variable classes. It operates by asking a series of yes/no questions about the data features at each node, ultimately forming a tree-like structure of decisions. Each path from the root to a leaf node represents a classification rule, making the model easy to interpret and visualize. Decision trees can handle both numerical and categorical data and are robust to outliers. However, they may overfit noisy data if not pruned or regularized properly. Despite this, decision tree classifiers are widely used in various applications due to their simplicity, interpretability, and ability to capture non-linear relationships in data effectively.

In [None]:
from sklearn.tree import DecisionTreeClassifier

DT = DecisionTreeClassifier()
DT.fit(xv_train, y_train)

In [None]:
pred_dt = DT.predict(xv_test)

In [None]:
DT.score(xv_test, y_test)

0.996524064171123

In [None]:
print (classification_report(y_test, pred_lr))

              precision    recall  f1-score   support

           0       0.99      0.98      0.98      5837
           1       0.98      0.99      0.98      5383

    accuracy                           0.98     11220
   macro avg       0.98      0.98      0.98     11220
weighted avg       0.98      0.98      0.98     11220



## Gradient Boost Classifier

The Gradient Boosting Classifier is a powerful machine learning method that sequentially builds a series of models, typically decision trees, to correct errors of its predecessors. This iterative process focuses on improving prediction accuracy step by step. It is known for its effectiveness in both classification and regression tasks, handling various types of data well. Gradient boosting excels in scenarios where high predictive performance is crucial, but it can be sensitive to noisy data and requires careful parameter tuning for optimal results. Overall, it is valued for its ability to deliver accurate predictions and is widely used in data science applications.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

GB = GradientBoostingClassifier(random_state = 0)
GB.fit(xv_train, y_train)

In [None]:
pred_gb = GB.predict(xv_test)

In [None]:
GB.score(xv_test, y_test)

0.9954545454545455

In [None]:
print(classification_report(y_test, pred_gb))

              precision    recall  f1-score   support

           0       1.00      0.99      1.00      5837
           1       0.99      1.00      1.00      5383

    accuracy                           1.00     11220
   macro avg       1.00      1.00      1.00     11220
weighted avg       1.00      1.00      1.00     11220



## Random Forest Classifier

Random Forest Classifiers are popular in machine learning because they combine multiple decision trees to give more accurate predictions. By averaging the results of many trees, they reduce overfitting and handle different types of data well. They're robust against outliers and noisy data and can show which features are most important for making predictions. Random forests work efficiently for both classification and regression tasks, making them widely used in various applications.

In [None]:
from sklearn.ensemble import RandomForestClassifier

RF = RandomForestClassifier(random_state = 0)
RF.fit(xv_train, y_train)

In [None]:
pred_rf = RF.predict(xv_test)

In [None]:
RF.score(xv_test, y_test)

0.989572192513369

In [None]:
print (classification_report(y_test, pred_rf))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99      5837
           1       0.99      0.99      0.99      5383

    accuracy                           0.99     11220
   macro avg       0.99      0.99      0.99     11220
weighted avg       0.99      0.99      0.99     11220



## Testing the Model

In [None]:
def output_lable(n):
    if n==0:
        return "Fake News"
    elif n==1:
        return "Not A Fake News"

def manual_testing(news):
    testing_news = {"text":[news]}
    new_def_test = pd.DataFrame(testing_news)
    new_def_test['text'] = new_def_test["text"].apply(wordopt)
    new_x_test = new_def_test["text"]
    new_xv_test = vectorization.transform(new_x_test)
    pred_LR = LR.predict(new_xv_test)
    pred_DT = DT.predict(new_xv_test)
    pred_GB = GB.predict(new_xv_test)
    pred_RF = RF.predict(new_xv_test)

    return print("\n\nLR Predicition: {} \nDT Prediction: {} \nGBC Prediction: {} \nRFC Prediction:{}".format(output_lable(pred_LR[0]),
                                                                                                             output_lable(pred_DT[0]),
                                                                                                             output_lable(pred_GB[0]),
                                                                                                             output_lable(pred_RF[0])))

### Model Testing With Manual Entry
Model testing with manual entry involves evaluating the performance of the model by manually inputting data points and observing the model's predictions, ensuring it behaves as expected under various scenarios

In [None]:
news = str(input())
manual_testing(news)

In [None]:
news=str(input())
manual_testing(news)