### 1. Introduction
Fake news has become a widespread issue in digital communication, especially on social media platforms. This project aims to build a machine learning model to distinguish between real and fake news articles using Natural Language Processing (NLP) techniques.

In [11]:
import pandas as pd
import numpy as np

#### 2. Dataset Description
We use two datasets:
true.csv: Contains real news articles.
fake.csv: Contains fake or misleading articles.

Each dataset includes the following columns:

title: Headline of the news article.
text: Full content of the news.
subject: Topic/category of the article.
date: Date of publication.

In [15]:
true = pd.read_csv("true.csv")
fake = pd.read_csv("fake.csv")

In [16]:
true.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


In [19]:
fake.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [21]:
true['label']=1

In [23]:
fake['label']=0

In [25]:
true.head()

Unnamed: 0,title,text,subject,date,label
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",1
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",1
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",1
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017",1
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017",1


In [27]:
fake.head()

Unnamed: 0,title,text,subject,date,label
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0


### 3. Data Preprocessing
Combine both datasets and assign labels: 
1 for real news and 0 for fake news.
Drop irrelevant columns such as date and subject.
Handle missing values, if any.
Perform basic text preprocessing (e.g., lowercasing, removing punctuation, stopwords, etc.) to clean the text for vectorization.

In [29]:
news = pd.concat([fake, true],axis=0)

In [31]:
news.head()

Unnamed: 0,title,text,subject,date,label
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0


In [33]:
news.tail()

Unnamed: 0,title,text,subject,date,label
21412,'Fully committed' NATO backs new U.S. approach...,BRUSSELS (Reuters) - NATO allies on Tuesday we...,worldnews,"August 22, 2017",1
21413,LexisNexis withdrew two products from Chinese ...,"LONDON (Reuters) - LexisNexis, a provider of l...",worldnews,"August 22, 2017",1
21414,Minsk cultural hub becomes haven from authorities,MINSK (Reuters) - In the shadow of disused Sov...,worldnews,"August 22, 2017",1
21415,Vatican upbeat on possibility of Pope Francis ...,MOSCOW (Reuters) - Vatican Secretary of State ...,worldnews,"August 22, 2017",1
21416,Indonesia to buy $1.14 billion worth of Russia...,JAKARTA (Reuters) - Indonesia will buy 11 Sukh...,worldnews,"August 22, 2017",1


In [35]:
news.isnull().sum()

title      0
text       0
subject    0
date       0
label      0
dtype: int64

In [37]:
news = news.drop(['title','subject','date'],axis=1)

In [39]:
news.head()

Unnamed: 0,text,label
0,Donald Trump just couldn t wish all Americans ...,0
1,House Intelligence Committee Chairman Devin Nu...,0
2,"On Friday, it was revealed that former Milwauk...",0
3,"On Christmas day, Donald Trump announced that ...",0
4,Pope Francis used his annual Christmas Day mes...,0


In [41]:
news = news.sample(frac=1)# reshuffling

In [43]:
news.head()

Unnamed: 0,text,label
20659,DUBAI (Reuters) - Bahrain condemned as inaccur...,1
20038,"Hang-on, hang-on, hang-on, hang-on Brooke Bal...",0
12513,The Clinton camp has been able to project a n...,0
4535,BRUSSELS (Reuters) - U.S. Secretary of State R...,1
14825,It only took what 14 years for this to happen?...,0


In [45]:
news.reset_index(inplace=True)

In [47]:
news.head()

Unnamed: 0,index,text,label
0,20659,DUBAI (Reuters) - Bahrain condemned as inaccur...,1
1,20038,"Hang-on, hang-on, hang-on, hang-on Brooke Bal...",0
2,12513,The Clinton camp has been able to project a n...,0
3,4535,BRUSSELS (Reuters) - U.S. Secretary of State R...,1
4,14825,It only took what 14 years for this to happen?...,0


In [49]:
news.drop(['index'],axis=1,inplace=True)

In [51]:
news.head()

Unnamed: 0,text,label
0,DUBAI (Reuters) - Bahrain condemned as inaccur...,1
1,"Hang-on, hang-on, hang-on, hang-on Brooke Bal...",0
2,The Clinton camp has been able to project a n...,0
3,BRUSSELS (Reuters) - U.S. Secretary of State R...,1
4,It only took what 14 years for this to happen?...,0


In [53]:
import re

In [55]:
def wordopt(text):
    #converting into lowercase
    text = text.lower()
    #remove URLs
    text = re.sub(r'https?://\S+|www\.\S+',' ',text)
    #remove html tags
    text = re.sub(r'[^\w\s]','',text)
    #remove digits
    tet = re.sub(r'\d','',text)
    #remove newline characters
    text = re.sub(r'\n','',text)
    return text

In [57]:
news['text'] = news['text'].apply(wordopt)

In [63]:
news['text']

0        dubai reuters  bahrain condemned as inaccurate...
1         hangon hangon hangon hangon brooke baldwin de...
2         the clinton camp has been able to project a n...
3        brussels reuters  us secretary of state rex ti...
4        it only took what 14 years for this to happen ...
                               ...                        
44893    reuters  the us defense security cooperation a...
44894    san francisco washington reuters  a trump admi...
44895    bogota reuters  colombian president juan manue...
44896    london reuters  britain has an incredibly stro...
44897    with donald trump winning the election albeit ...
Name: text, Length: 44898, dtype: object

In [65]:
x = news['text']
y = news['label']

In [67]:
x

0        dubai reuters  bahrain condemned as inaccurate...
1         hangon hangon hangon hangon brooke baldwin de...
2         the clinton camp has been able to project a n...
3        brussels reuters  us secretary of state rex ti...
4        it only took what 14 years for this to happen ...
                               ...                        
44893    reuters  the us defense security cooperation a...
44894    san francisco washington reuters  a trump admi...
44895    bogota reuters  colombian president juan manue...
44896    london reuters  britain has an incredibly stro...
44897    with donald trump winning the election albeit ...
Name: text, Length: 44898, dtype: object

In [69]:
from sklearn.model_selection import train_test_split

In [71]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.3)

In [73]:
x_train.shape

(31428,)

In [75]:
x_test.shape

(13470,)

##### We use TfidfVectorizer to convert text data into numerical features. This helps the machine learning model understand the importance of each word in the context of the news article

In [77]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [79]:
vectorization = TfidfVectorizer()

##### We split the dataset into training and testing sets using train_test_split.

In [81]:
xv_train = vectorization.fit_transform(x_train)

In [83]:
xv_test = vectorization.transform(x_test)

In [85]:
xv_test

<13470x187708 sparse matrix of type '<class 'numpy.float64'>'
	with 2759977 stored elements in Compressed Sparse Row format>

##### we train a Logistic Regression model, which is well-suited for binary classification tasks like fake vs. real news.

In [87]:
#creating ml model
from sklearn.linear_model import LogisticRegression 

In [89]:
LR = LogisticRegression()

In [91]:
LR.fit(xv_train, y_train)

In [93]:
pred_lr = LR.predict(xv_test)

####  6. Model Evaluation
We evaluate our model using the following metrics:

Classification Report (Precision, Recall, F1-score)

Accuracy Score

Confusion Matrix

These metrics help us understand how well our model performs on unseen data.

In [95]:
LR.score(xv_test, y_test)

0.9893838158871566

In [97]:
from sklearn.metrics import classification_report

print(classification_report(y_test, pred_lr))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99      7015
           1       0.99      0.99      0.99      6455

    accuracy                           0.99     13470
   macro avg       0.99      0.99      0.99     13470
weighted avg       0.99      0.99      0.99     13470



In [99]:
#decision tree classifier
from sklearn.tree import DecisionTreeClassifier

In [101]:
DTC = DecisionTreeClassifier()

In [103]:
DTC.fit(xv_train, y_train) 

In [105]:
pred_dtc = DTC.predict(xv_test)

In [107]:
DTC.score(xv_test, y_test)

0.9953229398663697

In [109]:
print(classification_report(y_test, pred_dtc))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      7015
           1       1.00      1.00      1.00      6455

    accuracy                           1.00     13470
   macro avg       1.00      1.00      1.00     13470
weighted avg       1.00      1.00      1.00     13470



In [111]:
#random forest classifier
from sklearn.ensemble import RandomForestClassifier

In [113]:
rfc = RandomForestClassifier()

In [115]:
rfc.fit(xv_train, y_train)

In [122]:
predict_rfc = rfc.predict(xv_test)

In [124]:
rfc.score(xv_test, y_test)

0.989532293986637

In [126]:
print(classification_report(y_test, predict_rfc))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99      7015
           1       0.99      0.99      0.99      6455

    accuracy                           0.99     13470
   macro avg       0.99      0.99      0.99     13470
weighted avg       0.99      0.99      0.99     13470



In [128]:
from sklearn.ensemble import GradientBoostingClassifier #GradientBoostingClassifier

In [130]:
gbc = GradientBoostingClassifier()

In [132]:
gbc.fit(xv_train, y_train)

In [134]:
predict_gbc = gbc.predict(xv_test)

In [136]:
rfc.score(xv_test, y_test)

0.989532293986637

In [137]:
print(classification_report(y_test, predict_gbc))

              precision    recall  f1-score   support

           0       1.00      0.99      1.00      7015
           1       0.99      1.00      0.99      6455

    accuracy                           1.00     13470
   macro avg       1.00      1.00      1.00     13470
weighted avg       1.00      1.00      1.00     13470



In [140]:
def output_label(n):
    if n==0:
        return "It is Fake News"
    elif n==1:
        return "It is Genuine News"

In [142]:
def manual_testing(news):
    testing_news = {"text": [news]}
    new_def_test = pd.DataFrame(testing_news)
    
    # Apply preprocessing correctly and overwrite 'text'
    new_def_test["text"] = new_def_test["text"].apply(wordopt)
    
    new_x_test = new_def_test["text"]
    new_xv_test = vectorization.transform(new_x_test)

    pred_lr = LR.predict(new_xv_test)
    pred_gbc = gbc.predict(new_xv_test)
    pred_rfc = rfc.predict(new_xv_test)

    return ("\n\nLR Prediction: {}  \nGBC Prediction: {}  \nRFC Prediction: {}".format(
        output_label(pred_lr[0]), output_label(pred_gbc[0]), output_label(pred_rfc[0])))
    

In [144]:
news_article = str(input())

 House will likely need to vote again on tax bill: Republican leader


In [146]:
print(manual_testing(""))



LR Prediction: It is Fake News  
GBC Prediction: It is Fake News  
RFC Prediction: It is Fake News
