## Fake news

### Goal : Build a model that can accurately detect whether a piece of news is fake or real.

**What is a fake news ?**  

False information disseminated with the aim of manipulating the public

**TfidfVectorizer, PassiveAgressiveClassifier and machine learning classification algorithms**  
- TF (Term Frequency) : the number of times that a word appears in a document.  
- IDF (Inverse Document Frequency) : mesure of how significant a term is in the entire corpus.  
- TfidfVectorizer converts a collection of raw documents into a matrix of TF-IDF features.  
 

In [98]:
import numpy as np
import pandas as pd
import itertools

import re
import string

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

In [165]:
#Read the data
df=pd.read_csv('news.csv')

#Get shape and head
df.shape


(6335, 4)

In [166]:
df.head(10)

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL
5,6903,"Tehran, USA","\nI’m not an immigrant, but my grandparents ...",FAKE
6,7341,Girl Horrified At What She Watches Boyfriend D...,"Share This Baylee Luciani (left), Screenshot o...",FAKE
7,95,‘Britain’s Schindler’ Dies at 106,A Czech stockbroker who saved more than 650 Je...,REAL
8,4869,Fact check: Trump and Clinton at the 'commande...,Hillary Clinton and Donald Trump made some ina...,REAL
9,2909,Iran reportedly makes new push for uranium con...,Iranian negotiators reportedly have made a las...,REAL


In [76]:
#Get the labels
labels=df.label
labels.head()

#Remove useless column
df = df.drop(["Unnamed: 0","title"],axis=1)
df = df.sample(frac=1)
df.head(10)

df.isna().sum()

#Funtion to remove unnecessary character 
def word_drop(text):
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub("\\W"," ",text) 
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)    
    return text

df["text"] = df["text"].apply(word_drop)
df.head(10)

Unnamed: 0,text,label
27,after a week of nonstop criticism from democra...,REAL
3287,the national security agency considered abando...,REAL
3411,islam not welcome obama just got terrible new...,FAKE
3980,throughout the republican party from new hamp...,REAL
1228,in mainstream media politics propaganda l...,FAKE
3910,comments megyn kelly seems to think that she...,FAKE
5097,income inequality is back in the news propell...,REAL
5316,in the time since justice antonin scalia s pas...,REAL
1873,november fort russ exclusive interview...,FAKE
2282,since last year americans have grown increasi...,REAL


In [88]:
#Define dependent and independent variable as x and y
x = df["text"]
y = df["label"]

In [114]:
#Split the dataset into training and testing sets
x_train,x_test,y_train,y_test=train_test_split(x, y, test_size=0.2)

In [115]:
#Initialize a TfidfVectorizer to convert text to vector 
tfidf_vectorizer=TfidfVectorizer()
#stop word is word that is so common that it is unnecessary to index it or use it in a search 
#max_df argument removes words which appear in more than 70% of the document

In [117]:
#Fit and transform train set, transform test set
tfidf_train=tfidf_vectorizer.fit_transform(x_train) 
tfidf_test=tfidf_vectorizer.transform(x_test)

In [144]:
#Initialize a PassiveAggressiveClassifier
PAC=PassiveAggressiveClassifier(max_iter=50)
PAC.fit(tfidf_train,y_train)

PassiveAggressiveClassifier(C=1.0, average=False, class_weight=None,
                            early_stopping=False, fit_intercept=True,
                            loss='hinge', max_iter=50, n_iter_no_change=5,
                            n_jobs=None, random_state=None, shuffle=True,
                            tol=0.001, validation_fraction=0.1, verbose=0,
                            warm_start=False)

In [145]:
#Predict on the test set and calculate accuracy
pred_PAC=PAC.predict(tfidf_test)
score=accuracy_score(y_test,pred_PAC)
print(f'Accuracy: {round(score*100,2)}%')

Accuracy: 93.84%


In [93]:
#We got an accuracy of 93.84% with this model

In [146]:
#Build confusion matrix
confusion_matrix(y_test,pred_PAC, labels=['FAKE','REAL'])

array([[585,  39],
       [ 39, 604]])

In [157]:
#Our model successfully predicted 585 positives.
#Our model successfully predicted 604 negatives.
#Our model predicted 39 false positives.
#Our model predicted 39 false negatives.

In [137]:
print(classification_report(y_test, pred_PAC))

              precision    recall  f1-score   support

        FAKE       0.94      0.94      0.94       624
        REAL       0.94      0.94      0.94       643

    accuracy                           0.94      1267
   macro avg       0.94      0.94      0.94      1267
weighted avg       0.94      0.94      0.94      1267



#### Let's now use four machine learning algorithms to solve the fake news detection problem

### 1. Logistic Regression

In [121]:
from sklearn.linear_model import LogisticRegression
LR = LogisticRegression()
LR.fit(tfidf_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [123]:
pred_lr=LR.predict(tfidf_test)
LR.score(tfidf_test, y_test)

0.9076558800315706

In [124]:
print(classification_report(y_test, pred_lr))

              precision    recall  f1-score   support

        FAKE       0.90      0.92      0.91       624
        REAL       0.92      0.90      0.91       643

    accuracy                           0.91      1267
   macro avg       0.91      0.91      0.91      1267
weighted avg       0.91      0.91      0.91      1267



### 2. Decision Tree Classification

In [125]:
from sklearn.tree import DecisionTreeClassifier
DT = DecisionTreeClassifier()
DT.fit(tfidf_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [126]:
pred_dt=DT.predict(tfidf_test)
LR.score(tfidf_test, y_test)

0.9076558800315706

In [107]:
print(classification_report(y_test, pred_dt))

              precision    recall  f1-score   support

        FAKE       0.82      0.81      0.82       642
        REAL       0.81      0.81      0.81       625

    accuracy                           0.81      1267
   macro avg       0.81      0.81      0.81      1267
weighted avg       0.81      0.81      0.81      1267



### 3. Gradient Boosting Classifier

In [127]:
from sklearn.ensemble import GradientBoostingClassifier
GBC = GradientBoostingClassifier(random_state=0)
GBC.fit(tfidf_train, y_train)

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=0, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In [128]:
pred_gbc = GBC.predict(tfidf_test)
GBC.score(tfidf_test, y_test)

0.89344909234412

In [129]:
print(classification_report(y_test, pred_gbc))

              precision    recall  f1-score   support

        FAKE       0.88      0.91      0.89       624
        REAL       0.91      0.87      0.89       643

    accuracy                           0.89      1267
   macro avg       0.89      0.89      0.89      1267
weighted avg       0.89      0.89      0.89      1267



### 4. Random Forest Classifier

In [130]:
from sklearn.ensemble import RandomForestClassifier
RFC = RandomForestClassifier(random_state=0)
RFC.fit(tfidf_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

In [131]:
pred_rfc = RFC.predict(tfidf_test)
RFC.score(tfidf_test, y_test)

0.9068666140489345

In [132]:
print(classification_report(y_test, pred_rfc))

              precision    recall  f1-score   support

        FAKE       0.90      0.91      0.91       624
        REAL       0.91      0.90      0.91       643

    accuracy                           0.91      1267
   macro avg       0.91      0.91      0.91      1267
weighted avg       0.91      0.91      0.91      1267



## Model testing with manual entry

In [161]:
def output_label(n):
    if n == 'FAKE':
        return "Fake News"
    elif n == 'REAL':
        return "Not A Fake News"
    
def manual_testing(news):
    testing_news = {"text":[news]}
    new_def_test = pd.DataFrame(testing_news)
    new_def_test["text"] = new_def_test["text"].apply(word_drop) 
    new_x_test = new_def_test["text"]
    new_tfidf_test = tfidf_vectorizer.transform(new_x_test)
    pred_PAC = PAC.predict(new_tfidf_test)
    pred_LR = LR.predict(new_tfidf_test)
    pred_DT = DT.predict(new_tfidf_test)
    pred_GBC = GBC.predict(new_tfidf_test)
    pred_RFC = RFC.predict(new_tfidf_test)
    
    return print("\nPAC Prediction: {} \nLR Prediction: {} \nDT Prediction: {} \nGBC Prediction: {} \nRFC Prediction: {}".format(output_label(pred_PAC[0]),output_label(pred_LR[0]), 
                                                                                                              output_label(pred_DT[0]), 
                                                                                                              output_label(pred_GBC[0]), 
                                                                                                              output_label(pred_RFC[0])))



In [167]:
news = str(input())
manual_testing(news)

 Paul Craig RobertsIn the last years of the 20th century fraud entered US foreign policy in a new way.  On false pretenses Washington dismantled Yugoslavia and Serbia in order to advance an undeclared agenda. In the 21st century this fraud multiplied many times. Afghanistan, Iraq, Somalia, and Libya were destroyed, and Iran and Syria would also have been destroyed if the President of Russia had not prevented it.  Washington is also behind the current destruction of Yemen, and Washington has enabled and financed the Israeli destruction of Palestine.  Additionally, Washington operated militarily within Pakistan without declaring war, murdering many women, children, and village elders under the guise of  combating terrorism.  Washington s war crimes rival those of any country in history.I have documented these crimes in my columns and books (Clarity Press). Anyone who still believes in the purity of Washington s foreign policy is a lost soul  Russia and China now have a strategic alliance