# Fake news detection

Dataset: https://drive.google.com/file/d/1er9NJTLUA3qnRuyhfzuN0XUsoIC4a-_q/view

## Import

In [1]:
import numpy as np
import pandas as pd
import itertools
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import classification_report
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

### Read Dataset

In [2]:
dataset = pd.read_csv("../Datasets/news.zip")

### Dataset exploration

In [3]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6335 entries, 0 to 6334
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  6335 non-null   int64 
 1   title       6335 non-null   object
 2   text        6335 non-null   object
 3   label       6335 non-null   object
dtypes: int64(1), object(3)
memory usage: 198.1+ KB


In [4]:
dataset.describe()

Unnamed: 0.1,Unnamed: 0
count,6335.0
mean,5280.415627
std,3038.503953
min,2.0
25%,2674.5
50%,5271.0
75%,7901.0
max,10557.0


In [5]:
dataset.describe(include=np.object)

Unnamed: 0,title,text,label
count,6335,6335,6335
unique,6256,6060,2
top,OnPolitics | 's politics blog,"Killing Obama administration rules, dismantlin...",REAL
freq,5,58,3171


In [6]:
dataset["label"].value_counts()

REAL    3171
FAKE    3164
Name: label, dtype: int64

### Split dataset

In [7]:
x_train,x_test,y_train,y_test=train_test_split(dataset.text, dataset.label, test_size=0.2)

### Data processing

**TF (Term Frequency):** The number of times a word appears in a document is its Term Frequency. A higher value means a term appears more often than others, and so, the document is a good match when the term is part of the search terms.

**IDF (Inverse Document Frequency):** Words that occur many times a document, but also occur many times in many others, may be irrelevant. IDF is a measure of how significant a term is in the entire corpus.

The **TfidfVectorizer** converts a collection of raw documents into a matrix of TF-IDF features.

In [8]:
tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.7)
tfidf_train=tfidf_vectorizer.fit_transform(x_train) 
tfidf_test=tfidf_vectorizer.transform(x_test)
tfidf_train.shape

(5068, 61016)

### Training and predict

**Passive Aggressive algorithms** are online learning algorithms. Such an algorithm remains passive for a correct classification outcome, and turns aggressive in the event of a miscalculation, updating and adjusting. Unlike most other algorithms, it does not converge. Its purpose is to make updates that correct the loss, causing very little change in the norm of the weight vector.

https://www.bonaccorso.eu/2017/10/06/ml-algorithms-addendum-passive-aggressive-algorithms/

In [47]:
models = []
models.append(('Logistic Regression', LogisticRegression(solver='newton-cg', max_iter=50)))
models.append(('KNN', KNeighborsClassifier()))
models.append(('Decision Tree', DecisionTreeClassifier()))
models.append(('Random Forest', RandomForestClassifier(n_estimators=50)))
models.append(('Gradient Boosting', GradientBoostingClassifier(n_estimators=50)))
models.append(('PassiveAggressiveClassifier', PassiveAggressiveClassifier(max_iter=50)))

# evaluate each model in turn
results = []
names = []
results_mean = []

for name, model in models:
    kfold = StratifiedKFold(n_splits=5, random_state=1, shuffle=True)
    cv_results = cross_val_score(model, tfidf_train, y_train, cv=kfold, scoring='f1_macro')
    results.append(cv_results)
    names.append(name)
    results_mean.append(cv_results.mean())
    print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))

Logistic Regression: 0.910534 (0.006992)
KNN: 0.447922 (0.008116)
Decision Tree: 0.806602 (0.003711)
Random Forest: 0.888113 (0.007804)
Gradient Boosting: 0.870042 (0.004860)
PassiveAggressiveClassifier: 0.934883 (0.007501)


In [52]:
pac=PassiveAggressiveClassifier(max_iter=100)
pac.fit(tfidf_train,y_train)

y_pred=pac.predict(tfidf_test)
print(classification_report(y_pred, y_test))

              precision    recall  f1-score   support

        FAKE       0.94      0.95      0.95       636
        REAL       0.95      0.94      0.95       631

    accuracy                           0.95      1267
   macro avg       0.95      0.95      0.95      1267
weighted avg       0.95      0.95      0.95      1267

