<h1 align=center style="line-height:200%;font-family:vazir;color:#0099cc">
<font face="vazir" color="#0099cc">Fake news detection 
</font>
</h1>

In this project, we are going to check the news from the point of view of being real and fake. We have two datasets. In one is the text of the news, which has been converted into numerical vectors, and in the other is the general information of that news.

In [187]:
import numpy as np
import pandas as pd 
import scipy.sparse as ss

from sklearn.model_selection import train_test_split

from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report, f1_score

In [188]:
train_data = pd.read_csv('../data/news_train.csv')
train_data_text_vectors = ss.load_npz('../data/news_train_text_vectors.npz')

test_data = pd.read_csv('../data/news_test.csv')
test_data_text_vectors = ss.load_npz('../data/news_test_text_vectors.npz')

In [189]:
train_data_text_vectors = train_data_text_vectors.toarray()
train_data_text_vectors = pd.DataFrame(train_data_text_vectors)

test_data_text_vectors = test_data_text_vectors.toarray()
test_data_text_vectors = pd.DataFrame(test_data_text_vectors)

<h2 align=center style="line-height:200%;font-family:vazir;color:#0099cc">
<font face="vazir" color="#0099cc">
Data Preparing and Feature Engineering
</font>
</h2>



In [192]:
from sklearn.preprocessing import OneHotEncoder

y = train_data.label

mapping = {'Fake' : 0,
           'Real' :1}
y = y.map(mapping)

train_data.drop(['label','published'],axis=1,inplace=True)
test_data.drop('published',axis=1,inplace=True)

ohe = OneHotEncoder(handle_unknown='ignore')
train_data = pd.DataFrame(ohe.fit_transform(train_data).toarray())
test_data = pd.DataFrame(ohe.transform(test_data).toarray())
y

0       0
1       0
2       0
3       0
4       1
       ..
1495    1
1496    0
1497    0
1498    1
1499    0
Name: label, Length: 1500, dtype: int64

In [193]:
# Preprocessing
from sklearn.decomposition import PCA

dr = PCA(n_components=500)
reduced_train_text_vectors = dr.fit_transform(train_data_text_vectors)
reduced_test_text_vectors = dr.transform(test_data_text_vectors)

In [194]:
model = SVC(C=1,gamma=0.001,kernel='linear')
model.fit(reduced_train_text_vectors, y)
y_pred = model.predict(reduced_train_text_vectors)
train_data['text'] = y_pred

y_pred2 = model.predict(reduced_test_text_vectors)
test_data['text'] = y_pred2

train_data.columns = train_data.columns.astype(str)
test_data.columns = test_data.columns.astype(str)

train_data

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,477,478,479,480,481,482,483,484,485,text
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1495,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1
1496,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1
1497,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0
1498,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0


<h2 align=center style="line-height:200%;font-family:vazir;color:#0099cc">
<font face="vazir" color="#0099cc">
Learning Model
</font>
</h2>


In [None]:
import sklearn.model_selection as skmodel

X_train, X_test, y_train, y_test = skmodel.train_test_split(train_data, y, test_size=0.33, random_state=42)

In [196]:
#finding the best hyperparameter for SVC algorithm in this case.

from sklearn.model_selection import GridSearchCV

param_grid = {'C': [1, 10, 100, 1000], 
              'gamma': [0.001, 0.0001],
              'kernel': ['rbf', 'linear']} 

clf = GridSearchCV(SVC(), param_grid, verbose=3)
clf.fit(X_train, y_train)

print(clf.best_estimator_)

Fitting 5 folds for each of 16 candidates, totalling 80 fits
[CV 1/5] END ......C=1, gamma=0.001, kernel=rbf;, score=0.692 total time=   0.0s
[CV 2/5] END ......C=1, gamma=0.001, kernel=rbf;, score=0.697 total time=   0.0s
[CV 3/5] END ......C=1, gamma=0.001, kernel=rbf;, score=0.687 total time=   0.0s
[CV 4/5] END ......C=1, gamma=0.001, kernel=rbf;, score=0.711 total time=   0.0s
[CV 5/5] END ......C=1, gamma=0.001, kernel=rbf;, score=0.682 total time=   0.0s
[CV 1/5] END ...C=1, gamma=0.001, kernel=linear;, score=1.000 total time=   0.0s
[CV 2/5] END ...C=1, gamma=0.001, kernel=linear;, score=1.000 total time=   0.0s
[CV 3/5] END ...C=1, gamma=0.001, kernel=linear;, score=1.000 total time=   0.0s
[CV 4/5] END ...C=1, gamma=0.001, kernel=linear;, score=1.000 total time=   0.0s
[CV 5/5] END ...C=1, gamma=0.001, kernel=linear;, score=1.000 total time=   0.0s
[CV 1/5] END .....C=1, gamma=0.0001, kernel=rbf;, score=0.662 total time=   0.1s
[CV 2/5] END .....C=1, gamma=0.0001, kernel=rbf;

In [197]:
model = SVC(C=1,gamma=0.001,kernel='linear')
model.fit(X_train, y_train)
y_pred = model.predict(X_train)

<h2 align=center style="line-height:200%;font-family:vazir;color:#0099cc">
<font face="vazir" color="#0099cc">Evaluation
</font>
</h2>



The criterion we chose to evaluate the model is called f1_score.

In [198]:
# Evaluation
from sklearn.metrics import classification_report, f1_score

score = f1_score(y_true=y_train,y_pred=y_pred)
score

1.0

In [199]:
# evaluate model
from sklearn.metrics import  f1_score

y_pred = model.predict(X_test)
score = f1_score(y_true=y_test,y_pred=y_pred)
score

1.0

<h2 align=center style="line-height:200%;font-family:vazir;color:#0099cc">
<font face="vazir" color="#0099cc">
Prediction for Test Data
</font>
</h2>


In [200]:
y_pred = model.predict(test_data)
submission = pd.DataFrame(y_pred,columns=['label'])
mapping = {0:'Fake',
           1:'Real'}
submission=submission['label'].map(mapping)
submission = pd.DataFrame(submission,columns=['label'])
submission

Unnamed: 0,label
0,Real
1,Fake
2,Real
3,Real
4,Real
...,...
341,Fake
342,Real
343,Fake
344,Fake


In [None]:
submission.to_csv('submission.csv', index=False)