In [1]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')


In [2]:
! ls dataset

BuzzFeed_fake_news_content.csv	PolitiFact_fake_news_content.csv
BuzzFeedNews.txt		PolitiFactNews.txt
BuzzFeedNewsUser.txt		PolitiFactNewsUser.txt
BuzzFeed_real_news_content.csv	PolitiFact_real_news_content.csv
BuzzFeedUserFeature.mat		PolitiFactUserFeature.mat
BuzzFeedUser.txt		PolitiFactUser.txt
BuzzFeedUserUser.txt		PolitiFactUserUser.txt


* BuzzFeedNews.txt newsid list, index by the row num. For example, 'PolitiFactReal_1' is in the 1st row, so it's corresponding to index 1.
* BuzzFeedNewsUser.txt News-User relationship in BuzzFeed. For example, '240 1 1' means news 240 is posted/spreaded by user 1 for 1 time.
* BuzzFeedUser.txt user_name list : index by the row num. For example, 'f4b46be21c2f553811cc8a73c4f0ff05' is in the 1st row, so so it's corresponding to index 1.
* BuzzFeedUserFeature.mat Latent represenation of user features from BuzzFeed dataset as MATLAB file
* BuzzFeedUserUser.txt User-User relationship. For example, '1589 1' means user 1589 is following user 1;
* BuzzFeed_fake_news_content.csv BuzzFeed Fake news content meta data including news source, headline, image, bodytext, publishdata, etc
* BuzzFeed_real_news_content.csv BuzzFeed Real news content meta data including news source, headline, image, bodytext, publishdata, etc

In [3]:
path='dataset/'

BFNews=pd.read_csv(path+'BuzzFeedNews.txt', header=None)
BFNewsUser=pd.read_csv(path+'BuzzFeedNewsUser.txt', header=None, sep='\t')
BFreal_news=pd.read_csv(path+'BuzzFeed_real_news_content.csv')
BFUser=pd.read_csv(path+'BuzzFeedUser.txt', header=None)
BFUserUser=pd.read_csv(path+'BuzzFeedUserUser.txt', header=None, sep='\t')
BFfake_news=pd.read_csv(path+'BuzzFeed_fake_news_content.csv')

In [4]:
df=pd.concat([BFreal_news, BFfake_news],axis=0)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 182 entries, 0 to 90
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   id              182 non-null    object
 1   title           182 non-null    object
 2   text            182 non-null    object
 3   url             174 non-null    object
 4   top_img         172 non-null    object
 5   authors         141 non-null    object
 6   source          174 non-null    object
 7   publish_date    133 non-null    object
 8   movies          25 non-null     object
 9   images          172 non-null    object
 10  canonical_link  170 non-null    object
 11  meta_data       182 non-null    object
dtypes: object(12)
memory usage: 18.5+ KB


In [5]:
df['article']=df['title']+df['text']
df['news_type']=df['id'].apply(lambda x: 1 if x.split('_')[0]=='Real' else 0)
df.head()

Unnamed: 0,id,title,text,url,top_img,authors,source,publish_date,movies,images,canonical_link,meta_data,article,news_type
0,Real_1-Webpage,Another Terrorist Attack in NYC…Why Are we STI...,"On Saturday, September 17 at 8:30 pm EST, an e...",http://eaglerising.com/36942/another-terrorist...,http://eaglerising.com/wp-content/uploads/2016...,"View All Posts,Leonora Cravotta",http://eaglerising.com,{'$date': 1474528230000},,http://constitution.com/wp-content/uploads/201...,http://eaglerising.com/36942/another-terrorist...,"{""description"": ""\u201cWe believe at this poin...",Another Terrorist Attack in NYC…Why Are we STI...,1
1,Real_10-Webpage,"Donald Trump: Drugs a 'Very, Very Big Factor' ...",Less than a day after protests over the police...,http://abcn.ws/2d4lNn9,http://a.abcnews.com/images/Politics/AP_donald...,"More Candace,Adam Kelsey,Abc News,More Adam",http://abcn.ws,,,http://www.googleadservices.com/pagead/convers...,http://abcnews.go.com/Politics/donald-trump-dr...,"{""fb_title"": ""Trump: Drugs a 'Very, Very Big F...","Donald Trump: Drugs a 'Very, Very Big Factor' ...",1
2,Real_11-Webpage,"Obama To UN: ‘Giving Up Liberty, Enhances Secu...","Obama To UN: ‘Giving Up Liberty, Enhances Secu...",http://rightwingnews.com/barack-obama/obama-un...,http://rightwingnews.com/wp-content/uploads/20...,Cassy Fiano,http://rightwingnews.com,{'$date': 1474476044000},https://www.youtube.com/embed/ji6pl5Vwrvk,http://rightwingnews.com/wp-content/uploads/20...,http://rightwingnews.com/barack-obama/obama-un...,"{""googlebot"": ""noimageindex"", ""og"": {""site_nam...","Obama To UN: ‘Giving Up Liberty, Enhances Secu...",1
3,Real_12-Webpage,Trump vs. Clinton: A Fundamental Clash over Ho...,Getty Images Wealth Of Nations Trump vs. Clint...,http://politi.co/2de2qs0,http://static.politico.com/e9/11/6144cdc24e319...,"Jack Shafer,Erick Trickey,Zachary Karabell",http://politi.co,{'$date': 1474974420000},,https://static.politico.com/dims4/default/8a1c...,http://www.politico.com/magazine/story/2016/09...,"{""description"": ""He sees it as zero-sum. She b...",Trump vs. Clinton: A Fundamental Clash over Ho...,1
4,Real_13-Webpage,"President Obama Vetoes 9/11 Victims Bill, Sett...",President Obama today vetoed a bill that would...,http://abcn.ws/2dh2NFs,http://a.abcnews.com/images/US/AP_Obama_BM_201...,"John Parkinson,More John,Abc News,More Alexander",http://abcn.ws,,,http://www.googleadservices.com/pagead/convers...,http://abcnews.go.com/Politics/president-obama...,"{""fb_title"": ""President Obama Vetoes 9/11 Vict...","President Obama Vetoes 9/11 Victims Bill, Sett...",1


# Data preparation

# N-gramm count matrix

In [6]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
vectorizer=CountVectorizer()
N=vectorizer.fit_transform(df['article'])
N.shape

(182, 10288)

# News-user engagement matrix

In [7]:
from scipy.sparse import csr_matrix, coo_matrix

row=BFNewsUser[0]-1
col=BFNewsUser[1]-1
data=BFNewsUser[2]
U=csr_matrix((data, (row, col)), shape=(max(row)+1, max(col)+1))
U.shape

(182, 15257)

* If we use svds built-in function of scipy.sparse library. It stores singular values in ascending order. So we do some modifications after SVD decomposition

In [8]:
from scipy.sparse.linalg import svds
from scipy.sparse import hstack, diags



Un,Sn, VnT = svds(N.asfptype())
n=len(Sn)
Un[:,:n] = Un[:, n-1::-1]
Sn = Sn[::-1]
VnT[:n, :] = VnT[n-1::-1, :]

Uu, Su, VuT = svds(U.asfptype())
n=len(Su)
Uu[:,:n] = Uu[:, n-1::-1]
Su = Su[::-1]
VuT[:n, :] = VuT[n-1::-1, :]

In [9]:
News=csr_matrix(hstack((coo_matrix(Un), coo_matrix(Uu))))
SingValue=diags([*Sn, *Su])
Weighted_News=News*SingValue
Weighted_News.shape

(182, 12)

In [10]:
X=Weighted_News.copy()
Y=df['news_type']

In [11]:
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from xgboost import XGBClassifier

X_train, X_test, y_train, y_test=train_test_split(X, Y, test_size=.3, random_state=42, shuffle=True, stratify=Y)

In [12]:
cv=KFold(n_splits=5, shuffle=True, random_state=5)

params={
    'booster': ['gbtree', 'gblinear', 'dart'],
    'n_estimators': (100,500, 50),
    'max_features': ('log2', 'sqrt'),
    'max_depth': (10,50,10),
    'n_jobs': [-1]
}

gridXGB=GridSearchCV(XGBClassifier(), param_grid=params, cv=cv, verbose=0)
gridXGB.fit(X_train, y_train)
y_pred=gridXGB.predict(X_test)

Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be

Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be

Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be

Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be

Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be

Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be

Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be

Parameters: { "max_depth", "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_depth", "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_depth", "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Para

Parameters: { "max_depth", "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_depth", "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_depth", "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Para

Parameters: { "max_depth", "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_depth", "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_depth", "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Para

Parameters: { "max_depth", "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_depth", "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_depth", "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Para

Parameters: { "max_depth", "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_depth", "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_depth", "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Para

Parameters: { "max_depth", "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_depth", "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_depth", "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Para

Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be

Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be

Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be

Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be

Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be

Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be

In [13]:
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score


print(confusion_matrix(y_test, y_pred))
print('\n')
print(classification_report(y_test, y_pred))
print(gridXGB.best_params_)

[[22  6]
 [ 6 21]]


              precision    recall  f1-score   support

           0       0.79      0.79      0.79        28
           1       0.78      0.78      0.78        27

    accuracy                           0.78        55
   macro avg       0.78      0.78      0.78        55
weighted avg       0.78      0.78      0.78        55

{'booster': 'gblinear', 'max_depth': 50, 'max_features': 'sqrt', 'n_estimators': 50, 'n_jobs': -1}


In [17]:
from sklearn.linear_model import PassiveAggressiveClassifier

params={
    'C': np.array([100,10,1,0.1,0.01,0.001,0.001,0]),
    'loss': ('hinge', 'squared_hinge'),
    'n_jobs': [-1]
}

gridPAC=GridSearchCV(PassiveAggressiveClassifier(), param_grid=params, cv=cv, verbose=0)
gridPAC.fit(X_train, y_train)
y_pred=gridPAC.predict(X_test)

print(confusion_matrix(y_test, y_pred))
print('\n')
print(classification_report(y_test, y_pred))
print(gridPAC.best_params_)

[[18 10]
 [ 1 26]]


              precision    recall  f1-score   support

           0       0.95      0.64      0.77        28
           1       0.72      0.96      0.83        27

    accuracy                           0.80        55
   macro avg       0.83      0.80      0.80        55
weighted avg       0.84      0.80      0.80        55

{'C': 0.01, 'loss': 'hinge', 'n_jobs': -1}


In [18]:
from sklearn.linear_model import LogisticRegression

params={
    'penalty':['l2', 'l1'],
    'C':np.array([100,10,1,0.1,0.01,0.001,0.001,0])
}

gridLR=GridSearchCV(LogisticRegression(), param_grid=params, cv=cv, verbose=0)
gridLR.fit(X_train, y_train)
y_pred=gridLR.predict(X_test)

print(confusion_matrix(y_test, y_pred))
print('\n')
print(classification_report(y_test, y_pred))
print(gridPAC.best_params_)

[[21  7]
 [ 6 21]]


              precision    recall  f1-score   support

           0       0.78      0.75      0.76        28
           1       0.75      0.78      0.76        27

    accuracy                           0.76        55
   macro avg       0.76      0.76      0.76        55
weighted avg       0.76      0.76      0.76        55

{'C': 0.01, 'loss': 'hinge', 'n_jobs': -1}


In [19]:
from sklearn.preprocessing import MaxAbsScaler
scaler=MaxAbsScaler()
scaler.fit(X_train)
X_trainN=scaler.transform(X_train)
X_testN=scaler.transform(X_test)

In [20]:
from sklearn.linear_model import LogisticRegression

params={
    'penalty':['l2', 'l1'],
    'C':np.array([100,10,1,0.1,0.01,0.001,0.001,0])
}

gridLR=GridSearchCV(LogisticRegression(), param_grid=params, cv=cv, verbose=0)
gridLR.fit(X_trainN, y_train)
y_pred=gridLR.predict(X_testN)

print(confusion_matrix(y_test, y_pred))
print('\n')
print(classification_report(y_test, y_pred))
print(gridLR.best_params_)

[[21  7]
 [ 3 24]]


              precision    recall  f1-score   support

           0       0.88      0.75      0.81        28
           1       0.77      0.89      0.83        27

    accuracy                           0.82        55
   macro avg       0.82      0.82      0.82        55
weighted avg       0.83      0.82      0.82        55

{'C': 100.0, 'penalty': 'l2'}


In [81]:
from sklearn.linear_model import PassiveAggressiveClassifier

params={'C': [1.0], 'loss': ['squared_hinge'], 'n_jobs': [-1]}

gridPAC=GridSearchCV(PassiveAggressiveClassifier(), param_grid=params, cv=cv, verbose=0)
gridPAC.fit(X_trainN, y_train)
y_pred=gridPAC.predict(X_testN)

print(confusion_matrix(y_test, y_pred))
print('\n')
print(classification_report(y_test, y_pred))
print(gridPAC.best_params_)

[[23  5]
 [ 1 26]]


              precision    recall  f1-score   support

           0       0.96      0.82      0.88        28
           1       0.84      0.96      0.90        27

    accuracy                           0.89        55
   macro avg       0.90      0.89      0.89        55
weighted avg       0.90      0.89      0.89        55

{'C': 1.0, 'loss': 'squared_hinge', 'n_jobs': -1}


In [21]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, BatchNormalization
from keras import regularizers

xtest, xval, ytest, yval=train_test_split(X_testN, y_test, test_size=.5, random_state=42, shuffle=True, stratify=y_test)

In [39]:
model = Sequential([
    Dense(12, activation='relu', input_shape=(12,)),
    Dropout(0.2),
    Dense(12, activation='relu'),
    Dropout(0.2),
    Dense(12, activation='relu'),
    Dropout(0.2),
    Dense(6, activation='linear'),
    Dense(1, activation='sigmoid'),
])

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy']
             )

hist = model.fit(X_trainN.toarray(), y_train,
        batch_size=15, epochs=120,
        validation_split=.2
        )
model.evaluate(X_testN.toarray(), y_test)[1]

Epoch 1/120
Epoch 2/120
Epoch 3/120
Epoch 4/120
Epoch 5/120
Epoch 6/120
Epoch 7/120
Epoch 8/120
Epoch 9/120
Epoch 10/120
Epoch 11/120
Epoch 12/120
Epoch 13/120
Epoch 14/120
Epoch 15/120
Epoch 16/120
Epoch 17/120
Epoch 18/120
Epoch 19/120
Epoch 20/120
Epoch 21/120
Epoch 22/120
Epoch 23/120
Epoch 24/120
Epoch 25/120
Epoch 26/120
Epoch 27/120
Epoch 28/120
Epoch 29/120
Epoch 30/120
Epoch 31/120
Epoch 32/120
Epoch 33/120
Epoch 34/120
Epoch 35/120
Epoch 36/120
Epoch 37/120
Epoch 38/120
Epoch 39/120
Epoch 40/120
Epoch 41/120
Epoch 42/120
Epoch 43/120
Epoch 44/120
Epoch 45/120
Epoch 46/120
Epoch 47/120
Epoch 48/120
Epoch 49/120
Epoch 50/120
Epoch 51/120
Epoch 52/120
Epoch 53/120
Epoch 54/120
Epoch 55/120
Epoch 56/120
Epoch 57/120
Epoch 58/120
Epoch 59/120


Epoch 60/120
Epoch 61/120
Epoch 62/120
Epoch 63/120
Epoch 64/120
Epoch 65/120
Epoch 66/120
Epoch 67/120
Epoch 68/120
Epoch 69/120
Epoch 70/120
Epoch 71/120
Epoch 72/120
Epoch 73/120
Epoch 74/120
Epoch 75/120
Epoch 76/120
Epoch 77/120
Epoch 78/120
Epoch 79/120
Epoch 80/120
Epoch 81/120
Epoch 82/120
Epoch 83/120
Epoch 84/120
Epoch 85/120
Epoch 86/120
Epoch 87/120
Epoch 88/120
Epoch 89/120
Epoch 90/120
Epoch 91/120
Epoch 92/120
Epoch 93/120
Epoch 94/120
Epoch 95/120
Epoch 96/120
Epoch 97/120
Epoch 98/120
Epoch 99/120
Epoch 100/120
Epoch 101/120
Epoch 102/120
Epoch 103/120
Epoch 104/120
Epoch 105/120
Epoch 106/120
Epoch 107/120
Epoch 108/120
Epoch 109/120
Epoch 110/120
Epoch 111/120
Epoch 112/120
Epoch 113/120
Epoch 114/120
Epoch 115/120
Epoch 116/120
Epoch 117/120


Epoch 118/120
Epoch 119/120
Epoch 120/120


0.8181818127632141

In [41]:
y_pred = (model.predict(X_testN.toarray()) > 0.5).astype("int32")
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.85      0.79      0.81        28
           1       0.79      0.85      0.82        27

    accuracy                           0.82        55
   macro avg       0.82      0.82      0.82        55
weighted avg       0.82      0.82      0.82        55



# Fake News

Let's test this technique on different dataset

In [42]:
path='dataset2/'

df=pd.read_csv(path+'train.csv')

In [43]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20800 entries, 0 to 20799
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      20800 non-null  int64 
 1   title   20242 non-null  object
 2   author  18843 non-null  object
 3   text    20761 non-null  object
 4   label   20800 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 812.6+ KB


In [50]:
X=df.dropna( how='any', subset=['text'])
Y=X['label']
X.drop('label', axis=1, inplace=True)

In [51]:
X=X.fillna('')
X['article']=X['title']+X['author']+X['text']
X=X['article']

# Data preparation

## N-gramm count matrix

In [52]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer= CountVectorizer()
N=vectorizer.fit_transform(X)
N.shape

(20761, 199270)

In [53]:
from scipy.sparse.linalg import svds
from scipy.sparse import hstack, diags



Un,Sn, VnT = svds(N.asfptype())
n=len(Sn)
Un[:,:n] = Un[:, n-1::-1]
Sn = Sn[::-1]
VnT[:n, :] = VnT[n-1::-1, :]

Weighted_News=Un*diags(Sn)
Weighted_News.shape
X=Weighted_News.copy()

In [54]:
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from xgboost import XGBClassifier

X_train, X_test, y_train, y_test=train_test_split(X, Y, test_size=.3, random_state=42, shuffle=True, stratify=Y)

In [55]:
cv=KFold(n_splits=5, shuffle=True, random_state=5)

In [57]:
from sklearn.linear_model import PassiveAggressiveClassifier

params={
    'C': np.array([100,10,1,0.1,0.01,0.001,0.001,0]),
    'loss': ('hinge', 'squared_hinge'),
    'n_jobs': [-1]
}

gridPAC=GridSearchCV(PassiveAggressiveClassifier(), param_grid=params, cv=cv, verbose=0)
gridPAC.fit(X_train, y_train)
y_pred=gridPAC.predict(X_test)

print(confusion_matrix(y_test, y_pred))
print('\n')
print(classification_report(y_test, y_pred))
print(gridPAC.best_params_)

[[2471  645]
 [1181 1932]]


              precision    recall  f1-score   support

           0       0.68      0.79      0.73      3116
           1       0.75      0.62      0.68      3113

    accuracy                           0.71      6229
   macro avg       0.71      0.71      0.70      6229
weighted avg       0.71      0.71      0.70      6229

{'C': 100.0, 'loss': 'hinge', 'n_jobs': -1}


In [58]:
from sklearn.linear_model import LogisticRegression

params={
    'penalty':['l2'],
    'C':np.array([100,10,1,0.1,0.01,0.001,0.001,0]),
}

gridLR=GridSearchCV(LogisticRegression(), param_grid=params, cv=cv, verbose=0)
gridLR.fit(X_train, y_train)
y_pred=gridLR.predict(X_test)

print(confusion_matrix(y_test, y_pred))
print('\n')
print(classification_report(y_test, y_pred))
print(gridLR.best_params_)

[[1926 1190]
 [ 501 2612]]


              precision    recall  f1-score   support

           0       0.79      0.62      0.69      3116
           1       0.69      0.84      0.76      3113

    accuracy                           0.73      6229
   macro avg       0.74      0.73      0.73      6229
weighted avg       0.74      0.73      0.73      6229

{'C': 0.001, 'penalty': 'l2'}


In [59]:
from sklearn.preprocessing import MaxAbsScaler, MinMaxScaler
scaler=MaxAbsScaler()
scaler.fit(X_train)
X_trainN=scaler.transform(X_train)
X_testN=scaler.transform(X_test)

In [60]:
from sklearn.linear_model import LogisticRegression

params={
    'penalty':['l2'],
    'C':np.array([100,10,1,0.1,0.01,0.001,0.001,0]),
    
}

gridLR=GridSearchCV(LogisticRegression(), param_grid=params, cv=cv, verbose=0)
gridLR.fit(X_trainN, y_train)
y_pred=gridLR.predict(X_testN)

print(confusion_matrix(y_test, y_pred))
print('\n')
print(classification_report(y_test, y_pred))
print(gridLR.best_params_)

[[1926 1190]
 [ 503 2610]]


              precision    recall  f1-score   support

           0       0.79      0.62      0.69      3116
           1       0.69      0.84      0.76      3113

    accuracy                           0.73      6229
   macro avg       0.74      0.73      0.72      6229
weighted avg       0.74      0.73      0.72      6229

{'C': 100.0, 'penalty': 'l2'}


In [61]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, BatchNormalization
from keras import regularizers

xtest, xval, ytest, yval=train_test_split(X_testN, y_test, test_size=.5, random_state=42, shuffle=True, stratify=y_test)

In [62]:
model = Sequential([
    Dense(6, activation='relu', input_shape=(6,)),
    Dropout(0.2),
    Dense(6, activation='relu'),
    Dropout(0.2),
    Dense(6, activation='relu'),
    Dropout(0.2),
    Dense(6, activation='linear'),
    Dropout(0.2),
    Dense(1, activation='sigmoid'),
])

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy']
             )

hist = model.fit(X_trainN, y_train,
        batch_size=250, epochs=120,
        validation_split=.2
        )
model.evaluate(X_testN, y_test)[1]

Epoch 1/120
Epoch 2/120
Epoch 3/120
Epoch 4/120
Epoch 5/120
Epoch 6/120
Epoch 7/120
Epoch 8/120
Epoch 9/120
Epoch 10/120
Epoch 11/120
Epoch 12/120
Epoch 13/120
Epoch 14/120
Epoch 15/120
Epoch 16/120
Epoch 17/120
Epoch 18/120
Epoch 19/120
Epoch 20/120
Epoch 21/120
Epoch 22/120
Epoch 23/120
Epoch 24/120
Epoch 25/120
Epoch 26/120
Epoch 27/120
Epoch 28/120
Epoch 29/120
Epoch 30/120
Epoch 31/120
Epoch 32/120
Epoch 33/120
Epoch 34/120
Epoch 35/120
Epoch 36/120
Epoch 37/120
Epoch 38/120
Epoch 39/120
Epoch 40/120
Epoch 41/120
Epoch 42/120
Epoch 43/120
Epoch 44/120
Epoch 45/120
Epoch 46/120
Epoch 47/120
Epoch 48/120
Epoch 49/120
Epoch 50/120
Epoch 51/120
Epoch 52/120
Epoch 53/120
Epoch 54/120
Epoch 55/120
Epoch 56/120
Epoch 57/120
Epoch 58/120


Epoch 59/120
Epoch 60/120
Epoch 61/120
Epoch 62/120
Epoch 63/120
Epoch 64/120
Epoch 65/120
Epoch 66/120
Epoch 67/120
Epoch 68/120
Epoch 69/120
Epoch 70/120
Epoch 71/120
Epoch 72/120
Epoch 73/120
Epoch 74/120
Epoch 75/120
Epoch 76/120
Epoch 77/120
Epoch 78/120
Epoch 79/120
Epoch 80/120
Epoch 81/120
Epoch 82/120
Epoch 83/120
Epoch 84/120
Epoch 85/120
Epoch 86/120
Epoch 87/120
Epoch 88/120
Epoch 89/120
Epoch 90/120
Epoch 91/120
Epoch 92/120
Epoch 93/120
Epoch 94/120
Epoch 95/120
Epoch 96/120
Epoch 97/120
Epoch 98/120
Epoch 99/120
Epoch 100/120
Epoch 101/120
Epoch 102/120
Epoch 103/120
Epoch 104/120
Epoch 105/120
Epoch 106/120
Epoch 107/120
Epoch 108/120
Epoch 109/120
Epoch 110/120
Epoch 111/120
Epoch 112/120
Epoch 113/120
Epoch 114/120
Epoch 115/120
Epoch 116/120


Epoch 117/120
Epoch 118/120
Epoch 119/120
Epoch 120/120


0.7485952973365784

In [63]:
predictions = (model.predict(X_testN) > 0.5).astype("int32")
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.76      0.72      0.74      3116
           1       0.74      0.78      0.76      3113

    accuracy                           0.75      6229
   macro avg       0.75      0.75      0.75      6229
weighted avg       0.75      0.75      0.75      6229



# Summary

I have tested lots of models. Here I showed models with better results. Fake News dataset is large unfortunately it doesn't contain information about news-user relationship or user-community relationship. Maybe that's why I didn't get so good results. <br>
To sum up, applying SVD factorization to dataset makes our life easier. Because not so neccaserly doing feature engineering which takes much time and get good results.