# Spacy Natural Language Processing + Classification Using KNN and Naive Bayes

The purpose of this notebook is to explore our Fake vs Real News dataset using several of spaCy's preprocessing methods and attempt to use two different binary classifiers to predict 'Fake' vs 'Real' news given a piece of text.

## Importing All Necesssary Modules

In [1]:
import spacy
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report


## Data Loading + Cleaning

In [4]:
df = pd.read_csv('fake_real.csv')

In [5]:
df.head()

Unnamed: 0,title,text,subject,date,type
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",True
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",True
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",True
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017",True
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017",True


In [6]:
df.shape

(44898, 5)

In [7]:
df[df['type'] == 'True'].count()

title      21417
text       21417
subject    21417
date       21417
type       21417
dtype: int64

In [8]:
df[df['type'] == 'Fake'].count()

title      23481
text       23481
subject    23481
date       23481
type       23481
dtype: int64

In [9]:
# create a smaller sampled df
fake_df = df[df['type'] == 'Fake'].sample(400)


true_df = df[df['type'] == 'True'].sample(400)
true_df.shape

(400, 5)

In [10]:
sampled_df = pd.concat([fake_df, true_df])
sampled_df.shape

(800, 5)

In [11]:
sampled_df['type'].nunique()

2

In [12]:
sampled_df.type.value_counts()

type
Fake    400
True    400
Name: count, dtype: int64

In [13]:
sampled_df['label_num'] = df['type'].map({'Fake':0, 'True':1})
sampled_df.head()

Unnamed: 0,title,text,subject,date,type,label_num
30748,GRAMMYS Nominate All-Black Artists…SNUB Countr...,"On Tuesday, the Grammys announced their nomina...",politics,"Nov 28, 2017",Fake,0
30618,JUST IN: FCC VOTES To Repeal Obama’s Net Neutr...,"Today, the FCC voted to repeal the net neutral...",politics,"Dec 14, 2017",Fake,0
29866,This One GIF Perfectly Sums Up How Much Every...,"Normally, it s not very nice to make fun of so...",News,"January 29, 2016",Fake,0
30631,JUST RELEASED: Texts Between Anti-Trump Muelle...,Fox News just revealed what we knew but it s o...,politics,"Dec 12, 2017",Fake,0
40172,BRAVO! TED CRUZ To Introduce Bill To Help Trum...,Now one unlikely Senator is about to put forth...,left-news,"Apr 25, 2017",Fake,0


In [14]:
sampled_df[sampled_df['type'] =="True"].sample()

Unnamed: 0,title,text,subject,date,type,label_num
2269,Few expect Trump's 15-percent corporate tax ra...,WASHINGTON (Reuters) - Only a small number of ...,politicsNews,"August 8, 2017",True,1


## Text Preprocessing

In [20]:
!python -m spacy download en_core_web_lg



Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
     ---------------------------------------- 0.0/587.7 MB ? eta -:--:--
     ---------------------------------------- 0.2/587.7 MB 5.3 MB/s eta 0:01:52
     ---------------------------------------- 0.6/587.7 MB 7.1 MB/s eta 0:01:23
     --------------------------------------- 1.3/587.7 MB 10.1 MB/s eta 0:00:59
     --------------------------------------- 2.0/587.7 MB 11.3 MB/s eta 0:00:52
     --------------------------------------- 2.3/587.7 MB 11.2 MB/s eta 0:00:53
     --------------------------------------- 3.0/587.7 MB 11.8 MB/s eta 0:00:50
     --------------------------------------- 3.5/587.7 MB 11.3 MB/s eta 0:00:52
     --------------------------------------- 4.4/587.7 MB 12.3 MB/s eta 0:00:48
     --------------------------------------- 5.0/587.7 MB 12.7 MB/s eta 0:00:46
     -------------------------

  _torch_pytree._register_pytree_node(


In [21]:
# loading the spacy model
nlp = spacy.load('en_core_web_lg')

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


In [22]:
doc = nlp(sampled_df['text'].iloc[1])

Now we can create a vector of... This vector will later on be used to train our model.

In [23]:
sampled_df['vector'] = df['text'].apply(lambda x: nlp(x).vector)

In [None]:
sampled_df.head()

## Models

### Splitting and Prepping Data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(sampled_df.vector.values, sampled_df.label_num, test_size = 0.3, random_state=100)

In [None]:
# convert to 2D numpy array
X_train2 = np.stack(X_train)
X_test2 = np.stack(X_test)

### Min Max Scaling

In [None]:
scaler = MinMaxScaler()
scaled_train = scaler.fit_transform(X_train2)
scaled_test = scaler.transform(X_test2)

### Multinomial Naive Bayes Classifier

In [None]:
nb= MultinomialNB()
nb.fit(X_train2, y_train)

### KNN Classifier

In [None]:
knn = KNeighborsClassifier(n_neighbors = 5, metric = 'euclidean')
knn.fit(X_train2, y_train)

## Evaluation

### Multinomial Naive Bayes Evaluation

In [None]:
nb_y_pred = nb.predict(scaled_test)

In [None]:
print(classification_report(y_test, nb_y_pred))

### K-Nearest Neighbors Evaluation

In [None]:
knn_y_pred = knn.predict(scaled_test)

In [None]:
print(classification_report(y_test, knn_y_pred))

## Conclusion