#### Text Classification Using Spacy Word Embeddings
* Problem Statement
Fake news refers to misinformation or disinformation in the country which is spread through word of mouth and more recently through digital communication such as What's app messages, social media posts, etc.

Fake news spreads faster than real news and creates problems and fear among groups and in society.

We are going to address these problems using classical NLP techniques and going to classify whether a given message/ text is Real or Fake Message.

We will use glove embeddings from spacy which is trained on massive wikipedia dataset to pre-process and text vectorization and apply different classification algorithms.

In [None]:
import pandas as pd
df = pd.read_csv('Fake_Real_Data.csv')
print(df.shape)
df.head()

In [None]:
df['Text'][0]

In [None]:
df.label.value_counts()

In [None]:
target = {'Fake':0,'Real':1}
df['label_num'] = df['label'].map(target)

In [None]:
df.head()

In [None]:
import spacy
nlp = spacy.load('en_core_web_lg')

In [None]:
df['vector'] = df['Text'].apply(lambda text: nlp(text).vector)

In [None]:
df.head()

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(
    df['vector'],df['label_num'],test_size=0.2,random_state=2022,stratify=df.label_num
)

In [None]:
import numpy as np

x_train_2d = np.stack(X_train)
x_test_2d = np.stack(X_test)

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_train = scaler.fit_transform(x_train_2d)
scaled_test = scaler.transform(x_test_2d)

In [None]:
nb = MultinomialNB()
nb.fit(scaled_train,y_train)

In [None]:
from sklearn.metrics import classification_report
y_pred = nb.predict(scaled_test)
print(classification_report(y_test,y_pred))

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(x_train_2d,y_train)
y_pred = knn.predict(x_test_2d)
print(classification_report(y_test,y_pred))

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
cm

##### Key Takeaways
* KNN model which didn't perform well in the vectorization techniques like Bag of words, and TF-IDF due to very high dimensional vector space, performed really well with glove vectors due to only 300-dimensional vectors and very good embeddings(similar and related words have almost similar embeddings) for the given text data.

* MultinomialNB model performed decently well but did not come into the top list because in the 300-dimensional vectors we also have the negative values present. The Naive Bayes model does not fit the data if there are negative values. So, to overcome this shortcoming, we have used the Min-Max scaler to bring down all the values between 0 to 1. In this process, there will be a possibility of variance and information loss among the data. But anyhow we got a decent recall and f1 scores.