<a href="https://colab.research.google.com/github/LukaLujan/word2vec/blob/main/gthb1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this google colab notebook, I am going to show how you can use Gensim Word2Vec model to do word embedding on your own dictionary , so it will give us best possible vectors that we can use them latter for training our model on train set, and of course test it on test set data.  Watching various code onthe internet I found that it is very easy to find someone using Word2Vec model, however interestingly vast majority of code skip testing model on the real test data. Literally I couldn't find anything so with some effort I made my own process. People on Kaggle just get some "fancy" 96% accuracy rate on validation  and that's it. 
We are using "fake news data set" from here; 
https://www.kaggle.com/c/fake-news/data

Here you have a train.csv , test.csv and submit.csv - last ones are predicted labels of the test data that we are aiming to.
We will test our model on the data set, (and that includes preprocessing test data) and in next notebooks we will compare it with pre-trained Word2Vec vectors and even better - we will use both pre-trained Word2Vec vectors and our own Word2Vec vectors and combine them for better results.



In [None]:
#Let's import all necessary tools


import numpy as np
import pandas as pd
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
import gensim.downloader as api
from google.colab import drive

import gensim
from gensim.models import word2vec
from gensim.models import KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity

from keras.preprocessing.text import Tokenizer


import re
STOPWORDS = set(stopwords.words('english'))

import seaborn as sns

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
#This where I am going to mount my drive on google colab
drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
#I am changing directory , however you may do all of this on different way. 
%cd drive/MyDrive/

/content/drive/MyDrive


In [None]:
#Importing traning data. This traning data includes also labels. You can check all of this with df.head().
df =pd.read_csv("train.csv")

In [None]:
#I am copying all text data into one separate dataframe. I will explain why I am doing that in next few cells. 
df_text =df[['text']].copy()

In [None]:
df.head()

Unnamed: 0,index,id,title,author,text,label
0,0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [None]:
len(df), len(df_text)

(20800, 20800)

In [None]:
#This is were we are droping all rows that are empty. That can be empty title, author or text or label, and whole row will be droped. We will on this way lose about 10% rows 
#However with 2 lines of simple code we get clean data. For learning this is fair enough. 

df=df.dropna()
df.reset_index(inplace=True)
print(len(df))

18285


In [None]:
#However as you can see, I separated data frame with text, and now I am just dropping those rows that doesn't have any text.
#For me this is very valuable data that I will use for training my dictionary. 
df_text=df_text.dropna()
df_text.reset_index(inplace=True)
print(len(df_text))

20761


In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
lemmatizer=WordNetLemmatizer()

In [None]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In next cell we do a preprocessing. We are taking all 20761 text documents(rows) from df_text["text"] , each with hundreds of words. We are going to clean each sentence from punctuations, numbers, and strange symbols, clean everything of the stop words, and for do lemmatization for each word in each sentence of each text. Then we append it back corpus with name "message_text" This process can last few minutes on your machine or google colab.

In [None]:


messages_text = []

tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
for par in df_text["text"].values:
    tmp = []
    sentences = nltk.sent_tokenize(par)
     
    for sent in sentences:
      
        sent = sent.lower()
        
        tokens = tokenizer.tokenize(sent)
         

        filtered_words = [w.strip() for w in tokens if w not in STOPWORDS and len(w) > 1]
        filtered_words2 = [lemmatizer.lemmatize(w) for w in filtered_words]
        tmp.extend(filtered_words2)
    messages_text.append(tmp)

In [None]:
#Dimension of vectors we are generating
EMBEDDING_DIM = 300

#Creating Word Vectors by Word2Vec Method (takes time...)
#We are using those words to train our model. You can play with EMBEDDING_DIM but we will leave it 300 because latter we will compare it with pre-trained word2vec 300 vectors.
#You can play with minimum occurance of each word , I ll leave it with number 3. 
word2vec_model = gensim.models.Word2Vec(sentences=messages_text, size=EMBEDDING_DIM, window=5, min_count=3)

In [None]:
word2vec_model.wv.most_similar(positive=['iran'], topn = 10)

[('iranian', 0.6900563836097717),
 ('tehran', 0.6507840156555176),
 ('yemen', 0.6204209923744202),
 ('hizbollah', 0.6133781671524048),
 ('houthis', 0.6078612208366394),
 ('egypt', 0.604503870010376),
 ('turkey', 0.6000145673751831),
 ('sanction', 0.5976719856262207),
 ('libya', 0.5974670648574829),
 ('militarily', 0.588986873626709)]

One little trick here. You can do all preprocessing on df["title"] , like I did first time. And those data latter will be our main features to predict target value(is a text a fake news or not). However, that is very few words for our model to train.  
So I did train my model on text data.

In [None]:
#This is a size of our vocab, each word in vocab has a 300 dimension vector. 
len(word2vec_model.wv.vocab)

70967

In [None]:
df.head()

Unnamed: 0,index,id,title,author,text,label
0,0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


Now - we will preprocess all titles as well. Our goal will be to predict are news fake or true just from the titles. Computer can not read words. But it can read numbers. To turn each word into the number(300 dimension vector) we will need to preprocess all titles. 

In [None]:
#First we will do standard cleaning. 
messages_title = []

tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
for par in df["title"].values:
    tmp = []
    sentences = nltk.sent_tokenize(par)
    for sent in sentences:
        sent = sent.lower()
        tokens = tokenizer.tokenize(sent)
        filtered_words = [w.strip() for w in tokens if w not in STOPWORDS and len(w) > 1]
        filtered_words2 = [lemmatizer.lemmatize(w) for w in filtered_words]
        tmp.extend(filtered_words2)
    messages_title.append(tmp)

In [None]:
messages_title[0]

'house dem aide even see comey letter jason chaffetz tweeted'

In [None]:
#Now we will connect all separated words into one string. We only use string join function from python.
for i in range(len(messages_title)):
 messages_title[i] =" ".join(word for word in messages_title[i])  

In [None]:
#This function is generating vector with 300 dimension vector having only ones(1). We will use it for word embedding words that are not part of our vocab.
#To repeat, our vocab has around 70 000 words, each of words have unique 300 dimension vector that is property of that word. If some word is unknown it will get 300 dimension vector consists only of ones.
def ones_vector(d):
  return np.ones(d)

In [None]:
#this "lista_vectora" (list of vectors) is list that we will save all our vectors, for each row.
lista_vectora =[]
for row in messages_title:
  temp_ls_vec=[]

  #each word/token is split by space then - if word is in our vocab, it will get corresponding vector. If it is not, it will be generated new 300 dimension vector. 
  for token in row.split():
    if token in word2vec_model.wv.vocab:
      temp_ls_vec.append(word2vec_model[token])
    else:
      temp_ls_vec.append(ones_vector(300))
  lista_vectora.append(temp_ls_vec)

  


In [None]:
#Each title has different size. Here longest title has size of 47 words. I will use that size for padding. That means that each row in the end will have size of 47 vectors.


max_len =[]
for sent in lista_vectora:
  max_len.append(len(sent))
print(max(max_len))
print(len(max_len))
L = max(max_len)

47
18285


In [None]:
#We will use those zeros , for padding "empty space". As I said before, each of row will have len(47). Every word that in our trained vocab will get it's unique vector. 
#Word that is not par of our vocab will get vector of ones. All empty space will be padded with vector of zeros. 
def null_vector(num):
  return np.zeros(num)

In [None]:
#padding of zeroes. We are creating only list of zeroes. Each row in our corpus have unique len. Maximum len is 47. Substract 47 from len or each row and you will know how much zero vectors you will create.
padded_embeddings = []
for row in lista_vectora:
  temp_zer=[]
  if len(row) <L:
    for i in range(L-len(row)):
      temp_zer.append(null_vector(300))
  padded_embeddings.append(temp_zer) 

In [None]:
#easiest way to concat list of vectors that have only zeros(padded_embeddings) and list of vectors(lista_vectora) that have unique 300 dimension vectors for each word in vocab, and ones for unknown words
padded_vectors= [ k+v for k,v in zip(lista_vectora ,padded_embeddings )]

In [None]:
#we have to convert all those numbers in numpy array.
y_train = np.array(df["label"])
X_train = np.array(padded_vectors)

In [None]:
X_train.shape

(18285, 47, 300)

In [None]:
#every row that is going through our batch will have shape 47, 300. That will include test set latter.
input_shape =(X_train.shape[1], X_train.shape[2])

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
#Spliting data into validation and train. Test data will come latter.
X1_train, X1_val, y1_train, y1_val  =train_test_split(X_train, y_train, test_size=0.33, random_state=42)

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Input, Dense, LSTM, SpatialDropout1D, Dropout, TimeDistributed, Input
from tensorflow.keras.callbacks import EarlyStopping

In [None]:
model = Sequential()
# model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
    # here we do not need Embedding layer because we did it on our own) 
model.add(Input(shape=input_shape))
    # https://keras.io/api/layers/core_layers/input/

#This is my simple network that will yield some results. Goal here was more on preprocessing, embedding and how to use Word2Vec.
#You can play with tunning 

model.add(LSTM(64))
model.add(Dropout(0.4))
model.add(Dense(32))
model.add(Dropout(0.4))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm (LSTM)                 (None, 64)                93440     
                                                                 
 dropout (Dropout)           (None, 64)                0         
                                                                 
 dense (Dense)               (None, 32)                2080      
                                                                 
 dropout_1 (Dropout)         (None, 32)                0         
                                                                 
 dense_1 (Dense)             (None, 1)                 33        
                                                                 
Total params: 95,553
Trainable params: 95,553
Non-trainable params: 0
_________________________________________________________________
None


In [None]:

model.fit(X1_train,y1_train, validation_data=(X1_val, y1_val), epochs=5, batch_size=32, callbacks=[EarlyStopping(monitor='val_loss', patience=2,restore_best_weights=True)] )

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5


<keras.callbacks.History at 0x7f9fc42ddd10>

In [None]:
#checking folder where are our files
%ls

 1-s2.0-S1057521915001477-main.pdf       submit.csv         X_test.npy
[0m[01;34m'Colab Notebooks'[0m/                       test.csv           X_train_extra.npy
 [01;34mdario[0m/                                  train.csv          y_array.npy
 GoogleNews-vectors-negative300.bin.gz   X_array.csv.npy    y_test_extra.npy
 iv.docx                                 X_array.npy        y_test.npy
 spam.csv                                X_test_extra.npy   y_train_extra.npy


In [None]:
#Importing our data. Notace that  y -values are in separated csv file("submit.csv")
df_Xtest = pd.read_csv("test.csv")
df_ytest =pd.read_csv("submit.csv")

In [None]:
#concating two separated test data frames so I can for example get rid of rows with missing values much easier.
df_Xy = pd.concat([df_Xtest, df_ytest], axis=1)

In [None]:

df_Xy.dropna(inplace=True)
df_Xy.reset_index(inplace=True)
print(len(df_Xy))

4575


In [None]:
#test data must pass same process as train data. Everything must be preprocessed and tokens converted into 300 dimensional vectors.

messagesX = []

tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
for par in df_Xy["title"].values:
    tmp = []
    sentences = nltk.sent_tokenize(par)
    for sent in sentences:
        sent = sent.lower()
        tokens = tokenizer.tokenize(sent)
        filtered_words = [w.strip() for w in tokens if w not in STOPWORDS and len(w) > 1]
        filtered_words2 = [lemmatizer.lemmatize(w) for w in filtered_words]
        tmp.extend(filtered_words2)
    messagesX.append(tmp)

In [None]:
messagesX[0]

['specter',
 'trump',
 'loosens',
 'tongue',
 'purse',
 'string',
 'silicon',
 'valley',
 'new',
 'york',
 'time']

In [None]:

for i in range(len(messagesX)):
 messagesX[i] =" ".join(word for word in messagesX[i])  

In [None]:
#if the word from data set is part of our vocab(that is trained on text data ) it will get it's 300 dimension vector. Otherwise it gets vector of ones.
lista_vectora2 =[]
for row in messagesX:
  temp_ls_vec=[]
  for token in row.split():
    if token in word2vec_model.wv.vocab:
      temp_ls_vec.append(word2vec_model.wv[token])
    else:
      temp_ls_vec.append(ones_vector(300))
  lista_vectora2.append(temp_ls_vec)

In [None]:
#Everything else that is "empty space" will be padded with zeros.
padded_embeddings2 = []
for row in lista_vectora2:
  temp_zer=[]
  if len(row) <L:
    for i in range(L-len(row)):
      temp_zer.append(null_vector(300))
  padded_embeddings2.append(temp_zer) 

In [None]:
#as with train data, we do same with test data
padded_vectors2= [ k+v for k,v in zip(lista_vectora2 ,padded_embeddings2 )]

In [None]:
y_test = np.array(df_Xy["label"])
X_test = np.array(padded_vectors2)

In [None]:
#predicting results
y_preds = model.predict(X_test)

In [None]:
#because I used sigmoid function for predictions, everything above 0.5 will be classifed as fake news, everything bellow 0.5 as reliable news source
for i in range(len(y_preds)):
  if y_preds[i] >=0.5:
    y_preds[i]=1
  else:
    y_preds[i] =0

In [None]:
#lets see how our data is split
y_preds=y_preds.flatten()
pd.Series(y_preds).value_counts()

0.0    2473
1.0    2102
dtype: int64

In [None]:
from sklearn.metrics import accuracy_score, classification_report

This is our result. We tried to see how correct our model will be to predict is something fake news or not  just by reading titles of those news. You can get much higher result if you include author together with titles as X feature. Then model will probably learn which author is "fake" which real. Also you can try different type of models, different preprocessing etc. 

In [None]:

print(accuracy_score(y_test, y_preds))

0.6375956284153006


In next notebook, we will se how successfull will be our Word2Vec model that has been already pretrained on huge text.