<a href="https://colab.research.google.com/github/LeonardoGoncRibeiro/06_MachineLearning/blob/main/02_Advanced/08_Word2Vec_Intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word embedding using Word2Vec

In this course, we will work with word embedding techniques. For that end, we will make a model that classify news. We will start by taking a dataset, and representing it using one-hot-encoding. Then, we will use a different representation using Word2Vec. Word2Vec is able to understand the context of the word. We will discuss different techniques using Word2Vec.

In this course, we will use the following packages:

In [50]:
import pandas as pd
import numpy as np
import nltk
import string

from sklearn.feature_extraction.text import CountVectorizer
from gensim.models import KeyedVectors

from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import classification_report

Also, we will use the following dataset:

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
df_train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Machine Learning/Avançado/treino.csv')
df_test  = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Machine Learning/Avançado/teste.csv')

Since our dataset is very large, we are storing them in Google Drive, and importing directly from it. Let's look at our data:

In [4]:
df_train.head( )

Unnamed: 0,title,text,date,category,subcategory,link
0,"Após polêmica, Marine Le Pen diz que abomina n...",A candidata da direita nacionalista à Presidên...,2017-04-28,mundo,,http://www1.folha.uol.com.br/mundo/2017/04/187...
1,"Macron e Le Pen vão ao 2º turno na França, em ...",O centrista independente Emmanuel Macron e a d...,2017-04-23,mundo,,http://www1.folha.uol.com.br/mundo/2017/04/187...
2,"Apesar de larga vitória nas legislativas, Macr...",As eleições legislativas deste domingo (19) na...,2017-06-19,mundo,,http://www1.folha.uol.com.br/mundo/2017/06/189...
3,"Governo antecipa balanço, e Alckmin anuncia qu...",O número de ocorrências de homicídios dolosos ...,2015-07-24,cotidiano,,http://www1.folha.uol.com.br/cotidiano/2015/07...
4,"Após queda em maio, a atividade econômica sobe...","A economia cresceu 0,25% no segundo trimestre,...",2017-08-17,mercado,,http://www1.folha.uol.com.br/mercado/2017/08/1...


Our dataset stores different news. So, we have a title, a text, a date, and a category. Also, we have the link for our news. Let's see how many data do we have:

In [5]:
df_train.shape

(90000, 6)

In [6]:
df_test.shape

(20513, 6)

In our training set, we have 90000 news, and, in our test set, we have 20513 news.

In this course, we will create a model that receives our news, and predicts the category. However, we can't fit a model to our text itself. First, we have to create a representation of the text that the model can understand. The most simple representation is the one-hot encoding.

# One-hot encoding

Using One-hot encoding, we are considering that our each word in our text is a feature, and then we use a binary encoding to state whether our text has the word or not. We can do that using ```CountVectorizer( )```. For instance:

In [7]:
texts = ["Have a nice day", "Have a bad day", 'Have a great day']

vec   = CountVectorizer( )
vec.fit(texts)

CountVectorizer()

Now, let's look at the vector for each phrase:

In [8]:
vector_nice = vec.transform(["Have a nice day"])
print(vector_nice.toarray( ))

[[0 1 0 1 1]]


Note that we transformed our text into an array representation. Note that, here, our vector has 5 positions. That occurs because we have 5 words in our vocabulary. 

The one-hot encoding representation is simple and interesting. However, if we have a big vocabulary (which is expected in our case, working with more than 100,000 news), we will have too many features. In these cases, other representations may become much more interesting, such as...

# Word2Vec

Using Word2Vec, we also make a vectorized representation for our text data. However, Word2Vec is known as a word embedding technique, which uses dense vectors to store the text. This makes it so that less memory is required to store our data.

Using Word2Vec, we end up taking similar words, that are used in similar contexts, and make them "closer" in the vector. To make this transformation, we need to use an already fitted model, that already knows the relation between such words.

The training of an embedding is similar to the training of a Neural Network: we train them to get the optimal weight values in a vector of numbers (which better represent each text). We can train our Word2Vec model using a continuous bag of words (CBOW), where we pass a context and try to fit each word, seeing if they fit there, such as:

I live in \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_

We can also do the opposite, where we send a word and try to identify the context. This technique is known as Skip Gram. Both techniques can be used to see which one draws better results for our data.

In this course, we will not train an word embedding technique, but rather we will use an already fitted model. This model is taken from:

http://www.nilc.icmc.usp.br/embeddings

So, let's unzip our model:

In [9]:
!unzip "/content/drive/MyDrive/Colab Notebooks/Machine Learning/Avançado/cbow_s300.zip"

Archive:  /content/drive/MyDrive/Colab Notebooks/Machine Learning/Avançado/cbow_s300.zip
  inflating: cbow_s300.txt           


The file that we have is a text file containing all words and their corresponding vectors. To load the Word2Vec model, we can do:

In [10]:
model = KeyedVectors.load_word2vec_format("cbow_s300.txt")

Nice! Finally, we can obtain the vector for a given word. For instance, let's get the vector for *china*:

In [11]:
model.get_vector("china")

array([-1.49033e-01,  1.26020e-01,  2.17628e-01,  1.82684e-01,
        1.65151e-01, -1.59660e-01, -2.34411e-01,  6.00570e-02,
        8.03680e-02,  2.87578e-01, -4.81100e-03, -5.68800e-02,
        2.15676e-01,  8.65540e-02,  1.25983e-01,  3.36157e-01,
       -1.83254e-01, -1.18499e-01,  1.13010e-02,  1.03814e-01,
        9.37640e-02,  2.90178e-01, -1.64395e-01, -1.13300e-02,
       -1.80676e-01, -1.15820e-02,  1.08728e-01,  1.65898e-01,
        9.37900e-02,  2.66767e-01, -1.29890e-02,  9.16030e-02,
        2.21292e-01, -1.36497e-01, -4.26350e-02, -1.30038e-01,
        2.17067e-01, -1.01963e-01, -3.70960e-02,  1.42155e-01,
        3.41109e-01,  2.46560e-01,  1.27458e-01,  5.72360e-02,
       -1.47962e-01, -1.60290e-02,  1.86533e-01,  7.71550e-02,
       -3.50024e-01, -4.06085e-01,  1.67131e-01, -4.75230e-02,
        5.13780e-02, -1.28224e-01,  1.06580e-02, -2.92652e-01,
        1.40540e-01, -4.57049e-01,  1.31094e-01,  2.03234e-01,
        2.94019e-01,  7.38370e-02,  1.11554e-01, -1.642

Since we installed the model with 300 dimensions, this vector has 300 positions:

In [12]:
len(model.get_vector("china"))

300

Word2Vec representation can also help us to search for the most similar words:

In [13]:
model.most_similar("china")

[('rússia', 0.7320704460144043),
 ('índia', 0.7241617441177368),
 ('tailândia', 0.701935887336731),
 ('indonésia', 0.6860769987106323),
 ('turquia', 0.6741335988044739),
 ('malásia', 0.6665689945220947),
 ('mongólia', 0.6593616008758545),
 ('manchúria', 0.6581847667694092),
 ('urss', 0.6581669449806213),
 ('grã-bretanha', 0.6568098068237305)]

Nice! Note that the most similar words are always asian countries, which are actually close to China! 

# Understanding Word2Vec vectors

Note that, now, we have multiple 300-dimensional arrays that represent each word. To understand which words are similar to others, we are essentially evaluating the distance between two vectors. We can also take the most similar words considering multiple vectors:

In [14]:
model.most_similar(positive = ['brasil', 'argentina'])

[('chile', 0.6781662702560425),
 ('peru', 0.6348033547401428),
 ('venezuela', 0.6273865103721619),
 ('equador', 0.6037014722824097),
 ('bolívia', 0.6017140746116638),
 ('haiti', 0.5993807315826416),
 ('méxico', 0.5962306261062622),
 ('paraguai', 0.5957703590393066),
 ('uruguai', 0.5903672575950623),
 ('japão', 0.5893509984016418)]

Another interesting thing we can do with Word2Vec is trying to get the plural of a given word. We can try to do that using:

In [15]:
model.most_similar(positive = ["nuvens", "estrela"], negative = ["nuvem"])

[('estrelas', 0.5497429966926575),
 ('plêiades', 0.379197895526886),
 ('colinas', 0.3746805191040039),
 ('trovoadas', 0.373703271150589),
 ('sombras', 0.3734194040298462),
 ('pombas', 0.3726757764816284),
 ('corredoras', 0.3640727698802948),
 ('cigarras', 0.36065393686294556),
 ('galáxias', 0.35754913091659546),
 ('luas', 0.3575345277786255)]

Nice! Note that here we got a singular and plural word (nuvem and nuvens) and used them to get the plural of another (estrelas). Let's try to use a similar idea to get other derivations:

In [16]:
model.most_similar(positive = ["professor", "mulher"], negative = ["homem"])

[('professora', 0.6192208528518677),
 ('aluna', 0.5449554324150085),
 ('esposa', 0.4978231191635132),
 ('ex-aluna', 0.4884248375892639),
 ('namorada', 0.4737858772277832),
 ('enfermeira', 0.4728144109249115),
 ('filha', 0.467373788356781),
 ('irmã', 0.45845913887023926),
 ('ex-namorada', 0.45824766159057617),
 ('ex-professora', 0.4510470926761627)]

# Vectorization of our data

Ok, so, we have imported our Word2Vec model. Now, let's perform the vectorization of our data. Here, we will use the titles. For instance, let's get one of our titles:

In [17]:
df_train.title.iloc[0]

'Após polêmica, Marine Le Pen diz que abomina negacionistas do Holocausto'

In our title, we have many words. So, first, we have to tokenize our text, so that, later, we can use word embeddings to get the vectors for each word.

Let's crete a function to tokenize our text:

In [18]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [19]:
def GetTokens(text):
  alphanumeric_list = []

  text_lower = text.lower( )

  tokens = nltk.word_tokenize(text_lower)

  for token in tokens:
    if token not in string.punctuation:
      alphanumeric_list.append(token)

  return alphanumeric_list

Nice! Now, let's test our function:

In [20]:
tokens_example = GetTokens(df_train.title.iloc[0])
tokens_example

['após',
 'polêmica',
 'marine',
 'le',
 'pen',
 'diz',
 'que',
 'abomina',
 'negacionistas',
 'do',
 'holocausto']

It worked!

## Combining our words to get the resulting vector

Our Word2Vec model gives us different vectors for different words. Now, we need to get the resulting vector for the text. There are multiple methods to perform this (https://www.kaggle.com/c/quora-insincere-questions-classification/discussion/71778). Here, we will use the most simple of them, which simply gets the vector combination by summing them up:

In [21]:
def SumCombination(tokens):
  resulting_vector = np.zeros(300)               # Initializing our vector
  
  for token in tokens:
    resulting_vector += model.get_vector(token)

  return resulting_vector

Nice! Now, let's test it:

In [22]:
vec_example = SumCombination(tokens_example)
vec_example

array([ 0.52357099,  1.00529201,  0.26381801,  0.55167001,  1.29887401,
       -0.300037  ,  0.89378801,  0.11975599, -1.45775002, -0.452488  ,
        0.458223  , -0.21807001,  0.65952699,  0.57539101,  0.57576102,
        0.05649301, -1.28612098,  0.74008099,  0.26500801, -0.77916699,
       -1.012431  ,  0.140885  , -0.555113  ,  0.71819499, -0.20966199,
        0.64139499, -0.96246603,  0.13343704,  0.38299598,  0.83808199,
        0.12288502, -1.115068  , -0.00647301, -0.57349899, -0.78776202,
       -0.24487801, -0.10404397, -1.67126699, -0.08068499, -0.51521898,
        0.04381603,  0.68886297,  1.045672  , -0.23395702, -0.62220401,
        0.98942098,  0.43884596,  1.37832599, -0.183351  , -0.36253401,
        0.35007997, -0.75641701, -1.36119601,  0.143518  ,  0.00893804,
        0.49978199,  0.069786  , -0.513827  , -0.49270102,  0.86613399,
       -0.27306198,  0.228851  ,  0.12569697, -0.23421999, -0.20668898,
       -0.07144199, -1.13684901,  0.74301703, -0.15537495,  0.16

It seems to have worked out. 

Note that our function will generate an error if we pass a word that is not in its vocabulary. Also, all numeric values were normalized to 0. These can be found in the original paper for the model:

https://arxiv.org/abs/1708.06025

To fix these, we can raise an exception:

In [23]:
def SumCombination(tokens):
  resulting_vector = np.zeros(300)               # Initializing our vector
  
  for token in tokens:
    try:
      resulting_vector += model.get_vector(token)
    except KeyError:
      if token.isnumeric( ):
        token = "0"*len(token)                   # Treating numeric inputs
      else:
        token = "unknown"                        # Treating low frequency words
      resulting_vector += model.get_vector(token)

  return resulting_vector

Nice! Now, we have treated the possible errors in our code.

# Using our new representation for classification

Finally, we can use the new representation of our text for classification purposes. First, let's obtain the vectors corresponding to all titles:

In [34]:
def GetAllResVectors(texts):
  x = len(texts)
  y = 300
  resulting_vectors = np.zeros((x, y))

  for i in range(x):
    tokens = GetTokens(texts.iloc[i])
    vector = SumCombination(tokens)
    resulting_vectors[i] = vector

  return resulting_vectors

In [36]:
X_train = GetAllResVectors(df_train.title)
X_test  = GetAllResVectors(df_test.title)

In [40]:
y_train = df_train.category
y_test  = df_test.category

Now, if we get the size of our training and test datasets:

In [37]:
X_train.shape

(90000, 300)

In [38]:
X_test.shape

(20513, 300)

We now have 300 features in each dataset. Now, we can fit our model using:

In [42]:
logreg = LogisticRegression(max_iter = 200)
logreg.fit(X_train, y_train)

LogisticRegression(max_iter=200)

To check the test accuracy, we can do:

In [44]:
acc = logreg.score(X_test, y_test)*100
print("Accuracy: {:.2f}%".format(acc))

Accuracy: 79.58%


Nice! We have found an accuracy of almost 80% in our first model. However, we have six different categories. Let's understand how did our model perform for each of our labels. For that end, we can use a classification report:

In [47]:
y_pred = logreg.predict(X_test)
cr = classification_report(y_test, y_pred)
print(cr)

              precision    recall  f1-score   support

     colunas       0.86      0.71      0.78      6103
   cotidiano       0.61      0.79      0.69      1698
     esporte       0.92      0.88      0.90      4663
   ilustrada       0.13      0.88      0.23       131
     mercado       0.84      0.79      0.81      5867
       mundo       0.74      0.86      0.79      2051

    accuracy                           0.80     20513
   macro avg       0.68      0.82      0.70     20513
weighted avg       0.83      0.80      0.81     20513



Nice! Now, we have the accuracy (80%) and other metrics, such as precision and recall. Precision says that, from what I am classifying as $x$, only $p$% is really $x$. Recall, on the other hand, says that from class $x$, we are making the right call in $p$% of the guesses. Note that, in most cases, we got a high recall. However, in some of those (such as 'ilustrada') we got a very low precision. That is likely because we have very few samples from this class in our test set. Note that the weighted averages (which take into consideration the number of samples in each class) are very good.

## Using a dummy classifier

To say that our Logistic Regression model showed a good accuracy, we should actually try to fit a baseline model to our data. Our Logistic Regression should be able to show a much better performance when compared to this very simple model. Here, we will use a dummy classifier.

In [52]:
dummy = DummyClassifier( )
dummy.fit(X_train, y_train)
acc = dummy.score(X_test, y_test)*100
print("Accuracy: {:.2f}%".format(acc))

Accuracy: 29.75%


Our dummy classifier actually showed a very poor accuracy. Let's get the classification report:

In [53]:
y_pred = dummy.predict(X_test)
cr = classification_report(y_test, y_pred)
print(cr)

  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

     colunas       0.30      1.00      0.46      6103
   cotidiano       0.00      0.00      0.00      1698
     esporte       0.00      0.00      0.00      4663
   ilustrada       0.00      0.00      0.00       131
     mercado       0.00      0.00      0.00      5867
       mundo       0.00      0.00      0.00      2051

    accuracy                           0.30     20513
   macro avg       0.05      0.17      0.08     20513
weighted avg       0.09      0.30      0.14     20513



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Note that, indeed, our Logistic Regression model is much better than the Dummy Classifier model. This shows that our model is really being able to understand the patterns in the data, helping us to perform the classification of our news.

# Extra: What could we have done differently?

There are many things that we could have done to try and improve our model prediction. First, we could try to use a CBOW model with even more dimensions. On NILC's website, we can find CBOW models with up to 1000 dimensions. Also, we could try to use SKIPGRAM models and check if the resulting accuracy is higher. Another thing we could try is to change how we are combining our vectors for the text (here, we simply sum the vectors). Finally, we could try to use different, more complex models to perform our final classification.