## Classification with word2vec 

-- Prof. Dorien Herremans

We will be tackling a classification problem by first creating word embeddings, and comparing this to alternative approaches. 

During this tutorial, you will need some of the following libraries, let's install them first if you don't have them: 

In [0]:
!pip install bs4 
!pip install sklearn
!pip install nltk
!pip install gensim
!pip install lxml

Now we can import some libraries that we will use:

In [0]:
import logging
import pandas as pd
import numpy as np
from numpy import random
import gensim
import nltk
import lxml
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt

%matplotlib inline

## TFIDF with logistic regression

### Preparing the dataset

The classification problem at hand is to predict the tag that belongs to a stack overflow post. The data from Google BigQuery is publicly available at this Cloud Storage URL:

https://storage.googleapis.com/tensorflow-workshop-examples/stack-overflow-data.csv.
 
 We can read it directly into a pandas dataframe. 


Let's start by having a look at our data: 

In [0]:
# only keep data that has a tag (is labeled): 


# display first ten rows:


The size of our model will depend on how many unqiue words are in the dataset (meaning in the article text or posts): 

In [0]:
# Count the number of words: 

We have over 10 million words in the data. That's a lot! 


Let's visualise our dataset: 



In [0]:
# visualising dataset


As you can see, the classes are very well balanced.

Now let's have a look at the data of the posts ('post' columns) in more detail: 

As you can see, the text needs to be cleaned up a bit. Below we use the nltk toolkit to remove spaces, html tags, stopwords, symbols etc. Below we define a function to remove stop words, replace / \ and other symbols with spaces, ...

In [4]:
# note: slower students may wish to skip this step to finish the lab in class
from nltk.corpus import stopwords
import re
from bs4 import BeautifulSoup

# load a list of stop words
nltk.download('stopwords')


REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

def clean_text(text):
    """
        text: a string 
        return: modified initial string
    """
    text = BeautifulSoup(text, 'html.parser').text # HTML decoding
    text = text.lower() # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = BAD_SYMBOLS_RE.sub('', text) # delete symbols which are in BAD_SYMBOLS_RE from text
    text = ' '.join(word for word in text.split() if word not in STOPWORDS) # delete stopwors from text
    return text

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Now we can apply the newly defined function on the column of df 'post'. 

Let's check the results: 

This looks a lot better!

Now how many unique words do we have in this cleaned up dataset? 

Now we have over 3 million words to work with.

Before we start creating some classifiers, let's split our dataset in a test set (for evaluation) and training set: 

### Logistic regression

Now that we have our features, we can train a classifier to try to predict the tag of a post. We will start with logistic regression and TFIDF representation which provides a nice baseline for this task. 

To make the vectorizer => transformer => classifier easier to work with, we will use Pipeline class in Scikit-Learn that behaves like a compound classifier.

How well does it work? 

In [5]:
# to show the computation time: 
%%time


UsageError: %%time is a cell magic, but the cell body is empty. Did you mean the line magic %time (single %)?


That's quite a good accuracy. Now let's see if we can combine word2vec with logistic regression by feeding the new embedded representation to our logistic regression instead of the bag of words. 

## Word2vec embedding and Logistic Regression

Let's load a pretrained word2vec model, and use the embedding representation as input to a simple classifier (i.e. logistic regression). 

You can use the word2vec model you trained in lab 10a, or load this (quite big, 1.5GB) pretrained word2vec model: https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz

Note: it can take a while to load. (takes 2min for me) 

Once the file is on your system: 

If you are interested how good these pretrained embeddings are, you could try some of the similarity tests we did in Lab 10a. 

As we have multiple words for each post, we will need to somehow combine them. A common way to achieve this is by averaging the
word vectors per document. It could also be summation or weighted addition. The function below takes as input a list of words and the w2v model wv. Then it retrieves the vector embeddings for each of the words and averages them. 

In [0]:
def word_averaging(wv, words):
    # averages a set of words 'words' given their wordvectors 'wv'
    
    all_words, mean = set(), []
    
    # for each word in the list of words
    for word in words:
        # if the words are alread vectors, then just append them
        if isinstance(word, np.ndarray):
            mean.append(word)
        # if not: first get the vector embedding for the words
        elif word in wv.vocab:
            mean.append(wv.syn0norm[wv.vocab[word].index])
            all_words.add(wv.vocab[word].index)

    
    if not mean:
        # error handling in case mean cannot be calculated
        logging.warning("cannot compute similarity with no input %s", words)
        return np.zeros(wv.vector_size,)

    # use gensim's method to calculate the mean of all the words appended to mean list
    mean = gensim.matutils.unitvec(np.array(mean).mean(axis=0)).astype(np.float32)
    return mean

def  word_averaging_list(wv, text_list):
    return np.vstack([word_averaging(wv, post) for post in text_list ])

Below, we explore a different way to create tokens out of sentences, by using the nltk toolkit. 

Let's split the dataset in training and test set like before, and tokenize each of the datasets

We can then average the position per post in this new dataset using the functions we defined above and based on our word2vec model wv.

Now we can feed this new representation into the logistic regression: 

How accurate is this averaged word2vec model with logistic regression?

Now you can see that the accuracy went down! Oh no! Why is that? Because we used a very naive approach, to average our vectors. The way around it would be doc2vec, which learns relationships between documents (posts in this case), instead of words. The accuracy could also improve by using a different classifier instead of logistic regression, or by changing the aggregation strategy. 

## Doc2vec and Logistic Regression (advanced)

The idea of word2vec can be extended to documents whereby instead of learning feature representations for words, we learn it for sentences or documents. To get a general idea of a word2vec, think of it as a mathematical average of the word vector representations of all the words in the document. Doc2Vec extends the idea of word2vec, however words can only capture so much, there are times when we need relationships between documents and not just words.

The way to train doc2vec model for our Stack Overflow questions and tags data is very similar with when we trained multi-class text classification with word2vec and logistic regression above.

First, we label the sentences. Gensim’s Doc2Vec implementation requires each document/paragraph to have a label associated with it that indicates if it's part of the test or training set. We do this by using the TaggedDocument method. The format will be “TRAIN_i” or “TEST_i” where “i” is a dummy index of the post.

First let's import the necessary libraries. 


In [0]:
from tqdm import tqdm
from gensim.models import doc2vec
from sklearn import utils
import gensim
from gensim.models.doc2vec import TaggedDocument
import re

Let's start by defining a function that labels our documents in the corpus. We just give them dummy labels TRAIN_i or TEST_i for post i. Given a corpus and labels, we return a variable that includes a label indicating if it's test or training data. 

In [0]:
def label_sentences(corpus, label_type):
    """
    Gensim's Doc2Vec implementation requires each document/paragraph to have a label associated with it.
    We do this by using the TaggedDocument method. The format will be "TRAIN_i" or "TEST_i" where "i" is
    a dummy index of the post.
    """
  



  
    return labeled

Just like above we split our dataset up in test and training data.

Let's have a look how our data looks at this moment: 

Gensim allows us to build a model very easily. We can vary the parameters to fit your data: 

*    dm=0 , distributed bag of words (DBOW) is used.
*    vector_size=300 , 300 vector dimensional feature vectors.
*    negative=5 , specifies how many “noise words” should be drawn.
*    min_count=1, ignores all words with total frequency lower than this.
*    alpha=0.065 , the initial learning rate.

We initialize the model and train for 30 epochs. (slower computers may want to train for less epochs). Be sure to set your runtime to TPU/GPU hardware acceleration! Maybe test with a lower amount of epochs first to see how high you can go during class time!



Now let's define a function to the vectors out of this trained  model, so that we can feed them into the logistic regression:

In [0]:
def get_vectors(model, corpus_size, vectors_size, vectors_type):
    """
    Get vectors from trained doc2vec model
    :param doc2vec_model: Trained Doc2Vec model
    :param corpus_size: Size of the data
    :param vectors_size: Size of the embedding vectors
    :param vectors_type: Training or Testing vectors
    :return: list of vectors
    """
    




    
    return vectors

We can use this function to create a vectorised training and test set with 1 entry per document for the input in classification models such as logistic regression. 

We can now feed these vectors to the classifier again: 

80%, that is the best result so far! Remember, we can actually use any classifier with this method! So up to you to make your project as efficient as possible :)

Try using a different classifiers, e.g. Decision tree or SVM. Does that influence the results? 


    
New methods are coming out every day in the field of data science. Just at the end of August 2019, the first implementation of BERT for document classfication was published: DocBERT: https://arxiv.org/abs/1904.08398

## References

* https://radimrehurek.com/gensim/models/word2vec.html
* https://towardsdatascience.com/multi-class-text-classification-model-comparison-and-selection-5eb066197568
* https://github.com/kavgan/nlp-text-mining-working-examples/tree/master/word2vec
* https://medium.com/@mishra.thedeepak/doc2vec-simple-implementation-example-df2afbbfbad5]