## Classification with word2vec 

-- Prof. Dorien Herremans

In this second part of the lab, we will be tackling a classification problem by first loading word embeddings and feeding those in a simple classifier. We compare this to naive alternative approaches. 

During this tutorial, you will need some of the following libraries, let's install them first if you don't have them: 

In [None]:
#STUDENT NUMBER: 

#1004471

In [None]:
# Use this to install libraries if you find them missing on your system: 
# !pip install bs4 
# !pip install sklearn
# !pip install nltk
# !pip install gensim
# !pip install lxml

Now we can import some libraries that we will use:

In [None]:
import logging
import pandas as pd
import numpy as np
from numpy import random
import gensim
import nltk
import lxml
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt

%matplotlib inline

### TFIDF with logistic regression

#### Preparing the dataset

The classification problem at hand is to predict the tag that belongs to a Stack Overflow post. By the way, if you are not familiar with Stack Overflow, do check it out, it is a tremendous help when facing any coding issues. The data from Google BigQuery is available at the github below. If the link does not work you may have to download it manually from github then upload to Colab:

https://github.com/dorienh/class_materials/blob/main/datasets/stack-overflow-data.csv
 
 We can read it directly into a pandas dataframe. 


In [None]:
url = "https://github.com/dorienh/class_materials/blob/main/datasets/stack-overflow-data.csv?raw=true"

df = pd.read_csv(url, encoding = 'latin-1')

Let's start by having a look at our data: 

In [None]:
# only keep data that has a tag (is labeled): 
df = df[pd.notnull(df['tags'])]

# display first ten rows:
df.head(10)

Our task: predict the tag based on the post content. 

The size of our word embedding will be chosen based on how many unique words are in the dataset (meaning in the article text or posts): 

In [None]:
# Count the number of words: 
df['post'].apply(lambda x: len(x.split(' '))).sum()

We have over 10 million words in the data. That's a lot! 


Let's visualise our dataset: 



In [None]:
# visualising dataset
plt.figure(figsize=(10,4))
df.tags.value_counts().plot(kind='bar');

As you can see, the classes are very well balanced.

Now let's have a look at the data of the posts ('post' columns) in more detail: 

In [None]:
print(df['post'].values[10])

As you can see, the text needs to be cleaned up a bit. Below we use the `nltk` toolkit to remove spaces, html tags, stopwords, symbols etc. We define a function to remove stop words, replace / \ and other symbols.

In [None]:
# note: slower students may wish to skip this step to finish the lab in class
from nltk.corpus import stopwords
import re
from bs4 import BeautifulSoup

# load a list of stop words
nltk.download('stopwords')


REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

def clean_text(text):
    """
        text: a string 
        return: modified initial string
    """
    text = BeautifulSoup(text, 'html.parser').text # HTML decoding
    text = text.lower() # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = BAD_SYMBOLS_RE.sub('', text) # delete symbols which are in BAD_SYMBOLS_RE from text
    text = ' '.join(word for word in text.split() if word not in STOPWORDS) # delete stopwors from text
    return text

Now we can apply the newly defined function on the column of `df 'post'`. 

In [None]:
df['post'] = df['post'].apply(clean_text)

Let's check the results: 

In [None]:
print(df['post'].values[10])

This looks a lot better!

Now how many unique words do we have in this cleaned up dataset? 

In [None]:
df['post'].apply(lambda x: len(x.split(' '))).sum()

Now we have over 3 million words to work with, that's 7 million removed tags.

Before we start creating classifiers, let's split our dataset 70-30 in a test set (for evaluation) and training set: 

In [None]:
X = df.post
y = df.tags
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 42)

#### Logistic regression

Now that we have our features, we can train a classifier to try to predict the tag of a post. We will start with logistic regression and TFIDF representation which provides a nice baseline for this task. 

To make the vectorizer => transformer => classifier easier to work with, we will use the `Pipeline` class in Scikit-Learn that behaves like a compound classifier.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer

# we define a Pipeline, which first represents our features as TFID
# Then performs logistic regression
logreg = Pipeline([('vect', CountVectorizer()),
                ('tfidf', TfidfTransformer()),
                ('clf', LogisticRegression(n_jobs=1, C=1e5)),
               ])
logreg.fit(X_train, y_train)

How well does it work? 

In [None]:
# to show the computation time: 
%%time

y_pred = logreg.predict(X_test)

print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred))

That's quite a good accuracy. Now let's see if we can combine **word2vec** with logistic regression by feeding the new embedded representation to our logistic regression instead of the bag of words representation of TFIDF. 

### Word2vec embedding and Logistic Regression

Let's load a pretrained word2vec model, and use the embedding representation as input to a simple classifier (i.e. logistic regression). 

You can use the word2vec model you trained in the first part of the lab (on the Shakespeare text), or load this (quite big, 1.5GB) pretrained word2vec model from Google trained on Google News data. 

If you load an model you trained yourself, use#
`wv = gensim.models.KeyedVectors.load_word2vec_format("yourweights.bin.gz", binary=True)`. We will be loading pretrained weights available in gensim itself:

(This may take a while!)




In [None]:
%%time
import gensim.downloader
from gensim.models import Word2Vec

wv = gensim.downloader.load('word2vec-google-news-300')
wv.init_sims(replace=True)
print('Model loaded')


If you are interested how good these pretrained embeddings are, you could try some of the similarity tests we did in part 1 of the lab on the Shakespeare text. Only now we have a larger vocabulary, e.g.:  

In [None]:
wv.most_similar('twitter')

Gensim offers a number of pretrained models for you to choose from (convenient right!). You can check a list of available model like this: 

In [None]:
# Show all available models in gensim-data
print(list(gensim.downloader.info()['models'].keys()))

As we have multiple words for each post, we will need to somehow combine them. A common way to achieve this is by averaging the
word vectors per document. In later classes you can feed the individual words to memory models like LSTM. For a quick solution here, we can use a summation or weighted addition. The function below takes as input a list of words and the word2vec model `wv`. Then it retrieves the vector embeddings for each of the words and averages them. 

In [None]:
def word_averaging(wv, words):
    # averages a set of words 'words' given their wordvectors 'wv'
    
    all_words, mean = set(), []
    
    # for each word in the list of words
    for word in words:
        # if the words are alread vectors, then just append them
        if isinstance(word, np.ndarray):
            mean.append(word)
        # if not: first get the vector embedding for the words
        elif word in wv.vocab:
            mean.append(wv.syn0norm[wv.vocab[word].index])
            all_words.add(wv.vocab[word].index)

    
    if not mean:
        # error handling in case mean cannot be calculated
        logging.warning("cannot compute similarity with no input %s", words)
        return np.zeros(wv.vector_size,)

    # use gensim's method to calculate the mean of all the words appended to mean list
    mean = gensim.matutils.unitvec(np.array(mean).mean(axis=0)).astype(np.float32)
    return mean

def  word_averaging_list(wv, text_list):
    return np.vstack([word_averaging(wv, post) for post in text_list ])

Below, we explore a way (slightly different from the method used in part 1 of the lab) to create tokens out of sentences, by using the `nltk` toolkit. 

In [None]:
import nltk.data
nltk.download('punkt')

def w2v_tokenize_text(text):
    # create tokens, a list of words, for each post. This function will do some cleaning based on English language
    tokens = []
    for sent in nltk.sent_tokenize(text, language='english'):
        for word in nltk.word_tokenize(sent, language='english'):
            if len(word) < 2:
                continue
            tokens.append(word)
    return tokens

Let's also split the dataset in training and test set like before, and tokenize each of these datasets using the method defined above.

In [None]:
train, test = train_test_split(df, test_size=0.3, random_state = 42)

test_tokenized = test.apply(lambda r: w2v_tokenize_text(r['post']), axis=1).values
train_tokenized = train.apply(lambda r: w2v_tokenize_text(r['post']), axis=1).values

Since we have multiple word vectors per article, we can take multiple approaches (a powerful LSTM approach as we'll see later, or doc2vec as per below, but first we try a naive approach of averaging). We can average the word positions for each post in this new dataset using the functions we defined above and based on our word2vec model `wv`.

In [None]:
X_train_word_average = word_averaging_list(wv,train_tokenized)
X_test_word_average = word_averaging_list(wv,test_tokenized)

Now we have a way to represent our input! This can then be fed to any classifier, like logistic regression: 

In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(n_jobs=1, C=1e5)
logreg = logreg.fit(X_train_word_average, train['tags'])
y_pred = logreg.predict(X_test_word_average)

Let's evaluate how accurate this averaged word2vec representation with logistic regression is:

In [None]:
print('accuracy %s' % accuracy_score(y_pred, test.tags))
print(classification_report(test.tags, y_pred))

Now you can see that the accuracy went down! Oh no! Why is that? Because we used a very naive approach: averaging our vectors. A better way to approach this would be doc2vec, which learns relationships between documents (posts in this case), instead of words. The accuracy could also improve by using a different classifier instead of logistic regression, or by changing the aggregation strategy and feed it to an LSTM/RNN model. 

## Doc2vec and Logistic Regression (advanced)

The idea of word2vec can be extended to documents whereby instead of learning feature representations for words, we learn it for sentences or documents. Doc2Vec extends the idea of word2vec, however words can only capture so much, there are times when we need relationships between documents and not just words.

The way to train doc2vec model for our Stack Overflow questions and tags data is very similar to when we trained multi-class text classification with word2vec and logistic regression above.

First, we label the sentences. Gensim’s Doc2Vec implementation requires each document/paragraph to have a label associated with it that indicates if it's part of the test or training set. We do this by using the TaggedDocument method. The format will be `TRAIN_i` or `TEST_i` where `i` is a dummy index of the post.

First let's import the necessary libraries. 


In [None]:
from tqdm import tqdm
from gensim.models import doc2vec
from sklearn import utils
import gensim
from gensim.models.doc2vec import TaggedDocument
import re

Let's start by defining a function that labels our documents in the corpus. We just give them dummy labels TRAIN_i or TEST_i for post i. Given a corpus and labels, we return a variable that includes a label indicating if it's test or training data. 

In [None]:
def label_sentences(corpus, label_type):
    """
    Gensim's Doc2Vec implementation requires each document/paragraph to have a label associated with it.
    We do this by using the TaggedDocument method. The format will be "TRAIN_i" or "TEST_i" where "i" is
    a dummy index of the post.
    """
    labeled = []
    for i, v in enumerate(corpus):
        label = label_type + '_' + str(i)
        labeled.append(doc2vec.TaggedDocument(v.split(), [label]))
    return labeled

Just like above we split our dataset up in test and training data.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.post, df.tags, random_state=0, 
                                                    test_size=0.3)
X_train = label_sentences(X_train, 'Train')
X_test = label_sentences(X_test, 'Test')
all_data = X_train + X_test

Let's have a look how our data looks at this moment: 

In [None]:
all_data[:10]

Gensim allows us to build a model very easily. We can vary the parameters to fit your data: 

*    `dm=0` , distributed bag of words (DBOW) is used.
*    `vector_size=300` , 300 vector dimensional feature vectors.
*    `negative=5` , specifies how many “noise words” should be drawn.
*    `min_count=1`, ignores all words with total frequency lower than this.
*    `alpha=0.065` , the initial learning rate.

We initialize the model and train for 30 epochs. (Those of you on slower computers may want to train for less epochs). Be sure to set your runtime to GPU hardware acceleration! Maybe test with a lower amount of epochs first to see how high you can go during class time!


In [None]:
model_dbow = doc2vec.Doc2Vec(dm=0, vector_size=300, negative=5, min_count=1, alpha=0.065, 
                     min_alpha=0.065)
model_dbow.build_vocab([x for x in tqdm(all_data)])

In [None]:
for epoch in range(30):
    model_dbow.train(utils.shuffle([x for x in tqdm(all_data)]), 
                     total_examples=len(all_data), 
                     epochs=1)
    model_dbow.alpha -= 0.002
    model_dbow.min_alpha = model_dbow.alpha

Now let's define a function to get the vector of a particular word from this trained  model, so that we can feed them into the logistic regression:

In [None]:
def get_vectors(model, corpus_size, vectors_size, vectors_type):
    """
    Get vectors from trained doc2vec model
    :param doc2vec_model: Trained Doc2Vec model
    :param corpus_size: Size of the data
    :param vectors_size: Size of the embedding vectors
    :param vectors_type: Training or Testing vectors
    :return: list of vectors
    """
    vectors = np.zeros((corpus_size, vectors_size))
    for i in range(0, corpus_size):
        prefix = vectors_type + '_' + str(i)
        vectors[i] = model.docvecs[prefix]
    return vectors

We can use this function to create a vectorised training and test set with 1 entry per document for the input in classification models such as logistic regression. 

In [None]:
train_vectors_dbow = get_vectors(model_dbow, len(X_train), 300, 'Train')
test_vectors_dbow = get_vectors(model_dbow, len(X_test), 300, 'Test')

We can now feed these vectors to the classifier again: 

In [None]:
logreg = LogisticRegression(n_jobs=1, C=1e5)
logreg.fit(train_vectors_dbow, y_train)

logreg = logreg.fit(train_vectors_dbow, y_train)
y_pred = logreg.predict(test_vectors_dbow)

print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred))

We get 80%, that is the best result so far! Remember, we can actually use any classifier with this method! So up to you to make your project as efficient as possible :)
    
Try using a different classifiers, e.g. Decision tree or SVM. Does that influence the results? 

New methods are coming out every day in the field of data science. Just at the end of August 2019, the first implementation of BERT for document classfication was published: DocBERT: https://arxiv.org/abs/1904.08398

These embeddings can similarly be loaded. There are also specialised pretrainend embeddings for say, financial data, e.g. FinBERT. 

## References

* https://radimrehurek.com/gensim/models/word2vec.html
* https://towardsdatascience.com/multi-class-text-classification-model-comparison-and-selection-5eb066197568
* https://github.com/kavgan/nlp-text-mining-working-examples/tree/master/word2vec
* https://medium.com/@mishra.thedeepak/doc2vec-simple-implementation-example-df2afbbfbad5]

## Exercise

Now over to you! 

Can you develop a doc2vec with SVM classifier for the following dataset? 

https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset?select=Fake.csv

You can also find this dataset here: https://dorienherremans.com/drop/CDS/word2vec/fake.zip

The task is to predict if news is fake or real. 

As input, use only the text for simplicity (possibly concatenated with title, but not necessary). 

Good luck! 

### Solution

In [1]:
# Imports
import re
import logging
import pandas as pd
import numpy as np
from numpy import random
import gensim
import nltk
import lxml
from tqdm import tqdm
from gensim.models import doc2vec
from sklearn import utils
from gensim.models.doc2vec import TaggedDocument
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [12]:
# Loading dataset and concatening titles with their text
df_fake = pd.read_csv('Fake.csv')
# only keep data that has a text entry 
df_fake = df_fake[pd.notnull(df_fake['text'])]
df_fake['TitleAndText'] = df_fake['title'] + df_fake['text']
df_fake['label'] = 'Fake'
# df_fake.head(10)

df_true = pd.read_csv('True.csv')
# only keep data that has a text entry 
df_true = df_true[pd.notnull(df_true['text'])]
df_true['TitleAndText'] = df_true['title'] + df_true['text']
df_true['label'] = 'True'
df_true.info


<bound method DataFrame.info of                                                    title  ... label
0      As U.S. budget fight looms, Republicans flip t...  ...  True
1      U.S. military to accept transgender recruits o...  ...  True
2      Senior U.S. Republican senator: 'Let Mr. Muell...  ...  True
3      FBI Russia probe helped by Australian diplomat...  ...  True
4      Trump wants Postal Service to charge 'much mor...  ...  True
...                                                  ...  ...   ...
21412  'Fully committed' NATO backs new U.S. approach...  ...  True
21413  LexisNexis withdrew two products from Chinese ...  ...  True
21414  Minsk cultural hub becomes haven from authorities  ...  True
21415  Vatican upbeat on possibility of Pope Francis ...  ...  True
21416  Indonesia to buy $1.14 billion worth of Russia...  ...  True

[21417 rows x 6 columns]>

In [15]:
# Keeping only necessary columns
df_fake = df_fake.drop(columns=['title', 'text', 'date', 'subject'], axis = 1)
df_true = df_true.drop(columns = ['title', 'text', 'date', 'subject'], axis = 1)

df = pd.concat([df_fake, df_true])
df.head(5)

Unnamed: 0,TitleAndText,label
0,Donald Trump Sends Out Embarrassing New Year’...,Fake
1,Drunk Bragging Trump Staffer Started Russian ...,Fake
2,Sheriff David Clarke Becomes An Internet Joke...,Fake
3,Trump Is So Obsessed He Even Has Obama’s Name...,Fake
4,Pope Francis Just Called Out Donald Trump Dur...,Fake


In [17]:
# Cleaning the dataset
from nltk.corpus import stopwords
import re
from bs4 import BeautifulSoup

# load a list of stop words
nltk.download('stopwords')


REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

def clean_text(text):
    """
        text: a string 
        return: modified initial string
    """
    text = BeautifulSoup(text, 'html.parser').text # HTML decoding
    text = text.lower() # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = BAD_SYMBOLS_RE.sub('', text) # delete symbols which are in BAD_SYMBOLS_RE from text
    text = ' '.join(word for word in text.split() if word not in STOPWORDS) # delete stopwors from text
    return text

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [18]:
df['TitleAndText'] = df['TitleAndText'].apply(clean_text)

  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup


In [20]:
# Label documents with train/test labels
def label_sentences(corpus, label_type):
    """
    Gensim's Doc2Vec implementation requires each document/paragraph to have a label associated with it.
    We do this by using the TaggedDocument method. The format will be "TRAIN_i" or "TEST_i" where "i" is
    a dummy index of the post.
    """
    labeled = []
    for i, v in enumerate(corpus):
        label = label_type + '_' + str(i)
        labeled.append(doc2vec.TaggedDocument(v.split(), [label]))
    return labeled

In [21]:
# Splitting dataset
X_train, X_test, y_train, y_test = train_test_split(df.TitleAndText, df.label, random_state=0, 
                                                    test_size=0.3)
X_train = label_sentences(X_train, 'Train')
X_test = label_sentences(X_test, 'Test')
all_data = X_train + X_test

In [22]:
# Initializing doc2vec model
model_dbow = doc2vec.Doc2Vec(dm=0, vector_size=300, negative=5, min_count=1, alpha=0.065, 
                     min_alpha=0.065)
model_dbow.build_vocab([x for x in tqdm(all_data)])

100%|██████████| 44898/44898 [00:00<00:00, 2398928.17it/s]


In [23]:
# Training doc2vec model
for epoch in range(30):
    model_dbow.train(utils.shuffle([x for x in tqdm(all_data)]), 
                     total_examples=len(all_data), 
                     epochs=1)
    model_dbow.alpha -= 0.002
    model_dbow.min_alpha = model_dbow.alpha

100%|██████████| 44898/44898 [00:00<00:00, 2396333.41it/s]
100%|██████████| 44898/44898 [00:00<00:00, 2201700.66it/s]
100%|██████████| 44898/44898 [00:00<00:00, 2014288.81it/s]
100%|██████████| 44898/44898 [00:00<00:00, 2101434.62it/s]
100%|██████████| 44898/44898 [00:00<00:00, 2222199.60it/s]
100%|██████████| 44898/44898 [00:00<00:00, 2191860.20it/s]
100%|██████████| 44898/44898 [00:00<00:00, 2150632.81it/s]
100%|██████████| 44898/44898 [00:00<00:00, 2217097.89it/s]
100%|██████████| 44898/44898 [00:00<00:00, 2425094.47it/s]
100%|██████████| 44898/44898 [00:00<00:00, 2976525.85it/s]
100%|██████████| 44898/44898 [00:00<00:00, 2142631.25it/s]
100%|██████████| 44898/44898 [00:00<00:00, 2376195.39it/s]
100%|██████████| 44898/44898 [00:00<00:00, 2881385.96it/s]
100%|██████████| 44898/44898 [00:00<00:00, 2384348.71it/s]
100%|██████████| 44898/44898 [00:00<00:00, 3040933.05it/s]
100%|██████████| 44898/44898 [00:00<00:00, 1738064.95it/s]
100%|██████████| 44898/44898 [00:00<00:00, 2729373.60it/

In [24]:
# Getting vectors from trained model
def get_vectors(model, corpus_size, vectors_size, vectors_type):
    """
    Get vectors from trained doc2vec model
    :param doc2vec_model: Trained Doc2Vec model
    :param corpus_size: Size of the data
    :param vectors_size: Size of the embedding vectors
    :param vectors_type: Training or Testing vectors
    :return: list of vectors
    """
    vectors = np.zeros((corpus_size, vectors_size))
    for i in range(0, corpus_size):
        prefix = vectors_type + '_' + str(i)
        vectors[i] = model.docvecs[prefix]
    return vectors

In [25]:
# Vectorized train and test set
train_vectors_dbow = get_vectors(model_dbow, len(X_train), 300, 'Train')
test_vectors_dbow = get_vectors(model_dbow, len(X_test), 300, 'Test')

In [26]:
# Training SVM
svc = SVC()
svc.fit(train_vectors_dbow, y_train)

svc.fit(train_vectors_dbow, y_train)
y_pred = svc.predict(test_vectors_dbow)

print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred))

accuracy 0.993615441722346
              precision    recall  f1-score   support

        Fake       1.00      0.99      0.99      7025
        True       0.99      1.00      0.99      6445

    accuracy                           0.99     13470
   macro avg       0.99      0.99      0.99     13470
weighted avg       0.99      0.99      0.99     13470

