# TP : Word Embeddings for Classification

## Objectives:

Explore the various way to represent textual data by applying them to a relatively small classification dataset - **20NewsGroup** - and evaluate how they perform on the classification task. 
1. Using what we have previously seen, pre-process the data: clean it, obtain an appropriate vocabulary.
2. Obtain representations: any that will allow us to obtain a vector representation of each document is appropriate.
    - Symbolic: **BoW, TF-IDF**
    - Dense document representations: via **Topic Modeling: LSA, LDA**
    - Dense word representations: **SVD-reduced PPMI, Word2vec, GloVe**
        - For these, you will need to implement a **function aggregating word representations into document representations**
3. Perform classification: we can make things simple and only use a **logistic regression**

## Necessary dependancies

We will need the following packages:
- The Machine Learning API Scikit-learn : http://scikit-learn.org/stable/install.html
- The Natural Language Toolkit : http://www.nltk.org/install.html
- Gensim: https://radimrehurek.com/gensim/

These are available with Anaconda: https://anaconda.org/anaconda/nltk and https://anaconda.org/anaconda/scikit-learn

In [134]:
import os.path as op
import re 
import numpy as np
import matplotlib.pyplot as plt
from pprint import pprint

## Loading data

We retrieve the textual data in the variable ```train_texts```.

The labels are retrieved in the variable ```train_labels``` - it contains ```len(train_texts)``` of them: $0$ indicates that the corresponding review is negative while $1$ indicates that it is positive.

In [135]:
from glob import glob
# We get the files from the path: ./aclImdb/train/neg for negative reviews, and ./aclImdb/train/pos for positive reviews
train_filenames_neg = sorted(glob(op.join('.', 'aclImdb', 'neg', '*.txt')))
train_filenames_pos = sorted(glob(op.join('.', 'aclImdb', 'pos', '*.txt')))

# Each files contains a review that consists in one line of text: we put this string in two lists, that we concatenate
train_texts_neg = [open(f, encoding="utf8").read() for f in train_filenames_neg]
train_texts_pos = [open(f, encoding="utf8").read() for f in train_filenames_pos]
train_texts = train_texts_neg + train_texts_pos

# The first half of the elements of the list are string of negative reviews, and the second half positive ones
# We create the labels, as an array of [1,len(texts)], filled with 1, and change the first half to 0
train_labels = np.ones(len(train_texts), dtype=int)
train_labels[:len(train_texts_neg)] = 0.

We have a total of $25.000$ documents, which may take a long time to process, especially on some computers. Since data is ordered, let's pick only one every ```k``` documents to accelerate things:

In [136]:
# This number of documents may be high for most computers: we can select a fraction of them (here, one in k)
# Use an even number to keep the same number of positive and negative reviews
k = 10
texts_reduced = train_texts[0::k]
labels_reduced = train_labels[0::k]
print('Number of documents:', len(train_texts_neg))
print('Number of documents:', len(train_texts_pos))

print('Number of documents:', len(texts_reduced))

Number of documents: 12500
Number of documents: 12500
Number of documents: 2500


Use the function ```train_test_split```from ```sklearn``` function to set aside test data that you will use during the lab. Make it one fifth of the data you have currently.

<div class='alert alert-block alert-info'>
            Code:</div>

In [137]:
from sklearn.model_selection import train_test_split

# Splitting the data into training and test sets
train_texts, test_texts, train_labels, test_labels = train_test_split(texts_reduced, labels_reduced, test_size=0.2, random_state=42)

print('Number of training documents:', len(train_texts))
print('Number of test documents:', len(test_texts))
print('Number of training labels:', len(train_labels))
print('Number of test labels:', len(test_labels))


Number of training documents: 2000
Number of test documents: 500
Number of training labels: 2000
Number of test labels: 500


## 1 - Document Preprocessing

You should use a pre-processing function you can apply to the raw text before any other processing (*i.e*, tokenization and obtaining representations). Some pre-processing can also be tied with the tokenization (*i.e*, removing stop words). Complete the following function, using the appropriate ```nltk``` tools. 
<div class='alert alert-block alert-info'>
            Code:</div>

In [138]:
# Imports
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

nltk.download('punkt')  
nltk.download('stopwords')  

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\nicol\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nicol\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

<div class='alert alert-block alert-info'>
            Code:</div>

In [139]:

def clean_text(text: str,
               rm_numbers=True,
               rm_punct=True,
               rm_stop_words=False,
               rm_short_words=False):
    # Make lowercase
    text = text.lower()
    
    # Remove punctuation
    if rm_punct:
        text = text.translate(str.maketrans('', '', string.punctuation))
    
    # Remove numbers
    if rm_numbers:
        text = ''.join([word for word in text if not word.isdigit()])
    
    # Remove stopwords
    if rm_stop_words:
        stop_words = set(stopwords.words('english'))
        words = word_tokenize(text)
        text = ' '.join([word for word in words if word not in stop_words])
    
    # Remove short words
    if rm_short_words:
        words = word_tokenize(text)
        text = ' '.join([word for word in words if len(word) > 1])
    
    return text

# Test the clean_text function
sample_text = "This is a sample text with numbers 1234, punctuation! and stopwords like the and is. Short words too a I."
cleaned_text = clean_text(sample_text)
print(cleaned_text)


this is a sample text with numbers  punctuation and stopwords like the and is short words too a i


In [140]:
train_texts_clean = [clean_text(train_texts[i]) for i in range(len(train_texts))]
test_texts_clean = [clean_text(test_texts[i]) for i in range(len(test_texts))]
print(len(train_texts_clean))
print(len(test_texts_clean))

2000
500


Now that the data is cleaned, the first step we will follow is to pick a common vocabulary that we will use for every representations we obtain in this lab. The following function can be used:

In [141]:
def vocabulary(corpus, voc_threshold=0):
    """    
    Function using word counts to build a vocabulary - can be improved with a second parameter for 
    setting a frequency threshold
    Params:
        corpus (list of list of strings): corpus of sentences
        voc_threshold (int): maximum size of the vocabulary (0 means no limit !)
    Returns:
        vocabulary (dictionary): keys: list of distinct words across the corpus
                                 values: indexes corresponding to each word sorted by frequency        
    """  
    word_counts = {}
    for sent in corpus:
        for word in word_tokenize(sent):
            if (word not in word_counts):
                word_counts[word] = 0
            word_counts[word] += 1           
    words = sorted(word_counts.keys(), key=word_counts.get, reverse=True)
    if voc_threshold > 0:
        words = words[:voc_threshold] + ['unk']   
    vocabulary = {words[i] : i for i in range(len(words))}
    return vocabulary, {word: word_counts.get(word, 0) for word in vocabulary}

Look at the word frequency distribution of every word in your current **training data**. Display enough information to help you pick a vocabulary size. It may be hard to judge what could be a good compromise: when you don't know, look at what is usually done... 

**Hint:** what is the default **minimum_count** for a word to appear in the vocabulary created by ```gensim.models.Word2Vec``` ? Look at the default argument given in the documentation. 
<div class='alert alert-block alert-info'>
            Code:</div>

In [142]:
# Calculate word frequencies in the training data
voc_threshold = 7000
vocab, word_counts = vocabulary(train_texts_clean, voc_threshold)

# Display word frequency distribution
print("Word frequencies:")
for word, count in word_counts.items():
    print(f"{word}: {count}")

# Display the size of the vocabulary after applying the threshold
print(f"Vocabulary size: {len(vocab)}")


Word frequencies:
the: 26729
and: 12783
a: 12661
of: 11660
to: 10390
is: 8395
in: 7285
it: 6117
i: 6040
this: 5991
that: 5463
br: 4385
was: 3831
as: 3727
with: 3430
movie: 3426
for: 3405
but: 3251
film: 2931
on: 2688
not: 2395
you: 2347
his: 2303
have: 2247
are: 2235
be: 2110
one: 2085
he: 2080
its: 1937
at: 1863
all: 1862
by: 1763
an: 1650
they: 1646
from: 1601
who: 1580
like: 1548
so: 1524
about: 1390
out: 1389
just: 1365
or: 1363
her: 1348
if: 1323
has: 1273
there: 1256
some: 1249
what: 1177
very: 1142
good: 1117
more: 1077
when: 1037
my: 970
story: 962
time: 961
would: 958
had: 948
even: 944
really: 941
she: 940
no: 938
only: 926
up: 925
can: 923
their: 903
which: 886
see: 879
were: 874
me: 824
we: 790
much: 786
been: 769
than: 763
well: 759
get: 743
will: 731
most: 711
bad: 709
because: 709
great: 707
people: 706
also: 697
other: 697
into: 696
how: 694
first: 668
do: 665
made: 651
movies: 644
dont: 636
him: 625
any: 619
make: 613
could: 610
too: 603
way: 596
films: 595
then: 595
t

## 2 - Symbolic text representations

We can use the ```CountVectorizer``` class from scikit-learn to obtain the first set of representations:
- Use the appropriate argument to get your own vocabulary
- Fit the vectorizer on your training data, transform your test data
- Create a ```LogisticRegression``` model and train it with these representations. Display the confusion matrix using functions from ```sklearn.metrics``` 

Then, re-execute the same pipeline with the ```TfidfVectorizer```.

<div class='alert alert-block alert-info'>
            Code:</div>

In [143]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

In [144]:
# Create a CountVectorizer with your own vocabulary
count_vectorizer = CountVectorizer(vocabulary=vocab)

# Fit the vectorizer on your training data and transform your test data
X_train_count = count_vectorizer.fit_transform(train_texts_clean)
X_test_count = count_vectorizer.transform(test_texts_clean)

# Create a logistic regression model and train it with CountVectorizer representations
logistic_regression_count = LogisticRegression(max_iter=1000)
logistic_regression_count.fit(X_train_count, train_labels)

# Predict on the test data
y_pred_count = logistic_regression_count.predict(X_test_count)

# Display the confusion matrix
confusion_matrix_count = confusion_matrix(test_labels, y_pred_count)
print("Confusion Matrix (CountVectorizer):")
print(confusion_matrix_count)

# Classification report
print("Classification Report (CountVectorizer):")
print(classification_report(test_labels, y_pred_count))


Confusion Matrix (CountVectorizer):
[[192  44]
 [ 50 214]]
Classification Report (CountVectorizer):
              precision    recall  f1-score   support

           0       0.79      0.81      0.80       236
           1       0.83      0.81      0.82       264

    accuracy                           0.81       500
   macro avg       0.81      0.81      0.81       500
weighted avg       0.81      0.81      0.81       500



In [145]:
# Repeat the same process with TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(vocabulary=vocab)
X_train_tfidf = tfidf_vectorizer.fit_transform(train_texts_clean)
X_test_tfidf = tfidf_vectorizer.transform(test_texts_clean)

logistic_regression_tfidf = LogisticRegression()
logistic_regression_tfidf.fit(X_train_tfidf, train_labels)

y_pred_tfidf = logistic_regression_tfidf.predict(X_test_tfidf)

# Display the confusion matrix for TfidfVectorizer
confusion_matrix_tfidf = confusion_matrix(test_labels, y_pred_tfidf)
print("Confusion Matrix (TfidfVectorizer):")
print(confusion_matrix_tfidf)

# Classification report for TfidfVectorizer
print("Classification Report (TfidfVectorizer):")
print(classification_report(test_labels, y_pred_tfidf))


Confusion Matrix (TfidfVectorizer):
[[198  38]
 [ 43 221]]
Classification Report (TfidfVectorizer):
              precision    recall  f1-score   support

           0       0.82      0.84      0.83       236
           1       0.85      0.84      0.85       264

    accuracy                           0.84       500
   macro avg       0.84      0.84      0.84       500
weighted avg       0.84      0.84      0.84       500



## 3 - Dense Representations from Topic Modeling

Now, the goal is to re-use the bag-of-words representations we obtained earlier - but reduce their dimension through a **topic model**. Note that this allows to obtain reduced **document representations**, which we can again use directly to perform classification.
- Do this with two models: ```TruncatedSVD``` and ```LatentDirichletAllocation```
- Pick $300$ as the dimensionality of the latent representation (*i.e*, the number of topics)

<div class='alert alert-block alert-info'>
            Code:</div>

In [146]:
from sklearn.decomposition import TruncatedSVD, LatentDirichletAllocation

In [147]:
# Initialize TruncatedSVD with the desired number of topics (300)
svd_model = TruncatedSVD(n_components=300)

# Fit and transform your CountVectorizer representations
X_train_svd = svd_model.fit_transform(X_train_count)
X_test_svd = svd_model.transform(X_test_count)

In [148]:
# Initialize LatentDirichletAllocation with the desired number of topics (300)
lda_model = LatentDirichletAllocation(n_components=300)

# Fit and transform your CountVectorizer representations
X_train_lda = lda_model.fit_transform(X_train_count)
X_test_lda = lda_model.transform(X_test_count)


<div class='alert alert-block alert-warning'>
            Question:</div>
            
We picked $300$ as number of topics. What would be the procedure to follow if we wanted to choose this hyperparameter through the data ? 

**Anwser** :

1) **Split Data:** First, we should have a validation set in addition to our training and test sets. We will use the validation set for hyperparameter tuning.

2) **Hyperparameter Search:** We then should define a range of values for the number of topics (e.g., from 100 to 500 topics) that we want to test. We can use cross-validation or other methods to evaluate different values of this hyperparameter.

3) **Evaluate Models:** Afterwards we can evaluate topic models (e.g., Latent Dirichlet Allocation or Truncated SVD) with different numbers of topics on the validation set using an appropriate evaluation metric. Common metrics for topic modeling include perplexity or coherence. The goal is to find the number of topics that results in the best performance on the validation set.

4) **Select Optimal Hyperparameter:** Then, we can choose the number of topics that gives the best performance on the validation set as our final hyperparameter. Being still cautious about overfitting, and making sure our choice generalizes well to unseen data (test set).

5) **Final Model:** Finally, we can train the final topic model with the chosen number of topics on the combined training and validation sets. We can then use this model to transform both the training and test sets into lower-dimensional representations.

## 4 - Dense Count-based Representations

The following function allows to obtain very large-dimensional vectors for **words**. We will now follow a different procedure:
- Use ```TruncatedSVD```to obtain **word embeddings** of dimension $300$ from the output of the ```co_occurence_matrix```function, to which you can apply any intermediate transformation you see fit. 
- Complete the following ```sentence_representations``` matrix, which will allow you to obtain **document representations** from **word embeddings**. 
- Put the pipeline together and obtain document representations for both training and testing data, using word embeddings you got from the *training data co-occurence matrix*.
- Apply the same classification model as before, and display the results.

In [149]:
def co_occurence_matrix(corpus, vocabulary, window=0, distance_weighting=False):
    """
    Params:
        corpus (list of list of strings): corpus of sentences
        vocabulary (dictionary): words to use in the matrix
        window (int): size of the context window; when 0, the context is the whole sentence
        distance_weighting (bool): indicates if we use a weight depending on the distance between words for co-oc counts
    Returns:
        matrix (array of size (len(vocabulary), len(vocabulary))): the co-oc matrix, using the same ordering as the vocabulary given in input    
    """ 
    l = len(vocabulary)
    M = np.zeros((l,l))
    for sent in corpus:
        # Get the sentence
        sent = word_tokenize(sent)
        # Obtain the indexes of the words in the sentence from the vocabulary 
        sent_idx = [vocabulary.get(word, len(vocabulary)-1) for word in sent]
        # Avoid one-word sentences - can create issues in normalization:
        if len(sent_idx) == 1:
                sent_idx.append(len(vocabulary)-1)
        # Go through the indexes and add 1 / dist(i,j) to M[i,j] if words of index i and j appear in the same window
        for i, idx in enumerate(sent_idx):
            # If we consider a limited context:
            if window > 0:
                # Create a list containing the indexes that are on the left of the current index 'idx_i'
                l_ctx_idx = [sent_idx[j] for j in range(max(0,i-window),i)]                
            # If the context is the entire document:
            else:
                # The list containing the left context is easier to create
                l_ctx_idx = sent_idx[:i]
            # Go through the list and update M[i,j]:        
            for j, ctx_idx in enumerate(l_ctx_idx):
                if distance_weighting:
                    weight = 1.0 / (len(l_ctx_idx) - j)
                else:
                    weight = 1.0
                M[idx, ctx_idx] += weight * 1.0
                M[ctx_idx, idx] += weight * 1.0
    return M  

<div class='alert alert-block alert-info'>
            Code:</div>

In [160]:
# Obtain the co-occurence matrix, transform it as needed, reduce its dimension

co_occ_matrix = co_occurence_matrix(train_texts_clean, vocab, window=5, distance_weighting=False)

svd_model = TruncatedSVD(n_components=300)
word_embeddings = svd_model.fit_transform(co_occ_matrix)

sentence_representations = np.zeros((len(texts_reduced), 300))

for i, text in enumerate(texts_reduced):
    words = word_tokenize(text)
    word_indices = [vocab.get(word, len(vocab) - 1) for word in words]
    doc_embeddings = word_embeddings[word_indices]
    sentence_representations[i] = np.mean(doc_embeddings, axis=0)

logistic_regression = LogisticRegression(solver='liblinear')
logistic_regression.fit(sentence_representations, labels_reduced)

In [161]:
test_sentence_representations = np.zeros((len(test_texts), 300))

for i, text in enumerate(test_texts):
    words = word_tokenize(text)
    word_indices = [vocab.get(word, len(vocab) - 1) for word in words]
    doc_embeddings = word_embeddings[word_indices]
    test_sentence_representations[i] = np.mean(doc_embeddings, axis=0)

y_pred = logistic_regression.predict(test_sentence_representations)

In [162]:
confusion_matrix_svd = confusion_matrix(test_labels, y_pred)
print("Confusion Matrix (TruncatedSVD Word Embeddings):")
print(confusion_matrix_svd)

Confusion Matrix (TruncatedSVD Word Embeddings):
[[190  46]
 [ 40 224]]


<div class='alert alert-block alert-info'>
            Code:</div>

In [178]:
def sentence_representations(texts, vocabulary, embeddings, np_func=np.mean):
    """
    Represent the sentences as a combination of the vector of its words.
    Parameters
    ----------
    texts : a list of sentences   
    vocabulary : dict
        From words to indexes of vector.
    embeddings : Matrix containing word representations
    np_func : function (default: np.sum)
        A numpy matrix operation that can be applied columnwise, 
        like `np.mean`, `np.sum`, or `np.prod`. 
    Returns
    -------
    np.array, dimension `(len(texts), embeddings.shape[1])`            
    """
    representations = []
    for text in texts:
        indexes = np.array([vocabulary.get(w,len(vocabulary)-1) for w in word_tokenize(text)])
        sentrep = np_func(embeddings[indexes], axis=0)
        representations.append(sentrep)
    representations = np.array(representations)    
    return representations

<div class='alert alert-block alert-info'>
            Code:</div>

In [154]:
# Obtain document representations, apply the classifier

# Obtain document representations using the sentence_representations function
document_representations = sentence_representations(texts_reduced, vocab, word_embeddings)

# Create a logistic regression model
logistic_regression = LogisticRegression(solver='liblinear')

# Train the model with document representations
logistic_regression.fit(document_representations, labels_reduced)

# Transform the test data into document representations
test_document_representations = sentence_representations(test_texts, vocab, word_embeddings)

# Make predictions on the test data
y_pred = logistic_regression.predict(test_document_representations)

# Display the results
confusion_matrix_representations = confusion_matrix(test_labels, y_pred)
print("Confusion Matrix (Document Representations):")
print(confusion_matrix_representations)



Confusion Matrix (Document Representations):
[[191  45]
 [ 39 225]]


## 5 - Dense Prediction-based Representations

We will now use two types of word embeddings: 
1. From ```Word2Vec```: which we will train ourselves
2. From ```GloVe```: which we will import like in the demo

We will use the ```gensim``` library for its implementation of word2vec in python. Since we want to keep the same vocabulary as before: we'll first create the class, then get the vocabulary we generated above. 

In [166]:
from gensim.models import Word2Vec
import gensim.downloader as api

In [167]:
model = Word2Vec(vector_size=300,
                 window=5,
                 null_word=len(word_counts))
model.build_vocab_from_freq(word_counts)

<div class='alert alert-block alert-info'>
            Code:</div>

In [168]:
# The model is to be trained with one long list of words, containing the full training dataset.
#each sentence is a list of words
preprocessed_corpus = [word_tokenize(sent) for sent in texts_reduced]

In [169]:
model.train(preprocessed_corpus, total_examples=len(preprocessed_corpus), epochs=30, report_delay=1)

(10060990, 21044010)

Then, we can re-use the ```sentence_representations``` function like before to obtain document representations, and apply classification. 
<div class='alert alert-block alert-info'>
            Code:</div>

In [179]:
document_representations=sentence_representations(texts_reduced,vocab,embeddings=model.wv)
document_representations_test=sentence_representations(test_texts,vocab,embeddings=model.wv)

In [180]:
model_LR = LogisticRegression()
model_LR.fit(document_representations,labels_reduced)
predictions_val =model_LR.predict(document_representations_test)
confusion = confusion_matrix(test_labels, predictions_val)
print(confusion)

[[188  48]
 [ 47 217]]


Obtain the GloVe representations corresponding to the vocabulary and use them the same way !
<div class='alert alert-block alert-info'>
            Code:</div>

In [186]:
# Download the GloVe embeddings
loaded_glove_model = api.load("glove-wiki-gigaword-300")

In [187]:
loaded_glove_embeddings = loaded_glove_model.vectors
document_representations=sentence_representations(texts_reduced,vocab,embeddings=loaded_glove_embeddings)
document_representations_test=sentence_representations(test_texts,vocab,embeddings=loaded_glove_embeddings)

In [188]:
model_LR = LogisticRegression()
model_LR.fit(document_representations,labels_reduced)

predictions_val =model_LR.predict(document_representations_test)
confusion = confusion_matrix(test_labels, predictions_val)
print(confusion)

[[166  70]
 [ 65 199]]


<div class='alert alert-block alert-warning'>
            Question:</div>
            
Comment on the results. What do you expect would happen if you used more training data ? or less ? 

The results of Word2Vec are quite better than for Glove because we have more true positive as well as true nagative and thus a smaller error rate from the model.

**Ammount of Training Data:** With less training data, the quality of learned embeddings and the performance of dimensionality reduction may decrease. Word2Vec and GloVe might not perform as well, and the ability to capture nuanced word relationships may be limited. Obvisouly with, more data the models should be more precise and have smaller error rates. Gloves should be even worse with more data due it's predicting word co-occurences counts.