## Done by : Adnane El Bouhali

# TP : Word Embeddings for Classification

## Objectives:

Explore the various way to represent textual data by applying them to a relatively small French classification dataset based on professionnal certification titles - **RNCP** - and evaluate how they perform on the classification task.
1. Using what we have previously seen, pre-process the data: clean it, obtain an appropriate vocabulary.
2. Obtain representations: any that will allow us to obtain a vector representation of each document is appropriate.
    - Symbolic: **BoW, TF-IDF**
    - Dense document representations: via **Topic Modeling: LSA, LDA**
    - Dense word representations: **SVD-reduced PPMI, Word2vec, GloVe**
        - For these, you will need to implement a **function aggregating word representations into document representations**
3. Perform classification: we can make things simple and only use a **logistic regression**

## Necessary dependancies

We will need the following packages:
- The Machine Learning API Scikit-learn : http://scikit-learn.org/stable/install.html
- The Natural Language Toolkit : http://www.nltk.org/install.html
- Gensim: https://radimrehurek.com/gensim/

These are available with Anaconda: https://anaconda.org/anaconda/nltk and https://anaconda.org/anaconda/scikit-learn

In [1]:
import os.path as op
import re
import numpy as np
import matplotlib.pyplot as plt
import pprint
import pandas as pd
import gzip
pp = pprint.PrettyPrinter(indent=3)

## Loading data

Let's load the data: take a first look.

In [2]:
with open("rncp.csv", encoding='utf-8') as f:
    rncp = pd.read_csv(f, na_filter=False)

print(rncp.head())

   Categorie                                text_certifications
0          1  Responsable de chantiers de bûcheronnage manue...
1          1  Responsable de chantiers de bûcheronnage manue...
2          1                                 Travaux forestiers
3          1                                              Forêt
4          1                                              Forêt


In [3]:
print(rncp.columns.values)
texts = rncp.loc[:,'text_certifications'].astype('str').tolist()
labels = rncp.loc[:,'Categorie'].astype('str').tolist()

['Categorie' 'text_certifications']


You can see that the first column is the category, the second the title of the certification. Let's get the category names for clarity:

In [4]:
Categories = ["1-environnement",
              "2-defense",
              "3-patrimoine",
              "4-economie",
              "5-recherche",
              "6-nautisme",
              "7-aeronautique",
              "8-securite",
              "9-multimedia",
              "10-humanitaire",
              "11-nucleaire",
              "12-enfance",
              "13-saisonnier",
              "14-assistance",
              "15-sport",
              "16-ingenierie"]

In [5]:
pp.pprint(texts[:10])

[  'Responsable de chantiers de bûcheronnage manuel et de débardage',
   'Responsable de chantiers de bûcheronnage manuel et de sylviculture',
   'Travaux forestiers',
   'Forêt',
   'Forêt',
   'Responsable de chantiers forestiers',
   'Diagnostic et taille des arbres',
   'option Chef d’entreprise ou OHQ en travaux forestiers, spécialité '
   'abattage-façonnage',
   'option Chef d’entreprise ou OHQ en travaux forestiers, spécialité '
   'débardage',
   'Gestion et conduite de chantiers forestiers']


In [6]:
# This number of documents may be high for some computers: we can select a fraction of them (here, one in k)
# Use an even number to keep the same number of positive and negative reviews
k = 1
texts_reduced = texts[0::k]
labels_reduced = labels[0::k]

print('Number of documents:', len(texts_reduced))

Number of documents: 94312


Use the function ```train_test_split```from ```sklearn``` function to set aside test data that you will use during the lab. Make it one fifth of the data you have currently.

<div class='alert alert-block alert-info'>
            Code:</div>

In [7]:
from sklearn.model_selection import train_test_split

texts_reduced, test_texts, labels_reduced, test_labels = train_test_split(texts_reduced, labels_reduced, test_size=0.2, random_state=42)

## 1 - Document Preprocessing

You should use a pre-processing function you can apply to the raw text before any other processing (*i.e*, tokenization and obtaining representations). Some pre-processing can also be tied with the tokenization (*i.e*, removing stop words). Complete the following function, using the appropriate ```nltk``` tools.
<div class='alert alert-block alert-info'>
            Code:</div>

In [8]:
# Imports
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
import string

<div class='alert alert-block alert-info'>
            Code:</div>

In [9]:
# Look at the data and apply the appropriate pre-processing
# Download necessary NLTK datasets
nltk.download('punkt')
nltk.download('stopwords')

# Define a preprocessing function
def preprocess_text(text, language='french'):
    # Tokenize text
    tokens = word_tokenize(text, language=language)

    # Convert to lower case
    tokens = [word.lower() for word in tokens]

    # Remove punctuation from each word
    table = str.maketrans('', '', string.punctuation)
    stripped = [word.translate(table) for word in tokens]

    # Remove remaining tokens that are not alphabetic
    words = [word for word in stripped if word.isalpha()]

    # Filter out stop words
    stop_words = set(stopwords.words(language))
    words = [word for word in words if not word in stop_words]

    # Optionally: Perform stemming (we'll skip lemmatization for simplicity)
    # stemmer = SnowballStemmer(language)
    # stemmed = [stemmer.stem(word) for word in words]

    return words

# Test the preprocessing function on a sample text
sample_text = "Responsable de chantiers de bûcheronnage manuel et de débardage"
preprocessed_sample = preprocess_text(sample_text)
preprocessed_sample


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


['responsable', 'chantiers', 'bûcheronnage', 'manuel', 'débardage']

Now that the data is cleaned, the first step we will follow is to pick a common vocabulary that we will use for every representations we obtain in this lab. **Use the code of the previous lab to create a vocabulary.**

<div class='alert alert-block alert-info'>
            Code:</div>

In [10]:
# Assuming texts_reduced is your list of texts that you want to process
preprocessed_texts = [preprocess_text(text) for text in texts_reduced]

# Flatten the list of lists into a single list containing all tokens
all_tokens = [token for sublist in preprocessed_texts for token in sublist]

# Create a set of unique words to form the vocabulary
vocabulary = set(all_tokens)

print(f"Vocabulary size: {len(vocabulary)}")
# Optionally, you might want to sort the vocabulary for consistency
sorted_vocabulary = sorted(list(vocabulary))

# If you want to look at some of the vocabulary words
print(sorted_vocabulary[:100])

Vocabulary size: 5953
['a', 'aapapd', 'aaqcb', 'abattagefaçonnage', 'ac', 'accessibilité', 'accessoires', 'accompagnant', 'accompagnateur', 'accompagnateure', 'accompagnement', 'accompagné', 'accordeur', 'accordéon', 'accu', 'accueil', 'accueillant', 'accès', 'achard', 'achat', 'achats', 'achatvente', 'acheteur', 'acheteurs', 'acoustique', 'acoustiques', 'acquisition', 'acquisitions', 'acrobatique', 'acrobatiques', 'acse', 'acsyon', 'acteer', 'acteur', 'acteurs', 'actifs', 'action', 'actions', 'activite', 'activites', 'activité', 'activités', 'actuaire', 'actuariat', 'actuariel', 'actuarielle', 'actuelles', 'actuels', 'adaptation', 'adaptee', 'adapté', 'adaptée', 'adaptées', 'adaptés', 'additifs', 'additive', 'adhérents', 'adhésifs', 'adjoint', 'admin', 'administrateur', 'administrateurproducteur', 'administrateurrice', 'administratif', 'administratifs', 'administration', 'administrationgestion', 'administrationmaintenance', 'administrations', 'administrative', 'administratives', 'admi

What do you think is the **appropriate vocabulary size here** ? Would any further pre-processing make sense ? Motivate your answer.

<div class='alert alert-block alert-warning'>
            Question:</div>

Determining the appropriate vocabulary size and deciding on further preprocessing steps for a text classification task, such as classifying professional certification titles (RNCP dataset), requires balancing several factors:

1. **Dataset and Project Objectives:** The nature of your dataset and the specific goals of your project play a crucial role. A complex, topic-diverse dataset or projects aiming to capture specialized terminology may benefit from a larger vocabulary.

2. **Dimensionality vs. Information Loss:**
   - A larger vocabulary increases the feature space's dimensionality, which can lead to longer processing times and potential overfitting issues.
   - Excessively reducing the vocabulary size might result in significant information loss, undermining the model's ability to differentiate between classes effectively.

3. **Computational Resources:** The available memory and processing power limit the feasible size of the vocabulary. A balance must be found between computational efficiency and model performance.

4. **Further Pre-processing Considerations:**
   - **Rare Words:** Removing infrequently appearing words can streamline the vocabulary without heavily impacting the content.
   - **Stemming/Lemmatization:** These can reduce the vocabulary size by normalizing word variations to their root form, though the impact varies by language.
   - **Domain-Specific Stopwords:** Identifying and removing common but uninformative words specific to the dataset's domain can improve model focus on relevant terms.

### Decision Guidelines:
- **Experimentation is key:** There's no one-size-fits-all answer. Experiment with different vocabulary sizes and preprocessing strategies, and evaluate their impact on model performance.
- **Use Evaluation Metrics:** Monitor changes in accuracy, F1-score, and other relevant metrics to guide the refinement of your vocabulary and preprocessing steps.
- **Incorporate Domain Expertise:** Leveraging knowledge specific to the domain can help identify non-informative words or phrases to be excluded from the analysis.

For the RNCP dataset, starting with a comprehensive vocabulary and iteratively refining it based on performance and computational constraints is advisable. This approach, coupled with strategic preprocessing, can help create effective text representations for classification tasks.

## 2 - Symbolic text representations

We can use the ```CountVectorizer``` class from scikit-learn to obtain the first set of representations:
- Use the appropriate argument to get your own vocabulary
- Fit the vectorizer on your training data, transform your test data
- Create a ```LogisticRegression``` model and train it with these representations. Display the confusion matrix using functions from ```sklearn.metrics```

Then, re-execute the same pipeline with the ```TfidfVectorizer```.

<div class='alert alert-block alert-info'>
            Code:</div>

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

# Assuming texts_reduced and test_texts are your training and test sets, respectively
# and labels_reduced and test_labels are the corresponding labels

# Initialize CountVectorizer with your vocabulary
vectorizer = CountVectorizer(vocabulary=sorted_vocabulary)  # Use your sorted_vocabulary

# Fit the vectorizer on the training data and transform both training and test data
X_train_counts = vectorizer.fit_transform(texts_reduced)
X_test_counts = vectorizer.transform(test_texts)

# Create and train a LogisticRegression model
lr_model_counts = LogisticRegression(max_iter=1000)  # Increase max_iter if needed
lr_model_counts.fit(X_train_counts, labels_reduced)

# Predict on test data and display the confusion matrix
predictions_counts = lr_model_counts.predict(X_test_counts)
print("Confusion Matrix (BoW):")
print(confusion_matrix(test_labels, predictions_counts))
print("\nClassification Report (BoW):")
print(classification_report(test_labels, predictions_counts))

Confusion Matrix (BoW):
[[ 492  102   82   17   18   17   29  382   18   66  833   75   54  118
    22   79]
 [ 259  147    1   12    7   21   15   61    2   27  604   15   24   23
     9   75]
 [ 158    5   79    0    1    1    0  215   18    1  166   11   11  103
     5    1]
 [  19    6    0   48   42   48   24    0    0   22   66    8   24    0
     0   22]
 [  29    2    1   46   53   50   24    0    2    5   33    1   27    2
     6   11]
 [  36    9   10   42   39  121   44    6   10   33  110    7   21   21
    27   32]
 [  42    4    0   34   36   54   36    1    7   25  332   13   51    2
     0  130]
 [ 316   51   69    0    0    4    2  523    9    1  701   62   34  415
     8  111]
 [  25    4   26    0    1    8    2   19   46    0   30    2   31   36
    14    2]
 [ 115    8    1   25   17   15   27   24    1  164  239   94   19    7
     0   38]
 [ 359  103   31    7    6   17   39  511    5   54 2095  102   31  154
    19  331]
 [ 160    6    6    2    2   12    8  187

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TfidfVectorizer with your vocabulary
tfidf_vectorizer = TfidfVectorizer(vocabulary=sorted_vocabulary)  # Use your sorted_vocabulary

# Fit the vectorizer on the training data and transform both training and test data
X_train_tfidf = tfidf_vectorizer.fit_transform(texts_reduced)
X_test_tfidf = tfidf_vectorizer.transform(test_texts)

# Reuse the LogisticRegression model with TF-IDF representations
lr_model_tfidf = LogisticRegression(max_iter=1000)  # Increase max_iter if needed
lr_model_tfidf.fit(X_train_tfidf, labels_reduced)

# Predict on test data and display the confusion matrix
predictions_tfidf = lr_model_tfidf.predict(X_test_tfidf)
print("Confusion Matrix (TF-IDF):")
print(confusion_matrix(test_labels, predictions_tfidf))
print("\nClassification Report (TF-IDF):")
print(classification_report(test_labels, predictions_tfidf))

Confusion Matrix (TF-IDF):
[[ 593   71   70   15   22   12   27  415   13   58  768   67   48  116
    18   91]
 [ 287  151    0   15    6   16   16   66    2   29  562   13   20   24
     7   88]
 [ 155    4   83    0    1    1    0  234   14    2  142   13    9  112
     3    2]
 [  23    6    0   56   38   42   26    1    0   29   64    6   24    0
     0   14]
 [  25    1    1   45   60   46   20    0    2    9   32    2   34    0
     6    9]
 [  38    7    8   38   30  136   42    8    7   31  114    6   29   18
    24   32]
 [  46    3    0   29   36   49   39    2    8   29  314   14   53    3
     0  142]
 [ 333   48   59    0    1    3    1  624    5    0  650   57   29  390
     5  101]
 [  31    3   24    0    0    6    0   19   39    0   34    1   32   40
    15    2]
 [ 126    6    0   21   14   17   30   24    1  170  238   80   15   10
     0   42]
 [ 382  101   30    5    4   17   39  570    2   54 2043  101   16  156
    16  328]
 [ 185    5    4    3    0   12   12  

## 3 - Dense Representations from Topic Modeling

Now, the goal is to re-use the bag-of-words representations we obtained earlier - but reduce their dimension through a **topic model**. Note that this allows to obtain reduced **document representations**, which we can again use directly to perform classification.
- Do this with two models: ```TruncatedSVD``` and ```LatentDirichletAllocation```
- Pick $300$ as the dimensionality of the latent representation (*i.e*, the number of topics)

<div class='alert alert-block alert-info'>
            Code:</div>

In [13]:
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer

# Assuming X_train_counts is your BoW representation for the training data
# and 300 is the desired number of topics

# Initialize and fit TruncatedSVD
svd_model = TruncatedSVD(n_components=300, random_state=42)
lsa_transformed_train = svd_model.fit_transform(X_train_counts)

# Optionally, normalize the output (helpful for some types of analysis)
lsa_transformed_train = Normalizer(copy=False).fit_transform(lsa_transformed_train)

# Transform the test data
lsa_transformed_test = svd_model.transform(X_test_counts)
lsa_transformed_test = Normalizer(copy=False).transform(lsa_transformed_test)

# Train a logistic regression model on the LSA-transformed data
lr_model_lsa = LogisticRegression(max_iter=1000)
lr_model_lsa.fit(lsa_transformed_train, labels_reduced)

# Predict and display metrics
predictions_lsa = lr_model_lsa.predict(lsa_transformed_test)
print("Confusion Matrix (LSA):")
print(confusion_matrix(test_labels, predictions_lsa))
print("\nClassification Report (LSA):")
print(classification_report(test_labels, predictions_lsa))

Confusion Matrix (LSA):
[[ 583   52   73   13   19   28   27  383   13   62  833   57   26  124
    12   99]
 [ 257  121    1   13    7   20   21   65    3   25  614    8   17   30
     8   92]
 [ 154    2   79    0    7    2    0  200    4    4  181    9    7  117
     5    4]
 [  23    4    0   52   35   38   25    0    0   25   77    6   26    0
     2   16]
 [  26    1    1   38   54   54    8    1    1   10   52    1   30    0
     6    9]
 [  39    7    3   42   38  115   26   12    9   26  138    7   22   19
    26   39]
 [  34    2    1   38   31   42   37    3    4   25  338   11   56    2
     0  143]
 [ 326   43   53    1    0    6    0  680    5    5  684   49    9  342
     7   96]
 [  46    3   16    0    4    8    1   13   25    0   50    0   33   29
    14    4]
 [ 123    5    1   22   10   26   29   31    2  171  254   57   10   12
     0   41]
 [ 327   74   35    6    4   26   42  542    0   59 2181   80   14  165
    19  290]
 [ 188    6   10    3    2   10   18  198

In [14]:
from sklearn.decomposition import LatentDirichletAllocation

# Initialize and fit LatentDirichletAllocation
lda_model = LatentDirichletAllocation(n_components=300, random_state=42, learning_method='batch')
lda_transformed_train = lda_model.fit_transform(X_train_counts)

# Transform the test data
lda_transformed_test = lda_model.transform(X_test_counts)

# Train a logistic regression model on the LDA-transformed data
lr_model_lda = LogisticRegression(max_iter=1000)
lr_model_lda.fit(lda_transformed_train, labels_reduced)

# Predict and display metrics
predictions_lda = lr_model_lda.predict(lda_transformed_test)
print("Confusion Matrix (LDA):")
print(confusion_matrix(test_labels, predictions_lda))
print("\nClassification Report (LDA):")
print(classification_report(test_labels, predictions_lda))

Confusion Matrix (LDA):
[[ 606   36   40   11   15   11   16  409    2   52  953   35   38   95
    15   70]
 [ 267   80    1   12    4   14   17   72    1   21  721    8   11   15
     5   53]
 [ 165    2   43    0    3    1    0  242    4    3  200    5    8   86
     9    4]
 [  37    6    0   38   25   17   18    1    0   23  103    6   32    0
     1   22]
 [  40    1    1   29   26   20   11    1    0   13  105    0   30    6
     3    6]
 [  85    7    1   34   15   38   21   18    0   28  207   12   26   23
    23   30]
 [  50    4    0   33   19   26   37    9    3   26  387   14   45    8
     1  105]
 [ 333   31   36    1    1    3    1  767    2    4  724   23   18  250
     8  104]
 [  41    5   17    0    1    2    0   22   16    2   55    1   40   31
    11    2]
 [ 146    6    0   12   10   11   23   36    0  139  307   43   10   11
     0   40]
 [ 380   60   11    5    5   18   31  586    2   39 2282   49   19  119
     8  250]
 [ 183   14    1    3    2    9    8  215

<div class='alert alert-block alert-warning'>
            Question:</div>
            
We picked $300$ as number of topics. What would be the procedure to follow if we wanted to choose this hyperparameter through the data ?

To determine the optimal number of topics for topic modeling in text classification tasks, a structured, data-driven approach is recommended, blending quantitative evaluation with qualitative insights:

1. **Initial Setup:** Begin by selecting a range of potential numbers of topics. This range should be informed by your dataset's characteristics and the computational resources at your disposal.

2. **Quantitative Evaluation:** Use metrics to quantitatively evaluate the performance of your topic models across the range of topic numbers. While coherence scores and perplexity are standard metrics in topic modeling, you might rely on the performance of a downstream classification task, such as accuracy, as a proxy for topic model quality in environments where these standard metrics are not readily available.

3. **Cross-Validation:** Implement cross-validation to ensure the stability and robustness of your chosen hyperparameter across different data subsets. This step helps in verifying that the selected number of topics is not overfitted to a specific partition of your dataset.

4. **Iterative Testing:** Conduct iterative tests over your defined range of topic numbers, assessing each configuration's performance based on your chosen evaluation metric. This could be done through grid search techniques or more sophisticated optimization methods.

5. **Qualitative Evaluation:** Complement quantitative assessments with qualitative evaluations of the topics generated. Inspect the coherence, distinctiveness, and relevance of topics manually to ensure they align with analytical goals and exhibit meaningful patterns.

6. **Optimization and Final Selection:** Identify the optimal number of topics as the one that provides the best balance between quantitative performance (e.g., classification accuracy) and qualitative coherence. This optimal point is where the topics are not too broad to be meaningless nor too fine-grained to be overly specific.

Throughout this process, the goal is to find a sweet spot where the topics are meaningful and contribute positively to the performance of downstream tasks like classification, balancing between the granularity of topics and the manageability of model complexity and interpretability.

In [15]:
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Example range of topics to explore
num_topics_range = range(50, 351, 50)  # From 50 to 350, stepping by 50
performance_scores = []

for num_topics in num_topics_range:
    # Apply TruncatedSVD to reduce dimensions
    svd_model = TruncatedSVD(n_components=num_topics, random_state=42)
    X_train_reduced = svd_model.fit_transform(X_train_counts)
    X_test_reduced = svd_model.transform(X_test_counts)

    # Train logistic regression on the reduced dataset
    lr_model = LogisticRegression(max_iter=1000, random_state=42)
    lr_model.fit(X_train_reduced, labels_reduced)
    predictions = lr_model.predict(X_test_reduced)

    # Evaluate performance
    score = accuracy_score(test_labels, predictions)
    performance_scores.append(score)
    print(f"Num Topics: {num_topics}, Accuracy: {score}")

# Identify the number of topics with the best performance
optimal_num_topics = num_topics_range[np.argmax(performance_scores)]
print("Optimal number of topics:", optimal_num_topics)


Num Topics: 50, Accuracy: 0.23654773895986853
Num Topics: 100, Accuracy: 0.24715050628213964
Num Topics: 150, Accuracy: 0.25271695912633196
Num Topics: 200, Accuracy: 0.25451942957111806
Num Topics: 250, Accuracy: 0.253618194348725
Num Topics: 300, Accuracy: 0.25642792768912687
Num Topics: 350, Accuracy: 0.25785930127763346
Optimal number of topics: 350


## 4 - Dense Count-based Representations

The following function allows to obtain very large-dimensional vectors for **words**. We will now follow a different procedure:
- Step 1: Obtain the co-occurence matrix, based on the vocabulary, giving you a vector by word in the vocabulary.
- Step 2: Apply an SVD to obtain **word embeddings** of dimension $300$, for each word in the vocabulary.
- Step 3: Obtain document representations by aggregating embeddings associated to each word in the document.
- Step 4: Train a classifier on the (document representations, label) pairs.

Some instructions:
- In step 1, use the ```co_occurence_matrix``` function, which you need to complete.
- In step 2, use ```TruncatedSVD```to obtain word representations of dimension $300$ from the output of the ```co_occurence_matrix``` function.
- In step 3, use the ```sentence_representations``` function, which you will need to complete.
- In step 4, put the pipeline together by obtaining document representations for both training and testing data. Careful: the word embeddings must come from the *training data co-occurence matrix* only.

Lastly, add a **Step 1b**: transform the co-occurence matrix into the PPMI matrix, and compare the results.

In [16]:
def co_occurence_matrix(corpus, vocabulary, window=0):
    """
    Params:
        corpus (list of list of strings): corpus of sentences
        vocabulary (dictionary): word to index mapping for the vocabulary
        window (int): size of the context window; when 0, the context is the whole sentence
    Returns:
        matrix (np.array of size (len(vocabulary), len(vocabulary))): the co-oc matrix, using the same ordering as the vocabulary given in input
    """
    vocab_size = len(vocabulary)
    M = np.zeros((vocab_size, vocab_size))
    for sent in corpus:
        # Convert sentence words to indexes, based on the vocabulary
        sent_idx = [vocabulary[word] for word in sent if word in vocabulary]

        # Iterate through each word in the sentence
        for i, word_idx in enumerate(sent_idx):
            # Determine context based on window size
            if window > 0:
                # Limited context window
                start = max(i - window, 0)
                end = min(i + window + 1, len(sent_idx))
            else:
                # Whole sentence as context
                start, end = 0, len(sent_idx)

            # Update the co-occurrence matrix for words within the context
            for j in range(start, end):
                if i != j:  # Skip the word itself
                    context_word_idx = sent_idx[j]
                    M[word_idx, context_word_idx] += 1
                    # The matrix is symmetric, but we fill it in both directions anyway
                    M[context_word_idx, word_idx] += 1
    return M


<div class='alert alert-block alert-info'>
            Code:</div>

In [17]:
# Convert your vocabulary dictionary to a word:index format
word_to_index = {word: i for i, word in enumerate(sorted(vocabulary))}

# Obtain the co-occurrence matrix
co_occ_matrix = co_occurence_matrix(preprocessed_texts, word_to_index)

# Function to transform co-occurrence matrix into PPMI
def compute_ppmi(co_occ_matrix):
    total_count = np.sum(co_occ_matrix)
    sum_over_rows = np.sum(co_occ_matrix, axis=1)
    sum_over_cols = np.sum(co_occ_matrix, axis=0)
    expected = np.outer(sum_over_rows, sum_over_cols) / total_count
    with np.errstate(divide='ignore', invalid='ignore'):
        ppmi_matrix = np.log2(co_occ_matrix * total_count / (expected + 1e-8))
        ppmi_matrix[np.isinf(ppmi_matrix) | np.isnan(ppmi_matrix)] = 0
        ppmi_matrix = np.maximum(ppmi_matrix, 0)
    return ppmi_matrix


# Optionally transform the co-occurrence matrix into PPMI
ppmi_matrix = compute_ppmi(co_occ_matrix)

# Apply TruncatedSVD to reduce dimensions
svd = TruncatedSVD(n_components=300, random_state=42)
word_embeddings = svd.fit_transform(ppmi_matrix)  # Use co_occ_matrix directly if not using PPMI

# 'word_embeddings' now contains the 300-dimensional embeddings for each word in your vocabulary


<div class='alert alert-block alert-info'>
            Code:</div>

In [43]:
def sentence_representations(texts, vocabulary, embeddings, np_func=np.mean):
    """
    Represent the sentences as a combination of the vector of its words.
    Parameters
    ----------
    texts : a list of sentences
    vocabulary : dict
        From words to indexes of vector.
    embeddings : Matrix containing word representations
    np_func : function (default: np.sum)
        A numpy matrix operation that can be applied columnwise,
        like `np.mean`, `np.sum`, or `np.prod`.
    Returns
    -------
    np.array, dimension `(len(texts), embeddings.shape[1])`
    """
    representations = []
    for text in texts:
        # Retrieve indexes of words in the sentence from the vocabulary
        indexes = [vocabulary[word] for word in text if word in vocabulary]

        if not indexes:
            # Handle sentences with no known vocabulary words
            sent_rep = np.zeros(embeddings.shape[1])
        else:
            # Retrieve embeddings for these words
            word_embeddings = embeddings[indexes]

            # Aggregate embeddings of words in the sentence using np_func
            sent_rep = np_func(word_embeddings, axis=0)

        representations.append(sent_rep)

    representations = np.array(representations)
    return representations


<div class='alert alert-block alert-info'>
            Code:</div>

In [44]:
# Step 1: Obtain Document Representations
# Convert your training and test texts into their vector representations
# First, preprocess the texts to tokenize them, similar to how 'preprocessed_texts' was obtained
preprocessed_train_texts = [preprocess_text(text) for text in texts_reduced]
preprocessed_test_texts = [preprocess_text(text) for text in test_texts]

# Now, use the sentence representations function
X_train = sentence_representations(preprocessed_train_texts, word_to_index, word_embeddings, np_func=np.mean)
X_test = sentence_representations(preprocessed_test_texts, word_to_index, word_embeddings, np_func=np.mean)

# Step 2: Apply the Classifier
# Initialize the logistic regression classifier
lr_classifier = LogisticRegression(max_iter=1000, random_state=42)

# Train the classifier on the training data
lr_classifier.fit(X_train, labels_reduced)

# Predict the labels for the test set
predictions = lr_classifier.predict(X_test)

# Evaluate the classifier
accuracy = accuracy_score(test_labels, predictions)
conf_matrix = confusion_matrix(test_labels, predictions)
class_report = classification_report(test_labels, predictions)

print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", class_report)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Accuracy: 0.27328632773153794
Confusion Matrix:
 [[ 532   70   67   20   23   25   29  411   17   71  795   65   46  125
    15   93]
 [ 232  137    4   17    5   21   21   81    4   46  594   11   22   16
     6   85]
 [ 115    3   84    0    4    2    0  238    8    5  156   17   10  124
     7    2]
 [  18    3    0   55   43   40   36    0    0   33   56    7   18    0
     1   19]
 [  23    1    1   38   72   38   25    0    2   13   33    2   24    0
     5   15]
 [  37    8   10   39   58  115   29   10    7   29  108   10   21   17
    30   40]
 [  30    4    1   32   48   40   52    2   11   38  312   15   39    4
     0  139]
 [ 302   47   50    1    1    9    0  773    6    5  602   59   31  313
     4  103]
 [  32    6   23    0    2    5    2   15   46    0   42    0   22   34
    16    1]
 [ 109    5    1   27   14   30   30   29    2  189  227   60   17   15
     1   38]
 [ 289   87   28   13    4   19   40  619    2   57 2152   96   23  143
    17  275]
 [ 152    4   10

## 5 - Dense Prediction-based Representations

We will now use word embeddings from ```Word2Vec```: which we will train ourselves

We will use the ```gensim``` library for its implementation of word2vec in python. Since we want to keep the same vocabulary as before: we'll first create the model, then re-use the vocabulary we generated above.

In [45]:
from gensim.models import Word2Vec
from collections import Counter

<div class='alert alert-block alert-info'>
            Code:</div>

In [48]:
# The model is to be trained with a list of tokenized sentences, containing the full training dataset.
# Preprocess your texts
preprocessed_texts = [preprocess_text(text) for text in texts_reduced]

# Initialize the Word2Vec model
model = Word2Vec(vector_size=300, window=5, min_count=1)

# Build the vocabulary from your preprocessed texts
model.build_vocab(preprocessed_texts)

In [49]:
# Train the model on the preprocessed texts
model.train(preprocessed_texts, total_examples=len(preprocessed_texts), epochs=30, report_delay=1)

(11907371, 16946040)

Then, we can re-use the ```sentence_representations```function like before to obtain document representations, and apply classification.
<div class='alert alert-block alert-info'>
            Code:</div>

In [58]:
# Update the vocabulary and embeddings from the trained Word2Vec model
vocabulary = {word: idx for idx, word in enumerate(model.wv.index_to_key)}
embeddings = np.vstack([model.wv[word] for word in model.wv.index_to_key])

# Generate document vectors for training and testing sets
X_train_vec = sentence_representations(preprocessed_train_texts, vocabulary, embeddings)
X_test_vec = sentence_representations(preprocessed_test_texts, vocabulary, embeddings)

# Train the classifier on the document vectors
lr_classifier = LogisticRegression(max_iter=1000, random_state=42)
lr_classifier.fit(X_train_vec, labels_reduced)

# Predict the labels for the test set
predictions_vec = lr_classifier.predict(X_test_vec)

# Evaluate the classifier
accuracy_vec = accuracy_score(test_labels, predictions_vec)
conf_matrix_vec = confusion_matrix(test_labels, predictions_vec)
class_report_vec = classification_report(test_labels, predictions_vec)

print("Accuracy:", accuracy_vec)
print("Confusion Matrix:\n", conf_matrix_vec)
print("Classification Report:\n", class_report_vec)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Accuracy: 0.27169591263319726
Confusion Matrix:
 [[ 587   41   63   15   21   31   28  394   17   74  803   52   41  130
    16   91]
 [ 267  108    0   14    9   27   20   72    3   41  615    8   13   20
     7   78]
 [ 133    3   76    0    1   10    0  235   14    4  149   10   14  115
     8    3]
 [  20    3    0   58   43   34   25    4    0   31   62    5   21    0
     1   22]
 [  22    1    1   37   78   42   13    1    4   16   40    1   26    1
     3    6]
 [  47    6    5   39   54  120   25   13   15   34  112    8   18   19
    23   30]
 [  37    2    0   34   45   43   48    3    8   35  308   13   37    4
     0  150]
 [ 313   35   49    0    1    8    0  766    6    4  643   45   25  315
     7   89]
 [  36    3   22    0    3   11    1   13   35    1   41    1   30   32
    12    5]
 [ 112    2    0   22   13   25   28   26    3  200  245   49   17   18
     1   33]
 [ 327   58   32    5    5   37   39  591    1   56 2154   78   28  151
    18  284]
 [ 161    3    7

<div class='alert alert-block alert-warning'>
            Question:</div>
            
Comment on the results. What is the big issue with the dataset that using embeddings did not solve ?
**Given this type of data**, what would you propose if you needed solve this task (i.e, reach a reasonnable performance) in an industrial context ?

The performance metrics indicate challenges with the dataset and model, evidenced by a relatively low accuracy of around 27.2% and varied performance across different classes. This situation hints at underlying issues such as class imbalance, contextual ambiguity, and possibly noisy data, which the use of embeddings alone did not adequately address.

To tackle these challenges, especially in an industrial context where achieving reasonable performance is critical, a multifaceted approach is recommended:

1. **Address Class Imbalance**: Implement resampling techniques or cost-sensitive training to balance the influence of different classes on the training process.

2. **Enhance Model Capability**: Transition to more advanced NLP models like BERT, GPT, or RoBERTa that are better at capturing contextual nuances and have shown superior performance in a wide range of NLP tasks. Consider ensemble methods to leverage the strengths of multiple models.

3. **Improve Data Quality**: Engage in more sophisticated preprocessing and data augmentation strategies to clean the data further and increase its diversity, helping the model to generalize better.

4. **Experiment and Evaluate Thoroughly**: Employ cross-validation for more reliable performance evaluation and engage in systematic hyperparameter tuning to find the optimal model configuration.

Implementing these strategies involves balancing computational resources and model complexity, with a keen eye on the dataset's specific characteristics. In an industrial setting, it's also crucial to establish a continuous evaluation loop, where the model's real-world performance is regularly monitored, and the model is updated or retrained as needed to adapt to new data and evolving requirements.