# TP: Applying Arora's Method for Sentence Embeddings

* Master 1 – MIASHS, Université Lyon 2  
* December 2025  


## Objectives

**Global goal:** Build a model that takes an SMS message as input and predicts whether it is **spam** or **not spam**.  
To achieve this, you will compute sentence embeddings using the method proposed by Arora *et al.* (2017):  
https://openreview.net/forum?id=SyK00v5xx

**Steps of the practical session:**

1. Download the **SMS Spam Collection Dataset**:  
   https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset

2. Extract **word embeddings** for each observation using **Gensim Word2Vec**, forming a dictionary `{token: embedding}`.

3. Compute the **sentence embedding** $v_s^i$ for each observation using the **Arora SIF method**.

4. Train a **RandomForestClassifier**, a **Logistic Regression**, a **Neural Network** or any model you want using sentence embeddings as features.

5. Evaluate your Classifier (Precision, Recall, Accuracy, F1-score, etc.).


### Targeted Model

* $N$: dimension of the embedding space.  
* $J$: number of tokens/words in the sentence.  
* $v_s$: sentence embedding computed using the Arora method from word embeddings $(w_1, \dots, w_J)$.  
* $\delta$: binary target (1 for spam, 0 for non-spam).

The global learning pipeline is:

$$
\text{phrase} \ \xrightarrow[\mathrm{word. embd}]{\mathrm{extract.} } \quad \left(w_1, \dots, w_J \right) \in \mathbb{R}^{N \times J} \quad \xrightarrow[]{\mathrm{Arora} }  \quad v_s  \in \mathbb{R}^{N} \quad \xrightarrow[]{\boxed{ \mathrm{model} }} \quad \delta^{\text{pred}} \in \{0,1\}
$$


### Python Methods for Arora et al. (2017)

The official GitHub repository of the method is available here:  
https://github.com/PrincetonML/SIF

## I) Import Required Packages and Data
### I)1) Packages and Data

In [1]:
from tp_arora_pkg import *

In [2]:
df = load_text_classification_dataset('./spam.csv')

Loaded successfully with encoding='ISO-8859-1'.


#### Display the first few rows of the DataFrame to inspect its structure and verify that it was loaded correctly.

In [None]:
...

### I)2) Descriptive Statistics

#### Use the descriptive statistics and visualization functions provided in the tp_arora_pkg package to explore the dataset by computing summary statistics and plotting the target distribution.

In [None]:
...

## II) NLP Model encoding

#### Additional Resources

Here are a few useful references to help you explore Word2Vec and word embeddings more deeply:

* https://radimrehurek.com/gensim/models/word2vec.html  
  *Official Gensim documentation for the Word2Vec model.*

* https://datascientest.com/nlp-word-embedding-word2vec  
  *A clear introduction to Word2Vec and its role in NLP.*

Feel free to explore further resources (blog posts, tutorials, videos, research papers) to better understand how word embeddings, sentence embeddings, and the Arora SIF method fit into modern NLP workflows.


### II)1) Word embeddings extraction

#### Use the word-embedding extraction function from the package to transform each text entry into its corresponding set of word embeddings.  
Have a look at the first rows.

In [None]:
...

### II)2) Sentence embeddings computation

#### Apply the Arora sentence-embedding method to the word-embedding dictionary of each document, then display the first few resulting sentence embeddings to verify that the transformation worked.

In [None]:
...

## III) Training a Classification Model

### Instructions

1. Using the sentence embeddings computed with Arora's method, build a model that predicts whether a message should be classified as **SPAM** or **not SPAM**.

   * First, **split** the dataset into a training set and a test set.  
     (What is the explanatory variable? What is the target variable?)

   * Import a classification model (RandomForest, Logistic Regression, Neural Network, or any other classifier).  
     Instantiate it and train it on the **training** portion of the dataset.

   * Evaluate the **quality** of your model by comparing predictions on the **test set** with the true labels.  
     (*Which metrics can be used for this?* Think about Accuracy, Precision, Recall, F1-score, etc. but also Brier Score or other calibration metrics.)
  

    * Display the **confusion matrix**.
  

For a complete piece of work, you should put **several models in competition** (to be as exhaustive as possible) and compare both their discrimination and calibration results in order to reach a reasonably rigorous conclusion. 

2. Suppose you have obtained a trained model `classif_spam` whose performance is satisfactory.  
   What happens when you apply it to an **external dataset**?

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, brier_score_loss
)
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt


### EXAMPLES : you can take any classifier you want
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

In [None]:
...

In [21]:
# Artificial dataset of new SMS
data = {
    'text': [
        'what about',
        'see the last new of our model',
        'win last phone',
        'WINNER!! As a valued network customer you have been selected to receive a price'
    ]
}

df_newdata = pd.DataFrame(data)

# --------------------------------------------------------------
# Extract embeddings and compute Arora sentence embeddings
# --------------------------------------------------------------

df_newdata = extract_word_embeddings(df_newdata)
df_newdata = arora_methods(df_newdata, remove_pc_nbr=0)

# Convert embeddings to numpy format
X_new = np.stack(df_newdata['sentence_embeddings'].values)

# --------------------------------------------------------------
# Predictions for each model
# --------------------------------------------------------------

print("\n--- Predictions on New Messages ---")

# 1. Neural Network (MLPClassifier)
y_pred_mlp = clf_nn.predict(X_new)
print("MLPClassifier predictions :", y_pred_mlp)

# 2. Random Forest
y_pred_rf = clf_rf.predict(X_new)
print("RandomForest predictions  :", y_pred_rf)

# 3. Logistic Regression with LASSO
y_pred_lr = clf_lr.predict(X_new)
print("Logistic Regression (LASSO) predictions :", y_pred_lr)

Computing word embeddings...: 100%|████████████| 4/4 [00:00<00:00, 34807.50it/s]
Computing sentence embeddings...: 100%|█████████| 4/4 [00:00<00:00, 4666.82it/s]


--- Predictions on New Messages ---
MLPClassifier predictions : [0 0 0 0]
RandomForest predictions  : [0 0 0 0]
Logistic Regression (LASSO) predictions : [0 0 0 0]



