# Chargement du jeu de données traité


In [5]:
import pandas as pd 
df=pd.read_csv("/opt/airflow/data/processed/us_airline_sentiment_processed.csv")
df["id"]=df["id"].astype(str)
df.head()

Unnamed: 0,id,label,text_clean
0,5.70306e+17,neutral,what dhepburn said
1,5.70301e+17,positive,plus youve added commercials to the experienc...
2,5.70301e+17,neutral,i didnt today must mean i need to take anothe...
3,5.70301e+17,negative,its really aggressive to blast obnoxious ente...
4,5.70301e+17,negative,and its a really big bad thing about it


Remarque — Vérification du chargement
Si la table s'affiche, le jeu de données traité est disponible et prêt pour l'encodage. Sinon, vérifiez le chemin du fichier ou exécutez d'abord l'étape de nettoyage (`03_text_cleaning.ipynb`).

# Génération des embeddings


Nous avons choisi le modèle paraphrase-MiniLM-L12-v2 car il génère des embeddings de qualité pour capturer la similarité sémantique entre phrases, ce qui est idéal pour notre analyse de texte.
Il est également léger et rapide, ce qui permet de traiter de gros volumes de données sans trop de ressources.
Ainsi, il représente un bon compromis entre précision et performance pour notre projet.

In [6]:
from sentence_transformers import SentenceTransformer
import torch
import numpy as np
from sklearn.model_selection import train_test_split
import pandas as pd

model_name="paraphrase-MiniLM-L12-v2"
device="cuda" if torch.cuda.is_available() else "cpu"

print(f"Using device: {device}")

train_embedding_path="/opt/airflow/data/embeddings/train_embedding.npy"
test_embedding_path="/opt/airflow/data/embeddings/test_embedding.npy"
train_metadata_path="/opt/airflow/data/metadata/train_metadata.csv"
test_metadata_path="/opt/airflow/data/metadata/test_metadata.csv"

x=df[["id","text_clean"]]
y=df["label"]

X_train,X_test,y_train,y_test=train_test_split(
    x,y ,random_state=42,test_size=0.2,stratify=y
)

model=SentenceTransformer(model_name,device)
print("Encoding train embeddings...")

train_embedding=model.encode(
    X_train['text_clean'].tolist(),
    batch_size=64,
    show_progress_bar=True,
    convert_to_numpy=True,
)
print("Train embeddings done!")

print("Encoding test embeddings...")

test_embedding=model.encode(
    X_test['text_clean'].tolist(),
    convert_to_numpy=True,
    show_progress_bar=True,
    batch_size=64
    )

print("Test embeddings done!")

# Fix ids to be unique (append index)
X_train["id"] = X_train["id"].astype(str) + "_" + X_train["id"].index.astype(str)
X_test["id"] = X_test["id"].astype(str) + "_" + X_test["id"].index.astype(str)

train_metadata=pd.DataFrame({
    "id":X_train["id"].tolist(),
    "label":y_train.to_numpy()
})

test_metadata=pd.DataFrame({
    "id":X_test["id"].tolist(),
    "label":y_test.to_numpy()
})

train_metadata.to_csv(train_metadata_path,index=False)
test_metadata.to_csv(test_metadata_path,index=False)

np.save(train_embedding_path,train_embedding)
np.save(test_embedding_path,test_embedding)

print("Embeddings and metadata saved.")


  from .autonotebook import tqdm as notebook_tqdm


Using device: cpu
Encoding train embeddings...


Batches: 100%|██████████| 183/183 [00:51<00:00,  3.54it/s]


Train embeddings done!
Encoding test embeddings...


Batches: 100%|██████████| 46/46 [00:12<00:00,  3.58it/s]


Test embeddings done!
Embeddings and metadata saved.


- `train_embedding.npy` et `test_embedding.npy` contiennent les vecteurs d'embeddings au format NumPy.
- `train_metadata.csv` et `test_metadata.csv` contiennent les `id` et `label` associés aux embeddings.
Ces fichiers servent pour l'entraînement du classifieur, l'évaluation et l'indexation dans une base vectorielle.
Conservez-les dans `data/embeddings` et `data/metadata` pour les étapes suivantes.

#  Initialisation de la base vectorielle (Chroma)
La création d'un client Chroma persistant pointant vers `data/chroma_db`. Cela permet d'indexer et de rechercher des embeddings localement.

In [7]:
import chromadb
from chromadb.config import Settings


In [8]:
client=chromadb.PersistentClient(path='/opt/airflow/data/chroma_db')
train_collection=client.create_collection("avis_train",get_or_create=True)
test_colletion=client.get_or_create_collection(name="avis_test")


# Indexation des embeddings — train
Remarque : la boucle suivante ajoute les embeddings d'entraînement par lots dans la collection `avis_train`. 

In [None]:
batch_size = 1000
metadatas_full = [{"label": i, "split":"train"} for i in train_metadata["label"].to_numpy()]
n = len(train_metadata["id"])

if "avis_train" in client.list_collections():
    client.delete_collection("avis_train")

train_collection = client.get_or_create_collection("avis_train")

for i in range(0, n, batch_size):  
    ids = train_metadata["id"][i:i+batch_size].tolist()
    metadatas = metadatas_full[i:i+batch_size]
    documents = X_train['text_clean'][i:i+batch_size].tolist()  
    batch_embeddings = train_embedding[i:i+batch_size].tolist()  

    train_collection.add(
        ids=ids,
        embeddings=batch_embeddings,
        metadatas=metadatas,
        documents=documents  
    )
    print(f"Added train batch {i} to {i+len(ids)}")


Added train batch 0 to 1000
Added train batch 1000 to 2000
Added train batch 2000 to 3000
Added train batch 3000 to 4000
Added train batch 4000 to 5000
Added train batch 5000 to 6000
Added train batch 6000 to 7000
Added train batch 7000 to 8000
Added train batch 8000 to 9000
Added train batch 9000 to 10000
Added train batch 10000 to 11000
Added train batch 11000 to 11712


Après exécution, la collection `avis_train` contiendra les embeddings d'entraînement.

# Indexation des embeddings — test
La boucle suivante ajoute les embeddings de test par lots dans la collection `avis_test`. Les impressions indiquent la progression.

In [10]:
batch_size = 1000
metadatas_full = [{"label": i, "split":"test"} for i in test_metadata["label"].to_numpy()]
n = len(test_metadata["id"])

if "avis_test" in client.list_collections():
    client.delete_collection("avis_test")

test_collection = client.get_or_create_collection("avis_test")

for i in range(0, n, batch_size):  
    ids = test_metadata["id"][i:i+batch_size].tolist()
    metadatas = metadatas_full[i:i+batch_size]
    documents = X_test['text_clean'][i:i+batch_size].tolist()  
    batch_embeddings = test_embedding[i:i+batch_size].tolist()  

    test_collection.add(
        ids=ids,
        embeddings=batch_embeddings,
        metadatas=metadatas,
        documents=documents  
    )
    print(f"Added test batch {i} to {i+len(ids)}")


Added test batch 0 to 1000
Added test batch 1000 to 2000
Added test batch 2000 to 2928
