## <center style="color:red">**News Classifier**</center>

### <center>**ML Pipeline with Sentence Transformers, ChromaDB & Airflow**</center>

Ce projet vise à construire un pipeline complet de classification de textes permettant de catégoriser automatiquement des articles d’actualité dans 4 classes :

- World
- Sports
- Business
- Sci/Tech

Le pipeline utilise des techniques NLP avancées, des embeddings vectoriels, une base de données vectorielle et une orchestration via Apache Airflow.

<br>

### <span style="color:green">**Prétraîtement des Données :**</span>

#### <span style="color:orange">**1. Charger les Données :**</span>

In [1]:
import pandas as pd

df_train = pd.read_csv("../data/clean/train.csv")

df_test = pd.read_csv("../data/clean/test.csv")

print("Données Chargées avec Succès !")

Données Chargées avec Succès !


In [2]:
df_train.head()

Unnamed: 0,text,label
0,Wall St. Bears Claw Back Into the Black (Reute...,2
1,Carlyle Looks Toward Commercial Aerospace (Reu...,2
2,Oil and Economy Cloud Stocks' Outlook (Reuters...,2
3,Iraq Halts Oil Exports from Main Southern Pipe...,2
4,"Oil prices soar to all-time record, posing new...",2


In [3]:
df_test.head()

Unnamed: 0,text,label
0,Fears for T N pension after talks Unions repre...,2
1,The Race is On: Second Private Team Sets Launc...,3
2,Ky. Company Wins Grant to Study Peptides (AP) ...,3
3,Prediction Unit Helps Forecast Wildfires (AP) ...,3
4,Calif. Aims to Limit Farm-Related Smog (AP) AP...,3


#### <span style="color:orange">**2. Normaliser le Dataset :**</span>

In [4]:
df_train["text"] = df_train["text"].apply(lambda row : str(row).lower())

df_train.head()

Unnamed: 0,text,label
0,wall st. bears claw back into the black (reute...,2
1,carlyle looks toward commercial aerospace (reu...,2
2,oil and economy cloud stocks' outlook (reuters...,2
3,iraq halts oil exports from main southern pipe...,2
4,"oil prices soar to all-time record, posing new...",2


In [5]:
df_test["text"] = df_test["text"].apply(lambda row : str(row).lower())

df_test.head()

Unnamed: 0,text,label
0,fears for t n pension after talks unions repre...,2
1,the race is on: second private team sets launc...,3
2,ky. company wins grant to study peptides (ap) ...,3
3,prediction unit helps forecast wildfires (ap) ...,3
4,calif. aims to limit farm-related smog (ap) ap...,3


#### <span style="color:orange">**3. Tokenisation :**</span>

In [6]:
import nltk
from nltk.tokenize import word_tokenize

nltk.download("punkt")
nltk.download("punkt_tab")

df_train["word_tokens"] = df_train["text"].apply(lambda x: word_tokenize(x) if isinstance(x, str) else [])

df_train.head()

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\abdel\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\abdel\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Unnamed: 0,text,label,word_tokens
0,wall st. bears claw back into the black (reute...,2,"[wall, st., bears, claw, back, into, the, blac..."
1,carlyle looks toward commercial aerospace (reu...,2,"[carlyle, looks, toward, commercial, aerospace..."
2,oil and economy cloud stocks' outlook (reuters...,2,"[oil, and, economy, cloud, stocks, ', outlook,..."
3,iraq halts oil exports from main southern pipe...,2,"[iraq, halts, oil, exports, from, main, southe..."
4,"oil prices soar to all-time record, posing new...",2,"[oil, prices, soar, to, all-time, record, ,, p..."


In [7]:
df_test["word_tokens"] = df_test["text"].apply(lambda x: word_tokenize(x) if isinstance(x, str) else [])

df_test.head()

Unnamed: 0,text,label,word_tokens
0,fears for t n pension after talks unions repre...,2,"[fears, for, t, n, pension, after, talks, unio..."
1,the race is on: second private team sets launc...,3,"[the, race, is, on, :, second, private, team, ..."
2,ky. company wins grant to study peptides (ap) ...,3,"[ky., company, wins, grant, to, study, peptide..."
3,prediction unit helps forecast wildfires (ap) ...,3,"[prediction, unit, helps, forecast, wildfires,..."
4,calif. aims to limit farm-related smog (ap) ap...,3,"[calif., aims, to, limit, farm-related, smog, ..."


#### <span style="color:orange">**4. Retirer les Stopwords :**</span>

In [8]:
import nltk
from nltk.corpus import stopwords

nltk.download("stopwords")

stop_words = set(stopwords.words('english'))

df_train["tokens"] = df_train["word_tokens"].apply(
    lambda tokens : [token for token in tokens if token not in stop_words]
)

df_train.head()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\abdel\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,text,label,word_tokens,tokens
0,wall st. bears claw back into the black (reute...,2,"[wall, st., bears, claw, back, into, the, blac...","[wall, st., bears, claw, back, black, (, reute..."
1,carlyle looks toward commercial aerospace (reu...,2,"[carlyle, looks, toward, commercial, aerospace...","[carlyle, looks, toward, commercial, aerospace..."
2,oil and economy cloud stocks' outlook (reuters...,2,"[oil, and, economy, cloud, stocks, ', outlook,...","[oil, economy, cloud, stocks, ', outlook, (, r..."
3,iraq halts oil exports from main southern pipe...,2,"[iraq, halts, oil, exports, from, main, southe...","[iraq, halts, oil, exports, main, southern, pi..."
4,"oil prices soar to all-time record, posing new...",2,"[oil, prices, soar, to, all-time, record, ,, p...","[oil, prices, soar, all-time, record, ,, posin..."


In [9]:
df_test["tokens"] = df_test["word_tokens"].apply(
    lambda tokens : [token for token in tokens if token not in stop_words]
)

df_test.head()

Unnamed: 0,text,label,word_tokens,tokens
0,fears for t n pension after talks unions repre...,2,"[fears, for, t, n, pension, after, talks, unio...","[fears, n, pension, talks, unions, representin..."
1,the race is on: second private team sets launc...,3,"[the, race, is, on, :, second, private, team, ...","[race, :, second, private, team, sets, launch,..."
2,ky. company wins grant to study peptides (ap) ...,3,"[ky., company, wins, grant, to, study, peptide...","[ky., company, wins, grant, study, peptides, (..."
3,prediction unit helps forecast wildfires (ap) ...,3,"[prediction, unit, helps, forecast, wildfires,...","[prediction, unit, helps, forecast, wildfires,..."
4,calif. aims to limit farm-related smog (ap) ap...,3,"[calif., aims, to, limit, farm-related, smog, ...","[calif., aims, limit, farm-related, smog, (, a..."


#### <span style="color:orange">**5. Supprimer la Ponctuation & Caractères Spéciaux :**</span>

In [10]:
import re

df_train["clean_tokens"] = df_train["tokens"].apply(
    lambda tokens: [token for token in tokens if re.match(r"^[a-z]+$", token)]
)

df_train.head()

Unnamed: 0,text,label,word_tokens,tokens,clean_tokens
0,wall st. bears claw back into the black (reute...,2,"[wall, st., bears, claw, back, into, the, blac...","[wall, st., bears, claw, back, black, (, reute...","[wall, bears, claw, back, black, reuters, reut..."
1,carlyle looks toward commercial aerospace (reu...,2,"[carlyle, looks, toward, commercial, aerospace...","[carlyle, looks, toward, commercial, aerospace...","[carlyle, looks, toward, commercial, aerospace..."
2,oil and economy cloud stocks' outlook (reuters...,2,"[oil, and, economy, cloud, stocks, ', outlook,...","[oil, economy, cloud, stocks, ', outlook, (, r...","[oil, economy, cloud, stocks, outlook, reuters..."
3,iraq halts oil exports from main southern pipe...,2,"[iraq, halts, oil, exports, from, main, southe...","[iraq, halts, oil, exports, main, southern, pi...","[iraq, halts, oil, exports, main, southern, pi..."
4,"oil prices soar to all-time record, posing new...",2,"[oil, prices, soar, to, all-time, record, ,, p...","[oil, prices, soar, all-time, record, ,, posin...","[oil, prices, soar, record, posing, new, menac..."


In [11]:
df_test["clean_tokens"] = df_test["tokens"].apply(
    lambda tokens: [token for token in tokens if re.match(r"^[a-z]+$", token)]
)

df_test.head()

Unnamed: 0,text,label,word_tokens,tokens,clean_tokens
0,fears for t n pension after talks unions repre...,2,"[fears, for, t, n, pension, after, talks, unio...","[fears, n, pension, talks, unions, representin...","[fears, n, pension, talks, unions, representin..."
1,the race is on: second private team sets launc...,3,"[the, race, is, on, :, second, private, team, ...","[race, :, second, private, team, sets, launch,...","[race, second, private, team, sets, launch, da..."
2,ky. company wins grant to study peptides (ap) ...,3,"[ky., company, wins, grant, to, study, peptide...","[ky., company, wins, grant, study, peptides, (...","[company, wins, grant, study, peptides, ap, ap..."
3,prediction unit helps forecast wildfires (ap) ...,3,"[prediction, unit, helps, forecast, wildfires,...","[prediction, unit, helps, forecast, wildfires,...","[prediction, unit, helps, forecast, wildfires,..."
4,calif. aims to limit farm-related smog (ap) ap...,3,"[calif., aims, to, limit, farm-related, smog, ...","[calif., aims, limit, farm-related, smog, (, a...","[aims, limit, smog, ap, ap, southern, californ..."


#### <span style="color:orange">**6.  :**</span>

In [12]:
df_train["new_text"] = df_train["clean_tokens"].apply(
    lambda token : " ".join(token)
)

df_train.head()

Unnamed: 0,text,label,word_tokens,tokens,clean_tokens,new_text
0,wall st. bears claw back into the black (reute...,2,"[wall, st., bears, claw, back, into, the, blac...","[wall, st., bears, claw, back, black, (, reute...","[wall, bears, claw, back, black, reuters, reut...",wall bears claw back black reuters reuters wal...
1,carlyle looks toward commercial aerospace (reu...,2,"[carlyle, looks, toward, commercial, aerospace...","[carlyle, looks, toward, commercial, aerospace...","[carlyle, looks, toward, commercial, aerospace...",carlyle looks toward commercial aerospace reut...
2,oil and economy cloud stocks' outlook (reuters...,2,"[oil, and, economy, cloud, stocks, ', outlook,...","[oil, economy, cloud, stocks, ', outlook, (, r...","[oil, economy, cloud, stocks, outlook, reuters...",oil economy cloud stocks outlook reuters reute...
3,iraq halts oil exports from main southern pipe...,2,"[iraq, halts, oil, exports, from, main, southe...","[iraq, halts, oil, exports, main, southern, pi...","[iraq, halts, oil, exports, main, southern, pi...",iraq halts oil exports main southern pipeline ...
4,"oil prices soar to all-time record, posing new...",2,"[oil, prices, soar, to, all-time, record, ,, p...","[oil, prices, soar, all-time, record, ,, posin...","[oil, prices, soar, record, posing, new, menac...",oil prices soar record posing new menace us ec...


In [13]:
df_test["new_text"] = df_test["clean_tokens"].apply(
    lambda token : " ".join(token)
)

df_test.head()

Unnamed: 0,text,label,word_tokens,tokens,clean_tokens,new_text
0,fears for t n pension after talks unions repre...,2,"[fears, for, t, n, pension, after, talks, unio...","[fears, n, pension, talks, unions, representin...","[fears, n, pension, talks, unions, representin...",fears n pension talks unions representing work...
1,the race is on: second private team sets launc...,3,"[the, race, is, on, :, second, private, team, ...","[race, :, second, private, team, sets, launch,...","[race, second, private, team, sets, launch, da...",race second private team sets launch date huma...
2,ky. company wins grant to study peptides (ap) ...,3,"[ky., company, wins, grant, to, study, peptide...","[ky., company, wins, grant, study, peptides, (...","[company, wins, grant, study, peptides, ap, ap...",company wins grant study peptides ap ap compan...
3,prediction unit helps forecast wildfires (ap) ...,3,"[prediction, unit, helps, forecast, wildfires,...","[prediction, unit, helps, forecast, wildfires,...","[prediction, unit, helps, forecast, wildfires,...",prediction unit helps forecast wildfires ap ap...
4,calif. aims to limit farm-related smog (ap) ap...,3,"[calif., aims, to, limit, farm-related, smog, ...","[calif., aims, limit, farm-related, smog, (, a...","[aims, limit, smog, ap, ap, southern, californ...",aims limit smog ap ap southern california agen...


#### <span style="color:orange">**7. Sauvegarder les Datasets :**</span>

In [14]:
df_train = df_train[["new_text","label"]]

df_train.rename(columns={"new_text": "text"}, inplace=True)

df_train.head()

df_train.to_csv("../data/processed/train.csv")

print("Dataset Enregistré avec Succès !")

Dataset Enregistré avec Succès !


In [15]:
df_test = df_test[["new_text","label"]]

df_test.rename(columns={"new_text": "text"}, inplace=True)

df_test.head()

df_test.to_csv("../data/processed/test.csv")

print("Dataset Enregistré avec Succès !")

Dataset Enregistré avec Succès !
