<h2 style="color:red;text-align:center;font-weight:bold;">AeroStream Analytics</h2>

AeroStream Analytics est un système intelligent de classification automatique des avis clients des compagnies aériennes. Il analyse en temps réel le sentiment des utilisateurs afin de mesurer leur niveau de satisfaction et fournir des indicateurs clés de performance.

**Objectifs :**

Développer un système de classification automatique des avis clients en temps réel, Le
système devra permettre de:

- Collecter et prétraiter les avis clients,

- Analyser automatiquement le sentiment et la satisfaction,

- Générer des indicateurs de performance par compagnie aérienne,

- Visualiser les résultats via un tableau de bord interactif.

![Python](https://img.shields.io/badge/Python-3.9%2B-blue)
![Airflow](https://img.shields.io/badge/Apache%20Airflow-Orchestration-green)
![Streamlit](https://img.shields.io/badge/Streamlit-Dashboard-red)
![ChromaDB](https://img.shields.io/badge/ChromaDB-Vector%20Store-orange)

<br>

<h3 style="color:green;font-weight:bold;">Prétraitement des Données :</h3>

<h4 style="color:orange;font-weight:bold;">1. Charger le Dataset :</h4>

In [141]:
import pandas as pd

df = pd.read_csv("../data/raw/data.csv")

print("Données Chargées avec Succès !")

df.head()

Données Chargées avec Succès !


Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,name,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,5.70306e+17,neutral,1.0,,,Virgin America,cairdin,0,@VirginAmerica What @dhepburn said.,,2/24/2015 11:35,,Eastern Time (US & Canada)
1,5.70301e+17,positive,0.3486,,0.0,Virgin America,jnardino,0,@VirginAmerica plus you've added commercials t...,,2/24/2015 11:15,,Pacific Time (US & Canada)
2,5.70301e+17,neutral,0.6837,,,Virgin America,yvonnalynn,0,@VirginAmerica I didn't today... Must mean I n...,,2/24/2015 11:15,Lets Play,Central Time (US & Canada)
3,5.70301e+17,negative,1.0,Bad Flight,0.7033,Virgin America,jnardino,0,@VirginAmerica it's really aggressive to blast...,,2/24/2015 11:15,,Pacific Time (US & Canada)
4,5.70301e+17,negative,1.0,Can't Tell,1.0,Virgin America,jnardino,0,@VirginAmerica and it's a really big bad thing...,,2/24/2015 11:14,,Pacific Time (US & Canada)


<h4 style="color:orange;font-weight:bold;">2. Supprimer les Textes ayant un airline_sentiment_confidence < 0.51 :</h4>

In [142]:
print(f"- Nombre de Lignes Total : {df.shape[0]}")

df = df[df["airline_sentiment_confidence"]>=0.5]

- Nombre de Lignes Total : 14640


In [143]:
print(f"- Nombre de Lignes Restants : {df.shape[0]}")

- Nombre de Lignes Restants : 14404


In [144]:
print(f"- Confidence Minimale : {df["airline_sentiment_confidence"].min()}")

- Confidence Minimale : 0.5014


In [145]:
print(f"- Confidence Moyenne : {df["airline_sentiment_confidence"].mean()}")

- Confidence Moyenne : 0.90910301305193


In [146]:
df.groupby("airline_sentiment")["airline_sentiment_confidence"].agg(["mean"])

Unnamed: 0_level_0,mean
airline_sentiment,Unnamed: 1_level_1
negative,0.937348
neutral,0.839274
positive,0.888085


<h4 style="color:orange;font-weight:bold;">3. Séléctionner les Colonnes Features et Target :</h4>

In [147]:
df = df[["text", "airline_sentiment"]]

df.head()

Unnamed: 0,text,airline_sentiment
0,@VirginAmerica What @dhepburn said.,neutral
2,@VirginAmerica I didn't today... Must mean I n...,neutral
3,@VirginAmerica it's really aggressive to blast...,negative
4,@VirginAmerica and it's a really big bad thing...,negative
5,@VirginAmerica seriously would pay $30 a fligh...,negative


In [148]:
df.rename(columns={"airline_sentiment" : "sentiment"}, inplace=True)

df.head()

Unnamed: 0,text,sentiment
0,@VirginAmerica What @dhepburn said.,neutral
2,@VirginAmerica I didn't today... Must mean I n...,neutral
3,@VirginAmerica it's really aggressive to blast...,negative
4,@VirginAmerica and it's a really big bad thing...,negative
5,@VirginAmerica seriously would pay $30 a fligh...,negative


<h4 style="color:orange;font-weight:bold;">4. Vérifier les Doublons :</h4>

In [149]:
print(f"- Nombre des Doublons : {df.duplicated().sum()}")

- Nombre des Doublons : 184


In [150]:
df = df.drop_duplicates()

print(f"- Nombre des Doublons : {df.duplicated().sum()}")

print(f"\n- Nombre des Lignes : {df.shape[0]}")

- Nombre des Doublons : 0

- Nombre des Lignes : 14220


<h4 style="color:orange;font-weight:bold;">5. Gérer les Valeurs Manquantes :</h4>

In [151]:
print(f"- Nombre des Valeurs Manquantes : {df.isnull().sum().sum()}")


- Nombre des Valeurs Manquantes : 0


<h4 style="color:orange;font-weight:bold;">6. Nettoyer les Textes :</h4>

<h5 style="font-weight:bold;">6.1. Normaliser les Textes :</h5>

In [152]:
df["clean_text"] = df["text"]

df.head()

Unnamed: 0,text,sentiment,clean_text
0,@VirginAmerica What @dhepburn said.,neutral,@VirginAmerica What @dhepburn said.
2,@VirginAmerica I didn't today... Must mean I n...,neutral,@VirginAmerica I didn't today... Must mean I n...
3,@VirginAmerica it's really aggressive to blast...,negative,@VirginAmerica it's really aggressive to blast...
4,@VirginAmerica and it's a really big bad thing...,negative,@VirginAmerica and it's a really big bad thing...
5,@VirginAmerica seriously would pay $30 a fligh...,negative,@VirginAmerica seriously would pay $30 a fligh...


In [153]:
df["clean_text"] = df["clean_text"].apply(lambda row : str(row).lower())

df.head()

Unnamed: 0,text,sentiment,clean_text
0,@VirginAmerica What @dhepburn said.,neutral,@virginamerica what @dhepburn said.
2,@VirginAmerica I didn't today... Must mean I n...,neutral,@virginamerica i didn't today... must mean i n...
3,@VirginAmerica it's really aggressive to blast...,negative,@virginamerica it's really aggressive to blast...
4,@VirginAmerica and it's a really big bad thing...,negative,@virginamerica and it's a really big bad thing...
5,@VirginAmerica seriously would pay $30 a fligh...,negative,@virginamerica seriously would pay $30 a fligh...


<h5 style="font-weight:bold;">6.2. Supprimer les mentions @ :</h5>

In [154]:
df["text_split"] = df["clean_text"].apply(lambda row : row.split())

df.head()

Unnamed: 0,text,sentiment,clean_text,text_split
0,@VirginAmerica What @dhepburn said.,neutral,@virginamerica what @dhepburn said.,"[@virginamerica, what, @dhepburn, said.]"
2,@VirginAmerica I didn't today... Must mean I n...,neutral,@virginamerica i didn't today... must mean i n...,"[@virginamerica, i, didn't, today..., must, me..."
3,@VirginAmerica it's really aggressive to blast...,negative,@virginamerica it's really aggressive to blast...,"[@virginamerica, it's, really, aggressive, to,..."
4,@VirginAmerica and it's a really big bad thing...,negative,@virginamerica and it's a really big bad thing...,"[@virginamerica, and, it's, a, really, big, ba..."
5,@VirginAmerica seriously would pay $30 a fligh...,negative,@virginamerica seriously would pay $30 a fligh...,"[@virginamerica, seriously, would, pay, $30, a..."


In [155]:
df["text_split"] = df["text_split"].apply(lambda tokens : [token for token in tokens if not token.startswith("@")])

df.head()

Unnamed: 0,text,sentiment,clean_text,text_split
0,@VirginAmerica What @dhepburn said.,neutral,@virginamerica what @dhepburn said.,"[what, said.]"
2,@VirginAmerica I didn't today... Must mean I n...,neutral,@virginamerica i didn't today... must mean i n...,"[i, didn't, today..., must, mean, i, need, to,..."
3,@VirginAmerica it's really aggressive to blast...,negative,@virginamerica it's really aggressive to blast...,"[it's, really, aggressive, to, blast, obnoxiou..."
4,@VirginAmerica and it's a really big bad thing...,negative,@virginamerica and it's a really big bad thing...,"[and, it's, a, really, big, bad, thing, about,..."
5,@VirginAmerica seriously would pay $30 a fligh...,negative,@virginamerica seriously would pay $30 a fligh...,"[seriously, would, pay, $30, a, flight, for, s..."


In [156]:
df["clean_text"] = df["text_split"].apply(lambda row : " ".join(row))

df.head()

Unnamed: 0,text,sentiment,clean_text,text_split
0,@VirginAmerica What @dhepburn said.,neutral,what said.,"[what, said.]"
2,@VirginAmerica I didn't today... Must mean I n...,neutral,i didn't today... must mean i need to take ano...,"[i, didn't, today..., must, mean, i, need, to,..."
3,@VirginAmerica it's really aggressive to blast...,negative,"it's really aggressive to blast obnoxious ""ent...","[it's, really, aggressive, to, blast, obnoxiou..."
4,@VirginAmerica and it's a really big bad thing...,negative,and it's a really big bad thing about it,"[and, it's, a, really, big, bad, thing, about,..."
5,@VirginAmerica seriously would pay $30 a fligh...,negative,seriously would pay $30 a flight for seats tha...,"[seriously, would, pay, $30, a, flight, for, s..."


<h5 style="font-weight:bold;">6.3. Supprimer les URLs :</h5>

In [157]:
df["text_split"] = df["text_split"].apply(lambda tokens : [token for token in tokens if not token.startswith("http")])

df.head()

Unnamed: 0,text,sentiment,clean_text,text_split
0,@VirginAmerica What @dhepburn said.,neutral,what said.,"[what, said.]"
2,@VirginAmerica I didn't today... Must mean I n...,neutral,i didn't today... must mean i need to take ano...,"[i, didn't, today..., must, mean, i, need, to,..."
3,@VirginAmerica it's really aggressive to blast...,negative,"it's really aggressive to blast obnoxious ""ent...","[it's, really, aggressive, to, blast, obnoxiou..."
4,@VirginAmerica and it's a really big bad thing...,negative,and it's a really big bad thing about it,"[and, it's, a, really, big, bad, thing, about,..."
5,@VirginAmerica seriously would pay $30 a fligh...,negative,seriously would pay $30 a flight for seats tha...,"[seriously, would, pay, $30, a, flight, for, s..."


In [158]:
df["clean_text"] = df["text_split"].apply(lambda row : " ".join(row))

df = df.drop(columns=["text_split"])

df.head()

Unnamed: 0,text,sentiment,clean_text
0,@VirginAmerica What @dhepburn said.,neutral,what said.
2,@VirginAmerica I didn't today... Must mean I n...,neutral,i didn't today... must mean i need to take ano...
3,@VirginAmerica it's really aggressive to blast...,negative,"it's really aggressive to blast obnoxious ""ent..."
4,@VirginAmerica and it's a really big bad thing...,negative,and it's a really big bad thing about it
5,@VirginAmerica seriously would pay $30 a fligh...,negative,seriously would pay $30 a flight for seats tha...


<h5 style="font-weight:bold;">6.4. Supprimer la Ponctuation :</h5>

In [159]:
df["clean_text"] = df["clean_text"].str.replace(r'[^\w\s]', ' ', regex=True)

df.head()

Unnamed: 0,text,sentiment,clean_text
0,@VirginAmerica What @dhepburn said.,neutral,what said
2,@VirginAmerica I didn't today... Must mean I n...,neutral,i didn t today must mean i need to take ano...
3,@VirginAmerica it's really aggressive to blast...,negative,it s really aggressive to blast obnoxious ent...
4,@VirginAmerica and it's a really big bad thing...,negative,and it s a really big bad thing about it
5,@VirginAmerica seriously would pay $30 a fligh...,negative,seriously would pay 30 a flight for seats tha...


<h5 style="font-weight:bold;">6.5. Supprimer la Caractères Spéciaux :</h5>

In [160]:
df["clean_text"] = df["clean_text"].str.replace(r'[^a-z0-9\s]', '', regex=True)

df.head()

Unnamed: 0,text,sentiment,clean_text
0,@VirginAmerica What @dhepburn said.,neutral,what said
2,@VirginAmerica I didn't today... Must mean I n...,neutral,i didn t today must mean i need to take ano...
3,@VirginAmerica it's really aggressive to blast...,negative,it s really aggressive to blast obnoxious ent...
4,@VirginAmerica and it's a really big bad thing...,negative,and it s a really big bad thing about it
5,@VirginAmerica seriously would pay $30 a fligh...,negative,seriously would pay 30 a flight for seats tha...
