# <a id='toc1_'></a>[Projet 7 : Réalisez une analyse de sentiments grâce au Deep Learning](#toc0_)
# <a id='toc2_'></a>[Exploration & Preprocessing](#toc0_)

[Lien OpenClassroom](https://openclassrooms.com/fr/paths/795/projects/1516/1578-mission)

---

**Table of contents**<a id='toc0_'></a>    
- [Projet 7 : Réalisez une analyse de sentiments grâce au Deep Learning](#toc1_)    
- [Exploration & Preprocessing](#toc2_)    
  - [Imports](#toc2_1_)    
  - [Analyse Exploratoire](#toc2_2_)    
    - [Chargement du jeu de données](#toc2_2_1_)    
    - [Analyse sommaire](#toc2_2_2_)    
    - [Suppression des colonnes inutiles](#toc2_2_3_)    
    - [Analyse colonne '?'](#toc2_2_4_)    
    - [Modification de la colonne label](#toc2_2_5_)    
    - [Analyse Statistique Simple](#toc2_2_6_)    
      - [Distribution des labels](#toc2_2_6_1_)    
      - [Longueur des Tweets](#toc2_2_6_2_)    
      - [Comptage des mots](#toc2_2_6_3_)    
    - [Analyse Statistique Avancée](#toc2_2_7_)    
      - [Netoyage sommaire pour analyse](#toc2_2_7_1_)    
      - [Mots les plus utilisés](#toc2_2_7_2_)    
      - [Bi-gramme](#toc2_2_7_3_)    
      - [Ponctuation](#toc2_2_7_4_)    
  - [ Préprocessing](#toc2_3_)    
    - [Préparation du texte](#toc2_3_1_)    
    - [Suppression des Tweets devenuent vide](#toc2_3_2_)    
    - [Affichage comparatif](#toc2_3_3_)    
    - [Découpage du jeu de données final](#toc2_3_4_)    
    - [Sauvegarde du jeu de données](#toc2_3_5_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

---
---

## <a id='toc2_1_'></a>[Imports](#toc0_)

In [None]:
import pandas as pd
import sweetviz as sv
import skimpy as sk
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import re
import string
from collections import Counter
import nltk
from nltk.util import ngrams
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split


import contractions

nltk.download("wordnet")
nltk.download("punkt")
nltk.download("stopwords")

TEMPLATE = "plotly_dark"

---
---

## <a id='toc2_2_'></a>[Analyse Exploratoire](#toc0_)

---

### <a id='toc2_2_1_'></a>[Chargement du jeu de données](#toc0_)

In [None]:
df = pd.read_csv(
    "./raw_dataset.csv",
    encoding="utf-8",
    encoding_errors="ignore",
    names=["?", "user_id", "datetime", "??", "username", "message"],
)

display(df.head(10))

---

### <a id='toc2_2_2_'></a>[Analyse sommaire](#toc0_)

In [None]:
# Afficher une analyse descriptive rapide avec skimpy
sk.skim(df)

# Créer un rapport d'analyse avec sweetviz
report = sv.analyze(df)
# Enregistrer le rapport sweetviz en HTML
report.show_html("sweetviz_report.html")
# Afficher le rapport sweetviz dans le notebook (peut ne pas fonctionner dans tous les environnements)
report.show_notebook()

# Afficher des informations sur le DataFrame (types de données, nombre de valeurs non nulles, etc.)
df.info()

# Afficher le nombre de colonnes pour chaque type de données
df.dtypes.value_counts()

---

### <a id='toc2_2_3_'></a>[Suppression des colonnes inutiles](#toc0_)

In [None]:
df.drop(["??", "datetime", "username", "user_id"], axis=1, inplace=True)

df

---

### <a id='toc2_2_4_'></a>[Analyse colonne '?'](#toc0_)

In [None]:
print("Messages où la colonne inconue = 0 :")
for message in df[df["?"] == 0]["message"].head(10):
    print("\t" + message)

In [None]:
print("Messages où la colonne inconue = 4 :")
for message in df[df["?"] == 4]["message"].head(10):
    print("\t" + message)

---
### <a id='toc2_2_5_'></a>[Modification de la colonne label](#toc0_)

In [None]:
df["label"] = df["?"]

df["label"].replace({4: "positive", 0: "negative"}, inplace=True)

df.drop("?", axis=1, inplace=True)

df.sample(20)

---

### <a id='toc2_2_6_'></a>[Analyse Statistique Simple](#toc0_)

#### <a id='toc2_2_6_1_'></a>[Distribution des labels](#toc0_)

In [None]:
label_counts = df["label"].value_counts()
fig_pie = px.pie(
    label_counts,
    values=label_counts.values,
    names=label_counts.index,
    title="Distribution des sentiments",
    hole=0.3,
    color_discrete_sequence=px.colors.qualitative.Pastel,
)
fig_pie.update_traces(textposition="inside", textinfo="percent+label")
fig_pie.show()

fig_bar_dist = px.bar(
    label_counts,
    x=label_counts.index,
    y=label_counts.values,
    title="Distribution des sentiments",
    labels={"x": "Sentiment", "y": "Quantité de Tweets"},
    color=label_counts.index,
    color_discrete_sequence=px.colors.qualitative.Pastel,
    text_auto=True,
)
fig_bar_dist.show()

#### <a id='toc2_2_6_2_'></a>[Longueur des Tweets](#toc0_)

In [None]:
df["tweet_length"] = df["message"].str.len()

# Histogram of tweet lengths
fig_hist_len = px.histogram(
    df,
    x="tweet_length",
    title="Distribution de la longueur des Tweets",
    marginal="box",  # Add box plot on top
    labels={"tweet_length": "Tweet Length (Characters)"},
)
fig_hist_len.show()

fig_box_len = px.box(
    df,
    x="label",
    y="tweet_length",
    color="label",
    title="Distribution de la longueur des Tweets par Sentiment",
    labels={"label": "Sentiment", "tweet_length": "Longueur du Tweet (Characters)"},
    color_discrete_sequence=px.colors.qualitative.Pastel,
)
fig_box_len.show()

#### <a id='toc2_2_6_3_'></a>[Comptage des mots](#toc0_)

In [None]:
df["word_count"] = df["message"].apply(lambda x: len(x.split(" ")))

# Histogram of word counts
fig_hist_wc = px.histogram(
    df,
    x="word_count",
    title="Distribution de la quantité de mots pour chaque Tweet",
    marginal="box",
    labels={"word_count": "Word Count"},
)
fig_hist_wc.show()

# Compare word counts by sentiment
fig_box_wc = px.box(
    df,
    x="label",
    y="word_count",
    color="label",
    title="Quantité de mots par Sentiment",
    labels={"label": "Sentiment", "word_count": "Comptage Mot"},
    color_discrete_sequence=px.colors.qualitative.Pastel,
)
fig_box_wc.show()

---

### <a id='toc2_2_7_'></a>[Analyse Statistique Avancée](#toc0_)

#### <a id='toc2_2_7_1_'></a>[Custommisation des stop words](#toc0_)

In [None]:
stop_words = set(stopwords.words("english"))
stop_words = [word for word in stop_words if not word.endswith("n't") and word != "not"]

stop_words

#### <a id='toc2_2_7_1_'></a>[Netoyage sommaire pour analyse](#toc0_)

In [None]:
def clean_text(text):
    text = text.lower()  # Lowercase
    text = re.sub(
        r"http\S+|www\S+|https\S+", "", text, flags=re.MULTILINE
    )  # Remove URLs
    text = re.sub(r"\@\w+|\#", "", text)  # Remove mentions and hashtags
    text = text.translate(
        str.maketrans("", "", string.punctuation)
    )  # Remove punctuation
    text = re.sub(r"\d+", "", text)  # Remove numbers

    # DECONTRACTED
    text = re.sub(r"\'t", "not", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"\'s", " is", text)
    text = re.sub(r"\'d", " would", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"\'ve", " have", text)
    text = re.sub(r"\'m", " am", text)

    return text


df["cleaned_message"] = df["message"].apply(clean_text)

#### <a id='toc2_2_7_2_'></a>[Mots les plus utilisés](#toc0_)

In [None]:
def get_top_words(corpus, n=20):
    tokens = nltk.word_tokenize(" ".join(corpus))
    tokens = [word for word in tokens if word not in stop_words and word.isalpha()]
    count = Counter(tokens)
    most_common = count.most_common(n)
    df_common = pd.DataFrame(most_common, columns=["word", "count"])
    return df_common


top_words_overall = get_top_words(df["cleaned_message"], n=10)
fig_bar_overall = px.bar(
    top_words_overall,
    x="count",
    y="word",
    orientation="h",
    title="Top 10 des mots les plus utilisés",
    labels={"count": "Fréquence", "word": "Mot"},
    color="count",
    color_continuous_scale=px.colors.sequential.Viridis,
)
fig_bar_overall.update_layout(
    yaxis={"categoryorder": "total ascending"}, template=TEMPLATE
)
fig_bar_overall.show()

# Top Words for Positive Tweets
top_words_pos = get_top_words(df[df["label"] == "positive"]["cleaned_message"], n=10)
fig_bar_pos = px.bar(
    top_words_pos,
    x="count",
    y="word",
    orientation="h",
    title="Top 10 des mots les plus utilisés pour les Tweets positifs",
    labels={"count": "Fréquence", "word": "Mot"},
    color="count",
    color_continuous_scale=px.colors.sequential.Greens,
)
fig_bar_pos.update_layout(yaxis={"categoryorder": "total ascending"}, template=TEMPLATE)
fig_bar_pos.show()

# Top Words for Negative Tweets
top_words_neg = get_top_words(df[df["label"] == "negative"]["cleaned_message"], n=10)
fig_bar_neg = px.bar(
    top_words_neg,
    x="count",
    y="word",
    orientation="h",
    title="Top 10 des mots les plus utilisés pour les Tweets negatifs",
    labels={"count": "Fréquence", "word": "Mot"},
    color="count",
    color_continuous_scale=px.colors.sequential.Reds,
)
fig_bar_neg.update_layout(yaxis={"categoryorder": "total ascending"}, template=TEMPLATE)
fig_bar_neg.show()

#### <a id='toc2_2_7_3_'></a>[Bi-gramme](https://fr.wikipedia.org/wiki/N-gramme) [&#8593;](#toc0_)

In [None]:
def get_top_ngrams(corpus, n=20, ngram_range=(2, 2)):
    tokens = nltk.word_tokenize(" ".join(corpus))
    tokens = [word for word in tokens if word not in stop_words and word.isalpha()]
    n_grams = ngrams(tokens, ngram_range[0])  # Adjust for different n-grams
    count = Counter(n_grams)
    most_common = count.most_common(n)
    # Format ngrams for display
    most_common_formatted = [(" ".join(ngram), freq) for ngram, freq in most_common]
    df_common = pd.DataFrame(most_common_formatted, columns=["ngram", "count"])
    return df_common


# Top Bigrams for Positive Tweets
top_bigrams_overall = get_top_ngrams(df["cleaned_message"], n=10, ngram_range=(2, 2))
fig_bar_bi_pos = px.bar(
    top_bigrams_overall,
    x="count",
    y="ngram",
    orientation="h",
    title="Top 10 des Bigrammes pour tous les Tweets",
    labels={"count": "Fréquence", "ngram": "Bigram"},
    color="count",
    color_continuous_scale=px.colors.sequential.Greens,
)
fig_bar_bi_pos.update_layout(
    yaxis={"categoryorder": "total ascending"}, template=TEMPLATE
)
fig_bar_bi_pos.show()


# Top Bigrams for Positive Tweets
top_bigrams_pos = get_top_ngrams(
    df[df["label"] == "positive"]["cleaned_message"], n=10, ngram_range=(2, 2)
)
fig_bar_bi_pos = px.bar(
    top_bigrams_pos,
    x="count",
    y="ngram",
    orientation="h",
    title="Top 10 des Bigrammes pour les Tweets positifs",
    labels={"count": "Fréquence", "ngram": "Bigram"},
    color="count",
    color_continuous_scale=px.colors.sequential.Greens,
)
fig_bar_bi_pos.update_layout(
    yaxis={"categoryorder": "total ascending"}, template=TEMPLATE
)
fig_bar_bi_pos.show()

# Top Bigrams for Negative Tweets
top_bigrams_neg = get_top_ngrams(
    df[df["label"] == "negative"]["cleaned_message"], n=10, ngram_range=(2, 2)
)
fig_bar_bi_neg = px.bar(
    top_bigrams_neg,
    x="count",
    y="ngram",
    orientation="h",
    title="Top 10 des Bigrammes pour les Tweets négatifs",
    labels={"count": "Fréquence", "ngram": "Bigram"},
    color="count",
    color_continuous_scale=px.colors.sequential.Reds,
)
fig_bar_bi_neg.update_layout(
    yaxis={"categoryorder": "total ascending"}, template=TEMPLATE
)
fig_bar_bi_neg.show()

#### <a id='toc2_2_7_4_'></a>[Ponctuation](#toc0_)

In [None]:
df["exclamation_count"] = df["message"].str.count("!")
df["question_mark_count"] = df["message"].str.count("\?")

# Compare counts by sentiment using box plots
fig_box_punct = make_subplots(
    rows=1, cols=2, subplot_titles=("Points d'exclamation", "Points d'intérogation")
)

fig_box_punct.add_trace(
    go.Box(
        y=df[df["label"] == "positive"]["exclamation_count"],
        name="Positif",
        marker_color="lightgreen",
    ),
    row=1,
    col=1,
)
fig_box_punct.add_trace(
    go.Box(
        y=df[df["label"] == "negative"]["exclamation_count"],
        name="Negatif",
        marker_color="lightcoral",
    ),
    row=1,
    col=1,
)

fig_box_punct.add_trace(
    go.Box(
        y=df[df["label"] == "positive"]["question_mark_count"],
        name="Positif",
        marker_color="lightgreen",
        showlegend=False,
    ),
    row=1,
    col=2,
)
fig_box_punct.add_trace(
    go.Box(
        y=df[df["label"] == "negative"]["question_mark_count"],
        name="Negatif",
        marker_color="lightcoral",
        showlegend=False,
    ),
    row=1,
    col=2,
)

fig_box_punct.update_layout(
    title_text="Utilisation de la ponctuation par type de Tweet",
    height=400,
    template=TEMPLATE,
)
fig_box_punct.update_yaxes(title_text="Nombre par Tweet", row=1, col=1)
fig_box_punct.update_yaxes(title_text="Nombre par Tweet", row=1, col=2)
fig_box_punct.show()

---
---

## <a id='toc2_3_'></a>[ Préprocessing](#toc0_)

---

### <a id='toc2_3_1_'></a>[Préparation du texte](#toc0_)

In [None]:
lemmatizer = WordNetLemmatizer()



def preprocess_text(text):

    """Applies cleaning steps to a single text string."""

    if not isinstance(text, str):

        return ""  # Return empty string for non-string inputs


    # 1. Convert to lowercase

    text = text.lower()


    # 2. Expand contractions

    text = contractions.fix(text)


    # 3. Remove URLs

    text = re.sub(r"http\S+|www\S+", "", text)


    # 4. Remove mentions (@username)

    text = re.sub(r"@\w+", "", text)


    # 5. Remove hashtags (#topic) - removes the '#' symbol and the word

    text = re.sub(r"#\w+", "", text)


    # 6. Remove numbers

    text = re.sub(r"\d+", "", text)


    # 7. Remove special characters and punctuation (keeping spaces)

    text = re.sub(r"[^a-z\s]", "", text)

    # DECONTRACTED
    text = re.sub(r"\'t", "not", text)


    # 8. Tokenization

    tokens = word_tokenize(text)


    # 9. Remove stop words and lemmatize

    cleaned_tokens = []

    for word in tokens:

        if len(word) > 1 and word not in stop_words:

            lemma = lemmatizer.lemmatize(word)

            cleaned_tokens.append(lemma)


    # 10. Join tokens back into a single string

    cleaned_text = " ".join(cleaned_tokens)


    return cleaned_text



df["cleaned_tweet"] = df["message"].apply(preprocess_text)

---


### <a id='toc2_3_2_'></a>[Suppression des Tweets devenuent vide](#toc0_)

In [None]:
empty_after_clean = df[df["cleaned_tweet"] == ""].shape[0]
print(
    f"\nNumber of tweets resulting in empty strings after cleaning: {empty_after_clean}"
)

if empty_after_clean > 0:
    df = df[df["cleaned_tweet"] != ""]

---

### <a id='toc2_3_3_'></a>[Affichage comparatif](#toc0_)

In [None]:
print("\nExample Preprocessing:")
for line in df.sample(10).itertuples():
    print("-" * 30)
    print(f"Original:  {line[1]}")
    print(f"Cleaned:   {line[8]}")
    print(f"Sentiment: {line[2]}")
print("-" * 30)

---

### <a id='toc2_3_4_'></a>[Découpage du jeu de données final](#toc0_)

In [None]:
X = df["cleaned_tweet"]
y = df["label"]

# Split ratio
TEST_SIZE = 0.15
VALIDATION_SIZE = 0.15  # Relative to the original size

# First split: Train vs. (Validation + Test)
X_train, X_temp, y_train, y_temp = train_test_split(
    X,
    y,
    test_size=(VALIDATION_SIZE + TEST_SIZE),
    random_state=42,  # for reproducibility
    stratify=y,  # Ensure distribution is similar across splits
)

# Calculate split size for validation relative to the 'temp' set
val_split_ratio = VALIDATION_SIZE / (VALIDATION_SIZE + TEST_SIZE)

# Second split: Validation vs. Test
X_val, X_test, y_val, y_test = train_test_split(
    X_temp,
    y_temp,
    test_size=(1 - val_split_ratio),  # Test size is the remainder
    random_state=42,  # for reproducibility
    stratify=y_temp,  # Ensure distribution is similar across splits
)

print("Data Splitting Complete:")
print(f"Training set shape:   X={X_train.shape}, y={y_train.shape}")
print(f"Validation set shape: X={X_val.shape}, y={y_val.shape}")
print(f"Test set shape:       X={X_test.shape}, y={y_test.shape}")

# Verify stratification (optional)
print("\nSentiment distribution in splits:")
print("Train:\n", y_train.value_counts(normalize=True))
print("Validation:\n", y_val.value_counts(normalize=True))
print("Test:\n", y_test.value_counts(normalize=True))

---

### <a id='toc2_3_5_'></a>[Sauvegarde du jeu de données](#toc0_)

In [None]:
# Create DataFrames for easy saving
train_df = pd.DataFrame({"cleaned_text": X_train, "sentiment": y_train})
val_df = pd.DataFrame({"cleaned_text": X_val, "sentiment": y_val})
test_df = pd.DataFrame({"cleaned_text": X_test, "sentiment": y_test})

# Save to CSV
train_df.to_csv("train_data.csv", index=False)
val_df.to_csv("validation_data.csv", index=False)
test_df.to_csv("test_data.csv", index=False)
print("\nSplit data saved to train_data.csv, validation_data.csv, test_data.csv")