# **Mini Project 1**

0. Requirements:
   
   If you do not have the following packages installed, run the command below to install them.

In [15]:
# !pip install pandas
# !pip install numpy
# !pip install scikit-learn
# !pip install matplotlib
# !pip install seaborn
# !pip install nltk
# !pip install codecarbon
# !pip install shap

1. Data Preparation:
   
    Goal: Load and inspect the IMDb dataset containing movie reviews labeled with positive and negative sentiments.(https://ai.stanford.edu/%7Eamaas/data/sentiment/)
    
    Task: Read the dataset, store the reviews and their associated sentiments, and explore the dataset to understand its structure.

In [98]:
import os
import pandas as pd
import numpy as np
import re
import shap
import seaborn as sns
import matplotlib.pyplot as plt
from nltk.corpus import stopwords # Importe la liste des "stop words" (mots vides) de la bibliothèque NLTK (Natural Language Toolkit)
from nltk.stem import PorterStemmer # Importe la classe PorterStemmer de NLTK. Le stemming est un processus qui consiste à réduire les mots à leur racine (stem)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV, learning_curve
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from codecarbon import EmissionsTracker
from nltk.stem import WordNetLemmatizer
from bs4 import BeautifulSoup

In [99]:
# --- Fonction de chargement des données ---

def load_movie_reviews(data_folder):
    """
    Charge les critiques de films à partir d'une structure de dossiers
    (pos/ et neg/) et les renvoie sous forme de DataFrame Pandas.

    Args:
        data_folder: Le chemin vers le dossier principal contenant les
                     sous-dossiers 'pos' et 'neg'.

    Returns:
        Un DataFrame Pandas avec deux colonnes : 'review' (texte de la critique)
        et 'sentiment' ('pos' ou 'neg').
        Retourne None si une erreur se produit.
    """
    reviews = []
    sentiments = []

    for sentiment in ['pos', 'neg']:
        folder_path = os.path.join(data_folder, sentiment)  # Chemin complet vers pos/ ou neg/

        if not os.path.isdir(folder_path):
            print(f"Erreur : Le dossier '{folder_path}' n'existe pas.")
            return None

        for filename in os.listdir(folder_path):
            if filename.endswith(".txt"):  # Traiter seulement les fichiers .txt
                file_path = os.path.join(folder_path, filename)
                try:
                    with open(file_path, 'r', encoding='utf-8') as f:  # 'utf-8' pour gérer les accents
                        review_text = f.read()
                        reviews.append(review_text)
                        sentiments.append(sentiment)
                except FileNotFoundError:
                    print(f"Erreur: Fichier '{file_path}' introuvable (improbable).")
                    return None  # Tu peux choisir de continuer ou d'arrêter ici
                except Exception as e:
                    print(f"Erreur lors de la lecture de '{file_path}': {e}")
                    return None

    # Crée le DataFrame Pandas
    df = pd.DataFrame({'review': reviews, 'sentiment': sentiments})
    return df

# --- Fonctions d'affichage ---

def display_dataframe_info(df, num_reviews=1, example_index=0):
    """Affiche des informations complètes sur le DataFrame, y compris des exemples.

    Args:
        df: Le DataFrame Pandas à afficher.
        num_reviews: Le nombre de premières lignes à afficher (head).
        example_index: L'indice de la critique d'exemple à afficher.
    """
    if df is None or df.empty:
        print("Le DataFrame est vide ou None.")
        return

    print(df.head(num_reviews))  # Affiche les n premières lignes
    print("-" * 20)
    print(df.info())  # Informations générales (types, colonnes, etc.)
    print("-" * 20)
    print(df['sentiment'].value_counts())  # Nombre de critiques par sentiment
    print("-" * 20)

    if example_index < len(df):
        print(f"\nExemple de critique (index {example_index}):")
        print(df['review'][example_index])
        print("Sentiment associé:", df['sentiment'][example_index])
    else:
        print(f"L'index d'exemple {example_index} est en dehors des limites du DataFrame.")

def display_first_reviews(df, num_reviews=5):
    """Affiche les premières lignes (head) du DataFrame.

    Args:
        df: Le DataFrame Pandas à afficher.
        num_reviews: Le nombre de premières lignes à afficher.
    """
    if df is None or df.empty:
        print("Le DataFrame est vide ou None.")
        return

    print(df.head(num_reviews))

print("\nFonctions de chargement et d'affichage des données")


Fonctions de chargement et d'affichage des données


In [100]:
# Charger les données
data_directory = "database_full/train"  
movie_reviews_df = load_movie_reviews(data_directory)

# Vérifier le contenu du DataFrame
print("Données :")
display_dataframe_info(movie_reviews_df, num_reviews=5, example_index=0)

Données :
                                              review sentiment
0  Zentropa is the most original movie I've seen ...       pos
1  Busy is so amazing! I just loved every word sh...       pos
2  Another good Stooge short!Christine McIntyre i...       pos
3  This is a complex film that explores the effec...       pos
4  This film has a special place in my heart, as ...       pos
--------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     25000 non-null  object
 1   sentiment  25000 non-null  object
dtypes: object(2)
memory usage: 390.8+ KB
None
--------------------
sentiment
pos    12500
neg    12500
Name: count, dtype: int64
--------------------

Exemple de critique (index 0):
Zentropa is the most original movie I've seen in years. If you like unique thrillers that are influenced by film noir, then this is just the rig

In [101]:
def count_pos_strings(df):
    """
    Compte le nombre de critiques positives ('pos') dans la colonne 'sentiment'
    d'un DataFrame, AVANT la conversion en 0/1.

    Args:
        df: Le DataFrame contenant la colonne 'sentiment'.

    Returns:
        Le nombre de 'pos', ou None si la colonne 'sentiment' n'existe pas.
    """
    if 'sentiment' not in df.columns:
        print("Erreur : La colonne 'sentiment' est absente du DataFrame.")
        return None

    return (df['sentiment'] == 'pos').sum()

def count_neg_strings(df):
    """
    Compte le nombre de critiques negative ('neg') dans la colonne 'sentiment'
    d'un DataFrame, AVANT la conversion en 0/1.

    Args:
        df: Le DataFrame contenant la colonne 'sentiment'.

    Returns:
        Le nombre de 'pos', ou None si la colonne 'sentiment' n'existe pas.
    """
    if 'sentiment' not in df.columns:
        print("Erreur : La colonne 'sentiment' est absente du DataFrame.")
        return None

    return (df['sentiment'] == 'neg').sum()

num_pos = count_pos_strings(movie_reviews_df)
if num_pos is not None:
     print("Nombre de critiques positives (avant conversion) :", num_pos)

num_neg = count_neg_strings(movie_reviews_df)
if num_neg is not None:
    print("Nombre de critiques négatives (avant conversion) :", num_neg)

Nombre de critiques positives (avant conversion) : 12500
Nombre de critiques négatives (avant conversion) : 12500


2. Text Preprocessing:
   
    Goal: Clean and preprocess the text data to remove noise and prepare it for analysis.
    
    Task: Remove unnecessary characters (e.g., HTML tags, punctuation), convert text to lowercase, and process words by removing stop words and stemming/lemmatizing them.

In [102]:
# --- Fonctions de nettoyage des données ---

def remove_html_bs(text): # fonction pour supprimer les balises HTML
    """Supprime les balises HTML (avec BeautifulSoup)."""
    try:
        soup = BeautifulSoup(text, "html.parser")
        return soup.get_text(separator=" ")
    except Exception as e:
        print(f"Erreur lors du nettoyage HTML : {e}")
        return ""

def remove_special_characters(text): # fonction pour supprimer les caractères spéciaux
    """Supprime les caractères spéciaux et la ponctuation."""
    pattern = r"[^a-zA-ZÀ-ÖØ-öø-ÿ0-9\s]"
    cleaned_text = re.sub(pattern, " ", text)
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()
    return cleaned_text

def convert_to_lowercase(text): # fonction pour convertir les caractères en minuscules
    """Convertit une chaîne de caractères en minuscules."""
    return text.lower()

def remove_stopwords(text): # fonction pour supprimer les mots vides
    """Supprime les mots vides (stop words) en utilisant NLTK."""
    stop_words = set(stopwords.words('english'))  # Important: Utilise 'english'
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return " ".join(filtered_words) # Reconstruit la phrase

def apply_stemming(text): # fonction pour appliquer le stemming
    """Applique le stemming (PorterStemmer) de NLTK."""
    stemmer = PorterStemmer()
    words = text.split()
    stemmed_words = [stemmer.stem(word) for word in words]
    return " ".join(stemmed_words)

def apply_lemmatization(text): # fonction pour appliquer la lemmatisation
    """Applique la lemmatisation avec WordNetLemmatizer de NLTK."""
    lemmatizer = WordNetLemmatizer()
    words = text.split()
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
    return " ".join(lemmatized_words)

def clean_reviews(df):
    """
    Nettoie un DataFrame de critiques (HTML, caractères spéciaux, minuscules, stop words, stemming/lemmatization).
    Modifie le DataFrame en place.
    """
    # Vérifie si le DataFrame est vide
    if df.empty:
        print("Erreur : Le DataFrame est vide. Impossible de le nettoyer.")
        return

    # Vérifie si la colonne 'review' existe
    if 'review' not in df.columns:
        print("Erreur : La colonne 'review' est absente du DataFrame.")
        return

    # Applique les fonctions de nettoyage, en séquence
    df['review'] = df['review'].apply(remove_html_bs)
    df['review'] = df['review'].apply(remove_special_characters)
    df['review'] = df['review'].apply(convert_to_lowercase)
    df['review'] = df['review'].apply(remove_stopwords)
    df['review'] = df['review'].apply(apply_stemming)  # Optionnel : Stemming
    df['review'] = df['review'].apply(apply_lemmatization) # Optionnel : Lemmatization

    # La fonction ne retourne rien, car elle modifie le DataFrame directement

print("\nFonction de nettoyage des critiques d'entraînement")


Fonction de nettoyage des critiques d'entraînement


In [103]:
# === MAIN ===

if movie_reviews_df is not None:
    clean_reviews(movie_reviews_df)  # Nettoyage des données
    print("Donnees nettoyées :")
    display_first_reviews(movie_reviews_df, num_reviews=5)

  soup = BeautifulSoup(text, "html.parser")


Donnees nettoyées :
                                              review sentiment
0  zentropa origin movi seen year like uniqu thri...       pos
1  busi amaz love everi word ever done freak geek...       pos
2  anoth good stoog short christin mcintyr love e...       pos
3  complex film explor effect fordist taylorist m...       pos
4  film special place heart caught first time tea...       pos


In [104]:
if movie_reviews_df is not None:
    # Conversion des étiquettes en 0 et 1
    movie_reviews_df['sentiment'] = movie_reviews_df['sentiment'].replace({'pos': 1, 'neg': 0})

if movie_reviews_df is not None:
    print("Critiques d'entraînement nettoyées :")
    display_first_reviews(movie_reviews_df, num_reviews=5)

Critiques d'entraînement nettoyées :
                                              review  sentiment
0  zentropa origin movi seen year like uniqu thri...          1
1  busi amaz love everi word ever done freak geek...          1
2  anoth good stoog short christin mcintyr love e...          1
3  complex film explor effect fordist taylorist m...          1
4  film special place heart caught first time tea...          1


  movie_reviews_df['sentiment'] = movie_reviews_df['sentiment'].replace({'pos': 1, 'neg': 0})


3. Feature Extraction:

    Goal: Transform the cleaned text into numerical features for machine learning.
   
    Task: Use a vectorization technique such as TF-IDF to convert the text into a numerical matrix that captures the importance of each word in the dataset.

In [107]:
# --- Fonction de vectorisation TF-IDF ---

def vectorize_reviews(df, max_features=5000, ngram_range=(1, 2)):
    """
    Vectorise les critiques en utilisant TF-IDF.

    Args:
        df: Le DataFrame contenant les critiques nettoyées (colonne 'review').
        max_features: Le nombre maximum de features (mots/n-grammes) à conserver.
        ngram_range:  La plage de n-grammes à considérer (par défaut, unigrammes et bigrammes).

    Returns:
        Une matrice TF-IDF (sparse matrix) et le vectorizer utilisé.
    """
    if 'review' not in df.columns:
        print("Erreur : La colonne 'review' est absente du DataFrame.")
        return None, None

    tfidf_vectorizer = TfidfVectorizer(
        max_features=max_features,  # Limite le nombre de features
        ngram_range=ngram_range,   # Utilise des unigrammes et des bigrammes
        # On pourrait ajouter d'autres paramètres ici, mais les valeurs par défaut sont généralement bonnes
    )

    tfidf_matrix = tfidf_vectorizer.fit_transform(df['review'])  # Applique la vectorisation
    return tfidf_matrix, tfidf_vectorizer

In [108]:
if movie_reviews_df is not None:
    # Vectorisation
    tfidf_matrix, vectorizer = vectorize_reviews(movie_reviews_df)

    if tfidf_matrix is not None:  # Vérifie que la vectorisation a réussi
        print("\nMatrice TF-IDF (forme) :", tfidf_matrix.shape)
        # Accéder aux noms des features (mots/n-grammes)
        feature_names = vectorizer.get_feature_names_out()
        print("Nombre de features :", len(feature_names))
        print("Quelques features (mots/n-grammes) :", feature_names[1000:1020]) #Un exemple

        # Convertir la matrice sparse en array dense (pour l'affichage, seulement pour l'exemple!)
        dense_matrix = tfidf_matrix.toarray()
        print("\nMatrice TF-IDF (dense, extrait) :\n", dense_matrix[:2, :10])  # Affiche un petit extrait !

        #Si tu veux transformer une nouvelle review avec le vectorizer déjà entrainé:
        new_review = "This movie was absolutely amazing! The acting was superb."
        cleaned_new_review = remove_html_bs(new_review)
        cleaned_new_review = remove_special_characters(cleaned_new_review)
        cleaned_new_review = convert_to_lowercase(cleaned_new_review)
        cleaned_new_review = remove_stopwords(cleaned_new_review)
        cleaned_new_review = apply_lemmatization(cleaned_new_review)

        new_review_vectorized = vectorizer.transform([cleaned_new_review]) #Transform, et pas fit_transform
        print("\nNouvelle critique vectorisée (forme):", new_review_vectorized.shape)


Matrice TF-IDF (forme) : (25000, 5000)
Nombre de features : 5000
Quelques features (mots/n-grammes) : ['crimin' 'cring' 'crisi' 'critic' 'crocodil' 'crook' 'cross' 'crowd'
 'crucial' 'crude' 'cruel' 'cruis' 'crush' 'crystal' 'cuba' 'cube' 'cue'
 'cult' 'cult classic' 'cultur']

Matrice TF-IDF (dense, extrait) :
 [[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

Nouvelle critique vectorisée (forme): (1, 5000)


In [109]:
display_first_reviews(movie_reviews_df, num_reviews=5)

                                              review  sentiment
0  zentropa origin movi seen year like uniqu thri...          1
1  busi amaz love everi word ever done freak geek...          1
2  anoth good stoog short christin mcintyr love e...          1
3  complex film explor effect fordist taylorist m...          1
4  film special place heart caught first time tea...          1


4. Model Training:

    Goal: Train a machine learning model to classify reviews based on their sentiment.
    
    Task: Split the dataset into training and testing sets, train a Logistic Regression model, and evaluate its performance on the test data.

In [111]:
# TASK 4: Model Training 

def train_logistic_regression(X, y):
    """
    Entraîne un modèle de régression logistique.

    Args:
        X: La matrice TF-IDF (features).
        y: Les étiquettes (sentiments, 0 ou 1).

    Returns:
        Le modèle entraîné.
    """

    # 1. Division des données en ensembles d'entraînement et de test
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    # test_size = 0.2 : 20% des données pour le test, 80% pour l'entrainement
    # random_state = 42 :  Pour la reproductibilité.  Fixe la graine du générateur aléatoire.

    # 2. Création du modèle de régression logistique
    model = LogisticRegression(max_iter=1000, random_state=42)  # Augmente max_iter si besoin
    # max_iter : Nombre maximum d'itérations pour la descente de gradient.
    # random_state : Pour la reproductibilité.

    # 3. Entraînement du modèle
    model.fit(X_train, y_train)  # C'est ici que le modèle "apprend"

    return model, X_train, X_test, y_train, y_test


if tfidf_matrix is not None:
    # Entraînement du modèle
    model, X_train, X_test, y_train, y_test = train_logistic_regression(tfidf_matrix, movie_reviews_df['sentiment'])

    # --- Évaluation du modèle ---
    y_pred = model.predict(X_test) #Prédictions sur le set de test

    #print("\nAccuracy:", accuracy_score(y_test, y_pred)) #La précision
    print("\nClassification Report:\n", classification_report(y_test, y_pred)) #Rapport détaillé
    print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred)) #Matrice de confusion

# TASK 8: Track emissions during model training



Classification Report:
               precision    recall  f1-score   support

           0       0.90      0.88      0.89      2485
           1       0.88      0.90      0.89      2515

    accuracy                           0.89      5000
   macro avg       0.89      0.89      0.89      5000
weighted avg       0.89      0.89      0.89      5000


Confusion Matrix:
 [[2178  307]
 [ 254 2261]]


5. Model Evaluation:

    Goal: Assess the performance of your model using appropriate metrics.
    
    Task: Evaluate precision, recall, and F1-score of the Logistic Regression model. Use these metrics to identify the strengths and weaknesses of your system. Visualize the Confusion Matrix to better understand how well the model classifies positive and negative reviews. Additionally, test the model with a new review, preprocess it, make a prediction, and display the result. Example: test it with a new review such as:
    "The movie had great visuals, but the storyline was dull and predictable." The expected output might be: Negative Sentiment.

In [None]:
# TASK 5: Model Evaluation 

# Classification Report

# Confusion Matrix

# Plot the Confusion Matrix

# Test with a new review
review = "The movie had great visuals but the storyline was dull and predictable."


6. Hyperparameter Tuning:

    Goal: Optimize your Logistic Regression model by tuning its hyperparameters.
   
    Task: Use an optimization method to find the best parameters for your model and improve its accuracy.

In [None]:
# TASK 6: Hyperparameter Tuning 

# TASK 8: Track emissions during Hyperparameter Tuning


7. Learning Curve Analysis:

    Goal: Diagnose your model's performance by plotting learning curves.
   
    Task: Analyze training and validation performance as a function of the training set size to identify underfitting or overfitting issues.


In [None]:
# TASK 7: Learning Curve Analysis


9. Ethical Considerations and Explainability:

    Goal: Discuss the ethics in using and deploying your AI-based solution by investigating and implementing suitable explainability methods.
    
    Task: Understanding how a machine learning model makes predictions is crucial for ensuring transparency, fairness, and accountability in AI deployment. One of the widely used techniques for model explainability is SHAP (SHapley Additive exPlanations), which helps determine how much each feature (word) contributes to a prediction.
    In this task, you will use SHAP to analyze the impact of individual words on sentiment classification. This will allow you to visualize which words increase or decrease the probability of a positive or negative sentiment prediction. Additionally, discuss key aspects such as potential biases in the model, fairness in outcomes, and accountability in AI decision-making. You can find more information here: https://shap.readthedocs.io/en/latest/generated/shap.Explanation.html

In [None]:
# TASK 9: Ethical Considerations & Explainability

# Show SHAP summary plot with proper feature names


10. Deployment Considerations for Embedded Systems:

    Goal: Optimize and convert the trained logistic regression model for deployment on embedded systems like Arduino
    
    Task: To deploy the trained logistic regression model on a resource-constrained embedded system like an Arduino, we must optimize and convert the model into a format suitable for execution in an environment with limited memory and processing power. Since embedded systems do not support direct execution of machine learning models trained in Python, we extract the model’s learned parameters—namely, the weights and bias—after training. These parameters are then quantized to fixed-point integers to eliminate the need for floating-point calculations, which are inefficient on microcontrollers.
    Once quantization is applied, we generate a C++ .h header file containing the model’s coefficients and bias, formatted in a way that allows direct use within an Arduino sketch. The final model is optimized to perform inference using integer arithmetic, making it both lightweight and efficient for deployment on microcontrollers. You can find more information here: https://medium.com/@thommaskevin/tinyml-binomial-logistic-regression-0fdbf00e6765

In [None]:
# TASK 10: Deployment Considerations (Model Quantization & Export for Arduino)
# Extract weights and bias from the trained logistic regression model

# Apply quantization (convert to fixed-point representation)

# Generate C++ header file for Arduino

# Save the header file
