# **Mini Project 1**

0. Requirements:
   
   If you do not have the following packages installed, run the command below to install them.

In [9]:
# !pip install pandas
# !pip install numpy
# !pip install scikit-learn
# !pip install matplotlib
# !pip install seaborn
# !pip install nltk
# !pip install codecarbon
# !pip install shap

1. Data Preparation:
   
    Goal: Load and inspect the IMDb dataset containing movie reviews labeled with positive and negative sentiments.(https://ai.stanford.edu/%7Eamaas/data/sentiment/)
    
    Task: Read the dataset, store the reviews and their associated sentiments, and explore the dataset to understand its structure.

In [1]:
import os
import pandas as pd
import numpy as np
import re
import shap
import seaborn as sns
import matplotlib.pyplot as plt
from nltk.corpus import stopwords # Importe la liste des "stop words" (mots vides) de la bibliothèque NLTK (Natural Language Toolkit)
from nltk.stem import PorterStemmer # Importe la classe PorterStemmer de NLTK. Le stemming est un processus qui consiste à réduire les mots à leur racine (stem)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV, learning_curve
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from codecarbon import EmissionsTracker

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# TASK 1: Data Preparation 
# Load the dataset from the local aclImdb folder

def load_movie_reviews(data_folder):
    """
    Charge les critiques de films à partir d'une structure de dossiers
    (pos/ et neg/) et les renvoie sous forme de DataFrame Pandas.

    Args:
        data_folder: Le chemin vers le dossier principal contenant les
                     sous-dossiers 'pos' et 'neg'.  Ex: "aclImdb/train"

    Returns:
        Un DataFrame Pandas avec deux colonnes : 'review' (texte de la critique)
        et 'sentiment' ('pos' ou 'neg').
        Retourne None si une erreur se produit (ex: dossier inexistant).
    """

    # 1. Initialisation des listes pour stocker les données
    reviews = []   # Liste vide pour stocker les textes des critiques
    sentiments = []  # Liste vide pour stocker les sentiments ('pos' ou 'neg')

    # 2. Boucle sur les deux types de sentiments : 'pos' et 'neg'
    for sentiment in ['pos', 'neg']:
        # 2.a. Construction du chemin complet vers le sous-dossier (pos/ ou neg/)
        folder_path = os.path.join(data_folder, sentiment)
        #   Exemple: si data_folder = "aclImdb/train" et sentiment = "pos",
        #   folder_path devient "aclImdb/train/pos"

        # 2.b. Vérification de l'existence du dossier
        if not os.path.isdir(folder_path):
            print(f"Erreur : Le dossier '{folder_path}' n'existe pas.")
            return None  # Arrête la fonction et retourne None en cas d'erreur

        # 2.c. Boucle sur tous les fichiers DANS le sous-dossier courant (pos/ ou neg/)
        for filename in os.listdir(folder_path):
            # 2.c.i. Vérification que le fichier se termine par ".txt"
            if filename.endswith(".txt"):
                # 2.c.ii. Construction du chemin COMPLET vers le fichier texte
                file_path = os.path.join(folder_path, filename)
                #    Exemple: si folder_path = "aclImdb/train/pos" et filename = "123_9.txt",
                #    file_path devient "aclImdb/train/pos/123_9.txt"

                # 2.c.iii. Ouverture et lecture du fichier (avec gestion des erreurs)
                try:
                    with open(file_path, 'r', encoding='utf-8') as f:  # Ouvre en lecture, encodage UTF-8
                        review_text = f.read()  # Lit TOUT le contenu du fichier
                        reviews.append(review_text)     # Ajoute le texte à la liste des critiques
                        sentiments.append(sentiment)   # Ajoute le sentiment ('pos' ou 'neg') à la liste

                except FileNotFoundError:
                    # Cette erreur ne devrait normalement pas arriver, car os.listdir() a déjà listé le fichier.
                    print(f"Erreur: Fichier '{file_path}' introuvable (improbable).")
                    return None  # On pourrait choisir de continuer (continue) au lieu d'arrêter.

                except Exception as e:
                    # Capture toute autre erreur possible pendant la lecture
                    print(f"Erreur lors de la lecture de '{file_path}': {e}")
                    return None

    # 3. Création du DataFrame Pandas (APRÈS avoir parcouru tous les dossiers et fichiers)
    df = pd.DataFrame({'review': reviews, 'sentiment': sentiments})
    #   Crée un tableau avec deux colonnes:
    #     - 'review': contient les textes des critiques (qui étaient dans la liste 'reviews')
    #     - 'sentiment': contient les sentiments ('pos' ou 'neg') (qui étaient dans la liste 'sentiments')

    return df  # Retourne le DataFrame créé


# Exemple d'utilisation (en dehors de la fonction):
# data_directory = "aclImdb/train"  # OU "aclImdb/test" -  à adapter selon tes besoins
# movie_reviews_df = load_movie_reviews(data_directory)
#
# if movie_reviews_df is not None:
#     print(movie_reviews_df.head())

# Specify the path to the train folder in the aclImdb dataset

# Inspect the dataset

2. Text Preprocessing:
   
    Goal: Clean and preprocess the text data to remove noise and prepare it for analysis.
    
    Task: Remove unnecessary characters (e.g., HTML tags, punctuation), convert text to lowercase, and process words by removing stop words and stemming/lemmatizing them.

In [12]:
# TASK 2: Text Preprocessing 
# Remove HTML tags

# Remove special characters

# Convert to lowercase
 

3. Feature Extraction:

    Goal: Transform the cleaned text into numerical features for machine learning.
   
    Task: Use a vectorization technique such as TF-IDF to convert the text into a numerical matrix that captures the importance of each word in the dataset.

In [13]:
# TASK 3: Feature Extraction 


4. Model Training:

    Goal: Train a machine learning model to classify reviews based on their sentiment.
    
    Task: Split the dataset into training and testing sets, train a Logistic Regression model, and evaluate its performance on the test data.

In [14]:
# TASK 4: Model Training 

# TASK 8: Track emissions during model training


5. Model Evaluation:

    Goal: Assess the performance of your model using appropriate metrics.
    
    Task: Evaluate precision, recall, and F1-score of the Logistic Regression model. Use these metrics to identify the strengths and weaknesses of your system. Visualize the Confusion Matrix to better understand how well the model classifies positive and negative reviews. Additionally, test the model with a new review, preprocess it, make a prediction, and display the result. Example: test it with a new review such as:
    "The movie had great visuals, but the storyline was dull and predictable." The expected output might be: Negative Sentiment.

In [15]:
# TASK 5: Model Evaluation 

# Classification Report

# Confusion Matrix

# Plot the Confusion Matrix

# Test with a new review
review = "The movie had great visuals but the storyline was dull and predictable."


6. Hyperparameter Tuning:

    Goal: Optimize your Logistic Regression model by tuning its hyperparameters.
   
    Task: Use an optimization method to find the best parameters for your model and improve its accuracy.

In [16]:
# TASK 6: Hyperparameter Tuning 

# TASK 8: Track emissions during Hyperparameter Tuning


7. Learning Curve Analysis:

    Goal: Diagnose your model's performance by plotting learning curves.
   
    Task: Analyze training and validation performance as a function of the training set size to identify underfitting or overfitting issues.


In [17]:
# TASK 7: Learning Curve Analysis


9. Ethical Considerations and Explainability:

    Goal: Discuss the ethics in using and deploying your AI-based solution by investigating and implementing suitable explainability methods.
    
    Task: Understanding how a machine learning model makes predictions is crucial for ensuring transparency, fairness, and accountability in AI deployment. One of the widely used techniques for model explainability is SHAP (SHapley Additive exPlanations), which helps determine how much each feature (word) contributes to a prediction.
    In this task, you will use SHAP to analyze the impact of individual words on sentiment classification. This will allow you to visualize which words increase or decrease the probability of a positive or negative sentiment prediction. Additionally, discuss key aspects such as potential biases in the model, fairness in outcomes, and accountability in AI decision-making. You can find more information here: https://shap.readthedocs.io/en/latest/generated/shap.Explanation.html

In [18]:
# TASK 9: Ethical Considerations & Explainability

# Show SHAP summary plot with proper feature names


10. Deployment Considerations for Embedded Systems:

    Goal: Optimize and convert the trained logistic regression model for deployment on embedded systems like Arduino
    
    Task: To deploy the trained logistic regression model on a resource-constrained embedded system like an Arduino, we must optimize and convert the model into a format suitable for execution in an environment with limited memory and processing power. Since embedded systems do not support direct execution of machine learning models trained in Python, we extract the model’s learned parameters—namely, the weights and bias—after training. These parameters are then quantized to fixed-point integers to eliminate the need for floating-point calculations, which are inefficient on microcontrollers.
    Once quantization is applied, we generate a C++ .h header file containing the model’s coefficients and bias, formatted in a way that allows direct use within an Arduino sketch. The final model is optimized to perform inference using integer arithmetic, making it both lightweight and efficient for deployment on microcontrollers. You can find more information here: https://medium.com/@thommaskevin/tinyml-binomial-logistic-regression-0fdbf00e6765

In [19]:
# TASK 10: Deployment Considerations (Model Quantization & Export for Arduino)
# Extract weights and bias from the trained logistic regression model

# Apply quantization (convert to fixed-point representation)

# Generate C++ header file for Arduino

# Save the header file
