# ST2MLE : Machine Learning for IT Engineers

## Project Objectives

- Master the full lifecycle of a data project (collection, cleaning, preprocessing, modeling, evaluation).
- Apply techniques for text processing and numerical data analysis.
- Explore various text vectorization techniques (BoW, TF-IDF, Doc2Vec, BERT).
- Conduct analyses and provide recommendations based on real French data.

## Context

As part of this project, students will work on mixed data (numerical and textual) collected from French websites. The objective is to carry out a comprehensive analysis, from data collection to modeling and interpretation, with a focus on a French economic, social, or public context.

In [50]:
# Import essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import warnings

warnings.filterwarnings("ignore")

# Load data
df = pd.read_csv("docs/french_books_reviews.csv")
print(f"Dataset: {df.shape}")
print(f"Columns: {list(df.columns)}")
df.head()

# Display configuration
plt.style.use("default")
pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", 100)

print("Libraries imported successfully")

# Step 3: Data Cleaning

Dataset: (9658, 5)
Columns: ['book_title', 'author', 'reader_review', 'rating', 'label']
Libraries imported successfully


### 3. Data Cleaning

In [52]:
# Text preprocessing functions
def clean_text(text):
    """
    Function to clean French text
    """
    if pd.isna(text):
        return ""

    text = str(text).lower()

    # Remove special characters, but keep letters with accents
    text = re.sub(r"[^\w\s\àâäéèêëîïôöùûüÿç]", " ", text)

    return re.sub(r"\s+", " ", text).strip()


# Test cleaning functions
print("Testing cleaning functions")
test_text = (
    "Here is an EXAMPLE of text with special characters !@# and multiple   spaces."
)
print(f"Original text: {test_text}")
print(f"Cleaned text: {clean_text(test_text)}")

# Example with French characters
test_french = "C'est un très bon livre! Il m'a plu énormément... àâäéèêëîïôöùûüÿç"
print(f"\nOriginal French text: {test_french}")
print(f"Cleaned French text: {clean_text(test_french)}")


# Quick text cleaning
def clean_text(text):
    if pd.isna(text):
        return ""
    text = str(text).lower()
    text = re.sub(r"[^\w\s\àâäéèêëîïôöùûüÿç]", " ", text)
    return re.sub(r"\s+", " ", text).strip()


# Step 3: Data Cleaning
# Application and features
df_cleaned = df.copy()
df_cleaned["book_title"] = df_cleaned["book_title"].apply(clean_text)
df_cleaned["author"] = df_cleaned["author"].apply(clean_text)
df_cleaned["reader_review"] = df_cleaned["reader_review"].apply(clean_text)
df_cleaned["review_length"] = df_cleaned["reader_review"].str.len()

# Export
df_cleaned.to_csv("french_books_reviews_cleaned.csv", index=False)
print(f"Data cleaned and exported: {df_cleaned.shape}")
df_cleaned.head()

Testing cleaning functions
Original text: Here is an EXAMPLE of text with special characters !@# and multiple   spaces.
Cleaned text: here is an example of text with special characters and multiple spaces

Original French text: C'est un très bon livre! Il m'a plu énormément... àâäéèêëîïôöùûüÿç
Cleaned French text: c est un très bon livre il m a plu énormément àâäéèêëîïôöùûüÿç
Data cleaned and exported: (9658, 6)


Unnamed: 0,book_title,author,reader_review,rating,label,review_length
0,le démon de la colline aux loups,dimitri rouchon borie,ce n est pas le premier roman à aborder les thèmes lourds de l inceste et de l enfance martyre m...,5.0,1,490
1,simple,marie aude murail,simple alias barnabé est un jeune homme de 22 ans qui a l âge mental d un enfant de 3 ans kléber...,4.0,1,608
2,la plus secrète mémoire des hommes,mohamed mbougar sarr,pour écrire la plus secrète mémoire des hommes mohamed mbougar sarr s est inspiré du destin bris...,4.0,1,296
3,trancher,amélie cordonnier,la violence d aurélien est revenue par la fenêtre peut être bien c est une surprise qui te foudr...,3.5,0,710
4,la guerre d alan tome 2,emmanuel guibert,dans ce second album de la guerre d alan emmanuel guibert m a fait suivre à nouveau les pas de c...,5.0,1,183


### 4. Variable Labeling

In [None]:
# Step 4: Variable Labeling and Analysis
print("Dataset structure analysis:")
print(f"Number of rows: {len(df_cleaned)}")
print(f"Number of columns: {len(df_cleaned.columns)}")
print(f"Columns: {list(df_cleaned.columns)}\n")

print("Data types:")
print(df_cleaned.dtypes)
print()

# Analyze numerical variables
print("Numerical variables analysis:")
numerical_columns = df_cleaned.select_dtypes(include=[np.number]).columns.tolist()
print(f"Detected numerical variables: {numerical_columns}\n")

for col in numerical_columns:
    print(f"{col}:")
    print(f"  Type: {df_cleaned[col].dtype}")
    print(f"  Missing values: {df_cleaned[col].isnull().sum()}")
    print(f"  Unique values: {df_cleaned[col].nunique()}")
    if df_cleaned[col].nunique() < 20:
        print(f"  Possible values: {sorted(df_cleaned[col].dropna().unique())}")
    else:
        print(f"  Range: {df_cleaned[col].min()} to {df_cleaned[col].max()}")
        print(
            f"  Mean: {df_cleaned[col].mean():.2f}, Median: {df_cleaned[col].median():.2f}"
        )
    print()

# Analyze textual variables
print("Textual variables analysis:")
textual_columns = df_cleaned.select_dtypes(include=["object"]).columns.tolist()
print(f"Detected textual variables: {textual_columns}\n")

for col in textual_columns:
    print(f"{col}:")
    print(f"  Type: {df_cleaned[col].dtype}")
    print(f"  Missing values: {df_cleaned[col].isnull().sum()}")
    print(f"  Unique values: {df_cleaned[col].nunique()}")

    lengths = df_cleaned[col].str.len()
    print(f"  Average length: {lengths.mean():.1f} characters")
    print(f"  Length range: {lengths.min()} to {lengths.max()} characters")

    sample_values = df_cleaned[col].dropna().head(3).tolist()
    print(f"  Examples: {sample_values}")
    print()

# Define variable roles for modeling
print("Variable role classification:")
variable_roles = {
    "numerical_features": [],
    "textual_features": [],
    "target_variable": [],
    "derived_features": [],
}

for col in df_cleaned.columns:
    if col == "review_length":
        variable_roles["derived_features"].append(col)
    elif col in ["reader_review", "book_title", "author"]:
        variable_roles["textual_features"].append(col)
    elif df_cleaned[col].dtype in [np.int64, np.float64] and col != "review_length":
        if df_cleaned[col].nunique() < 10:
            variable_roles["target_variable"].append(col)
        else:
            variable_roles["numerical_features"].append(col)

for role, vars_list in variable_roles.items():
    if vars_list:
        print(f"{role.replace('_', ' ').title()}: {vars_list}")

print("\nTarget variable analysis:")
if "rating" in df_cleaned.columns:
    rating_stats = df_cleaned["rating"].value_counts().sort_index()
    print(f"Rating distribution:")
    print(rating_stats)

    if len(rating_stats) <= 5:
        print(
            f"Rating variable is suitable for classification with {len(rating_stats)} classes"
        )
        print(f"Classes: {list(rating_stats.index)}")
    else:
        print("Rating variable could be used for regression or grouped into categories")

print(f"\nTextual features ready for analysis: {variable_roles['textual_features']}")
print(
    f"Numerical features available: {variable_roles['numerical_features'] + variable_roles['derived_features']}"
)

print("\nConclusion:")
print(
    "No additional variables needed. The dataset contains sufficient information for:"
)
print("- Textual analysis of reviews, titles and authors")
print("- Predictive modeling with rating as target variable")
print("- Review length analysis as derived feature")

=== ANALYSE ET LABELLISATION DES VARIABLES ===

Structure du dataset:
Nombre de lignes: 9658
Nombre de colonnes: 6
Colonnes: ['book_title', 'author', 'reader_review', 'rating', 'label', 'review_length']

Types de données:
book_title        object
author            object
reader_review     object
rating           float64
label              int64
review_length      int64
dtype: object

=== ANALYSE DÉTAILLÉE DES VARIABLES ===

1. VARIABLES NUMÉRIQUES:
Variables numériques détectées: ['rating', 'label', 'review_length']

• rating:
  - Type: float64
  - Valeurs manquantes: 0
  - Valeurs uniques: 11
  - Valeurs possibles: [0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5]

• label:
  - Type: int64
  - Valeurs manquantes: 0
  - Valeurs uniques: 3
  - Valeurs possibles: [-1, 0, 1]

• review_length:
  - Type: int64
  - Valeurs manquantes: 0
  - Valeurs uniques: 1196
  - Min: 0, Max: 3641
  - Moyenne: 303.25, Médiane: 224.00

2. VARIABLES TEXTUELLES:
Variables textuelles détectées: ['book_t