# ST2MLE : Machine Learning for IT Engineers

## Project Objectives

- Master the full lifecycle of a data project (collection, cleaning, preprocessing, modeling, evaluation).
- Apply techniques for text processing and numerical data analysis.
- Explore various text vectorization techniques (BoW, TF-IDF, Doc2Vec, BERT).
- Conduct analyses and provide recommendations based on real French data.

## Context

As part of this project, students will work on mixed data (numerical and textual) collected from French websites. The objective is to carry out a comprehensive analysis, from data collection to modeling and interpretation, with a focus on a French economic, social, or public context.

In [48]:
# Import essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import warnings

warnings.filterwarnings("ignore")

# Load data
df = pd.read_csv("docs/french_books_reviews.csv")
print(f"Dataset: {df.shape}")
print(f"Columns: {list(df.columns)}")
df.head()

# Display configuration
plt.style.use("default")
pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", 100)

print("Libraries imported successfully")

# Step 3: Data Cleaning

Dataset: (9658, 5)
Columns: ['book_title', 'author', 'reader_review', 'rating', 'label']
Libraries imported successfully


### 3. Data Cleaning

In [49]:
# Text preprocessing functions
def clean_text(text):
    """
    Function to clean French text
    """
    if pd.isna(text):
        return ""

    text = str(text).lower()

    # Remove special characters, but keep letters with accents
    text = re.sub(r"[^\w\s\àâäéèêëîïôöùûüÿç]", " ", text)

    return re.sub(r"\s+", " ", text).strip()


# Test cleaning functions
print("Testing cleaning functions")
test_text = (
    "Here is an EXAMPLE of text with special characters !@# and multiple   spaces."
)
print(f"Original text: {test_text}")
print(f"Cleaned text: {clean_text(test_text)}")

# Example with French characters
test_french = "C'est un très bon livre! Il m'a plu énormément... àâäéèêëîïôöùûüÿç"
print(f"\nOriginal French text: {test_french}")
print(f"Cleaned French text: {clean_text(test_french)}")


# Quick text cleaning
def clean_text(text):
    if pd.isna(text):
        return ""
    text = str(text).lower()
    text = re.sub(r"[^\w\s\àâäéèêëîïôöùûüÿç]", " ", text)
    return re.sub(r"\s+", " ", text).strip()


# Step 3: Data Cleaning
# Application and features
df_cleaned = df.copy()
df_cleaned["book_title"] = df_cleaned["book_title"].apply(clean_text)
df_cleaned["author"] = df_cleaned["author"].apply(clean_text)
df_cleaned["reader_review"] = df_cleaned["reader_review"].apply(clean_text)
df_cleaned["review_length"] = df_cleaned["reader_review"].str.len()

# Export
df_cleaned.to_csv("french_books_reviews_cleaned.csv", index=False)
print(f"Data cleaned and exported: {df_cleaned.shape}")
df_cleaned.head()

Testing cleaning functions
Original text: Here is an EXAMPLE of text with special characters !@# and multiple   spaces.
Cleaned text: here is an example of text with special characters and multiple spaces

Original French text: C'est un très bon livre! Il m'a plu énormément... àâäéèêëîïôöùûüÿç
Cleaned French text: c est un très bon livre il m a plu énormément àâäéèêëîïôöùûüÿç
Data cleaned and exported: (9658, 6)
Data cleaned and exported: (9658, 6)


Unnamed: 0,book_title,author,reader_review,rating,label,review_length
0,le démon de la colline aux loups,dimitri rouchon borie,ce n est pas le premier roman à aborder les thèmes lourds de l inceste et de l enfance martyre m...,5.0,1,490
1,simple,marie aude murail,simple alias barnabé est un jeune homme de 22 ans qui a l âge mental d un enfant de 3 ans kléber...,4.0,1,608
2,la plus secrète mémoire des hommes,mohamed mbougar sarr,pour écrire la plus secrète mémoire des hommes mohamed mbougar sarr s est inspiré du destin bris...,4.0,1,296
3,trancher,amélie cordonnier,la violence d aurélien est revenue par la fenêtre peut être bien c est une surprise qui te foudr...,3.5,0,710
4,la guerre d alan tome 2,emmanuel guibert,dans ce second album de la guerre d alan emmanuel guibert m a fait suivre à nouveau les pas de c...,5.0,1,183
