# ST2MLE : Machine Learning for IT Engineers Project
## Machine Learning Project – Numerical and Textual Data (French Context)
### Context
As part of this project, students will work on mixed data (numerical and textual) collected from French websites.
The objective is to carry out a comprehensive analysis, from data collection to modeling
and interpretation, with a focus on a French economic, social, or public context.

### Learning Objectives
- Master the full lifecycle of a data project (collection, cleaning, preprocessing, modeling, evaluation).
- Apply techniques for text processing and numerical data analysis.
- Explore various text vectorization techniques (BoW, TF-IDF, Doc2Vec, BERT).
- Conduct analyses and provide recommendations based on real French data.

### Project Steps
1. Define a topic, the needs and identify relevant French sources.
2. Collect data (web scraping, APIs...).
3. Clean and preprocess both numerical and textual data.
4. Annotate (label) data. Some websites already include categories or tags — these can be scraped alongside the text and used as labels. Otherwise, label data manually.
5. Perform exploratory analysis and visualizations (distributions, word clouds, correlations...) to check for outliers, class imbalance, etc.
6. Apply under-sampling or oversampling (if needed), PCA for feature extraction (if needed).
7. Apply predictive models:
  - **Numerical data**: Decision Trees, Random Forest, Boosting.
  - **Textual data**: Naive Bayes, Logistic Regression after vectorization.
8. Compare text vectorization methods: BoW, TF-IDF, Doc2Vec, BERT.
9. Provide business recommendations and submit a final report.

### Technical Constraints
- Data must be exclusively from French sources.
- Texts must be in French only (use appropriate preprocessing: French lemmatization, French stopwords).
- Minimum of 500 data rows.
- Implementation in Python using scikit-learn, gensim, transformers, etc.

### Evaluation Criteria
- Quality and relevance of data collection and labeling (10%)
- Quality of data cleaning and preprocessing (10%)
- Relevance of visualizations and exploratory analysis (10%)
- Implementation of models (30%)
- Comparison and discussion of vectorization techniques (10%)
- Recommendations and critical thinking (10%)
- Quality of the report and code (5%)
- Quality of the presentation (5%)
- Q&A (10%)

In [20]:
import numpy as np

# Import data from CSV file

import pandas as pd

# Load the CSV file
df = pd.read_csv('data/classements_letudiant.csv')

# Display information of dataset
df.info()

df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10773 entries, 0 to 10772
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   École          10773 non-null  object
 1   Thématique     10773 non-null  object
 2   ID Thématique  10773 non-null  int64 
 3   Critère        10773 non-null  object
 4   Score /10      817 non-null    object
 5   Note brute     10771 non-null  object
dtypes: int64(1), object(5)
memory usage: 505.1+ KB


Unnamed: 0,École,Thématique,ID Thématique,Critère,Score /10,Note brute
0,École Polytechnique - Palaiseau,Mieux connaître l'école,425,Présence sur Parcoursup,,Oui
1,École Polytechnique - Palaiseau,Mieux connaître l'école,425,Villes d'implantation de l'école en France,,Palaiseau
2,École Polytechnique - Palaiseau,Mieux connaître l'école,425,Statut de l'école,,Public
3,École Polytechnique - Palaiseau,Mieux connaître l'école,425,Ministère de tutelle,,Ministère des Armées
4,École Polytechnique - Palaiseau,Mieux connaître l'école,425,Concours,,X-ESPCI
...,...,...,...,...,...,...
10768,Polytech Sorbonne,Professionnalisation et emploi,431,"Autres industries (bois, imprimerie...)",,Non communiqué
10769,Polytech Sorbonne,Professionnalisation et emploi,431,Métiers de l'eau et gestion des déchets (achem...,,Non communiqué
10770,Polytech Sorbonne,Professionnalisation et emploi,431,Commerce,,Non communiqué
10771,Polytech Sorbonne,Professionnalisation et emploi,431,Télécommunications,,Non communiqué


In [21]:
df.columns = (
    df.columns.str.lower()
    .str.strip()
    .str.replace(" /10", "")
    .str.replace(" ", "_")
    .str.replace("é", "e")
    .str.replace("è", "e")
)

df

Unnamed: 0,ecole,thematique,id_thematique,critere,score,note_brute
0,École Polytechnique - Palaiseau,Mieux connaître l'école,425,Présence sur Parcoursup,,Oui
1,École Polytechnique - Palaiseau,Mieux connaître l'école,425,Villes d'implantation de l'école en France,,Palaiseau
2,École Polytechnique - Palaiseau,Mieux connaître l'école,425,Statut de l'école,,Public
3,École Polytechnique - Palaiseau,Mieux connaître l'école,425,Ministère de tutelle,,Ministère des Armées
4,École Polytechnique - Palaiseau,Mieux connaître l'école,425,Concours,,X-ESPCI
...,...,...,...,...,...,...
10768,Polytech Sorbonne,Professionnalisation et emploi,431,"Autres industries (bois, imprimerie...)",,Non communiqué
10769,Polytech Sorbonne,Professionnalisation et emploi,431,Métiers de l'eau et gestion des déchets (achem...,,Non communiqué
10770,Polytech Sorbonne,Professionnalisation et emploi,431,Commerce,,Non communiqué
10771,Polytech Sorbonne,Professionnalisation et emploi,431,Télécommunications,,Non communiqué


In [22]:
# Remove /10 /2 in scores
# Remove % in scores
df["score"] = (
    df["score"]
    .astype(str)
    .str.replace(r"/\d+", "", regex=True)
    .str.replace(",", ".", regex=False)
    .astype(float)
)

df

Unnamed: 0,ecole,thematique,id_thematique,critere,score,note_brute
0,École Polytechnique - Palaiseau,Mieux connaître l'école,425,Présence sur Parcoursup,,Oui
1,École Polytechnique - Palaiseau,Mieux connaître l'école,425,Villes d'implantation de l'école en France,,Palaiseau
2,École Polytechnique - Palaiseau,Mieux connaître l'école,425,Statut de l'école,,Public
3,École Polytechnique - Palaiseau,Mieux connaître l'école,425,Ministère de tutelle,,Ministère des Armées
4,École Polytechnique - Palaiseau,Mieux connaître l'école,425,Concours,,X-ESPCI
...,...,...,...,...,...,...
10768,Polytech Sorbonne,Professionnalisation et emploi,431,"Autres industries (bois, imprimerie...)",,Non communiqué
10769,Polytech Sorbonne,Professionnalisation et emploi,431,Métiers de l'eau et gestion des déchets (achem...,,Non communiqué
10770,Polytech Sorbonne,Professionnalisation et emploi,431,Commerce,,Non communiqué
10771,Polytech Sorbonne,Professionnalisation et emploi,431,Télécommunications,,Non communiqué


In [23]:
# Pivot table
df = df.pivot_table(
    index="ecole", columns="critere", values=["score", "note_brute"], aggfunc="first"
).sort_index()

df

Unnamed: 0_level_0,note_brute,note_brute,note_brute,note_brute,note_brute,note_brute,note_brute,note_brute,note_brute,note_brute,...,score,score,score,score,score,score,score,score,score,score
critere,Accord de Grenoble,Activités informatiques et services d'information,"Administration d'Etat, Collectivité territoriale, Hospitalière","Agriculture, sylviculture, pêche","Autres industries (bois, imprimerie...)",Autres secteurs,"BTP, construction",Cellule de soutien psychologique,Commerce,Concours,...,Moyenne au bac des intégrés,Ouverture sociale,Parité au sein de la promotion (Hommes/Femmes),Part d'enseignants-chercheurs,Politique de chaires,Pourcentage d'étudiants internationaux,Pourcentage de double diplômés internationaux,Réputation internationale,Salaire à la sortie,Taux d'alternants
ecole,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
3iL ingénieurs,Non,"12,04%","0,00%","2,78%","0,93%","56,48%","0,93%",Oui,"1,85%","Puissance Alpha, CCINP",...,,,,,,9.5,9.5,,8.0,5.0
AgroParisTech,Oui,"0,46%","14,22%","28,44%","0,46%","8,26%","0,92%",Oui,"2,29%","Agro-véto, Centrale-Supélec",...,8.0,,2.0,9.0,8.0,,,10.0,,
Arts et Métiers Sciences et Technologies,Oui,"6,49%","1,18%","0,88%","7,37%","10,03%","12,68%",Oui,"1,47%","Centrale-Supélec, Concours ATS",...,,6.5,,,,,8.0,7.5,8.0,
Bordeaux Sciences Agro,Oui,"0,00%","6,02%","45,78%","1,20%","22,89%","1,20%",Oui,"2,41%","Agro-véto, CCINP",...,,6.5,2.0,,,,,,,
Builders École d’ingénieurs,Non,"0,00%","0,00%","0,00%","0,00%","0,00%","100,00%",Non,"0,00%","Avenir, Concours ATS",...,,,,,7.0,,,,8.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
École des Mines Paris - PSL,Non,"9,48%","6,90%","0,00%","2,59%","37,07%","0,00%",Oui,"0,00%","Mines-Ponts, Concours ATS",...,10.0,,,10.0,10.0,,7.0,9.0,10.0,
École des mines - Nancy,Non,"24,27%","0,97%","0,97%","0,00%","20,39%","5,83%",Oui,"0,97%",Mines-Ponts,...,10.0,,,9.5,8.0,,9.0,7.0,9.0,
École des mines - Saint-Étienne,Oui,Non communiqué,Non communiqué,Non communiqué,Non communiqué,Non communiqué,Non communiqué,Oui,Non communiqué,"Mines-Télécom, Mines-Ponts",...,10.0,,,9.5,,,8.0,,9.0,
École nationale des ponts et chaussées - Marne-la-vallée,Oui,"4,85%","4,85%","0,44%","1,32%","16,74%","17,18%",Oui,"2,20%",Mines-Ponts,...,10.0,,,9.5,9.0,7.0,,7.5,10.0,


In [24]:
df.columns = [f"{v}_{c}".lower().replace(" ", "_") for v, c in df.columns]

df.reset_index(inplace=True)

df

Unnamed: 0,ecole,note_brute_accord_de_grenoble,note_brute_activités_informatiques_et_services_d'information,"note_brute_administration_d'etat,_collectivité_territoriale,_hospitalière","note_brute_agriculture,_sylviculture,_pêche","note_brute_autres_industries_(bois,_imprimerie...)",note_brute_autres_secteurs,"note_brute_btp,_construction",note_brute_cellule_de_soutien_psychologique,note_brute_commerce,...,score_moyenne_au_bac_des_intégrés,score_ouverture_sociale,score_parité_au_sein_de_la_promotion_(hommes/femmes),score_part_d'enseignants-chercheurs,score_politique_de_chaires,score_pourcentage_d'étudiants_internationaux,score_pourcentage_de_double_diplômés_internationaux,score_réputation_internationale,score_salaire_à_la_sortie,score_taux_d'alternants
0,3iL ingénieurs,Non,"12,04%","0,00%","2,78%","0,93%","56,48%","0,93%",Oui,"1,85%",...,,,,,,9.5,9.5,,8.0,5.0
1,AgroParisTech,Oui,"0,46%","14,22%","28,44%","0,46%","8,26%","0,92%",Oui,"2,29%",...,8.0,,2.0,9.0,8.0,,,10.0,,
2,Arts et Métiers Sciences et Technologies,Oui,"6,49%","1,18%","0,88%","7,37%","10,03%","12,68%",Oui,"1,47%",...,,6.5,,,,,8.0,7.5,8.0,
3,Bordeaux Sciences Agro,Oui,"0,00%","6,02%","45,78%","1,20%","22,89%","1,20%",Oui,"2,41%",...,,6.5,2.0,,,,,,,
4,Builders École d’ingénieurs,Non,"0,00%","0,00%","0,00%","0,00%","0,00%","100,00%",Non,"0,00%",...,,,,,7.0,,,,8.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
166,École des Mines Paris - PSL,Non,"9,48%","6,90%","0,00%","2,59%","37,07%","0,00%",Oui,"0,00%",...,10.0,,,10.0,10.0,,7.0,9.0,10.0,
167,École des mines - Nancy,Non,"24,27%","0,97%","0,97%","0,00%","20,39%","5,83%",Oui,"0,97%",...,10.0,,,9.5,8.0,,9.0,7.0,9.0,
168,École des mines - Saint-Étienne,Oui,Non communiqué,Non communiqué,Non communiqué,Non communiqué,Non communiqué,Non communiqué,Oui,Non communiqué,...,10.0,,,9.5,,,8.0,,9.0,
169,École nationale des ponts et chaussées - Marne...,Oui,"4,85%","4,85%","0,44%","1,32%","16,74%","17,18%",Oui,"2,20%",...,10.0,,,9.5,9.0,7.0,,7.5,10.0,


In [25]:
# Delete specific columns
columns_to_drop = [
    "note_brute_origine_des_intégrés_en_cycle_ingénieur",
    "note_brute_nombre_d\'intégrés_issus_de_bac_technologique",
    "note_brute_nombre_d\'intégrés_issus_de_bac_général",
]  # All have "Voir plus" value

df.drop(columns=columns_to_drop, inplace=True, errors='ignore')

# Go in every note_brute column
# Replace "Non communiqué" with None
# Replace "Oui" by True and "Non" by False

excluded_columns = [
    "note_brute_concours",
    "note_brute_label_dd&rs",
    "note_brute_ministère_de_tutelle",
    "note_brute_masse_et_encadrement_des_doctorants",
    "note_brute_niveau_d\'anglais_exigé",
    "note_brute_ouverture_sociale",
]

for col in df.columns:
    if "note_brute" in col and col not in excluded_columns:
        df[col] = (
            df[col]
            .replace("%", "", regex=True)
            .replace("Non communiqué", None)
            .replace("Oui", True)
            .replace("Non", False)
            .replace({pd.NA: np.nan, None: np.nan})
        )

        df[col] = df[col].apply(
            lambda x: (
                float(str(x).replace(",", "."))
                if isinstance(x, str) and str(x).replace(",", ".").replace(".", "", 1).isdigit()
                else x
            )
        )

        if df[col].dtype == "object":
            try:
                df[col] = df[col].astype(float)
            except ValueError:
                pass  # Ignore columns that cannot be converted to float (like False/ T/F/str

# Label DD&RS -> True/False
df["note_brute_label_dd&rs"] = df["note_brute_label_dd&rs"].replace(
    {"Label DD&RS": True, "Pas de label": False}
)

# note_brute_masse_et_encadrement_des_doctorants & note_brute_ouverture_sociale -> Numerical values
level_mapping = {
    "Aucun": 0,
    "Faible": 1,
    "Correcte": 2,
    "Moyenne": 3,
    "Importante": 4,
    "Très importante": 5,
}

df["note_brute_masse_et_encadrement_des_doctorants"] = df[
    "note_brute_masse_et_encadrement_des_doctorants"
].map(level_mapping)

level_mapping = {"Faible": 1, "Passable": 2, "Correcte": 3, "Bonne": 4, "Excellente": 5}

df["note_brute_ouverture_sociale"] = df["note_brute_ouverture_sociale"].map(level_mapping)

# Parité homme femmes
df["note_brute_parité_au_sein_de_la_promotion_(hommes/femmes)"] = (
    df["note_brute_parité_au_sein_de_la_promotion_(hommes/femmes)"]
    .astype(str)
    .str.replace(",", ".", regex=False)
)
df[["note_brute_part_etudiant_hommes", "note_brute_part_etudiant_femmes"]] = (
    df["note_brute_parité_au_sein_de_la_promotion_(hommes/femmes)"]
    .str.split("|", expand=True)
    .apply(pd.to_numeric, errors="ignore")
)

df

  .replace("Non", False)
  .replace("Non", False)
  df["note_brute_label_dd&rs"] = df["note_brute_label_dd&rs"].replace({"Label DD&RS": True, "Pas de label": False})
  .apply(pd.to_numeric, errors="ignore")


Unnamed: 0,ecole,note_brute_accord_de_grenoble,note_brute_activités_informatiques_et_services_d'information,"note_brute_administration_d'etat,_collectivité_territoriale,_hospitalière","note_brute_agriculture,_sylviculture,_pêche","note_brute_autres_industries_(bois,_imprimerie...)",note_brute_autres_secteurs,"note_brute_btp,_construction",note_brute_cellule_de_soutien_psychologique,note_brute_commerce,...,score_parité_au_sein_de_la_promotion_(hommes/femmes),score_part_d'enseignants-chercheurs,score_politique_de_chaires,score_pourcentage_d'étudiants_internationaux,score_pourcentage_de_double_diplômés_internationaux,score_réputation_internationale,score_salaire_à_la_sortie,score_taux_d'alternants,note_brute_part_etudiant_hommes,note_brute_part_etudiant_femmes
0,3iL ingénieurs,False,12.04,0.00,2.78,0.93,56.48,0.93,True,1.85,...,,,,9.5,9.5,,8.0,5.0,84.24,15.76
1,AgroParisTech,True,0.46,14.22,28.44,0.46,8.26,0.92,True,2.29,...,2.0,9.0,8.0,,,10.0,,,36.45,63.55
2,Arts et Métiers Sciences et Technologies,True,6.49,1.18,0.88,7.37,10.03,12.68,True,1.47,...,,,,,8.0,7.5,8.0,,83.14,16.86
3,Bordeaux Sciences Agro,True,0.00,6.02,45.78,1.20,22.89,1.20,True,2.41,...,2.0,,,,,,,,40.87,59.13
4,Builders École d’ingénieurs,False,0.00,0.00,0.00,0.00,0.00,100.00,False,0.00,...,,,7.0,,,,8.0,,71.32,28.68
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
166,École des Mines Paris - PSL,False,9.48,6.90,0.00,2.59,37.07,0.00,True,0.00,...,,10.0,10.0,,7.0,9.0,10.0,,73.78,26.22
167,École des mines - Nancy,False,24.27,0.97,0.97,0.00,20.39,5.83,True,0.97,...,,9.5,8.0,,9.0,7.0,9.0,,76.50,23.50
168,École des mines - Saint-Étienne,True,,,,,,,True,,...,,9.5,,,8.0,,9.0,,78.64,21.36
169,École nationale des ponts et chaussées - Marne...,True,4.85,4.85,0.44,1.32,16.74,17.18,True,2.20,...,,9.5,9.0,7.0,,7.5,10.0,,73.78,26.22
