# ST2MLE : Machine Learning for IT Engineers Project
## Machine Learning Project – Numerical and Textual Data (French Context)
### Context
As part of this project, students will work on mixed data (numerical and textual) collected from French websites.
The objective is to carry out a comprehensive analysis, from data collection to modeling
and interpretation, with a focus on a French economic, social, or public context.

### Learning Objectives
- Master the full lifecycle of a data project (collection, cleaning, preprocessing, modeling, evaluation).
- Apply techniques for text processing and numerical data analysis.
- Explore various text vectorization techniques (BoW, TF-IDF, Doc2Vec, BERT).
- Conduct analyses and provide recommendations based on real French data.

### Project Steps
1. Define a topic, the needs and identify relevant French sources.
2. Collect data (web scraping, APIs...).
3. Clean and preprocess both numerical and textual data.
4. Annotate (label) data. Some websites already include categories or tags — these can be scraped alongside the text and used as labels. Otherwise, label data manually.
5. Perform exploratory analysis and visualizations (distributions, word clouds, correlations...) to check for outliers, class imbalance, etc.
6. Apply under-sampling or oversampling (if needed), PCA for feature extraction (if needed).
7. Apply predictive models:
  - **Numerical data**: Decision Trees, Random Forest, Boosting.
  - **Textual data**: Naive Bayes, Logistic Regression after vectorization.
8. Compare text vectorization methods: BoW, TF-IDF, Doc2Vec, BERT.
9. Provide business recommendations and submit a final report.

### Technical Constraints
- Data must be exclusively from French sources.
- Texts must be in French only (use appropriate preprocessing: French lemmatization, French stopwords).
- Minimum of 500 data rows.
- Implementation in Python using scikit-learn, gensim, transformers, etc.

### Evaluation Criteria
- Quality and relevance of data collection and labeling (10%)
- Quality of data cleaning and preprocessing (10%)
- Relevance of visualizations and exploratory analysis (10%)
- Implementation of models (30%)
- Comparison and discussion of vectorization techniques (10%)
- Recommendations and critical thinking (10%)
- Quality of the report and code (5%)
- Quality of the presentation (5%)
- Q&A (10%)

In [40]:
# Import data from CSV file

import pandas as pd

# Load the CSV file
df = pd.read_csv('data/classements_letudiant.csv')

# Display information of dataset
df.info()

df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10773 entries, 0 to 10772
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   École          10773 non-null  object
 1   Thématique     10773 non-null  object
 2   ID Thématique  10773 non-null  int64 
 3   Critère        10773 non-null  object
 4   Score /10      817 non-null    object
 5   Note brute     10771 non-null  object
dtypes: int64(1), object(5)
memory usage: 505.1+ KB


Unnamed: 0,École,Thématique,ID Thématique,Critère,Score /10,Note brute
0,École Polytechnique - Palaiseau,Mieux connaître l'école,425,Présence sur Parcoursup,,Oui
1,École Polytechnique - Palaiseau,Mieux connaître l'école,425,Villes d'implantation de l'école en France,,Palaiseau
2,École Polytechnique - Palaiseau,Mieux connaître l'école,425,Statut de l'école,,Public
3,École Polytechnique - Palaiseau,Mieux connaître l'école,425,Ministère de tutelle,,Ministère des Armées
4,École Polytechnique - Palaiseau,Mieux connaître l'école,425,Concours,,X-ESPCI
...,...,...,...,...,...,...
10768,Polytech Sorbonne,Professionnalisation et emploi,431,"Autres industries (bois, imprimerie...)",,Non communiqué
10769,Polytech Sorbonne,Professionnalisation et emploi,431,Métiers de l'eau et gestion des déchets (achem...,,Non communiqué
10770,Polytech Sorbonne,Professionnalisation et emploi,431,Commerce,,Non communiqué
10771,Polytech Sorbonne,Professionnalisation et emploi,431,Télécommunications,,Non communiqué


In [41]:
df.columns = (
    df.columns.str.lower()
                .str.strip()
                .str.replace(" /10", "_score")
                .str.replace(" ", "_")
                .str.replace("é", "e")
                .str.replace("è", "e")
)

df

Unnamed: 0,ecole,thematique,id_thematique,critere,score_score,note_brute
0,École Polytechnique - Palaiseau,Mieux connaître l'école,425,Présence sur Parcoursup,,Oui
1,École Polytechnique - Palaiseau,Mieux connaître l'école,425,Villes d'implantation de l'école en France,,Palaiseau
2,École Polytechnique - Palaiseau,Mieux connaître l'école,425,Statut de l'école,,Public
3,École Polytechnique - Palaiseau,Mieux connaître l'école,425,Ministère de tutelle,,Ministère des Armées
4,École Polytechnique - Palaiseau,Mieux connaître l'école,425,Concours,,X-ESPCI
...,...,...,...,...,...,...
10768,Polytech Sorbonne,Professionnalisation et emploi,431,"Autres industries (bois, imprimerie...)",,Non communiqué
10769,Polytech Sorbonne,Professionnalisation et emploi,431,Métiers de l'eau et gestion des déchets (achem...,,Non communiqué
10770,Polytech Sorbonne,Professionnalisation et emploi,431,Commerce,,Non communiqué
10771,Polytech Sorbonne,Professionnalisation et emploi,431,Télécommunications,,Non communiqué


In [42]:
df = df.pivot_table(index="ecole",
                        columns="critere",
                        values=["score_score","note_brute"],
                        aggfunc="first").sort_index()

df



Unnamed: 0_level_0,note_brute,note_brute,note_brute,note_brute,note_brute,note_brute,note_brute,note_brute,note_brute,note_brute,...,score_score,score_score,score_score,score_score,score_score,score_score,score_score,score_score,score_score,score_score
critere,Accord de Grenoble,Activités informatiques et services d'information,"Administration d'Etat, Collectivité territoriale, Hospitalière","Agriculture, sylviculture, pêche","Autres industries (bois, imprimerie...)",Autres secteurs,"BTP, construction",Cellule de soutien psychologique,Commerce,Concours,...,Moyenne au bac des intégrés,Ouverture sociale,Parité au sein de la promotion (Hommes/Femmes),Part d'enseignants-chercheurs,Politique de chaires,Pourcentage d'étudiants internationaux,Pourcentage de double diplômés internationaux,Réputation internationale,Salaire à la sortie,Taux d'alternants
ecole,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
3iL ingénieurs,Non,"12,04%","0,00%","2,78%","0,93%","56,48%","0,93%",Oui,"1,85%","Puissance Alpha, CCINP",...,,,,,,"9,5/10","9,5/10",,"8,0/10","5,0/5"
AgroParisTech,Oui,"0,46%","14,22%","28,44%","0,46%","8,26%","0,92%",Oui,"2,29%","Agro-véto, Centrale-Supélec",...,"8,0/10",,"2,0/2","9,0/10","8,0/10",,,"10,0/10",,
Arts et Métiers Sciences et Technologies,Oui,"6,49%","1,18%","0,88%","7,37%","10,03%","12,68%",Oui,"1,47%","Centrale-Supélec, Concours ATS",...,,"6,5/10",,,,,"8,0/10","7,5/10","8,0/10",
Bordeaux Sciences Agro,Oui,"0,00%","6,02%","45,78%","1,20%","22,89%","1,20%",Oui,"2,41%","Agro-véto, CCINP",...,,"6,5/10","2,0/2",,,,,,,
Builders École d’ingénieurs,Non,"0,00%","0,00%","0,00%","0,00%","0,00%","100,00%",Non,"0,00%","Avenir, Concours ATS",...,,,,,"7,0/10",,,,"8,0/10",
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
École des Mines Paris - PSL,Non,"9,48%","6,90%","0,00%","2,59%","37,07%","0,00%",Oui,"0,00%","Mines-Ponts, Concours ATS",...,"10,0/10",,,"10,0/10","10,0/10",,"7,0/10","9,0/10","10,0/10",
École des mines - Nancy,Non,"24,27%","0,97%","0,97%","0,00%","20,39%","5,83%",Oui,"0,97%",Mines-Ponts,...,"10,0/10",,,"9,5/10","8,0/10",,"9,0/10","7,0/10","9,0/10",
École des mines - Saint-Étienne,Oui,Non communiqué,Non communiqué,Non communiqué,Non communiqué,Non communiqué,Non communiqué,Oui,Non communiqué,"Mines-Télécom, Mines-Ponts",...,"10,0/10",,,"9,5/10",,,"8,0/10",,"9,0/10",
École nationale des ponts et chaussées - Marne-la-vallée,Oui,"4,85%","4,85%","0,44%","1,32%","16,74%","17,18%",Oui,"2,20%",Mines-Ponts,...,"10,0/10",,,"9,5/10","9,0/10","7,0/10",,"7,5/10","10,0/10",


In [43]:
df.columns = [
    f"{v}_{c}".lower().replace(" ", "_")
    for v, c in df.columns
]

df.reset_index(inplace=True)

df

Unnamed: 0,ecole,note_brute_accord_de_grenoble,note_brute_activités_informatiques_et_services_d'information,"note_brute_administration_d'etat,_collectivité_territoriale,_hospitalière","note_brute_agriculture,_sylviculture,_pêche","note_brute_autres_industries_(bois,_imprimerie...)",note_brute_autres_secteurs,"note_brute_btp,_construction",note_brute_cellule_de_soutien_psychologique,note_brute_commerce,...,score_score_moyenne_au_bac_des_intégrés,score_score_ouverture_sociale,score_score_parité_au_sein_de_la_promotion_(hommes/femmes),score_score_part_d'enseignants-chercheurs,score_score_politique_de_chaires,score_score_pourcentage_d'étudiants_internationaux,score_score_pourcentage_de_double_diplômés_internationaux,score_score_réputation_internationale,score_score_salaire_à_la_sortie,score_score_taux_d'alternants
0,3iL ingénieurs,Non,"12,04%","0,00%","2,78%","0,93%","56,48%","0,93%",Oui,"1,85%",...,,,,,,"9,5/10","9,5/10",,"8,0/10","5,0/5"
1,AgroParisTech,Oui,"0,46%","14,22%","28,44%","0,46%","8,26%","0,92%",Oui,"2,29%",...,"8,0/10",,"2,0/2","9,0/10","8,0/10",,,"10,0/10",,
2,Arts et Métiers Sciences et Technologies,Oui,"6,49%","1,18%","0,88%","7,37%","10,03%","12,68%",Oui,"1,47%",...,,"6,5/10",,,,,"8,0/10","7,5/10","8,0/10",
3,Bordeaux Sciences Agro,Oui,"0,00%","6,02%","45,78%","1,20%","22,89%","1,20%",Oui,"2,41%",...,,"6,5/10","2,0/2",,,,,,,
4,Builders École d’ingénieurs,Non,"0,00%","0,00%","0,00%","0,00%","0,00%","100,00%",Non,"0,00%",...,,,,,"7,0/10",,,,"8,0/10",
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
166,École des Mines Paris - PSL,Non,"9,48%","6,90%","0,00%","2,59%","37,07%","0,00%",Oui,"0,00%",...,"10,0/10",,,"10,0/10","10,0/10",,"7,0/10","9,0/10","10,0/10",
167,École des mines - Nancy,Non,"24,27%","0,97%","0,97%","0,00%","20,39%","5,83%",Oui,"0,97%",...,"10,0/10",,,"9,5/10","8,0/10",,"9,0/10","7,0/10","9,0/10",
168,École des mines - Saint-Étienne,Oui,Non communiqué,Non communiqué,Non communiqué,Non communiqué,Non communiqué,Non communiqué,Oui,Non communiqué,...,"10,0/10",,,"9,5/10",,,"8,0/10",,"9,0/10",
169,École nationale des ponts et chaussées - Marne...,Oui,"4,85%","4,85%","0,44%","1,32%","16,74%","17,18%",Oui,"2,20%",...,"10,0/10",,,"9,5/10","9,0/10","7,0/10",,"7,5/10","10,0/10",
