#### Standardisation

Maintenant que nous avons nettoyé nos données, fixé la problématique, la variable cible, et que nous avons déterminé le preprocessing à appliquer aux variables numériques, de langues, et de genre, nous avons trois fichiers à notre disposition :
* _9000plus_genres.csv_ qui contient les genres binarisés des films,
* _9000plus_lg_reduced.csv_ qui contient les langues originales binarisées des films,
* _9000plus_nums.csv_ qui contient les variables numériques des films, à qui nous allons maintenant appliquer les formatages correspondant.

In [2]:
import pandas as pd

# Chargement des sous-datasets
df_genres = pd.read_csv("9000plus_genres.csv")
df_languages = pd.read_csv("9000plus_lg_reduced.csv")
df_nums = pd.read_csv("9000plus_nums.csv")

# Suppression des colonnes d'indexation
df_genres = df_genres.drop(columns=["Unnamed: 0"])
df_languages = df_languages.drop(columns=["Unnamed: 0"])
df_nums = df_nums.drop(columns=["Unnamed: 0"])

# Vérification de df_nums (pour preprocessing ultérieur)
df_nums

Unnamed: 0,Popularity,Vote_Count,Vote_Average,Release_Year,Release_Month,Release_Day,Release_Weekday
0,5083.954,8940,8.3,2021,12,15,3
1,3827.658,1151,8.1,2022,3,1,2
2,2618.087,122,6.3,2022,2,25,5
3,2402.201,5076,7.7,2021,11,24,3
4,1895.511,1793,7.0,2021,12,22,3
...,...,...,...,...,...,...,...
9821,13.357,896,7.6,1973,10,15,1
9822,13.356,8,3.5,2020,10,1,4
9823,13.355,94,5.0,2016,5,6,5
9824,13.354,152,6.7,2021,3,31,3


Récapitulons le preprocessing à appliquer aux variables numériques (hors variable cible _Vote_Average_ ) :
* pour les colonnes temporelles (sauf _Release_Year_ dû à son manque de périodicité), nous utiliserons **x:(sin(x)+1)/2** pour les ramener dans l'intervalle \[0;1\] tout en préservant leur caractère cyclique (fonction inverse : **y:arcsin(2y-1)**).
* pour les deux autres, nous leur appliquerons la fonction **x:log(x+1)** pour réduire leur asymétrie (tout en évitant d'avoir des valeurs infinies) avant de les standardiser
(fonction inverse : **y:exp(y)-1**).

In [3]:
import numpy as np

### Formatage variables temporelles (hors "Release_Year")

# Définition fonction transformation
def temp_std(x):
    transf = (np.sin(x)+1)/2
    return transf

# Application fonction transformation
df_nums["Release_Month"] = df_nums["Release_Month"].apply(temp_std)
df_nums["Release_Day"] = df_nums["Release_Day"].apply(temp_std)
df_nums["Release_Weekday"] = df_nums["Release_Weekday"].apply(temp_std)

### Formatage "Popularity" et "Vote_Count"

# Application du logarithme
df_nums["Popularity"] = np.log(df_nums["Popularity"]+1)
df_nums["Vote_Count"] = np.log(df_nums["Vote_Count"]+1)

df_nums

Unnamed: 0,Popularity,Vote_Count,Vote_Average,Release_Year,Release_Month,Release_Day,Release_Weekday
0,8.534041,9.098403,8.3,2021,0.231714,0.825144,0.570560
1,8.250270,7.049255,8.1,2022,0.570560,0.920735,0.954649
2,7.870581,4.812184,6.3,2022,0.954649,0.433824,0.020538
3,7.784557,8.532476,7.7,2021,0.000005,0.047211,0.570560
4,7.547771,7.492203,7.0,2021,0.231714,0.495574,0.570560
...,...,...,...,...,...,...,...
9821,2.664238,6.799056,7.6,1973,0.227989,0.825144,0.920735
9822,2.664168,2.197225,3.5,2020,0.227989,0.920735,0.121599
9823,2.664098,4.553877,5.0,2016,0.020538,0.360292,0.020538
9824,2.664029,5.030438,6.7,2021,0.570560,0.297981,0.570560


Pour pouvoir appliquer la standardisation aux colonnes numériques, nous devons tout d'abord séparer les données en jeu de test et jeu d'entraînement.

Nous en profiterons pour reconstituer le Dataset dans son intégralité en fusionnant les différents DataFrames :
* des variables numériques,
* des variables de langues,
* et des variables de genres.

Nous fixerons un jeu de test égal à 15% des données totales.

In [4]:
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Création du DataFrame final
df_final = pd.concat([df_nums,
                      df_genres,
                      df_languages],axis=1)

# Séparation X (variables prédictives) et y (variable cible)
X = df_final.drop(columns=["Vote_Average"])
y = df_final["Vote_Average"]

# Séparation train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.15)

scaler = StandardScaler()
X_train[["Vote_Count", "Popularity","Release_Year"]] = scaler.fit_transform(X_train[["Vote_Count", "Popularity","Release_Year"]])
X_test[["Vote_Count", "Popularity","Release_Year"]] = scaler.transform(X_test[["Vote_Count", "Popularity","Release_Year"]])

X_train[X_train.columns] = pd.DataFrame(X_train, index = X_train.index)
X_test[X_test.columns] = pd.DataFrame(X_test, index = X_test.index)

X_train

Unnamed: 0,Popularity,Vote_Count,Release_Year,Release_Month,Release_Day,Release_Weekday,Action,Adventure,Animation,Comedy,...,de,en,es,fr,it,ja,ko,ru,zh,Other_languages
8991,-0.900585,0.115241,0.756302,0.828493,0.956473,0.020538,0,0,0,1,...,0,1,0,0,0,0,0,0,0,0
9567,-0.959126,-0.334707,-0.330976,0.570560,0.994679,0.121599,0,0,0,1,...,0,1,0,0,0,0,0,0,0,0
5416,-0.426763,-0.184609,0.884217,0.954649,0.168183,0.360292,0,0,1,1,...,0,1,0,0,0,0,0,0,0,0
2392,0.430960,1.673544,0.692345,0.000005,0.570560,0.020538,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
9061,-0.906447,-1.001075,-0.203061,0.231714,0.956473,0.360292,0,0,0,1,...,0,1,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3423,0.042304,-1.388523,0.692345,0.706059,0.706059,0.360292,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
3196,0.116950,-1.195768,0.756302,0.227989,0.920735,0.920735,0,0,0,1,...,0,1,0,0,0,0,0,0,0,0
3292,0.084625,0.992792,0.820260,0.227989,0.000005,0.020538,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
3344,0.065048,-0.820088,0.884217,0.000005,0.956473,0.020538,0,0,1,0,...,0,1,0,0,0,0,0,0,0,0


Enfin, nous allons pouvoir sauvegarder les jeux de test et d'entraînement pour pouvoir les utiliser dans la suite du projet.

In [7]:
# Sauvegarde des jeux de test et d'entraînement
X_train.reset_index().to_csv("X_train.csv")
X_test.reset_index().to_csv("X_test.csv")
y_train.reset_index().to_csv("y_train.csv")
y_test.reset_index().to_csv("y_test.csv")

In [9]:
y_test

9679    6.2
8200    7.1
525     4.1
833     7.0
7414    7.7
       ... 
1591    5.4
85      6.5
6866    8.4
9612    7.8
8137    0.0
Name: Vote_Average, Length: 1474, dtype: float64