# The Porto Seguro kaggle challenge

## 1. Data Description

In this competition, you will predict the probability that an auto insurance policy holder files a claim.

In the train and test data, features that belong to similar groupings are tagged as such in the feature names (e.g., `ind` , `reg`, `car`, `calc`). In addition, feature names include the postfix bin to indicate binary features and cat to indicate categorical features. Features without these designations are either continuous or ordinal. Values of -1 indicate that the feature was missing from the observation. The target columns signifies whether or not a claim was filed for that policy holder.

## 2. File descriptions

- `train.csv` contains the training data, where each row corresponds to a policy holder, and the target columns signifies that a claim was filed.
- `test.csv` contains the test data.

## 3. Aim

- Build a classifier using the training dataset that leads to a good ROC and Precision / Recall curve on the testing set
- The notebook should describe your steps, explain what you do and should run entirely without bugs. It should contain some descriptive statistics and quick study, to understand some things about the data...
- It must end with plots of the ROC and precision/recall curves obtained on the testing dataset

In [None]:
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from collections import Counter
from sklearn.impute import SimpleImputer

# Use the path to your filename

#Chemin Kenny
path = ''

#Chemin Mickaël
#path = '/home/chopin/Bureau/M2MOdata/machine_learning/tp2challenge'

df = pd.read_csv(os.path.join(path, 'train.csv'))
df.head(5)

In [None]:
df.describe()

In [None]:
lignes = df.shape[0]
colonnes = df.shape[1]
print("Le jeu de données de training contient {0} lignes et {1} colonnes".format(lignes, colonnes))

# 1. Analyse exploratoire des données

## 1)	Etude des données brutes

### Données manquantes

In [None]:
#df.isnull()
Nombre_de_donnees_manquantes=df.isna().sum()
Nombre_de_donnees_manquantes

Aucune donnée manquante n'a été détectée ici. Il n'y a donc que celles qui ont codées par la valeur -1. On relève donc les données manquantes en changeant les -1 en NaN et via le test isna()

In [None]:
donnees=df.replace(-1, np.NaN)

In [None]:
Nombre_de_donnees_manquantes=donnees.isna().sum()
Nombre_de_donnees_manquantes

Faisons une liste des features ayant des données manquantes :

In [None]:
val_manquantes=donnees.columns[donnees.isna().any()].tolist()
val_manquantes

Visualisons les données manquantes :

In [None]:
import missingno as msno
msno.matrix(donnees[val_manquantes],width_ratios=(10,1),figsize=(20,10),color=(0.3,0.4,0.5),fontsize=18,\
            sparkline=True,labels=True)

Calculons les pourcentages de données manquantes :

In [None]:
donnees_copy = (Nombre_de_donnees_manquantes / len(donnees)) * 100 
donnees_copy = donnees_copy.drop(donnees_copy[donnees_copy == 0].index).sort_values(ascending=False)[:30]
# Rajouter une colonne avec le nombre de NaN avec pd.concat
manquantes = pd.DataFrame({'Données manquantes en %' :donnees_copy})
manquantes

### Type des données

In [None]:
Counter(donnees.dtypes.values)

In [None]:
donnees.dtypes

In [None]:
target=donnees.pop("target")
X , y = donnees , target

**Données binaires**
- ps_ind_06_bin 
- ps_ind_07_bin 
- ps_ind_08_bin 
- ps_ind_09_bin
- ps_ind_10_bin
- ps_ind_11_bin 
- ps_ind_12_bin 
- ps_ind_13_bin
- ps_ind_16_bin 
- ps_ind_17_bin 
- ps_ind_18_bin 
- ps_calc_15_bin
- ps_calc_16_bin 
- ps_calc_17_bin 
- ps_calc_18_bin 
- ps_calc_19_bin
- ps_calc_20_bin

In [None]:
#X.dtypes
#X.describe()
#X.corr()

bin_col=[col for col in X.columns if '_bin' in col]
X_bin=X.loc[:,bin_col]

for col in bin_col:
    donnees[col] = donnees[col].astype('bool')

**Données catégorielles**
- ps_ind_02_cat
- ps_ind_04_cat 
- ps_ind_05_cat 
- ps_car_01_cat
- ps_car_02_cat
- ps_car_03_cat
- ps_car_04_cat 
- ps_car_05_cat 
- ps_car_06_cat 
- ps_car_07_cat
- ps_car_08_cat 
- ps_car_09_cat 
- ps_car_10_cat 
- ps_car_11_cat

In [None]:
cat_col=[col for col in X.columns if '_cat' in col]
X_cat=X.loc[:,cat_col]

for col in cat_col:
    X[col] = X[col].astype('category')

**Données continues**
- ps_ind_01 
- ps_ind_03 
- ps_ind_14 
- ps_ind_15 
- ps_reg_01
- ps_reg_02
- ps_reg_03 
- ps_car_11 
- ps_car_12 
- ps_car_13 
- ps_car_14
- ps_car_15 
- ps_calc_01 
- ps_calc_02 
- ps_calc_03 
- ps_calc_04
- ps_calc_05 
- ps_calc_06 
- ps_calc_07 
- ps_calc_08 
- ps_calc_09
- ps_calc_10 
- ps_calc_11 
- ps_calc_12 
- ps_calc_13
- ps_calc_14

In [None]:
cont_col=[col for col in X.columns if col[-3:] not in ['bin', 'cat']]
X_cont=X.loc[:,cont_col]
X_cont2=X[cont_col]
type(X_cont)

## 2) Visualisation `pandas` + `seaborn` du jeu de données

In [None]:
X_cont.dtypes

### Corrélation des features continues

In [None]:
X_float = X_cont.select_dtypes(include=['float64'])
colormap = plt.cm.inferno
plt.figure(figsize=(16,12))
plt.title('Corrélation Pearson des features continues', y=1.05, size=15)
sns.heatmap(X_float.corr(),linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)

### Corrélation des features discrètes

In [None]:
X_int = X.select_dtypes(include=['int64'])
colormap = plt.cm.inferno
plt.figure(figsize=(16,12))
plt.title('Corrélation Pearson des features discrètes', y=1.05, size=15)
sns.heatmap(X_int.corr(),linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)

### Distribution des features

#### Catégorielles

In [None]:
fig , axes = plt.subplots(nrows=5,ncols=3,figsize=(16,16))
for i , colname in enumerate(cat_col):
    sns.countplot(colname,data=X_cat,ax=fig.axes[i])
plt.tight_layout()
fig.delaxes(axes[4][2])

#### Binaires

In [None]:
fig , axes = plt.subplots(nrows=5,ncols=4,figsize=(13,13))
for i , colname in enumerate(bin_col):
    sns.countplot(colname,data=X_bin,ax=fig.axes[i])
plt.tight_layout()
for i in range(1,4):
    fig.delaxes(axes[4][i])

#### Continues

In [None]:
X_cont.describe(include='all')

In [None]:
# histogrammes
#g = sns.FacetGrid(X_cont, col=cont_col[0]) 
#g.map(sns.distplot, "you")

In [None]:
#sns.distplot(X_cont.(X_cont.columns[1]))

### Distribution de la variable cible

In [None]:
plt.style.use('ggplot')
sns.despine(left=True)
sns.countplot(x=y, data=X)

plt.tight_layout()
#donnees.target
#Mettre les valeurs

In [None]:
X_cont.shape

In [None]:
X.columns

# 2. Préparation des données pour l'entraînement des classifieurs  

### Suppression de features 

Les features ps_car_03_cat, ps_car_05_cat et ps_reg_03 ont trop de valeurs manquantes. On va donc les supprimer :

In [None]:
cat1 = X_cat.shape[1]
cont1 = X_cont.shape[1]
dropfeat1 = X_cat.pop('ps_car_03_cat')
dropfeat2 = X_cat.pop('ps_car_05_cat')
dropfeat3 = X_cont.pop('ps_reg_03')
cont_col.remove('ps_reg_03')
cat2 = X_cat.shape[1]
cont2 = X_cont.shape[1]
print("On a bien supprimé " + str(cat1-cat2+cont1-cont2)+ " features")

On supprime également la colonne id (à cause de sa valeur prédictive nulle) :

In [None]:
_ = X_cont.pop('id')
cont_col.remove('id')
X_cont.shape[1]

In [None]:
len(cont_col)

Remplissage des données manquantes en remplaçant les NaN par la moyenne des valeurs de la colonne :

In [None]:
remp = SimpleImputer(missing_values=np.NaN, strategy="mean")

X_cont['ps_car_11']=remp.fit_transform(X_cont[['ps_car_11']]).ravel()
X_cont['ps_car_12']=remp.fit_transform(X_cont[['ps_car_12']]).ravel()
X_cont['ps_car_14']=remp.fit_transform(X_cont[['ps_car_14']]).ravel()
#donnees['ps_reg_03']=remp.fit_transform(donnees[['ps_reg_03']]).ravel()   VALEUR SUPPRIMEE

Retirons les derniers NaN :

In [None]:
#X_cont = X_cont.dropna()
#X_cat = X_cat.dropna()
#X_bin = X_bin.dropna()

### Encodage des données catégorielles 

In [None]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

X_cat_bin = pd.get_dummies(X_cat, prefix_sep='#', drop_first=True)

In [None]:
X_cat_bin.head(n=10)

### Centrage et réduction des variables continues 

In [None]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

scaler = MinMaxScaler()
scaler.fit(X_cont)
X_cont = scaler.transform(X_cont)

In [None]:
X_cont

In [None]:
X_cont = pd.DataFrame(X_cont, columns = cont_col)
X_cont.describe()
X_cont.index

### Matrice des features

In [None]:
X = pd.concat((X_bin, X_cat_bin, X_cont), axis=1)

In [None]:
X.columns

In [None]:
X.describe(include='all')

In [None]:
X.head()

In [None]:
X.index

In [None]:
#On a retiré des données, donc la liste des indices de la dataframe n'est plus contigue. 
#On réinitialise cette liste :

In [None]:
X.reset_index(inplace=True, drop=True)

In [None]:
X.head()

In [None]:
X.shape

In [None]:
y.shape

### Enregistrement des données traitées 

On utilise `pickle` pour enregistrer les données traitées

In [None]:
import pickle as pkl

with open('données_traitées.pkl', 'wb') as f:
    pkl.dump(X, f)