**Objectif du projet**: Prédire si un client est éligible à un prêt.

**Les étapes du projet :**

**EDA (Analyse Exploratoire des Données) :** 
- prise en main du jeu de données, nettoyage
- analyse de corrélation des variables
- définition des variables pertinentes

=> output : dataframe propre avec les variables pertinentes pour la prédiction du prix du bien

**Base de données SQL :**

**Développer un programme d'intelligence artificielle :**µ
- choix et entrainement de modèles de prédiction
- évaluation et comparaison de ces modèles
- enregister le modèle le plus performant (pickle..)

=> output: retenir le modèle plus performant.

**Développement API (Flask) :**
- définir une application flask
- charger le modèle de prédiction retenu  

=> output: tester le modèle à partir d'une page web.

## EDA

Prise en main du  jeu de données: 
- collecte, nettoyage de données
- vérifications/transformations de types (int,float,datetime,string ..) + définition de nouvelles variables(date? ..)
- vérification des données: valeurs uniques(distinctes)/ nulles/ non renseignées / outliers ? (boxplot)...

### Importation des librairies

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import MinMaxScaler

from sklearn.model_selection import GridSearchCV

%matplotlib inline 

import warnings
warnings.filterwarnings('ignore')

### Importation des données

In [None]:
#loading the dataset
train = pd.read_csv(r'C:\Users\Asma\Documents\ExoSimplon\Sem15_Grand_Projet\Data\train_u6lujuX_CVtuZ9i.csv')
test = pd.read_csv(r'C:\Users\Asma\Documents\ExoSimplon\Sem15_Grand_Projet\Data\test_Y3wMUE5_7gLdaTN.csv')

In [None]:
train.head()

In [None]:
test.head()

=> Pas de colonne target = 'Loan_Status' pour le test set

In [None]:
print('Train shape : ', train.shape, '\nTest  shape : ', test.shape)

- Concaténer les deux dataframes train et test

In [None]:
#df = pd.concat([train, test])

In [None]:
#df.shape

In [None]:
# use .describe() to get more information on the dataset. 
train.describe().T

In [None]:
train.info()

### Vérification des doubons

In [None]:
train.duplicated().sum()

In [None]:
#train[train.duplicated()]

In [None]:
#train.drop_duplicates(inplace=True)

In [None]:
train.drop('Loan_ID', axis=1, inplace=True)

In [None]:
train.head()

### Vérification et Gestion des données manquantes (NAN)

In [None]:
train.isna().sum()

- Remplacement de 'Gender'

In [None]:
train['Gender'].unique()

In [None]:
train['Gender'].value_counts()

In [None]:
train.dropna(subset=['Gender'],inplace=True)
train.head()

In [None]:
train['Gender'].isna().sum()

In [None]:
train.reset_index(drop=True) #drop=true permet de supprimer l'ancien index et garder que le réindexage

In [None]:
train['Gender'].isna().sum()

In [None]:
train.isna().sum()

- Remplacement de 'Married'

In [None]:
train['Married'].unique()

In [None]:
train['Married'] = train['Married'].fillna('No')

In [None]:
train['Married'].unique()

In [None]:
train.isna().sum()

- Remplacement de 'Dependents'

In [None]:
train['Dependents'].unique()

In [None]:
train.Dependents = train.Dependents.fillna('0')

rpl = {'0':'0', '1':'1', '2':'2', '3+':'3'}

train.Dependents = train.Dependents.replace(rpl).astype(int)

In [None]:
train.Dependents.unique()

In [None]:
train.isna().sum()

- Remplacement de 'Self_Employed'

In [None]:
train['Self_Employed'].unique()

In [None]:
train['Self_Employed'].value_counts()

In [None]:
train['Self_Employed'] = train['Self_Employed'].fillna('No')

In [None]:
train['Self_Employed'].unique()

In [None]:
train['Self_Employed'].value_counts()

In [None]:
train.isna().sum()

- Remplacement de 'Credit_History'

In [None]:
train['Credit_History'].unique()

In [None]:
train['Credit_History'].value_counts()

In [None]:
train['Credit_History'].isna().sum()

In [None]:
train.dropna(subset=['Credit_History'],inplace=True)

In [None]:
train.reset_index(drop=True) #drop=true permet de supprimer l'ancien index et garder que le réindexage

In [None]:
train['Credit_History'].isna().sum()

- Remplacement de 'LoanAmount'

In [None]:
train['LoanAmount'].unique()

In [None]:
#train['LoanAmount'].fillna(train['LoanAmount'].mode()[0], inplace=True)

In [None]:
train.dropna(subset=['LoanAmount'],inplace=True)

In [None]:
train.reset_index(drop=True) #drop=true permet de supprimer l'ancien index et garder que le réindexage

In [None]:
train['LoanAmount'].isna().sum()

- Remplacement de 'Loan_Amount_Term'

In [None]:
train['Loan_Amount_Term'].unique()

In [None]:
#train['Loan_Amount_Term'].fillna(train['Loan_Amount_Term'].mode()[0], inplace=True)

In [None]:
train.dropna(subset=['Loan_Amount_Term'],inplace=True)

In [None]:
train.reset_index(drop=True) #drop=true permet de supprimer l'ancien index et garder que le réindexage

In [None]:
train['Loan_Amount_Term'].isna().sum()

- Revérification des NaNs

In [None]:
train.isna().sum()

In [None]:
print('Train shape :', train.shape)

### Changer les types des données

In [None]:
train.info()

In [None]:
train['Credit_History'].astype(int)

### Convertion des variables categorielles en variables numériques

In [None]:
train.head(3)

In [None]:
train['Married'] = train['Married'].map(dict(Yes=1, No=0))

In [None]:
train['Self_Employed'] = train['Self_Employed'].map(dict(Yes=1, No=0))

In [None]:
train['Education'].value_counts()

In [None]:
train['Education'] = train['Education'].map(dict(Graduate=1, Not Graduate=0))

In [None]:
train['Education'].dtypes

### Vérification des outliers

# DataViz

In [None]:
# nunique() : pour trouver le nombre de valeurs uniques sur l’axe de la colonne

[print(col, train[col].nunique()) for col in train.columns if train[col].dtype=='object']

In [None]:
sns.countplot(train['Gender']).set_title('Sexe');

In [None]:
sns.countplot(train['Married']).set_title('Marrié ?');

In [None]:
sns.countplot(train['Education']).set_title('Diplômé ?');

In [None]:
train['Dependents'].value_counts()

In [None]:
train['Dependents'].value_counts().index

In [None]:
fig = plt.figure(figsize =(10, 7));
plt.pie(train['Dependents'].value_counts(), 
        labels=train['Dependents'].value_counts().index, 
        autopct='%1.1f%%',  
        startangle=90,
        textprops={'size': 'x-large'});
plt.legend(title = "Personnes à charge :", loc ="center right", bbox_to_anchor =(1.3, 0.7));

In [None]:
sns.countplot(train['Self_Employed']).set_title('Travailleur indépendant ?');

In [None]:
sns.countplot(train['Credit_History']).set_title('Crédit historique ?');

# Partie 2

- Analyse de corrélation des variables
- vérifier la multicolinéarité avec un heatmap (voir comment les variables indépendantes peuvent être corrélées)
- Définir les variables significatives
- transformer les variables continues

In [None]:
#Matrice de corrélation

corr_df = train.corr().abs()

upp_mat = np.triu(train.corr())

plt.figure(figsize=(20, 15))
sns.heatmap(corr_df, annot=True, mask=upp_mat)
#plt.savefig("Matrice de corrélation.png")
plt.show()

### Multi Colinéarité

* Vérifier la Multicolinéarité des features

https://datascience.eu/fr/apprentissage-automatique/multicollinearite-2/

https://datascience.eu/fr/mathematiques-et-statistiques/multicollinearite/

In [None]:
def Multicollinear_Features():
    corr = train.corr().abs()
    features = []
    correlations = []
    for idx, correlation in corr['price'].T.iteritems():
        if correlation >= .3 and idx != 'price':
            features.append(idx)
            correlations.append(correlation)
    corr_price_df = pd.DataFrame({'Correlations':correlations, 'Features': features})
    corr_price_df.sort_values(by='Correlations',ascending=False,inplace=True)
    corr_price_df.reset_index(drop=True,inplace=True)
      
    Multicollinear_Features = [] 
    for feature in corr:
        for idx, correlation in corr[feature].T.iteritems():
            if correlation >= .8 and idx != feature and corr['price'].loc[feature] >= corr['price'].loc[idx]:
                Multicollinear_Features.append({'Correlations':correlation,'Features':feature,'idx': idx})
    if len(Multicollinear_Features) > 0:
        MC_df = pd.DataFrame(Multicollinear_Features)
    else:
        MC_df = pd.DataFrame(columns=['Correlations', 'Features','idx'])
    MC_df.sort_values(by='Correlations',ascending=False,inplace=True)
    MC_df.reset_index(drop=True,inplace=True) 
    print('Multicollinear Features')
    display(MC_df)
    
    print('Correlations with Target')
    corr_price_df = corr_price_df.loc[~corr_price_df['Features'].isin(MC_df['idx'].to_list())]
    display(corr_price_df)
    return corr_price_df,MC_df

In [None]:
corr_price_df, MC_df = Multicollinear_Features()

* Quels sont les meilleures variables , les plus significatives ?

utiliser stepwise_selection pour choisir les meilleurs features

https://towardsdatascience.com/feature-selection-techniques-in-regression-model-26878fe0e24e

https://en.wikipedia.org/wiki/Stepwise_regression

https://towardsdatascience.com/stepwise-regression-tutorial-in-python-ebf7c782c922

https://bookdown.org/max/FES/greedy-stepwise-selection.html

In [None]:
#Nouveau df avec les features qu'on garde

values = []
cols = []
for col, value in df.iteritems():
    if (col in corr_price_df['Features'].to_list()):
        values.append(df[col])
        cols.append(col)
df1 = pd.concat(values,axis=1)
df1

In [None]:
df1 = df[['price'] + df1.columns.tolist()]
df1

# Sauvegarde du nouveau df propre