# Projet de classification sur le défaut de paiement des clients en Taiwan

---------------------------------------------------------------------------------------------------------------------------
From UCI Machine Learning Repository                                                                                      -
http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients                                                     -
                                                                                                                          -
Source:                                                                                                                   -
Name: I-Cheng Yeh                                                                                                         -
email addresses: (1) icyeh '@' chu.edu.tw (2) 140910 '@' mail.tku.edu.tw                                                  -
institutions: (1) Department of Information Management, Chung Hua University, Taiwan. (2) Department of Civil Engineering,- Tamkang University, Taiwan.                                                                                               -
other contact information: 886-2-26215656 ext. 3181                                                                       -
---------------------------------------------------------------------------------------------------------------------------


1. Chargez les données.
2. Effectuez des analyses exploratoires (par exemple, comment les diverses caractéristiques et la variable cible sont distribuées).
3. Entraînez un modèle pour prédire la variable cible (risque de "par défaut").
   - Utiliser trois modèles différents (régression logistique, arbre de décision et forêt aléatoire).
   - Comparer les performances des modèles (par exemple, AUC).
   - Défendre votre choix du meilleur modèle (par exemple, quelles sont les forces et les faiblesses de chacun de ces modèles ?).
4. Que feriez-vous de plus avec ces données ? Quelque chose pour vous aider à trouver une meilleure solution ?
5. Mettre votre solution sur github et m'envoyez le lien.


---

---
# 1. Chargement des données


In [1]:
#importing pyspark
import pyspark

#importing sparksessio
from pyspark.sql import SparkSession
import joblib

In [2]:
#creating a sparksession object and providing appName 
spark=SparkSession.builder.master("local").appName("lab2_ilboudo").getOrCreate()

In [3]:
ccdefault = spark.read.format("csv").options(header=True,inferSchema=True) .load("ccdefault.csv")

---
# 2. Analyses exploratoires 
Nous allons explorer les données en suivants les étapes suivantes :
* Regarder le schéma et la dimension du jeu de données.
* Regarder le jeu de données
* Réaliser des statistiques descriptives des attributs qu'il contient
* Scinder le jeu de données sur la base des variables catégorielles
* Trouver la correlation entre les variables explicatives
* Eventuellement, créer des variables calculées

## 2.1. Schema and dimension
Print the schema of the dataset

In [4]:
ccdefault.printSchema()

root
 |-- ID: integer (nullable = true)
 |-- LIMIT_BAL: integer (nullable = true)
 |-- SEX: integer (nullable = true)
 |-- EDUCATION: integer (nullable = true)
 |-- MARRIAGE: integer (nullable = true)
 |-- AGE: integer (nullable = true)
 |-- PAY_0: integer (nullable = true)
 |-- PAY_2: integer (nullable = true)
 |-- PAY_3: integer (nullable = true)
 |-- PAY_4: integer (nullable = true)
 |-- PAY_5: integer (nullable = true)
 |-- PAY_6: integer (nullable = true)
 |-- BILL_AMT1: integer (nullable = true)
 |-- BILL_AMT2: integer (nullable = true)
 |-- BILL_AMT3: integer (nullable = true)
 |-- BILL_AMT4: integer (nullable = true)
 |-- BILL_AMT5: integer (nullable = true)
 |-- BILL_AMT6: integer (nullable = true)
 |-- PAY_AMT1: integer (nullable = true)
 |-- PAY_AMT2: integer (nullable = true)
 |-- PAY_AMT3: integer (nullable = true)
 |-- PAY_AMT4: integer (nullable = true)
 |-- PAY_AMT5: integer (nullable = true)
 |-- PAY_AMT6: integer (nullable = true)
 |-- DEFAULT: integer (nullable = tru

Print the number of records in the dataset.

In [5]:
ccdefault.count()

30000

## 2.2. Look at the data (Regarder le jeu de données)
Print the first seven records of the dataset.

In [6]:
#question : Comment avoir le même format de sortie en utilisant tail(7)

ccdefault.show(7, vertical=True)

-RECORD 0-----------
 ID        | 1      
 LIMIT_BAL | 20000  
 SEX       | 2      
 EDUCATION | 2      
 MARRIAGE  | 1      
 AGE       | 24     
 PAY_0     | 2      
 PAY_2     | 2      
 PAY_3     | -1     
 PAY_4     | -1     
 PAY_5     | -2     
 PAY_6     | -2     
 BILL_AMT1 | 3913   
 BILL_AMT2 | 3102   
 BILL_AMT3 | 689    
 BILL_AMT4 | 0      
 BILL_AMT5 | 0      
 BILL_AMT6 | 0      
 PAY_AMT1  | 0      
 PAY_AMT2  | 689    
 PAY_AMT3  | 0      
 PAY_AMT4  | 0      
 PAY_AMT5  | 0      
 PAY_AMT6  | 0      
 DEFAULT   | 1      
-RECORD 1-----------
 ID        | 2      
 LIMIT_BAL | 120000 
 SEX       | 2      
 EDUCATION | 2      
 MARRIAGE  | 2      
 AGE       | 26     
 PAY_0     | -1     
 PAY_2     | 2      
 PAY_3     | 0      
 PAY_4     | 0      
 PAY_5     | 0      
 PAY_6     | 2      
 BILL_AMT1 | 2682   
 BILL_AMT2 | 1725   
 BILL_AMT3 | 2682   
 BILL_AMT4 | 3272   
 BILL_AMT5 | 3455   
 BILL_AMT6 | 3261   
 PAY_AMT1  | 0      
 PAY_AMT2  | 1000   
 PAY_AMT3  | 

## 2.3. Statistiques descriptives

In [7]:
#Renommer PAY_0 ==>PAY_1

ccdefaut1=ccdefault.withColumnRenamed("PAY_0", "PAY_1")
ccdefaut1.printSchema()

root
 |-- ID: integer (nullable = true)
 |-- LIMIT_BAL: integer (nullable = true)
 |-- SEX: integer (nullable = true)
 |-- EDUCATION: integer (nullable = true)
 |-- MARRIAGE: integer (nullable = true)
 |-- AGE: integer (nullable = true)
 |-- PAY_1: integer (nullable = true)
 |-- PAY_2: integer (nullable = true)
 |-- PAY_3: integer (nullable = true)
 |-- PAY_4: integer (nullable = true)
 |-- PAY_5: integer (nullable = true)
 |-- PAY_6: integer (nullable = true)
 |-- BILL_AMT1: integer (nullable = true)
 |-- BILL_AMT2: integer (nullable = true)
 |-- BILL_AMT3: integer (nullable = true)
 |-- BILL_AMT4: integer (nullable = true)
 |-- BILL_AMT5: integer (nullable = true)
 |-- BILL_AMT6: integer (nullable = true)
 |-- PAY_AMT1: integer (nullable = true)
 |-- PAY_AMT2: integer (nullable = true)
 |-- PAY_AMT3: integer (nullable = true)
 |-- PAY_AMT4: integer (nullable = true)
 |-- PAY_AMT5: integer (nullable = true)
 |-- PAY_AMT6: integer (nullable = true)
 |-- DEFAULT: integer (nullable = tru

In [8]:
# statistiques descriptives des variables continues
ccdefaut1.describe("BILL_AMT1", "PAY_AMT1", "AGE").show()

+-------+-----------------+-----------------+-----------------+
|summary|        BILL_AMT1|         PAY_AMT1|              AGE|
+-------+-----------------+-----------------+-----------------+
|  count|            30000|            30000|            30000|
|   mean|       51223.3309|        5663.5805|          35.4855|
| stddev|73635.86057552966|16563.28035402577|9.217904068090155|
|    min|          -165580|                0|               21|
|    max|           964511|           873552|               79|
+-------+-----------------+-----------------+-----------------+



Print the maximum age (`housing_median_age`), the minimum number of rooms (`total_rooms`), and the average of house values (`median_house_value`).

In [9]:

from pyspark.sql.functions import *
ccdefaut1.select(max("BILL_AMT6"), min("BILL_AMT6"), avg("BILL_AMT6")).show()

+--------------+--------------+--------------+
|max(BILL_AMT6)|min(BILL_AMT6)|avg(BILL_AMT6)|
+--------------+--------------+--------------+
|        961664|       -339603|    38871.7604|
+--------------+--------------+--------------+



In [10]:
#Fréquence de variables catégorielles

In [11]:
# variable d'intérêt : DEFAULT

ccdefaut1.groupBy("DEFAULT").count().orderBy(desc("count")).show()

+-------+-----+
|DEFAULT|count|
+-------+-----+
|      0|23364|
|      1| 6636|
+-------+-----+



-- Les données sont "unbalanced", il faudra faire attention au choix des indicateurs de précision du modèle plus tard

In [12]:
# variable d'intérêt : EDUCATION

ccdefaut1.createOrReplaceTempView("df")
education = spark.sql("SELECT EDUCATION, count(*) as effectif \
                    FROM df \
                    GROUP BY EDUCATION \
                    ORDER BY effectif DESC")
education.show()


#nombre d'observations
nb_obs = ccdefaut1.count()


education_freq = education.withColumn('pourcentage', 100*education.effectif/float(nb_obs))
education_freq.show()

# Question : arrondir à deux chiffres après la virgule



+---------+--------+
|EDUCATION|effectif|
+---------+--------+
|        2|   14030|
|        1|   10585|
|        3|    4917|
|        5|     280|
|        4|     123|
|        6|      51|
|        0|      14|
+---------+--------+

+---------+--------+-------------------+
|EDUCATION|effectif|        pourcentage|
+---------+--------+-------------------+
|        2|   14030| 46.766666666666666|
|        1|   10585|  35.28333333333333|
|        3|    4917|              16.39|
|        5|     280| 0.9333333333333333|
|        4|     123|               0.41|
|        6|      51|               0.17|
|        0|      14|0.04666666666666667|
+---------+--------+-------------------+



---
## 3. Prepare the data for Machine Learning algorithms


Nous allons définir l'étiquette i.e. la variable à préduire ('DEFAULT') et les variables exploratoires regroupées selon leur type continues ou catégorielles
#Now, we want to separate the numerical attributes from the categorical attribute (`ocean_proximity`) and keep their column #names in two different lists. Moreover, sice we don't want to apply the same transformations to the predictors (features) and #the label, we should remove the label attribute from the list of predictors. 

In [13]:
# On renomme la colonne 'DEFAULT'==>'label'
ccdefaut2 = ccdefaut1.withColumnRenamed("DEFAULT","label")
ccdefaut2.printSchema()

root
 |-- ID: integer (nullable = true)
 |-- LIMIT_BAL: integer (nullable = true)
 |-- SEX: integer (nullable = true)
 |-- EDUCATION: integer (nullable = true)
 |-- MARRIAGE: integer (nullable = true)
 |-- AGE: integer (nullable = true)
 |-- PAY_1: integer (nullable = true)
 |-- PAY_2: integer (nullable = true)
 |-- PAY_3: integer (nullable = true)
 |-- PAY_4: integer (nullable = true)
 |-- PAY_5: integer (nullable = true)
 |-- PAY_6: integer (nullable = true)
 |-- BILL_AMT1: integer (nullable = true)
 |-- BILL_AMT2: integer (nullable = true)
 |-- BILL_AMT3: integer (nullable = true)
 |-- BILL_AMT4: integer (nullable = true)
 |-- BILL_AMT5: integer (nullable = true)
 |-- BILL_AMT6: integer (nullable = true)
 |-- PAY_AMT1: integer (nullable = true)
 |-- PAY_AMT2: integer (nullable = true)
 |-- PAY_AMT3: integer (nullable = true)
 |-- PAY_AMT4: integer (nullable = true)
 |-- PAY_AMT5: integer (nullable = true)
 |-- PAY_AMT6: integer (nullable = true)
 |-- label: integer (nullable = true)

In [14]:
#  columne des etiquettes
colLabel = "label"

# categorical columns
colCat = ["SEX","EDUCATION","MARRIAGE","PAY_1","PAY_2","PAY_3","PAY_4","PAY_5","PAY_6"]


# Variables inutiles exclus de l'analyse.
unused=["ID","LIMIT_BAL"]

# Une astuce pour selectionner de 'PAY_1' à 'PAY_6' ?

#  columne numerique
colNum = [col for col in ccdefaut2.columns if (col not in colCat and col !=colLabel and col not in unused) ]

# and col != colCat and col!="LIMIT_BAL"



In [15]:
colNum

['AGE',
 'BILL_AMT1',
 'BILL_AMT2',
 'BILL_AMT3',
 'BILL_AMT4',
 'BILL_AMT5',
 'BILL_AMT6',
 'PAY_AMT1',
 'PAY_AMT2',
 'PAY_AMT3',
 'PAY_AMT4',
 'PAY_AMT5',
 'PAY_AMT6']

## 3.1. Prepare continuse attributes
### Data cleaning
Most Machine Learning algorithms cannot work with missing features, so we should take care of them. As a first step, let's find the columns with missing values in the numerical attributes. To do so, we can print the number of missing values of each continues attributes, listed in `colNum`.

In [16]:
for c in colNum:
    count=ccdefaut2.filter(c+" is   NULL" or c+"is ''" or c+"is  NaN" or c+"is  null").count()
    print(c+" column has "+ str(count) +" missing values\n")


AGE column has 0 missing values

BILL_AMT1 column has 0 missing values

BILL_AMT2 column has 0 missing values

BILL_AMT3 column has 0 missing values

BILL_AMT4 column has 0 missing values

BILL_AMT5 column has 0 missing values

BILL_AMT6 column has 0 missing values

PAY_AMT1 column has 0 missing values

PAY_AMT2 column has 0 missing values

PAY_AMT3 column has 0 missing values

PAY_AMT4 column has 0 missing values

PAY_AMT5 column has 0 missing values

PAY_AMT6 column has 0 missing values



### Scaling
One of the most important transformations you need to apply to your data is feature scaling. With few exceptions, Machine Learning algorithms don't perform well when the input numerical attributes have very different scales. This is the case for the housing data: the total number of rooms ranges from about 6 to 39,320, while the median incomes only range from 0 to 15. Note that scaling the label attribues is generally not required.

One way to get all attributes to have the same scale is to use standardization. In standardization, for each value, first it subtracts the mean value (so standardized values always have a zero mean), and then it divides by the variance so that the resulting distribution has unit variance. To do this, we can use the `StandardScaler` Estimator. To use `StandardScaler`, again we need to convert all the numerical attributes into a big vectore of features using `VectorAssembler`, and then call `StandardScaler` on that vactor.

In [17]:
from pyspark.ml.feature import VectorAssembler, StandardScaler

va =  VectorAssembler().setInputCols(colNum).setOutputCol("features")

featuredccdefaut2 = va.transform(ccdefaut2)

scaler =  StandardScaler().setInputCol("features").setOutputCol("scaledFeatures")
scaledccdefaut2 = scaler.fit(featuredccdefaut2).transform(featuredccdefaut2)

scaledccdefaut2.show(5)

+---+---------+---+---------+--------+---+-----+-----+-----+-----+-----+-----+---------+---------+---------+---------+---------+---------+--------+--------+--------+--------+--------+--------+-----+--------------------+--------------------+
| ID|LIMIT_BAL|SEX|EDUCATION|MARRIAGE|AGE|PAY_1|PAY_2|PAY_3|PAY_4|PAY_5|PAY_6|BILL_AMT1|BILL_AMT2|BILL_AMT3|BILL_AMT4|BILL_AMT5|BILL_AMT6|PAY_AMT1|PAY_AMT2|PAY_AMT3|PAY_AMT4|PAY_AMT5|PAY_AMT6|label|            features|      scaledFeatures|
+---+---------+---+---------+--------+---+-----+-----+-----+-----+-----+-----+---------+---------+---------+---------+---------+---------+--------+--------+--------+--------+--------+--------+-----+--------------------+--------------------+
|  1|    20000|  2|        2|       1| 24|    2|    2|   -1|   -1|   -2|   -2|     3913|     3102|      689|        0|        0|        0|       0|     689|       0|       0|       0|       0|    1|(13,[0,1,2,3,8],[...|(13,[0,1,2,3,8],[...|
|  2|   120000|  2|        2|       

## 3.2. Prepare categorical attributes
After imputing and scaling the continuse attributes, we should take care of the categorical attributes. Let's first print the number of distict values of the categirical attributes.["SEX","EDUCATION","MARRIAGE","PAY_1","PAY_2","PAY_3","PAY_4","PAY_5","PAY_6"]

In [18]:
ccdefaut2.select(countDistinct("SEX"), countDistinct("EDUCATION"),countDistinct("MARRIAGE") ).show()
ccdefaut2.select(countDistinct("PAY_1"),countDistinct("PAY_2"), countDistinct("PAY_3") ).show()


+-------------------+-------------------------+------------------------+
|count(DISTINCT SEX)|count(DISTINCT EDUCATION)|count(DISTINCT MARRIAGE)|
+-------------------+-------------------------+------------------------+
|                  2|                        7|                       4|
+-------------------+-------------------------+------------------------+

+---------------------+---------------------+---------------------+
|count(DISTINCT PAY_1)|count(DISTINCT PAY_2)|count(DISTINCT PAY_3)|
+---------------------+---------------------+---------------------+
|                   11|                   11|                   11|
+---------------------+---------------------+---------------------+



### String indexer
Most Machine Learning algorithms prefer to work with numbers. So let's convert the categorical attribute `ocean_proximity` to numbers. To do so, we can use the `StringIndexer` that encodes a string column of labels to a column of label indices. The indices are in [0, numLabels), ordered by label frequencies, so the most frequent label gets index 0.

In [19]:
from pyspark.ml.feature import StringIndexer
#"Smart" One hot encoder, orders based on label frequency (0 if most frequent, 1 the second most fre, ... )

indexer =  StringIndexer().setInputCols(["SEX","EDUCATION","MARRIAGE","PAY_1","PAY_2","PAY_3","PAY_4","PAY_5","PAY_6"]).setOutputCols(["SEX_index","EDUCATION_index","MARRIAGE_index","PAY_1_index","PAY_2_index","PAY_3_index","PAY_4_index","PAY_5_index","PAY_6_index"])
idxccdefaut2 = indexer.fit(ccdefaut2).transform(ccdefaut2)

idxccdefaut2.show(5, vertical=True)

-RECORD 0-----------------
 ID              | 1      
 LIMIT_BAL       | 20000  
 SEX             | 2      
 EDUCATION       | 2      
 MARRIAGE        | 1      
 AGE             | 24     
 PAY_1           | 2      
 PAY_2           | 2      
 PAY_3           | -1     
 PAY_4           | -1     
 PAY_5           | -2     
 PAY_6           | -2     
 BILL_AMT1       | 3913   
 BILL_AMT2       | 3102   
 BILL_AMT3       | 689    
 BILL_AMT4       | 0      
 BILL_AMT5       | 0      
 BILL_AMT6       | 0      
 PAY_AMT1        | 0      
 PAY_AMT2        | 689    
 PAY_AMT3        | 0      
 PAY_AMT4        | 0      
 PAY_AMT5        | 0      
 PAY_AMT6        | 0      
 label           | 1      
 EDUCATION_index | 0.0    
 PAY_1_index     | 4.0    
 PAY_5_index     | 2.0    
 PAY_2_index     | 2.0    
 MARRIAGE_index  | 1.0    
 PAY_3_index     | 1.0    
 PAY_4_index     | 1.0    
 SEX_index       | 0.0    
 PAY_6_index     | 2.0    
-RECORD 1-----------------
 ID              | 2      
 

Now we can use this numerical data in any Machine Learning algorithm. You can look at the mapping that this encoder has learned using the `labels` method: "<1H OCEAN" is mapped to 0, "INLAND" is mapped to 1, etc.

In [20]:
indexer.fit(ccdefaut2).labelsArray

[('2', '1'),
 ('2', '1', '3', '5', '4', '6', '0'),
 ('2', '1', '3', '0'),
 ('0', '-1', '1', '-2', '2', '3', '4', '5', '8', '6', '7'),
 ('0', '-1', '2', '-2', '3', '4', '1', '5', '7', '6', '8'),
 ('0', '-1', '-2', '2', '3', '4', '7', '6', '5', '1', '8'),
 ('0', '-1', '-2', '2', '3', '4', '7', '5', '6', '1', '8'),
 ('0', '-1', '-2', '2', '3', '4', '7', '5', '6', '8'),
 ('0', '-1', '-2', '2', '3', '4', '7', '6', '5', '8')]

### One-hot encoding
Now, convert the label indices built in the last step into one-hot vectors. To do this, you can take advantage of the `OneHotEncoderEstimator` Estimator.

In [21]:
from pyspark.ml.feature import OneHotEncoder
encoder = OneHotEncoder().setInputCols(["SEX_index","EDUCATION_index","MARRIAGE_index","PAY_1_index","PAY_2_index","PAY_3_index","PAY_4_index","PAY_5_index","PAY_6_index"]).setOutputCols(["SEX_vec","EDUCATION_vec","MARRIAGE_vec","PAY_1_vec","PAY_2_vec","PAY_3_vec","PAY_4_vec","PAY_5_vec","PAY_6_vec"])
ohccdefaut2 = encoder.fit(idxccdefaut2).transform(idxccdefaut2)

ohccdefaut2.show(5, vertical=True)

-RECORD 0-------------------------
 ID              | 1              
 LIMIT_BAL       | 20000          
 SEX             | 2              
 EDUCATION       | 2              
 MARRIAGE        | 1              
 AGE             | 24             
 PAY_1           | 2              
 PAY_2           | 2              
 PAY_3           | -1             
 PAY_4           | -1             
 PAY_5           | -2             
 PAY_6           | -2             
 BILL_AMT1       | 3913           
 BILL_AMT2       | 3102           
 BILL_AMT3       | 689            
 BILL_AMT4       | 0              
 BILL_AMT5       | 0              
 BILL_AMT6       | 0              
 PAY_AMT1        | 0              
 PAY_AMT2        | 689            
 PAY_AMT3        | 0              
 PAY_AMT4        | 0              
 PAY_AMT5        | 0              
 PAY_AMT6        | 0              
 label           | 1              
 EDUCATION_index | 0.0            
 PAY_1_index     | 4.0            
 PAY_5_index     | 2

---
# 4. Pipeline
As you can see, there are many data transformation steps that need to be executed in the right order. For example, you called the `Imputer`, `VectorAssembler`, and `StandardScaler` from left to right. However, we can use the `Pipeline` class to define a sequence of Transformers/Estimators, and run them in order. A `Pipeline` is an `Estimator`, thus, after a Pipeline's `fit()` method runs, it produces a `PipelineModel`, which is a `Transformer`.

Now, let's create a pipeline called `numPipeline` to call the numerical transformers you built above (`imputer`, `va`, and `scaler`) in the right order from left to right, as well as a pipeline called `catPipeline` to call the categorical transformers (`indexer` and `encoder`). Then, put these two pipelines `numPipeline` and `catPipeline` into one pipeline.

In [22]:
from pyspark.ml import Pipeline, PipelineModel

numPipeline =  Pipeline().setStages([va, scaler])
catPipeline =  Pipeline().setStages([indexer, encoder])
pipeline =  Pipeline().setStages([numPipeline, catPipeline])
pipfited = pipeline.fit(ccdefaut2)
newccdefaut2=pipfited.transform(ccdefaut2)

newccdefaut2.show(5, vertical=True)

-RECORD 0-------------------------------
 ID              | 1                    
 LIMIT_BAL       | 20000                
 SEX             | 2                    
 EDUCATION       | 2                    
 MARRIAGE        | 1                    
 AGE             | 24                   
 PAY_1           | 2                    
 PAY_2           | 2                    
 PAY_3           | -1                   
 PAY_4           | -1                   
 PAY_5           | -2                   
 PAY_6           | -2                   
 BILL_AMT1       | 3913                 
 BILL_AMT2       | 3102                 
 BILL_AMT3       | 689                  
 BILL_AMT4       | 0                    
 BILL_AMT5       | 0                    
 BILL_AMT6       | 0                    
 PAY_AMT1        | 0                    
 PAY_AMT2        | 689                  
 PAY_AMT3        | 0                    
 PAY_AMT4        | 0                    
 PAY_AMT5        | 0                    
 PAY_AMT6       

In [23]:
newccdefaut2.describe("AGE").show(5)

+-------+-----------------+
|summary|              AGE|
+-------+-----------------+
|  count|            30000|
|   mean|          35.4855|
| stddev|9.217904068090155|
|    min|               21|
|    max|               79|
+-------+-----------------+



Now, use `VectorAssembler` to put all attributes of the final dataset `newHousing` into a big vector, and call the new column `features`.

In [24]:

finalccdefaut2 = newccdefaut2.drop("SEX","EDUCATION","MARRIAGE","PAY_1","PAY_2","PAY_3","PAY_4","PAY_5","PAY_6", "SEX_index","EDUCATION_index","MARRIAGE_index","PAY_1_index","PAY_2_index","PAY_3_index","PAY_4_index","PAY_5_index","PAY_6_index", "features", "scaledFeatures")
finalccdefaut2.printSchema()

newColNum = [col for col in finalccdefaut2.columns if (col not in colCat and col !=colLabel and col not in unused) ]

va2 =  VectorAssembler().setInputCols(newColNum).setOutputCol("to_be_scaled_features")

featuredfinalccdefaut2 = va2.transform(finalccdefaut2)

scaler2 =  StandardScaler().setInputCol("to_be_scaled_features").setOutputCol("features")

dataset = scaler2.fit(featuredfinalccdefaut2).transform(featuredfinalccdefaut2).select("features", "label")

dataset.show(5)

root
 |-- ID: integer (nullable = true)
 |-- LIMIT_BAL: integer (nullable = true)
 |-- AGE: integer (nullable = true)
 |-- BILL_AMT1: integer (nullable = true)
 |-- BILL_AMT2: integer (nullable = true)
 |-- BILL_AMT3: integer (nullable = true)
 |-- BILL_AMT4: integer (nullable = true)
 |-- BILL_AMT5: integer (nullable = true)
 |-- BILL_AMT6: integer (nullable = true)
 |-- PAY_AMT1: integer (nullable = true)
 |-- PAY_AMT2: integer (nullable = true)
 |-- PAY_AMT3: integer (nullable = true)
 |-- PAY_AMT4: integer (nullable = true)
 |-- PAY_AMT5: integer (nullable = true)
 |-- PAY_AMT6: integer (nullable = true)
 |-- label: integer (nullable = true)
 |-- PAY_3_vec: vector (nullable = true)
 |-- MARRIAGE_vec: vector (nullable = true)
 |-- PAY_5_vec: vector (nullable = true)
 |-- PAY_4_vec: vector (nullable = true)
 |-- EDUCATION_vec: vector (nullable = true)
 |-- SEX_vec: vector (nullable = true)
 |-- PAY_2_vec: vector (nullable = true)
 |-- PAY_1_vec: vector (nullable = true)
 |-- PAY_6_ve

In [25]:
featuredfinalccdefaut2.show(7, vertical=True)

-RECORD 0-------------------------------------
 ID                    | 1                    
 LIMIT_BAL             | 20000                
 AGE                   | 24                   
 BILL_AMT1             | 3913                 
 BILL_AMT2             | 3102                 
 BILL_AMT3             | 689                  
 BILL_AMT4             | 0                    
 BILL_AMT5             | 0                    
 BILL_AMT6             | 0                    
 PAY_AMT1              | 0                    
 PAY_AMT2              | 689                  
 PAY_AMT3              | 0                    
 PAY_AMT4              | 0                    
 PAY_AMT5              | 0                    
 PAY_AMT6              | 0                    
 label                 | 1                    
 PAY_3_vec             | (10,[1],[1.0])       
 MARRIAGE_vec          | (3,[1],[1.0])        
 PAY_5_vec             | (9,[2],[1.0])        
 PAY_4_vec             | (10,[1],[1.0])       
 EDUCATION_ve

---
# 5. Make a model
Here we going to make four different regression models:
* Logistic regression model
* Decission tree regression
* Random forest regression


But, before giving the data to train a Machine Learning model, let's first split the data into training dataset (`trainSet`) with 80% of the whole data, and test dataset (`testSet`) with 20% of it.

In [26]:
trainSet, testSet = dataset.randomSplit([0.8,0.2])

In [33]:
# Afficher le schéma
trainSet.printSchema()


#
trainSet.show(1,False)

root
 |-- features: vector (nullable = true)
 |-- label: integer (nullable = true)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
|features                                                                                                                                                                                                                                                                                                                                                                                                                              

## 5.1. Logistic regression model
Now, train a Linear Regression model using the `LogisticRegression` class. 

In [53]:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
#from pyspark.ml.evaluation import RegressionEvaluator

lr =LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8)
lrModel = lr.fit(trainSet)
predictions=lrModel.transform(testSet)
predictions.select('label','features','prediction','probability').toPandas().head(7)




Unnamed: 0,label,features,prediction,probability
0,0,"(2.2781751518434885, 0.556372936770088, 0.4597...",0.0,"[0.7783659858274281, 0.2216340141725719]"
1,0,"(2.386659682883655, 0.023656951740425522, 0.03...",0.0,"[0.7783659858274281, 0.2216340141725719]"
2,0,"(2.386659682883655, 0.034276777378205524, 0.07...",0.0,"[0.7783659858274281, 0.2216340141725719]"
3,0,"(2.386659682883655, 0.09682782199152352, 0.115...",0.0,"[0.7783659858274281, 0.2216340141725719]"
4,0,"(2.386659682883655, 0.21729901538378232, 0.177...",0.0,"[0.7783659858274281, 0.2216340141725719]"
5,0,"(2.386659682883655, 0.23072997133744527, 0.252...",0.0,"[0.7783659858274281, 0.2216340141725719]"
6,0,"(2.386659682883655, 0.24182510886438421, 0.260...",0.0,"[0.7783659858274281, 0.2216340141725719]"


In [54]:
#evaluation
evaluator=BinaryClassificationEvaluator()
print('Test area under ROC', evaluator.evaluate(predictions))

#accuracy

accuracy=predictions.filter(predictions.label==predictions.prediction).count()/float(predictions.count())
print("Accuracy : ",accuracy)

Test area under ROC 0.5
Accuracy :  0.7805324459234609


## 5.2. Decision tree regression
Repeat what you have done on Regression Model to build a Decision Tree model. Use the `DecisionTreeRegressor` to make a model and then measure its RMSE on the test dataset.

In [61]:
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

dt = DecisionTreeRegressor().setLabelCol("label").setFeaturesCol("features")

# train the model
dtModel = dt.fit(trainSet)

# make predictions on the test data
predictions = dtModel.transform(testSet)
# to compare with others algo
predictions.select("prediction", "label", "features").toPandas().head(7)





Unnamed: 0,prediction,label,features
0,0.198531,0,"(2.2781751518434885, 0.556372936770088, 0.4597..."
1,0.101804,0,"(2.386659682883655, 0.023656951740425522, 0.03..."
2,0.101804,0,"(2.386659682883655, 0.034276777378205524, 0.07..."
3,0.198531,0,"(2.386659682883655, 0.09682782199152352, 0.115..."
4,0.198531,0,"(2.386659682883655, 0.21729901538378232, 0.177..."
5,0.101804,0,"(2.386659682883655, 0.23072997133744527, 0.252..."
6,0.101804,0,"(2.386659682883655, 0.24182510886438421, 0.260..."


In [62]:

# select (prediction, true label) and compute test error
evaluator1 =  RegressionEvaluator().setLabelCol("label").setPredictionCol("prediction").setMetricName("rmse")
rmse = evaluator1.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = "+ str(rmse))


# Select (prediction, true label) and compute test error
evaluator2 = MulticlassClassificationEvaluator(
    labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator2.evaluate(predictions)
print(accuracy)
print("Test Error = %g " % (1.0 - accuracy))

Root Mean Squared Error (RMSE) on test data = 0.3743126974182259
0.0004991680532445924
Test Error = 0.999501 


## 5.3. Random forest regression
Let's try the test error on a Random Forest Model. Youcan use the `RandomForestRegressor` to make a Random Forest model.

In [63]:
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.evaluation import RegressionEvaluator

rf =   RandomForestRegressor().setLabelCol("label").setFeaturesCol("features")

# train the model
rfModel = rf.fit(trainSet)

# make predictions on the test data
predictions = rfModel.transform(testSet)
predictions.select("prediction", "label", "features").show(5)

# select (prediction, true label) and compute test error
evaluator3 =  RegressionEvaluator().setLabelCol("label").setPredictionCol("prediction").setMetricName("rmse")
rmse = evaluator3.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = "+ str(rmse))

# Select (prediction, true label) and compute test error
evaluator4 = MulticlassClassificationEvaluator(
    labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator4.evaluate(predictions)
print(accuracy)
print("Test Error = %g" % (1.0 - accuracy))





+-------------------+-----+--------------------+
|         prediction|label|            features|
+-------------------+-----+--------------------+
|0.14627575024308326|    0|(81,[0,1,2,3,4,5,...|
|0.13244026177763302|    0|(81,[0,1,2,3,4,5,...|
|0.12949430812352608|    0|(81,[0,1,2,3,4,5,...|
| 0.1838565811852626|    0|(81,[0,1,2,3,4,5,...|
|0.16783072182642905|    0|(81,[0,1,2,3,4,5,...|
+-------------------+-----+--------------------+
only showing top 5 rows

Root Mean Squared Error (RMSE) on test data = 0.37156609892244463
0.0
Test Error = 1


In [56]:
# to compare with others algo
predictions.select("prediction", "rawPrediction", "probability", "label", "features").toPandas().head(7)

Unnamed: 0,prediction,rawPrediction,probability,label,features
0,0.0,"[1.2561693957147415, -1.2561693957147415]","[0.7783659858274281, 0.2216340141725719]",0,"(2.2781751518434885, 0.556372936770088, 0.4597..."
1,0.0,"[1.2561693957147415, -1.2561693957147415]","[0.7783659858274281, 0.2216340141725719]",0,"(2.386659682883655, 0.023656951740425522, 0.03..."
2,0.0,"[1.2561693957147415, -1.2561693957147415]","[0.7783659858274281, 0.2216340141725719]",0,"(2.386659682883655, 0.034276777378205524, 0.07..."
3,0.0,"[1.2561693957147415, -1.2561693957147415]","[0.7783659858274281, 0.2216340141725719]",0,"(2.386659682883655, 0.09682782199152352, 0.115..."
4,0.0,"[1.2561693957147415, -1.2561693957147415]","[0.7783659858274281, 0.2216340141725719]",0,"(2.386659682883655, 0.21729901538378232, 0.177..."
5,0.0,"[1.2561693957147415, -1.2561693957147415]","[0.7783659858274281, 0.2216340141725719]",0,"(2.386659682883655, 0.23072997133744527, 0.252..."
6,0.0,"[1.2561693957147415, -1.2561693957147415]","[0.7783659858274281, 0.2216340141725719]",0,"(2.386659682883655, 0.24182510886438421, 0.260..."
