# <font color='blue'>Data Science Academy</font>
# <font color='blue'>Big Data Real-Time Analytics com Python e Spark</font>

## <font color='blue'>Lab 5</font>

## <font color='blue'>Machine Learning com PySpark</font>

Leia os manuais em pdf no Capítulo 14 do curso com o material complementar.

![title](imagens/Lab5.png)

## <font color='blue'>Classificação Binária</font>

Usaremos o algoritmo RandomForest para construir um modelo preditivo capaz de prever se um cliente pagará ou não um empréstimo bancário.

Os dados são uma versão modificada deste dataset: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients

In [1]:
# Versão da Linguagem Python
from platform import python_version
print('Versão da Linguagem Python Usada Neste Jupyter Notebook:', python_version())

Versão da Linguagem Python Usada Neste Jupyter Notebook: 3.9.7


In [2]:
# Para atualizar um pacote, execute o comando abaixo no terminal ou prompt de comando:
# pip install -U nome_pacote

# Para instalar a versão exata de um pacote, execute o comando abaixo no terminal ou prompt de comando:
#!pip install nome_pacote==versão_desejada

# Depois de instalar ou atualizar o pacote, reinicie o jupyter notebook.

# Instala o pacote watermark. 
# Esse pacote é usado para gravar as versões de outros pacotes usados neste jupyter notebook.
#!pip install -q -U watermark

In [3]:
# Importa o findspark e inicializa
import findspark
findspark.init()

In [4]:
# Imports
import pyspark
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.ml.linalg import Vectors
from pyspark.sql import Row
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import PCA
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [5]:
# Versões dos pacotes usados neste jupyter notebook
%reload_ext watermark
%watermark -a "Data Science Academy" --iversions

Author: Data Science Academy

pyspark  : 3.3.0
findspark: 2.0.1



## Carregando os Dados

In [6]:
# Criando o Spark Context
sc = SparkContext(appName = "Lab5")

22/08/24 17:53:14 WARN Utils: Your hostname, falcon.local resolves to a loopback address: 127.0.0.1; using 10.0.0.87 instead (on interface en0)
22/08/24 17:53:15 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/08/24 17:53:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/08/24 17:53:16 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
22/08/24 17:53:16 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


In [7]:
sc.setLogLevel("ERROR")

In [8]:
# Spark Session - usada quando se trabalha com Dataframes no Spark
spSession = SparkSession.builder.master("local").getOrCreate()

In [9]:
# Carregando os dados e gerando um RDD
bankRDD = sc.textFile("dados/dataset3.csv")

In [10]:
bankRDD.cache()

dados/dataset3.csv MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0

In [11]:
bankRDD.count()

                                                                                

542

In [12]:
bankRDD.take(5)

['"age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"',
 '30;"unemployed";"married";"primary";"no";1787;"no";"no";"cellular";19;"oct";79;1;-1;0;"unknown";"no"',
 '33;"services";"married";"secondary";"no";4789;"yes";"yes";"cellular";11;"may";220;1;339;4;"failure";"yes"',
 '35;"management";"single";"tertiary";"no";1350;"yes";"no";"cellular";16;"apr";185;1;330;1;"failure";"yes"',
 '30;"management";"married";"tertiary";"no";1476;"yes";"yes";"unknown";3;"jun";199;4;-1;0;"unknown";"yes"']

In [13]:
# Removendo a primeira linha do arquivo (cabeçalho)
firstLine = bankRDD.first()
bankRDD2 = bankRDD.filter(lambda x: x != firstLine)
bankRDD2.count()

541

## Limpeza e Transformação dos Dados

Vamos usar somente algumas variáveis originais e então criar novas variáveis. Essa decisão pode ser baseada no conhecimento da área de negócio e a fim de evitar possíveis preconceitos. Podemos usar outras técnicas de seleção de atributos.

In [14]:
# Transformando os dados para valores numéricos
def transformToNumeric(inputStr) :
    
    # Em cada linha faz substituição de caracteres e separa as colunas
    attList = inputStr.replace("\"","").split(";")
        
    # Converte de int para float a fim de aumentar a precisão dos cálculos
    age = float(attList[0])
    balance = float(attList[5])
    
    # Aplica One-Hot Encoding criando variáveis dummy
    single = 1.0 if attList[2] == "single" else 0.0
    married = 1.0 if attList[2] == "married" else 0.0
    divorced = 1.0 if attList[2] == "divorced" else 0.0
    primary = 1.0 if attList[3] == "primary" else 0.0
    secondary = 1.0 if attList[3] == "secondary" else 0.0
    tertiary = 1.0 if attList[3] == "tertiary" else 0.0
    
    # Aplica label encoding convertendo a variável categórica para sua representação numérica
    default = 0.0 if attList[4] == "no" else 1.0
    loan = 0.0 if attList[7] == "no" else 1.0
    outcome = 0.0 if attList[16] == "no" else 1.0
    
    # Cria as linhas com os atributos transformados
    linhas = Row(OUTCOME = outcome, 
                 AGE = age, 
                 SINGLE = single, 
                 MARRIED = married, 
                 DIVORCED = divorced,
                 PRIMARY = primary, 
                 SECONDARY = secondary, 
                 TERTIARY = tertiary, 
                 DEFAULT = default, 
                 BALANCE = balance,
                 LOAN = loan) 
    return linhas

In [15]:
# Aplicando a função de limpeza ao conjunto de dados
bankRDD3 = bankRDD2.map(transformToNumeric)

In [16]:
bankRDD3.collect()[:15]

[Row(OUTCOME=0.0, AGE=30.0, SINGLE=0.0, MARRIED=1.0, DIVORCED=0.0, PRIMARY=1.0, SECONDARY=0.0, TERTIARY=0.0, DEFAULT=0.0, BALANCE=1787.0, LOAN=0.0),
 Row(OUTCOME=1.0, AGE=33.0, SINGLE=0.0, MARRIED=1.0, DIVORCED=0.0, PRIMARY=0.0, SECONDARY=1.0, TERTIARY=0.0, DEFAULT=0.0, BALANCE=4789.0, LOAN=1.0),
 Row(OUTCOME=1.0, AGE=35.0, SINGLE=1.0, MARRIED=0.0, DIVORCED=0.0, PRIMARY=0.0, SECONDARY=0.0, TERTIARY=1.0, DEFAULT=0.0, BALANCE=1350.0, LOAN=0.0),
 Row(OUTCOME=1.0, AGE=30.0, SINGLE=0.0, MARRIED=1.0, DIVORCED=0.0, PRIMARY=0.0, SECONDARY=0.0, TERTIARY=1.0, DEFAULT=0.0, BALANCE=1476.0, LOAN=1.0),
 Row(OUTCOME=0.0, AGE=59.0, SINGLE=0.0, MARRIED=1.0, DIVORCED=0.0, PRIMARY=0.0, SECONDARY=1.0, TERTIARY=0.0, DEFAULT=0.0, BALANCE=0.0, LOAN=0.0),
 Row(OUTCOME=1.0, AGE=35.0, SINGLE=1.0, MARRIED=0.0, DIVORCED=0.0, PRIMARY=0.0, SECONDARY=0.0, TERTIARY=1.0, DEFAULT=0.0, BALANCE=747.0, LOAN=0.0),
 Row(OUTCOME=1.0, AGE=36.0, SINGLE=0.0, MARRIED=1.0, DIVORCED=0.0, PRIMARY=0.0, SECONDARY=0.0, TERTIARY=1.0, D

## Análise Exploratória de Dados

In [17]:
# Transforma para DataFrame
bankDF = spSession.createDataFrame(bankRDD3)

In [18]:
# Estatística descritiva
bankDF.describe().show()

+-------+-------------------+------------------+------------------+------------------+-------------------+------------------+------------------+------------------+--------------------+------------------+-------------------+
|summary|            OUTCOME|               AGE|            SINGLE|           MARRIED|           DIVORCED|           PRIMARY|         SECONDARY|          TERTIARY|             DEFAULT|           BALANCE|               LOAN|
+-------+-------------------+------------------+------------------+------------------+-------------------+------------------+------------------+------------------+--------------------+------------------+-------------------+
|  count|                541|               541|               541|               541|                541|               541|               541|               541|                 541|               541|                541|
|   mean| 0.3974121996303142| 41.26987060998152|0.2754158964879852|0.6155268022181146|0.1090573012939001

In [19]:
# Correlação entre as variáveis
for i in bankDF.columns:
    if not( isinstance(bankDF.select(i).take(1)[0][0], str)) :
        print( "Correlação da variável OUTCOME com:", i, bankDF.stat.corr('OUTCOME',i))

Correlação da variável OUTCOME com: OUTCOME 1.0
Correlação da variável OUTCOME com: AGE -0.1823210432736525
Correlação da variável OUTCOME com: SINGLE 0.46323284934360515
Correlação da variável OUTCOME com: MARRIED -0.3753241299133561
Correlação da variável OUTCOME com: DIVORCED -0.07812659940926987
Correlação da variável OUTCOME com: PRIMARY -0.12561548832677982
Correlação da variável OUTCOME com: SECONDARY 0.026392774894072973
Correlação da variável OUTCOME com: TERTIARY 0.08494840766635618
Correlação da variável OUTCOME com: DEFAULT -0.04536965206737378
Correlação da variável OUTCOME com: BALANCE 0.036574866119976804
Correlação da variável OUTCOME com: LOAN -0.030420586112717318


## Pré-Processamento dos Dados

In [20]:
# Criando um LabeledPoint (target, Vector[features])
def transformaVar(row) :
    obj = (row["OUTCOME"], Vectors.dense([row["AGE"], 
                                          row["BALANCE"], 
                                          row["DEFAULT"], 
                                          row["DIVORCED"], 
                                          row["LOAN"], 
                                          row["MARRIED"], 
                                          row["PRIMARY"], 
                                          row["SECONDARY"], 
                                          row["SINGLE"], 
                                          row["TERTIARY"]]))
    return obj

In [21]:
# Aplica a função
bankRDD4 = bankDF.rdd.map(transformaVar)

In [22]:
bankRDD4.collect()

[(0.0, DenseVector([30.0, 1787.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0])),
 (1.0, DenseVector([33.0, 4789.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0])),
 (1.0, DenseVector([35.0, 1350.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0])),
 (1.0, DenseVector([30.0, 1476.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0])),
 (0.0, DenseVector([59.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0])),
 (1.0, DenseVector([35.0, 747.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0])),
 (1.0, DenseVector([36.0, 307.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0])),
 (0.0, DenseVector([39.0, 147.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0])),
 (0.0, DenseVector([41.0, 221.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0])),
 (1.0, DenseVector([43.0, -88.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0])),
 (0.0, DenseVector([39.0, 9374.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0])),
 (0.0, DenseVector([43.0, 264.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0])),
 (0.0, DenseVector([36.0, 1109.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0])),
 (1.0, D

In [23]:
# Converte o RDD em DataFrame
bankDF = spSession.createDataFrame(bankRDD4,["label", "features"])

In [24]:
bankDF.select("features", "label").show(10)

+--------------------+-----+
|            features|label|
+--------------------+-----+
|[30.0,1787.0,0.0,...|  0.0|
|[33.0,4789.0,0.0,...|  1.0|
|[35.0,1350.0,0.0,...|  1.0|
|[30.0,1476.0,0.0,...|  1.0|
|[59.0,0.0,0.0,0.0...|  0.0|
|[35.0,747.0,0.0,0...|  1.0|
|[36.0,307.0,0.0,0...|  1.0|
|[39.0,147.0,0.0,0...|  0.0|
|[41.0,221.0,0.0,0...|  0.0|
|[43.0,-88.0,0.0,0...|  1.0|
+--------------------+-----+
only showing top 10 rows



## Redução de Dimensionalidade com PCA

A redução de dimensionalidade deve ser aplicada quando o número de vaiáveis preditoras for muito alto.

In [25]:
# Cria o objeto PCA com 3 componentes
bankPCA = PCA(k = 3, inputCol = "features", outputCol = "pcaFeatures")

In [26]:
# Treina o modelo
pcaModel = bankPCA.fit(bankDF)

                                                                                

In [27]:
# Aplica o modelo PCA para reduzir a dimensionalidade
pcaResult = pcaModel.transform(bankDF).select("label", "pcaFeatures")

In [28]:
# A informação contida nas variáveis preditoras está agora consolidada em 3 componentes, para cada linha.
pcaResult.show(truncate = False)

+-----+------------------------------------------------------------+
|label|pcaFeatures                                                 |
+-----+------------------------------------------------------------+
|0.0  |[-1787.018897197381,28.86209683775529,-0.06459982604876241] |
|1.0  |[-4789.020177138492,29.922562636341947,-0.9830243513096373] |
|1.0  |[-1350.022213163262,34.10110809796688,0.8951427168301704]   |
|1.0  |[-1476.0189517184556,29.051333993596703,0.3952723868021948] |
|0.0  |[-0.037889185366442445,58.9897182000177,-0.7290792383661886]|
|1.0  |[-747.0223377634923,34.48829198181773,0.9045654956970108]   |
|1.0  |[-307.0230691022593,35.799850539655225,0.5170631523785976]  |
|0.0  |[-147.0250121617634,38.90107856650329,-0.8069627548799397]  |
|0.0  |[-221.02629853487866,40.853633675694944,0.5373036365803221] |
|1.0  |[87.9723868768871,43.062659441151055,-0.0670164287117152]   |
|0.0  |[-9374.023105550941,32.97645883799288,-0.9511484606914431]  |
|0.0  |[-264.02755731528384,42.824

In [29]:
# Indexação do label é pré-requisito para Decision Trees
stringIndexer = StringIndexer(inputCol = "label", outputCol = "label_indexed")
si_model = stringIndexer.fit(pcaResult)
obj_final = si_model.transform(pcaResult)
obj_final.collect()

[Row(label=0.0, pcaFeatures=DenseVector([-1787.0189, 28.8621, -0.0646]), label_indexed=0.0),
 Row(label=1.0, pcaFeatures=DenseVector([-4789.0202, 29.9226, -0.983]), label_indexed=1.0),
 Row(label=1.0, pcaFeatures=DenseVector([-1350.0222, 34.1011, 0.8951]), label_indexed=1.0),
 Row(label=1.0, pcaFeatures=DenseVector([-1476.019, 29.0513, 0.3953]), label_indexed=1.0),
 Row(label=0.0, pcaFeatures=DenseVector([-0.0379, 58.9897, -0.7291]), label_indexed=0.0),
 Row(label=1.0, pcaFeatures=DenseVector([-747.0223, 34.4883, 0.9046]), label_indexed=1.0),
 Row(label=1.0, pcaFeatures=DenseVector([-307.0231, 35.7999, 0.5171]), label_indexed=1.0),
 Row(label=0.0, pcaFeatures=DenseVector([-147.025, 38.9011, -0.807]), label_indexed=0.0),
 Row(label=0.0, pcaFeatures=DenseVector([-221.0263, 40.8536, 0.5373]), label_indexed=0.0),
 Row(label=1.0, pcaFeatures=DenseVector([87.9724, 43.0627, -0.067]), label_indexed=1.0),
 Row(label=0.0, pcaFeatures=DenseVector([-9374.0231, 32.9765, -0.9511]), label_indexed=0.0

## Machine Learning

In [30]:
# Dados de Treino e de Teste
(dados_treino, dados_teste) = obj_final.randomSplit([0.7, 0.3])

In [31]:
dados_treino.count()

379

In [32]:
dados_teste.count()

162

In [33]:
# Criando o objeto
rfClassifer = RandomForestClassifier(labelCol = "label_indexed", featuresCol = "pcaFeatures")

In [34]:
# Treinando o objeto e criando o modelo
modelo = rfClassifer.fit(dados_treino)

In [35]:
# Previsões com dados de teste
predictions = modelo.transform(dados_teste)

In [36]:
predictions

DataFrame[label: double, pcaFeatures: vector, label_indexed: double, rawPrediction: vector, probability: vector, prediction: double]

In [37]:
predictions.select("label", "label_indexed", "pcaFeatures", "prediction").collect()

[Row(label=0.0, label_indexed=0.0, pcaFeatures=DenseVector([-14093.0337, 47.9412, -0.9569]), prediction=1.0),
 Row(label=0.0, label_indexed=0.0, pcaFeatures=DenseVector([-11494.0342, 49.61, -0.9162]), prediction=1.0),
 Row(label=0.0, label_indexed=0.0, pcaFeatures=DenseVector([-9374.0231, 32.9765, -0.9511]), prediction=1.0),
 Row(label=0.0, label_indexed=0.0, pcaFeatures=DenseVector([-8104.0336, 49.7873, -0.8708]), prediction=1.0),
 Row(label=0.0, label_indexed=0.0, pcaFeatures=DenseVector([-7190.0255, 37.3733, 0.7344]), prediction=0.0),
 Row(label=0.0, label_indexed=0.0, pcaFeatures=DenseVector([-6313.0372, 55.9407, -0.1054]), prediction=1.0),
 Row(label=0.0, label_indexed=0.0, pcaFeatures=DenseVector([-5996.0302, 45.1426, -0.8606]), prediction=1.0),
 Row(label=0.0, label_indexed=0.0, pcaFeatures=DenseVector([-5883.0251, 37.2181, 0.4488]), prediction=0.0),
 Row(label=0.0, label_indexed=0.0, pcaFeatures=DenseVector([-3762.0275, 41.5791, 0.4933]), prediction=0.0),
 Row(label=0.0, label_

In [38]:
# Avaliando a acurácia
evaluator = MulticlassClassificationEvaluator(predictionCol = "prediction", 
                                              labelCol = "label_indexed", 
                                              metricName = "accuracy")

In [39]:
evaluator.evaluate(predictions)      

0.6358024691358025

In [40]:
# Confusion Matrix
predictions.groupBy("label_indexed", "prediction").count().show()

+-------------+----------+-----+
|label_indexed|prediction|count|
+-------------+----------+-----+
|          1.0|       1.0|   24|
|          0.0|       1.0|   18|
|          1.0|       0.0|   41|
|          0.0|       0.0|   79|
+-------------+----------+-----+



# Fim