# Regressão Logística - Classificação


[Matriz de Confusão](https://pt.wikipedia.org/wiki/Matriz_de_confus%C3%A3o)

[Curva Roc](https://pt.wikipedia.org/wiki/Caracter%C3%ADstica_de_Opera%C3%A7%C3%A3o_do_Receptor)

In [1]:
# Inicializando Spark
import findspark
findspark.init('/home/macaubas/spark-3.2.1-bin-hadoop3.2')
import pyspark

from pyspark.sql import SparkSession 

spark = SparkSession.builder.appName("logisticReg").getOrCreate()

22/05/16 13:33:23 WARN Utils: Your hostname, macaubas-VirtualBox resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface enp0s3)
22/05/16 13:33:23 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/05/16 13:33:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Imports

In [39]:
# Regressão linear
from pyspark.ml.classification import LogisticRegression

# Features
from pyspark.ml.feature import (VectorAssembler, VectorIndexer,
                               OneHotEncoder, StringIndexer)

# Pipeline
from pyspark.ml import Pipeline

# Evaluator
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [27]:
# Load data
df = spark.read.csv('titanic.csv', inferSchema=True, header=True)

# Esquema
df.printSchema()

root
 |-- PassengerId: integer (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)



In [28]:
# Viz
for e in df.head(5):
    print(e)
    print('\n')

Row(PassengerId=1, Survived=0, Pclass=3, Name='Braund, Mr. Owen Harris', Sex='male', Age=22.0, SibSp=1, Parch=0, Ticket='A/5 21171', Fare=7.25, Cabin=None, Embarked='S')


Row(PassengerId=2, Survived=1, Pclass=1, Name='Cumings, Mrs. John Bradley (Florence Briggs Thayer)', Sex='female', Age=38.0, SibSp=1, Parch=0, Ticket='PC 17599', Fare=71.2833, Cabin='C85', Embarked='C')


Row(PassengerId=3, Survived=1, Pclass=3, Name='Heikkinen, Miss. Laina', Sex='female', Age=26.0, SibSp=0, Parch=0, Ticket='STON/O2. 3101282', Fare=7.925, Cabin=None, Embarked='S')


Row(PassengerId=4, Survived=1, Pclass=1, Name='Futrelle, Mrs. Jacques Heath (Lily May Peel)', Sex='female', Age=35.0, SibSp=1, Parch=0, Ticket='113803', Fare=53.1, Cabin='C123', Embarked='S')


Row(PassengerId=5, Survived=0, Pclass=3, Name='Allen, Mr. William Henry', Sex='male', Age=35.0, SibSp=0, Parch=0, Ticket='373450', Fare=8.05, Cabin=None, Embarked='S')




In [29]:
# Selecionando colunas de interesse
my_cols = df.select(['Survived','Pclass','Sex',
                     'Age','SibSp','Parch',
                     'Fare','Embarked'])

In [30]:
# Removendo valores ausentes
final_data = my_cols.na.drop()

# Viz
for e in final_data.head(5):
    print(e)
    print('\n')

Row(Survived=0, Pclass=3, Sex='male', Age=22.0, SibSp=1, Parch=0, Fare=7.25, Embarked='S')


Row(Survived=1, Pclass=1, Sex='female', Age=38.0, SibSp=1, Parch=0, Fare=71.2833, Embarked='C')


Row(Survived=1, Pclass=3, Sex='female', Age=26.0, SibSp=0, Parch=0, Fare=7.925, Embarked='S')


Row(Survived=1, Pclass=1, Sex='female', Age=35.0, SibSp=1, Parch=0, Fare=53.1, Embarked='S')


Row(Survived=0, Pclass=3, Sex='male', Age=35.0, SibSp=0, Parch=0, Fare=8.05, Embarked='S')




## StringeIndexer e OneHotEncoder

In [31]:
## Sexo

# Criando stringer indexer - converter string para números
gender_indexer = StringIndexer(inputCol='Sex', outputCol='Sex_Index')

# OneHotEncoder - vectorizar a característica
gender_encoder = OneHotEncoder(inputCol = 'Sex_Index', outputCol='Sex_Vec')

In [32]:
## Embark

# Stringer indexer
embark_indexer = StringIndexer(inputCol='Embarked', outputCol='Embarked_Index')

# OneHotEnconder 
embark_encoder = OneHotEncoder(inputCol='Embarked_Index', outputCol='Embarked_Vec')

In [33]:
## Criando assembler
assembler = VectorAssembler(inputCols=['Pclass', 'Sex_Vec', 'Embarked_Vec', 
                                       'Age','SibSp','Parch','Fare'], 
                            outputCol='features')

## Modelo

In [34]:
## Criando modelo
log_reg_titanic = LogisticRegression(featuresCol = 'features', labelCol = 'Survived')


## Pipeline - etapas

In [35]:
## Criando pipeline - O que fazer em cada etapa
pipeline = Pipeline(stages=[gender_indexer, embark_indexer,
                           gender_encoder, embark_encoder,
                           assembler, log_reg_titanic])

In [36]:
# Criando teste e treino
train_data, test_data = final_data.randomSplit([0.7,0.3])

In [37]:
# Estimando modelo
fitted_model = pipeline.fit(train_data)


22/05/16 14:13:41 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
22/05/16 14:13:41 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.ForeignLinkerBLAS


In [38]:
# Resultados
results = fitted_model.transform(test_data)

In [42]:
# Viz
results.select('Survived', 'prediction').show()

+--------+----------+
|Survived|prediction|
+--------+----------+
|       0|       1.0|
|       0|       1.0|
|       0|       1.0|
|       0|       1.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       1.0|
|       0|       1.0|
|       0|       1.0|
|       0|       0.0|
+--------+----------+
only showing top 20 rows



## Avaliando o modelo

In [41]:
# rawPredictionCol tem o nome de prediction pq por default quando damos fit no modelo, sua coluna de predição tem esse nome
my_eval = BinaryClassificationEvaluator(rawPredictionCol='prediction', labelCol='Survived')

In [43]:
# Avaliando meus resultados
AUC = my_eval.evaluate(results) #ROC

In [44]:
AUC

0.7789419443490209

Quanto mais próximo de 1, mais perfeito nosso modelo de classificação. Para este caso, vemos que temos um valor de 0.77 que é melhor do que uma previsão completamente aleatória. Mais a frente iremos trabalhar com outras métricas além da área abaixo da curva.