Para esse projeto, foi selecionado o DataFrame 'healthcare_dataset_stroke_data.csv' encontrado no Kaggle.com, com a finalidade de identificar topicos e responder as perguntas propostas do desafio

In [0]:
from pyspark.sql import SparkSession
#importa a biblioteca que cria a seção do spark

spark = SparkSession.builder \
                    .appName("desafio_IGTI") \
                    .getOrCreate() 
#cria a seção caso não exista ou obtém a já criada

In [0]:
diretorio_dataset="/FileStore/tables/healthcare_dataset_stroke_data.csv"  
#diretório que contém o arquivo a ser utilizado

df = spark.read.format("csv") \
                .options(header="true", inferschema="true") \
                .load(diretorio_dataset)  
#realiza a leitura do dataset

In [0]:
df.show()

+-----+------+----+------------+-------------+------------+-------------+--------------+-----------------+----+---------------+------+
|   id|gender| age|hypertension|heart_disease|ever_married|    work_type|Residence_type|avg_glucose_level| bmi| smoking_status|stroke|
+-----+------+----+------------+-------------+------------+-------------+--------------+-----------------+----+---------------+------+
| 9046|  Male|67.0|           0|            1|         Yes|      Private|         Urban|           228.69|36.6|formerly smoked|     1|
|51676|Female|61.0|           0|            0|         Yes|Self-employed|         Rural|           202.21| N/A|   never smoked|     1|
|31112|  Male|80.0|           0|            1|         Yes|      Private|         Rural|           105.92|32.5|   never smoked|     1|
|60182|Female|49.0|           0|            0|         Yes|      Private|         Urban|           171.23|34.4|         smokes|     1|
| 1665|Female|79.0|           1|            0|         

In [0]:
df.printSchema()

root
 |-- id: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- age: double (nullable = true)
 |-- hypertension: integer (nullable = true)
 |-- heart_disease: integer (nullable = true)
 |-- ever_married: string (nullable = true)
 |-- work_type: string (nullable = true)
 |-- Residence_type: string (nullable = true)
 |-- avg_glucose_level: double (nullable = true)
 |-- bmi: string (nullable = true)
 |-- smoking_status: string (nullable = true)
 |-- stroke: integer (nullable = true)



In [0]:
df.describe().show()

+-------+-----------------+------+------------------+------------------+-------------------+------------+---------+--------------+------------------+------------------+--------------+-------------------+
|summary|               id|gender|               age|      hypertension|      heart_disease|ever_married|work_type|Residence_type| avg_glucose_level|               bmi|smoking_status|             stroke|
+-------+-----------------+------+------------------+------------------+-------------------+------------+---------+--------------+------------------+------------------+--------------+-------------------+
|  count|             5110|  5110|              5110|              5110|               5110|        5110|     5110|          5110|              5110|              5110|          5110|               5110|
|   mean|36517.82935420744|  null|43.226614481409015|0.0974559686888454|0.05401174168297456|        null|     null|          null|106.14767710371804|28.893236911794673|          null| 

In [0]:
print(f' Quantidade de colunas existentes :  {len(df.columns)}')
print(f' Quantidade de linhas existentes :  {df.count()}')

 Quantidade de colunas existentes :  12
 Quantidade de linhas existentes :  5110


In [0]:
df.printSchema()

root
 |-- id: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- age: double (nullable = true)
 |-- hypertension: integer (nullable = true)
 |-- heart_disease: integer (nullable = true)
 |-- ever_married: string (nullable = true)
 |-- work_type: string (nullable = true)
 |-- Residence_type: string (nullable = true)
 |-- avg_glucose_level: double (nullable = true)
 |-- bmi: string (nullable = true)
 |-- smoking_status: string (nullable = true)
 |-- stroke: integer (nullable = true)



A partir da verificacao conclui-se que existem 6 variaveis tipo STRING

In [0]:
df.select('age').describe().show()

+-------+------------------+
|summary|               age|
+-------+------------------+
|  count|              5110|
|   mean|43.226614481409015|
| stddev| 22.61264672311348|
|    min|              0.08|
|    max|              82.0|
+-------+------------------+



In [0]:
display(df.groupby('gender','stroke').count().sort("count",ascending=True))

gender,stroke,count
Other,0,1
Male,1,108
Female,1,141
Male,0,2007
Female,0,2853


In [0]:
# verificando a existencia de dados nulos
pandas = df.toPandas()
pandas.isnull().sum()

Out[37]: id                   0
gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64

In [0]:
# verificando se a variavel BMi possui valor nao numerico
df.select('bmi').dtypes

Out[38]: [('bmi', 'string')]

In [0]:
# Verificando a existencia de duas classes de residencia na variavel
df.groupby('Residence_type').count().show()

+--------------+-----+
|Residence_type|count|
+--------------+-----+
|         Urban| 2596|
|         Rural| 2514|
+--------------+-----+



In [0]:
# verificando se existe balanceamento entre individuoes que sofreram ou nao AVC
df.groupby('stroke').count().show()

+------+-----+
|stroke|count|
+------+-----+
|     1|  249|
|     0| 4861|
+------+-----+



Unica resposta possivel encontrada e que os dadosestao balanceados

In [0]:
display(df.select('avg_glucose_level'))

avg_glucose_level
228.69
202.21
105.92
171.23
174.12
186.21
70.09
94.39
76.15
58.57


Pelo grafico Boxplot se nota que o nivel medio de glicose : 95

Existem outliers nesse dataset, mas nao devem ser desconsiderados,pois demonstram a abstracao da variavel

In [0]:
display(df.select('age'))

age
67.0
61.0
80.0
49.0
79.0
81.0
74.0
69.0
59.0
78.0


Nessa variavel pode-se afirmar que sua mediana : 52 anos

In [0]:
df.groupby('work_type').count().show()

+-------------+-----+
|    work_type|count|
+-------------+-----+
| Never_worked|   22|
|Self-employed|  819|
|      Private| 2925|
|     children|  687|
|     Govt_job|  657|
+-------------+-----+



Existem 5 tipos de classes para essa variavel

A classe que mais possui instancias : Private 2925

In [0]:
print(f'Tipo de variavel BMI : {df.select("bmi").dtypes}')
print(f'Tipo de variavel SMOKING_STATUS : {df.select("smoking_status").dtypes}')

Tipo de variavel BMI : [('bmi', 'string')]
Tipo de variavel SMOKING_STATUS : [('smoking_status', 'string')]


In [0]:
df.groupby('smoking_status').count().show()

+---------------+-----+
| smoking_status|count|
+---------------+-----+
|         smokes|  789|
|        Unknown| 1544|
|   never smoked| 1892|
|formerly smoked|  885|
+---------------+-----+



Verifica-se que existem 4 classes nessa variavel

In [0]:
#agrupando os dados
display(df.groupby('smoking_status','stroke').count().sort("count",ascending=True).show())

+---------------+------+-----+
| smoking_status|stroke|count|
+---------------+------+-----+
|         smokes|     1|   42|
|        Unknown|     1|   47|
|formerly smoked|     1|   70|
|   never smoked|     1|   90|
|         smokes|     0|  747|
|formerly smoked|     0|  815|
|        Unknown|     0| 1497|
|   never smoked|     0| 1802|
+---------------+------+-----+



Nessa verificao, nota-se que existe maior quantidade de individuos qued nunca fumaram e tiveram AVC

In [0]:
#agrupando os dados
df.groupby('stroke','hypertension').count().sort("count",ascending=True).show()


+------+------------+-----+
|stroke|hypertension|count|
+------+------------+-----+
|     1|           1|   66|
|     1|           0|  183|
|     0|           1|  432|
|     0|           0| 4429|
+------+------------+-----+



Nesse grafico, verifica-se a alta incidencia de AVC em individuos com hypertensao

#Realizando a transformacao dos dados categoricos para a analise

In [0]:
dataset_filtrado=df.filter((df['bmi'] != 'N/A') & (df['smoking_status'] > 'Unknown'))

In [0]:
# aplicando a transformação dos dados categóricos
from pyspark.ml.feature import VectorAssembler, OneHotEncoder, StringIndexer

In [0]:
#define a transformação para a variável "gender"
stringIndexer_gender=StringIndexer(inputCol="gender", outputCol="gender_encoded")  #label encoding
encoder_gender = OneHotEncoder(dropLast=False, inputCol="gender_encoded", outputCol="genderVec") #one-hot encoding

In [0]:
#define a transformação para a variável "ever_married"
stringIndexer_married=StringIndexer(inputCol="ever_married", outputCol="ever_married_encoded") #label encoding
encoder_married = OneHotEncoder(dropLast=False, inputCol="ever_married_encoded", outputCol="marriedVec") #one-hot encoding

In [0]:
#define a transformação para a variável "work_type"
stringIndexer_work=StringIndexer(inputCol="work_type", outputCol="work_type_encoded")  #label encoding
encoder_work = OneHotEncoder(dropLast=False, inputCol="work_type_encoded", outputCol="workVec") #one-hot encoding

In [0]:
#define a transformação para a variável "Residence_type"
stringIndexer_residence=StringIndexer(inputCol="Residence_type", outputCol="Residence_type_encoded")  #label encoding
encoder_residence = OneHotEncoder(dropLast=False, inputCol="Residence_type_encoded", outputCol="residenceVec") #one-hot encoding

In [0]:
#define a transformação para a variável "smoking_status"
stringIndexer_smoking=StringIndexer(inputCol="smoking_status", outputCol="smoking_status_encoded")  #define o objeto
encoder_smoking = OneHotEncoder(dropLast=False, inputCol="smoking_status_encoded", outputCol="smokingVec")#one-hot encoding

In [0]:
#define a construção do vetor de entrada
colunas_entrada=['age','hypertension', 'heart_disease','avg_glucose_level','genderVec','marriedVec','workVec','residenceVec','smokingVec']
vetor_entrada = VectorAssembler(inputCols=colunas_entrada,outputCol='features')

In [0]:
#define a sequencia de transformações para o pipeline
sequencia_transformacoes=[stringIndexer_gender,stringIndexer_married,stringIndexer_work,stringIndexer_residence,stringIndexer_smoking,encoder_gender,encoder_married,encoder_work,encoder_residence,encoder_smoking,vetor_entrada]

In [0]:
from pyspark.ml import Pipeline
# Aplicando o pipeline
pipeline = Pipeline(stages=sequencia_transformacoes)
pipelineModel = pipeline.fit(dataset_filtrado)
model = pipelineModel.transform(dataset_filtrado)

In [0]:
#mostrando parte dos dados para entrada
model.select('age','gender','genderVec','ever_married','marriedVec','features').show()

+----+------+-------------+------------+-------------+--------------------+
| age|gender|    genderVec|ever_married|   marriedVec|            features|
+----+------+-------------+------------+-------------+--------------------+
|67.0|  Male|(3,[1],[1.0])|         Yes|(2,[0],[1.0])|(19,[0,2,3,5,7,9,...|
|80.0|  Male|(3,[1],[1.0])|         Yes|(2,[0],[1.0])|(19,[0,2,3,5,7,9,...|
|49.0|Female|(3,[0],[1.0])|         Yes|(2,[0],[1.0])|(19,[0,3,4,7,9,14...|
|79.0|Female|(3,[0],[1.0])|         Yes|(2,[0],[1.0])|(19,[0,1,3,4,7,10...|
|81.0|  Male|(3,[1],[1.0])|         Yes|(2,[0],[1.0])|(19,[0,3,5,7,9,14...|
|74.0|  Male|(3,[1],[1.0])|         Yes|(2,[0],[1.0])|(19,[0,1,2,3,5,7,...|
|69.0|Female|(3,[0],[1.0])|          No|(2,[1],[1.0])|(19,[0,3,4,8,9,14...|
|81.0|Female|(3,[0],[1.0])|         Yes|(2,[0],[1.0])|(19,[0,1,3,4,7,9,...|
|61.0|Female|(3,[0],[1.0])|         Yes|(2,[0],[1.0])|(19,[0,2,3,4,7,11...|
|54.0|Female|(3,[0],[1.0])|         Yes|(2,[0],[1.0])|(19,[0,3,4,7,9,14...|
|79.0|Female

In [0]:
#dividindo o dataset entre teste e treinamento
train_data, test_data = model.randomSplit([.8,.2],seed=1)

In [0]:
#define o modelo de regrssão logística
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

#instancia o objeto para a regressão logística
lr = LogisticRegression(labelCol="stroke",featuresCol="features", maxIter=100, regParam=0.3, )

# treina o modelo
linearModel = lr.fit(train_data)

In [0]:
#realiza a previsão utilizando o modelo de regressão logística
previsao_regressao = linearModel.transform(test_data)

In [0]:
#avaliando a classificação realizada pela regressão logística
acc_evaluator = MulticlassClassificationEvaluator(labelCol="stroke", predictionCol="prediction", metricName="accuracy")
acuracia_regressao = acc_evaluator.evaluate(previsao_regressao)
print('Regressão Logística: {0:2.2f}%'.format(acuracia_regressao*100))

Regressão Logística: 94.31%
