<a href="https://colab.research.google.com/github/LeonardoCechellaVelho/machine-learning/blob/master/regressao_linear_logistica_seguro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

------------------REGRESSÃO LINEAR------------------------

Neste desafio vamos passar por todo o processo de pré-processamento e criação de um modelo de aprendizado (regressão linear) de máquina no famoso conjunto de dados seguro de saúde, que é usado em competições de Machine Learning. Neste dataset você pode encontrar 6 atributos sobre pessoas e seus respectivos preços para o plano de saúde.

**Objetivo:** É seu trabalho como engenheiro da computação prever o preço do seguro de uma pessoa, ou seja, criar um modelo para isso.

**Itens que você deve cumprir:**



*   Apresentar os dados do dataset (info,describe, head) 

*   Realizar analises exploratorias, verificando possíveis dados faltantes e transformações.

*   Aplicar técnicas de pré-processamento (substituir dados faltantes pelas médias, transformação de dados categóricos)

*   Verificar a correlação entre as variáveis


*   Criar um modelo utilizando regressão linear, treinar esse modelo e fazer a previsão.


*   Com base na previsão verificar o "score , R2" para avaliar o modelo



*   Utilizar o modelo para prever o quanto você pagaria de seguro. 


------------------REGRESSÃO LOGÍSTICA ------------------------

Neste desafio vamos passar por todo o processo de pré-processamento e criação de um modelo de aprendizado (regressão linear) de máquina no famoso conjunto de dados sobre a possibilidade de ataque cardíaco, que é usado em competições de Machine Learning. Este dataset contém 14 atributos, o campo "target" refere-se à presença de doença cardíaca no paciente. É um número inteiro com valor 0 = nenhuma / menos chance de ataque cardíaco e 1 = mais chance de ataque cardíaco

Informação de Atributo

1) idade

2) sexo

3) tipo de dor no peito (4 valores)

4) pressão arterial em repouso

5) colesterol sérico em mg / dl

6) açúcar no sangue em jejum> 120 mg / dl

7) resultados eletrocardiográficos de repouso (valores 0,1,2)

8) frequência cardíaca máxima alcançada

9) angina induzida por exercício

10) pico antigo = depressão de ST induzida por exercício em relação ao repouso

11) a inclinação do segmento ST de pico de exercício

12) número de vasos principais (0-3) coloridos por fluorosopia

13) thal: 0 = normal; 1 = defeito corrigido; 2 = defeito reversível

14) alvo: 0 = menos chance de ataque cardíaco 1 = mais chance de ataque cardíaco

Objetivo: É seu trabalho como engenheiro da computação prever se pessoa é propensa a ter um ataque cardíaco, ou seja, criar um modelo para isso.


**Itens que você deve cumprir:**


*   Apresentar os dados do dataset (info,describe, head) 

*   Realizar analises exploratorias, verificando possíveis dados faltantes e transformações.


*   Verificar a correlação entre as variáveis


*   Criar um modelo utilizando regressão l´gística, treinar esse modelo e fazer a previsão.


*   Com base na previsão verificar o "score , R2" para avaliar o modelo





In [None]:
import pandas as pd #biblioteca utilizada para o tratamento de dados via dataframes 
import numpy as np #biblioteca utilizada para o tratamento de valores numéricos (vetores e matrizes)
import matplotlib.pyplot as plt #biblioteca utilizada para construir os gráficos
import seaborn as sns #biblioteca utilizada para construir os gráficos
from sklearn.linear_model import LogisticRegression # biblioteca para regressão logística 
from sklearn.decomposition import PCA #biblioteca para PCA
from sklearn.feature_selection import RFE #biblioteca para aplicação RFE
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LinearRegression #importa o modelo de regressão linear univariada
from sklearn.metrics import r2_score #método para o cálculo do R2 (coeficiente de determinação)
from sklearn import preprocessing

  import pandas.util.testing as tm


**Dataset insuracance.csv ----- REGRESSÃO LINEAR ------------**

In [None]:
from google.colab import files  #biblioteca utilizada para carregar os dados para o google colab
uploaded = files.upload()

Saving insurance.csv to insurance.csv


**Apresentar informações sobre o dataset (head,info,describe)**

In [None]:
nome_do_arquivo="insurance.csv" 
dataframe = pd.read_csv(nome_do_arquivo)

In [None]:
dataframe.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19.0,female,27.9,0,yes,southwest,16884.924
1,18.0,male,33.77,1,no,southeast,1725.5523
2,28.0,male,33.0,3,no,southeast,4449.462
3,33.0,male,22.705,0,no,northwest,21984.47061
4,32.0,male,28.88,0,no,northwest,3866.8552


In [None]:
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1321 non-null   float64
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(3), int64(1), object(3)
memory usage: 73.3+ KB


In [None]:
dataframe.describe()

Unnamed: 0,age,bmi,children,charges
count,1321.0,1338.0,1338.0,1338.0
mean,39.248297,30.663397,1.094918,13270.422265
std,14.017591,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


**Verificar dados nullos e dados categóricos**

In [None]:
dataframe.isnull().sum()

age         17
sex          0
bmi          0
children     0
smoker       0
region       0
charges      0
dtype: int64

In [None]:
m_age = dataframe['age'].mean()

In [None]:
m_age

39.24829674489023

**Substituir dados nulos e realizar transformações em todos os dados categóricos**

In [None]:
dataframe['age'].fillna(dataframe['age'].mean(),inplace = True)

In [None]:
dataframe.isnull().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

In [None]:
one_hot1 = pd.get_dummies(dataframe['sex'])

In [None]:
one_hot1

Unnamed: 0,female,male
0,1,0
1,0,1
2,0,1
3,0,1
4,0,1
...,...,...
1333,0,1
1334,1,0
1335,1,0
1336,1,0


In [None]:
dataframe = dataframe.join(one_hot1)

In [None]:
one_hot2 = pd.get_dummies(dataframe['smoker'])

In [None]:
one_hot2

Unnamed: 0,no,yes
0,0,1
1,1,0
2,1,0
3,1,0
4,1,0
...,...,...
1333,1,0
1334,1,0
1335,1,0
1336,1,0


In [None]:
dataframe = dataframe.join(one_hot2)

In [None]:
one_hot2

Unnamed: 0,no,yes
0,0,1
1,1,0
2,1,0
3,1,0
4,1,0
...,...,...
1333,1,0
1334,1,0
1335,1,0
1336,1,0


In [None]:
dataframe

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,female,male,no,yes
0,19.000000,female,27.900,0,yes,southwest,16884.92400,1,0,0,1
1,18.000000,male,33.770,1,no,southeast,1725.55230,0,1,1,0
2,28.000000,male,33.000,3,no,southeast,4449.46200,0,1,1,0
3,33.000000,male,22.705,0,no,northwest,21984.47061,0,1,1,0
4,32.000000,male,28.880,0,no,northwest,3866.85520,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...
1333,50.000000,male,30.970,3,no,northwest,10600.54830,0,1,1,0
1334,18.000000,female,31.920,0,no,northeast,2205.98080,1,0,1,0
1335,39.248297,female,36.850,0,no,southeast,1629.83350,1,0,1,0
1336,21.000000,female,25.800,0,no,southwest,2007.94500,1,0,1,0


**Verificar as correlações entre variáveis, escolher qualquer método**

In [None]:
corr_pearson = dataframe.corr(method='pearson')

In [None]:
corr_pearson.style.background_gradient(cmap='coolwarm').set_precision(2)

Unnamed: 0,age,bmi,children,charges,female,male,no,yes
age,1.0,0.11,0.04,0.3,0.02,-0.02,0.02,-0.02
bmi,0.11,1.0,0.01,0.2,-0.05,0.05,-0.0,0.0
children,0.04,0.01,1.0,0.07,-0.02,0.02,-0.01,0.01
charges,0.3,0.2,0.07,1.0,-0.06,0.06,-0.79,0.79
female,0.02,-0.05,-0.02,-0.06,1.0,-1.0,0.08,-0.08
male,-0.02,0.05,0.02,0.06,-1.0,1.0,-0.08,0.08
no,0.02,-0.0,-0.01,-0.79,0.08,-0.08,1.0,-1.0
yes,-0.02,0.0,0.01,0.79,-0.08,0.08,-1.0,1.0


In [None]:
corr_spearman = dataframe.corr(method='spearman')

In [None]:
corr_spearman.style.background_gradient(cmap='coolwarm').set_precision(2)

Unnamed: 0,age,bmi,children,charges,female,male,no,yes
age,1.0,0.11,0.06,0.53,0.02,-0.02,0.02,-0.02
bmi,0.11,1.0,0.02,0.12,-0.04,0.04,-0.0,0.0
children,0.06,0.02,1.0,0.13,-0.02,0.02,-0.02,0.02
charges,0.53,0.12,0.13,1.0,-0.01,0.01,-0.66,0.66
female,0.02,-0.04,-0.02,-0.01,1.0,-1.0,0.08,-0.08
male,-0.02,0.04,0.02,0.01,-1.0,1.0,-0.08,0.08
no,0.02,-0.0,-0.02,-0.66,0.08,-0.08,1.0,-1.0
yes,-0.02,0.0,0.02,0.66,-0.08,0.08,-1.0,1.0


**Definir as variáveis X e Y**

Opcional (bônus) :  separar as variáveis x e y em : x_train, x_test, y_train e y_test. Você pode fazer isso manualmente por vetores ou pode utilizar o train_test_split do sklearn


In [None]:
x = dataframe[['yes','no','female','male','age']]

In [None]:
y = dataframe[['charges']]

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler(feature_range=(0,1))

In [None]:
x_scaled = scaler.fit_transform(x)

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
clf_tree = DecisionTreeRegressor()

In [None]:
clf_tree.fit(x_train,y_train)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=None, splitter='best')

In [None]:
pred_tree = clf_tree.predict(x_test)

In [None]:
r2_score(y_test, pred_tree)

0.609131650582838

In [None]:
from sklearn import tree

In [None]:
import pydotplus

In [None]:
from IPython.display import Image

In [None]:
dot = tree.export_graphviz(clf_tree)

In [None]:
graph = pydotplus.graph_from_dot_data(dot)

In [None]:
Image(graph.create_png())

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
clf_random = RandomForestClassifier()

In [None]:
clf_random.fit(x_train,y_train)

In [None]:
pred_Random = clf_random.predict(x_test)

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy_score(y_test, pred_Random)

In [None]:
for x in range(0,500):
  clf_random = RandomForestClassifier(n_estimators=x)
  clf_random.fit(x_train,y_train)
  pred_Random = clf_random.predict(x_test)
  acc = accuracy_score(y_test, pred_Random)
  print('N arvore: {}  Acc: {}'.format(x,acc))

In [None]:
clf_random.n_estimators

In [None]:
clf_random.estimators_[2]

In [None]:
from sklearn.neighbors import KNeighborsRegressor

In [None]:
model_KNN = KNeighborsRegressor(n_neighbors=13)

In [None]:
model_KNN.fit(x_train,y_train)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=13, p=2,
                    weights='uniform')

In [None]:
pred_KNN = model_KNN.predict(x_test)

In [None]:
r2_score(y_test,pred_KNN)

0.5325680256176983

In [None]:
K_n = 20

In [None]:
for x in range(1, K_n):
  model_KNN = KNeighborsRegressor(n_neighbors=x)
  model_KNN.fit(x_train,y_train)
  pred_KNN = model_KNN.predict(x_test)
  R2 = r2_score(y_test,pred_KNN)
  print(' K:{} R2:{}'.format(x,R2))

 K:1 R2:0.4897344319621062
 K:2 R2:0.6184732136142583
 K:3 R2:0.6222313766071476
 K:4 R2:0.6425482393120805
 K:5 R2:0.6505494184956406
 K:6 R2:0.6493637999739255
 K:7 R2:0.6280728631896486
 K:8 R2:0.6113701373804885
 K:9 R2:0.5967538461461841
 K:10 R2:0.5784783355182078
 K:11 R2:0.5584514700062269
 K:12 R2:0.5475384621991797
 K:13 R2:0.5325680256176983
 K:14 R2:0.5231651699789037
 K:15 R2:0.5082239048644874
 K:16 R2:0.4896834667352332
 K:17 R2:0.48835327088179203
 K:18 R2:0.474199718229132
 K:19 R2:0.45613895709795527


**Crie um modelo de regressão linear**

In [None]:
reg = LinearRegression().fit(x, y)

**Treine o modelo**

In [None]:
from xgboost import XGBClassifier

In [None]:
model_arv = XGBClassifier()

In [None]:
model_arv.fit(x_train,y_train)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='multi:softprob', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

**Print a equação que representa o modelo**

In [None]:
print('Y = {}X {}'.format(reg.coef_,reg.intercept_))

Y = [[ 11904.72504296 -11904.72504296    -40.06085711     40.06085711
     273.72193045]]X [9555.83181023]


**Print 'score' R2 do modelo**

In [None]:
previsao = reg.predict(x)

In [None]:
r2 = r2_score(y,previsao)

In [None]:
print("Coeficiente de Determinação (R2):", r2)

Coeficiente de Determinação (R2): 0.7188016743854042


**Utilizar o modelo para prever o quanto você pagaria de seguro.**

model_reg.predict(np.array([[seus dados de acordo com a entrada do modelo]]))

In [None]:
reg.predict(np.array([[0,1,0,1,22]])) #df[['yes','no','female','male','age']]

array([[3713.05009435]])

**Dataset heart.csv ----- REGRESSÃO LOGÍSTICA ------------**

In [None]:
from google.colab import files  #biblioteca utilizada para carregar os dados para o google colab
uploaded = files.upload()

Saving heart.csv to heart.csv


**Apresentar informações sobre o dataset (head,info,describe)**

In [None]:
nome_do_arquivo="heart.csv" 
dataframe_h = pd.read_csv(nome_do_arquivo)

In [None]:
dataframe_h.head()

In [None]:
dataframe_h.info()

In [None]:
dataframe_h.describe()

**Verificar dados nullos e dados categóricos**

In [None]:
dataframe_h.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

**Verificar as correlações entre variáveis, escolher qualquer método**

In [None]:
corr_pearson = dataframe_h.corr(method='pearson')

In [None]:
corr_pearson.style.background_gradient(cmap='coolwarm').set_precision(2)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
age,1.0,-0.1,-0.07,0.28,0.21,0.12,-0.12,-0.4,0.1,0.21,-0.17,0.28,0.07,-0.23
sex,-0.1,1.0,-0.05,-0.06,-0.2,0.05,-0.06,-0.04,0.14,0.1,-0.03,0.12,0.21,-0.28
cp,-0.07,-0.05,1.0,0.05,-0.08,0.09,0.04,0.3,-0.39,-0.15,0.12,-0.18,-0.16,0.43
trestbps,0.28,-0.06,0.05,1.0,0.12,0.18,-0.11,-0.05,0.07,0.19,-0.12,0.1,0.06,-0.14
chol,0.21,-0.2,-0.08,0.12,1.0,0.01,-0.15,-0.01,0.07,0.05,-0.0,0.07,0.1,-0.09
fbs,0.12,0.05,0.09,0.18,0.01,1.0,-0.08,-0.01,0.03,0.01,-0.06,0.14,-0.03,-0.03
restecg,-0.12,-0.06,0.04,-0.11,-0.15,-0.08,1.0,0.04,-0.07,-0.06,0.09,-0.07,-0.01,0.14
thalach,-0.4,-0.04,0.3,-0.05,-0.01,-0.01,0.04,1.0,-0.38,-0.34,0.39,-0.21,-0.1,0.42
exang,0.1,0.14,-0.39,0.07,0.07,0.03,-0.07,-0.38,1.0,0.29,-0.26,0.12,0.21,-0.44
oldpeak,0.21,0.1,-0.15,0.19,0.05,0.01,-0.06,-0.34,0.29,1.0,-0.58,0.22,0.21,-0.43


In [None]:
corr_spearman = dataframe_h.corr(method='spearman')

In [None]:
corr_spearman.style.background_gradient(cmap='coolwarm').set_precision(2)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
age,1.0,-0.1,-0.09,0.29,0.2,0.11,-0.13,-0.4,0.09,0.27,-0.18,0.34,0.09,-0.24
sex,-0.1,1.0,-0.06,-0.05,-0.15,0.05,-0.05,-0.04,0.14,0.1,-0.03,0.12,0.25,-0.28
cp,-0.09,-0.06,1.0,0.04,-0.09,0.09,0.07,0.32,-0.42,-0.16,0.16,-0.22,-0.21,0.46
trestbps,0.29,-0.05,0.04,1.0,0.13,0.15,-0.13,-0.04,0.05,0.15,-0.09,0.09,0.06,-0.12
chol,0.2,-0.15,-0.09,0.13,1.0,0.02,-0.16,-0.05,0.09,0.05,-0.01,0.11,0.08,-0.12
fbs,0.11,0.05,0.09,0.15,0.02,1.0,-0.08,-0.01,0.03,0.03,-0.05,0.13,-0.01,-0.03
restecg,-0.13,-0.05,0.07,-0.13,-0.16,-0.08,1.0,0.09,-0.08,-0.08,0.11,-0.1,-0.01,0.15
thalach,-0.4,-0.04,0.32,-0.04,-0.05,-0.01,0.09,1.0,-0.4,-0.43,0.44,-0.26,-0.16,0.43
exang,0.09,0.14,-0.42,0.05,0.09,0.03,-0.08,-0.4,1.0,0.3,-0.27,0.16,0.25,-0.44
oldpeak,0.27,0.1,-0.16,0.15,0.05,0.03,-0.08,-0.43,0.3,1.0,-0.59,0.22,0.26,-0.42


**Definir as variáveis X e Y**

Opcional (bônus) : separar as variáveis x e y em : x_train, x_test, y_train e y_test. Você pode fazer isso manualmente por vetores ou pode utilizar o train_test_split do sklearn

In [None]:
x = dataframe_h[['cp','thalach','slope','exang','oldpeak']]

In [None]:
y = dataframe_h[['target']]

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)

**Crie um modelo de regressão LOGÍSTICA**

In [None]:
modelo = LogisticRegression(random_state= 0).fit(x,y)

  y = column_or_1d(y, warn=True)


**Treine o modelo**

In [None]:
model_arv = XGBClassifier()

In [None]:
model_arv.fit(x_train,y_train)

**Print a equação que representa o modelo**

In [None]:
print('Y = {}X {}'.format(reg.coef_,reg.intercept_))

**Print 'score' R2 do modelo**

In [None]:
previsao = modelo.predict(x)

In [None]:
modelo.score(x,y)

0.7953795379537953

In [None]:
from sklearn.naive_bayes import GaussianNB

In [None]:
model_NB = GaussianNB()

In [None]:
model_NB.fit(x_train,y_train)

  y = column_or_1d(y, warn=True)


GaussianNB(priors=None, var_smoothing=1e-09)

In [None]:
pred_NB = model_NB.predict(x_test)

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy_score(y_test,pred_NB)

0.79