<a href="https://colab.research.google.com/github/DzakiMuhammad3/pipeline_sklearn/blob/main/ml_stroke_finder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Import Dataset from Kaggle

First we need to upload our `kaggle.json` token to Google Colab.

Next, we make a hidden directory in colab name .kaggle

In [None]:
!mkdir ~/.kaggle

mkdir: cannot create directory ‘/root/.kaggle’: File exists


After that, we change our current path to the root and check the list of all

In [None]:
!cd ~/ & ls

sample_data


Next, we copy the `kaggle.json` to the hidden `.kaggle `file

In [None]:
!cp kaggle.json ~/.kaggle/kaggle.json

Next, we give the permission 600 to acces `.kaggle/kaggle.json`

In [None]:
!chmod 600 ~/.kaggle/kaggle.json

Now finally we can download the datasets

In [None]:
!kaggle competitions download -c playground-series-s3e2

Downloading playground-series-s3e2.zip to /content
  0% 0.00/321k [00:00<?, ?B/s]
100% 321k/321k [00:00<00:00, 90.3MB/s]


## Get the data

We need to unzip the data and store it in dataframe.

Here way to unzip the data

In [None]:
!unzip playground-series-s3e2

Archive:  playground-series-s3e2.zip
  inflating: sample_submission.csv   
  inflating: test.csv                
  inflating: train.csv               


Now get the train and test dataset from colab directory


In [None]:
import pandas as pd

df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

len(df_train), len(df_test)

(15304, 10204)

I think we should see the whole data using info

In [None]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15304 entries, 0 to 15303
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 15304 non-null  int64  
 1   gender             15304 non-null  object 
 2   age                15304 non-null  float64
 3   hypertension       15304 non-null  int64  
 4   heart_disease      15304 non-null  int64  
 5   ever_married       15304 non-null  object 
 6   work_type          15304 non-null  object 
 7   Residence_type     15304 non-null  object 
 8   avg_glucose_level  15304 non-null  float64
 9   bmi                15304 non-null  float64
 10  smoking_status     15304 non-null  object 
 11  stroke             15304 non-null  int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 1.4+ MB


The data is well cleaned so we can proceed to analyze it.

The data have 1 id column, 10 feature columns and 1 target.

The target is stroke with binary classification type

Data have:
* 5 Categorical features, `age, hypertension, heart_disease, avg_glucose_level, bmi`
* 5 Numerical Features, `gender, ever_married, work_type, residence_type, smoking_status`
* 1 Target, `stroke` with binary classification type. Yes or No 

## Make X and y data

So we need to separate the features and the target and drop the id column because it can make our model bias

In [None]:
data = df_train.drop(['id'], axis=1) # change the axis to 1 to drop the column not the row 
data.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,Male,28.0,0,0,Yes,Private,Urban,79.53,31.1,never smoked,0
1,Male,33.0,0,0,Yes,Private,Rural,78.44,23.9,formerly smoked,0
2,Female,42.0,0,0,Yes,Private,Rural,103.0,40.3,Unknown,0
3,Male,56.0,0,0,Yes,Private,Urban,64.87,28.8,never smoked,0
4,Female,24.0,0,0,No,Private,Rural,73.36,28.8,never smoked,0


Now make X and y data, where X are the features and y is the target

In [None]:
X = data.drop(['stroke'], axis=1)
y = data.stroke

In [None]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15304 entries, 0 to 15303
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   gender             15304 non-null  object 
 1   age                15304 non-null  float64
 2   hypertension       15304 non-null  int64  
 3   heart_disease      15304 non-null  int64  
 4   ever_married       15304 non-null  object 
 5   work_type          15304 non-null  object 
 6   Residence_type     15304 non-null  object 
 7   avg_glucose_level  15304 non-null  float64
 8   bmi                15304 non-null  float64
 9   smoking_status     15304 non-null  object 
dtypes: float64(3), int64(2), object(5)
memory usage: 1.2+ MB


In [None]:
y[0:5]

0    0
1    0
2    0
3    0
4    0
Name: stroke, dtype: int64

## Split the dataset

We need to split the train data as training and testing data because the test data doesn't have a target. We use module `sklearn.mode_selection.train_test_split` 

This is the documentation of train_test_split: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    random_state=42)
len(X_train), len(X_test), len(y_train), len(y_test)

(12243, 3061, 12243, 3061)

## Pipeline Practice

We can use the `sklearn pipeline` to train the data smoothly. I get this example through the internet https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html

In [None]:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectPercentile, chi2
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

### Numerical Features

We made the pipeline for preprocessing the numerical features:
* Age
* BMI
* Avg_glucose
* hypertension
* heart diseas


First, grab the name of numerical feature column

In [None]:
numerical_features = ['age', 'hypertension', 'heart_disease', 'avg_glucose_level', 'bmi']

Next, make the pipeline. We need the `sklearn.impute.SimpleImputer` and `sklearn.preprocessing.StandardScaler` module. 
* SimpleImputer is Univariate imputer for completing missing values with simple strategies. The module impute the NaN value using the common value that we choose like mean, median, modus value
* StandardScaler is a standardize features by removing the mean and scaling to unit variance.

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

numerical_pipeline = Pipeline(
    steps=[
        ('imputer', SimpleImputer()),
        ('sclaer', StandardScaler())
    ]
)

## Categorical Features and Pipeline

Now let's handle the categorical features:
* gender
* ever_married
* work_type
* residence_type
* smoking_status
Let's grab the categorical feature

In [None]:
categorical_features = ['ever_married', 'work_type', 'work_type', 'Residence_type', 'smoking_status']

We need the one-hot-encoding because machine learning tend to more learn with `OneHotEncoder` and the `SelectPercentile`.
* OneHotEncoder - Encode categorical features as a one-hot numeric array. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
* SelectPercentile - Select features according to a percentile of the highest scores. https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_selection import SelectPercentile, chi2

categorical_pipeline = Pipeline(
    steps=[
        ('encoder',OneHotEncoder(handle_unknown='ignore')),
        ('selector', SelectPercentile(chi2, percentile=50))
    ]
)

### Preprocessor

Now let's put together the numerical and categorical features as one pipeline preprocessor using `sklearn.compose.ColumnTransformer` -> Applies transformers to columns of an array or pandas DataFrame.


In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_features),
        ('cat', categorical_pipeline, categorical_features)
    ]
)

Now put the preprocessing and the classifieer together

In [None]:
clf = Pipeline(
    steps=[
        ('preprocessing', preprocessor),
        ('classifier', LogisticRegression())
    ]
)

## Fitting the model with train data

In [None]:
clf.fit(X_train, y_train)

Pipeline(steps=[('preprocessing',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer()),
                                                                  ('sclaer',
                                                                   StandardScaler())]),
                                                  ['age', 'hypertension',
                                                   'heart_disease',
                                                   'avg_glucose_level',
                                                   'bmi']),
                                                 ('cat',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore')),
                                                                

## Check the score with X_test, y_test

In [None]:
clf.score(X_test, y_test)

0.957203528258739

## Make the prediction using test data from competition

In [None]:
data_test = df_test.drop(['id'], axis=1)
X = data_test

y_test = clf.predict(X)


In [None]:
y_test

array([0, 0, 0, ..., 0, 0, 0])

In [None]:
# Numerical Feature
numerical_feature = ['age', 'hypertension', 'heart_disease', 'avg_glucose_level', 'bmi']
numerical_pipeline = Pipeline(
    steps=[('imputer', SimpleImputer()), ('scaler', StandardScaler())]
)

categorical_feature = ['ever_married', 'work_type', 'work_type', 'Residence_type', 'smoking_status']
categorical_pipeline = Pipeline(
    steps=[
        ('encoder', OneHotEncoder(handle_unknown='ignore')),
        ('selector',SelectPercentile(chi2, percentile=50))
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_feature),
        ('cat', categorical_pipeline, categorical_feature)
    ]
)

In [None]:
clf.score(X_train, y_train)

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status
11264,Female,53.00,0,0,Yes,Private,Rural,69.34,40.2,smokes
9255,Male,71.00,1,0,Yes,Private,Urban,211.63,33.1,never smoked
4306,Female,32.00,0,0,Yes,Private,Urban,90.36,22.1,smokes
7516,Female,34.00,0,0,Yes,Private,Urban,77.67,34.1,never smoked
6174,Male,0.72,0,0,No,children,Rural,112.19,18.9,Unknown
...,...,...,...,...,...,...,...,...,...,...
620,Male,58.00,0,0,Yes,Govt_job,Rural,86.05,25.1,never smoked
2455,Male,8.00,0,0,No,children,Urban,90.22,18.8,Unknown
10710,Male,67.00,0,0,Yes,Private,Urban,94.98,33.4,smokes
6137,Male,60.00,1,0,Yes,Private,Rural,76.00,30.0,never smoked


set()

{1, 7, 9}

IndexError: ignored

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status
0,15304,Female,57.0,0,0,Yes,Private,Rural,82.54,33.4,Unknown
1,15305,Male,70.0,1,0,Yes,Private,Urban,72.06,28.5,Unknown
2,15306,Female,5.0,0,0,No,children,Urban,103.72,19.5,Unknown
3,15307,Female,56.0,0,0,Yes,Govt_job,Urban,69.24,41.4,smokes
4,15308,Male,32.0,0,0,Yes,Private,Rural,111.15,30.1,smokes
...,...,...,...,...,...,...,...,...,...,...,...
10199,25503,Female,27.0,0,0,No,Private,Urban,75.77,17.6,never smoked
10200,25504,Male,49.0,0,0,Yes,Private,Urban,102.91,26.7,Unknown
10201,25505,Female,3.0,0,0,No,children,Rural,104.04,18.3,Unknown
10202,25506,Male,31.0,0,0,Yes,Private,Urban,82.41,28.7,never smoked
