# scikit-learn Pipelines

https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline

---
### Machine Learning intro
0. [Dataset](#0.-Dataset)
1. [Preprocessing](#1.-Preprocessing)

    1. [Missing numerical data](#A.-Missing-numerical-data)
    2. [Missing categorical data](#B.-Missing-categorical-data)
    3. [Categorical data transformation](#C.-Categorical-data-transformation)
    
2. [Split test train](#2.-Split-test-train)
3. [Scaling](#3.-Scaling)
4. [Feature selection](#4.-Feature-selection)
5. [Training](#5.-Training)
6. [Predict](#6.-Predict)

### Pipeline approach
7. [Prepare](#7.-Prepare)
8. [Training](#8.-Training)
9. [Predict](#9.-Predict)
---
### Titanic problem
Use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

| Variable | Definition | Key | Type |
| :- | :- | :- | :- |
| survived | Survived | 0 = No, 1 = Yes | Numerical |
| pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd | Numerical |
| name | Name | | String |
| sex | Sex | | String |
| age | Age in years | |Numerical | 	
| sibsp | # of siblings / spouses aboard the Titanic | | Numerical |
| parch | # of parents / children aboard the Titanic | | Numerical |	
| ticket | Ticket number | | Numerical |
| fare | Passenger fare | |	Numerical |
| cabin | Cabin number | | String |
| embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton | String |
| boat | Lifeboat (if survived) || Numerical |
| body | Body number (if did not survive and body was recovered) | | Numerical |
| home.dest | Destination | | String |


## 0. Dataset

In [1]:
import pandas as pd

In [2]:
dataset = pd.read_csv('titanic.csv', encoding='utf-8', na_values='?')

In [3]:
dataset.describe()

Unnamed: 0,pclass,survived,age,sibsp,parch,fare,body
count,1309.0,1309.0,1046.0,1309.0,1309.0,1308.0,121.0
mean,2.294882,0.381971,29.881135,0.498854,0.385027,33.295479,160.809917
std,0.837836,0.486055,14.4135,1.041658,0.86556,51.758668,97.696922
min,1.0,0.0,0.1667,0.0,0.0,0.0,1.0
25%,2.0,0.0,21.0,0.0,0.0,7.8958,72.0
50%,3.0,0.0,28.0,0.0,0.0,14.4542,155.0
75%,3.0,1.0,39.0,1.0,0.0,31.275,256.0
max,3.0,1.0,80.0,8.0,9.0,512.3292,328.0


In [4]:
X = dataset.loc[:, ['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']]
y = dataset.loc[:, 'survived']

In [5]:
X

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,embarked
0,1,female,29.0000,0,0,211.3375,S
1,1,male,0.9167,1,2,151.5500,S
2,1,female,2.0000,1,2,151.5500,S
3,1,male,30.0000,1,2,151.5500,S
4,1,female,25.0000,1,2,151.5500,S
...,...,...,...,...,...,...,...
1304,3,female,14.5000,1,0,14.4542,C
1305,3,female,,1,0,14.4542,C
1306,3,male,26.5000,0,0,7.2250,C
1307,3,male,27.0000,0,0,7.2250,C


## 1. Preprocessing

### A. Missing numerical data

In [6]:
import numpy as np
from sklearn.impute import SimpleImputer

In [7]:
X.loc[X['age'].isnull()]

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,embarked
15,1,male,,0,0,25.9250,S
37,1,male,,0,0,26.5500,S
40,1,male,,0,0,39.6000,C
46,1,male,,0,0,31.0000,S
59,1,female,,0,0,27.7208,C
...,...,...,...,...,...,...,...
1293,3,male,,0,0,8.0500,S
1297,3,male,,0,0,7.2500,S
1302,3,male,,0,0,7.2250,C
1303,3,male,,0,0,14.4583,C


In [8]:
numerical_imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
numerical_imputer.fit(X.iloc[:, [0, 2, 3, 4, 5]])
X.iloc[:, [0, 2, 3, 4, 5]] = numerical_imputer.transform(X.iloc[:, [0, 2, 3, 4, 5]])

In [9]:
X.iloc[[1, 15, 20, 37], 2]

1      0.916700
15    29.881135
20    37.000000
37    29.881135
Name: age, dtype: float64

### B. Missing categorical data

In [10]:
X.loc[X['embarked'].isnull()]

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,embarked
168,1.0,female,38.0,0.0,0.0,80.0,
284,1.0,female,62.0,0.0,0.0,80.0,


In [11]:
categorical_imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
categorical_imputer.fit(X.iloc[:, [1, 6]])
X.iloc[:, [1, 6]] = categorical_imputer.transform(X.iloc[:, [1, 6]])

In [12]:
X.loc[X['embarked'].isnull()]

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,embarked


In [13]:
X.iloc[168, [1, 6]]

sex         female
embarked         S
Name: 168, dtype: object

### C. Categorical data transformation

In [14]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

In [15]:
column_transformer = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1, 6])],
                                       remainder='passthrough')
column_transformer.fit(X)
X = np.array(column_transformer.transform(X))

In [16]:
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1, 6])])
ct.fit(X)

ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
                  transformer_weights=None,
                  transformers=[('encoder',
                                 OneHotEncoder(categories='auto', drop=None,
                                               dtype=<class 'numpy.float64'>,
                                               handle_unknown='error',
                                               sparse=True),
                                 [1, 6])],
                  verbose=False)

In [17]:
X[0]

array([  1.    ,   0.    ,   0.    ,   0.    ,   1.    ,   1.    ,
        29.    ,   0.    ,   0.    , 211.3375])

## 2. Split test train

In [18]:
from sklearn.model_selection import train_test_split

In [19]:
X.astype('float64')

array([[  1.    ,   0.    ,   0.    , ...,   0.    ,   0.    , 211.3375],
       [  0.    ,   1.    ,   0.    , ...,   1.    ,   2.    , 151.55  ],
       [  1.    ,   0.    ,   0.    , ...,   1.    ,   2.    , 151.55  ],
       ...,
       [  0.    ,   1.    ,   1.    , ...,   0.    ,   0.    ,   7.225 ],
       [  0.    ,   1.    ,   1.    , ...,   0.    ,   0.    ,   7.225 ],
       [  0.    ,   1.    ,   0.    , ...,   0.    ,   0.    ,   7.875 ]])

In [20]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [21]:
print(len(X), len(y), len(x_train), len(y_train), len(x_test), len(y_test)) 

1309 1309 1047 1047 262 262


## 3. Scaling

In [22]:
from sklearn.preprocessing import StandardScaler

In [23]:
scaler = StandardScaler()
x_train[:, 5:] = scaler.fit_transform(x_train[:, 5:])
x_test[:, 5:] = scaler.transform(x_test[:, 5:])

In [24]:
pd.DataFrame(data=x_train[:5], 
             columns=['female', 'male', 'C', 'Q', 'S', 'pclass', 'age', 'sibsp', 'parch', 'fare'])

Unnamed: 0,female,male,C,Q,S,pclass,age,sibsp,parch,fare
0,0.0,1.0,0.0,0.0,1.0,-1.572054,1.255138,0.463734,-0.455474,0.525646
1,0.0,1.0,0.0,0.0,1.0,-0.368001,-0.852514,-0.473582,-0.455474,-0.441533
2,1.0,0.0,0.0,1.0,0.0,0.836052,0.006159,-0.473582,-0.455474,-0.509288
3,0.0,1.0,0.0,0.0,1.0,0.836052,-0.00312,-0.473582,-0.455474,-0.488293
4,0.0,1.0,0.0,0.0,1.0,0.836052,-2.257616,4.212998,1.934909,0.253195


## 4. Feature selection

In [25]:
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel

In [26]:
feature_selector = LinearSVC(C=0.01, penalty="l1", dual=False)
feature_selector.fit(x_train, y_train)
feature_model = SelectFromModel(feature_selector, prefit=True)
x_new = feature_model.transform(x_train)

In [27]:
x_train.shape

(1047, 10)

In [28]:
x_new.shape

(1047, 6)

In [29]:
feature_model.get_support()

array([ True,  True, False, False, False,  True,  True,  True, False,
        True])

## 5. Training

In [37]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

#### Decision Tree Classifier

In [38]:
dtc_classifier = DecisionTreeClassifier(criterion='entropy')
dtc_classifier.fit(x_new, y_train)
y_pred = dtc_classifier.predict(feature_model.transform(x_test))
accuracy = accuracy_score(y_test, y_pred)
print('Decision tree {}'.format(accuracy))

Decision tree 0.7938931297709924


#### K-nearest Neighbours Classifier

In [34]:
knn_classifier = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)
knn_classifier.fit(x_new, y_train)
y_pred = knn_classifier.predict(feature_model.transform(x_test))
accuracy = accuracy_score(y_test, y_pred)
print('K-NN {}'.format(accuracy))

K-NN 0.8015267175572519


#### Logistic regression

In [35]:
lr_classifier = LogisticRegression(C=1)
lr_classifier.fit(x_new, y_train)
y_pred = lr_classifier.predict(feature_model.transform(x_test))
accuracy = accuracy_score(y_test, y_pred)
print('Logistic Regression {}'.format(accuracy))

Logistic Regression 0.8244274809160306


## 6. Predict

In [None]:
from random import randrange

In [None]:
to_predict = dataset.iloc[[randrange(len(dataset))]]
to_predict

In [None]:
x_predict = to_predict.loc[:, ['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']]
x_predict.iloc[:, [0, 2, 3, 4, 5]] = numerical_imputer.transform(x_predict.iloc[:, [0, 2, 3, 4, 5]])
x_predict.iloc[:, [1, 6]] = categorical_imputer.transform(x_predict.iloc[:, [1, 6]])
x_predict = np.array(column_transformer.transform(x_predict))
x_predict[:, 5:] = scaler.transform(x_predict[:, 5:])
x_predict = feature_model.transform(x_predict)

In [None]:
x_predict

In [None]:
print('survived') if lr_classifier.predict(x_predict) else print('not survived')

# Pipeline

### 7. Prepare

In [None]:
from sklearn.pipeline import Pipeline

In [None]:
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(missing_values=np.nan, strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(missing_values=np.nan, strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='error'))
])

preprocessor = ColumnTransformer(transformers=[
    ('categorical', categorical_transformer, [1, 6]),
    ('numerical', numeric_transformer, [0, 2, 3, 4, 5])
], remainder='passthrough')

feature_selection = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('feature_selection', SelectFromModel(LinearSVC(C=0.01, penalty="l1", dual=False))),
])

### 8. Training

In [None]:
X = dataset.loc[:, ['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']]
y = dataset.loc[:, 'survived']
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [None]:
final_model = None
best_score = 0
for classifier in [DecisionTreeClassifier(criterion='entropy'),
                   KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2),
                   LogisticRegression(C=1)]:
    model = Pipeline(steps=[
        ('feature_selection', feature_selection),
        ('classification', classifier)
    ])
    model.fit(x_train, y_train)
    score = model.score(x_test, y_test)
    print(f"{classifier.__class__.__name__},\t model score: {score:.3f}")
    if score > best_score:
        final_model = model

In [None]:
from joblib import dump

In [None]:
dump(final_model, 'our_awesome_model.joblib')

##### Restart kernel

## 9. Predict

In [None]:
import pandas as pd
from joblib import load
from random import randrange

In [None]:
loaded_model = load('our_awesome_model.joblib')

In [None]:
loaded_model

In [None]:
dataset = pd.read_csv('titanic.csv', encoding='utf-8', na_values='?')
to_predict = dataset.iloc[[randrange(len(dataset))]]
to_predict

In [None]:
to_predict = to_predict.loc[:, ['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']]

In [None]:
print('survived') if loaded_model.predict(to_predict) else print('not survived')