# Model Building Process

### 1. Train-Test-Split
- Random Split (Representativeness)

#### Training Set: 
1. **Exploratory Analysis** (Descriptive, Visualization)
  - Understand your data, and get an idea of what to do with your data


2. **Pre-processing** (fit/transform)
  - Take the result of Exploratory Analysis and take action (e.g. scaling, standardize, 1-hot-encoding, etc)


3. **Modeling** (fit/predict/evaluate)
  - Modeling techniques (Linear Rrgression, Logsitic Regression, Decision Trees, Neural Network)
  - Hyperparameter tuning


4. **K-Fold Cross-Validation**

---

***Avoid Data Leakage*** - no back-and-forth between training and testing set

---

#### Testing Set:
1. **Pre-processing** (transform)
  - Replicate the exact same process and transform your testing set


2. **Test Model** (predict/evaluate)

#### Fit/Transform vs Transform
- train = [0, 50, 100, 150, 200]

min-max scale:
- (e - min)/(max - min)
- min = 200
- max = 0
- return : mmstrain = [0, 0.25, 0.5, 0.75, 1]

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
train = [0, 50, 100, 150, 200]

from sklearn.preprocess import MinMaxScaler
mms = MinMaxScaler()

# everything in SKLearn is class, needs to be instantiated

# under pre-processing (fit/transform)
mms.fit(train) # -> model fitting : extract parameter | min = 0, max = 200
mms.transform(train) # -> based on the fitted parameter, tranform your data
# train -> [0, 0.25, 0.5, 0.75, 1]

# 
test = [0, 50, 100, 150, 2000]
# min = 0, max = 2000 
mms.transform(test) # -> [0, 0.25, 0.5, 0.75, 10] --> which does not align with the idea of MinMaxScaler
# .predict(test) -> ok
# .transform(test) -> ok
# .fit(test) -> NOT OKAY!!

# But it is ok for 3 reason (outlier)
# # 1. Random Split - representative enough
# # 2. Not overfitting to our data - robust model can take care of it
# # 3. Expected - Analysts' jobs to maintain and re-train our model (perhaps something has fundamentally shiftted) 

# SciKitLearn

- Built on SciPy, NumPy, Matplotlib
- Commercially Useable -> BSD License

Resources:
- https://scikit-learn.org/stable/index.html
- https://scikit-learn.org/stable/user_guide.html
- https://scikit-learn.org/stable/modules/classes.html

## Datasets

- [Toy Datasets](https://scikit-learn.org/stable/datasets/index.html#toy-datasets)
- [Real World Datasets](https://scikit-learn.org/stable/datasets/index.html#real-world-datasets)
- [Generated Datasets](https://scikit-learn.org/stable/datasets/index.html#generated-datasets)

### Load Toy Datasets:

In [None]:
import pandas as pd

# load iris data
from sklearn.datasets import load_iris

In [None]:
X, y = load_iris(return_X_y = True)

In [None]:
# print(X, y) # array, label

In [None]:
dataset = load_iris() # load

In [None]:
print(dataset.DESCR) # information about the dataset

In [None]:
dataset.feature_names

In [None]:
df = pd.DataFrame(data = dataset.data,
                 columns = dataset.feature_names)
df['label'] = dataset.target
df.head()

In [None]:
df.groupby('label').describe()

#### `load_wine` dataset

In [None]:
from sklearn.datasets import load_wine
dataset = load_wine()
df = pd.DataFrame(data = dataset.data,
                 columns = dataset.feature_names)
df['label'] = dataset.target
df.head()

#### `load_breast_cancer` dataset

In [None]:
from sklearn.datasets import load_breast_cancer
dataset = load_breast_cancer()
df = pd.DataFrame(data = dataset.data,
                 columns = dataset.feature_names)
df['label'] = dataset.target
df.head()

#### `load_diabetes` dataset

In [None]:
from sklearn.datasets import load_diabetes
dataset = load_diabetes()
df = pd.DataFrame(data = dataset.data,
                 columns = dataset.feature_names)
df['label'] = dataset.target
df.head()

#### `load_boston` dataset

In [None]:
from sklearn.datasets import load_boston
dataset = load_boston()
df = pd.DataFrame(data = dataset.data,
                 columns = dataset.feature_names)
df['label'] = dataset.target
df.head()

In [None]:
# generated dataset
# imports
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.datasets import make_regression

# load
X, y = make_regression(n_samples = 300,
                       n_features = 1,
                       noise = 5)
# plot
plt.scatter(X, y)

In [None]:
from sklearn.datasets import make_regression

# load
X, y = make_regression(n_samples = 300,
                       n_features = 1,
                       noise = 5)

# polynomial 
y = y*y

# plot
plt.scatter(X, y)

In [None]:
# Generate isotropic Gaussian blobs for clustering
from sklearn.datasets import make_blobs

# load
X, y = make_blobs(n_samples=100,
                 n_features=2,
                 centers=3,
                 cluster_std=1,
                 random_state=3)

# plot
plt.figure(figsize=(4,4))
plt.scatter(X[:,0], X[:,1], marker='o', c=y, s=25, edgecolor='k')

In [None]:
# make_circle
from sklearn.datasets import make_circles

# load
X, y = make_circles(noise=0.2,
                    factor=0.5,
                   random_state=1)

# plot
plt.figure(figsize=(4,4))
plt.scatter(X[:,0], X[:,1], marker='o', c=y, s=25, edgecolor='k')

In [None]:
# make_moon
from sklearn.datasets import make_moons

# load
X, y = make_moons(n_samples= 100, noise=0.1)

# plot
plt.figure(figsize=(4,4))
plt.scatter(X[:,0], X[:,1], marker='o', c=y, s=25, edgecolor='k')

In [None]:
# make_classification
from sklearn.datasets import make_classification

# load
X, y = make_classification(n_samples= 1000,
                           n_features = 2,
                           n_redundant = 0,
                           n_informative = 1, # -> 1
                           n_clusters_per_class = 1,
                           n_classes = 2)

# plot
plt.figure(figsize=(4,4))
plt.scatter(X[:,0], X[:,1], marker='o', c=y, s=25, edgecolor='k')

In [None]:
# make_classification
from sklearn.datasets import make_classification

# load
X, y = make_classification(n_samples= 1000,
                           n_features = 2,
                           n_redundant = 0,
                           n_informative = 2,
                           n_clusters_per_class = 2, # ->2
                           n_classes = 2)

# plot
plt.figure(figsize=(4,4))
plt.scatter(X[:,0], X[:,1], marker='o', c=y, s=25, edgecolor='k')

In [None]:
# make_classification
from sklearn.datasets import make_classification

# load
X, y = make_classification(n_samples= 1000,
                           n_features = 2,
                           n_redundant = 0,
                           n_informative = 2,
                           n_clusters_per_class = 1, # ->2
                           n_classes = 3)

# plot
plt.figure(figsize=(4,4))
plt.scatter(X[:,0], X[:,1], marker='o', c=y, s=25, edgecolor='k')

In [None]:
# Generate isotropic Gaussian and label samples by quantile
from sklearn.datasets import make_gaussian_quantiles

# load
X1, Y1 = make_gaussian_quantiles(n_features=2, n_classes=3)

# plot
plt.figure(figsize=(4,4))
plt.scatter(X1[:,0], X1[:,1], marker='o', c=Y1, edgecolor='k')

# Preprocessing

- Documentation:
  - https://scikit-learn.org/stable/modules/preprocessing.html


- Encoding Categorical Variables:
  - https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html
  - https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html


- Transforming Prediction Targets:
  - https://scikit-learn.org/stable/modules/preprocessing_targets.html
  - https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
  - https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html
  - https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html


- Standardization, Scaling, Normalization:
  - https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
  - https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
  - https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html


- Discretization:
  - https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html
  - https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html


- Missing Value Imputation
  - https://scikit-learn.org/stable/modules/impute.html


- Polynomial Features
  - https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html


- Custom Transformers
  - https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html


## Process(`OrdinalEncoder`)
- fit and transform training data
- transform testing data

### Fit and Transforming The Training Data

In [None]:
import pandas as pd
import numpy as np

In [None]:
# sample data
train = pd.DataFrame(
    data=[
    [1, 'Aaron', 'Third', 22, 7.25, 'no'], 
    [2, 'Beth', 'First', 38, 71.28, 'yes'], 
    [3, 'Cathy', 'Second', 26, 7.92, 'yes'], 
    [4, 'Dave', 'First', 60, 71.28, 'yes'], 
    [5, 'Erin', 'Second', 70, 71.92, 'no']], 
    columns=['Id', 'Name', 'Pclass', 'Age', 'Fare', 'Survived'])

# feature
train_X = train.drop('Survived', axis=1)

# target
train_Y = train['Survived']

train

In [None]:
# encode categorical variable - ordinal encoding
# (not one-hot-encoding)

# import
from sklearn.preprocessing import OrdinalEncoder

In [None]:
# instantiate the preprocess model
oe = OrdinalEncoder()

# fit model to training data w/ preprocessing model
# feed training data
oe.fit(train_X[['Pclass']])

In [None]:
# transform the training data and 
oe.transform(train_X[['Pclass']])

In [None]:
# put into a new column
train_X['Pclass_transformed'] = oe.transform(train_X[['Pclass']])

train_X

In [None]:
# drop the original data
train_X.drop('Pclass', axis=1, inplace=True)
train_X

### Transforming the Test Data

Now, that we have a model that transform original data to new type of data.
We can use `transform` on our test data

In [None]:
test = pd.DataFrame(
    data=[
    [6, 'Fiona', 'Second', 2, 50.25, 'yes'], 
    [7, 'Gina', 'Third', 25, 7.28, 'no'], 
    [8, 'Heather', 'First', 30, 71.92, 'no'], 
    [9, 'Ingrid', 'First', 54, 71.28, 'yes'], 
    [10, 'John', 'Third', 66, 7.92, 'yes']], 
    columns=['Id', 'Name', 'Pclass', 'Age', 'Fare', 'Survived'])

test_X = test.drop('Survived', axis=1)

test_Y = test['Survived']

test

In [None]:
test_X['Plcass_transformed'] = oe.transform(test_X[['Pclass']])

test_X.drop('Pclass', axis=1, inplace=True)

test_X

In [None]:
# we can also combine fit and transform into one step!

# sample data
train = pd.DataFrame(
    data=[
    [1, 'Aaron', 'Third', 22, 7.25, 'no'], 
    [2, 'Beth', 'First', 38, 71.28, 'yes'], 
    [3, 'Cathy', 'Second', 26, 7.92, 'yes'], 
    [4, 'Dave', 'First', 60, 71.28, 'yes'], 
    [5, 'Erin', 'Second', 70, 71.92, 'no']], 
    columns=['Id', 'Name', 'Pclass', 'Age', 'Fare', 'Survived'])

# feature
train_X = train.drop('Survived', axis=1)

# target
train_Y = train['Survived']


# import
from sklearn.preprocessing import OrdinalEncoder

# instantiate the preprocess model
oe = OrdinalEncoder()

# fit + transform
train_X['Pclass_transformed'] = oe.fit_transform(train_X[['Pclass']])

train_X

In [None]:
# now we have fit and transformed the training data, let's transform our testing data
test = pd.DataFrame(
    data=[
    [6, 'Fiona', 'Second', 2, 50.25, 'yes'], 
    [7, 'Gina', 'Third', 25, 7.28, 'no'], 
    [8, 'Heather', 'First', 30, 71.92, 'no'], 
    [9, 'Ingrid', 'First', 54, 71.28, 'yes'], 
    [10, 'John', 'Third', 66, 7.92, 'yes']], 
    columns=['Id', 'Name', 'Pclass', 'Age', 'Fare', 'Survived'])

test_X = test.drop('Survived', axis=1)

test_Y = test['Survived']


test_X['Pclass_transformed'] = oe.transform(test_X[['Pclass']])

test_X

## `OneHotEncoder`

In [None]:
df = pd.read_csv('train.csv')
df.head()

In [None]:
df.dtypes

In [None]:
df['Pclass'].unique()

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
ohe = OneHotEncoder(categories='auto',
                   drop=None, # None: dont drop any column e.g. typically drop 1
                   handle_unknown='error', # category not exist in training data but in testing data, # typically ignore
                   sparse=False, # True by default. used with pyspark
                   dtype=int) # default float


ohe.fit(df[['Pclass']])

ohe.transform(df[['Pclass']])

In [None]:
ohe.get_feature_names()
# could used to plug into original data

In [None]:
# could of course fit and transform in one step
dfcat =  pd.DataFrame(ohe.fit_transform(df[['Pclass']]),
                     columns = ohe.get_feature_names())
dfcat.head()

In [None]:
dfcat = pd.concat([df, dfcat], axis=1)
dfcat.head()

In [None]:
# dropping the first encoded column

ohe = OneHotEncoder(categories='auto',
                   drop='first', 
                   handle_unknown='error',
                   sparse=False,
                   dtype=int)

dfcat =  pd.DataFrame(ohe.fit_transform(df[['Pclass']]),
                     columns = ohe.get_feature_names())

dfcat = pd.concat([df, dfcat], axis=1)

dfcat = dfcat.drop('Pclass', axis=1)

dfcat

In [None]:
categorical_var = ['Pclass','Sex','Embarked']

ohe = OneHotEncoder(categories='auto',
                   drop='first',
                   handle_unknown='error',
                   sparse=False,
                   dtype=int)

df = df.fillna('X')

dfcat =  pd.DataFrame(ohe.fit_transform(df[categorical_var]),
                     columns = ohe.get_feature_names())

dfcat = pd.concat([df, dfcat], axis=1)

dfcat = dfcat.drop(categorical_var, axis=1)

dfcat.head()

## Standardizing, Scaling, Normalizing

### Standardization - Mean Removal and Variance Scaling
  - https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
  - `xi_scale = (xi - xmean)/xsd`
  - resulting distribution has mean 0 and sd 1.
  - sensitive to outliers, and cannot guarantee balanced scales in the presence of outliers.
  - also, the outliers themselves are still present in the transformed data.

In [None]:
df = pd.read_csv('train.csv')
df.head()

In [None]:
numeric_vars = ['Age', 'Fare']

from sklearn.preprocessing import StandardScaler

ss = StandardScaler(with_mean=True,
                   with_std=True)

dfnumss = pd.DataFrame(ss.fit_transform(df[numeric_vars]),
                      columns = ['ss_'+x for x in numeric_vars])

dfnumss.head()

In [None]:
df = pd.concat([df, dfnumss], axis=1).drop(numeric_vars, axis=1)
df

In [None]:
df['ss_Age'].mean(), df['ss_Age'].std()

In [None]:
df['ss_Fare'].mean(), df['ss_Fare'].std()

### Min Max Scale - scaling each feature to a given range (by default [0,1])
  - https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
  - `xi_scale = (xi - min(x))/(max(x) - min(x))`
  - by default, resulting distribution is in [0, 1] range.
  - **sensitive to outliers**, and cannot guarantee balanced scales in the presence of outliers.
  - also, the outliers themselves are still present in the transformed data.

In [None]:
df = pd.read_csv('train.csv')

numeric_vars = ['Age', 'Fare']

from sklearn.preprocessing import MinMaxScaler

mms = MinMaxScaler()

dfnummms = pd.DataFrame(mms.fit_transform(df[numeric_vars]),
                      columns = ['mms_'+x for x in numeric_vars])

df = pd.concat([df, dfnummms], axis=1).drop(numeric_vars, axis=1)

dfnummms.head()

In [None]:
df['mms_Age'].min() , df['mms_Age'].max()

In [None]:
df['mms_Fare'].min() , df['mms_Fare'].max()

In [None]:
# note the result of transforming the testing set is not going to be exact
# the whole point is to see how robust and general the model is

### MaxAbsScaler
  - MaxAbsScaler scales each feature by its maximum absolute value.
  - https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html
  - `xi_scale = (xi)/(max(abs(x)))`
  - resulting distribution is in [-1, 1] range.
  - **sensitive to outliers**, and cannot guarantee balanced scales in the presence of outliers.
  - also, the outliers themselves are still present in the transformed data.

In [None]:
df = pd.read_csv('train.csv')

numeric_vars = ['Age', 'Fare']

from sklearn.preprocessing import MaxAbsScaler

mas = MaxAbsScaler()

dfnummas = pd.DataFrame(mas.fit_transform(df[numeric_vars]),
                      columns = ['mas_'+x for x in numeric_vars])

dfnummas = pd.concat([dfnummas, df], axis=1).drop(numeric_vars, axis=1)

dfnummas

### Robust Scaler
  - https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html
  - removes the median and scales the data according to the inter-quartile range (defaults to IQR (or Q3-Q1))
  - `xi_scale = (xi - Q2(x))/(Q3(x) - Q1(x))` where Q1, Q2, and Q3 are 25th, 50th and 75th quantiles
  - robust to outliers, but the outliers themselves are still present in the transformed data.

In [None]:
df = pd.read_csv('train.csv')

numeric_vars = ['Age', 'Fare']

from sklearn.preprocessing import RobustScaler

rs = RobustScaler()

dfnumrs = pd.DataFrame(rs.fit_transform(df[numeric_vars]),
                      columns = ['rs_'+x for x in numeric_vars])

dfnumrs = pd.concat([dfnumrs, df], axis=1).drop(numeric_vars, axis=1)

dfnumrs.head()

### Power Transformer
  - https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer
  - makes data more **Gaussian-like**.
  - finds the optimal scaling factor to stabilize variance and mimimize skewness through maximum likelihood estimation. 
  - by default, PowerTransformer also applies zero-mean, unit variance normalization to the transformed output. 
  - supports the Box-Cox transform (can only be applied to strictly positive data) and the **Yeo-Johnson** transform (if there are negative values in data).

In [None]:
df = pd.read_csv('train.csv')
numeric_vars = ['Age', 'Fare']
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer()
dfnumpt = pd.DataFrame(pt.fit_transform(df[numeric_vars]), columns=['pt_'+x for x in numericvars])
dfnumpt = pd.concat([df, dfnumpt], axis=1).drop(numericvars, axis=1)
dfnumpt.head()

### Normalization (Sample Vector)
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html
- Normalization is the process of scaling **individual *samples***( **not features** - i.e., operation is along **rows**!) to have unit norm. 
- This process can be useful if you plan to use a quadratic form such as the dot-product 
- or any other kernel to quantify the similarity of any pair of samples.
  - l1: sum of abs values is 1
  - **l2: sum of square of values is 1**

In [None]:
# load iris for all neumeric data
from sklearn.datasets import load_iris

iris = load_iris()

df = pd.DataFrame(data = iris.data, columns = iris.feature_names)



from sklearn.preprocessing import Normalizer

norm = Normalizer(norm='l2')

dfnorm = pd.DataFrame(norm.fit_transform(df), columns=['norm_'+x for x in df.columns])

dfnorm.head()

In [None]:
# check
dfnorm.apply(lambda x: x**2).sum(axis=1)

### DISCRETIZATION (or quantization or binning)
  - https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html


### KBinsDiscretizer: bin continuous data into intervals
  - https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html


In [None]:
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df.head()

In [None]:
from sklearn.preprocessing import KBinsDiscretizer

# instantiate
kbd = KBinsDiscretizer(n_bins=5,            # bins size
                       encode='ordinal',    # default : onehot
                       strategy='quantile') # read documentation for encode and strategy

# fit, transform
dfkbd = pd.DataFrame(kbd.fit_transform(df), columns=['kbd_'+x for x in df.columns])

dfkbd

In [None]:
# check out the bin boundaries 
kbd.bin_edges_

### Binarizer: 
  - binarize data (set feature values to 0 or 1) according to a threshold
  - Binarizer is similar to the KBinsDiscretizer when k = 2, and when the bin edge is at the value threshold.
  - https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html

In [None]:
df.head()

In [None]:
from sklearn.preprocessing import Binarizer
bnr = Binarizer(threshold=4.9)
dfbnr = pd.DataFrame(bnr.fit_transform(df[['sepal length (cm)']]), columns=['bnr_sepal length'])
dfbnr.head()

## Missing Value Imputation

- Missing Value Imputation
  - https://scikit-learn.org/stable/modules/impute.html
  - https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html
  - https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html
  - https://scikit-learn.org/stable/modules/generated/sklearn.impute.MissingIndicator.html


### Univariate Feature Imputation
  - https://scikit-learn.org/stable/modules/impute.html
  - https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html
  - https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html

In [None]:
from sklearn.impute import SimpleImputer

In [None]:
df = pd.read_csv('train.csv')

# impute Age missing value with mean
impage = SimpleImputer(missing_values=np.NaN,
                       strategy='mean',
                       fill_value=None)

df_impage = pd.DataFrame(data=impage.fit_transform(df[['Age']]),
                         columns=['imp_Age'])
df_impage

In [None]:
# impute Age missing value with mean
impcabin = SimpleImputer(missing_values=np.NaN,
                       strategy='constant',
                       fill_value='MISSING')

df_impcabin = pd.DataFrame(data=impcabin.fit_transform(df[['Cabin']]),
                         columns=['imp_Cabin'])
df_impcabin

In [None]:
dfimp = pd.concat([df, df_impage,df_impcabin], axis=1)
dfimp

In [None]:
# check
dfimp[(df['Age'].isna()) | (df['Cabin'].isna())].head()

In [None]:
dfimp = dfimp.drop(['Age', 'Cabin'],axis=1)
dfimp

### Multivariate feature imputation
- https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html


#### NOTE!!! 
- This estimator is still **experimental for now**: the predictions and the API might change without any deprecation cycle. So to use it, **you need to explicitly** `import enable_iterative_imputer`.
- Each feature with missing values is modeled as a function of other features, and that estimate is then used for imputation. 
- It achieves this an iterated round-robin fashion: 
- At each step, a feature column is designated as output y, and the other feature columns are treated as inputs X. 
- A regressor is fit on (X, y) for known y. Then, the regressor is used to **predict the missing values** of y. 
- This is done for each feature in an iterative fashion, and then is repeated for max_iter imputation rounds. 
- The results of the final imputation round are returned.

In [None]:
df = pd.read_csv('train.csv')

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

numericvars = ['SibSp', 'Parch', 'Age']

imp = IterativeImputer(estimator = None, # which means using the default estimator : 'BayesianRidge()'
                      max_iter=10,
                      random_state=0)

dfimp = pd.DataFrame(data = imp.fit_transform(df[numericvars]),
                     columns = ['imp_'+x for x in numericvars])

dfimp = pd.concat([df,dfimp], axis=1)

dfimp

In [None]:
df = pd.read_csv('train.csv')

# use RandomForestRegressor

from sklearn.ensemble import RandomForestRegressor
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

numericvars = ['SibSp', 'Parch', 'Age']

imp = IterativeImputer(estimator = RandomForestRegressor(),
                      max_iter=10,
                      random_state=0)

dfimp = pd.DataFrame(data = imp.fit_transform(df[numericvars]),
                     columns = ['imp_'+x for x in numericvars])

dfimp = pd.concat([df,dfimp], axis=1)

dfimp

### Missing Indicator

- When using imputation, preserving the information about which values had been missing can be informative. 
- Can use MissingIndicator to transform a dataset into corresponding binary matrix indicating the presence of missing values in the dataset. 

In [None]:
from sklearn.impute import MissingIndicator

lst = [[1, 2, 20], [3, 6, 60], [4, 8, 80], [np.nan, 3, 30], [np.nan, np.nan, 70]]
dff = pd.DataFrame(lst)

mi = MissingIndicator(missing_values=np.NaN)

mi.fit_transform(dff)

### POLYNOMIAL FEATURE GENERATION
- Generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree.
- For example, if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2].
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df = df[['sepal length (cm)','sepal width (cm)']]
df.rename(columns={'sepal length (cm)':'a', 'sepal width (cm)':'b'}, inplace=True)
df.head()

In [None]:
from sklearn.preprocessing import PolynomialFeatures

In [None]:
poly = PolynomialFeatures(degree=2,
                         interaction_only=False, # If true, only interaction features are produced: features that are products
                         include_bias=True) # intercept

dfpoly = pd.DataFrame(data=poly.fit_transform(df),
                     columns= ['bias', 'a', 'b', 'a^2', 'ab', 'b^2'])

dfpoly


### CUSTOM TRANSFORMERS 
- A FunctionTransformer forwards its X (and optionally y) arguments to a user-defined function or function object and returns the result of this function. 
- This is useful for stateless transformations such as taking the log of frequencies, doing custom scaling, etc.
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df.head()

In [None]:
from sklearn.preprocessing import FunctionTransformer

ft = FunctionTransformer(np.log, # function to run
                        validate=True) # throw error when something weird happens

dfft = pd.DataFrame(data= ft.fit_transform(df),
                   columns= ['ft_'+ x for x in df.columns])

dfft

## TRANSFORMING PREDICTION TARGET
- These are transformers that are not intended to be used on features, but only on supervised learning targets.
- https://scikit-learn.org/stable/modules/preprocessing_targets.html
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html

In [None]:
df = pd.read_csv('iris.csv')
df.columns = ['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)','iris species']
df.head()

In [None]:
df.sample(frac=0.05)

In [None]:
df['iris species'].unique()

### Label Encoding
- (similar to OrdinalEncoder for Categorical Features)
- use LabelEncoder to Encode labels with value between 0 and n_classes-1
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html


In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
dfle = pd.DataFrame(le.fit_transform(df['iris species']), columns=['iris species LE'])
dfle = pd.concat([df, dfle],axis=1).drop('iris species',axis=1)
dfle.head()

### Label Binarization 
- similar to **OneHotEncoder** for Categorical Features
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html


In [None]:
from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer()
dflb = pd.DataFrame(lb.fit_transform(df['iris species']),
                    columns=[x+' LB' for x in lb.classes_]) # suffix with LB by the end of species names
dflb = pd.concat([df, dflb], axis=1).drop('iris species', axis=1)
dflb.sample(frac=0.05)

### Multilabel Binarizer 
- converts lists of sets or tuples into multilabel format
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html


In [None]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
labels = [('sci-fi', 'thriller'), # first movie genre
          ('comedy',)]  # second movie genre
pd.DataFrame(mlb.fit_transform(labels), columns = mlb.classes_) # columns names

# Machine Learning

The sklearn ML API is very consistent:

0. read data
1. explore data
2. preprocess data
3. setup data for consumption by ML model 
    4. (4) choose the model by importing the appropriate estiamtor class from sklearn [from sklearn import model]
    5. (5) instantiate the model with desired parameter values [ml=model()]
    6. (6) fit the model to the training data [ml.fit(Xtrain, ytrain)]
    7. (7) apply the model to test data [ypred=ml.predict(Xtest) or ml.transform(Xtest)]
8. evaluate model
9. deploy/use model

In [None]:
# 0) read data
from sklearn.datasets import load_boston
boston = load_boston()
df = pd.DataFrame(data= boston.data, columns= boston.feature_names)
df['label'] = boston.target
df.head()

In [None]:
# 1) explore data
# not demonstrating for this example

# 2) preprocess data
# not demonstrating for this example


In [None]:
# 3) set up for ML model
X = df.drop('label',axis=1)
y = df['label']

# model selection
from sklearn.model_selection import train_test_split

# instantiate
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.2, # 20% in testing set
                                                    random_state=1) # replicate the result, almost will not be using it tho

In [None]:
# 4) choose the model by importing the appropriate estiamtor class from sklearn [from sklearn import model]
# Linear Regression

from sklearn.linear_model import LinearRegression # note this is a class

In [None]:
# 5) instantiate the model with desired parameter values [ml=model()]
lr = LinearRegression()

In [None]:
# 6) fit the model to the training data [ml.fit(Xtrain, ytrain)]
lr.fit(X_train, y_train)

In [None]:
# 7) apply the model to test data [ypred=ml.predict(Xtest) or ml.transform(Xtest)]
y_pred = lr.predict(X_test)

[Model Evaluation Documentation](https://scikit-learn.org/stable/modules/model_evaluation.html)

In [None]:
# 8) evaluate model
# https://scikit-learn.org/stable/modules/model_evaluation.html

from sklearn import metrics

print(np.sqrt( metrics.mean_squared_error( y_test, y_pred ) )) # -> rmse
print(metrics.r2_score( y_test, y_pred)) # rsquare


In [None]:
# plot the predicted and actual result

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


sns.jointplot(y_pred, y_test)
sns.jointplot(y_pred, (y_test-y_pred))

In [None]:
# 9) deploy/use model
lr.predict([[0.03237, 0.0, 2.18, 0.0, 0.458, 6.998, 45.8, 6.0622, 3.0, 222.0, 18.7, 394.63, 2.94]])
# sample from training set; expected label 33.4

In [None]:
# attribute
print(lr.coef_)
print(lr.intercept_)

## Other Linear Model

### Ridge Regression
- https://scikit-learn.org/stable/modules/linear_model.html
- Least squares with L2 regularization:
- minmize ||y - Xw||^2_2 + alpha * ||w||^2_2
- L2 norm here imposes a penalty on the size of the coefficients, the larger the value of alpha, 
- the greater the amount of shrinkage and thus the coefficients become more robust to collinearity

In [None]:
from sklearn.linear_model import Ridge
rr = Ridge(alpha=1.0)
rr.fit(X_train, y_train)
y_pred = rr.predict(X_test)  

from sklearn import metrics
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred))) # rmse
print(metrics.r2_score(y_test, y_pred)) # r-square

### Lasso Regression
- https://scikit-learn.org/stable/modules/linear_model.html
- Least squares with L1 regularization:
- minimize (1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1
- L1 norm here causes the model to prefer solutions with fewer non-zero coefficients, 
- effectively reducing the number of features upon which the given solution is dependent.

In [None]:
from sklearn.linear_model import Lasso
lr = Lasso(alpha=1.0)
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)  

from sklearn import metrics
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred))) # rmse
print(metrics.r2_score(y_test, y_pred)) # r-square

### ElasticNet Regression
- https://scikit-learn.org/stable/modules/linear_model.html
- Least squares with L1 and L2 regularization:
- minimize 1 / (2 * n_samples) * ||y - Xw||^2_2 + alpha * l1_ratio * ||w||_1 + 0.5 * alpha * (1 - l1_ratio) * ||w||^2_2

In [None]:
from sklearn.linear_model import ElasticNet
en = ElasticNet()
en.fit(X_train, y_train)
y_pred = en.predict(X_test)  

from sklearn import metrics
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred))) # rmse
print(metrics.r2_score(y_test, y_pred)) # r-square

### Polynomial Regression

In [None]:
# generate a dataset
from sklearn.datasets import make_regression
X, y = make_regression(n_samples = 300, n_features=1, noise=5)
y = y*y

# fit linear regression model
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X, y)
ypred = lr.predict(X)

#shold be bad since we're fitting on a linear regression


# now, lets fit polynomial regression model - by using polynomial features with linear regression
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2) # n^2

Xpoly = poly.fit_transform(X)

pr = LinearRegression()

pr.fit(Xpoly, y)

yppred = pr.predict(Xpoly) # not sorted

sortedX, sortedyppred = zip(*sorted(zip(X, yppred))) # first zip, then sort, then unzip

# plotting
import matplotlib.pyplot as plt
# original
plt.scatter(X, y)
# ypred (lr)
plt.plot(X, ypred, 'r-')
# sorted yppred
plt.plot(sortedX, sortedyppred, 'g-')

## Classification Model

### Logisitc Regression


In [None]:
##### Logistic Regression
#
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
#
 
# 0) read data
from sklearn.datasets import load_iris
dataset = load_iris()
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df['label'] = dataset.target
df.head()

# 1) explore data
# not demonstrating for this example

# 2) preprocess data
# not demonstrating for this example

# 3) setup data for ml model
X = df.drop(['label'], axis=1)
y = df['label']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [None]:
from sklearn.linear_model import LogisticRegression                                                                   # 4) choose the model 
lr = LogisticRegression(C=1e5, solver='lbfgs', multi_class='auto', class_weight=None)                # 5) instantiate the model 
lr.fit(X_train, y_train)                                                                                                                         # 6) fit the model to train data
y_pred = lr.predict(X_test)                                                                                                                # 7) apply model to test data 

# 8) evaluate model
# https://scikit-learn.org/stable/modules/model_evaluation.html
from sklearn import metrics
print (metrics.accuracy_score(y_test, y_pred))
print (metrics.confusion_matrix(y_test, y_pred))
print (metrics.classification_report(y_test, y_pred))

### K-Neighbor Classification

- https://scikit-learn.org/stable/modules/neighbors.html
- https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

In [None]:
from sklearn.neighbors import KNeighborsClassifier
# simple majority vote (weights='uniform') of 5 nearest neighbors (n_neighbors=5) based on euclidean distance (p=2, metric='minkowski')

knn = KNeighborsClassifier(n_neighbors=5,
                           weights='uniform',
                           p=2, metric='minkowski')

knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

from sklearn import metrics
print (metrics.accuracy_score(y_test, y_pred))
print (metrics.confusion_matrix(y_test, y_pred))
print (metrics.classification_report(y_test, y_pred))

In [None]:
# Note, we can use the "elbow method" to pick an optimal k
# check: gradient descent

from sklearn import metrics
import matplotlib.pyplot as plt

errorlst = pd.DataFrame(data=None, columns=['k','error'])

for k in range(1,5):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    error = metrics.mean_absolute_error(y_test, y_pred) # record the error
    errorlst = errorlst.append ({'k':k, 'error':error}, ignore_index=True) # record the k value and error value

# plot the record
plt.plot (errorlst['k'], errorlst['error'], 'o-')

### SVM Classification

- https://scikit-learn.org/stable/modules/svm.html
- https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
- https://scikit-learn.org/stable/modules/svm.html#svm-kernels

In [None]:
from sklearn.svm import SVC

svc = SVC(C=1.0, # regularization parameter
          kernel='linear',
          class_weight=None)

# svc = SVC(C=1.0, kernel='rbf', gamma=0.7)
# svc = SVC(C=1.0, kernel='poly', degree=3)

svc.fit(X_train, y_train)

y_pred = svc.predict(X_test)

from sklearn import metrics
print (metrics.accuracy_score(y_test, y_pred))
print (metrics.confusion_matrix(y_test, y_pred))
print (metrics.classification_report(y_test, y_pred))

### Naive Bayes Classification

- https://scikit-learn.org/stable/modules/naive_bayes.html
- https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html
- https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html
- https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html

In [None]:
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()

# from sklearn.naive_bayes import MultinomialNB
# nb = MultinomialNB(alpha=1.0,
#                    fit_prior=True,
#                    class_prior=None)

# from sklearn.naive_bayes import BernoulliNB
# nb = BernoulliNB()

nb.fit(X_train, y_train)
y_pred = nb.predict(X_test)

# 8) evaluate model
# https://scikit-learn.org/stable/modules/model_evaluation.html
from sklearn import metrics
print (metrics.accuracy_score(y_test, y_pred))
print (metrics.confusion_matrix(y_test, y_pred))
print (metrics.classification_report(y_test, y_pred))

### Decision Tree Classification

 - https://scikit-learn.org/stable/modules/tree.html
 - https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

In [None]:
from sklearn.tree import DecisionTreeClassifier            
dt = DecisionTreeClassifier(max_depth=None,
                            min_samples_split=2,
                            min_samples_leaf=1,
                            class_weight=None)                                           
dt.fit(X_train, y_train)                                                       
y_pred = dt.predict(X_test)                                             

from sklearn import metrics
print (metrics.accuracy_score(y_test, y_pred))
print (metrics.confusion_matrix(y_test, y_pred))
print (metrics.classification_report(y_test, y_pred))

## Ensemble Methods - Bagging, Boosting & Stacking

https://scikit-learn.org/stable/modules/ensemble.html

- The goal of ensemble methods is to combine the predictions of several base estimators 
   built with a given learning algorithm in order to improve generalizability / robustness over 
   that of a single estimator.

- In Averaging Methods, the driving principle is to build several estimators independently 
  and then to average their predictions. On average, the combined estimator is usually better 
  than any of the single base estimator because its variance is reduced. <br>
  Examples include: 
   - Bagging (Bootrstrap Aggregation) Methods 
   - Forests of Randomized Trees (Random Forest, and Extremely Randomized (Extra) Trees)

- By contrast, in Boosting Methods, base estimators are built sequentially and one tries to 
  reduce the bias of the combined estimator. The motivation is to combine several weak 
  models to produce a powerful ensemble.  <br>
  Examples include: 
   - AdaBoost (Adaptive Boosting)
   - Gradient Tree Boosting

- Bagging methods work best with strong and complex models (e.g., fully developed decision 
  trees), in contrast with Boosting methods which usually work best with weak models 
  (e.g., shallow decision trees).

In [None]:
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=1000,
                 n_features=10,
                 centers=100)

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier

In [None]:
# bagging ensemble of KNeighborsClassifier base estimators, 
# each built on random subsets of 50% of the samples drawn with replacement,
# and 50% of the features drawn without replacement

bc = BaggingClassifier(base_estimator = KNeighborsClassifier(),
                      max_samples = 0.5, bootstrap=True,
                      max_features = 0.5, bootstrap_features=False)

bc.fit(X_train, y_train)

y_pred = bc.predict(X_test)

print(bc.score(X_test, y_test))


### Random Forest Classifier - Forests of Randomized Trees
- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
- The sklearn.ensemble module includes two averaging algorithms based on randomized 
  decision trees: the RandomForest algorithm and the Extra-Trees method.


In [None]:
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=10000, n_features=10, centers=100)

from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=10,
                            max_features='auto',
                            bootstrap=True,
                            max_depth=None,
                            min_samples_split=2,
                            min_samples_leaf=1,
                            class_weight=None)
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)

print(rfc.score(X_test, y_test))



### Extra Tree Classifier
- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html
- Randomized Threshold

In [None]:
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=10000, n_features=10, centers=100, random_state=0)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [None]:
# import, instantiate, train, test model
from sklearn.ensemble import ExtraTreesClassifier  
etc = ExtraTreesClassifier(n_estimators=10, 
                           max_features='auto', bootstrap=False,
                           max_depth=None, min_samples_split=2, min_samples_leaf=1, class_weight=None)   

etc.fit(X_train, y_train)
y_pred = etc.predict(X_test)

print (etc.score(X_test, y_test))

## AdaBoost (Adaptive Boost)
 - Use abunch of weak estimators
 - https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html


In [None]:
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=10000, n_features=10, centers=100, random_state=0)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)


In [None]:
from sklearn.ensemble import AdaBoostClassifier
abc = AdaBoostClassifier(n_estimators=100, learning_rate=0.01)

abc.fit(X_train, y_train)
y_pred = abc.predict(X_test)
print (abc.score(X_test, y_test)) 

In [None]:
# from sklearn.svm import SVC
# svc = SVC(kernel='linear', C=1.0)

# from sklearn.ensemble import AdaBoostClassifier                                                                                                
# abc = AdaBoostClassifier(base_estimator=svc,
#                          n_estimators=100,
#                          learning_rate=0.01,
#                          algorithm='SAMME')

# abc.fit(X_train, y_train)
# y_pred = abc.predict(X_test)
# print (abc.score(X_test, y_test)) 


### Gradient Tree Boosting

- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html


In [None]:
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=10000, n_features=10, centers=10)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier(n_estimators=100, learning_rate=0.01,
                                                   max_depth=3, min_samples_split=2, min_samples_leaf=1)

gbc.fit(X_train, y_train)

y_pred = gbc.predict(X_test)

print (gbc.score(X_test, y_test))

### Voting Classifiers 
- https://scikit-learn.org/stable/modules/ensemble.html#voting-classifier

In [None]:
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=10000, n_features=10, centers=10, random_state=0)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# import the classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

# import the voting classifier
from sklearn.ensemble import VotingClassifier   

# instantiate
dtc = DecisionTreeClassifier(max_depth=4)

knn = KNeighborsClassifier(n_neighbors=7)

svc = SVC(gamma='scale', kernel='rbf', probability=True)


#eclf = VotingClassifier(estimators=[('dtc', dtc), ('knn', knn), ('svc', svc)], voting='hard')
eclf = VotingClassifier(estimators=[('dtc', dtc), ('knn', knn), ('svc', svc)], voting='soft', weights=[2, 1, 2])

eclf.fit(X_train, y_train)

y_pred = eclf.predict(X_test)

print (eclf.score(X_test, y_test))

# Unsupervised Methods

### KMeans Clustering

- https://scikit-learn.org/stable/modules/clustering.html
- https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation
- https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

In [None]:
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=10000, n_features=2, centers=4, cluster_std=0.5 , random_state=0)

# visualize
plt.figure(figsize=(6, 6))
plt.scatter(X[:, 0], X[:, 1], marker='o', c=y, s=25, edgecolor='k')

In [None]:
from sklearn.cluster import KMeans
from sklearn import metrics

In [None]:
err = pd.DataFrame(data=None, columns=['K', 'Error'])

for k in range(2,7):
    km = KMeans(n_clusters=k)
    km.fit(X, y)
    error = km.inertia_
    err = err.append({'K':k, 'Error':error, 'Silhouette':errorS}, ignore_index=True)

plt.plot(err['K'], err['Error'], 'o-')

In [None]:
err = pd.DataFrame(data=None, columns=['K', 'Silhouette'])

for k in range(2,7):
    km = KMeans(n_clusters=k)
    km.fit(X, y)
    errorS = metrics.silhouette_score(X, km.labels_, metric='sqeuclidean')
    err = err.append({'K':k, 'Silhouette':errorS}, ignore_index=True)

plt.plot(err['K'], err['Silhouette'], 'o-')

In [None]:
# import, instantiate, train, test model
from sklearn.cluster import KMeans                                         # 4) choose the model 
km = KMeans(n_clusters=4, random_state=0)                                  # 5) instantiate the model 
km.fit(X, y)                                                               # 6) fit the model to the training data

error = km.inertia_                                                        # 8) evaluate the model
#error = metrics.silhouette_score(X, km.labels_, metric='sqeuclidean') 
print (error)

print (km.labels_)
# center of each cluster
print (km.cluster_centers_)

# predict a new point will be in which cluster
print (km.predict([[0,4]]))                                                # 9) deploy/use model

### Principal Component Analysis

- https://scikit-learn.org/stable/modules/decomposition.html
- https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
- https://scikit-learn.org/stable/modules/unsupervised_reduction.html
- PCA is sesative to relative scaling of the original data, so it's always a good idea to standardize before deploy



In [None]:
# load data
from sklearn.datasets import load_breast_cancer
dataset = load_breast_cancer()
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df['label'] = dataset.target

X = df.drop(['label'], axis=1)
y = df['label']

df.head()

In [None]:
# preprocessing - standard scaler
from sklearn.preprocessing import StandardScaler
ss = StandardScaler(with_mean=True, with_std=True)
Xss = ss.fit_transform(X)

# pca
from sklearn.decomposition import PCA
pca = PCA(n_components=2) # 2 principle component
Xpca = pca.fit_transform(Xss)
Xpca

In [None]:
dfXpca = pd.DataFrame(data=Xpca, columns=['PCA1', 'PCA2'])
dfXpca['Cancer'] = y
dfXpca.head()

In [None]:
print(pca.components_)

In [None]:
sns.lmplot(data=dfXpca,
          x='PCA1', y='PCA2',
          hue='Cancer', fit_reg=False)

---

# Exercise 5

### (I) Read data and pick relevant columns
- read data
- transform Cabin to Deck
- only retain columns required for analysis

#### Question 1
- Read kaggle train.csv data into a dataframe called df; Then check the top few rows of the df dataframe.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
titanic = pd.read_csv('train.csv')
titanic.sample(frac=0.01)

#### Question 2
- Extract a new column called “Deck” from “Cabin”.
- Hint: write a function called getDeck, and then use following:
  - `df['Deck'] = df['Cabin'].apply(getDeck)`

In [None]:
def getDeck(x):
    if pd.notna(x):
        return (x[0])
    else:
        return ('X')

titanic['Deck'] = titanic['Cabin'].apply(getDeck)
titanic.head()

### (II) Split data into Train and Test

In [None]:
# spliting data
X = titanic.drop('Survived',axis=1)
y = titanic['Survived']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=1)

X_train = X_train.copy()
X_test = X_test.copy()
y_train = y_train.copy()
y_test = y_test.copy()


### (III) Fit/Transform on Training Data

- `'Age'`: impute missing values with median
  - impute


- `['Pclass', 'Sex', 'Deck']`: impute missing values with `'X'`
  - impute


- `['impPclass', 'impSex', 'impDeck']`: **One Hot Encoding**
  - OneHotEncoding()


- Only keep imputed numeric and ohe categorical features
  - .drop


- build Logistic Regression Model
  - LogisticRegression()



#### Question 4
- Set numeric_features = ['Age’]
- Use SimpleImputer to fill the missing values in numeric_features with the median values, and prefix the imputed columns with `“imp”`

In [None]:
numeric_features = ['Age']

from sklearn.impute import SimpleImputer

si_num = SimpleImputer(missing_values=np.NaN,
                   strategy='median',
                  fill_value=None)

X_si_num = pd.DataFrame(si_num.fit_transform(X_train[numeric_features]),
                        columns=['imp'+x for x in numeric_features],
                        index=X_train.index)

X_train = pd.concat([X_train, X_si_num],axis=1)

X_train.head()

#### Question 5
- Set categorical_features = ['Pclass', 'Sex', 'Deck']
- Use SimpleImputer to fill the missing values in categorical_features with the constant value ‘X’, and prefix the imputed columns with “imp”

In [None]:
categorical_vars = ['Pclass', 'Sex', 'Deck']

si_cat = SimpleImputer(missing_values=np.NaN,
                   strategy='constant',
                  fill_value='X')

df_imp_cat = pd.DataFrame(data = si_cat.fit_transform(X_train[categorical_vars]),
                         columns=['imp'+x for x in categorical_vars],
                         index=X_train.index)

X_train = pd.concat([X_train, df_imp_cat],axis=1)
X_train.head()

#### Question 6
- Set imputed_categorical_features = [‘impPclass', ‘impSex’, ‘impDeck']
- Use OneHotEncoder to one-hot-encode the imputed categorical variables

In [None]:
imputed_categorical_features = ['impPclass', 'impSex', 'impDeck']

from sklearn.preprocessing import OneHotEncoder

OHE = OneHotEncoder(categories='auto',
                   handle_unknown='ignore',
                   sparse=False,
                   dtype=int)

df_cat =  pd.DataFrame(data = OHE.fit_transform(X_train[imputed_categorical_features]),
                       columns = OHE.get_feature_names(),
                       index= X_train.index)
X_train = pd.concat([X_train, df_cat], axis=1)

X_train.head()


#### Question 7
- Only keep imputed numeric features, and ohe catergorical features

In [None]:
X_train.drop(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age',
              'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 
              'Deck', 'impPclass', 'impSex', 'impDeck',], axis=1,inplace=True)
X_train.head()

#### Question 8
Build Logistic Regression Model by fitting to the transformed training data

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logr = LogisticRegression(solver='liblinear')

logr.fit(X_train, y_train)


### (IV) Transform/Predict on Test Data
- 'Age’: impute missing values with median
- ['Pclass', 'Sex', 'Deck’]: impute missing values with ‘X’
- [‘impPclass', ‘impSex', 'impDeck']: OHE
- Only keep imputed numeric and ohe categorical features
- predict using Logistic Regression Model
- evaluate model

In [None]:
# Numeric
numeric_features = ['Age']

X_si_num = pd.DataFrame(si_num.transform(X_test[numeric_features]),
                        columns=['imp'+x for x in numeric_features],
                        index=X_test.index)

X_test = pd.concat([X_test, X_si_num],axis=1)

X_test.head()

In [None]:
categorical_vars = ['Pclass', 'Sex', 'Deck']

df_imp_cat = pd.DataFrame(data = si_cat.transform(X_test[categorical_vars]),
                         columns=['imp'+x for x in categorical_vars],
                         index=X_test.index)

X_test = pd.concat([X_test, df_imp_cat],axis=1)
X_test.head()

In [None]:
imputed_categorical_features = ['impPclass', 'impSex', 'impDeck']

df_cat =  pd.DataFrame(data = OHE.transform(X_test[imputed_categorical_features]),
                       columns = OHE.get_feature_names(),
                       index= X_test.index)

X_test = pd.concat([X_test, df_cat], axis=1)

X_test.head()

In [None]:
X_test.drop(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age',
              'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 
              'Deck', 'impPclass', 'impSex', 'impDeck',], axis=1,inplace=True)
X_test.head()

In [None]:
y_pred = logr.predict(X_test)

In [None]:
from sklearn import metrics

In [None]:
print(metrics.accuracy_score(y_test, y_pred))

#### In this Exercise Set:

- We had to keep track of the preprocessing/transformation steps with **"fit_transform"** on the training data (including remembering to drop the columns we didn't want to use anymore)…. and then repeating all of the same preprocessing/transformation steps with **"transform" on the test data**

- We also had to keep track of building the model with **“fit” on the preprocessed/transformed *training data…***. and then predicting with **“predict” on the preprocessed/transformed *test data*** As we’ll see in the next exercise set, we can use ColumnTransformers and Pipelines to make life much-much easier for us!

# Exercise 5a

## Pipelines with ColumnTransformers
- https://scikit-learn.org/stable/modules/compose.html#pipeline
- https://scikit-learn.org/stable/modules/compose.html#column-transformer


## Pipeline
“Pipelines can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification.


The Pipeline is built using a list of (key, value) pairs, where the key is a string containing the name you want to give this step and value is an estimator object.”

## ColumnTransformers

“The ColumnTransformer helps performing different transformations for different columns of the data, within a Pipeline that is safe from data leakage and that can be parametrized. ColumnTransformer works on arrays, sparse matrices, and pandas DataFrames.

- Warning: The compose.ColumnTransformer class is experimental and the API is subject to change.”


In [None]:
# import
import pandas as pd
import numpy as np
%matplotlib inline

# load data
titanic = pd.read_csv('train.csv')

# Get Useful Information while Handling the np.NaN 
def getDeck(x):
    if pd.notna(x):
        return (x[0])
    else:
        return (x)

titanic['Deck'] = titanic['Cabin'].apply(getDeck)


# spliting data into feature and target df
X = titanic.drop('Survived',axis=1)
y = titanic['Survived']

# train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=1)

# create copy to deal with warning
X_train = X_train.copy()
X_test = X_test.copy()
y_train = y_train.copy()
y_test = y_test.copy()



#### Set up preprocessing pipeline for numeric data
- impute missing values with median

In [None]:
# copy - first pipeline

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

numeric_features = ['Age']

# instantiate
# steps : a list of TUPLE: [(,) , (,), ...]
numeric_transformer = Pipeline(steps=[
    ('si', SimpleImputer(missing_values = np.nan, strategy='median'))]) # pipeline 1 - step 1

# note: thing has happened right now, we are just setting the blue print what we want to do

#### Set up preprocessing pipeline for categorical data
- impute missing values with constant 'X'
- one-hot-encode imputed categorical values

In [None]:
# copy
from sklearn.preprocessing import OneHotEncoder

categorical_features = ['Pclass', 'Sex', 'Deck']

categorical_transformer = Pipeline(steps=[
    ('si', SimpleImputer(missing_values=np.NaN, strategy='constant', fill_value='X')),          # pipeline 2 - step 1
    ('ohe', OneHotEncoder(categories='auto',handle_unknown='ignore',sparse=False,dtype=int))])  # pipeline 2 - step 2


#### Set up column transformer with preprocessing pipelines for numeric and categorical data
- only keep imputed numeric and ohe catergorical features

In [None]:
# experimental
# copy

from sklearn.compose import ColumnTransformer

# three item tuple ('name', #input, #pipeline)
preprocessor = ColumnTransformer(
                        transformers=[
                                ('num', numeric_transformer, numeric_features),
                                ('cat', categorical_transformer, categorical_features)],
                        remainder='drop') # only keep the preprocessed feature

### remainder='passthrough'

#### Set up the preprocessing->model pipeline

In [None]:
from sklearn.linear_model import LogisticRegression

# nesting pipeline into yet another pipeline
# pipeline run it sequentially
# naming that taking the output as the input for next step

clf = Pipeline(steps=[
    ('pp', preprocessor),
    ('lr', LogisticRegression(solver='liblinear'))])


#### Invoke clf
- fit using the combined preprocessing and model pipeline on train data
- this will automatically run **"fit" and "transform"** for all pre-processing steps, and "fit" for model step on training data

In [None]:
clf.fit(X_train, y_train)

In [None]:
y_pred = clf.predict(X_test)

#### Evaluate Model on Test Data

In [None]:
from sklearn import metrics
print (metrics.accuracy_score(y_test, y_pred))
print()
print (metrics.confusion_matrix(y_test, y_pred))
print()
print (metrics.classification_report(y_test, y_pred))

# Exercise 5b

## Cross-Validation
- Note: Cross-Validation is typically done in conjunction with Grid-Search, as we’ll see in (5d)
- https://scikit-learn.org/stable/modules/cross_validation.html

“A solution to this problem is a procedure called Cross-Validation. A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV.

In the basic approach, called k-fold CV, the training set is split into k smaller sets. The following procedure is followed for each of the k “folds”:
- A model is trained using k-1 of the folds as training data;
- The resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).


The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. This approach can be computationally expensive, but does not waste too much data (as is the case when fixing an arbitrary validation set), which is a major advantage in problems such as inverse inference where the number of samples is very small.”



## Hyper-Parameter Tuning

“Hyper-parameters are parameters that are not directly learnt within estimators. In scikitlearn they are passed as arguments to the constructor of the estimator classes.


When evaluating hyperparameters for estimators, there is a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance.


To solve this problem, yet another part of the dataset can be held out as a so-called “validation set”: training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set.


However, by partitioning the available data into three sets, we drastically reduce the number
of samples which can be used for learning the model, and the results can depend on a
particular random choice for the pair of (train, validation) sets.”

In [1]:
### Set up data and pipeline

# import
import pandas as pd
import numpy as np
%matplotlib inline

# load data
titanic = pd.read_csv('train.csv')

# Get Useful Information while Handling the np.NaN 
def getDeck(x):
    if pd.notna(x):
        return (x[0])
    else:
        return (x)

titanic['Deck'] = titanic['Cabin'].apply(getDeck)


# spliting data into feature and target df
X = titanic.drop('Survived',axis=1)
y = titanic['Survived']

# train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=1)

# create copy to deal with warning
X_train = X_train.copy()
X_test = X_test.copy()
y_train = y_train.copy()
y_test = y_test.copy()

# pipeline
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

numeric_features = ['Age']
numeric_transformer = Pipeline(steps=[
    ('si', SimpleImputer(missing_values = np.nan, strategy='median'))])

categorical_features = ['Pclass', 'Sex', 'Deck']
categorical_transformer = Pipeline(steps=[
    ('si', SimpleImputer(missing_values=np.NaN, strategy='constant', fill_value='X')),
    ('ohe', OneHotEncoder(categories='auto',handle_unknown='ignore',sparse=False,dtype=int))])

# columns Transformer
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
                        transformers=[
                                ('num', numeric_transformer, numeric_features),
                                ('cat', categorical_transformer, categorical_features)],
                        remainder='drop')

# Pipeline
from sklearn.linear_model import LogisticRegression
clf = Pipeline(steps=[
    ('pp', preprocessor),
    ('lr', LogisticRegression(solver='liblinear'))])


#### Use 5-fold cross-validation:
- train, validate each time and get the mean scores

In [4]:
from sklearn.model_selection import cross_validate

# cross_validate (pipeline, training_features, training_target, return_train_score)
scores = cross_validate(clf,
                        X_train, y_train,
                        cv=5,
                        return_train_score=False)

# Return in Dictionary Structure
scores['test_score'].mean()

# Hyper-Parameter Tuning

array([0.77622378, 0.78321678, 0.81690141, 0.80985915, 0.82394366])

# Exercise 5c

## Grid-Search with Cross-Validation
- https://scikit-learn.org/stable/modules/grid_search.html

## Grid Search

“It is possible and recommended to search the hyper-parameter space for the best cross validation score.


Any parameter provided when constructing an estimator may be optimized in this manner. <br>
A search consists of:
- an estimator (regressor or classifier such as sklearn.svm.SVC());
- a parameter space;
- a method for searching or sampling candidates;
- a cross-validation scheme; and
- a score function.


GridSearchCV exhaustively considers all parameter combinations.”

In [1]:
### Set up data and pipeline

# import
import pandas as pd
import numpy as np
%matplotlib inline

# load data
titanic = pd.read_csv('train.csv')

# Get Useful Information while Handling the np.NaN 
def getDeck(x):
    if pd.notna(x):
        return (x[0])
    else:
        return (x)

titanic['Deck'] = titanic['Cabin'].apply(getDeck)


# spliting data into feature and target df
X = titanic.drop('Survived',axis=1)
y = titanic['Survived']

# train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=1)

# create copy to deal with warning
X_train = X_train.copy()
X_test = X_test.copy()
y_train = y_train.copy()
y_test = y_test.copy()

# pipeline
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

numeric_features = ['Age']
numeric_transformer = Pipeline(steps=[
    ('si', SimpleImputer(missing_values = np.nan, strategy='median'))])

categorical_features = ['Pclass', 'Sex', 'Deck']
categorical_transformer = Pipeline(steps=[
    ('si', SimpleImputer(missing_values=np.NaN, strategy='constant', fill_value='X')),
    ('ohe', OneHotEncoder(categories='auto',handle_unknown='ignore',sparse=False,dtype=int))])

# columns Transformer
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
                        transformers=[
                                ('num', numeric_transformer, numeric_features),
                                ('cat', categorical_transformer, categorical_features)],
                        remainder='drop')

# Pipeline
from sklearn.linear_model import LogisticRegression
clf = Pipeline(steps=[
    ('pp', preprocessor),
    ('lr', LogisticRegression(solver='liblinear'))]) # lr will translate to lr__ for later use


#### Set up grid search

In [5]:
# Set up grid search

from sklearn.model_selection import GridSearchCV

# param_grid = parameter grid (hyperparameter) --> in documentation (cmd + tab in, for example, LogisticRegression(**here**))

# dictionary structure : 
## Key:
##### lr__ (:logistic regression model we named earlier), the __ is a must
##### can add as much hyperparameters as the model provide (in the following example)
## Value: list of parameter that we want to use

param_grid = {'lr__penalty': ['l1', 'l2']}

gscv = GridSearchCV(estimator = clf,
                    param_grid = param_grid,
                    cv=5,
                    return_train_score=False)

# still a blue print for now

#### Search for best params

In [7]:
# Search for best params
gscv.fit(X_train, y_train)
print(gscv.best_estimator_)
print('-'*50)
print(gscv.best_score_)
print('-'*50)
print(gscv.best_params_)
print('-'*50)
print(gscv.cv_results_)

Pipeline(memory=None,
         steps=[('pp',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('num',
                                                  Pipeline(memory=None,
                                                           steps=[('si',
                                                                   SimpleImputer(add_indicator=False,
                                                                                 copy=True,
                                                                                 fill_value=None,
                                                                                 missing_values=nan,
                                                                                 strategy='median',
                                                                            

#### Predict and Evaluate best_estimator_ on test data

In [10]:
# Predict and Evaluate best_estimator_ on test data
y_pred = gscv.best_estimator_.predict(X_test)

from sklearn import metrics
print (metrics.accuracy_score(y_test, y_pred))
print('-'*50)
print (metrics.confusion_matrix(y_test, y_pred))
print('-'*50)
print (metrics.classification_report(y_test, y_pred))


0.7821229050279329
--------------------------------------------------
[[87 19]
 [20 53]]
--------------------------------------------------
              precision    recall  f1-score   support

           0       0.81      0.82      0.82       106
           1       0.74      0.73      0.73        73

    accuracy                           0.78       179
   macro avg       0.77      0.77      0.77       179
weighted avg       0.78      0.78      0.78       179



What if we want to check if `strategy='median'` in the `numeric_transformer` pipeline is the best way: <br> 
`numeric_transformer = Pipeline(steps=[
    ('si', SimpleImputer(missing_values = np.nan, strategy='median'))])` <br>
Say, we want to check both `'median'` and `'mean'`, How can we do it?


#### 1) Remove the strategy = 'median'

In [11]:
### Set up data and pipeline

# import
import pandas as pd
import numpy as np
%matplotlib inline

# load data
titanic = pd.read_csv('train.csv')

# Get Useful Information while Handling the np.NaN 
def getDeck(x):
    if pd.notna(x):
        return (x[0])
    else:
        return (x)

titanic['Deck'] = titanic['Cabin'].apply(getDeck)


# spliting data into feature and target df
X = titanic.drop('Survived',axis=1)
y = titanic['Survived']

# train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=1)

# create copy to deal with warning
X_train = X_train.copy()
X_test = X_test.copy()
y_train = y_train.copy()
y_test = y_test.copy()

# pipeline
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

numeric_features = ['Age']
numeric_transformer = Pipeline(steps=[
    ('si', SimpleImputer(missing_values = np.nan))]) ### <- HERE!!!

categorical_features = ['Pclass', 'Sex', 'Deck']
categorical_transformer = Pipeline(steps=[
    ('si', SimpleImputer(missing_values=np.NaN, strategy='constant', fill_value='X')),
    ('ohe', OneHotEncoder(categories='auto',handle_unknown='ignore',sparse=False,dtype=int))])

# columns Transformer
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
                        transformers=[
                                ('num', numeric_transformer, numeric_features),
                                ('cat', categorical_transformer, categorical_features)],
                        remainder='drop')

# Pipeline
from sklearn.linear_model import LogisticRegression
clf = Pipeline(steps=[
    ('pp', preprocessor),
    ('lr', LogisticRegression(solver='liblinear'))])


In [12]:
from sklearn.model_selection import GridSearchCV

param_grid = {'lr__penalty': ['l1', 'l2'],
              'pp__num__si__strategy':['median','mean']} ### add it into the param_grid 
# following the naming convention (structure(level) of the pipelines we set up)
# under 'pp' (preprocessing)
#       >> 'num' (numeric_transformer)
#           >> 'si' (simple imputer)
#              >> 'strategy' (hyperparameter)

gscv = GridSearchCV(clf,
                    param_grid,
                    cv=5,
                    return_train_score=False)

In [13]:
# Search for best params
gscv.fit(X_train, y_train)
print(gscv.best_estimator_)
print('-'*50)
print(gscv.best_score_)
print('-'*50)
print(gscv.best_params_)
print('-'*50)
print(gscv.cv_results_)

Pipeline(memory=None,
         steps=[('pp',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('num',
                                                  Pipeline(memory=None,
                                                           steps=[('si',
                                                                   SimpleImputer(add_indicator=False,
                                                                                 copy=True,
                                                                                 fill_value=None,
                                                                                 missing_values=nan,
                                                                                 strategy='median',
                                                                            

In [14]:
# Predict and Evaluate best_estimator_ on test data
y_pred = gscv.best_estimator_.predict(X_test)

from sklearn import metrics
print (metrics.accuracy_score(y_test, y_pred))
print('-'*50)
print (metrics.confusion_matrix(y_test, y_pred))
print('-'*50)
print (metrics.classification_report(y_test, y_pred))


0.7821229050279329
--------------------------------------------------
[[87 19]
 [20 53]]
--------------------------------------------------
              precision    recall  f1-score   support

           0       0.81      0.82      0.82       106
           1       0.74      0.73      0.73        73

    accuracy                           0.78       179
   macro avg       0.77      0.77      0.77       179
weighted avg       0.78      0.78      0.78       179

