# Preprocessor Tuning

## The `tumors` Dataset

* 👩🏻‍⚕️ The following dataset describes tumors that are either <font color=red>malignant</font> or <font color=green>benign</font>. 
* 🎯 The task is to detect as many malignant tumors as possible.

## Imports 

In [1]:
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import KNNImputer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import make_column_transformer, make_column_selector
import pandas as pd

In [2]:
pd.set_option('display.max_columns', None)
url = "https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/08-Workflow/tumors_dataset.csv"
data = pd.read_csv(url)
data.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,malignant
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,1
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,1
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,1
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,1
4,20.29,14.34,,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,,0.205,0.4,0.1625,0.2364,0.07678,1


In [3]:
round(data.malignant.value_counts(normalize = True),2)

malignant
0    0.63
1    0.37
Name: proportion, dtype: float64

In [4]:
X = data.drop(columns="malignant")
y= data["malignant"]

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

In [6]:
X_train.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
305,11.6,24.49,74.23,417.2,0.07474,0.05688,0.01974,0.01313,0.1935,0.05878,0.2512,1.786,1.961,18.21,0.006122,0.02337,0.01596,0.006998,0.03194,0.002211,12.44,31.62,81.39,476.5,0.09545,0.1361,0.07239,0.04815,0.3244,0.06745
428,11.13,16.62,70.47,381.1,0.08151,0.03834,0.01369,0.0137,0.1511,0.06148,0.1415,0.9671,0.968,9.704,0.005883,0.006263,0.009398,0.006189,0.02009,0.002377,11.68,20.29,74.35,421.1,0.103,0.06219,0.0458,0.04044,0.2383,0.07083
431,12.4,17.68,81.47,467.8,0.1054,0.1316,0.07741,0.02799,0.1811,0.07102,0.1767,1.46,2.204,15.43,0.01,0.03295,0.04861,0.01167,0.02187,0.006005,12.88,22.91,89.61,515.8,0.145,0.2629,0.2403,0.0737,0.2556,0.09359
511,14.81,14.7,94.66,680.7,0.08472,0.05016,0.03416,0.02541,0.1659,0.05348,0.2182,0.6232,1.677,20.72,0.006708,0.01197,0.01482,0.01056,0.0158,0.001779,15.61,17.58,101.7,760.2,0.1139,0.1011,0.1101,0.07955,0.2334,0.06142
356,13.05,18.59,85.09,512.0,0.1082,0.1304,0.09603,0.05603,0.2035,0.06501,0.3106,1.51,2.59,21.57,0.007807,0.03932,0.05112,0.01876,0.0286,0.005715,14.19,24.85,94.22,591.2,0.1343,0.2658,0.2573,0.1258,0.3113,0.08317


## Building a Pipeline

❓ **Question: Building a Pipeline** ❓

Combine the following steps in a **`Pipeline`** object named `pipeline`:

1. Impute missing values with a **`KNNImputer`**
2. Scale all the (numerical) features with a **`MinMaxScaler`**
3. Model a **`LogisticRegression`** with default parameters

In [7]:
number_transformer = make_pipeline(KNNImputer(), MinMaxScaler())
number_transformer

In [8]:
preproc_basic = make_column_transformer(
    (
        number_transformer,
        X.columns
    ),
    remainder="passthrough"
)
preproc_basic

In [9]:
num_col = make_column_selector(dtype_include=["float64"])

In [10]:
pipeline = make_pipeline(preproc_basic,LogisticRegression())
pipeline

In [11]:
pipeline.fit(X_train, y_train)

In [12]:
pipeline.score(X_test, y_test)

0.9649122807017544

In [13]:
pipeline.get_params()

{'memory': None,
 'steps': [('columntransformer',
   ColumnTransformer(remainder='passthrough',
                     transformers=[('pipeline',
                                    Pipeline(steps=[('knnimputer', KNNImputer()),
                                                    ('minmaxscaler',
                                                     MinMaxScaler())]),
                                    Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
          'mean smoothness', 'mean compactness', 'mean concavity',
          'mean concave points', 'mean symmetry', 'mean fractal dimension',
          'radius error', 'texture error', 'perimeter error', 'area error',
          'smoothness error', 'compactness error', 'concavity error',
          'concave points error', 'symmetry error', 'fractal dimension error',
          'worst radius', 'worst texture', 'worst perimeter', 'worst area',
          'worst smoothness', 'worst compactness', 'worst concavity',
          'wor

## Optimizing a pipelined model

In [14]:
grid_search = GridSearchCV(
    pipeline,
    param_grid={
        "columntransformer__pipeline__knnimputer__n_neighbors": [2, 5, 10]
    },
    cv= 5, 
    scoring="r2")

grid_search.fit(X, y)
grid_search.best_params_

{'columntransformer__pipeline__knnimputer__n_neighbors': 5}

In [15]:
n_best = grid_search.best_params_["columntransformer__pipeline__knnimputer__n_neighbors"]

In [16]:
pipeline_tuned = grid_search.best_estimator_
pipeline_tuned = pipeline_tuned.fit(X_train,y_train)
pipeline_tuned.predict(X_test)
pipeline_tuned.score(X_test,y_test)

0.9649122807017544

## Evaluating a pipeline

❓ **Question: what is the performance of the optimal pipeline**  ❓

- Make sure you cross-validate your optimal pipeline! 
- Store your result as a `float` number in a variable named `cv_score`

In [17]:
cv_score = pipeline_tuned.score(X_test, y_test)
cv_score

0.9649122807017544

## Predicting using a fitted and pipelined model

👇 Here is a new tumor.

In [18]:
new_url = "https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/08-Workflow/new_tumor.csv"
new_data = pd.read_csv(new_url)
new_data

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902


❓ **Question: Using your optimal pipeline, predict whether the new tumor is malignant or not** ❓

In [19]:
pipeline_tuned.predict(new_data)

array([1], dtype=int64)