# Preprocessor Tuning

👇 Consider the following dataset as your training set

In [1]:
import pandas as pd

data = pd.read_csv("data.csv")

data.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,malignant
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,1
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,1
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,1
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,1
4,20.29,14.34,,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,,0.205,0.4,0.1625,0.2364,0.07678,1


The dataset describes tumors that are either malignant or benign. The task is to detect as many malignant tumors as possible.

👇 Combine the following steps in a `Pipeline`:

- Impute missing values with a `KNNImputer`
- Scale all the features with a `MinMaxScaler`
- Model a `LogisticRegression` with default parameters
- Use the scoring metric relevant for the task

❓With how many neighbors does the `KNNImputer` produce the optimal pipeline: 2, 5, or 10?

In [2]:
X = data.drop(columns='malignant')
y = data['malignant']

In [26]:
from sklearn.pipeline import Pipeline

from sklearn.pipeline import Pipeline
from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler

from sklearn.compose import ColumnTransformer

from sklearn import set_config; set_config(display='diagram')
from sklearn.linear_model import LogisticRegression

#Create X_Train , y_train, X_test, y_test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=0)


# Impute then Scale for numerical variables: 
num_transformer = Pipeline([
    ('imputer', KNNImputer()),
    ('scaler', MinMaxScaler())])

# preprocessor = ColumnTransformer([
#     ('num_transformer', num_transformer, make_column_selector(dtype_include=['float64'])),
#     ])
#     #,remainder='passthrough')
    
# Combine preprocessor and linear model in pipeline
final_pipe = Pipeline([
    ('preprocessing', num_transformer),
    ('linear_regression', LogisticRegression())])

final_pipe



In [27]:
# Train pipeline
final_pipe_trained = final_pipe.fit(X_train,y_train)

# Score model
final_pipe_trained.score(X_test,y_test)

0.9590643274853801

👇 What is the performance of the optimal pipeline? Make sure you cross validate!

In [28]:
from sklearn.model_selection import cross_val_score

# Cross validate pipeline
cross_val_score(final_pipe, X_train, y_train, cv=5, scoring='r2').mean()


0.8715402298850575

In [29]:
final_pipe.get_params()

{'memory': None,
 'steps': [('preprocessing',
   ColumnTransformer(transformers=[('num_transformer',
                                    Pipeline(steps=[('imputer', KNNImputer()),
                                                    ('scaler', MinMaxScaler())]),
                                    <sklearn.compose._column_transformer.make_column_selector object at 0x7fd4a34fb8e0>)])),
  ('linear_regression', LogisticRegression())],
 'verbose': False,
 'preprocessing': ColumnTransformer(transformers=[('num_transformer',
                                  Pipeline(steps=[('imputer', KNNImputer()),
                                                  ('scaler', MinMaxScaler())]),
                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7fd4a34fb8e0>)]),
 'linear_regression': LogisticRegression(),
 'preprocessing__n_jobs': None,
 'preprocessing__remainder': 'drop',
 'preprocessing__sparse_threshold': 0.3,
 'preprocessing__transformer_weights': None

👇 Using your optimal pipeline, predict wether the following tumor is malignant or not

In [7]:
new_data = pd.read_csv("new_data.csv")
new_data

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902


In [36]:
###?????  We dont need to do that!!!! 
new_data_transform = pd.DataFrame(preprocessor.transform(new_data))

In [38]:
final_pipe_trained.predict(new_data)

array([1])