## Imports

In [1]:
import numpy as np
import pandas as pd


## Data

In [2]:
df = pd.read_csv('gene_expression.csv')

In [3]:
df.head()

Unnamed: 0,Gene One,Gene Two,Cancer Present
0,4.3,3.9,1
1,2.5,6.3,0
2,5.7,3.9,1
3,6.1,6.2,0
4,7.4,3.4,1


## Train|Test Split and Scaling Data

In [4]:
X = df.drop('Cancer Present',axis=1)
y = df['Cancer Present']



from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)


### Creating a Pipeline to find K value

**Follow along very carefully here! We use very specific string codes AND variable names here so that everything matches up correctly. This is not a case where you can easily swap out variable names for whatever you want!**

We'll use a Pipeline object to set up a workflow of operations:

1. Scale Data
2. Create Model on Scaled Data

----
*How does the Scaler work inside a Pipeline with CV? Is scikit-learn "smart" enough to understand .fit() on train vs .transform() on train and test?**

**Yes! Scikit-Learn's pipeline is well suited for this! [Full Info in Documentation](https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling) **

When you use the StandardScaler as a step inside a Pipeline then scikit-learn will internally do the job for you.

What happens can be discribed as follows:

* Step 0: The data are split into TRAINING data and TEST data according to the cv parameter that you specified in the GridSearchCV.
* Step 1: the scaler is fitted on the TRAINING data
* Step 2: the scaler transforms TRAINING data
* Step 3: the models are fitted/trained using the transformed TRAINING data
* Step 4: the scaler is used to transform the TEST data
* Step 5: the trained models predict using the transformed TEST data

----

In [5]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [6]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()

In [7]:
operations = [('scaler',scaler),('knn',knn)]

In [8]:
from sklearn.pipeline import Pipeline
pipe = Pipeline(operations)

## GridSearch

In [9]:
k_values = list(range(1,20))

param_grid = {'knn__n_neighbors': k_values}

In [10]:
from sklearn.model_selection import GridSearchCV
full_search = GridSearchCV(pipe,param_grid,scoring='accuracy')

In [None]:
full_search.fit(X_train,y_train)

In [None]:
full_search.best_estimator_.get_params()

## Final Model

In [15]:
pred=full_search.predict(X_test)

In [None]:
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
confusion_matrix(y_test,pred)

In [None]:
print(classification_report(y_test,pred))