You are given a dataset called ’Diabetes.csv’.
- Load the dataset
- Print the first ten observations on the screen
-  Check the shape of the dataset
-  Print the column names of the dataset
-  Split the dataset into feature set (X) and the target variable (y). The target variable is ‘OnDiab’, indicating onset of diabetes within five years.
-  Find the unique number of classes for the target variable.
-  Split the dataset into training and test datasets
-  Train KNearestNeighbor classifer on your train dataset and print the score on the the test dataset. Set number of neighbors to 5.
- Import GridSearchCV from `sklearn.modelselection`
-  Split your data into train and test datasets
-  For neighbors=1 to 30, compute GridSearchCV for train dataset with kfold=10. 12. Print the best cross validation score
-  Print the best parameter
-  Print the test score

In [1]:
import pandas as pd
df = pd.read_csv('diabetes.csv')

In [2]:
df.shape

(767, 9)

In [3]:
df.columns

Index(['6', '148', '72', '35', '0', '33.6', '0.627', '50', '1'], dtype='object')

In [4]:
cols = ['x1','x2','x3','x4','x5','x6','x7','x8','y']

In [5]:
df.columns=cols

In [6]:
df.columns

Index(['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'y'], dtype='object')

In [7]:
df.head()

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,y
0,1,85,66,29,0,26.6,0.351,31,0
1,8,183,64,0,0,23.3,0.672,32,1
2,1,89,66,23,94,28.1,0.167,21,0
3,0,137,40,35,168,43.1,2.288,33,1
4,5,116,74,0,0,25.6,0.201,30,0


In [8]:
# look for the column y  when looking for columns
X = df.drop('y', axis=1)

In [9]:
y = df['y']

In [10]:
from sklearn.model_selection import train_test_split

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X,y, stratify=y, random_state=0)

In [12]:
from sklearn.preprocessing import StandardScaler

In [13]:
scaler = StandardScaler()

In [14]:
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)

In [15]:
from sklearn.neighbors import KNeighborsClassifier

In [16]:
knn_nos = KNeighborsClassifier()
knn_s = KNeighborsClassifier()

In [17]:
knn_nos.fit(X_train, y_train)
knn_s.fit(X_train_scaled, y_train)

In [18]:
knn_nos.score(X_test,y_test)

0.75

In [19]:
X_test_scaled = scaler.transform(X_test)

In [20]:
knn_s.score(X_test_scaled, y_test)

0.7604166666666666

In [21]:
from sklearn.pipeline import make_pipeline

In [22]:
pipe1 = make_pipeline(StandardScaler(), KNeighborsClassifier())

In [23]:
pipe1.fit(X_train,y_train)
pipe1.score(X_test,y_test)

0.7604166666666666

In [24]:
# to see steps of pipeline
pipe1.steps

[('standardscaler', StandardScaler()),
 ('kneighborsclassifier', KNeighborsClassifier())]

In [25]:
# to see the first step of pipeline
pipe1.steps[0]

('standardscaler', StandardScaler())

In [26]:
from sklearn.model_selection import GridSearchCV

In [27]:
import numpy as np
my_params = {'kneighborsclassifier__n_neighbors': np.arange(1,30,2)}

In [28]:
my_grid = GridSearchCV(pipe1, param_grid=my_params, cv=5)

In [29]:
pipe1

In [33]:
my_grid.fit(X_train,y_train)

In [34]:
my_grid.score(X_test,y_test)

0.734375