# Workflow

👇 Import the data

In [130]:
import pandas as pd

data = pd.read_csv('data.csv')

data.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,7.0,3.2,4.7,1.4,versicolor
1,6.4,3.2,4.5,1.5,versicolor
2,6.9,3.1,4.9,1.5,versicolor
3,5.5,2.0,4.0,1.0,versicolor
4,4.0,2.8,4.6,1.5,versicolor


The dataset represents two species of plants (target) and their specificities (features).

## 1. Encoding

👇 Encode the target `species`

In [124]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

encoder.fit(data.species)

data["species"] = encoder.transform(data.species)

data.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,7.0,3.2,4.7,1.4,1
1,6.4,3.2,4.5,1.5,1
2,6.9,3.1,4.9,1.5,1
3,5.5,2.0,4.0,1.0,1
4,4.0,2.8,4.6,1.5,1


## 2. Train/Test split

👇 Split the data into train and test sets.

In [125]:
from sklearn.model_selection import train_test_split

data_train, data_test = train_test_split(data, test_size = 30)

## 3. Grid search

👇 Grid search a KNN's hyperparameter K on the training data.
- Search k = [5,10,20,30]
- 5 fold cross validate
- Score with R2

In [126]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsRegressor

# Instanciate model
model = KNeighborsRegressor()

# Hyperparameter Grid
k_grid = {'n_neighbors' : [5,10,20,30]}

# Instanciate Grid Search
grid = GridSearchCV(model, k_grid, n_jobs=-1, scoring = 'r2', cv = 5)

# Select features
X = data_train[["sepal length (cm)","sepal width (cm)","petal length (cm)","petal width (cm)"]]

# Fit data to Grid Search
grid.fit(X, data_train.species)

GridSearchCV(cv=5, error_score=nan,
             estimator=KNeighborsRegressor(algorithm='auto', leaf_size=30,
                                           metric='minkowski',
                                           metric_params=None, n_jobs=None,
                                           n_neighbors=5, p=2,
                                           weights='uniform'),
             iid='deprecated', n_jobs=-1,
             param_grid={'n_neighbors': [5, 10, 20, 30]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='r2', verbose=0)

❓ According to the grid search, what is the optimal K value?

In [127]:
grid.best_params_

{'n_neighbors': 5}

❓ What is the best score the optimal K value produced?

In [128]:
grid.best_score_

0.7439492063492062

## 4. Generalisation

👇 Extract the best model from the grid search and score its performance on the test set.

In [129]:
# Extract best model from grid search
model = grid.best_estimator_

# Select features
X_test = data_test[["sepal length (cm)","sepal width (cm)","petal length (cm)","petal width (cm)"]]
y_test = data_test[["species"]]

model.score(X_test,y_test)

0.8888888888888888

❓ Would you consider the optimized model to generalize well?

⚠️ Please push the exercice once completed. Thanks 🙃

🏁