# Random Search

👇 Import the data

In [3]:
import pandas as pd

data = pd.read_csv('data.csv')

data.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,7.0,3.2,4.7,1.4,1
1,6.4,3.2,4.5,1.5,1
2,6.9,3.1,4.9,1.5,1
3,5.5,2.0,4.0,1.0,1
4,4.0,2.8,4.6,1.5,1


The dataset represents two species of plants (target) and their specificities (features). It is the same as in the previous exercice, but the target has been labeled.

## 1. Train/Test split

👇 Split the data into train and test sets.

In [5]:
from sklearn.model_selection import train_test_split

data_train, data_test = train_test_split(data, test_size = 30)

## 2. Random search

👇 Is the default distance parameter of a KNNClassifier optimal for the task? Run a random search to compute your answer.

In [8]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.neighbors import KNeighborsClassifier
from scipy.stats import randint

# Instanciate model
model = KNeighborsClassifier()

# Hyperparameter Grid
search_space = {'n_neighbors' : randint(1,50), 'p':[1,2]}

# Instanciate Grid Search
search = RandomizedSearchCV(model, param_distributions = search_space, n_jobs=-1, scoring = 'accuracy', cv = 5, n_iter = 50)

# Select features
X = data_train[["sepal length (cm)","sepal width (cm)","petal length (cm)","petal width (cm)"]]

# Fit data to Grid Search
search.fit(X, data_train.species)

RandomizedSearchCV(cv=5, estimator=KNeighborsClassifier(), n_iter=50, n_jobs=-1,
                   param_distributions={'n_neighbors': <scipy.stats._distn_infrastructure.rv_frozen object at 0x122e58d50>,
                                        'p': [1, 2]},
                   scoring='accuracy')

In [9]:
search.best_params_

{'n_neighbors': 3, 'p': 1}

## 3. Generalisation

👇 Extract the best model from the random search and score its performance on the test set.

In [12]:
# Extract best model from grid search
model = search.best_estimator_

# Select features
X_test = data_test[["sepal length (cm)","sepal width (cm)","petal length (cm)","petal width (cm)"]]
y_test = data_test[["species"]]

model.score(X_test,y_test)

0.9333333333333333

You might have noticed that this model outperforms the one of the previous exercice, in which the distance hyperparameter had not been tuned.

⚠️ Please push the exercice once completed. Thanks 🙃

🏁