# SVM

👇 Import the data

In [1]:
import pandas as pd

data = pd.read_csv('data.csv')

data.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,7.0,3.2,4.7,1.4,1
1,6.4,3.2,4.5,1.5,1
2,6.9,3.1,4.9,1.5,1
3,5.5,2.0,4.0,1.0,1
4,4.0,2.8,4.6,1.5,1


The dataset represents two species of plants (target) and their specificities (features). It is the same as in the previous exercice, but the target has been labeled.

## 1. Scaling

👇 Scale the features

In [2]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

data[['sepal length (cm)', 'sepal width (cm)','petal length (cm)','petal width (cm)']] = scaler.fit_transform(data[['sepal length (cm)', 'sepal width (cm)','petal length (cm)','petal width (cm)']])

data.head()


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,1.0,0.366667,0.74,0.448276,1
1,0.85,0.366667,0.7,0.482759,1
2,0.975,0.35,0.78,0.482759,1
3,0.625,0.166667,0.6,0.310345,1
4,0.25,0.3,0.72,0.482759,1


## 2. Train/Test split

👇 Split the data into train and test sets.

In [3]:
from sklearn.model_selection import train_test_split

data_train, data_test = train_test_split(data, test_size = 0.3)

## 3. Random Search

👇 Use a **Random search** to optimize the parameters `kernel` and `C` of an SVM classifier.

In [8]:
from sklearn.svm import SVC
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform

# Select features
X = data_train[["sepal length (cm)","sepal width (cm)","petal length (cm)","petal width (cm)"]]
y = data_train['species']

# Instanciate model
model = SVC()

# Hyperparameter search space
search_space = {'kernel' : ['linear', 'poly', 'rbf'], 'C': uniform(0,10)}

# Instanciate Random Search
search = RandomizedSearchCV(model, param_distributions = search_space, n_jobs=-1, scoring = 'accuracy', cv = 5, n_iter = 20)

# Select features
X = data_train[["sepal length (cm)","sepal width (cm)","petal length (cm)","petal width (cm)"]]

# Fit data to Grid Search
search.fit(X, data_train.species)


RandomizedSearchCV(cv=5, estimator=SVC(), n_iter=20, n_jobs=-1,
                   param_distributions={'C': <scipy.stats._distn_infrastructure.rv_frozen object at 0x1290a0310>,
                                        'kernel': ['linear', 'poly', 'rbf']},
                   scoring='accuracy')

❓ Which kernel best separate the data?

In [9]:
search.best_params_

{'C': 6.969266625304459, 'kernel': 'rbf'}

❓ What is the mean parameter combination fit time ?

In [10]:
search.cv_results_['mean_fit_time'].mean()

0.0036894917488098145

## 4. Generalisation

👇 Extract the best model from the random search and score its performance on the test set.

In [11]:
# Extract best model from grid search
model = search.best_estimator_

# Select features
X_test = data_test[["sepal length (cm)","sepal width (cm)","petal length (cm)","petal width (cm)"]]
y_test = data_test[["species"]]

model.score(X_test,y_test)

0.9666666666666667

⚠️ Please push the exercice once completed. Thanks 🙃

🏁