# `scikit-learn` Overview



In [None]:
!pip install --upgrade scikit-learn



### Imports



In [None]:
# We'll use numpy to generate the data
import numpy as np

# matplotlib for figures
from matplotlib import pyplot as plt
%matplotlib inline

# The Scikit-learn package is called using 'sklearn'
import sklearn



Usually, we won't import sklearn like so as it's a big package. We should focus our imports to the specific things we need as we'll start doing below.



## Classification example
### Generating data to work with



In [None]:
num_samples = 1000
num_features = 3
X = np.random.uniform(low=0.0, high=10.0, size=(num_samples,num_features))

# Checking that the array is shaped as we wanted
X.shape



In [None]:
# Let's set y to 1 for samples with an average of X's above 6
y = np.zeros(num_samples).astype(int)
y[X.mean(axis=1) > 6] = 1

# Checking that we actually have 2 classes
unique, counts = np.unique(y, return_counts=True)
dict(zip(unique, counts))



Notice that our features are written in uppercase `X` and the labels are lowercase `y`. 

This is a naming convention, as `X` is a 2d array and `y` is a 1d array. If for some reason we generate a single sample (1d array), we should use `x`. 



### Train / Test Split
Sklearn can do a lot of heavy lifting for you, including splitting your data sets.



In [None]:
from sklearn.model_selection import train_test_split

# Train/validation/test split
tmp_X_train, X_test, tmp_y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_validation, y_train, y_validation = train_test_split(tmp_X_train, tmp_y_train, test_size=0.25, random_state=42)

print('X_train shape: ', X_train.shape)
print('X_validation shape: ', X_validation.shape)
print('X_test shape: ', X_test.shape)
print('y_train shape: ', y_train.shape)
print('y_validation shape: ', y_validation.shape)
print('y_test shape: ', y_test.shape)



### Baseline Model: kNN
The process of choosing a model should never be done directly with the test set! Rather, we'll divide our data to train/validation/test and choose the best model using the validation set. This is done to avoid a phenomenon called *test set overfitting*, where the performance on the test set does not represent genralization.



In [None]:
# Import
from sklearn.neighbors import KNeighborsClassifier

# Create the model object
kNN_model = KNeighborsClassifier(n_neighbors=5)

# Fit the training data
kNN_model.fit(X_train, y_train)

# Predict on the test data
y_pred = kNN_model.predict(X_validation)



### Score our prediction
Sklearn has a bunch of metrics you can use out of the box, check out the list [here](https://scikit-learn.org/stable/modules/model_evaluation.html).



In [None]:
from sklearn.metrics import accuracy_score

print('kNN baseline score: ', accuracy_score(y_validation, y_pred))



Looks good, but let's try to make it better..

### Finding a better model
To find a better model, we'll need to test the performance of several algorithms on our toy dataset. 



In [None]:
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB

# Let's create some classifiers
names = ["Nearest Neighbors", "Linear SVM", "Decision Tree", "Random Forest", "AdaBoost",  "Naive Bayes"]
classifiers = [KNeighborsClassifier(5),
               SVC(kernel="linear", C=0.025),
               DecisionTreeClassifier(max_depth=5),
               RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
               AdaBoostClassifier(),
               GaussianNB()]

for name, clf in zip(names, classifiers):
    clf.fit(X_train, y_train)                # sklearn has the same API for all algorithms:  .fit()
    #clf.predict(X_test)                     # .predict()
    val_score = clf.score(X_validation, y_validation)   # and even .score() which runs predict inside
    print(name, '(validation):', val_score)



Now, let's see how our chosen model performs on the real test set..



In [None]:
model = SVC(kernel="linear", C=0.025)
model.fit(X_train, y_train)

print('Linear SVM test score: ', model.score(X_test, y_test))



As you can see, the performance on the test set is not identical to the performance on the validation set. This is expected.

**Note:** If the difference between the validation and test performance is large, it might indicate we have problems!



### Confusion Matrix
Don't be confused! Sklearn can help with confusion matrices!



In [None]:
from sklearn.metrics import plot_confusion_matrix

# Plot the confusion matrix
# Sklearn uses matplotlib behind the scenes
plot_confusion_matrix(model, X_test, y_test);



In [None]:
# We can also generate the textual version
from sklearn.metrics import confusion_matrix

y_pred = model.predict(X_test)

conf_mat = confusion_matrix(y_test, y_pred)
conf_mat



In [None]:
# .. and visualize it however we want
import seaborn as sns

sns.heatmap(conf_mat, annot=True);



## Regression example
As in the classifier example above, our steps are similar:
1. Generate toy data / load real data
2. Split data into train/validation/test
3. Start with a simple baseline model
4. Try to improve performance with additional algorithms
5. Check test set performance with selected model



In [None]:
# In this example, we'll use sklearn's built in datasets
from sklearn.datasets import load_boston  # The boston house prices dataset

# 1. Load data
X, y = load_boston(return_X_y=True)

# 2. Split data
tmp_X_train, X_test, tmp_y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_validation, y_train, y_validation = train_test_split(tmp_X_train, tmp_y_train, test_size=0.25, random_state=42)

print('X_train shape: ', X_train.shape)
print('X_validation shape: ', X_validation.shape)
print('X_test shape: ', X_test.shape)
print('y_train shape: ', y_train.shape)
print('y_validation shape: ', y_validation.shape)
print('y_test shape: ', y_test.shape)



In [None]:
# Let's look at the y distribution (just so we know some background before we start modeling..)

plt.figure(figsize=(6.5,6))
plt.hist(y, alpha=0.7, color='firebrick', edgecolor='k', label='y (boston)')
plt.xlabel('# samples')
plt.ylabel('y value')
plt.title("Distribution of y (boston)")
plt.tight_layout()



In [None]:
# 3. Baseline model
from sklearn import linear_model

reg_model = linear_model.LinearRegression()
reg_model.fit(X_train, y_train)  # Same fit/predict API!

print('Learned regression weights: \n', reg_model.coef_)



In [None]:
# Mean Squared Error (MSE) of baseline model
from sklearn.metrics import mean_squared_error

y_pred = reg_model.predict(X_validation)
mse = mean_squared_error(y_validation, y_pred)

print('Baseline Linear Regression MSE (validation): ', mse)



Let's see if we can improve this!



In [None]:
# 4. Find better algorithms

names = ["Linear Regression", "Ridge Regression", "Stochastic Gradient Descent Regression", "Bayesian Ridge Regression"]
regressors = [linear_model.LinearRegression(),
              linear_model.Ridge(),
              linear_model.SGDRegressor(),
              linear_model.BayesianRidge()]

for name, reg_model in zip(names, regressors):
    reg_model.fit(X_train, y_train)
    y_pred = reg_model.predict(X_validation)
    mse = mean_squared_error(y_validation, y_pred)
    print(name, 'MSE (validation):', mse)



Looks like our baseline plain-vanilla linear regression has comparable results to Ridge and Bayesian Ridge, with no real improvement. This can also happen :)

In cases like these (similar performance) we'll want to choose the simplest model of the bunch, for easiest model interpertation.

Let's see how it performs on the real test set.



In [None]:
lin_reg_model = linear_model.LinearRegression()
lin_reg_model.fit(X_train, y_train)

y_pred = lin_reg_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print('Linear Regression MSE (test):', mse)

