# Say hello to machine learning in Python

## Basic machine learning tutorial, using scikit-learn library

### @ PyConPL 2018

# Table of Contents

[1 Preparation of the development environment](#1-Preparation-of-the-development-environment)<br/>
[2 Introduction to Machine Learning](#2-Introduction-to-Machine-Learning)<br/>
[3 Introduction to classification](#3-Introduction-to-classification)<br/>
[4 Introduction to regression](#4-Introduction-to-regression)<br/>
[5 Stand alone project](#5-Project-time)

# 1 Preparation of the development environment

### Dependencies installation

As everyone can have several versions of Python installed in the system, we use somewhat convoluted way of installing packages to add them to the same version of Python interpreter being used to run this notebook.

In [None]:
import sys
!{sys.executable} -m pip install numpy pandas scipy scikit-learn matplotlib

### Imports

A few libraries that will be used during the workshop. <br/>
**Please confirm that you can import all of them successfully.**

In [None]:
import urllib.request
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
from collections import Counter
%matplotlib inline

### Download datasets

In [None]:
_ = urllib.request.urlretrieve('https://s3.eu-central-1.amazonaws.com/ml-workshop-pycon-2018/startups.csv', 'startups.csv')
_ = urllib.request.urlretrieve('https://s3.eu-central-1.amazonaws.com/ml-workshop-pycon-2018/orthopedic_patients_3C.csv', 'orthopedic_patients_3C.csv')

### Some "global" constants

In [None]:
RANDOM_SEED = 42
STARTUPS_DATASET_PATH = './startups.csv'
ORTHOPEDIC_PATIENTS_DATASET_PATH = './orthopedic_patients_3C.csv'

### Utility (visualisation) code

In [None]:
def _make_meshgrid(x, y, h=.02):
    """
    Create a mesh of points to plot in

    Parameters
    ----------
    x: data to base x-axis meshgrid on
    y: data to base y-axis meshgrid on
    h: stepsize for meshgrid, optional

    Returns
    -------
    xx, yy : ndarray
    """
    x_min, x_max = x.min() - 1, x.max() + 1
    y_min, y_max = y.min() - 1, y.max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    return xx, yy


def _plot_contours(ax, clf, xx, yy, **params):
    """
    Plot the decision boundaries for a classifier.

    Parameters
    ----------
    ax: matplotlib axes object
    clf: a classifier
    xx: meshgrid ndarray
    yy: meshgrid ndarray
    params: dictionary of params to pass to contourf, optional
    """
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    out = ax.contourf(xx, yy, Z, **params)
    return out


def plot_decision_regions(clf, X, y, title='', x_label='', y_label='', figsize=(6, 6)):
    """
    Plots decision regions for passed as argument classifier and data.
    
    Parameters
    ----------
    clf: a (fitted) classifier
    X: matrix of samples, that will be used as classifier input (note that only 2 first features will be used)
    title: title for a generated plot
    x_label: label for x axis of generated plot
    y_label: label for y axis of generated plot
    figsize: size of generated plot in inches, 2 element tuple (width, height)
    """
    plt.figure(figsize=figsize)
    X0, X1 = X[:, 0], X[:, 1]
    xx, yy = _make_meshgrid(X0, X1)
    
    _plot_contours(plt.gca(), clf, xx, yy, cmap=plt.cm.coolwarm, alpha=0.8)
    plt.scatter(X0, X1, c=y, cmap=plt.cm.coolwarm, edgecolors='k')
    
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.xlabel(x_label)
    plt.ylabel(y_label)
    plt.title(title)
    plt.show()

# 2 Introduction to Machine Learning

# 3 Introduction to classification

<b>Note:</b> all concepts, that are present/described here, will be more fully described during workshop and on slides.

### Dataset loading

In [None]:
from sklearn.datasets import load_iris


iris_dataset = load_iris()
print(iris_dataset['DESCR'])

In [None]:
X = iris_dataset.data
y = iris_dataset.target

X = X[:, :2]

In [None]:
X[: 5]

In [None]:
y[: 5]

### Train/test split

In [None]:
from sklearn.model_selection import train_test_split

Default, completely random split with 25% of data assigned to test set...

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=RANDOM_SEED)


print('Class labels:', np.unique(y_train))
print('Class counts:', np.unique(y_train, return_counts=True)[1])

... vs stratified (preserving class distribution) split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=RANDOM_SEED,
                                                   stratify=y)


print('Class labels:', np.unique(y_train))
print('Class counts:', np.unique(y_train, return_counts=True)[1])

### Model Fitting

In [None]:
from sklearn.neighbors import KNeighborsClassifier


knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_train)

### Model evaluation

In [None]:
def accuracy(y_true, y_pred):
    return np.mean(y_true == y_pred)


y_train_pred = knn_clf.predict(X_train)
y_test_pred = knn_clf.predict(X_test)
print('Train acc:', accuracy(y_train, y_train_pred))
print('Test acc:', accuracy(y_test, y_test_pred))

Instead of writing this code by ourselves, we can use scikit help in this case too - it contains implementations of most popular metrics (some of which are much more complex and harder to code than the accuracy used above).

In [None]:
from sklearn.metrics import accuracy_score


y_test_pred = knn_clf.predict(X_test)
accuracy_score(y_test, y_test_pred)

As we are dealing with two features only, we can also check, how the algorithm decision regions look like.

In [None]:
plot_decision_regions(knn_clf, X_test, y_test, 'KNN decision regions', 'sepal length [cm]', 'sepal_width [cm]')

Nearest neighbors classifier is only one of many ML algorithms. Let's use another one, logistic regression, which is a linear classifier to see how well it will tackle the problem, and what the decision boundaries will be!

In [None]:
from sklearn.linear_model import LogisticRegression


log_reg_clf = LogisticRegression()
log_reg_clf.fit(X_train, y_train)
y_train_pred = log_reg_clf.predict(X_train)
y_test_pred = log_reg_clf.predict(X_test)
print('Train acc:', accuracy_score(y_train, y_train_pred))
print('Test acc:', accuracy_score(y_test, y_test_pred))

In [None]:
plot_decision_regions(log_reg_clf, X_test, y_test, 'KNN decision regions', 'sepal length [cm]', 'sepal_width [cm]')

### Model hyperparameters

Features of data we use for model training influence its performance greatly. Besides them, algorithms themselves have also another set of parameters, called hyperparameters, that affect how they work/optimize given problem. We will test it in one of the following exercises.

In [None]:
knn_clf = KNeighborsClassifier(n_neighbors=1)  # default number of neighbors is 5
knn_clf.fit(X_train, y_train)

y_train_pred = knn_clf.predict(X_train)
y_test_pred = knn_clf.predict(X_test)
print('Train acc:', accuracy(y_train, y_train_pred))
print('Test acc:', accuracy(y_test, y_test_pred))

### Exercises
- Explore in more exhaustive way, how the change of *k* parameter in KNN classifier affects achieved accuracy. Fit the classifier with all possible *k* values in range [1, 3, 5 ... 51] and check the accuracy. <br/> **Bonus**: Plot the results you obtained.
- Fit the data using yet another classifier - Decision Tree (located in *sklearn.tree*). Do you need to change much of a code to test another algorithm?
- Remember how we selected only two features of Iris dataset after loading? It was done for visualisation purposes, but more features can definitely help us improve our score. Investigate how different models will perform using data with all features. <br/>**Bonus**: After that, you can also play with the hyperparameters - try to obtain as high accuracy on test set as possible.

# 4 Introduction to regression

### Dataset loading

Iris dataset which we've used earlier is available directly in the library and "ready to go". The reality is often not so pleasant, and data comes with many problems, like missing or inconsistent values, that must be handled first. In fact, it often takes even 80-90% of the time when working on a project. Fortunately, there is plenty of libraries that simplify working with data, i.e. pandas which we will use.

In [None]:
data = pd.read_csv(STARTUPS_DATASET_PATH)

CSV is one of the simplest formats to handle. Sometimes datasets contain data stored in various binary formats that are not as straightforward to load.

Remember to always look for a ready-to-use solution before you start writing your own data parser. It will save you a lot of time.

Startups Database
=================

Notes
-----
Dataset Characteristics:
    :Number of Instances: 50
    :Number of Attributes: 5 (4 numeric and 1 textual)
    :Attribute Information:
        - Research & Developement Spend in USD
        - Administration Spend in USD
        - Merketing Spend in USD
        - State in which the startup is located (3 possible values)
	- Profit in USD

This datasets describes some attributes of 50 startups. The task is to perform
a regression that will predict the overall profit of the startup using all
other information.

Please note that since this dataset is very small, it can only be used with
simple regression methods.

### Data preparation - data cleaning

Now that we loaded the data, let's take a peek at how does it look. We can also generate some summary statistics.

In [None]:
data.head()

In [None]:
data.describe()

In [None]:
data.info()

We can clearly see some inconsistencies. R&D and Administration costs have less non-null values than other properties.

In order to perform the training we will get rid of the rows that lack some data.

In [None]:
pd.isnull(data).any(axis=1)
# data[pd.isnull(data).any(axis=1)]

Let's check the length (number of rows) of the data.

In [None]:
print(len(data))

Now, let's drop the rows with missing information and check the length again.

In [None]:
data = data.dropna()
print(len(data))

### Data preparation - encoding

Problem: we go rid of samples with missing data, but we need our features to be numeric matrix - condition which might not be exactly fulfilled looking at 'California' in state column...

In [None]:
data.columns

In [None]:
X = data[['R&D Spend', 'Administration', 'Marketing Spend', 'State']].values
y = data['Profit'].values

In [None]:
from sklearn.preprocessing import LabelEncoder

state_le = LabelEncoder()
state_le.fit(X[:, 3])
X[:, 3] = state_le.transform(X[:, 3])

In [None]:
X[: 5]

In [None]:
state_le.classes_
state_le.inverse_transform([0, 1, 2])  # we can inverse performed transformation at any time

We got our data but there is a problem: should Florida be numerically greater than New York? What does it even mean? 

In [None]:
from sklearn.preprocessing import OneHotEncoder

state_ohe = OneHotEncoder()
state_ohe.fit(X[:, 3].reshape(-1,1))  # OneHotEncoder require matrix input - hence reshape
encoded = state_ohe.transform(X[:, 3].reshape(-1,1)).toarray()
encoded = np.delete(encoded, 0, axis=1)  # we drop one column as it is redundant and can be inferred from remaining ones
X = np.delete(X, 3, axis=1) 
X = np.hstack([X, encoded])

### Data preparation - feature scaling

In [None]:
# TODO (one/two sentences about feature scaling itself)

In [None]:
from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=RANDOM_SEED)

sc = StandardScaler()
sc.fit(X_train)  # Note that we only fit scaler on train data - same as with model fitting.
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)

Done, but did something actually happen? Do feature scalers do something? (Do they know something?) Let's find out!

In [None]:
plt.scatter(X_train[:, 0], X_train[:, 1])
plt.show()

Ok, now our data is finally ready to be passed to the model, weee :) <br/> By the way, you will probably agree that even for a simple data quite some work was needed to be done first.

### Model fitting

In [None]:
from sklearn.linear_model import LinearRegression


lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

### Model evaluation

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
# Quick exercise - test all of metrics imported above. Look up what do they do in documantation if names aren't intuitive
# enough. Which one of them do you think is the best and why?


y_train_pred = lin_reg.predict(X_train)
y_test_pred = lin_reg.predict(X_test)
print('Train metric:', some_metric(y_train, y_train_pred))
print('Test metric:', some_metric(y_test, y_test_pred))

This time we can't plot the results in 2D - but we can check weights assigned to features and intercept, which will give us  idea what is important to our model.

In [None]:
print(lin_reg.coef_)
print(lin_reg.intercept_)

### Exercises

- What other ways of dealing with missing data besides sample deletion you can come up with? What are their potential pros and cons? (Tip: some methods were already mentioned during walking through slides). 
- Analyse parameters of the trained linear regression model. As some of them are less significant then the others, try to  simplify model to use less features. (If possible, try to visualise fitted line/points after retraining).

# 5 Project time

And now is the time you are in post-credits scene saying: "Fine, I'll do it myself".

TODO - task description.

A bunch of a tips for a start: identify kind of a task, load and analyse data. Don't be scared to consult previous parts of the notebook. Good luck!