Supervised Learning 
- Making predictions using the data
- There is an outcome we are trying to predict
- Example: Predicting whether a given email is spam or not.

Unsupervised Learning
- Extracting structure from the data
- There is no right or wrong answer
- Patterns are observed based on the data
- Example: Identify shopping behavior across age-groups

<b>How machine learning works?</b>

Supervised Learning
- Train the machine learning model using labelled data
- Machine learning model learns the relationship between attributes of the data and its outcome
- Then, make predictions on the new data for which the label is unknown

The primary goal of supervised learning is to build a model that 'generalizes'. It accurately predicts the future rather than the past.



<b>The IRIS dataset</b>
- 50 samples of 3 different species of iris (150 samples total)
- Measurements - sepal length, sepal width, petal length, petal width

This is framed as a 'Supervised Learning' problem. The aim is to predict the sepcies of an IRIS using measurements as input.

In [2]:
from IPython.display import IFrame
IFrame('http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', width=300, height=200)

Loading the iris dataset into scikit-learn

In [3]:
# import load_iris function from datasets module
from sklearn.datasets import load_iris

In [4]:
# save "bunch" object containing iris dataset and its attributes
iris = load_iris()
type(iris)

sklearn.utils.Bunch

In [5]:
# print the iris data
print(iris.data)

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.1 1.5 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 [5.  3.3 1.4 0.2]
 [7.  3.2 4.7 1.4]
 [6.4 3.2 4.5 1.5]
 [6.9 3.1 4.

Machine Learning Terminology
- Each row is an observation. Also known as - sample, example, instance, record etc
- Each column is a feature. Also known as - predictor, attribute, independent variable, input, regressor, covariate etc

In [7]:
# print the names of the four features
print(iris.feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [8]:
# print integers representing the species of each observation
print(iris.target)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


In [9]:
# print the encoding scheme for species: 0 = setosa, 1 = versicolor, 2 = virginica
print(iris.target_names)

['setosa' 'versicolor' 'virginica']


<b>Requirements for working with data in scikit-learn</b>
- Features and responses are <b>separate objects</b>
- Features and responses should be <b>numeric</b>
- Features and responses should be <b>NumPy arrays</b>
- Features and responses should have <b>specific shapes</b>

In [10]:
# check the types of features and response
print(type(iris.data))
print(type(iris.target))

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>


In [11]:
# check the shape of features (first dimension = number of observations, second dimensions = number of features)
print(iris.data.shape)

(150, 4)


In [12]:
# check the shape of response (single dimension matching the number of observations)
print(iris.target.shape)

(150,)


In [13]:
# store the feature matrix in "X"
X = iris.data

# store response vector in "y"
y = iris.target

<b>Implementiong KNN on Iris Dataset</b>
- Pick a value for K.
- Search for the K observations in the training data that are nearest to the measurements of the unknown iris.
- Use the most popular response value from the K nearest neighbors as the predicted response value for the unknown iris.

In [14]:
# print the shapes of X and y
print(X.shape)
print(y.shape)

(150, 4)
(150,)


# scikit-learn 4-step modeling pattern

<b>Step1:</b> Import the class you plan to use

In [16]:
from sklearn.neighbors import KNeighborsClassifier

<b>Step2:</b> Instantiate the estimator
- Estimator is scikit-learn's term for model
- Instantiate means, make an instance of

In [17]:
knn = KNeighborsClassifier(n_neighbors=1)

- Name of the object does not matter
- Can specify tuning parameters (aka 'hyperparameters') during this step
- All parameters not specified are set to their defaults

In [18]:
print(knn)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')


<b>Step3:</b> Fit the model with the data (aka 'model training')
- Model is learning the relationship between X and y
- Occours in-place

In [19]:
knn.fit(X, y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')

<b>Step4:</b> Predict the response for a new observation
- New observations are called 'out-of-sample' data
- Uses the information it learned during the model training process

In [45]:
# knn.predict([[3, 5, 4, 2]])
# print(iris.target_names(int(knn.predict([[3, 5, 4, 2]]))))
target_idx = int(knn.predict([[3, 5, 4, 2]]))
target_val = iris.target_names[target_idx]
print(target_idx, ':', target_val)

2 : virginica


- Returns the NumPy array
- Can predict for multiple observations at once

In [44]:
X_new = [[3, 5,  4, 2], [5, 4, 3, 2]]
# print(list(knn.predict(X_new)))
target_idx = list(knn.predict(X_new))
# print(target_idx)
# target_val = iris.target_names[target_idx[]]
# print(target_idx, '-', target_val)
target_val = list()
# print(target_val)
for i in range(len(target_idx)):
    target_val.append(str(target_idx[i]) + ': ' + str(iris.target_names[target_idx[i]]))
    
print(target_val)  


['2: virginica', '1: versicolor']


<b>Using a different value for K</b>

In [47]:
# Step1: import the class to be used
from sklearn.neighbors import KNeighborsClassifier

# Step2: call the KNN classification method with K=5 as the value this time
knn = KNeighborsClassifier(n_neighbors=5)

# Step3: Fit the model with the data (X and y)
knn.fit(X, y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [48]:
# Step4: Make predictions for the single observation
target_idx = int(knn.predict([[3, 5, 4, 2]]))
target_val = iris.target_names[target_idx]
print(target_idx, ':', target_val)

1 : versicolor


In [49]:
# Step4: Make predictions for multiple values
X_new = [[3, 5,  4, 2], [5, 4, 3, 2]]
target_idx = list(knn.predict(X_new))
target_val = list()
for i in range(len(target_idx)):
    target_val.append(str(target_idx[i]) + ': ' + str(iris.target_names[target_idx[i]]))
print(target_val)  

['1: versicolor', '1: versicolor']


<b>Using a different classification model</b>
-  Logistic Regression

In [50]:
# Step1: import the class
from sklearn.linear_model import LogisticRegression

# Step2: instantiate the model (using the default parameters)
logreg = LogisticRegression()

# Step3: fit the model with data
logreg.fit(X, y)

# Step4: predict the response for new observations
logreg.predict(X_new)

array([2, 0])

# Comparing the machine learning models within scikit-learn

<b>Evaluation procedure #1: Train and test on the entire dataset.</b>
- Train the model on the entire dataset
- Test the model on the same dataset and evaluate how well we did by comparing the predicted response values with the true response values.

In [51]:
# read the iris dataset
from sklearn.datasets import load_iris
iris = load_iris()

# create X(features) and y(response)
X = iris.data
y = iris.target

<b>1. Logistic Regression</b>

In [52]:
# Step1: import the class
from sklearn.linear_model import LogisticRegression

# Step2: instantiate the model using the default parameters
logreg = LogisticRegression()

# Step3: fit the model with the data
logreg.fit(X, y)

# Step4: predict the response values for the observations
logreg.predict(X)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [53]:
# Step5: store the predicted response values
y_pred = logreg.predict(X)

# check ho many predictions were generated
len(y_pred)

150

<b>Classification Accuracy</b>
- Proportion of correct predictions
- Common evaluation metric for classification problems

In [55]:
# compute the classification accuracy for the logistic regression model
from sklearn import metrics
print(metrics.accuracy_score(y, y_pred))

0.96


This is also known as 'Training Accuracy'. Here we train and test the model on the same data

<b>2. KNN(K=5)</b>

In [56]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X, y)
y_pred=knn.predict(X)
print(metrics.accuracy_score(y, y_pred))

0.9666666666666667


<b>2. KNN(K=1)</b>

In [57]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X, y)
y_pred=knn.predict(X)
print(metrics.accuracy_score(y, y_pred))

1.0


Problems with training and testing the model on the same dataset
- Our goal is to estimate the likely performance of the model on 'out-of-sample' data
- Maximizing training accuracy rewards overly complex models that wont necessary generalize
- Unnecessarily complex models overfit the training data

<b>Evaluation procedure #2: Train test split.</b>
- Split the dataset into two pieces - a training set and a testing set
- Train the model on the training set
- Test the model on the testing set and evaluate

In [58]:
# print the shapes of X and y
print(X.shape)
print(y.shape)

(150, 4)
(150,)


In [61]:
# Step1: split X and y into training and testin sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=4)

In [67]:
# print the shapes of the new objects
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(90, 4)
(60, 4)
(90,)
(60,)


<b>1. Logistic Regression</b>

In [64]:
# Step2: train the model on the training set
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [66]:
# Step3: make predictions on the testing set
y_pred = logreg.predict(X_test)

# Step4: compare actual response values (y_test) with the predicted response values (y_pred)
print(metrics.accuracy_score(y_test, y_pred))

0.95


<b>2. KNN (K=5)</b>

In [69]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred))

0.9666666666666667


<b>3. KNN (K=1)</b>

In [70]:
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred))

0.95
