# Scikit-learn

<img src='ml3.png' width=700/>

## Machine Learning Process

### The machine learning process is much more involved than the high level work-flow depicted in the previous cell. 

- 1. The stages can be broadly defined as follows:
    - Data exploration
    - Understand the feature data (data types, missing values, outliers, etc)
    - Visualization (boxplots, scatter plots, correlations matrix, etc)  
    - Correlations between features and the target
    - Feature engineering (aggregate features)
- 2. Data preparation
    - Dealing with outliers 
    - Dealing with missing values
    - Feature encoding (encoding categorical features)
    - Feature selection
    - Feature scaling
    - Handling Imbalance

<img src='ml4.png' width=700/>

__The machine learning process__

### The stages can be broadly defined as follows:

- Building and Evaluating Models
    - Train many models from different categories (e.g., linear, kNN, etc.) using standard parameters.
    - Measure and compare their performance.
    - Debug ML models and analyse the types of errors the models make.

- Fine Tuning and Optimization
    - Perform hyper-parameter optimization (e.g. tuning value of k in kNN) 
    - Finally assess the generalization capability of your model on the test set. 

# Outline of Today's topics

- Today, I will briefly introduce Scikit learn and how to use it to perform basic classification and regression modelling

    - Therefore, the structure of what we cover will be:
    - Introduction to Scikit Learn
    - Data Preparation (Train and Test data)
    - Build Classification Model
    - Building Regression Model
    - Evaluating Models

## Introduction to SciKit Learn

- Scikit-learn provides a range of supervised and unsupervised learning algorithms in Python.

- The library is built upon the following:
    - __NumPy__: Base n-dimensional array package
    - __SciPy__: Fundamental library for scientific computing
    - __Matplotlib__: Comprehensive 2D/3D plotting

- The library is focused on modelling data. 


### Scikit Learn is well organized and there is a wealth of tutorial and API pages, which can be accessed here. 

- The functionally offered by Scikit-learn can be broken into the following :

    - __Classification__: a large collection of learning algorithms such as naive bayes, support vector machines, decision trees, ensembles etc.
    - __Clustering__: for grouping unlabelled data such as Kmeans, DBScan, etc.
    - __Regression__: libraries for predicting real-valued attributes such as multiple linear regression, ridge regression, etc. 
    - __Pre-processing__: Outlier detection, normalization, encoding categorical features.
    - __Dimensionality Reduction__: Reduces the number of features that you need to consider in your dataset.
    - __Model Selection__: Comparing, validating and choosing parameters and models, metrics.

<img src='ml5.png' width=900/>
<img src='ml6.png' width=900/>

## A few important notes about Scikit Learn

- The following are some important requirements that you should keep in mind when working with Scikit learn. 

    - Features and classes/target values are separate objects (data structures)
    - Features and classes should be numerical
    - Features and classes should be NumPy arrays
    - Features and classes should have a specific shape
    - Features should be 2D (Columns correspond to numbers of features and rows are number of data instances)
    - Class array should be one dimensional with same number of instances as there are data instances in the features array

- Scikit-learn comes with a number of standard example datasets. These are broken into [toy datasets](https://scikit-learn.org/stable/datasets/toy_dataset.html) and [real-world datasets](https://scikit-learn.org/stable/datasets/real_world.html). 
    - Toy datasets include the __iris__ dataset and __digits__ datasets for classification and the __Boston house prices__ dataset for regression. 
    - Real-world datasets include Olivetti faces dataset, newsgroups, California housing dataset, etc. 

- These datasets are dictionary-like objects holding at least two items: 

    - A NumPy array of shape n_samples * n_features with the key data 
    - A NumPy array of length n_samples, containing the class values, with key target

# Iris dataset

In [2]:
from sklearn import datasets


iris = datasets.load_iris() # Load iris dataset into a dataset object

print (iris.data.shape) # Outputs the dimensions of the data in this case (150, 4)

print (iris.target_names) # Name of three calsses

print (iris.feature_names) # Name of four feats

#print (iris.DESCR) # description

print (iris.data) #Accesses the data stored in the dataset object (2D numpy array)

#print (iris.data[0:1, 2:4]) # Accesses first row of the dataset but just columns with index 2 and 3

print (iris.target) # Accesses the class associated with each data item




(150, 4)
['setosa' 'versicolor' 'virginica']
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.2]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.6 1.4 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3

<pre>



{'data': array([[5.1, 3.5, 1.4, 0.2],
                [4.9, 3. , 1.4, 0.2],
                [4.7, 3.2, 1.3, 0.2],
                [4.6, 3.1, 1.5, 0.2],...
'target': array([0, 0, 0, ... 1, 1, 1, ... 2, 2, 2, ...
'target_names': array(['setosa', 'versicolor', 'virginica'], dtype='<U10'), 
...}
    
</pre>

<img src='ml7.png' width=900/>

# Wine dataset

<img src='ml8.png' width=900/>

In [18]:
import numpy as np

df = np.genfromtxt("wine.data", delimiter=",")

print (df.shape)

target = df[:, 0]

data = df[:, 1:13]
print (target)
data

(178, 14)
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3.
 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3.
 3. 3. 3. 3. 3. 3. 3. 3. 3. 3.]


array([[14.23,  1.71,  2.43, ...,  5.64,  1.04,  3.92],
       [13.2 ,  1.78,  2.14, ...,  4.38,  1.05,  3.4 ],
       [13.16,  2.36,  2.67, ...,  5.68,  1.03,  3.17],
       ...,
       [13.27,  4.28,  2.26, ..., 10.2 ,  0.59,  1.56],
       [13.17,  2.59,  2.37, ...,  9.3 ,  0.6 ,  1.62],
       [14.13,  4.1 ,  2.74, ...,  9.2 ,  0.61,  1.6 ]])

# Classification

<img src='ml.png' width=700/>

## Training and Test data

<img src='ml2.png' width=700/>

# Training: Using a Logistic Regression  Classifier in SciKit Learn


- Logistic Regression are a family of supervised learning methods (estimators). 

- The class we will be using is sklearn.linear_model.LogisticRegression

- The LogisticRegression takes as input two arrays: 
    - An array X of size [n_samples, n_features] holding the training samples, 
    - An array Y of integer values, size [n_samples], holding the class labels for the training samples. 

- The __LogisticRegression__, as you can see from the API, takes many more arguments

In [1]:
from sklearn import datasets
from sklearn.linear_model import LogisticRegression

iris = datasets.load_iris()

clf = LogisticRegression(solver='liblinear')

# train the classifier pass it the training data and classes
clf.fit(iris.data, iris.target)

# predict the class for an unseen example
prediction = clf.predict( [[4.9, 3.0,  1.4, 0.2]])

#corresponds to setosa

print (prediction)


prediction = clf.predict( [[4.5, 0.9, 13.0, 2.1]])

print (prediction)
# which corresponds to virginica

# When we run this code it will predict the output for this unseen instance as being 2,


[0]
[2]


# Using kNN on Iris Dataset

- In the next example we will apply k-nearest neighbour to the iris dataset. 

In [10]:
from sklearn import datasets
from sklearn import neighbors

iris = datasets.load_iris()


knn = neighbors.KNeighborsClassifier(n_neighbors = 8)  
# We can provide many different parameters to machine learning algorithms in Scikit Learn.
#However, most have default values allows us to get started very quickly.

knn.fit(iris.data, iris.target)

print (knn.predict([[3, 5, 4, 2]]))
print (knn.predict([[0.3, 5.0, 4.0, 0.2]]))

print (knn.predict([[0.3, 4.1, 4.0, 0.2]]))


[1]
[0]
[0]


### We can provide many different parameters to machine learning algorithms in Scikit Learn. However, most have default values allows us to get started very quickly.

In [19]:
from sklearn import datasets
from sklearn import neighbors

iris = datasets.load_iris()


knn = neighbors.KNeighborsClassifier(n_neighbors = 5, algorithm = "kd_tree")

knn.fit(iris.data, iris.target)

print (knn.predict([[3, 5, 4, 2]]))

[1]


# Below we look at two separate scenarios for simple evaluation (we will cover/ have covered more advanced and realistic techniques in a later):

- There is a separate __training__ (left) and __test__ (right) dataset that can be used. 
- A __single__ dataset split into a training and test data (holdout method). 

- There is a separate training and test dataset that can be used

<img src='ml9.png' width=900/>


# Assesing Accuracy

- Assuming I have separate training and test data I might have the following arrays
    - features_train, labels_train
    - features_test, labels_test


In [37]:
from sklearn import metrics
from sklearn import tree

# creates a new decision tree object classifier
clf = tree.DecisionTreeClassifier()

# trains the classifier; pass it the training data and classes
clf = clf.fit(features_train, labels_train)

# predict the class for an unseen example
results= clf.predict(features_test)

print (metrics.accuracy_score(results, labels_test))

NameError: name 'features_train' is not defined

- Notice in this example when we call predict we are passing it as __2D NumPy__ array. 
- The accuracy_score function (available in the metrics module) will count the number of classes we correctly predicated and express that as a percentage of the total number

# test/train split

<img src='ml10.png' width=900/>

# Assessing Accuracy (Splitting Training Data)

- In scikit-learn a random split into training and test sets can be quickly computed with the train_test_split helper function. 
- As arguments we pass it the original data and target as well as the percentage of the original data we want for the training

In [20]:
from sklearn import tree
from sklearn import datasets
from sklearn import model_selection


wine = datasets.load_wine()

train_features, test_features, train_labels, test_labels = model_selection.train_test_split( wine.data, wine.target, test_size=0.2, random_state=0)


print (wine.data.shape, wine.target.shape)
print (train_features.shape, train_labels.shape)

print (test_features.shape, test_labels.shape)

(178, 13) (178,)
(142, 13) (142,)
(36, 13) (36,)


In [25]:
from sklearn import tree
from sklearn import datasets
from sklearn import model_selection
from sklearn import neighbors
from sklearn import metrics


wine = datasets.load_wine()
train_features, test_features, train_labels, test_labels = model_selection.train_test_split( wine.data, wine.target, test_size=0.2, random_state=0)

clf = tree.DecisionTreeClassifier()
# trains the classifier pass it the training data and classes
clf = clf.fit(train_features, train_labels)


# predict the class for an unseen example
results= clf.predict(test_features)
#print (results)
print  (metrics.accuracy_score(results, test_labels))


lrclf = LogisticRegression(solver='liblinear')

lrclf = lrclf.fit(train_features, train_labels)

# predict the class for an unseen example
results= clf.predict(test_features)
#print (results)
print  (metrics.accuracy_score(results, test_labels))



knn = neighbors.KNeighborsClassifier(n_neighbors = 2)
# predict the class for an unseen example
knn = knn.fit(train_features, train_labels)

results= knn.predict(test_features)
#print (results)
print  (metrics.accuracy_score(results, test_labels))


knn = neighbors.KNeighborsClassifier(n_neighbors = 5)
# predict the class for an unseen example
knn = knn.fit(train_features, train_labels)

results= knn.predict(test_features)
#print (results)
print  (metrics.accuracy_score(results, test_labels))




0.9722222222222222
0.9722222222222222
0.75
0.8055555555555556


### We take in the data. We split into a training set and a test set. We then assess its accuracy 


# Feed-Forward Neural Network

In [20]:
from sklearn import tree
from sklearn import datasets
from sklearn import model_selection
from sklearn import metrics
from sklearn.neural_network import MLPClassifier


wine = datasets.load_wine()
train_features, test_features, train_labels, test_labels = model_selection.train_test_split( wine.data, wine.target, test_size=0.2, random_state=0)


clf = MLPClassifier(random_state=1, max_iter=3000).fit(train_features, train_labels)

results = clf.predict_proba(test_features)

print (results)

pred = np.argmax(results, axis=1) # Returns the indices of the maximum values along an axis.

print (metrics.accuracy_score(pred, test_labels))


[[9.99946921e-01 3.96829653e-05 1.33961632e-05]
 [2.18055785e-09 1.24830618e-09 9.99999997e-01]
 [1.88590347e-05 9.99981130e-01 1.14318547e-08]
 [9.99537224e-01 4.49282186e-04 1.34934725e-05]
 [2.51007147e-04 9.99748746e-01 2.46384206e-07]
 [1.77526777e-03 9.98220913e-01 3.81924159e-06]
 [9.99999592e-01 1.63946274e-07 2.43776141e-07]
 [1.55913142e-06 5.05752038e-07 9.99997935e-01]
 [1.00566401e-03 9.98993600e-01 7.35948458e-07]
 [5.24727421e-05 9.99946548e-01 9.78953094e-07]
 [1.39912898e-03 4.21297924e-04 9.98179573e-01]
 [1.25219841e-06 1.38023156e-04 9.99860725e-01]
 [9.99999976e-01 9.60247566e-09 1.44318761e-08]
 [5.17674155e-02 9.48232569e-01 1.56729812e-08]
 [1.64312701e-07 1.01472481e-07 9.99999734e-01]
 [4.32540461e-07 9.99999560e-01 7.15263502e-09]
 [9.99977663e-01 8.91570063e-06 1.34208320e-05]
 [9.99999961e-01 1.95790916e-10 3.91953765e-08]
 [3.98939166e-05 4.75650032e-02 9.52395103e-01]
 [9.99992587e-01 7.41005719e-06 2.64742161e-09]
 [1.12953806e-03 9.98870447e-01 1.490972

# Regression

<img src='reg1.png' width=700/>

In [47]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_boston
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import mean_squared_error


boston = load_boston()
print(boston.keys())


#print(boston["data"]) #prints the description of data
#print(boston.data.shape)
#print(boston.target.shape)
#print(boston.feature_names)


X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, test_size = 0.2, random_state = 5)

print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

regression = linear_model.LinearRegression()
regression.fit(X_train, Y_train)


print("R-Squared:", regression.score(X_test,Y_test))

Y_pred = regression.predict(X_test)

print('MAE:', metrics.mean_absolute_error(Y_test, Y_pred))
print('MSE:', metrics.mean_squared_error(Y_test, Y_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(Y_test, Y_pred)))


dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename', 'data_module'])
(404, 13)
(102, 13)
(404,)
(102,)
R-Squared: 0.7334492147453135
MAE: 3.2132704958423437
MSE: 20.869292183770348
RMSE: 4.568292042303157



    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this case special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows:

        from sklearn.datasets import fetch_californi