# Data Science at UCSB

# Python for Data Science: ML Crash Course

## Jason Freeberg, Fall 2016 

Okay! Today will be a crash course in machine learning. I'll explain things at a high level and use scikit learn to show real examples. Feature engineering is the creation or collection of predictors for a machine learning pipeline, so we're covering ML first.

To make an oversimplification, let's assume we have some set of *p* predictors, **X<sub>1</sub>, X<sub>2</sub>, X<sub>3</sub> ... X<sub>p</sub> ** for *n* observations. Then we have a corresponding set of dependent *n* variables, **Y**. **X** could be a set of 100 people (n=100), each with 10 variables (p=10) like height, weight, sex, location, education level, and so on. And, in the same example, **Y** is that person's salary. What I just described is a *regression* problem. Where we have **X** and **Y**, and **Y** is a continuous variable. Now, from **X** and **Y** we can *learn* **F**, the mapping from **X** to **Y**... **F**(**X**) = **Y**. Or in matrix notation...

$$ F \left(
\begin{matrix}
X_{1,1} & ... & X_{1,p} \\
\vdots & \ddots & \vdots \\
X_{n,1} & ... & X_{n,p} \\
\end{matrix}
\right) 
= 
\begin{bmatrix}
Y_1 \\
\vdots \\
Y_n
\end{bmatrix}
$$

There are two main branches of machine learning...

### Supervised Learning
Like the example above, supervised learning involves using a set of *n* inputs, **X**, and *n* crorresponding outputs, **Y**, to build a statistical model that can then give predicted outputs from new, unseen inputs. As you might expect, this type of learning has broad applications to business, healthcare, and physics.


#### Sci Kit Lean Example
Similar to the problem above, imagine we have both **X** and **Y** and try to learn the mapping between them. But what if our dependant variable, **Y**, doesn't span the real numbers? There are many casses where we're trying to *classify* our outcomes... good or bad, alive or dead, pay or default... and these are **classification** problems. 

Moreover, we can have multiple classes in **Y**. Think of tax brackets, image recognition, or types of crime. Food for thought: we can turn a regression problem into a classification problem simply by *binning* our outcomes. Salary in dollars would become income brackets. Then we could use classification algorithms instead.

In [1]:
# Iris is a classic dataset. It holds various measurements of flowers and their species.
# If we want to predict species from the measurements, what kind of problem are we 
# working with?

import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn import cross_validation, metrics
import urllib2
import os


def read_csv_from_url(URL, columnNames):
    response = urllib2.urlopen(URL)
    lines = pd.read_csv(response,
                        header=None,
                        index_col=False)
    dataframe = pd.DataFrame(lines)
    dataframe.columns = columnNames
    return dataframe

# Seeds make our random methods reproducible.
seed = 123

# Load the Iris dataset
irisURL = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
irisNames = ['sepalLength', 'sepalWidth', 'petalLength', 'petalWidth', 'class']
iris = read_csv_from_url(irisURL, irisNames)
print 'N rows =', iris.shape[0], '\n', 'N cols =', iris.shape[1]
print 'Classes in the dependent var. =', set(iris['class'])
print iris.head()


N rows = 150 
N cols = 5
Classes in the dependent var. = set(['Iris-virginica', 'Iris-setosa', 'Iris-versicolor'])
   sepalLength  sepalWidth  petalLength  petalWidth        class
0          5.1         3.5          1.4         0.2  Iris-setosa
1          4.9         3.0          1.4         0.2  Iris-setosa
2          4.7         3.2          1.3         0.2  Iris-setosa
3          4.6         3.1          1.5         0.2  Iris-setosa
4          5.0         3.6          1.4         0.2  Iris-setosa


#### Training and Test Sets

When you're studying for an exam at school you have your notes, review packets, and maybe some recorded lectures. You study, or *train*, with these materials and then you take the exam to see how well you understand the material you studied. The test has questions *similar* to what you studied, but not equal. So if you studied hard, you should be able to **generalize** well to these new "inputs".

Likewise in machine learning, we need some data with the recorded inputs *and* outputs. Then when a model is trained on that data, we need to evaulate the model's perfomance by giving it new, unseen data. This is a common paradigm in machine and statistical learning, but it can be easy to mess up.

In [2]:
# Split the data into train and test sets
train, test = cross_validation.train_test_split(iris,
                                                test_size=0.3,
                                               random_state=seed)

# Coerce the independent and dependent variable of the training set to NumPy arrays.
predictors = np.array(train.ix[:, 0:-1])
variable = np.array(train.ix[:, -1])

print 'Predictors:', '\n', predictors[:5]
print 'Dependent variable:', '\n', variable[:5]


Predictors: 
[[ 5.8  2.8  5.1  2.4]
 [ 6.3  3.4  5.6  2.4]
 [ 5.5  2.3  4.   1.3]
 [ 5.1  3.8  1.5  0.3]
 [ 4.4  3.   1.3  0.2]]
Dependent variable: 
['Iris-virginica' 'Iris-virginica' 'Iris-versicolor' 'Iris-setosa'
 'Iris-setosa']


#### k-NN Algorithm

Since Iris is a classic dataset, we're going to use a classic classification algorithm. You'll probably see this example a lot in books and online, it's *classifying the Iris dataset with **k nearest neighbors***. In kNN, we put all of our predictors in a *feature space*, stored with their classifications. Then when a new observation needs to be classified, we compare it to the nearest **k** observations and give it the classification of the majority nearest **k** observations.

In the diagram below, you can see that the predicted class for the star can change depending on the number of neighbors used.

![image](http://bdewilde.github.io/assets/images/2012-10-26-knn-concept.png)

*Food for thought*: kNN can also be used for regression problems! In a regression context, we predict based on the mean target of the k neighbors. For a more detailed explanation on kNN classification, [click here](https://saravananthirumuruganathan.wordpress.com/2010/05/17/a-detailed-introduction-to-k-nearest-neighbor-knn-algorithm/). Now for the code!


In [3]:
# KNN object and fit
knn = KNeighborsClassifier()
knn.fit(X=predictors,
        y=variable)

# Make predictions on test set
testPredictors = test.ix[:, :-1]
actual = test.ix[:, -1]
predictions = knn.predict(testPredictors)

# Merge the predicted and actual classifications
predictionsDF = pd.DataFrame(predictions, columns=['prediction'])
results = pd.concat([test.reset_index(drop=True), predictionsDF], axis=1)

# Model evaluation
matrix = metrics.confusion_matrix(y_true=results['class'],
                                  y_pred=results['prediction'])

print '------------------ Confusion Matrix ------------------'
print matrix

incorrect = results.ix[results['class'] != results['prediction'], ['class', 'prediction']]
if incorrect.shape[0] == 0:
    print 'No incorrect classifications!'
else:
    print '-------------- Incorrect Classifications --------------'
    print incorrect

------------------ Confusion Matrix ------------------
[[18  0  0]
 [ 0  9  1]
 [ 0  0 17]]
-------------- Incorrect Classifications --------------
             class      prediction
0  Iris-versicolor  Iris-virginica


### Unsupervised Learning

Supervised learning is fairly straightforward, as we saw in the example. However, unsupervised learning requires some abstraction. Essentially, in an unsupervised exercise we are trying to either uncover hidden structure, find similarities, or reduce dimensionality in the data.

A common unsupervised example is clustering. If I were to hand you a bucket of rocks and ask you to put them into groups you may look at features like weight, volume, color and texture. Then you can group them by those characteristics. 

In a machine learning context, we can use unsupervised clustering as a pre-processing step. After our observations are clustered, we can easily add their cluster numbers as a column of predictors in the training data. Now we can use that new variable as a predictor in a regression or classification problem!

Eventually you may come across the problem of having *too many predictors*. If you have a set of 10 million observations and 200 predictors, then building a model on all predictors will be very expensive and time consuming. Thankfully, using techniques like [Principle Component Analysis](http://colah.github.io/posts/2014-10-Visualizing-MNIST/), we can reduce the number of predictors to only those with a *high variability* and save time in the long run.



#### Dimensionality Reduction using PCA

We're going to exemplify unsupervised learning with another classic dataset: the MNIST collection of handwritten digits. Each digit is a 28x28 pixel image, for 784 total pixels... 

\begin{matrix}
000 & 001 & 002 & 003 & ... & 026 & 027 \\
028 & 029 & 030 & 031 & ... & 054 & 055 \\
056 & 057 & 058 & 059 & ... & 082 & 083 \\
 \vdots &  \vdots &  \vdots &  \vdots & ... & \vdots & \vdots \\
728 & 729 & 730 & 731 & ... & 754 & 755 \\
756 & 757 & 758 & 759 & ... & 782 & 783 \\
\end{matrix}

The .csv contains the level of darkness for each pixel, organized as a table with 785 columns and 42,000 rows. This translates to the 784 pixel values *plus* the actual digit labels, and 42,000 digits...

\begin{matrix}
'1' & 001 & 002 & 003 & 004 & ... & 783 & 784 \\
'7' & 001 & 002 & 003 & 004 & ... & 783 & 784 \\
'3' & 001 & 002 & 003 & 004 & ... & 783 & 784 \\
 \vdots & \vdots &  \vdots &  \vdots &  \vdots & ... & \vdots & \vdots \\
'8' & 001 & 002 & 003 & 004 & ... & 783 & 784 \\
'6' & 001 & 002 & 003 & 004 & ... & 783 & 784 \\
\end{matrix}
 

In [4]:
# Read data from the repo you downloaded
location = os.path.realpath(os.path.join(os.getcwd(), "digits.csv"))
digits = pd.read_csv(location)

# Let's peak at the data
print "N rows =", digits.shape[0], '\n', 'N cols =', digits.shape[1]
print '------------------ Head of Data ------------------'
print digits.head()

N rows = 42000 
N cols = 785
------------------ Head of Data ------------------
   label  pixel0  pixel1  pixel2  pixel3  pixel4  pixel5  pixel6  pixel7  \
0      1       0       0       0       0       0       0       0       0   
1      0       0       0       0       0       0       0       0       0   
2      1       0       0       0       0       0       0       0       0   
3      4       0       0       0       0       0       0       0       0   
4      0       0       0       0       0       0       0       0       0   

   pixel8    ...     pixel774  pixel775  pixel776  pixel777  pixel778  \
0       0    ...            0         0         0         0         0   
1       0    ...            0         0         0         0         0   
2       0    ...            0         0         0         0         0   
3       0    ...            0         0         0         0         0   
4       0    ...            0         0         0         0         0   

   pixel779  pixel780  p

In [35]:
from sklearn.decomposition import PCA

trainDigits, testDigits = cross_validation.train_test_split(digits,
                                                            test_size=0.3,
                                                            random_state=seed)
# Coerce training data to np arrays
trainLabel = trainDigits['label']
trainPixels = trainDigits.ix[:, 1:]

# Coerce testing data to np arrays
testLabel = testDigits['label']
testPixels = testDigits.ix[:, 1:]

# Fit a PCA model
pca = PCA(n_components = 10)
pca.fit(X=trainPixels)

PCA(copy=True, n_components=10, whiten=False)

In [53]:
# Let's train a KMeans model with the 10 principle components.
# First we need to transform the training set using the PCA parameters...
trainPCAPixels = pca.transform(trainPixels)

PCAdigitsKNN = KNeighborsClassifier()
PCAdigitsKNN.fit(X=trainPCAPixels,
              y=trainLabel)

# And now make predictions on the test set.
# First we need to transform the test set using the PCA parameters...
testPCAPixels = pca.transform(testPixels)

PCAdigitPredictions = PCAdigitsKNN.predict(testPCAPixels)
PCAdigitPredictionsDF = pd.DataFrame(PCAdigitPredictions, columns=['prediction'])
PCAdigitComparison = pd.concat([testLabel.reset_index(drop=True), PCAdigitPredictionsDF], axis=1)
PCAdigitsIncorrect = results.ix[PCAdigitComparison['label'] != \
                             PCAdigitComparison['prediction'], ['class', 'prediction']].shape[0]

PCAdigitsCorrect = PCAdigitComparison.shape[0] - PCAdigitsIncorrect
print 'Digits correctly classified =', PCAdigitsCorrect 
print 'Digits incorrectly classified =', PCAdigitsIncorrect
print 'Raw accuracy =', round(PCAdigitsCorrect / float(PCAdigitsCorrect + PCAdigitsIncorrect), 4)

Digits correctly classified = 12597
Digits incorrectly classified = 3
Raw accuracy = 0.9998


Now let's compare our accuracy of 99.98% to a baseline kNN using the full set of 784 features.

In [None]:


digitsKNN = KNeighborsClassifier()
digitsKNN.fit(X=trainPixels,
             y=trainLabel)

digitPredictions = digitsKNN.predict(testPixels)

digitPredictionsDF = pd.DataFrame(digitPredictions, columns=['prediction'])
digitComparison = pd.concat([testLabel.reset_index(drop=True), digitPredictionsDF], axis=1)
digitsIncorrect = results.ix[digitComparison['label'] != \
                             digitComparison['prediction'], ['class', 'prediction']].shape[0]

digitsCorrect = digitComparison.shape[0] - digitsIncorrect
print 'Digits correctly classified =', digitsCorrect 
print 'Digits incorrectly classified =', digitsIncorrect
print 'Raw accuracy =', round(digitsCorrect / float(digitsCorrect + digitsIncorrect), 4)

##### Thanks to...

- [The MNIST digits dataset from Kaggle]()
- [The UCI Iris dataset]()
- [This **great** blog post on PCA](http://colah.github.io/posts/2014-10-Visualizing-MNIST/)
- [This Git Blog](http://bdewilde.github.io/blog/blogger/2012/10/26/classification-of-hand-written-digits-3/) for the kNN illustration

In [None]:
print 'asd'