# Week 1: Meeting Date: 12/13/2017
## Goals: 
1. Get introduced to machine learning terminologies.
2. The steps involved in building any machine learning model
3. Problems involved in applying machine learning

Study Resources:
1. [Machine Learning - A Probabilistic Perspective](http://www.cs.ubc.ca/~murphyk/MLbook/pml-intro-5nov11.pdf)
2. [Hands On Machine Learning](https://www.safaribooksonline.com/library/view/hands-on-machine-learning/9781491962282/ch01.html) (Requires access to safari books online)


## Goal of Machine Learning
The goal of machine learning is to develop methods that
can automatically detect patterns in data, and then to use the uncovered patterns to predict
future data or other outcomes of interest.

## Applications
Fields Ripe for applying Machine Learning(includes but not limited to)
1. Computer Vision
2. Robotics
3. Text Processing(Natural Language Processing)

Almost any field that you can think of right now, has machine learning applied to it in one form or the other. Although the complexity of the algorithms involved in them may vary.

## What is needed?
Machine Learning is less fun without understanding the math behind the algorithms. Although, it is not the objective of this study group to delve into the math, it certainly can act as a helpful resource for motivated people.

In that vein, here are the math topics that you might want to start learning.
1. Multivariate Calculus
2. Probability
3. Linear Algebra
4. Basic Computer Programming

## Lets look at some of the Machine Learning terms by building a machine learning model
1. Dataset
2. Features
3. Labels
4. Classification vs Regression

In [1]:
## Load Sample dataset
from sklearn.datasets import fetch_mldata
# Dataset - D

## First time users run this.
# D = fetch_mldata("MNIST original", data_home="../data") #Takes a long time.
# X = D["data"]
# y = D["target"]

## Returning users run this
import scipy.io
D = scipy.io.loadmat("../data/mldata/mnist-original.mat")
X = D["data"]
y = D["label"]
X = X.T
y = y.T

1. Features - Attributes of the input data(aka Attributes or Covariates)
2. Labels - Target(aka Result/Truth)
3. Common Convention is to call features as X and labels as y

In [None]:
print("Feature Shape", X.shape)
print("Target Matrix shape", y.shape)

In [None]:
print("Viewing a single input of shape(1,784).\n", X[0])

## Question
1. What can you infer from this sample datapoint? Try to infer as many patterns/characteristics/attributes as possible

In [None]:
print("Label for the same input", y[0])
print("Labels for some more input", y[:10000])

### 1. Numbers are hard to interpret by humans but machines prefer to interpret numbers
### 2. A very common, important and most time consuming first step to build any machine learning model is to analyze the data to make reasonable inferences about the data we are working with. Visual Analysis is often preferred as a first step because humans can quickly interpret them.

In [None]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

def display_image(img):
    some_digit = img
    some_digit_image = some_digit.reshape(28, 28)
    plt.imshow(some_digit_image, cmap = matplotlib.cm.binary,
           interpolation="nearest")
    plt.axis("off")
    plt.show()

display_image(X[0])

## Training Set
Usually a portion of the dataset(D) that is reserved for training the ML model.
## Test Set
Usually a portion of the dataset(D) that is reserved for testing the ML model.

Note: 
1. Training Set and Test set should be mutually exclusive.
2. There are different ways to prepare this training set and test set. Most common one is called cross validation folds. We will discuss this later.

In [None]:
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

### Question: 
1. What is the size of this training data and target?
2. What is the size of the test data and target?

## Classification vs Regression
1. Output Variables can in principle be anything, but if they fall within a finite set of classes - we call it a categorial variable or nominal variable - Classfication
2. If the output variable is real values then the machine learning model is called a Regression.

In [None]:
import numpy as np
np.unique(y)

## Question:
1. What do we have here? Classification Problem or Regression Problem?

In [None]:
#Earlier we had mentioned that there are multiple ways to prepare the train data and test data.
#What we have below is the most naive way. 
#A random shuffle.
shuffle_index = np.random.permutation(60000) #Shuffle numbers 0 to 59999
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index] #Shuffle the train data

## Questions
1. Why do we have to shuffle the data?
2. You see the training data shuffled here. Should we shuffle the test_data too?

### Lets the following learn from building a Machine Learning Model

1. Classifier
2. Binary Classifiers
3. MultiClass Classifier.
3. MultiLabel Classifier
4. MultiOutput Classifier

## Question
1. Just by looking at the names above, can you guess what's the difference in the output produced by these classifiers?

## Machine Learning model
The term "model" is one of the overused and least described terms in ML. A Model is a system that has been "trained" to detect patterns from a dataset and is ready to make predictions on any input.

Lets look at a very simple, hand wavy version of model.

You might have seen posts in so many useless social media sites like the one shown below:

If f(9, 3) = 21, f(6,9) = 21, f(3, 8) = 14, What is f(4,5)?
The human mind is so fast to recognize the pattern from the three data points and three target values, our human mind is so quick to recognize the pattern and predict the answer 13. It also correctly "learns" that the underlying function producing the pattern is "f(x,y) = 2x + y".

Think of the ML model as something that can learn this on a really huge dataset

In [None]:
from sklearn.linear_model import SGDClassifier #Stochastic Gradient Classifier. Don't worry about this now.
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()
sgd = SGDClassifier(random_state=13, verbose=1)
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))
sgd.fit(X_train, y_train)

## Lets look at all the parameters to a typical ML model
The above steps spits out a bunch of words that are new and alien.

They are ordered in terms of importance. Items marked as * requires a deeper understanding of what happens inside attempts to learn model.
1. Loss - A parameter to measure "how far off" you are from the ground truth. A model can learn nothing if doesn't have an understanding how to "measure" itself. If you go back to the analogy described earlier with f(x,y) = 2x + y, your mind arrived at this solution after trying a couple of functions like this f(x, y) = 7y. This function works for datapoint 1- f(7,3) but after you look at the second datapoint f(6,9), your mind tuned itself to exclude this function from consideration because 7(9) = 63. The loss = 63 - 21 = 42 is so large.
2. Norm* - Normalization Term (aka Regularization term).
3. Bias* - Bias Term
4. Epoch - Full iteration of the training set

## What happended behind the scenes?
1. A model got trained. How?
2. It took the training dataset(60000 samples)
3. Divided them into batches of arbitrary size(batch_size usually a power of 2. Ex:128)
4. It looked at a batch of data(say 128 examples and its corresponding ground truth value)
5. Tried to learn a function(aka decision function) - Just like how your mind learned a function.
6. Looked at another set of data(next batch 128 samples - Tried to apply the function that it learned. Evaluates itself using the loss function that you passed to it. Based on the loss function, it corrects the learned function by fine tuning its hyperparameters.
7. Moves to the next data set until it finishes the entire training dataset.
8. This process of iterating though the entire training set in batches is called one Epoch.
9. After an epoch, you evaluate the model if it meets the terminating condition.
10. If it meets the terminating condition, exit. If not shuffle the data and repeat.
11. Terminating condition can be something as simple as an arbitrary number of epochs(say 50) or you can use a evaluation function like RMSE(Root Mean Squared Error)

## Question
1. What is a hyperparameter? Try to think of this using the simpler example f(x,y) = 2x + y
2. Why is it important to shuffle the data after en epoch?
3. Step 9 says we should evaluate our model before calling the epoch done. On what data should we evaluate our model? Try to think why training/test can/can't be used for the evaluation purposes?

## Lets try to see if our model anything


In [None]:
display_image(X_test[5000]) #Pick an arbitrary image from the dataset and visualize it so we know what to expect

In [None]:
# Alight, my trained ML model, can you make a prediction for me?
sgd.predict(X_test[5000].reshape(1,-1)) #Randomly 

Me: Alright, my ML model you say that it's a 4 but if you meet my professors who evaluate my exam papers they'd say that I write my 9 that way. #HumansVsAI

AI: No. It's a 4. Have a look at the ground truth values if you have any doubts.

Me: Alright.

In [None]:
y_test[5000]

Me: &lt;Insert your favorite man beating machine with a baseball bat meme>

## Lets try to evaluate our ML model

In [None]:
## Evaluation Functions and Validation Schemes
from sklearn.model_selection import cross_val_score, cross_val_predict ## Cross Validation. See below
y_train_pred = cross_val_predict(sgd, X_train, y_train, cv=5) ## 5 Fold Cross Validation

## K-Fold Cross Validation
![alt K-fold Cross Validation](https://raw.githubusercontent.com/ArulselvanMadhavan/ml-study-group/master/resources/images/k-fold-diagram.png "K-Fold cross validation")

Note: Test Fold is also called as Validation Set in some literature.

[Source:http://karlrosaen.com/ml/learning-log/2016-06-20/](http://karlrosaen.com/ml/learning-log/2016-06-20/)

## Question
1. Why is Cross Validation important?

In [None]:
y_train_pred[:10] ##Lets look at the predictions for the first 10 images.

## Question
1. Can you define the notion of accuracy for our problem? When do you say a prediction is correct for a single image? How do you accumulate that notion for the entire data set?

In [None]:
## Accuracy 
def accuracy(truth, predictions):
    return (np.sum(truth == predictions) / truth.size)

In [None]:
accuracy(y_train, y_train_pred) #Train Accuracy of around 84%

## Question
It should be fairly trivial to define what they mean but can you reason about where they should be used in the lifecycle of a ML model and what reasonable inferences can be made from them?
1. Training Set Accuracy
2. Validation Set Accuracy
3. Test Set Accuracy

## How can we improve accuracy?

1. As I mentioned earlier machines can interpret numbers well and humans can interpret information from arrregated data(in most cases aggregated data is represented via visual medium). It is very effective compared to just looking at numbers.
2. One of the tools that they use to understand model performance and make reasonable tweaks to ML model is called confusion matrix.

In [None]:
from sklearn.metrics import confusion_matrix
conf_mx = confusion_matrix(y_train, y_train_pred)
conf_mx

In [None]:
conf_mx.shape #Confusion matrix is a num_of_classes x num_of_classes matrix.

In [None]:
plt.matshow(conf_mx, cmap=plt.cm.gray)
plt.show()

# Question
1. By just looking at the picture, can you reason why this is a handy tool to understand and tweak a ML model?

## Types of MultiClass Classification 
1. One Vs One Classifier
2. One Vs Rest Classifier

## Topics for Next week?