# Intro to Machine Learning
___
___

## Supervised Learning

* Supervised learning algorithms are trained using **labeled** examples, such as an input where the desired output is known
* Historical data with category labels assigned
* e.g.
    * Spam vs legit emai
    * positive vs negative reviews
* Supervised learning is commonly used in applications where historical data predicts likely future events
    
___
 **For Neural Networks**
 * Network receives a set of inputs along with the corresponding correct output
 * The algotithm learns by comparing it's actual output to the correct label to find errors
 * then it modifies the model accordingly e.g. adjust weights and bias value
___

**ML process for supervised learning**

1. Data acquistion (customers, IoT sensors)
2. Data cleaning (using Pandas)
3. Split data into training and testing datasets (30% test, 70% train)
4. Model training and building (fit model to training data)
5. Model Testing (evaluated model using test data) - adjust model params if needed
6. Model deployment

* This is a simplified approach
* Is it fair to use our single split of the data to evaluate model performance
* After all, we were given the chance to update the model params again and again
* To fix this, often for Neural Networks/ Deep Learning, data is split into 3 sets
    * Training data - to train model params
    * Validation data - to determine what model hyperparams to adjust
    * Test data - data the model has never seen before used to get a final real-world performance metric
        * don't go back to adjust the model params after this stage
___
___

## Evaluation Performance

### Classification

* Key classification metrics:
    * Accuracy
    * Recall
    * Precision
    * F1-Score
* Typically in any classificatioin task, the model can only achieve 2 results: correct or incorrect prediction
* Incorrect vs correct expands to situation where you have multiple classes (more that one category value)
* Binary example
    * predict if image is dog or cat
    * supervised learning problem - fit/train a model on training data, then test on testing data
    * compare model's predictions from the **X_test** data to the **true y values** (correct labels)
    * in the end you have a count of correct matches and incorrect matches
    * in real world,
        * not all incorrect or correct matches hold equal value
        * a single metric won't tell the complete story
        * we could organize our predicted values compared to real values in a confusion matrix
        * **Accuracy:**
            * Number of correct predictions as % of total predictions made
            * Accuracy is useful when the target classes are well balanced (similar number of each class, in this case dogs and cats)
            * Accuracy is not a good choice with unbalanced classes (99 dog images, 1 cat image) - if our model was simply a line that always predicted dog, we'd get 99% accuracy
        * **Recall**
            * Ability of a model to find all the relevant cases within a dataset
            * \begin{equation*} Recall = \frac{TP}{TP + FN} \end{equation*}
            * i.e. how many correct "yes" predictions out of total "yes" predictions possible
        * **Precision**
            * Ability of a classification model to identify only the relevant data points
            * \begin{equation*} Precision = \frac{TP}{TP + FP} \end{equation*}
            * i.e. how many correct "yes"  predictions out of total of "yes" predictions made
        * **Recall & Precision**
            * there is a trade-off between Recall and Precision
            * Recall expresse the ability to find all relevant instances, Precision expresses the proportion of the data points our model says was relevant were relevant
        * **F1-Score**
            * Harmonic mean of precision and recall
            * \begin{equation*} F_{1} = 2 * \frac{(Precision * Recall)}{(Precision + Recall)} \end{equation*}
            * use harmonic mean rather than simple mean, as it harmonic mean punishes the extreme values,
            * so harmonic mean gives a faire assessment
            * A classifier with precisionof 1.0 and a recall of 0.0 has a simple average of 0.5 but F1 score of 0
        * The main point to remember with confusion matrix and various calculated metrics - they're all fundamentally ways of comparing the predicted values versus the true values
        * What constitues "good" metrics will depend on the specific situation and context - require domain knowledge
    * **What's a good enough Accuracy**
        * depends on context of situation
        * is the model for predicting presence of a disease?
        * is the disease presence well balanced in the general population (unlikey)
        * thereforce we have a precision/recall tradeoff
        * Often models are used as a quick diagnostic test before more invasive test - need to consider what's at stake
        * e.g. Urine test before biopsy for prostate cancer
        * often have precision/recall tradeoff - we need to decide if the model should focus on fixing False Positives vs. False Negatives
        * In disease diagnosis, probably better to focus minimizing FN at the cost of increase FP, so we make sure we correctly classify as many cases as the diease as possible
            * don't want to turn away people that actually have disese but tested -ve (FN)
            * the FP will can found out in the more invasive tests
            


<table align='center'>
    <tr>
        <th colspan=4 style="color:darkblue; text-align:center; font-size:18px">Confusion Matrix</th>    
    </tr>
    <tr>
        <th></th>
        <th></th>
        <th colspan=2 style="text-align:center; font-size:15px">Predicted</th>
    </tr>
    <tr>        
        <th></th>
        <th></th>
        <th><p style="color:gray; text-align:center">Yes</p><p style="color:gray; text-align:center">(+ve)</p></th>
        <th><p style="color:gray; text-align:center">No</p><p style="color:gray; text-align:center">(-ve)</p></th>
    </tr>
    <tr>
        <th rowspan=2 style="text-align:center; font-size:15px">Actual</th>
        <th><p style="color:gray; text-align:center">Yes</p><p style="color:gray; text-align:center">(+ve)</p></th>
        <td><p style="color:green; text-align:center">True Positives (TP)<p></td>
        <td><p style="color:red; text-align:center">False Negatives (FN)<p><p style="color:red; text-align:center">(Type II error)</p></td>
    </tr>
    <tr>
        <th><p style="color:gray; text-align:center">No</p><p style="color:gray; text-align:center">(-ve)</p></th>
        <td><p style="color:red; text-align:center">False Positives (FP)<p><p style="color:red; text-align:center">(Type I error)</p></td>
        <td><p style="color:green; text-align:center">True Negatives (TN)<p></td>
    </tr>

</table>

![image.png](attachment:image.png)

___

## Evaluation Performance

### Regression

* Regression is predicting **continuous** values
* Need metrics designed for **continuous** values (note accuracy, precision, recall, F1-score)
* For example
    * predicting house price given its features is a regression task
    * predicting the country a house is in is a classification task
* Common metrics
    * Mean Absolute Error: **MAE**
        * the mean of the absolute value of error
        * sum of absolute(true value minus predicted value), divided by total values
        * \begin{equation*} \frac{1}{n} \sum_{k=1}^n = |y_{i} - \hat{y}_{i}| \end{equation*}
        * problem with MAE, it won't punish large errors (outliers)
    * Mean Squared Error: **MSE**
        * the mean of the squared errors
        * large errors (outliers) are are accounted for more
        * sum of squared(true value minus predicted value), divided by total values
        * \begin{equation*} \frac{1}{n} \sum_{k=1}^n = (y_{i} - \hat{y}_{i})^2 \end{equation*}
        * problem: square actually squares the units as well, which is difficult to interpret, so fix by using RMSE
    * Root Mean Square Error: **RMSE**
        * the root of the mean of the squared errors
        * large errors (outliers) are are accounted for and has same units as predicted value
        * sqrt(sum of squared(true value minus predicted value), divided by total values)
        * \begin{equation*} \sqrt{\frac{1}{n} \sum_{k=1}^n = (y_{i} - \hat{y}_{i})^2} \end{equation*}
        * "is this RMSE good?"
            * context and domain knowledge is everything
            * RMSE of $10 is good for predicting a house price, but bad for predicting candy bar price
            * Best to compare the error metric to the average (mean) value of the label in the dataset to get intuition of overal performance
    * MAE, MSE, RMSE are **loss functions**, so the aim is to **minimize** them
            

___

## ML with Python

* Scikit Learn package
* most popular ML package for python, had many algorithms built in
* install: conda install -c anaconda scikit-learn
* Every model in scikit-learn is exposed via an "Estimator"
* First you import the model:
    * general form: **from sklearn.family import Model**
    * example: from sklearn.linear_model import LinearRegression
* Params: all the params of an estimator can be set when it is instantiated and have suitable default values
    * example:
        * **model = LinearRegression(normalize=True)**
        * **print(model)** # print to get the default params
            * > LinearRegression(copy_X=True, fit_intercept=True, normalize=True,)
* Once model is created, next step is to fit the model on training data
    * See below how to split data into train and test
* Once data is split, we can train/fit our model on training data
    * supervised learning: **model.fit(X_train, y_train)**
    * unsupervised learning: **model.fit(X_train)**
* Next, model is ready to predict values (labels for supervised learning) based on test set
    * supervised learning:
        * **predictions = model.predict(X_test)**
        * **model.predict_proba()**
            * for some classification problems, it returns the probability that the test/new observation has a particular label (label with highest probability is return in model.predict())
        * **model.score()** for classification and regressions ,where scores are bet 0 and 1 (larger score = better fit)
    * works slightly differently for unsupervised learning (unlabelled data)
    * unsupervised learning:
        * **model.transform(X_test)**, returns new representation of data based on unsupervised model 
        * **model.fit_transform()**, some estimators have this method, which efficiently perofrms a fit and transform on the same input data
* Next, evaluate model by comparing to y_test labels
    * evaluation method depends on what ML algo we're using (regressions, classification, clustering, etc..)
    


### Scikit-lear Algorithm cheat sheet

https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html


![image.png](attachment:image.png)

___
### Splitting data in to training and test sets

In [2]:
# Spliting data into training and test sets
import numpy as np
import sklearn as sk

In [4]:
from sklearn.model_selection import train_test_split

In [5]:
# fake data
# X = features, y = labels for rows
X, y = np.arange(10).reshape((5,2)), range(5)

In [6]:
X

array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])

In [8]:
list(y)

[0, 1, 2, 3, 4]

In [10]:
# split 70% train, 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [11]:
X_train

array([[2, 3],
       [6, 7],
       [8, 9]])

In [12]:
y_train

[1, 3, 4]

In [13]:
X_test

array([[0, 1],
       [4, 5]])

In [14]:
y_test

[0, 2]