# Machine Learning Intro II
## August 31, 2020

# Project Organization

Forbes magazine released this graphic after surveying 80 data scientists. 

![DataScientistTime_Forbes.jpg](figures/DataScientistTime_Forbes.jpg)

## File Structure

Project file structures can vary depending on the type of project, but most examples of an organized structure include folders for:

- Data - unprocessed and processed data files
- Scripts - code to clean, process, build the ML model, and analyze data
- Results - analysis results from the model
- Figures - ways to present results


## Category File Structure Example

![Structure_Example.png](figures/Structure_Example.png)

- src - source code
- eda - exploratory data analysis
- poc - proof of concept

## Project Documentation

Most data projects are built to be shared. To give meaning to the project, most data scientists include documentation on:

- Research questions being posed
- Type of data being used
- Any necessary data cleaning or processing
- Type of Machine Learning model being applied
- Any parameters or assumptions incorporated with the data or the model
- Expected format of results




![Screen%20Shot%202020-08-26%20at%207.36.41%20PM.png](figures/Screen%20Shot%202020-08-26%20at%207.36.41%20PM.png)


## Data Lifecycle

Working with data types or pulling data from databases

![Data-Lifecycle.png](figures/Data-Lifecycle.png)

## Code Development and Operations
![devops.png](figures/devops.png)

## Collaboration and Version Control

Data science is built around community, and community means that more than one set of hands may touch your code. 

![VersionControl.png](figures/VersionControl.png)

# Collaboration Tools

Version Control
 - Git - tools to control the version control system integrated into your development environment or from the command line
 - GitHub - website that projects are published to, facilitates multiple people working on the same project
 
Communication
- Slack - workplace organized chat

### Performance Metrics

- We need some way of evaluation performance
- This depends on the problem
- Regression 
    - SSE, RMSE, MAE, etc.
- Classification
    - Precision, Recall

### Train-Test Splits

- Want some portion of the data to "train" the model
- Want some portion of the data to "test" the model
- Common splits are 75/25, 80/20, 90/10
- Assume we want an 80/20 split
    - 80% of the data for training 
    - 20% for testing

![diamond_data](figures/train_test_5cv.PNG)


Train-Test Splits, example
- For example, we have a dataset with 100 rows
- We perform linear regression with 80 randomly selected rows (train set)
- Performance is measured on the 20 remaining rows of data (test set)

- To generalize, we want to repeat this process many times

![diamond_data](figures/train_test_5cv1.PNG)


### Cross Validation 

- k-fold Cross Validation
- A train-test split induces a number of folds
- For the first fold
    - Use portions 1-4 to train, test on 5

![diamond_data](figures/train_test_5cv1.PNG)

### Cross Validation

- The remaining 4 folds

|                                       |                                  |
|---------------------------------------|----------------------------------|
|  ![diamond_data](figures/train_test_5cv2.PNG) | ![diamond_data](figures/train_test_5cv3.PNG)  |
|  ![diamond_data](figures/train_test_5cv4.PNG) | ![diamond_data](figures/train_test_5cv5.PNG)  |



# Example Project: Diamonds Dataset

* Research Question: Find a model which minimizes how much we underprice each diamond.

* The data are collected on previous diamond sales and have price, carat, and grade of cut.

# Visualize the Data

![diamond_data](figures/diamond_data.png)

# Linear Regression - Model 1

* We will first consider a linear regression model with varying intercepts but the same slope

* Varying intercepts mean that each cut has on average different mean prices

* Same slope means that the predicted increase in price per unit increase in carat is the same for each cut

# Visualising Model 1 Fit

![diamond_data_model1](figures/diamond_data_model1.png)

# Loss Functions

* Mean Squared Loss for this Model: 2,284,251 $\$^2$

* Mean Absolute Loss for this Model: 988.46 $\$$

# Custom Loss Function

* Since our research question is specifically interested in _underpricing_ the diamonds, consider this custom loss function:

$$
L(actual, estimated) = 
\begin{cases}
    actual - estimated & actual > estimated \\
    0 & else \\
\end{cases}
$$

* Using this loss function, the sum total is $\$$ 26,658,675

* Using this loss function, the average underpricing is $\$$ 494.23

# Iterating the Model: Varying Slope, Varying Intercept

![diamond_data_model2](figures/diamond_data_model2.png)

# Perfomance Evaluation

* Mean Squared Error for this model: 2,243,036 $\$^2$ (previous model: 2,284,251 $\$^2$)

* Mean Absolute Error for this model: 987.40 $\$$ (previous model: 988.46 $\$$)

* Sum total of underpricing:  $\$$ 26,630,089 (previous model: $\$$ 26,658,675)

* Average underpricing: $\$$ 493.70 (previous model: $\$$ 494.23)

# Keep Iterating: Quadratic Model

![diamond_data_model3](figures/diamond_data_model3.png)

# Fit Curves by Cut

![fit curves by cut](figures/individual_fit.png)

# Model Evaluation

| Model Index | Model                            | MSE       | MAE    | Total Underpricing | Average Underpricing |
|-------------|----------------------------------|-----------|--------|--------------------|----------------------|
|      1      | Varying Intercept                | 2,284,251 | 988.46 | 26,658,675         | 494.23               |
|      2      | Varying Intercept, Varying Slope | 2,243,036 | 987.40 | 26,630,089         | 493.70               |
|      3      | Varying Quadratic                |_2,180,508_|_908.12_|_24,491,946_ 	   |_454.06_              |

### Example Project: Classify species of Iris

* Research Question: Build a model to classify species of iris (versicolor and virginica) using sepal length and width 

* The data were collected by Edgar Anderson in 1936
* The full dataset includes 3 species of iris (*versicolor*, *virginica*, and *setosa*)
* The full dataset includes 3 features/covariates (sepal length, sepal width, petal length, petal width)
* A famous dataset - Was used by R.A. Fisher in his paper on discriminant analysis (one of the earliest classification models)

![diamond_data_model3](figures/iris_raw_data.png)

# K nearest neighbors classification/regression
- For each point in the dataset, find the point's k nearest neighbors
- For classification, return the plurality vote for those k neighbors
- For classification, return the mean of those k neighbors

#### (Aside:) Performance in generalizable classification will be driven by separability and sample size

![diamond_data_model3](figures/linpred.png)

#### Generalizability as a function of k

![diamond_data_model3](figures/iris_k_train_test.png)

# Tuning k with cross-validation
![diamond_data_model3](figures/cv_plot.png)

# Best k predicted probabilities
![diamond_data_model3](figures/irisk-best-k.png)

# Presenting Results (Confusion Matrices)

| | | Predicted |  |
|---|---|---|---|
|  | | Ve| Vi |
| **True** | Ve | 38| 12 |
| | Vi |16 | 34|


# Presenting Results (ROC curves and AUC)
![diamond_data_model3](figures/roc_curve.png)

# Questions?