# Machine Learning Framework

<img style='width:800px' src='ml-framework-img'>

## 1. Problem Definition
* What problem are we trying to solve?
* When shouldn’t you use ML?
  * Will a simple hand-coded instruction based system work?
* Main types of ML
  * Supervised Learning → I know my inputs and outputs
  * Unsupervised Learning → I’m not sure of the outputs but I have the inputs
  * Transfer Learning → I think my problem may be similar to something else
  * Reinforcement Learning → Telling the computer to maximize the points it gets may be the way

## 2. Data
* What kind of data do we have?
* Static Data = data that doesn’t change over time
  * Record of patients who got or didn’t get heart disease
* Streaming data = data that constantly changes over time
  * Stock market
* Structured data
  * Rows and columns
  * Excel/CSV file
* Unstructured data
  * Images
  * Videos/Audios
  * Text
* **“The more the data, the better”**

## 3. Evaluation
* What defines success for us?
* We know getting 100% accuracy may be infeasible so what is going to be the limit at which we can consider the results as sufficiently good?
* Different types of metrics
  * Accuracy
  * Precision
  * Recall
* Finding people with heart disease
  * May want model that has accuracy > 99% because heart disease is a very important issue
* Finding spam emails
  * May suffice with a model that has 80% accuracy

## 4. Features
* What do we already know about the data
* How can we use the data to create more useful indicators?
* We want to use the feature variables to predict the target variable(s)
* Numerical Feature Variables
  * Weight/Height
* Categorical Feature Variables
  * Sex/Address
  * Have to later convert these variables into numerical
* Derived Feature Variables (Feature engineering)
  * Using existing data to create new features
  * Finding new data and incorporating it into existing data
* Feature Coverage
  * Want more than 10% of a feature to not have missing values

## 5. Modelling
* Based on our problem and data, what model should we use?
* Figuring out the correct model for the problem you're solving?
* <u>The most important concept in ML: 3 sets</u>
    * Training (70-80%)
        * Train your model on this
    * Validation (10-15%)
        * Tune your model on this
    * Test (10-15%)
        * Test and compare on this
        * Cannot use test set to tune model because then the model may not be learning well, it may just be memorizing the test set (overfitting for the test set) making it hard for the model to generalize in the future even though it may be showing you a great accuracy on the test set
    * Generalization → ability for a ML model to perform well on data it hasn’t seen before
* Choosing a model - Training set
    * Structured data → decision trees, random forest
    * Unstructured data → deep learning, transfer learning
    * Try less data in the beginning because some models (Neural Networks) take a long time to train
* Tuning - Training and/or Validation set
    * Hyperparameters can be tweaked for a model
        * Random Forest → # of trees
        * Neural Networks → # of layers
* Model Comparison - Test set
    * Overfitting
        * Data leakage - some test set leaks into training data
        * Fixes
            * Collect more data
            * Try a less advanced model
    * Underfitting
        * Data mismatch - test set doesn’t match format of train set
        * Fixes
            * Try more advanced model
            * Increase model hyperparameters
            * Reduce amount of features
            * Train longer
    * When comparing models, make sure they have been trained and tested on the same data
    * Not only accuracy, other things (like training time) can also influence which model you want to use

## 6. Experimentation
* How could we improve/what can we try next?
* You will need to cycle through all the steps above to achieve success