# Machine Learning Recipe

1. **Importing**
    - load the dataset using pandas. pandas can read many different file formats and even communicate with databases
2. **Exploratory data analysis**
    - visualise the data using
        - ```plt.hist()```
        - ```plt.scatter(c=, cmap=, s=)```
            - ```plt.colorbar()``` <i style="color:green"># adds a legend<i>
        - ```plt.plot()```
        - ```sns.boxplot()```
        - ```sns.jointplot(kind=)```
        - ```sns.pairplot(hue=)```
    - interpreting the visual patterns
        - look for correlated pairs in scatter plots. In general, we seek to predict ```Y``` using ```X```'s
            1. if X appears to be correlated with Y, X might be a good feature to use to predict Y
            2. correlation among the ```X```'s might suggest redundancy in trying to predict ```Y```
        - using color and hue in scatter and pairplots to analyse categorical variables
            1. if Y is categorial, any ```X```'s that separate the different ```Y``` categories into clean groups are good candidates to use for predicting ```Y```
    - fit a **decision tree**
        - features that appear in the first few layers of the tree are likely to be useful in predicting ```Y```
3. **Transform and clean your data**
    - deal with missing data or **```NaN```** values
        1. you can drop the rows of data with NaN values but this is in general not recommended
        2. fill in the missing values with sensible values
            - if the feature is categorical some possibilities for filling in the data are
                - use the the mean of the observations where there is data
                - use regression to backfill the missing values
            - if the feature is categorical, some possibilities are
                - use the mode
                - create a new category to represent the missing values
    - apply **one-hot-encoding** to tranform categorical variables to binary
    - use appropriate data transformations - some examples are
        - presence of a certain string in a name
        - length of a string
        - reducing the number of categories for a categorical variable
    - combine with data from other sources
    - standardise/normalise your data where appropriate
4. **Choose appropriate learning models**
    - based this on your insights gleaned from exploring and transforming data. Are you trying to predict 
        1. a **categorical** feature (ie. use **classification**) or 
        2. a **continuous** feature (ie. use regression)
5. **Model Selection**
    1. **create training and test datasets**
        - create a **train-test** split on the whole dataset (recommended 70%-30% or 80%-20%)
    2. **Choose the best model within each model class**
        - this is also known as **tuning** the model **parameters**. examples include
            - finding the best ```alpha``` for ```Ridge```
            - finding the best ```k``` for ```KNeighborsClassifier```
        - run **n-fold cross validation** on the **train set** using **```GridSearchCV()```** to select the optimal **parameters** for each of the candidate model classes
        - Within each model class, this will select the best model based on the **cross validated score** (use **```accuracy```** for classification; use **```neg_mean_squared_error```** for regression)
    3. **Select the best overall model**
        - Select the best model out of the cross-validated models from the previous step (say the best ```LinearRegression```, ```Ridge```, ```Lasso```, ```ElasticNet```, etc)
        - the final model will have the lowest ***MSE/Classification Error*** on the **test set**
