# Machine Learning Roadmap

## The Stages of Machine Learning Analysis

For 99% of analyses which use machine learningm, the method of their use can be broken down into three stages

1. **Data Preparation**: Importing the data, cleaning it, and identifying the features and target
2. **Model Creation**: Selecting which type of model to use and training it with the data available
3. **System Evaluation**: Testing the model to confirm that it is performing well in the ways we want it to

Below is a template for each of these steps that you can fill in to help build up your machine learning analysis. Links to potential tools for each are also provided for your reference; most of these links have example code to work with, though you may need to scroll a bit to find it. A number of pre-built code snippets have also been created for you to copy-paste as a starting point as well, with all being available on D2L.

Don't stress about the massive block of possible arguments that the documentation linked herein has. They are there to help programmers who want to integrate their code into a larger system, not the average user. For 99% of your use cases the usage provided by the example code (either by the documentation or ourselves) will be sufficient; we are simply providing it to provide context on these tools' full scope of applications, should you need to do more detail analyses in the future.

## Step 1: Data Preparation

### Importing our Data

The first step of preparing the data is to import it into our program. Pandas (`import pandas as pd`) is generally advised for this, and depending on the format of the data, different commands can be used:

* `read_csv`: Imports comma-separated-value style data (usually with `.csv` or `.tsv` extensions). Often the default format for data, so if all else fails, try this. [Documentation Here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)
* `read_excel`: Imports data created and saved using Microsoft Excel or similar. Has a large number of extensions depending on the Excel version used (`.XLS`, `.XLSX`, `.XLSM`, `.XLTX` and `.XLTM`), but pandas will usually figure out how to import it for you regardless. If your computer defaults to opening the data in excel without asking you anything, this is probably the command to use. [Documentation Here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html)
* `read_json`: Imports data stored in JavaScript-style format. These files are pretty rare outside of direct database or website queries, and are denoted with a `.json` extension. This is is most likely the format used if the prior two options fail. [Documentation Here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html)

Other formats exist as well, though they are much more uncommon. If none of the above work for you, please contact us and we will figure out the formatting and convert the data into one of the above formats for your use instead.

Place the code importing your data into the code block below. Place it within the `df` variable to keep consistent with subsequent blocks:

### Cleaning the Data

Before we know what needs to be changed about the data, we should inspect it first. First we should look into the distribution of the data; pandas DataFrames allow us to do this with `df.describe()`:

In [2]:
# Check the distribution of our features
df.describe()

Unnamed: 0,id,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
count,150.0,150.0,150.0,150.0,150.0,150.0
mean,424.173333,5.843333,3.057333,3.758,1.199333,1.0
std,251.099642,0.828066,0.435866,1.765298,0.762238,0.819232
min,3.0,4.3,2.0,1.0,0.1,0.0
25%,217.25,5.1,2.8,1.6,0.3,0.0
50%,420.5,5.8,3.0,4.35,1.3,1.0
75%,624.5,6.4,3.3,5.1,1.8,2.0
max,867.0,7.9,4.4,6.9,2.5,2.0


We should also check whether there are null (non-existant) values that we will need to deal with as well. This can be done with `df.isnull().sum()`:

In [3]:
# Check whether null values are present
df.isnull().sum()

id                   0
sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
species              0
dtype: int64

If there are any nulls values in the dataset, we should deal with them now before proceeding. We can do this one of two ways
* Deleting entries will null values. This is good if null values are few and far between, resulting in data deletion having a negligable effect on the dataset as a whole. This can be accomplished with `df = df.dropna()`. The documentation for the method can be found [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html)
* Imputing missing entries, "filling them in" with data generated from the rest of the dataset. For this `sklearn.impute.SimpleImputer` is generally sufficient, filling in missing values with average of the rest of the column by default. The documentation for this method can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)

If you wish to apply either of these to your dataset, do so below. Make sure the result remains stored within the `df` variable:

### Preparing the Data

Now that we have an idea of what our dataset contains, and are sure that there are no missing values, we generally want to split it into three different subsets:
* _Metadata_: Elements which are useful to data management, but are probably not informative in any way (i.e. patient IDs, software versions, protocol name etc.)
* _Features_: Elements we expect to have available and intend to use when we want to make predictions using the final model (i.e. blood pressure, gender, genotype etc.)
* _Target(s)_: The element(s) we want the machine learning model to predict for us, using the features designated prior (i.e. disease severity, surgical complication risk, blood sugar etc.)

Use the dataframe view we have above and decide which rows/columns fit into each of these categories, and then group them accordingly. Once you have those elements, you can use `loc` or `iloc` to separate those elements into their respective groups, saving them as unique variables. If you don't feel comfortable with Python's querying format yet, lists (created using `[]`) can be used for this.

For example, the command `df_meta = df.loc[:, ['feature 1', 'feature 2']]` will create a new subset dataframe named `df_meta` which contains all rows (`:`) and only columns `feature 1` and `feature 2` (`['feature 1', 'feature 2']`). For the sake of consistency with future commands, name these subsets `df_meta`, `df_feature`, and `df_target`:

Finally, depending how the features are structured, we may want to make some final changes on the _features_ of the dataset to prepare it for use in our machine learning tool:

* If the data distributions are different between features, you may want to normalize their values to all be roughly the same range and distribution (certain models rely on this to function properly). This can be done with `sklearn.preprocessing.normalize`, which will automatically scale each feature to within a -1 to +1 range. Its documentation can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html)
* If you suspect certain features may not be useful, and want to drop them, feature elimination may also be warranted. This can be done using `sklearn.feature_selection.VarianceThreshold`, `sklearn.feature_selection.SelectKBest`, or `sklearn.feature_selection.SelectPercentile`, depending on your preference and needs. Documentation for each are available [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html), [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html), and [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html), respectively.
* Finally, if you just have a ton of features and wish to reduce their number (either to reduce the change of overfitting, or because you suspect some features may be redundant), feature transformation can be used. While others exist, we recommend `sklearn.decomposition.PCA` (documented [here](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)) or `sklearn.discriminant_analysis.LinearDiscriminantAnalysis` (documented [here](https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html)) for unsupervised and supervised dimensionality reduction, respectively.

If you want to employ any of these techniques, do so in the following code block. Just make sure that you are only applying them to your features (not metadata or target) and that the results remain saved to `df_feature`:

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,0.092700,0.081966,0.092483,0.086268
1,0.078864,0.074034,0.080676,0.074766
2,0.084398,0.079322,0.090515,0.080517
3,0.092700,0.087254,0.112160,0.120775
4,0.074713,0.079322,0.088547,0.086268
...,...,...,...,...
145,0.063645,0.089898,0.027548,0.017254
146,0.088549,0.071390,0.104289,0.109273
147,0.070563,0.100474,0.037387,0.023005
148,0.078864,0.100474,0.033451,0.017254


Finally, we need to make sure the target is ready for analysis. This is much simpler; if its in a text-based form, we need to convert it to a numerical form instead. This can be done using `sklearn.preprocessing.LabelEncoder`, which will simply apply a single number to each unique value in the column. Its documentation is available [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)

If your data needs this done, do so in the following code block:

0      1
1      1
2      1
3      2
4      1
      ..
145    0
146    2
147    0
148    0
149    1
Name: species, Length: 150, dtype: int64

## Model Selection and Training

### Train-Test Split

Now that our dataset is ready, let's actually start building a model that makes use of it. Before we do so, it is common practice to split our data into two subsets:

* _Training_: The data we will train the model on
* _Testing_: The data the model will be evaluated using

This is done to reduce the likelihood that the model will simply "memorize" the data it was trained on, making it appear to be extremely effective when in reality it performs poorly on new data. Thankfully this is very simple to do with SciKit-Learn by using the `sklearn.model_selection.train_test_split` function; by default it will provide a training dataset which contains 75% of your provided data, and a testing dataset with the remaining 25%. The documentation for its use can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).

Apply this to your features and data in the block below. For the variable names, append `_train` to the end of the training set data and `_test` for the testing set data:

Now that we have our training set, we can look into which model to use. The biggest distinction here is whether your target is categorical or continuous. Categorical data is any data which can only exist as one of several distinct values (i.e. species, diabetes type, eye color etc.). In contrast, continuous metrics instead can exist as any value within a range (i.e. height, blood sugar, disease severity etc.). 

Note that these can occasionally be ambiguous and/or both simmultaneously; for example, one could measure eye color using hue, which would make it continuous rather than categorical. Likewise, continuous values can be "binned" to turn them into categorical values, should it better fit the use case needed.

### Continuous Models

Common models for working with continous targets are as follows:
* `sklearn.linear_model.LinearRegression`: Provides a simple linear regression based prediction of your target, based on relations between the target and each of the features you provided. Very easy to interpret and extremely efficient to run, though its simplicity can lead to it missing trends other more advanced models would catch. Its full documentation can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
* `sklearn.svm.SVR`: A continuous regression implementation of the Support Vector Machine algorithm. Reasonably efficient and much more resistant to outliers than a simple linear regression, but its default `rbf` kernel makes it prone to overfitting if you are not careful. Other kernels can be used via the `kernel` argument to mitigate this somewhat, but this only helps so much. The full documentation for this tool can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html)
* `sklearn.ensemble.RandomForestRegressor`: A continuous regression implementation of the Random Forest method. As an ensemble machine learning system, this is incredibly resistant to overfitting, but generally less efficient and more difficult to interpret then the prior two. Its documentation can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)

### Categorical Models

Common models for working with categorical targets are as follows:
* `sklearn.linear_model.LogisticRegression`: The categorical equivalent to `LinearRegression` above, with the same benefits and drawbacks. Very easy to quickly train and test with, but extremely simplistic. Documentation can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
* `sklearn.svm.SVC`: A categorical-targetting implementation of the Support Vector Machine algorithm. Much like `SVR` prior, is very effective when outliers are a problem, but can overfit if not checked with its default kernel. Its documentation is available [here](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
* `sklearn.ensemble.RandomForestClassifier`: A Random Forest built to predict categorial targets. Being an ensemble method like `RandomForestRegressor` prior, it shares the same benefits and drawbacks; resistant to overfitting, but less efficient and somewhat difficult to interpret. The documentation for this tool can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

Once you have selected a model from the list prior, we can fit it to our training data. Do so in the code block below, using the training datasets we created prior and saving the resulting model to the `model` variable:

RandomForestClassifier()

## Model Evaluation
We have a working model! Now all thats left is to see how it performs. Depending on whether your target is continuous or categorical, this can be done in a number of ways, many of which you may want to perform together to get an accurate understanding of the model's performance:

### Continuous Evaluations

* `sklearn.metrics.mean_squared_error`: A good general use evaluation which calculates the error of the model as the sum of squared error (measured as difference between the true value, and the value predicted by the model). Penalizes large errors much more heavilly than small errors, making it very useful when these extreme mistakes would be more hazardous to the model's use than small mistakes. A value of 0 is a perfect model, with greater values represetning worse model performance. Its documentation can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html). 
* `sklearn.metrics.mean_absolute_error`: Another general use evaluation, though less common than the prior, which evaluates error as the absolute deviation between the true and model predicted values. This results in extreme error being penalized less severely than the prior method, making it better for evaluating models that need to treat slight errors as serious. A value of 0 is a perfect score, with increasing values representing worsening model performance. Its documentation is available [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html)
* `sklearn.metrics.r2_score`: Useful in nearly all circumstances, calculates the _coefficient of determination_ (or $r^2$) of the model's predictions. This is similar to mean-squared error prior, but accounts for the precision of the model as well. Unlike the prior two, higher values are better, with a score of 1.0 being a perfect score and a score of 0.0 representing a 'naive' model (which does not utilize the features at all; if your model scores this or worse, it is not a very good model). Documentation for its use is provided [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html)

### Categorical Evaluations

* `sklearn.metrics.accuracy_score`: Arguably the simplest measure of model performance, this simply reports how often the model predicts a sample's class correctly. Naturally this means a score of 1.0 is perfect (100% accuracy) and 0 is awful (0% accuracy). Its simplicity means it does not account for the circumstances of the data or model's use, however. Its documentation can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)
* `sklearn.metrics.balanced_accuracy_score`: Calculates accuracy, similar to the prior method, but accounts for target classes having different prevalance within the dataset. This makes its report more "fair", as classes which are rarer in the dataset will have a much more substantial effect on the final score if they are predicted incorrectly more often than classes which are more common. You can view the documentation for its use [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html)
* `sklearn.metrics.roc_auc_score`: Very common in medical applications, the ROC AUC evaluates how readilly the model can distinguish between true positives from false positives. This results in true and false negatives being ignored, making it useful for situations where false negatives are likely to be caught or accounted for in other ways. For data with more than two classes, this is calculated once per class, and reported back to you as a list. Documentation for its use can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html)
* `sklearn.metrics.recall_score`: Calculates the recall of the model, which is a measure of how reliably the model can identify true positives in the dataset. This results in the score ignoring true negatives and false positives. For data with more than two classes, this is calculated once per class, and reported back to you as a list. Its use is documented [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html)
* `sklearn.metrics.precision_score`: Calculates the precision of the model, which is a measure of how reliable the model avoids misclassifying positive samples as negative. This results in false positives and true negatives being ignored. For data with more than two classes, this is calculated once per class, and reported back to you as a list. The documentation for its use is available [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html)
* `sklearn.metrics.classification_report`: Provides a generalized report of most of the metrics provided prior for convenience sake. By default will report Very useful if you just want a generalized overview of a model's performance, without needing to compare multiple models automatically within the code. Its documentation is provided [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)

Place the code for the evaluation(s) you wish to perform in the code block below. If you think the model performed sufficiently, congrats! If not, you can use this to inform how you want to change the prior code blocks to account for the issues instead:

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        11
           1       0.87      1.00      0.93        13
           2       1.00      0.86      0.92        14

    accuracy                           0.95        38
   macro avg       0.96      0.95      0.95        38
weighted avg       0.95      0.95      0.95        38

