# Simplified 4-Step Process for Modeling using Scikit-Learn
1. Instantiate Model
> `model = Classifier()`
2. Fit model to training data
> `model.fit(X_train, y_train)`
3. Predict on test data with the fitted model
> `pred_test = model.predict(X_test)`
4. Score the model using a metric to evaluate how well it performs
> `fbeta_score(y_test, pred_test, beta=0.5)`

# Working With Missing Data
1. Remove
> We can remove (or “drop”) the rows or columns holding the missing values
2. Impute
> Replace with mean, median, mode of frequency, univariate linear regression, etc.
3. Work Around
> We can build models that work around them, and only use the information provided

Resource: [How to Handle Missing Data](https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4)

## Option 1: Removing
1. Ask "Why are the values missing"
> Removing data can lead to biased models

Ex: If data is of survey nature, the types of questions NOT RESPONDED TO may indicate different types of respondents

May be valuable to account for the number of, or which questions have, missing values for each observation:

| Q1 | Q2  | Q3  | Missing |
|----|-----|-----|---------|
| 1  | Nan | 1   | 1       |
| 4  | 4   | Nan | 1       |
| 1  | 2   | 1   | 0       |

### When is it ok to remove missing values?
1. Data entry errors
2. Mechanical errors
3. Didn't need the data
4. The missing data is in the column to be predicted
5. There is no variablity in the observations

### Other Considerations
1. Drop observations
2. Drop columns

## Option 2: Imputing

Be very cautious of the BIAS you are imputing into any model that uses these imputed values.

Though imputing values is very common, and often leads to better predictive power in machine learning models, it can lead to **over generalizations**

* By imputing, you are diluting the importance of that feature
* Variablity in features is what allows you to use them to predict another variable well

**Common Methods:**
1. Mean
2. Median
3. Mode
> Especially with categorical data
4. Impute 0 or $\infty$
> Small or large value to differentiate missing values from the others

**More advanced:**
1. [K-Nearest Neighbors](https://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf)
> Impute values based on features that are most similar
2. [AMELIA](https://cran.r-project.org/web/packages/Amelia/Amelia.pdf)

**PROS OF IMPUTING**
* You **ARE NOT** removing rows

**CONS OF IMPUTING**
* You **ARE** diluting the power of your features to predict well by reducing variability in those features

Note that in the image, the pink values were missing and replaced with the mean values in the column. However, `child height` at `42` and at `57` are vastly different. Because of imputation though, these two observations are identical aside from the value to be predicted

<img src='impute_0.png'>

In general, you should try to be more careful with missing data in understanding the real world implications and reasons for why the missing values exist. At the same time, these solutions are very quick, and they enable you to get models off the ground. You can then iterate on your feature engineering to be more careful as time permits.

# Encoding of Categorical Variables
**One Hot Encoding (dummy variables, indicator variables)**
Rule of thumb:
* for any factor to be added (p -> p+1), there should be at least 10 observations for each (i.e. if you add two factos by including dummy variables, you should have 10x the number of variables, p, the data has after including the new variables)

Ex.
* n=100, p = 3
> currently 33 1/3 observation per-factor<br>
* if one p has 5 categories/levels -> n=100, p = 8 (assuming no base drop, which is wrong), no issue
> each factor now has 100 / 8 observations<br>
* if one p has 8 categories/levels -> n=100, p = 11 (assuming no base drop, which is wrong), there may be an issues
> each factor now has 100 / 11 observations, even if the base case is dropped, this is still a potential issue

Another rule of thumb:
* There should be at least $\sqrt{n}$ observations for any factor
> this leads to an equivalent amount of observations for each factor


# Overfitting
* When we are able to build a model that performs well on data it has seen before, but does not predict well in new situations