This study guide should reinforce and provide practice for all of the concepts you have seen in the past week. There are a mix of written questions and coding exercises, both are equally important to prepare you for the sprint challenge as well as to be able to speak on these topics comfortably in interviews and on the job.

If you get stuck or are unsure of something remember the 20 minute rule. If that doesn't help, then research a solution with google and stackoverflow. Only once you have exausted these methods should you turn to your Team Lead - they won't be there on your SC or during an interview. That being said, don't hesitate to ask for help if you truly are stuck.

Have fun studying!

# Resources

[Category Encoders](https://contrib.scikit-learn.org/categorical-encoding/)

[Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

[Decision Tree Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)

[Hyperparameter Tuning](https://scikit-learn.org/stable/modules/grid_search.html)

[Confusion Matrix](https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html)

[Scoring Metrics](https://scikit-learn.org/stable/modules/model_evaluation.html)

In [1]:
import pandas as pd

Use the dataframe below for all questions unless otherwise stated

In [2]:
# https://www.kaggle.com/ronitf/heart-disease-uci
df = pd.read_csv('https://raw.githubusercontent.com/bundickm/Study-Guides/master/data/hearts.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,0,63,Male,D,145,233,1,0,150,0,2.3,0,0,1,1
1,1,37,Male,C,130,250,0,1,187,0,3.5,0,0,2,1
2,2,41,Female,B,130,204,0,0,172,0,1.4,2,0,2,1
3,3,56,Male,B,120,236,0,1,178,0,0.8,2,0,2,1
4,4,57,Female,A,120,354,0,1,163,1,0.6,2,0,2,1


# Basics and Data Preparation

Define the following terms in your own words, do not simply copy and paste a definition found elsewhere but reword it to be understandable and memorable to you. *Double click the markdown to add your definitions.*
<br/><br/>

**Logistic Regression:** `Forming a linear model to predict a continuous series.`

**Imbalanced Classes:** `When a model has unequal values on ones side or another.`

**Leakage:** `When the test model catches data from the training set.`

**Categorical Encoding:** `A way to transrorm categorical values into numerical values a model can use.`

**Skew:** `Imbalanced values, the direction of the imbalance.`

**Log Transformation:** `A way to handle imbalanced models.`

**Outliers:** `Extremes in data that can throw off accuracy.`

Answer the following questions in plain english as much as possible.
<br/><br/>

What are some ways to deal with imbalanced classes?
```
Log Transformation, changing weight and/or scale of either majority or minority classes.
```

What are some possible sources of data leakage?
```
Combined data exploration before a train_test_split.
```

What are some indicators or methods for detecting data leakage?
```
Your Answer Here
```

What is the relationship between skew and log transformation?
```
Your Answer Here
```

Using the dataset above, complete the following:
- Train/Test/Validation Split
- Get a baseline
- Perform EDA with visuals
- Clean up any nulls, duplicate columns, or outliers you might find
- Engineer at least 2 features
- Use One Hot or Ordinal Encoding on one feature

# Model Building

Define the following terms in your own words, do not simply copy and paste a definition found elsewhere but reword it to be understandable and memorable to you. *Double click the markdown to add your definitions.*
<br/><br/>

**Decision Tree:** `Your Answer Here`

**Ensemble Methods (Ensemble Models):** `Your Answer Here`

**Gradient Descent:** `Your Answer Here`

**Bagging:** `Your Answer Here`

**Boosting:** `Your Answer Here`

**Hyperparameters:** `Your Answer Here`

Build a random forest classifier using the dataset you cleaned and prepped above.

Graph your model's feature importances

In 2-3 sentences, explain how to interpret and use the feature importances to further refine or help explain your model.

```
Your Answer Here
```

How does feature importance differ from drop-column importances and permutation importances?

```
Your Answer Here
```

Build a logisitic regression model using the dataset you cleaned and prepped above.

Plot the coefficients of your model.

In 2-3 sentences, explain how to interpret and use the coefficients to further refine or help explain your model.

```
Your Answer Here
```

What is an example of an ensemble method?

```
Your Answer Here
```

What do we mean by hyperparameter tuning and how can we automate the tuning process?

```
Your Answer Here
```

# Metrics and Model Evaluation

Define the following terms in your own words, do not simply copy and paste a definition found elsewhere but reword it to be understandable and memorable to you. *Double click the markdown to add your definitions.*
<br/><br/>

**ROC:** `Your Answer Here`

**ROC-AUC:** `Your Answer Here`

**Discrimination Threshold:** `Your Answer Here`

**Precision:** `Your Answer Here`

**Recall:** `Your Answer Here`

**F1 Score:** `Your Answer Here`

**Confusion Matrix:** `Your Answer Here`

Choose one of your models above to complete the following:
- Get your model's validation accuracy (This may be done multiple times if you are refining your model)
- Get your model's test accuracy
- Create a confusion matrix for your model
- Calculate the Accuracy, F1 Score, Precision, and Recall by hand
- Use SKLearn to calculate accuracy, F1 score, precision, and recall to confirm your work.

Give an example of when we would use precision to score our model and explain why precision is the best metric for that situation.

```
Your Answer Here
```

Give an example of when we would use recall to score our model and explain why recall is the best metric for that situation.

```
Your Answer Here
```

Find your model's ROC-AUC Score

Plot your model's ROC Curve