## Introduction

Welcome to the Guided Project for the Logistic Regression Modeling in Python course! We have learned a lot about logistic regression and classification in the past four lessons, and it's about time that we use this knowledge on a real-world dataset.

As with the linear regression guided project, we'll also be looking at a real-life dataset: the [Heart Disease Data Set](https://archive.ics.uci.edu/ml/datasets/Heart+Disease) from the UCI Machine Learning Repository. This dataset comes from the famous Cleveland Clinic Foundation, which recorded information on various patient characteristics, including age and chest pain, to try to classify the presence of heart disease in an individual. This a prime example of how machine learning can help solve problems that have a real impact on people's lives.

We'll practice going through the machine learning pipeline, starting from examining the dataset itself to creating a polished classification model. Classification problems are much more common than regression problems, so it'll be good to get some practice.

Let's get started!

## Instructions
You may work on this guided project in a personal Jupyter notebook or Python coding environment of your choice, but feel free to use the Dataquest interface. Going through a guided project on your own local machine makes it more convenient to share and iterate on. The heart_disease.csv file is currently loaded into the interface, and you can download it here to have on your own machine.

- Load in the pandas library with pd as the alias.
- To get started, let's load the heart disease dataset in so that we can start examining it:
- Use the `read_csv()` function to read in the data and name the data heart

Note: we have partially cleaned the dataset so that we will perform binary classification. The original dataset has multiple classes, so you may download it from the site to attempt the guided project on multiple classes instead. For this guided project, we will focus on the binary classification case.

## Exploring the Dataset

Before we build any model, we should explore the dataset and perform any adjustments we might need before actually fitting the model. This may include converting categorical variables into dummy variables or centering and scaling variables. We'll also want to check for predictors that are distributed differently based on the outcome, since they could be informative for classification.

Take some time to explore the heart dataset. We'll want to ultimately pick out some predictors to potentially use in a linear model.

## Instructions
- What are the columns that are present in the dataset?
- Consult the official page on the dataset to read about it a bit more.
- What are the data types for each of the columns? Are any of them worth transforming or converting into dummy variables?
- Check the relationships between the potential predictor variables with the outcome via plots (i.e., histograms). See if stratifying by heart disease shows a meaningful difference in the distribution of the predictors:
- The `boxplot()` and `hist()` methods may come in handy here.
- Select a set of the predictors to use in your predictor model.
- Summarize these findings in a short paragraph before moving on to the next screen.
- Why did you choose to include your predictors in the model?
- Were there data-specific reasons? Domain knowledge?

## Dividing the Data

Now that we have some predictors, we need to set aside some data to act as a final assessment for our model. We'll need the following:

A training set that will be used to estimate the regression coefficients
A test set that will be used to assess the predictive ability of the model
The model will be fit to the training set, and predictive ability will be assessed on the test set. We'll need to make sure that both sets contain both cases and non-cases.

Let's take the time to divide up the data properly.

## Instructions
- Decide what percentage of the `heart` dataset will be used for the training and test datasets. The remaining proportion will be used for the test set.
- Import the `train_test_split()` function from the model_selection submodule in sklearn.
- Using this proportion, divide up the `heart` data into a training set and a test set:
- Make sure to set a random seed to make your results reproducible.
Check that both the training and test datasets have cases and non-cases. If not, then select a new seed until this is the case.

## Building the Model

With our `heart` dataset divided up, let's build the classification model and do some initial assessments. These are some guiding questions that you should think about:

- What is the overall training accuracy? Sensitivity and specificity?
- Does the model perform better on cases or non-cases? Or does it perform equally well?

These training metrics are overly optimistic estimations of how the model performs, so we should expect slightly worse metrics if the model is general enough. If these metrics are too high, it might be a sign that our model is starting to overfit.

## Instructions
- Construct the logistic regression model using only the training set.
- Calculate the accuracy, sensitivity and specificity of the model.
- Write some notes about what you observe from these measures of model quality.

## Interpreting the Model Coefficients

Now that we've created our model, let's look at the coefficients to see if they make sense, given the problem. Recall that the logistic regression relates the binary outcome to the linear combination of predictors via the link function:

$log(\frac{EY}{1−EY})=β_0+β_1X $

The predictors affect the outcome on the log-odds scale. The non-intercept coefficients represent the log-odds ratio for a unit increase in a predictor:

$log(\frac{O_1}{O_0})=β_1$

. . . where $O_0$ represents the odds ratio when the predictor is `0`, and $O_1$ represents the odds ratio when the predictor is `1`. However, we're usually interested in examining these effects on the odds scale, so we take e to both sides to get the following:

$O1=e^{β_1}O_0$

Let's see what our chosen predictors suggest about their relationship with heart disease.

## Instructions
- Examine the coefficients of your logistic regression model on both the `log-odds` and odds scales.
- Make some notes on what the coefficients suggest about the effects of the predictors. Do these coefficients seem to make sense?

## Final Model Evaluation

Finally, we can assess the predictive ability of our logistic regression model.

## Instructions
- Use the model to calculate the test predictions.
- Calculate the accuracy, sensitivity, and specificity of these predictions.
- How does this value compare to what was calculated from the training set?
- Write down some conclusions on what you observe.

## Drawing Conclusions

We hope that the process of going from dataset to model or set of models is starting to feel more natural. As you learn more machine learning, continue to refine your own personal process. Take some time to review all of the notes that you've made during the predictive modeling process.

## Instructions
Answer the following questions in your write-up:

- Does the model make sense when considering its interpretation? Does it seem to match up with what you might expect?
- Does the model seem to predict the cases or non-cases better than the other? Why might this be the case, based on your model?
- How would you interpret the accuracy for the model? Does this accuracy seem acceptable for use in an actual clinical setting?