# Capstone Regression Project

## Introduction 

In this notebook you'll perform a linear regression analysis and report the findings of your model, including both predictive model performance metrics and interpretation of fitted model parameters.

## Objectives

You will be able to:

* Describe the business use case solved by your analysis
* Complete data cleaning and exploratory data analysis using Pandas and/or SQL
* Perform a linear regression analysis starting with a baseline model and then making improvements
* Evaluate your models and interpret their predictive performance metrics
* Apply an inferential lens to interpret relationships between variables identified by the top model

## Business Understanding

*Task*: (Five to ten sentences) Explain the background needed for someone with little to no background in your topic to understand the business goals of your project. Discuss the issues seen currently that your analysis can address.

*YOUR ANSWER HERE*

## Data Understanding

*Task*: (Five to ten sentences) Where does your data come from? What are you trying to predict? What limitations does this data have? 

*YOUR ANSWER HERE*


## Loading the Data

*Task*: (Three to five sentences) Explain how to load the data using Python. Clearly address which packages you are using and include the code to `import` those packages below.

*YOUR ANSWER HERE*

In [194]:
# *YOUR ANSWER HERE for importing packages*

## Analysis Requirements

### 1. Data Exploration and Cleaning

During the data exploration phase, the datatypes of columns should be checked, the distribution of the target variable should be inspected, null values should be either removed or replaced, and duplicates (in most cases) should be dropped. 

### 2. Create a Baseline Model

In order to evaluate how a simple linear regression model is understanding the dependent variable, you will begin by first creating a model that predicts the mean of the dependent variable for every observation. Predicting the mean of your target variable/feature can be considered a highly naive model. If a simple linear regression model performs worse than this naive approach, you can safely say that it is not a very good model. The same can be said of multiple linear regression models as well.

### 3. Interpret a Correlation Heatmap

To develop a simple linear regression model, you will identify the independent variable that is most correlated with our dependent variable. To do this this you will plot a correlation heatmap to identify the variable most correlated with your target.

### 4. Build a Simple Linear Regression Model

Now, create a linear regression model where the most correlated feature is used as the independent variable and the dependent variable is properly set. 

### 5. Interpret the Simple Linear Regression Model

Once the model has been fit, the coefficient for our independent variable, its p-value, and the coefficient confidence interval should be interpeted. You should ask yourself whether or not the relationship your model is finding seems plausible. 

### 6. Evaluate the Simple Linear Regression Model

Before you can make a final assessment of our model, you need to compare its metrics with the baseline model created in step one, and you need to check the assumptions of linear regression.

### 7. Build a Multiple Linear Regression Model

Now, create a multiple linear regression model by adding at least one feature to the model beyond the most correlated one.

### 8. Interpret this Multiple Linear Regression Model

Interpret the coefficients for our independent variables, their p-values, and the coefficient confidence intervals. As before, you should ask yourself whether or not the relationship your model is finding seems plausible. 

### 9. Evaluate the Multiple Linear Regression Model

Compare the metrics with the baseline model created in step one and the similar linear regression model in step four.

### 10. Decide on how next to proceed

Explore other multiple linear regression models with a variety of features. Use inferential statistics and your business understanding to talk about reasons for including particular features and compare your final model back to the (1) baseline, (2) simple linear regression, and (3) initial multiple linear regression model built in step seven.



# 1. Data Exploration and Cleaning

**Note:** Some steps have been given here to get you going with your analysis. Please add in any additional cleaning steps, data visualizations, and exploratory analysis you've done in Python here.

Inspect the dataframe by outputting the first five rows.

In [198]:
# Replace None with your code

None

Produce high-level descriptive information about your training data

In [200]:
# Replace None with your code

None

Display the number of null values for each column

In [202]:
# Replace None with your code

None

Check the datatypes of the columns in the dataframe. 
> Remember, the target column and any columns you use as independent variables *must* have a numeric datatype. After inspecting the datatypes of the columns, convert columns to numeric where necessary. 

In [209]:
# Replace None with your code

None

In the cell below, output the number of duplicate rows in the dataframe. If duplicates are found, drop them.

In [211]:
# Replace None with your code

None

Visualize the distribution of the dependent variable

In [213]:
# Replace None with your code

None

### What additional stories can you tell by exploring your data? Add them below.

# Create a Baseline Model

Below, create a baseline model by using the mean of the response.

Now that you have baseline predictions, you can use the predictions to calculate metrics about the model's performance. 

In [None]:
from sklearn.metrics import r2_score, mean_squared_error


**Interpret the resulting metrics for the baseline model.**

- How is the model explaining the variance of the dependent variable?
- For example, on average, how many dollars off are the models predictions?

*YOUR ANSWER HERE*

# 2. Interpret a Correlation Heatmap to Build a Baseline Model

## Correlation Heatmap

Produce a heatmap showing the correlations between all of the numeric values in the data. The x and y axis labels should indicate the pair of values that are being compared, and then the color and the number should represent the correlation. 

The most important column or row shows the correlations between the target and other attributes.

In [218]:
# Run this cell without changes

import seaborn as sns
import numpy as np

In [220]:
# Replace None with your code

None

**Task**: (Five to ten sentences) What are the major findings of the correlation analysis?

*YOUR ANSWER HERE*

# 3. Build a Simple Linear Regression Model

Now, you'll build a linear regression model using just the most correlated feature. 

In the cell below, fit a `statsmodels` linear regression model to the data and output a summary for the model. 

In [226]:
import statsmodels.formula.api as smf

# Replace None with your code

model = None

# 4. Interpret the Simple Linear Regression Model

Now that the model has been fit, you should interpret the model parameters. 

Specifically:
- What do the coefficients for the intercept and independent variable suggest about the dependent variable?
- Are the coefficients found to be statistically significant?
- What are the confidence intervals for the coefficients?
- Do the relationships found by the model seem plausible? 

*YOUR ANSWER HERE*

# 5. Evaluate the Simple Linear Regression Model

Now that the model parameters have been interpreted, the model must be assessed based on predictive metrics and whether or not the model is meeting the assumptions of linear regression. 

### Compare the $R^2$ and the Root Mean Squared Error of the simple linear regression model with the baseline model. 

In [None]:
# Replace None with your code
model_r2 = None

model_rmse = None

print('Baseline R^2: ', baseline_r2)
print('Baseline RMSE:', baseline_rmse)
print('----------------------------')
print('Regression R^2: ', model_r2)
print('Regression RMSE:', model_rmse)

### Interpret the model metrics

*YOUR ANSWER HERE*

### Check the assumptions of simple linear regression

#### Investigating Linearity

First, let's check whether the linearity assumption holds.

In [None]:
# YOUR CODE HERE

Are you violating the linearity assumption?

*YOUR ANSWER HERE*

#### Investigating Normality

Now let's check whether the normality assumption holds for our model.

In [None]:
# YOUR CODE HERE

Are you violating the normality assumption?

*YOUR ANSWER HERE*

#### Investigating Homoscedasticity

Now let's check whether the model's errors are indeed homoscedastic or if they violate this principle and display heteroscedasticity.

In [None]:
# YOUR CODE HERE

Are you violating the homoscedasticity assumption?

*YOUR ANSWER HERE*

### Linear Regression Assumptions Conclusion

Given your answers above, how should you interpret our model's coefficients? Do you have a model that can be used for inferential as well as predictive purposes?

*YOUR ANSWER HERE*

# STEPS 6-10 here

Feel free to use much of the scaffolding that has been provided for you above to continue steps six to 10 here.

# Level Up: Project Enhancements

After completing the project, you could consider the following enhancements if you have time:

* Identify and remove outliers, then redo the analysis
* Compile the data cleaning code into a function

## Summary

**Task**: (Ten to fifteen sentences) What has been determined by your analysis? What would you do next if you had more time? What things do you wish you had known before you started this capstone? Which steps did you find most useful in your analysis? Why?

*YOUR ANSWER HERE*