
# Regression Project

## Project Overview

### Background: 

Zillow wants to improve their Zestimate.  The zestimate is estimated value of a home.  Zillow theorizes that there is more information to be gained to improve its existing model.  Because of that, Zillow wants you to develop a model to predict the error between the Zestimate and the sales price of a home.  In predicting the error, you will discover features that will help them improve the Zestimate estimate itself.  Your goal of this project is to develop a linear regression model that will best predict the log error of the Zestimate.  The error is the difference of the sales price and the Zestimate.  The log error is computed by taking the log function of that error.  You don't need to worry about the fact that the error is of a logarithmic function.  It is a continuous number that represents an error rate. 

### Your deliverables:

1. A report (in the form of a presentation, both verbal and through a slides) that summarizes your findings about the drivers of the Zestimate error.  This will be come from the analysis you do during the exploration phase of the pipeline.  In the report, you will have charts that visually tell the story of what is driving the errors.

2. A Jupyter notebook, titled 'Regression_Proj_YourName', that contains a clearly labeled section and code clearly documented for each the stages below (project planning, data acquisition, data prep, exploration, and modeling).  All of the work will take place in your jupyter notebook. 


## Project Planning

You will use the available data dictionary to understand the fields.  During the Project Planning stage, you will want to document the meaning of the fields you will be using (see below for list of fields).   You will want to brainstorm how you plan to go about this, especially your thoughts around your analysis, any hypotheses you might have, ideas about what might matter, what might not.  This part, in the real world, is for you and maybe your co-workers only.  Grammer, technicalities, etc. do not matter here.  It is about creating a plan so that you know when you go off course and get lost in the weeds.  You will likely not follow your plan exactly, which is why it is important to clearly state your goal.  That way when you get a bit lost, you can quickly refer to the goal and ask, "Is what I'm doing now the best use of my time to help me achieve my goal?"

Also in this step, prepare your environment with the Python libraries you will need throughout your project.  You may need to go back and add more as you go.  


1. Your goal clearly stated.  Why?  So that when you get a bit lost, you can quickly refer to the goal and ask, "Is what I'm doing now the best use of my time to help me achieve my goal?"

2. Your deliverables clearly stated, so you know when you are done!

3. Data dictionary of fields you will use. Why? So that you can refer back and others can refer to the meanings as you are developing your model.  This is about gaining knowledge in the domain space so that you will understand when data doesn't look right, be able to more effectively develop hypotheses, and use that domain knowledge to build a more robust model (among other reasons)

4. Brainstorming ideas, hypotheses, related to how variables might impact or relate to each other, both within indepependent variables and between the independent variables and dependent variable, and also related to any ideas for new features you may have while first looking at the existing variables and challenge ahead of you.
 

## Data Acquisition

Using the SQL zillow schema, write a query via Python to generate a cohesive data set that includes the following fields:

    - `logerror`
    - `bathroomcnt`
    - `bedroomcnt`
    - `calculatedfinishedsquarefeet`
    - `fullbathcnt`
    - `garagecarcnt`
    - `roomcnt`
    - `yearbuilt`
    - `taxvaluedollarcnt`
    - `taxamount`

## Data Preparation

1. Sample the data.  Why?  So you can confirm the data look like what you would expect.
2. Create a variable, `colnames`, that is a list of the column names.  Why?  You will likely reference this variable later. 
3. Identify the data types of each variable.  Why? You need to make sure they are what makes sense for the data and the meaning of the data that lies in that variable.  If it does not, make necessary changes.  
4. Compute the summary statistics for the variables.  Why?  The get a glimpse into outliers, skewness, spread, central tendency.  
5. Identify the columns that have missing values and the number of missing values in each column. Why? Missing values are going to cause issues down the line so you will need to handle those appropriately.  For each variable with missing values, if it makes sense to replace those missing with a 0, do so.  For those where that doesn't make sense, decide if you should drop the entire observations (rows) that contain the missing values, or drop the entire variable (column) that contains the missing values. 
6. Create a list of the independent variable names (aka attributes) and assign it to the variable `attributes`. Why? During exploration, you will likely use this list to refer to the attribute names. 
7. Clearly identify your dependent (target) variable.  What is the name of the variable? Is it discrete or continuous?
8. Plot a histogram and box plot of each variable.  Why?  To see the distribution, skewness, outliers, and unit scales.  You will use this information in your decision of whether to normalize, standardize or neither. 
9. Bonus: Create a new data frame that is the min-max normalization of the independent variable in the original data frame (+ the original dependent variable).  You will normalize each of the independent variables independently, i.e. using the min and max of each variable, not the min/max of the whole dataframe. Why?  Regression is very sensitive to difference in units.  It will be almost impossible to extract a meaningful linear regression model with such extreme differences in scale.  For more context, see: https://medium.com/@rrfd/standardize-or-normalize-examples-in-python-e3f174b65dfc


## Data Exploration

1. Split data into training and test datasets
2. Address each of the questions you posed in your planning & brainstorming through visual or statistical analysis.  
3. Create a jointplot for each independent variable (normalized version) with the dependent variable.  Use your for loop created in the exercises to run through the plotting of each independent variable.  Be sure you have Pearson's r and p-value annotated on each plot.  
4. Create a feature plot using seaborn's PairGrid() of the interaction between each variable (dependent + independent).  You will want to use the normalized dataframe so you can more clearly view the interactions.  
5. Create a heatmap of the correlation between each variable pair.  
6. Summarize your conclusions from these steps. 
7. Is the logerror significantly different for homes with 3 bedrooms vs those with 5 or more bedrooms?  Run a t-test to test this difference.  
8. Do the same for another 2 samples you are interested in comparing (e.g. those with 1 bath vs. x baths)

## Data Modeling

### Feature Engineering & Selection

1. Are there new features you could create based on existing features that might be helpful?  Come up with at least one possible new feature that is a calculation from 2+ existing variables.  Add that feature and update the normalized dataframe with the min-max normalization of that feature. 
2. Use statsmodels ordinary least squares to assess the importance of each feature with respect to the target (using the normalized dataframe)
3. Summarize your conclusions and next steps from your analysis in step 2.  What will you try when developing your model?  (which features to use/not use/etc)

#### Train & Test Model

1. Fit, predict (in-sample) & evaluate multiple linear regression models to find the best one.
2. Make any changes as necessary to improve your model.
3. Identify the best model after all training and predict & evaluate on out-of-sample data.  
4. Plot the residuals from your out-of-sample predictions.  
5. Summarize your expectations about how you estimate this model will perform in production.