## Final Project Submission

* Student name: Greg Osborne
* Student pace: self paced / part time
* Scheduled project review date/time: 6/28/22
* Instructor name: Clause Fried
* Blog post URL: https://medium.com/@gregosborne

## Project Overview

For this project, you will use multiple linear regression modeling to analyze house sales in a northwestern county.

### Business Problem

It is up to you to define a stakeholder and business problem appropriate to this dataset.

If you are struggling to define a stakeholder, we recommend you complete a project for a real estate agency that helps homeowners buy and/or sell homes. A business problem you could focus on for this stakeholder is the need to provide advice to homeowners about how home renovations might increase the estimated value of their homes, and by what amount.

### The Data

This project uses the King County House Sales dataset, which can be found in  `kc_house_data.csv` in the data folder in this assignment's GitHub repository. The description of the column names can be found in `column_names.md` in the same folder. As with most real world data sets, the column names are not perfectly described, so you'll have to do some research or use your best judgment if you have questions about what the data means.

It is up to you to decide what data from this dataset to use and how to use it. If you are feeling overwhelmed or behind, we recommend you **ignore** some or all of the following features:

* `date`
* `view`
* `sqft_above`
* `sqft_basement`
* `yr_renovated`
* `zipcode`
* `lat`
* `long`
* `sqft_living15`
* `sqft_lot15`

### Key Points

* **Your goal in regression modeling is to yield findings to support relevant recommendations. Those findings should include a metric describing overall model performance as well as at least two regression model coefficients.** As you explore the data and refine your stakeholder and business problem definitions, make sure you are also thinking about how a linear regression model adds value to your analysis. "The assignment was to use linear regression" is not an acceptable answer! You can also use additional statistical techniques other than linear regression, so long as you clearly explain why you are using each technique.

* **You should demonstrate an iterative approach to modeling.** This means that you must build multiple models. Begin with a basic model, evaluate it, and then provide justification for and proceed to a new model. After you finish refining your models, you should provide 1-3 paragraphs in the notebook discussing your final model.

* **Data visualization and analysis are no longer explicit project requirements, but they are still very important.** In Phase 1, your project stopped earlier in the CRISP-DM process. Now you are going a step further, to modeling. Data visualization and analysis will help you build better models and tell a better story to your stakeholders.

### Statistical Communication

Recall that communication is one of the key data science "soft skills". In Phase 2, we are specifically focused on Statistical Communication. We define Statistical Communication as:

> Communicating **results of statistical analyses** to diverse audiences via writing and live presentation

Note that this is the same as in Phase 1, except we are replacing "basic data analysis" with "statistical analyses".

High-quality Statistical Communication includes rationale, results, limitations, and recommendations:

* **Rationale:** Explaining why you are using statistical analyses rather than basic data analysis
  * For example, why are you using regression coefficients rather than just a graph?
  * What about the problem or data is suitable for this form of analysis?
  * For a data science audience, this includes your reasoning for the changes you applied while iterating between models.
* **Results:** Describing the overall model metrics and feature coefficients
  * You need at least one overall model metric (e.g. R² or RMSE) and at least two feature coefficients.
  * For a business audience, make sure you connect any metrics to real-world implications. You do not need to get into the details of how linear regression works.
  * For a data science audience, you don't need to explain what a metric is, but make sure you explain why you chose that particular one.
* **Limitations:** Identifying the limitations and/or uncertainty present in your analysis
  * This could include p-values/alpha values, confidence intervals, assumptions of linear regression, missing data, etc.
  * In general, this should be more in-depth for a data science audience and more surface-level for a business audience.
* **Recommendations:** Interpreting the model results and limitations in the context of the business problem
  * What should stakeholders _do_ with this information?

### Data Preparation Fundamentals

We define this objective as:

> Applying appropriate **preprocessing** and feature engineering steps to tabular data in preparation for statistical modeling

The two most important components of preprocessing for the Phase 2 project are:

* **Handling Missing Values:** Missing values may be present in the features you want to use, either encoded as `NaN` or as some other value such as `"?"`. Before you can build a linear regression model, make sure you identify and address any missing values using techniques such as dropping or replacing data.
* **Handling Non-Numeric Data:** A linear regression model needs all of the features to be numeric, not categorical. For this project, ***be sure to pick at least one non-numeric feature and try including it in a model.*** You can identify that a feature is currently non-numeric if the type is `object` when you run `.info()` on your dataframe. Once you have identified the non-numeric features, address them using techniques such as ordinal or one-hot (dummy) encoding.

There is no single correct way to handle either of these situations! Use your best judgement to decide what to do, and be sure to explain your rationale in the Markdown of your notebook.

Feature engineering is encouraged but not required for this project.

### Linear Modeling

According to [Kaggle's 2020 State of Data Science and Machine Learning Survey](https://www.kaggle.com/kaggle-survey-2020), linear and logistic regression are the most popular machine learning algorithms, used by 83.7% of data scientists. They are small, fast models compared to some of the models you will learn later, but have limitations in the kinds of relationships they are able to learn.

In this project you are required to use linear regression as the primary statistical analysis, although you are free to use additional statistical techniques as appropriate.

# Column Names and Descriptions for King County Data Set
* `id` - Unique identifier for a house
* `date` - Date house was sold
* `price` - Sale price (prediction target)
* `bedrooms` - Number of bedrooms
* `bathrooms` - Number of bathrooms
* `sqft_living` - Square footage of living space in the home
* `sqft_lot` - Square footage of the lot
* `floors` - Number of floors (levels) in house
* `waterfront` - Whether the house is on a waterfront
  * Includes Duwamish, Elliott Bay, Puget Sound, Lake Union, Ship Canal, Lake Washington, Lake Sammamish, other lake, and river/slough waterfronts
* `view` - Quality of view from house
  * Includes views of Mt. Rainier, Olympics, Cascades, Territorial, Seattle Skyline, Puget Sound, Lake Washington, Lake Sammamish, small lake / river / creek, and other
* `condition` - How good the overall condition of the house is. Related to maintenance of house.
  * From the [King County Assessor Website](https://info.kingcounty.gov/assessor/esales/Glossary.aspx?type=r): 
    * Relative to age and grade. Coded 1-5.

    * 1 = Poor- Worn out. Repair and overhaul needed on painted surfaces, roofing, plumbing, heating and numerous functional inadequacies. Excessive deferred maintenance and abuse, limited value-in-use, approaching abandonment or major reconstruction; reuse or change in occupancy is imminent. Effective age is near the end of the scale regardless of the actual chronological age.

    * 2 = Fair- Badly worn. Much repair needed. Many items need refinishing or overhauling, deferred maintenance obvious, inadequate building utility and systems all shortening the life expectancy and increasing the effective age.

    * 3 = Average- Some evidence of deferred maintenance and normal obsolescence with age in that a few minor repairs are needed, along with some refinishing. All major components still functional and contributing toward an extended life expectancy. Effective age and utility is standard for like properties of its class and usage.

    * 4 = Good- No obvious maintenance required but neither is everything new. Appearance and utility are above the standard and the overall effective age will be lower than the typical property.

    * 5= Very Good- All items well maintained, many having been overhauled and repaired as they have shown signs of wear, increasing the life expectancy and lowering the effective age with little deterioration or obsolescence evident with a high degree of utility.
* `grade` - Overall grade of the house. Related to the construction and design of the house.
  * From the [King County Assessor Website](https://info.kingcounty.gov/assessor/esales/Glossary.aspx?type=r) :
      * Represents the construction quality of improvements. Grades run from grade 1 to 13. Generally defined as:

      * 1-3 Falls short of minimum building standards. Normally cabin or inferior structure.

      * 4 Generally older, low quality construction. Does not meet code.

      * 5 Low construction costs and workmanship. Small, simple design.

      * 6 Lowest grade currently meeting building code. Low quality materials and simple designs.

      * 7 Average grade of construction and design. Commonly seen in plats and older sub-divisions.

      * 8 Just above average in construction and design. Usually better materials in both the exterior and interior finish work.

      * 9 Better architectural design with extra interior and exterior design and quality.

      * 10 Homes of this quality generally have high quality features. Finish work is better and more design quality is seen in the floor plans. Generally have a larger square footage.

      * 11 Custom design and higher quality finish work with added amenities of solid woods, bathroom fixtures and more luxurious options.

      * 12 Custom design and excellent builders. All materials are of the highest quality and all conveniences are present.

      * 13 Generally custom designed and built. Mansion level. Large amount of highest quality cabinet work, wood trim, marble, entry ways etc.
* `sqft_above` - Square footage of house apart from basement
* `sqft_basement` - Square footage of the basement
* `yr_built` - Year when house was built
* `yr_renovated` - Year when house was renovated
* `zipcode` - ZIP Code used by the United States Postal Service
* `lat` - Latitude coordinate
* `long` - Longitude coordinate
* `sqft_living15` - The square footage of interior housing living space for the nearest 15 neighbors
* `sqft_lot15` - The square footage of the land lots of the nearest 15 neighbors


# The assignment must include:
* Explain why I'm using multiple linear regression models at all
* Explain why I'm using statistical analysis rather than just basic data analysis
* I need one overall Model Metric, R² or RMSE 
  * (Explain it to data science audience)
* Need two feature Coefficients
  * One must be must be categorical, use Ordinal or One-hot (dummy) encoding
* Explain to business audience and data scientists
* Identify limitations:
  * p-values/alpha values, confidence intervals, assumptions of linear regression, missing data, etc...
* Make recommendations

Seattle properties instructed me not to use the following columns listed below in my analysis. I dropped them from the data.

# Notes from my meeting with Claude Fried:

Figure out your audience.
Figure out your question.
Three actionable insights based on your model with three coefficients.
All the data prep to do. 

Baseline model. R² does not have to be high.
One predictive feature and one target feature. Target would be price, predictive would be variables.
Iterate through the modeling process.
Look at the P-values for the features in the model.
Add or drop features to get a better performing model.

Check the four assumptions for linear regression models.
1. There's a linear relationship between the target and predictors.
2. Homoscedasticity
3. Independence or Multicolinearity
4. Normality

Two or three iterations. R² increases through iterations. P values low, below the alpha. Assumptions are improving.

Improving assumptions may be removing outliers.

At the end, we'd have our final model. Pick three coefficients. Explain them, and their impact. Give business recomendations based on three coefficients.

Grade, condition, bathrooms, bedrooms. Grade and condition can be a continuous variable.

Maybe check the county website for what they designate as grade.

### Stats models, the package I want work with.
I can use OLS, or formula ols, lowercase.

Don't use sci kit for this. 
(I did use scikit for the polynomial stuff, but I also built my own simple regression formula so I could find the biggest outliers)

predictors = features = independent variables = dimensions

turn scatter matrix into scatter plots and distributions
Find outliers for price and individual predictors 
(I removed one outlier, anything further than that caused problems)

plot the valuecounts at the top of the data.

Put visualizations before the code.

When we build a regression model, we need to choose a y-intercept. If we have conditional variables, and all have to have a one.

Inferential cannot have multicolinearity
Predictive can have colinearity.

Phase II walkthrough

Three part series - Walkthrough on regression.

Part 1: 4 Eastern

# Questions before I submit this.

2. Is the fact that the data fails all the assumptions of linear regression a problem? Does it actually fail all the assumptions?
4. Why did dropping outliers that seemed to have the largest distance from the regression lines make my R² go down?

Answered:
1. How do I draw conclusions from this model I built? 
* The coefficients answer the questions. I didn't understand what these meant before.
3. Is it a problem that all my P-Values are zero? 
* No it's not. That's what we want to see.


# Notes from Morgan's Linear Regression Part 2

The y - y-hat is called the error, or residual. When it's squared, it's called square error.
My R² value of .595 is too low. :-( 
model.params tells us the coefficients of the model.
Statsmodel showing us 0.00 for P-value is good. That's what we want to see.


# Claude Fried Office Hours.
We could remove modes with price.

Normality: The residuals of the model needs to be normally distributed.
model.resid = Will create pandas series of all residuals
Homo = Variance of the residuals.
Variance inflation factor → Please be sure to add a constant (A string of ones?)


Rule of thumb is that five is too high.
10 or above must do something
5-10 - yellow light
Below 5 is a greenlight

Encourages me to remove the high correlation coefficent.

Phase II project - No requirement to do a train test split.

Train test split happens before we do any transformations on the data.


Two approaches to improving the model.
A model doesn't need us to transform its data. 

Don't drop any more than 5% of the data.

Or you can perform a log transformation, square root, squaring them.

Claude would build four models at once.
raw values
log values
raw scaled → Relative coefficient.
log scaled values

scaling: Min/Max, Z-score (standard scaler)
Scaling will not impact the model.

Feature engineering vs. including polynomials

This is an inferential model, not predictive.

Shoot for an R² value of .6

Fit transform is dot fit and dot transform.


# Notes on Office Hours with Claude II:

When testing linearity assumptions.
Assume that prefect normality is not possible
Skewness and Kurtosis should be as close to zero as possible
Jarque-Bera and Omnibus should be as low as possible. Lower shows improvement.

Don't do train for this projects.

Homoscedasticity is scatterplot of true values vs predicted values. Residuals vs. values. 
x=model.fittedvalues,y=model.resid

R-squared does not need to be above 0.6.

Questions:
1. Is it ok to interpret a binary variable as continuous? → Continuous is the only way to do it.
2. Normality is residuals. Is homoscedasticity also a test of residuals or just the scatterplot?
3. What would you tell me as I'm now moving forward with model interpretation?
4. Removing outliers lowered the R-squared, should I avoid it entirely then? Should I do this to improve the assumptions?

(If time) Am I doing normality correct?

# Notes from 1on1 with Claude:

Whatever data we look at, shouldn't matter. 

In Data Science in general, it is impossible to find the "right answer."

1. Don't remove what I've done.
2. Add in a section in the notebook where I start with one model with one variable. Check all assumptions, and R-squared.
3. Make a second model and build it out. Add 3-6 other features. Observe what happens. Did the score go up? P-Values? Residuals improved? Multicollinearity? Get your hands dirty instead of automating it (with those tests).
4. If assumptions are still a problem log, transforming, removing outliers. (Remove the outliers a few at a time.
5. Come up with a pre-algorithm final model. It should have some validity in and of itself. Try to interpret the pre-algorithm model. Use visualizations to show the differnce between waterfront. Three recomendations based on three coefficients.

# Conclusions
As my contract specified, here are three conclusions from three variables I used in my model, and an observation on the fourth.

### 1) Each digit of a home's grade rating is worth nearly \$200,000.

Of the four factors I included in my model, Grade had a stronger affect on the price than any other. This is interesting because Grade is also a slightly subjective number based on King County criteria. It may be difficult for an individual to guess what King County would grade a home based on the criteria, it is very easy to look up how King County graded a home.

For a salesman to optimize their efforts, they could look up homes at a specific grade and then target those homes to represent.

### 2) A home's existence on a waterfront adds \$844K to the sale price to a home.

Perhaps because there are so few waterfront properties in King County, those that have waterfront status are worth a \$0.84 million more. That's quite a bump for a single additional feature to a home.

A salesman could target waterfront properties specifically, and offer their services as someone with waterfront sales experience to generate more business. This could have a profound snowball effect, leading to higher commissions again and again. However, as I mentioned, there are a limited number of homes in King County that are on the waterfront, so whenever they are available, salesmen should lobby their owners for the opportunity to represent them.

3) Homes with more bathrooms sell at higher prices, at \\$148K per full bathroom, or \\$74K per half bath, or $37K per quarter bath.