# Project 2:

## Use multiple linear regression modeling 
## Analyze house sales in a northwestern county.

______________

# Business Understanding

## Who: Real Estate Agency that
## What: Helps Homeowners 
## How: Buy and/or Sell homes.

#### Business problem you could focus on for this stakeholder:
- How home renovations MIGHT increase the estimated value of their homes 
- By WHAT amount

___________

# Data


1. This project uses the **King County House Sales dataset** in  `data/kc_house_data.csv` in the data folder
2. The description of the column names in `data/column_names.md` in the same folder. 

- The column names are not perfectly described, so you'll have to do some research or use your best judgment if you have questions about what the data means.

- It is up to you to decide what data from this dataset to use and how to use it. **If you are feeling overwhelmed or behind**, we recommend you **ignore** some or all of the following features:

* `date`
* `view`
* `sqft_above`
* `sqft_basement`
* `yr_renovated`
* `address`
* `lat`
* `long`

# Key Points

* **Your goal in regression modeling is to yield findings to support relevant recommendations. Those findings should include a metric describing overall model performance as well as at least two regression model coefficients.** As you explore the data and refine your stakeholder and business problem definitions, make sure you are also thinking about how a linear regression model adds value to your analysis. "The assignment was to use linear regression" is not an acceptable answer! You can also use additional statistical techniques other than linear regression, so long as you clearly explain why you are using each technique.

* **You should demonstrate an iterative approach to modeling.** This means that you must build multiple models. Begin with a basic model, evaluate it, and then provide justification for and proceed to a new model. After you finish refining your models, you should provide 1-3 paragraphs in the notebook discussing your final model.

* **Data visualization and analysis are still very important.** In Phase 1, your project stopped earlier in the CRISP-DM process. Now you are going a step further, to modeling. Data visualization and analysis will help you build better models and tell a better story to your stakeholders.

### My approach
To approach this data, we will perform a regression analysis to estimate how much renovations might increase the estimated value of homes for the Real Estate Agency. We will use the King County dataset which includes variables such as the number of bedrooms, square footage of living space and lot, overall condition of the house, view quality, grade, and heat source to predict the sale price of a house.

We will first load the dataset into our code and perform data cleaning and feature selection to ensure relevant data is present. We can then visualize and analyze trends between the selected features and sale price to understand the relationship between them better.

Next, we will create a simple linear regression model to estimate the relationship between one independent variable (e.g., square footage of living space) and the dependent variable (sale price). We will evaluate the model's performance, and continue refining our model by adding more independent variables, transforming variables or even using non-linear regression techniques, while keeping in mind the business problem for our stakeholder.

Finally, we will interpret our final model and regression coefficients to assess the impact home renovations would have on estimated home value. We should also examine any assumptions required for linear regression models to verify that these are met before concluding on the results. By the end, we will provide detailed findings with recommendations to our stakeholder, based on the insights garnered from the model.

In [2]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
import sklearn.metrics as metrics
from random import gauss
from mpl_toolkits.mplot3d import Axes3D
from scipy import stats as stats

%matplotlib inline

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


# %load data/column_names.md
# Column Names and Descriptions for King County Data Set

* `id` - Unique identifier for a house
* `date` - Date house was sold
* `price` - Sale price (prediction target)
* `bedrooms` - Number of bedrooms
* `bathrooms` - Number of bathrooms
* `sqft_living` - Square footage of living space in the home
* `sqft_lot` - Square footage of the lot
* `floors` - Number of floors (levels) in house
* `waterfront` - Whether the house is on a waterfront
  * Includes Duwamish, Elliott Bay, Puget Sound, Lake Union, Ship Canal, Lake Washington, Lake Sammamish, other lake, and river/slough waterfronts
* `greenbelt` - Whether the house is adjacent to a green belt
* `nuisance` - Whether the house has traffic noise or other recorded nuisances
* `view` - Quality of view from house
  * Includes views of Mt. Rainier, Olympics, Cascades, Territorial, Seattle Skyline, Puget Sound, Lake Washington, Lake Sammamish, small lake / river / creek, and other
* `condition` - How good the overall condition of the house is. Related to maintenance of house.
  * See the [King County Assessor Website](https://info.kingcounty.gov/assessor/esales/Glossary.aspx?type=r) for further explanation of each condition code
* `grade` - Overall grade of the house. Related to the construction and design of the house.
  * See the [King County Assessor Website](https://info.kingcounty.gov/assessor/esales/Glossary.aspx?type=r) for further explanation of each building grade code
* `heat_source` - Heat source for the house
* `sewer_system` - Sewer system for the house
* `sqft_above` - Square footage of house apart from basement
* `sqft_basement` - Square footage of the basement
* `sqft_garage` - Square footage of garage space
* `sqft_patio` - Square footage of outdoor porch or deck space
* `yr_built` - Year when house was built
* `yr_renovated` - Year when house was renovated
* `address` - The street address
* `lat` - Latitude coordinate
* `long` - Longitude coordinate

Most fields were pulled from the [King County Assessor Data Download](https://info.kingcounty.gov/assessor/DataDownload/default.aspx).

The `address`, `lat`, and `long` fields have been retrieved using a third-party [geocoding API](https://docs.mapbox.com/api/search/geocoding/). In some cases due to missing or incorrectly-entered data from the King County Assessor, this API returned locations outside of King County, WA. If you plan to use the `address`, `lat`, or `long` fields in your modeling, consider identifying outliers prior to including the values in your model.


In [10]:
kc_sales.head()

Unnamed: 0,ExciseTaxNbr,Major,Minor,DocumentDate,SalePrice,RecordingNbr,Volume,Page,PlatNbr,PlatType,...,PropertyType,PrincipalUse,SaleInstrument,AFForestLand,AFCurrentUseLand,AFNonProfitUse,AFHistoricProperty,SaleReason,PropertyClass,SaleWarning
0,1541702,4000,228,04/29/1997,103500,199705010641,11.0,31.0,4000.0,P,...,3,0,3,N,N,N,N,1,8,
1,3000111,799671,190,06/26/2019,0,20190718000977,,,,,...,11,6,15,N,N,N,N,1,8,
2,3068518,327620,100,09/01/2020,430000,20200909002684,,,,,...,11,6,2,N,N,N,N,1,8,
3,1606088,375380,40,04/10/1998,130000,199804171192,26.0,91.0,375380.0,C,...,3,2,3,N,N,N,N,1,3,
4,3188451,152900,260,04/05/2022,0,20220509000741,,,,,...,3,2,3,N,N,N,N,5,3,12 31 51


Based on the given data, there are a total of 25 columns and 30155 rows in a pandas DataFrame. The columns contain various features related to houses in King County such as the number of bedrooms, bathrooms, square footage of living space, lot, and patio/deck space, the year the house was built and renovated among others. The target variable is "price" which represents the sale price of the house.

To approach this data for the business problem of how home renovations might increase the estimated value of homes and by what amount, we can start by performing exploratory data analysis to understand the relationships between the different features and the target variable. We can also use visualizations such as scatter plots, heat maps, histograms, and correlation matrices to gain insights from the data.

Afterward, we can apply regression modeling techniques to determine the effect of different features such as the condition, renovation year, and other independent variables on the target variable (house price). We can iterate through multiple models starting with a baseline model, evaluating its performance, and refining it by selecting appropriate features and applying regularization techniques as needed.

Overall, we want to build a model that accurately predicts the sale price of homes based on relevant features that include information on home renovations. With this information, we can provide recommendations to the real estate agency on how they can increase the estimated value of homes through renovations and by what amount.

In [None]:
major, minor, bldgnbr
major, minor, tax #