## Project Overview

For this project, you will use multiple linear regression modeling to analyze house sales in a northwestern county.

### Business Problem

It is up to you to define a stakeholder and business problem appropriate to this dataset.

If you are struggling to define a stakeholder, we recommend you complete a project for a real estate agency that helps homeowners buy and/or sell homes. A business problem you could focus on for this stakeholder is the need to provide advice to homeowners about how home renovations might increase the estimated value of their homes, and by what amount.


# Data


1. This project uses the **King County House Sales dataset** in  `data/kc_house_data.csv` in the data folder
2. The description of the column names in `data/column_names.md` in the same folder. 

- The column names are not perfectly described, so you'll have to do some research or use your best judgment if you have questions about what the data means.

- It is up to you to decide what data from this dataset to use and how to use it. **If you are feeling overwhelmed or behind**, we recommend you **ignore** some or all of the following features:

* `date`
* `view`
* `sqft_above`
* `sqft_basement`
* `yr_renovated`
* `address`
* `lat`
* `long`

# Key Points

* **Your goal in regression modeling is to yield findings to support relevant recommendations. Those findings should include a metric describing overall model performance as well as at least two regression model coefficients.** As you explore the data and refine your stakeholder and business problem definitions, make sure you are also thinking about how a linear regression model adds value to your analysis. "The assignment was to use linear regression" is not an acceptable answer! You can also use additional statistical techniques other than linear regression, so long as you clearly explain why you are using each technique.

* **You should demonstrate an iterative approach to modeling.** This means that you must build multiple models. Begin with a basic model, evaluate it, and then provide justification for and proceed to a new model. After you finish refining your models, you should provide 1-3 paragraphs in the notebook discussing your final model.

* **Data visualization and analysis are still very important.** In Phase 1, your project stopped earlier in the CRISP-DM process. Now you are going a step further, to modeling. Data visualization and analysis will help you build better models and tell a better story to your stakeholders.

## Getting Started

Please start by reviewing the contents of this project description. If you have any questions, please ask your instructor ASAP.

Next, you will need to complete the [***Project Proposal***](#project_proposal) which must be reviewed by your instructor before you can continue with the project.

Here are some suggestions for creating your GitHub repository:

1. Fork the [Phase 2 Project Repository](https://github.com/learn-co-curriculum/dsc-phase-2-project-v2-5), clone it locally, and work in the `student.ipynb` file. Make sure to also add and commit a PDF of your presentation to your repository with a file name of `presentation.pdf`.
2. Or, create a new repository from scratch by going to [github.com/new](https://github.com/new) and copying the data files from the Phase 2 Project Repository into your new repository.
   - Recall that you can refer to the [Phase 1 Project Template](https://github.com/learn-co-curriculum/dsc-project-template) as an example structure
   - This option will result in the most professional-looking portfolio repository, but can be more complicated to use. So if you are getting stuck with this option, try forking the project repository instead


# CRISP-DM

![crisp-dm-pic](./images/CRISP-DM.jpg)

# CRISP-DM Part 1: Business Understanding

### Real Estate Agency going to advise on one of the following:
1. Advise homeowners on types of homes to buy and/or sell
2. Advise homeowners on renovations to increase the value of their home

#### Since we have some freedom on our Business Understanding, we're going to get a better picture of our Data before deciding on a business problem and stakeholder

___

# CRISP-DM Part 2: Data Understanding

# %load data/column_names.md
# Column Names and Descriptions for King County Data Set

* `id` - Unique identifier for a house
* `date` - Date house was sold
* `price` - Sale price (prediction target)
* `bedrooms` - Number of bedrooms
* `bathrooms` - Number of bathrooms
* `sqft_living` - Square footage of living space in the home
* `sqft_lot` - Square footage of the lot
* `floors` - Number of floors (levels) in house
* `waterfront` - Whether the house is on a waterfront
  * Includes Duwamish, Elliott Bay, Puget Sound, Lake Union, Ship Canal, Lake Washington, Lake Sammamish, other lake, and river/slough waterfronts
* `greenbelt` - Whether the house is adjacent to a green belt
* `nuisance` - Whether the house has traffic noise or other recorded nuisances
* `view` - Quality of view from house
  * Includes views of Mt. Rainier, Olympics, Cascades, Territorial, Seattle Skyline, Puget Sound, Lake Washington, Lake Sammamish, small lake / river / creek, and other
* `condition` - How good the overall condition of the house is. Related to maintenance of house.
  * See the [King County Assessor Website](https://info.kingcounty.gov/assessor/esales/Glossary.aspx?type=r) for further explanation of each condition code
* `grade` - Overall grade of the house. Related to the construction and design of the house.
  * See the [King County Assessor Website](https://info.kingcounty.gov/assessor/esales/Glossary.aspx?type=r) for further explanation of each building grade code
* `heat_source` - Heat source for the house
* `sewer_system` - Sewer system for the house
* `sqft_above` - Square footage of house apart from basement
* `sqft_basement` - Square footage of the basement
* `sqft_garage` - Square footage of garage space
* `sqft_patio` - Square footage of outdoor porch or deck space
* `yr_built` - Year when house was built
* `yr_renovated` - Year when house was renovated
* `address` - The street address
* `lat` - Latitude coordinate
* `long` - Longitude coordinate

Most fields were pulled from the [King County Assessor Data Download](https://info.kingcounty.gov/assessor/DataDownload/default.aspx).

The `address`, `lat`, and `long` fields have been retrieved using a third-party [geocoding API](https://docs.mapbox.com/api/search/geocoding/). In some cases due to missing or incorrectly-entered data from the King County Assessor, this API returned locations outside of King County, WA. If you plan to use the `address`, `lat`, or `long` fields in your modeling, consider identifying outliers prior to including the values in your model.


In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [8]:
df = pd.read_csv('data/kc_house_data.csv')

In [14]:
#df.head(3)

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30155 entries, 0 to 30154
Data columns (total 25 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             30155 non-null  int64  
 1   date           30155 non-null  object 
 2   price          30155 non-null  float64
 3   bedrooms       30155 non-null  int64  
 4   bathrooms      30155 non-null  float64
 5   sqft_living    30155 non-null  int64  
 6   sqft_lot       30155 non-null  int64  
 7   floors         30155 non-null  float64
 8   waterfront     30155 non-null  object 
 9   greenbelt      30155 non-null  object 
 10  nuisance       30155 non-null  object 
 11  view           30155 non-null  object 
 12  condition      30155 non-null  object 
 13  grade          30155 non-null  object 
 14  heat_source    30123 non-null  object 
 15  sewer_system   30141 non-null  object 
 16  sqft_above     30155 non-null  int64  
 17  sqft_basement  30155 non-null  int64  
 18  sqft_g

#### heat_source and sewer system have null values
#### need to explore the object type columns, get their dummy values and find out if they correlate.

### Column Exploration:

#### 1. Waterfront 

In [17]:
df['waterfront'].value_counts()

NO     29636
YES      519
Name: waterfront, dtype: int64

In [20]:
df['greenbelt'].value_counts()

NO     29382
YES      773
Name: greenbelt, dtype: int64

In [21]:
df['nuisance'].value_counts()

NO     24893
YES     5262
Name: nuisance, dtype: int64

In [22]:
df['view'].value_counts()

NONE         26589
AVERAGE       1915
GOOD           878
EXCELLENT      553
FAIR           220
Name: view, dtype: int64

#### need to find correlation to price. If correlation need to find out how the scale works
#### Will also want to checkout to see that the lattitude and longitude of these views makes sense. Views that fall into each category should be relatively close to each other, thus having similar lattitudes and longitudes.

In [23]:
df['condition'].value_counts()

Average      18547
Good          8054
Very Good     3259
Fair           230
Poor            65
Name: condition, dtype: int64

#### need to find correlation to price. If correlation need to find out how the scale works

In [24]:
df['grade'].value_counts()

7 Average        11697
8 Good            9410
9 Better          3806
6 Low Average     2858
10 Very Good      1371
11 Excellent       406
5 Fair             393
12 Luxury          122
4 Low               51
13 Mansion          24
3 Poor              13
2 Substandard        2
1 Cabin              2
Name: grade, dtype: int64

#### need to find correlation to price. If correlation need to find out how the scale works

In [25]:
df['heat_source'].value_counts()

Gas                  20583
Electricity           6465
Oil                   2899
Gas/Solar               93
Electricity/Solar       59
Other                   20
Oil/Solar                4
Name: heat_source, dtype: int64

#### need to find correlation to price. If correlation need to find out how the scale works

In [26]:
df['sewer_system'].value_counts()

PUBLIC                25777
PRIVATE                4355
PRIVATE RESTRICTED        6
PUBLIC RESTRICTED         3
Name: sewer_system, dtype: int64

#### need to find correlation to price. If correlation need to find out how the scale works

In [27]:
df['address'].value_counts()

Avenue, 108 Foothill Blvd, Rancho Cucamonga, California 91730, United States    38
Delridge Way Southwest, Seattle, Washington 98106, United States                24
9th Ave, Nebraska City, Nebraska 68410, United States                           21
South 35th Avenue, Bellevue, Nebraska 68123, United States                      20
15th Avenue, Plattsmouth, Nebraska 68048, United States                         17
                                                                                ..
12518 98th Avenue Northeast, Kirkland, Washington 98034, United States           1
2815 31st Avenue West, Seattle, Washington 98199, United States                  1
32821 11th Avenue Southwest, Federal Way, Washington 98023, United States        1
17950 50th Avenue South, SeaTac, Washington 98188, United States                 1
10838 26th Avenue Southwest, Seattle, Washington 98146, United States            1
Name: address, Length: 29560, dtype: int64

#### probably going to split this column by street and zip. Have to make sure that it matches up with lat and long

### Column Types to be converted:
- date: object ---> datetime
- waterfront: object ---> boolean


In [18]:
# convert 'date' column to datetime format (date only)
df['date'] = pd.to_datetime(df['date']).dt.date
df['date']

0        2022-05-24
1        2021-12-13
2        2021-09-29
3        2021-12-14
4        2021-08-24
            ...    
30150    2021-11-30
30151    2021-06-16
30152    2022-05-27
30153    2022-02-24
30154    2022-04-29
Name: date, Length: 30155, dtype: object

In [19]:
# Replace the 'NO' values with 0 and 'YES' values with 1
df['waterfront'] = df['waterfront'].replace({'NO': 0, 'YES': 1})
df['waterfront']

0        0
1        0
2        0
3        0
4        0
        ..
30150    0
30151    0
30152    0
30153    0
30154    0
Name: waterfront, Length: 30155, dtype: int64