## Project Title: House Sales Analysis in King County

### 1. Defining the Question

#### a) Specifying the Data Analytic Question

**Data Analytic Question:** How can we provide advice to homeowners in King County about how home renovations might increase the estimated value of their homes and by what amount?

#### b) Defining the Metric for Success

To determine the success of our analysis, we will consider the following metrics:

1. **Model Performance:** We will assess the overall performance of our regression models using appropriate evaluation metrics, such as mean squared error (MSE), root mean squared error (RMSE), or R-squared (R^2) value. The lower the MSE or RMSE and the higher the R^2 value, the better the model's predictive accuracy.

2. **Regression Model Coefficients:** We aim to identify at least two regression model coefficients that are statistically significant and have a meaningful impact on the estimated value of homes. These coefficients will help us provide specific recommendations to homeowners regarding the potential value increase associated with certain home renovations.

#### c). Recording the Experimental Design

The experimental design for this analysis will involve the following steps:

1. **Data Collection:** We will use the King County House Sales dataset, specifically the `kc_house_data.csv` file, which contains information about house sales in King County. The dataset includes various features such as the number of bedrooms, bathrooms, square footage, condition, and other relevant information.

2. **Data Cleaning and Exploration:** We will perform data cleaning tasks, handle missing values, and explore the dataset to gain a better understanding of the variables and their relationships. This will involve visualizations, summary statistics, and correlation analysis.

3. **Feature Selection:** Based on our domain knowledge and exploratory analysis, we will select relevant features that are likely to impact the estimated value of homes. We will consider factors such as location, size, condition, and other characteristics that influence home prices.

4. **Model Building:** We will build multiple linear regression models using different combinations of features and techniques. We will start with a basic model and iteratively refine it by adding or removing features, transforming variables, or using regularization techniques as necessary.

5. **Model Evaluation and Selection:** We will evaluate the performance of each model using appropriate evaluation metrics, such as MSE, RMSE, or R^2. Based on the results, we will select the best-performing model that provides accurate predictions and meaningful insights.

6. **Interpretation and Recommendations:** Once we have the final model, we will interpret the regression coefficients and provide recommendations to homeowners about how specific home renovations can potentially increase the estimated value of their homes.


#### d). Exploring the Dataset

Let's explore the King County House Sales dataset to get familiar with its structure and content:

- `id`: Unique identifier for a house
- `date`: Date house was sold
- `price`: Sale price (prediction target)
- `bedrooms`: Number of bedrooms
- `bathrooms`: Number of bathrooms
- `sqft_living`: Square footage of living space in the home
- `sqft_lot`: Square footage of the lot
- `floors`: Number of floors (levels) in the house
- `waterfront`: Whether the house is on a waterfront
- `view`: Quality of view from the house
- `condition`: Overall condition of the house
- `grade`: Overall grade of the house
- `sqft_above`: Square footage of the house apart from the basement
- `sqft_basement`: Square footage of the basement
- `yr_built`: Year when the house was built
- `yr_renovated`: Year when the house was renovated
- `zipcode`: ZIP Code used by the United States Postal Service
- `lat`: Latitude coordinate
- `long`: Longitude coordinate
- `sqft_living15`: The square footage of interior housing living space for the nearest 15 neighbors
- `sqft_lot15`: The square footage of the land lots of the nearest 15 neighbors

Please note that some features, such as `date`, `view`, `sqft_above`, `sqft_basement`, `yr_renovated`, `zipcode`, `lat`, `long`, `sqft_living15`, and `sqft_lot15`, may not be used in the analysis based on your project requirements. We will focus on the relevant features to answer our data analytic question.

#### e) Data Relevance

Based on our data analytic question, the dataset provides relevant information to address the problem at hand. The dataset includes key features such as the number of bedrooms, bathrooms, square footage, condition, and grade of the houses, which can help us analyze how these factors impact the estimated value of homes in King County.



### 2. Reading the Dataset

In [3]:
#Importing necessary Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error,mean_squared_error
import statsmodels.api as sm
from statsmodels.formula.api import ols
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning)
from sklearn.metrics import r2_score
from sklearn.preprocessing import PolynomialFeatures
import warnings
warnings.filterwarnings('ignore')
import math

In [4]:
#load and preview the data
data = pd.read_csv("Data/kc_house_data.csv")
data.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,10/13/2014,221900.0,3,1.0,1180,5650,1.0,,NONE,...,7 Average,1180,0.0,1955,0.0,98178,47.5112,-122.257,1340,5650
1,6414100192,12/9/2014,538000.0,3,2.25,2570,7242,2.0,NO,NONE,...,7 Average,2170,400.0,1951,1991.0,98125,47.721,-122.319,1690,7639
2,5631500400,2/25/2015,180000.0,2,1.0,770,10000,1.0,NO,NONE,...,6 Low Average,770,0.0,1933,,98028,47.7379,-122.233,2720,8062
3,2487200875,12/9/2014,604000.0,4,3.0,1960,5000,1.0,NO,NONE,...,7 Average,1050,910.0,1965,0.0,98136,47.5208,-122.393,1360,5000
4,1954400510,2/18/2015,510000.0,3,2.0,1680,8080,1.0,NO,NONE,...,8 Good,1680,0.0,1987,0.0,98074,47.6168,-122.045,1800,7503


In [5]:
# check the last 5 rows
data.tail()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
21592,263000018,5/21/2014,360000.0,3,2.5,1530,1131,3.0,NO,NONE,...,8 Good,1530,0.0,2009,0.0,98103,47.6993,-122.346,1530,1509
21593,6600060120,2/23/2015,400000.0,4,2.5,2310,5813,2.0,NO,NONE,...,8 Good,2310,0.0,2014,0.0,98146,47.5107,-122.362,1830,7200
21594,1523300141,6/23/2014,402101.0,2,0.75,1020,1350,2.0,NO,NONE,...,7 Average,1020,0.0,2009,0.0,98144,47.5944,-122.299,1020,2007
21595,291310100,1/16/2015,400000.0,3,2.5,1600,2388,2.0,,NONE,...,8 Good,1600,0.0,2004,0.0,98027,47.5345,-122.069,1410,1287
21596,1523300157,10/15/2014,325000.0,2,0.75,1020,1076,2.0,NO,NONE,...,7 Average,1020,0.0,2008,0.0,98144,47.5941,-122.299,1020,1357


In [6]:
#checking data types and shape
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21597 non-null  int64  
 1   date           21597 non-null  object 
 2   price          21597 non-null  float64
 3   bedrooms       21597 non-null  int64  
 4   bathrooms      21597 non-null  float64
 5   sqft_living    21597 non-null  int64  
 6   sqft_lot       21597 non-null  int64  
 7   floors         21597 non-null  float64
 8   waterfront     19221 non-null  object 
 9   view           21534 non-null  object 
 10  condition      21597 non-null  object 
 11  grade          21597 non-null  object 
 12  sqft_above     21597 non-null  int64  
 13  sqft_basement  21597 non-null  object 
 14  yr_built       21597 non-null  int64  
 15  yr_renovated   17755 non-null  float64
 16  zipcode        21597 non-null  int64  
 17  lat            21597 non-null  float64
 18  long  

The dataset is extensive, comprising over 21,500 entries and consisting of 20 columns. Most of the columns consist of numerical data, making them well-suited for linear regression analysis.

### 3. Cleaning the Dataset
In this section, we will prepare the data for the analysis by converting the categorical features into numeric ones. This will allow us to apply mathematical operations and statistical methods on the data.

We will use label encoding to convert the view and condition feature from text values to numeric values that represent the quality of the view from the house and the condition of the house.

We will also use pandas functions to convert the date feature from string to datetime format and extract only the year of sale. We will create a new column for sell year and drop the original date column.

Finally, we will convert the `sqft_basement` feature from object to int type by handling any errors that may occur during the conversion.

In [7]:
#'View' column transformation
data['view'].replace(to_replace=['NONE', 'AVERAGE', 'GOOD', 'FAIR', 'EXCELLENT'], value=[0, 1, 2, 3, 4], inplace=True)