## Final Project Submission

Please fill out:
* Student name: Alice Wanjiru Wamuyu
* Student pace:  part time 
* Scheduled project review date/time: 
* Instructor name: Asha deen
* Blog post URL:


# House Data Analysis

## Overview

This project analyzes the King county house sales dataset for Kings' real estate agency that helps homeowners buy and sell homes. I will use regression modelling in the Analysis.

## Business Problem

Kings' real estate agency helps homeowners buy and/or sell homes. There is a need to provide advice to homeowners about how home renovations might increase the estimated value of their homes, and by what amount.

## Data Understanding
For this analysis, I used the King county house sales dataset. This datasets contain information on the houses built between  1900 and 2015. 
My target variables are: Bedrooms,bathrooms,sqft_living,sqft_lot,floors,condition,grade,yr_built

## Importing relevant libraries

In [None]:
# importing pandas for data wrangling and manipulation
import pandas as pd

# importing matplotlib and seaborn for data visualization
import matplotlib.pyplot as plt
%matplotlib inline
%config inlineBackend.figure_format = 'retina'
import seaborn as sns
sns.set_context('notebook')

# numpy for numerical operation and arrays
import numpy as np

# importing libraries needed for the linear regression model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import cross_validate, ShuffleSplit
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
import scipy.stats as stats
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from statsmodels.stats.outliers_influence import variance_inflation_factor

## Loading the data

In [None]:
# Your code here - remember to use markdown cells for comments as well!
#importing data
Data = pd.read_csv("C:/Users/This PC/Downloads/kc_house_data.csv")

In [None]:
#checking the first 5 rows
Data.head()

## Data Cleaning

1)Removing columns that aren't required

In [None]:
irrelevant_columns = ['date', 'view', 'sqft_above', 'sqft_basement', 
'yr_renovated', 'zipcode','lat','long','sqft_living15','sqft_lot15','waterfront']
Data.drop(irrelevant_columns, axis=1, inplace = True)

In [None]:
# getting more information about the columns
Data.info()

In [None]:
#converting bathroom and floors data types from float to int64
convert_dict = {'bathrooms': int,'floors':int }
Data = Data.astype(convert_dict)

In [None]:
Data.dtypes

From my analyses below, waterfront has no relationship with any of the variables. Since it has missing values, I dropped the waterfront column

## Checking for duplicates and dropping duplicated values

In [None]:
Data.duplicated().value_counts()

In [None]:
##checking for duplicates
Data.duplicated().sum()

In [None]:
Data.drop_duplicates(inplace = True)

In [None]:
Data.duplicated().sum()

In [None]:
Data

# Exploratory Data Analysis 

In [None]:
#Description of the data using 5-point statistics
Data.describe()

## Correlation between variables


In [None]:
dependent = ['bedrooms', 'bathrooms', 'sqft_living','sqft_lot','floors','condition','grade','yr_built']

In [None]:
plt.figure(figsize=(15,8))
for i in enumerate(dependent):
    plt.subplot(3,3,i[0]+1 )
    plt.scatter(x =i[1], y ='price', data= Data)

From the scatter plots above, we confirm that condition and grade are categorical variables.
Condition ranges from 1-5. 1 Representing Poor- Worn out,2 represents Fair- Badly worn, 3 represents Average, 4 represents Good and 5 represents Very Good.

Grade ranges from 3 to 13.
Grades run from grade 1 to 13 and are generally defined as;
1-3: Falls short of minimum building standards.
4: Generally older, low quality construction. Does not meet code.
5: Low construction costs and workmanship. Small, simple design.
6: Lowest grade currently meeting building code. Low quality materials and simple designs.
7: Average grade of construction and design. Commonly seen in plats and older sub-divisions.
8: Just above average in construction and design. Usually better materials in both the exterior and interior finish work.
9: Better architectural design with extra interior and exterior design and quality
10: 10 Homes of this quality generally have high quality features. Finish work is better and more design quality is seen in the floor plans. Generally have a larger square footage.
11: Custom design and higher quality finish work with added amenities of solid woods, bathroom fixtures and more luxurious options.
12: Custom design and excellent builders. All materials are of the highest quality and all conveniences are present.
13: Generally custom designed and built. Mansion level. Large amount of highest quality cabinet work, wood trim, marble and entry ways

source:https://info.kingcounty.gov/assessor/esales/Glossary.aspx?type=r



The grade of a house and the area of the living space has a linear relationship with the price of the house.

In [None]:
datacorr = Data.corr()
datacorr

In [None]:
plt.figure(figsize=(10,8))
sns.heatmap(datacorr)

From the heat map above, we see that the grade of a house and the area of the living space significantly affects the price of the house.

## Checking for Multicollinearity
predictors with overly high pairwise-correlation (r > .65) are almost certain to produce multicollinearity in a model. With that, I generated the pairwise (pearson) correlation coefficients of your predictive features and visualized these coefficients as a heatmap.

In [None]:
dependent = ['bedrooms', 'bathrooms', 'sqft_living','sqft_lot','floors','condition','grade','yr_built']
corr = Data[dependent].corr()
corr

In [None]:
plt.figure(figsize=(10,8))
sns.heatmap(corr, center=0, annot=True);

No correlation is greater than 0.65 hence there is no multicollinearity

# Using pandas to plot histograms for all the numeric variables in the dataset.

In [None]:
Data.hist(figsize = (18,18));

These variables have a skewed distribution hence they are not perfectly normal.

## Using Log Transformations to normalize the non- normal data

In [None]:
x = np.linspace(start=-100, stop=100, num=10**3)
y = np.log(x)

In [None]:
x_cols = ['price','bedrooms', 'bathrooms', 'sqft_living','sqft_lot','floors','condition','grade','yr_built']

In [None]:
non_normal = ['sqft_living', 'price','grade']
for t in non_normal:
    Data[t] = Data[t].map(lambda x: np.log(x))
pd.plotting.scatter_matrix(Data[x_cols], figsize=(10,12));

## Check for Linearity

Testing the linearity assumption in linear regression modeling.The dependent variable should be linearly related to the independent variables


In [None]:
##Testing if there is any linear relationship between the price of a house and sqft_living. 
sns.jointplot('sqft_living','price', data=Data, kind='reg');

There seems to be a some-what linear correlation between the sqft_living and the price of the house.

In [None]:
##Testing if there is any linear relationship between the price of a house and grade.
sns.jointplot('grade','price', data=Data, kind='reg');

There is also a linear correlation between the grade and the price of the house.

## Modelling the data

In [None]:
from statsmodels.formula.api import ols
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
import scipy.stats as stats

import warnings
warnings.simplefilter('ignore', FutureWarning)
warnings.simplefilter('ignore', RuntimeWarning)
warnings.simplefilter('ignore', UserWarning)

## Building a Baseline Model

I built a linear regression model using the feature that is most correlated with price it being sqft_living to serve as my baseline model.

In [None]:
X = Data.drop('price', axis=1)
y = Data[['price']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
#Instantiate Linear regression model
most_correlated_feature = 'sqft_living'
baseline_model = LinearRegression()
splitter = ShuffleSplit(n_splits=3, test_size=0.3, random_state=42)

baseline_scores = cross_validate(
    estimator=baseline_model,
    X=X_train[[most_correlated_feature]],
    y=y_train,
    return_train_score=True,
    cv=splitter
)

print("Train r-squared score:", baseline_scores["train_score"].mean())
print("Test r-squared score:", baseline_scores["test_score"].mean())

Observations:

R_squared is 0.49. This shows that our model is weak.
The test subset performs slightly better than the training subset.
There is a nearly identical performance in the training and test subsets, both indicating a variance of around 43%.

## Second model with all features.

In [None]:
second_model_features = X_train.drop('sqft_living', axis=1)

In [None]:
second_model = LinearRegression()

second_model_scores = cross_validate(
    estimator=second_model,
    X=X_train,
    y=y_train,
    return_train_score=True,
    cv=splitter
)

print("Second Model")
print("Train r-squared score:", second_model_scores["train_score"].mean())
print("Test r-squared score: ", second_model_scores["test_score"].mean())
print()
print("Baseline Model")
print("Train r-squared score:", baseline_scores["train_score"].mean())
print("Test r-squared score: ", baseline_scores["test_score"].mean())

The second model performs better than the baseline model in both the training and test subset.
The test subset of the second model performs better than the training subset.
The difference in the r_squared value of the training and test subset small meaning the model will perform well with unknown data. 
The second model will be the final model

In [None]:
outcome = 'price'
variables2 =['bedrooms', 'bathrooms', 'sqft_living','sqft_lot','floors','condition','grade','yr_built']
predictors = '+'.join(variables2)
formula = outcome + '~' + predictors
model = ols(formula=formula, data=Data).fit()
model.summary()


All features have a statistically significant p_value. This means that all the features are required to make the best model

## Checking for the Normality Assumption

In [None]:
fig = sm.graphics.qqplot(model.resid, dist=stats.norm, line='45', fit=True)

## Conclusion

The square footage of interior housing living space and the Grade of the house has a somewhat positive linear relationship. This means that an increase in the interior housing living space will lead to an increase in the price of the house similarily for the grade.

The MSEs for the train and test subsets are almost similar, which suggests that the model will perform similarly on different data
