# Exploratory Data Analysis: King County (Washington) real estate sales May 2014-2015

## List of features in the data set
<details>
    <summary>Features</summary>


| column name  	|  description          |
|---	        |---	                |
| date  	    | date of the sale  	|
| price         | prediction target   	| 
| house_id  	| unique identifier of the house  	|
| id          	| unique identifier of the sale   	|
| bedrooms  	| number of bedrooms  	|
| bathrooms  	| number of bathrooms  	|
| sqft_living   | footage of the home                    	|
| sqft_lot          	| footage of the lot                    	|
| floors          	| floors (levels) in house                    	|
| waterfront          	| does the house have a view to a waterfront                    	|
| view          	| grading of the view outside the house               	|
| condition          	| overall condition of the home                    	|
| grade          	| overall grade given to the housing unit, based on King County grading system                    	|
| sqft_above          	| square footage of house apart from basement                    	|
| sqft_basement          	| square footage of the basement                    	|
| yr_built          	| Built Year                    	|
| yr_renovated          	| Year when house was renovated, 0 means no renovation                     	|
| zipcode          	|  zipcode of the house                   	|
| lat          	| Latitude coordinate                    	|
| long          	| Longitude coordinate                    	|
| sqft_living15          	| The square footage of interior housing living space for the nearest 15 neighbors                    	|
| sqft_lot15          	| The square footage of the land lots of the nearest 15 neighbors                    	|

</details>

## Stakeholder
The target of our analysis is Timothy Stevens: a seller who owns expensive houses in the centre (Seattle). He needs to get rid of them fast and is interested in the best timing for selling within a year. He is open for renovation when profits rise

## Data Analysis

#### import statements and setting options

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

#telling pandas to always show all columns
pd.set_option('display.max_columns', None)


In [None]:
#importing dataframe from pickle:
df = pd.read_pickle("data/dataframe_housesales.pkl")

### First look at the data

##### correlation matrix for quick overview

In [None]:
# @hidden_cell
# Compute the correlation matrix
corr = df.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

findings: correlation between price and size, number of bathrooms/bedrooms, view and grading. Also slight correlation between latitude and price? the further up the more expensive maybe? no correlation between price and building year or renovation year

#### first look at distribution of sales over the time period

In [None]:
# @hidden_cell
#histogram of the sales over time
fig = px.histogram(df, x="date", nbins=58, title= "histogram of the sales over time")
fig.show()

hypothesis: slump of sales between end of december (christmas holidays) and february, also at the end of november (thanksgiving).
Why the slump in april 2014 and may 2015? maybe because of the data gathering method (ie self reporting) at the end of the date range?

#### distribution of the years the houses were built in

In [None]:
# @hidden_cell
fig = px.histogram(df, x="yr_built", title="histogram of the construction years")
fig.show()

observation: slumps during economic crisis (great depression in 1930s, oil crisis 70s, bank crisis 2008, etc)

#### plotting prices against the year the house was built in

In [None]:
# @hidden_cell
fig = px.scatter(x=df["yr_built"], y=df["price"], title="scatterplot: price versus the construction year", labels={"x":"construction year", "y":"price in $"})
fig.show()

slight price increase in newer buildings but no clear trend, lots of outliers?

### houses that were sold multiple times in timeframe

In [None]:
# creating a list of house ids that show up multiple times in datafram
duplicate_id= df[df.duplicated(subset=["house_id"])]["house_id"].tolist()

In [None]:
# create dataframe of the sales of these duplicate house id's
df_duplicate = df[df["house_id"].isin(duplicate_id)]
#df_duplicate.sort_values(by=["house_id", "yr_renovated"], ascending=False)

lets see if there are houses which where renovated between the sales in the dataset

In [None]:
#check for unique values in the year of renovation
df_duplicate["yr_renovated"].nunique()


only 5 unique values in the column of the renovation year, thats not much

In [None]:
# create a dataframe in which the year of renovation differs between the rows with the same house id
# drop all entries in which both the housing id and the renovation year is the same in both rows
df_drop = df_duplicate.drop_duplicates(subset = ['house_id', 'yr_renovated'])
duplicate_id2= df_drop[df_drop.duplicated(subset=["house_id"])]["house_id"].tolist()

#create a dataframe in which only the entries with different renovation years is inside
duplicate_id2= df_drop[df_drop.duplicated(subset=["house_id"])]["house_id"].tolist()
df_drop_dup = df_drop[df_drop["house_id"].isin(duplicate_id2)]
df_drop_dup


insight: no houses were renovated between two sales in the dataset

### create subselection for further analysis based on location and price range

#### check the distribution of the houses on a geoplot and plot our subselection of data

In [None]:
#zip codes of Seattle proper
zip_seattle = [98101, 98102, 98103, 98104, 98105, 98106, 98107, 98108, 98109, 98110, 98111, 98112, 98114, 98115, 98116, 98117, 98118, 98119, 98121, 98122, 98124, 98125, 98126, 98129, 98131, 98132, 98133, 98134, 98136, 98138, 98144, 98145, 98146, 98148, 98151, 98154, 98155, 98158, 98160, 98161, 98164, 98166, 98168, 98170, 98171, 98174, 98177, 98178, 98181, 98184, 98185, 98188, 98190, 98191, 98195, 98198, 98199]

In [None]:
#dataframe for geoplotting
df_geo = df.loc[:, ["id", "lat", "long", "zipcode"]]

In [None]:
#boolean category if house sale was in Seattle
df_geo["in_city"] = df["zipcode"].isin(zip_seattle)

In [None]:
#geoplot of the datapoints
fig = px.scatter_mapbox(df_geo, lat='lat', lon='long', title= 'Subselection of the data based on geography', color="in_city",
labels={"in_city":"is the datapoint in Seattle"}
)
fig.update_layout(mapbox_style="open-street-map")                    
fig.update_traces(marker={'size': 3})
fig.update_layout(height=700, width= 750)
fig.update_mapboxes(zoom=8.5)
fig.show()

#### creating geographic subselection

In [None]:
#dataframe with sales in Seattle proper
df_city = df[df["zipcode"].isin(zip_seattle)]

In [None]:
df_city_saledate= df_city.groupby("date", as_index=False).count()[["date","id"]]

##### check if distribution of sales over time is different in Seattle compared to entire dataset

In [None]:
fig = px.histogram(df_city, x="date", nbins=58, title="distribution of house sales between May 2014 and May 2015") #x=df_salesbydate["date"], y=df_salesbydate["id"])
fig.show()

In [None]:
fig = px.scatter(df_city_saledate, x="date", y="id", title='Amount of Sales in Seattle May 2014 - May 2015', trendline="lowess")
fig.show()

observation: both the entire dataset and the geographic subselection seem to have the same distribution over time

#### Creating further subselection based on price

In [None]:
fig = px.histogram(df_city, x="price", title="histogram of house prices in Seattle", labels={"price":"price in $"} ) #x=df_salesbydate["date"], y=df_salesbydate["id"])
fig.show()

our stakeholder has high end properties, lets look at only the properties in the 75th percentile:

In [None]:
df_city["price"].quantile(q=0.75)

In [None]:
#dataframe with sales in seattle in the 75th percentile
df_city_over75 = df_city[df_city["price"] >= 630000]

In [None]:
df.shape

In [None]:
df_city.shape

In [None]:
df_city_over75.shape

we reduced our dataset from 21597 to 2255 data points! 

#### creating new features in subselection

split up subselection based on the property having being renovated or not
features:
| column name  	|  description          |
|---	        |---	                |
| build_to_renovation_time  	    | time between construction and renovation in years  	|
| renovation_time         | time between renovation and sale in years   	| 
| build_time         | time between construction and sale in years, if no renovation took place   	| 


In [None]:
# dataframe with only renovated houses and new features
df_city_over75_renovated = df_city_over75[df_city_over75["yr_renovated"] != 0]
df_city_over75_renovated["build_to_renovation_time"] = df_city_over75_renovated["yr_renovated"] - df_city_over75_renovated["yr_built"]
df_city_over75_renovated["yr_renovated"] = pd.to_datetime(df_city_over75_renovated["yr_renovated"], format="%Y")
df_city_over75_renovated["yr_built"] = pd.to_datetime(df_city_over75_renovated["yr_built"], format="%Y")

In [None]:
# new feature: renovation_time, transform it to years in integers for easier plotting
df_city_over75_renovated["renovation_time"] = df_city_over75_renovated["date"] - df_city_over75_renovated["yr_renovated"]
df_city_over75_renovated["renovation_time"] = df_city_over75_renovated["renovation_time"].apply(lambda x: (x / np.timedelta64(1, 'D')))
df_city_over75_renovated["renovation_time"]=  df_city_over75_renovated["renovation_time"].apply(lambda x: x/365)

In [None]:
# dataframe with only unrenovated houses
df_city_over75_unrenovated = df_city_over75[df_city_over75["yr_renovated"] == 0]
df_city_over75_unrenovated["yr_built"]  = pd.to_datetime(df_city_over75_unrenovated["yr_built"], format="%Y")

In [None]:
#new feature: build_time, transform it to years in integers for easier plotting
df_city_over75_unrenovated["build_time"] = df_city_over75_unrenovated["date"] - df_city_over75_unrenovated["yr_built"]
df_city_over75_unrenovated["build_time"] = df_city_over75_unrenovated["build_time"].apply(lambda x: (x / np.timedelta64(1, 'D')))
df_city_over75_unrenovated["build_time"] = df_city_over75_unrenovated["build_time"].apply(lambda x: x/365)

### plotting new features


In [None]:
df_city_over75_unrenovated["grade"].unique()

#### price plotted against time since construction and grading in unrenovated houses

In [None]:
fig = px.scatter(x = df_city_over75_unrenovated["build_time"],
y = df_city_over75_unrenovated["price"],
color=df_city_over75_unrenovated["grade"],
title = "price plotted against time since construction in unrenovated houses",
labels= {"x":"time since construction in years", "y":"price in $", "color":"grading"},
color_continuous_scale = px.colors.sequential.RdBu,
)
fig.update_layout(legend_traceorder="reversed")
fig.show()

observation: houses with grading 9+ have higher prices  
hypothesis: renovating houses to achieve higher grades should also increase sale prices

#### pice plotted against time since last renovation and grading

In [None]:
fig = px.scatter(x = df_city_over75_renovated["renovation_time"],
y = df_city_over75_renovated["price"],
color=df_city_over75_renovated["grade"],
title = "price plotted against time since last renovation",
labels= {"x":"time since last renovation in years", "y":"price in $", "color":"grading"},
color_continuous_scale = px.colors.sequential.RdBu,
)
fig.show()


observation: no clear increase in grading because of recent renovation
hypothesis that renovation increase prices not proven

#### distribution of time between construction and renovation

In [None]:
fig = px.histogram(df_city_over75_renovated,
x = "build_to_renovation_time",
title="histogram of the time between construction and renovation",
labels= {"build_to_renovation_time":"time between construction and renovation", "count":"number of renovations"}
)
fig.show()

observation: most recent renovations first started at least 30 years after the time of construction

### conclusions
- best time to sell is in spring to summer, try to sell property before november
- higher grading is correlated with higher sale prices
- BUT: recent renovation is not correlated with higher grading
- no datapoints in the data set which can show direct influence of renovations on price

### to do
find ways to normalize the influence of view, waterfront on the price to see a clearer influence of the renovation on the price