## Adressing the clients needs, filter the data accordingly and derive proposals for the client

The client (Jacob Phillips) is a buyer and was described with the following paragraph:  
"Unlimited Budget, 4+ bathrooms or smaller house nearby, big lot (tennis court & pool), golf, historic, no waterfront"  

From this the following criteria were derived:
* money is not an issue --> no limitations
* number of bathrooms >= 4
* lot site > 20000 sqft (a tennis court needs 7200 sqft, a nicely sized pool needs 3000 sqft, house, garden, garage, pool house, walkways, etc.)
* historic house --> built before 1975
* no waterfront
  
The part "or smaller house nearby" was not addressed, because there were enough properties fulfilling the other criteria.  
The solution to the "golf wish" is documented in detail further below.

In [2]:
# import all libraries which can be helpful down the road
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import altair as alt
import missingno as msno

In [3]:
# Data type has to be set again, because it is "lost" during saving and importing the data in/from csv-file
df = pd.read_csv('data/cleaned_realestate.csv', sep=";")
df['yr_renovated'] = df['yr_renovated'].astype(pd.Int64Dtype())
df.describe()

Unnamed: 0,prop_id,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,...,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,price,transaction_id
count,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,...,21597.0,21597.0,744.0,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0
mean,4580474000.0,3.3732,2.115826,2080.32185,15099.41,1.494096,0.00676,0.233181,3.409825,7.657915,...,285.748993,1970.999676,1995.928763,98077.951845,47.560093,-122.213983,1986.620318,12758.283512,540296.6,10799.0
std,2876736000.0,0.926299,0.768984,918.106125,41412.64,0.539683,0.081944,0.764673,0.650546,1.1732,...,439.824566,29.375234,15.599946,53.513072,0.138552,0.140724,685.230472,27274.44195,367368.1,6234.661218
min,1000102.0,1.0,0.5,370.0,520.0,1.0,0.0,0.0,1.0,3.0,...,0.0,1900.0,1934.0,98001.0,47.1559,-122.519,399.0,651.0,78000.0,1.0
25%,2123049000.0,3.0,1.75,1430.0,5040.0,1.0,0.0,0.0,3.0,7.0,...,0.0,1951.0,1987.0,98033.0,47.4711,-122.328,1490.0,5100.0,322000.0,5400.0
50%,3904930000.0,3.0,2.25,1910.0,7618.0,1.5,0.0,0.0,3.0,7.0,...,0.0,1975.0,2000.0,98065.0,47.5718,-122.231,1840.0,7620.0,450000.0,10799.0
75%,7308900000.0,4.0,2.5,2550.0,10685.0,2.0,0.0,0.0,4.0,8.0,...,550.0,1997.0,2007.25,98118.0,47.678,-122.125,2360.0,10083.0,645000.0,16198.0
max,9900000000.0,33.0,8.0,13540.0,1651359.0,3.5,1.0,4.0,5.0,13.0,...,4820.0,2015.0,2015.0,98199.0,47.7776,-121.315,6210.0,871200.0,7700000.0,21597.0


##### Creation of a dataframe with less data, only containing properties which fulfill the client's wishes. I added the criterium "condition >=4", because I assumed/decided that a client with those exculsive and expensive wishes would not want to move into a shabby house or spend much time renovating it before.

In [124]:
df_jacob_wide = df.query("bathrooms >=4 and yr_built < 1975 and waterfront == 0 and sqft_lot >= 20000 and condition >= 4")

df_jacob_wide

Unnamed: 0,prop_id,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,...,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,date,price,transaction_id
3018,3377900195,4.0,5.5,6930,45100,1.0,0,0,4,11,...,1950,1991.0,98006,47.5547,-122.144,2560,37766,2014-09-29,2530000,3019
3314,2821049048,4.0,4.25,2360,57514,2.0,0,0,4,8,...,1939,,98003,47.2843,-122.294,2037,35733,2014-06-03,590000,3315
5844,3585900500,4.0,4.25,4720,21000,3.0,0,4,5,11,...,1971,,98177,47.7591,-122.376,3010,20000,2015-04-02,1530000,5845
5961,5249800010,4.0,4.25,6410,43838,2.5,0,2,4,12,...,1906,,98144,47.5703,-122.28,2270,6630,2014-12-03,2730000,5962
7245,6762700020,6.0,8.0,12050,27600,2.5,0,3,4,13,...,1910,1987.0,98102,47.6298,-122.323,3940,8800,2014-10-13,7700000,7246
7304,6072800170,4.0,4.0,3330,24354,1.0,0,0,4,10,...,1961,,98006,47.5708,-122.192,3880,25493,2015-04-28,2500000,7305
14172,1333300145,3.0,4.0,4200,30120,2.0,0,2,4,11,...,1933,,98112,47.6379,-122.311,2760,12200,2015-03-04,2230000,14173
14926,3627800050,5.0,4.0,3760,22763,1.0,0,3,4,11,...,1969,,98040,47.5333,-122.22,3730,11201,2014-07-15,1380000,14927
15152,3304700130,4.0,4.0,3860,67953,2.0,0,2,4,12,...,1927,,98177,47.7469,-122.378,4410,128066,2015-01-28,1760000,15153
17665,3585901085,6.0,4.5,3810,28176,1.0,0,4,5,10,...,1969,,98177,47.7612,-122.381,3810,26400,2014-06-04,2010000,17666


##### There was the need to manually remove two properties. Despite having the waterfront = 0 characteristic they were situated on the shoreline.

In [125]:
# prop_id 6072800170 is falsly annotated and is on the shoreline --> drop it
# prop_id 239000155 is falsly annotated, it lies on a lake --> drop it
df_jacob_wide.drop([7304], inplace=True)
df_jacob_wide.drop([18711], inplace=True)
df_jacob_wide



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,prop_id,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,...,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,date,price,transaction_id
3018,3377900195,4.0,5.5,6930,45100,1.0,0,0,4,11,...,1950,1991.0,98006,47.5547,-122.144,2560,37766,2014-09-29,2530000,3019
3314,2821049048,4.0,4.25,2360,57514,2.0,0,0,4,8,...,1939,,98003,47.2843,-122.294,2037,35733,2014-06-03,590000,3315
5844,3585900500,4.0,4.25,4720,21000,3.0,0,4,5,11,...,1971,,98177,47.7591,-122.376,3010,20000,2015-04-02,1530000,5845
5961,5249800010,4.0,4.25,6410,43838,2.5,0,2,4,12,...,1906,,98144,47.5703,-122.28,2270,6630,2014-12-03,2730000,5962
7245,6762700020,6.0,8.0,12050,27600,2.5,0,3,4,13,...,1910,1987.0,98102,47.6298,-122.323,3940,8800,2014-10-13,7700000,7246
14172,1333300145,3.0,4.0,4200,30120,2.0,0,2,4,11,...,1933,,98112,47.6379,-122.311,2760,12200,2015-03-04,2230000,14173
14926,3627800050,5.0,4.0,3760,22763,1.0,0,3,4,11,...,1969,,98040,47.5333,-122.22,3730,11201,2014-07-15,1380000,14927
15152,3304700130,4.0,4.0,3860,67953,2.0,0,2,4,12,...,1927,,98177,47.7469,-122.378,4410,128066,2015-01-28,1760000,15153
17665,3585901085,6.0,4.5,3810,28176,1.0,0,4,5,10,...,1969,,98177,47.7612,-122.381,3810,26400,2014-06-04,2010000,17666
18314,5317100750,4.0,4.75,4575,24085,2.5,0,2,5,10,...,1926,,98112,47.6263,-122.284,3900,9687,2014-07-11,2920000,18315


##### The locations of the ten remaining properties were shown on a map and this will be presented to the client.

In [147]:
fig = px.scatter_map(df_jacob_wide, 
                        lat="lat", 
                        lon="long", 
                        hover_name="prop_id", 
                        color_discrete_sequence=['blue'],
                        zoom=9, 
                        height=800,
                        width=800)

fig.update_traces(marker=dict(size=9),
                  selector=dict(mode='markers'))

fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

##### With the help of two websites, I identified 19 private(!) golf courses in the Seattle area and converted their addresses to GPS coordinates. This data was combined in a csv-file, so that the data could be readily accessed.
---
##### Then the dataframe of the ten remaining properties was combined with the dataframe of the golf courses. Then the distance between the properties and all golf courses was calculated (linear distance, not road distance - this is a drawback!). Finally, the distances between a property and its five closest golf courses was averaged and used as the final ranking criterium.  
---  
The formula to calculate the distances was adapted from stack overflow. By now there is a library (H3?) which can be used with pandas, which would do the same. A few distances were checked for accuracy by Google maps.  

---
But first a dataframe was created which consisted of the combined locations of the properties and the golf courses, so that the figure can be shown to the client.


In [168]:
df_golf_courses_temp = pd.read_csv('data/golf_clubs.csv', sep=";")
df_golf_courses_temp.rename(columns={'g_lat': 'lat', 'g_long': 'long'}, inplace=True)
df_dist_golf_plot = pd.concat([df_jacob_wide_shrunk, df_golf_courses_temp], ignore_index=True, sort=False)
df_dist_golf_plot.loc[df_dist_golf_plot.prop_id > 0, 'Location'] = "Property"
df_dist_golf_plot.loc[df_dist_golf_plot.prop_id.isna(), 'Location'] = "Golf course"

In [171]:
fig = px.scatter_map(df_dist_golf_plot, 
                        lat="lat", 
                        lon="long", 
                        hover_name="prop_id", 
                        #hover_data=["Address", "Listed"],
                        color="Location",
                        #color_continuous_scale=color_scale,
                        #size="Listed",
                        zoom=9, 
                        height=800,
                        width=900)

fig.update_traces(marker=dict(size=9),
                  selector=dict(mode='markers'))

fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

In [149]:
# import of golf data and combining it with real estate data
df_golf_courses = pd.read_csv('data/golf_clubs.csv', sep=";")
df_jacob_wide_shrunk = df_jacob_wide[['prop_id', 'lat', 'long']] 
df_dist_golf = pd.merge(df_jacob_wide_shrunk, df_golf_courses, how="cross")

# function to calculate the distances
def haversine(lat1, lon1, lat2, lon2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)
    """
    # Convert latitude and longitude from degrees to radians
    lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])

    # Haversine formula
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a))
    r = 6371  # Radius of earth in kilometers. Use 3956 for miles
    return c * r

df_dist_golf['distance'] = df_dist_golf.apply(lambda row: haversine(row['lat'], row['long'], row['g_lat'], row['g_long']), axis=1)
# sort the distances distances per property and select the five shortest of them
df_dist_golf = df_dist_golf.sort_values('distance').groupby('prop_id').head(5).sort_values(['prop_id', 'distance'])
# calculate the average of the five distances
df_dist_golf = df_dist_golf.groupby(['prop_id'])['distance'].mean().reset_index(name='g_average_distance')
# join the property and the distance data
df_jacob_wide_golf = pd.merge(df_jacob_wide, df_dist_golf, on='prop_id')
df_dist_golf.head(15)
# select the top five properties (by distance to the golf courses)
df_jacob_wide_golf.sort_values('g_average_distance')[:5]

Unnamed: 0,prop_id,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,...,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,date,price,transaction_id,g_average_distance
0,3377900195,4.0,5.5,6930,45100,1.0,0,0,4,11,...,1991.0,98006,47.5547,-122.144,2560,37766,2014-09-29,2530000,3019,8.869581
6,3627800050,5.0,4.0,3760,22763,1.0,0,3,4,11,...,,98040,47.5333,-122.22,3730,11201,2014-07-15,1380000,14927,9.194866
3,5249800010,4.0,4.25,6410,43838,2.5,0,2,4,12,...,,98144,47.5703,-122.28,2270,6630,2014-12-03,2730000,5962,9.301835
9,5317100750,4.0,4.75,4575,24085,2.5,0,2,5,10,...,,98112,47.6263,-122.284,3900,9687,2014-07-11,2920000,18315,9.816267
5,1333300145,3.0,4.0,4200,30120,2.0,0,2,4,11,...,,98112,47.6379,-122.311,2760,12200,2015-03-04,2230000,14173,10.134726


##### Finally, the coordinates of the top3 properties were entered into Google maps, to create pictures from high above (to see where the property is situated) and from closer range to get provide an overview how the property looks. If there were good pictures available from Google streetview those were added as well.  
---
##### Data from the dataframe was combined with the Google images to present the top3 properties to the client. Those things constitute the final slides of the presentation. Now, one can only hope he likes the proposals ;)