# Capstone Project - San Francisco Housing Sales Price 
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

In this project, I'll analyze each neighborhood in San Francisco and try to find a relationship between **San Francisco Housing Sales Price** and **nearby venues**. 

Specifically, this report will be targeted to stakeholders interested in investing a **real estate in San Francisco**, helping them to choose the regions with their favorite venues or lower real estate costs.

I'll create a map with each neighborhood in San Francisco segmented and clustered according to housing sales prices and nearby venues.

## Data <a name="data"></a>

In consider of our problem, I found following data sets:
* I found **Median Value Per Squre foot** data of each neighborhood in San Francisco, from Apr 1996 to Jul 2019. The csv file contains the **Region Name, City, CountyName, SizeRank and median_value_per_sqft** during this time period in USD.
* I used **Forsquare API** to get the most common venues of given neighborhood of San Francisco.
* I used **Google Maps API geocoding** to get the center cooridnates of each neighborhood.

In [171]:
import pandas as pd
import numpy as np
import folium 
from geopy.geocoders import Nominatim

In [182]:
df = pd.read_csv('Neighborhood_MedianValuePerSqft_AllHomes.csv')
df_sf = df[df['City']=='San Francisco'].reset_index(drop=True)
df_sf.drop(['City','State','Metro','CountyName','SizeRank'],1,inplace=True)
df_sf.head()

Unnamed: 0,RegionID,RegionName,1996-04,1996-05,1996-06,1996-07,1996-08,1996-09,1996-10,1996-11,...,2018-10,2018-11,2018-12,2019-01,2019-02,2019-03,2019-04,2019-05,2019-06,2019-07
0,268384,Outer Sunset,192.0,193.0,194.0,195.0,196.0,196.0,198.0,199.0,...,1018,1021,1022,1025,1024,1019,1018,1019,1021,1024
1,274552,Mission,193.0,193.0,193.0,193.0,193.0,193.0,194.0,194.0,...,1154,1153,1152,1153,1149,1144,1143,1146,1147,1148
2,268383,Outer Richmond,181.0,182.0,183.0,184.0,185.0,186.0,188.0,189.0,...,944,943,944,945,941,932,923,919,920,922
3,268219,Inner Richmond,197.0,198.0,199.0,201.0,202.0,203.0,205.0,206.0,...,1065,1064,1063,1063,1064,1063,1061,1060,1062,1064
4,268396,Parkside,187.0,188.0,189.0,190.0,190.0,191.0,193.0,195.0,...,947,944,942,942,939,938,940,944,949,954


In [183]:
col_2019 = [col for col in df_sf.columns if '2019' in col]
mean_2019 = df_sf[col_2019].mean(axis=1).to_frame(name='avgPrice_2019')
mean_2019.head()

Unnamed: 0,avgPrice_2019
0,1021.428571
1,1147.142857
2,928.857143
3,1062.428571
4,943.714286


In [184]:
df_sf.drop(df_sf.iloc[:,2:],1,inplace = True)
df_sf['avgPrice_2019'] = mean_2019['avgPrice_2019']
df_sf.head()

Unnamed: 0,RegionID,RegionName,avgPrice_2019
0,268384,Outer Sunset,1021.428571
1,274552,Mission,1147.142857
2,268383,Outer Richmond,928.857143
3,268219,Inner Richmond,1062.428571
4,268396,Parkside,943.714286


In [185]:
def find_coordinates(address):
    add = address + ', San Francisco, California'
    geolocator = Nominatim(user_agent="ca_explorer")
    location = geolocator.geocode(add)
    latitude = location.latitude
    longitude = location.longitude
#    print('The geograpical coordinate of {} are {}, {}.'.format(address, latitude, longitude))
    return [latitude, longitude]


In [186]:
coordinates = []
for item in df_sf['RegionName']:
    try: 
        coordinates.append(find_coordinates(item))
    except:
        coordinates.append([np.nan, np.nan])
        print('No coordinates found for {}.'.format(item))
coor = pd.DataFrame(coordinates,columns = ['latitude' , 'longitude'])
df_sf[['latitude' , 'longitude']] = coor[['latitude' , 'longitude']]
df_sf.head()

No coordinates found for Golden Gate Heights.
No coordinates found for Ingleside Heights.
No coordinates found for Upper Market.
No coordinates found for Miraloma Park.
No coordinates found for Diamond Heights.
No coordinates found for Westwood Park.
No coordinates found for Midtown Terrace.
No coordinates found for Little Hollywood.
No coordinates found for Mount Davidson Manor.
No coordinates found for Westwood Highlands.


Unnamed: 0,RegionID,RegionName,avgPrice_2019,latitude,longitude
0,268384,Outer Sunset,1021.428571,37.753303,-122.495159
1,274552,Mission,1147.142857,37.752498,-122.412826
2,268383,Outer Richmond,928.857143,37.780643,-122.472596
3,268219,Inner Richmond,1062.428571,37.769825,-122.466087
4,268396,Parkside,943.714286,37.738364,-122.483982


In [187]:
df_sf_cleaned = df_sf.dropna()
df_sf_cleaned.shape

(51, 5)

In [190]:
sf_coor = find_coordinates('San Francisco')
# create map of Toronto using latitude and longitude values
map_sf = folium.Map(location=sf_coor, zoom_start=12)

# add markers to map
for lat, lng, neighborhood, price in zip(df_sf_cleaned['latitude'], df_sf_cleaned['longitude'], df_sf_cleaned['RegionName'], df_sf_cleaned['avgPrice_2019']):
    label = '{}, {}'.format(neighborhood, price)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_sf)  
    
map_sf

## Methodology <a name="methodology"></a>

## Analysis <a name="analysis"></a>

## Results and Discussion <a name="results"></a>

## Conclusion <a name="conclusion"></a>