# IBM Data Science Professional Specialization - Coursera

This notebook is for the capstone project, fulfilling the requirement for the __[IBM Data Science Professional Specialization](https://www.coursera.org/specializations/ibm-data-science-professional-certificate)__. 


In this analysis, the following content will be listed 
- Web scraping: obtain apartment rental price by neighbourhood
- Data wrangling: clean up the data using python pandas 
- Data analyze:
  - obtain more neighbourhood venue data using Four Square data base, 
  - explore the data set in pandas, matplotlib, etc 
  - perform rental price prediction using machine learning skills  
  
  
  
### Author: Estella Yu ###
email: estellayyu@gmail.com || 
LinkedIn: __[Yingxian Estella Yu](https://www.linkedin.com/in/estella-yingxian-y-65b3b883/)__



keyword:  *`web scraping`*, *`data wrangling`*, *`python`*, *`pandas`*, *`folium`*, *`Four Square`*


# I. Introduction 

### 1. Background

New York is such a unique place full of attractions from the financial capital, fashion trends, artistic and historic atmosphere, that words just simply can't describe enough. The only way to know the life in New York is simply to experience it. Yet, city life comes with a lot of dollar signs _--$$--_  especially in Manhattan. According to the recent National Rent Report (__[Feb 2019](https://www.zumper.com/blog/2019/01/zumper-national-rent-report-february-2019/)__), the rental price in New York for a 1 Bedroom apartment ( **$2,780** ) is ranking 2nd across the nation, right behind the crazy San Fransisco. What's more, based on the data shown in  __[businessinsider](https://www.businessinsider.com/manhattan-rent-by-neighborhood-ranked-from-lowest-to-highest-2018-5)__, the asking rent has drastically increased by **33%** in the window from Dec 2009 to July 2017 (in less than 9 years)!

Therefore, it's of special significance to analyze and understand the housing trend in New York. With some simple search, one can easily spot that the housing price in New York is highly correlated with its location. For example, the rent near Soho (average `$5,000 - $6,000+`) is higher than the Manhattan average by 52%, and is certainly more pricy than the rent near East Harlem (average `$2,000 - $3,000+`). 

So to what extend can we predict the rent in Manhattan based on the neighboring venue, an important component that contributes to the vibes in the neighborhood? Is it easier to spot an Italian restaurant than a pizza store at the pricy neighborhood? How much does a school, a mall, a supermarket potentially contribute to the housing price? We are going to figure it out in this report!

### 2. Project Description

Using data to analyze the following questions: 
 - *`Why`* do we want to analyze the housing price in __New York__?
 - *`How`* is the apartment rental price vary by __neighborhood__?
 - If you are planning to move to a new neighborhood, *`what`* typical __venues__ will you be looking for? 
 - Do the popular venue & higher end apartment price align?

### 3. potential terget reader
The results and analysis enclosed in this project can be closely relevant to: 
 - People related to rental activities in New York (landloard, tenant, real estate agent, ect)
 - Bussiness personal: if one plans to open a new bussiness in a certain neighborhood, which neighborhoods are more appropriate, and do the target neighborhoods have relevant venue already? 
 - or Anyone who's courious about data (like us! :))

# II. Data Description 
### 1. Data Source:
- __[Zumper](https://www.zumper.com/blog/2019/01/zumper-national-rent-report-february-2019/)__: National Rent Report: February 2019
- __[Rentcafe](https://www.rentcafe.com/average-rent-market-trends/us/ny/manhattan/)__: Manhattan, NY Rental Market Trends
- __[School locations in Manhattan](http://www.lat-long.com/Search.cfm?q=+school&State=NY&County=New+York&FeatureType=school)__
- __[Four Square API](https://developer.foursquare.com/)__: venue data around each neighbourhood

### 2. Data Collection:
 1. Data from National Rent Report and Manhattan rent market trend will be scraped fromthe web
 2. For each neighborhood, it's lattitude and longitude information will be collected using Geocoder
 3. For each neighborhood, call Four Square API to obtain sourounding detailed venue data
 
### 3. Use data to solve the problem:
1. Visualize the national rent price across nation (Matplotlib, bubble plot)
2. Cluster the neighborhood based on rental price, closeby venues, etc.
3. Plot rental price v.s. popular venue across all neighborhood, and explore correlations
4. Analyze a special venue -- Schools. Investigate how does distance to each school potentially contribute to the housing price
 


# III. Import Data

In [1]:
from bs4 import BeautifulSoup # for obtaining clean html content

import numpy as np
import pandas as pd

import requests # for getting html content
import folium
from folium import plugins
from folium.plugins import HeatMap
from geopy.geocoders import Nominatim # convert an address intolatitude and longitude values

import datetime
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
from sklearn.cluster import KMeans

import matplotlib.cm as cm
import matplotlib.colors as colors

print("Libraries Imported!")

Libraries Imported!


### 1. extract & plot rental price based on disctrict in Manhattan

    1) use beautiful soup to extract website content

In [2]:
############1. obtain data from website (using BeautyfulSoup)####################
web_link = 'https://www.rentcafe.com/average-rent-market-trends/us/ny/manhattan/'
web_neighborhood = requests.get(web_link)

soup = BeautifulSoup(web_neighborhood.content, 'html.parser')

    2) clean up the text & extract content in the table

In [7]:
geolocator = Nominatim(user_agent="loc_locator")
priceTable = soup.find("table", id = "MarketTrendsAverageRentTable")
content = priceTable.find_all('tr')
a = []
for con in content:
    pricepair = con.text.lstrip().rstrip()
    pricepair = pricepair.split("\n")

    # there are some exceptions in naming the region (e.g. Theatre District - Times Square)
    # here we only choose the first half (before the "-")
    if ("-" in pricepair[0]):
        pricepair[0] = pricepair[0].split(' - ')[0]
        
    a.append(pricepair[0])
    if 1 < len(a) < 3: a.append(pricepair[1])
    if len(a) > 3:
        a.append(float(pricepair[1].lstrip('$').replace(",","")))
        location = geolocator.geocode("{0}, Manhattan, NY".format(pricepair[0]))
        a.append(float(location.latitude))
        a.append(float(location.longitude))
        

dfTitle = a[1:3]
dfTitle.append('loc_lat')
dfTitle.append('loc_lon')
dfValue = a[3::]
df = pd.DataFrame(np.array(dfValue).reshape(len(dfValue)// 4, 4), columns = dfTitle)
df.head()

Unnamed: 0,Neighborhood,Average Rent,loc_lat,loc_lon
0,Marble Hill,1694.0,40.8762983,-73.9104292
1,Inwood,2225.0,40.8692579,-73.9204949
2,Washington Heights,2243.0,40.8401984,-73.9402214
3,Randalls and Wards Islands,2336.0,40.79144785,-73.921023713881
4,East Harlem,3334.0,40.7947222,-73.9425


In [8]:
df.shape

(51, 4)

    3) folium visualization 

In [9]:
ny_location = geolocator.geocode("Manhattan, NY")
ny_lat = ny_location.latitude
ny_lon = ny_location.longitude
print('The coordinates of Manhattan, NY are ({}, {})'.format(ny_lat, ny_lon))

The coordinates of Manhattan, NY are (40.7900869, -73.9598295)


In [1]:
# create folium map
manhattan_map = folium.Map(location = [ny_lat, ny_lon], zoom_start = 11.48)

# plot average rental price at each district
for lat, lon, neighborhood, price in zip(df['loc_lat'], df['loc_lon'], df['Neighborhood'], df['Average Rent']):
    label = '{}, ${}0'.format(neighborhood, price)
    label = folium.Popup(label, parse_html = True)
    folium.CircleMarker([float(lat), float(lon)],
                        radius = 1, 
                        popup = label, 
                        color = 'red', 
                        #fill = True, 
                        fill_color = '#a72920', 
                        fill_opacity = 0.5, 
                        parse_html = False).add_to(manhattan_map)
 
# add heat map 
df['heat_map_weights_col'] = (df['Average Rent'].astype(float) - df['Average Rent'].astype(float).min()) / (df['Average Rent'].astype(float).max() - df['Average Rent'].astype(float).min())
df['double_lat'] = df['loc_lat'].astype(float)
df['double_lon'] = df['loc_lon'].astype(float)
cols_to_pull = ['double_lat', 'double_lon', 'heat_map_weights_col']

PriceList = df[cols_to_pull].values.tolist()

manhattan_map.add_child(plugins.HeatMap(PriceList, radius = 15))
manhattan_map

NameError: name 'folium' is not defined