# Capstone Project - The Battle of the Neighborhoods
### Applied Data Science Capstone by IBM/Coursera
### By: Matias Garib

This Jupyter Notebook contains all the code and brief comments of the Coursera Capstone project. The full report will be accessible in the following Github Repository: https://github.com/MatiasGarib/Coursera_Capstone

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Datasets](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)


## Introduction <a name="introduction"></a>

People want to start going out and visiting restaurants, but they want to visit places with the best hygiene practices. The questions we want to answer, for the city of San Francisco, are: which are the cleanest restaurants in each neighborhood? Which are the safest neighborhoods to go out to eat?

## Datasets <a name="data"></a>

1. The first dataset to be used consists of a **GeoJSON file with the names and boundaries of 41 San Francisco neighborhoods (GeoJSON)** 
2. **Foursquare APIs (URI)**
3. City of San Francisco Health Department’s **hygiene inspection program (CSV)** 


In [5]:
pip install sodapy

Collecting sodapy
  Downloading sodapy-2.1.0-py2.py3-none-any.whl (14 kB)
Installing collected packages: sodapy
Successfully installed sodapy-2.1.0
Note: you may need to restart the kernel to use updated packages.


In [17]:
import pandas as pd
import numpy as np
import requests
import folium
from sodapy import Socrata



<h3> Importinge the Datasets <h3>

The neighborhoods and hygiene inspection datasets are easily accessible thanks to the Socrata API provided by the San Francisco Government

In [101]:
client = Socrata("data.sfgov.org", None)
results = client.get("pyih-qa8i", limit=60000)
hygiene_df=pd.DataFrame.from_records(results)

nhoods=client.get("743h-p4bq", limit=60000) # We will use this JSON file later on to map out San Francisco's neighborhoods
nhoods_df=pd.DataFrame.from_records(nhoods)



In [102]:
print(hygiene_df.shape)
print(nhoods_df.shape)

(53973, 23)
(92, 4)


In [103]:
hygiene_df.head()

Unnamed: 0,business_id,business_name,business_address,business_city,business_state,business_postal_code,inspection_id,inspection_date,inspection_type,violation_id,...,inspection_score,business_latitude,business_longitude,business_location,:@computed_region_fyvs_ahh9,:@computed_region_p5aj_wyqh,:@computed_region_rxqg_mtj9,:@computed_region_yftq_j783,:@computed_region_bh8s_q3mv,:@computed_region_ajp5_b2md
0,69618,Fancy Wheatfield Bakery,1362 Stockton St,San Francisco,CA,94133,69618_20190304,2019-03-04T00:00:00.000,Complaint,69618_20190304_103130,...,,,,,,,,,,
1,97975,BREADBELLY,1408 Clement St,San Francisco,CA,94118,97975_20190725,2019-07-25T00:00:00.000,Routine - Unscheduled,97975_20190725_103124,...,96.0,,,,,,,,,
2,69487,Hakkasan San Francisco,1 Kearny St,San Francisco,CA,94108,69487_20180418,2018-04-18T00:00:00.000,Routine - Unscheduled,69487_20180418_103119,...,88.0,,,,,,,,,
3,91044,Chopsticks Restaurant,4615 Mission St,San Francisco,CA,94112,91044_20170818,2017-08-18T00:00:00.000,Non-inspection site visit,,...,,,,,,,,,,
4,85987,Tselogs,552 Jones St,San Francisco,CA,94102,85987_20180412,2018-04-12T00:00:00.000,Routine - Unscheduled,85987_20180412_103132,...,94.0,,,,,,,,,


We will now use the Foursquare API to search each neighborhoods restaurants

In [112]:
client_id = 'U0BHFR2CGBOER0NS2E3LDULEVT032SXA3KVWLR2U1RTQBJCV' # your Foursquare ID
client_secret = 'WRRQIHUGH45BSIKD4HCNE5ZXRNAK3E1JJNIXVNRVBNYLZYEC' # your Foursquare Secret
version = '20180605' # Foursquare API version
category= '4d4b7105d754a06374d81259' #Food Category


print('Your credentails:')
print('CLIENT_ID: ' + client_id)
print('CLIENT_SECRET:' + client_secret)

Your credentails:
CLIENT_ID: U0BHFR2CGBOER0NS2E3LDULEVT032SXA3KVWLR2U1RTQBJCV
CLIENT_SECRET:WRRQIHUGH45BSIKD4HCNE5ZXRNAK3E1JJNIXVNRVBNYLZYEC


In [114]:
def getVenuesLoc(names, radius=600):
    
    venues_list=[]
    for name in zip(names):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&near={},San Francisco, CA&categoryId={}&radius={}&limit={}'.format(
        client_id,
        client_secret,
        version, neighborhood,
        category,
        radius, 
        limit)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name,
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Becuase there are certain neighborhoods grouped, we ungroup them to apply the function

In [162]:
nhood_names=[]
for name in nhoods_df['nbrhood']:
    if '/' in name:
        print (name)
        split_name=name.split('/',1)
        nhood_names.append(split_name[0])
        nhood_names.append(split_name[1])
    elif '/' not in name:
        nhood_names.append(name)
    

Buena Vista Park/Ashbury Heights
Eureka Valley / Dolores Heights
Financial District/Barbary Coast
Jordan Park / Laurel Heights
Cole Valley/Parnassus Heights
Van Ness/Civic Center
Central Waterfront/Dogpatch


In [163]:
nhood_names

['Alamo Square',
 'Anza Vista',
 'Balboa Terrace',
 'Bayview',
 'Bernal Heights',
 'Buena Vista Park',
 'Ashbury Heights',
 'Central Richmond',
 'Central Sunset',
 'Clarendon Heights',
 'Corona Heights',
 'Cow Hollow',
 'Crocker Amazon',
 'Diamond Heights',
 'Downtown',
 'Duboce Triangle',
 'Eureka Valley ',
 ' Dolores Heights',
 'Excelsior',
 'Financial District',
 'Barbary Coast',
 'Yerba Buena',
 'Forest Hill',
 'Forest Hills Extension',
 'Forest Knolls',
 'Glen Park',
 'Golden Gate Heights',
 'Golden Gate Park',
 'Haight Ashbury',
 'Hayes Valley',
 'Hunters Point',
 'Ingleside',
 'Ingleside Heights',
 'Ingleside Terrace',
 'Inner Mission',
 'Inner Parkside',
 'Inner Richmond',
 'Inner Sunset',
 'Jordan Park ',
 ' Laurel Heights',
 'Lake Street',
 'Monterey Heights',
 'Lake Shore',
 'Lakeside',
 'Lone Mountain',
 'Lower Pacific Heights',
 'Marina',
 'Merced Heights',
 'Merced Manor',
 'Midtown Terrace',
 'South Beach',
 'Miraloma Park',
 'Mission Bay',
 'Mission Dolores',
 'Mission 

In [135]:
example = 'Buena Vista Park/Ashbury Heights'
'/' in example
split_name=example.split('/',1)
split_name[0]
split_name[1]

'Ashbury Heights'

In [118]:
nhoods_df

Unnamed: 0,sfar_distr,the_geom,nbrhood,nid
0,District 6 - Central North,"{'type': 'MultiPolygon', 'coordinates': [[[[-1...",Alamo Square,6e
1,District 6 - Central North,"{'type': 'MultiPolygon', 'coordinates': [[[[-1...",Anza Vista,6a
2,District 4 - Twin Peaks West,"{'type': 'MultiPolygon', 'coordinates': [[[[-1...",Balboa Terrace,4a
3,District 10 - Southeast,"{'type': 'MultiPolygon', 'coordinates': [[[[-1...",Bayview,10a
4,District 9 - Central East,"{'type': 'MultiPolygon', 'coordinates': [[[[-1...",Bernal Heights,9a
...,...,...,...,...
87,District 9 - Central East,"{'type': 'MultiPolygon', 'coordinates': [[[[-1...",Central Waterfront/Dogpatch,9j
88,District 10 - Southeast,"{'type': 'MultiPolygon', 'coordinates': [[[[-1...",Candlestick Point,10m
89,District 10 - Southeast,"{'type': 'MultiPolygon', 'coordinates': [[[[-1...",Bayview Heights,10k
90,District 10 - Southeast,"{'type': 'MultiPolygon', 'coordinates': [[[[-1...",Little Hollywood,10n


In [104]:
print(hygiene_df['business_latitude'].isna().sum())
print(hygiene_df['business_longitude'].isna().sum())
print(hygiene_df['business_location'].isna().sum())
print('Aproximately',round((hygiene_df['business_latitude'].isna().sum())/len(hygiene_df.index)*100), '% of data points are not georeferenced')

26498
26498
26498
Aproximately 49.0 % of data points are not georeferenced
