# Capstone Project - The Battle of Neighborhoods
#### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)

<!-- * [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion) -->

## Introduction: Business Problem <a name="introduction"></a>

__Is New York City more like Toronto?__

Specifically, this project will be targeted to people/business new to the city and interested in finding out the neighborhoods with specific venue or venues, and compare it with the other neighborhoods. It will also help analyse the businesses within city.

Primary objective of this project is to set up the process to compare Neighborhood across geographical location, and provide results that help us to analyze the selected Cities and its Neighborhoods compare to each other in terms of the likelihood of their neighborhoods and kind of venues they have. i.e. one can compare the cities in terms of the number of restaurants serving their favourite cuisine (# venues with in City), or neighborhoods which has venues of their choice (categories of venues with in neighborhood).

We will use our data science powers to clean,analyse and classify the data, and then prepare final report with few graphs and folium maps.

## Data <a name="data"></a>

We will consider all neighborhoods of Toronto, Canada and New York city of America for this project, to compare neighborhoods across both the cities and to analyze neighborhoods having similar kind of venues within same or different city. 

We will be using Foursquare API to find out top 10 venues for each neighborhood.  To explore and find out a nearby venues using foursquare, We will need geographical location (Address, Longitude, Latitude) data for each neighborhood, For this project we will use the data I have already compiled earlier while working on previous capstone project and lab assignments. 

We will then explore, segment and cluster neighborhoods based on characteristics of top 10 venues surrounding (i.e. within 500 meter radius) of neighborhood's geographical location.


----------------------------------------------- End of week 1 assignment -----------------------------------------------
<hr style="height:1px;border-width:0;color:white;background-color:brown">

## Week2

*Methodology* - Week 2 Assignment in progress...
<!-- ## Methodology
In last Module, we explored New York City and the city of Toronto and segmented and clustered their neighborhoods. Both cities are very diverse and are the financial capitals of their respective countries. I liked an idea suggested for the project is to compare neighborhoods across cities and determine how similar or dissimilar they are. Is New York City more like Toronto?. 
We will be working on same idea with following Project objectives: 
* Set up the process to compare Cities, Neighborhood and Venues surrounding geographical location
* Collect the essential data for project and use Data Science skills for data cleansing and standardization.
  * We will be using __Foursquare API__ to explore a neighborhood and find out a nearby venues.
  * we will need geographical location (Address, Longitude, Latitude) for each neighborhood, for this project we will be using the data we have already compiled earlier while working on previous capstone project and lab assignments.*
* Analyze how the selected Cities and its Neighborhoods compare to each other in terms of the likelihood of neighborhoods and kind of venues(categories) they have.
* To start with analysis we will follow the usecase similar to the below given example:
  * Compare the city in terms of the number of restaurants serving your favourite cuisine (# venues with in City)
  * Compare neighborhoods which has venues of your choice (kind of venues with in neighborhood)
  * Cluster neighborhoods based on their top 10 venues, regardless of their cities to recommend similar neighboorhoods within and across city 
__Data:__ we will take all of the neighborhoods from Toronto,Canada and New York City, USA to 
then we will segment and cluster them based on characteristics of their top 10 venues. in order to find out top 10 venues for each neighborhood we will be using Foursquare API.*
__Notes__ *in order to explore a neighborhood and find out a nearby venues using foursquare, we will need geographical location (Address, Longitude, Latitude) for each neighborhood, for this project we will be using the data we have already compiled earlier while working on previous capstone project and lab assignments.*
-->

<li Style='font-size:110%;color:brown'> Let us start with importing few essential libraries!! </li>

In [None]:
## Import required libraries
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import numpy as np # library to handle data in a vectorized manner
print("Libraries Imported!")

<li Style='font-size:110%;color:brown'> Let download the Neighborhoods data from ... </li>

In [None]:
## Let's download the data and save it as a CSV
#!wget -q -O 'Toronto_df.csv' ## http://Tbd
#!wget -q -O 'Newyork_df.csv' ## http://Tbd

In [None]:
## Now that the data for both city is downloaded, let's read it into a pandas dataframe.
Toronto_df = pd.read_csv("Toronto_df.csv")
Newyork_df = pd.read_csv("Newyork_df.csv")

## Clean dataframe
Toronto_df = Toronto_df.drop('PostalCode',axis=1)
Toronto_df.insert(0,'City','Toronto')
Newyork_df.insert(0,'City','New York')

## Print 
print("Toronto_df:\n",Toronto_df.head(3),"\n\nNewyork_df:\n",Newyork_df.head(3))


<li Style='font-size:110%;color:brown'> Let's take a detailed view of Toronto and New York data </li>

In [None]:
## Set df as city dtaframe to view details

df = Toronto_df.copy() ## provide the name of dataframe
city = 'Toronto'       ## provide the name of City

## print the number of Boroughs and Neighborhoods within City. 
print('\nThe', city, 'city has {} boroughs and {} neighborhoods.'.format(len(df['Borough'].unique())
                                                                        ,df.shape[0]))
## Print the details
print('\n # of Neighborhoods in each Borough:\n')
print(df[['Borough','Neighborhood']].groupby(
    'Borough',as_index=False).count().sort_values(by='Neighborhood',ascending=False),'\n')

## Print dataframe
df.head()

<li Style='font-size:110%;color:brown'> Let's append data of Toronto and New York,in order to fetch venue details for each neighborhood once </li>

In [None]:
## Checking the Columns before appending tabels to append data properly  
#Toronto_df.columns, Newyork_df.columns

## Appending tabels
Nghbr_df = pd.concat([Newyork_df, Toronto_df], ignore_index=False).reset_index(drop=True)

## print the number of Boroughs and Neighborhoods within City. 
print('\nThe data includes {} cities {} boroughs and {} neighborhoods.'.format(len(Nghbr_df['City'].unique()),
                                                                               len(Nghbr_df['Borough'].unique()),
                                                                               Nghbr_df.shape[0]))
## Print the details
print('\n # of Neighborhoods in each Borough:\n')
print(Nghbr_df[['City','Borough','Neighborhood']].groupby(
    ['City','Borough'],as_index=False).count().sort_values(by='Neighborhood',ascending=False),'\n')

## Print dataframe
Nghbr_df.head()

<li Style='font-size:110%;color:brown'> 
    Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.</li>

Make sure that you have created a Foursquare developer account and have your credentials handy

Before we start exploring it, let's get all the libraries that we will need. </li>

In [None]:
# Imrorting Libraries  
import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

<li Style='font-size:110%;color:brown'>
    Now, Define credentials and version (date)  for Foursquare </li>

In [None]:
# @hidden_cell
CLIENT_ID = '0XUWQYJ51LOM4MNDEUUOJ1XPHCV13TQ4PIUE4SW1MADEN2U2' # your Foursquare ID
CLIENT_SECRET = 'F5ZG3S4U0DVS2D0OI1YDYWYM54FBO4X4VXSAXALU2I4KSS0M' # your Foursquare Secret

In [None]:
#CLIENT_ID = 'XXXXXXX' # your Foursquare ID
#CLIENT_SECRET = 'XXXXXXX' # your Foursquare Secret
#VERSION = '20180604'
#print('Your credentails:'); print('CLIENT_ID: ' + CLIENT_ID); print('CLIENT_SECRET:' + CLIENT_SECRET)

<li Style='font-size:110%;color:brown'> Let's see how to fetch venues for any neighborhood using Foursquare API  </li>

Let's pick one neighborhood from our dataframe.
Get the neighborhood's name.

In [None]:
neighborhood_city = Nghbr_df.loc[0, 'City'] # neighborhood name
neighborhood_name = Nghbr_df.loc[0, 'Neighborhood'] # neighborhood name
neighborhood_latitude = Nghbr_df.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = Nghbr_df.loc[0, 'Longitude'] # neighborhood longitude value

print('\n Neighborhood is {}, {}., and Lat, Long values are {}, {}.'.format(
    neighborhood_name, neighborhood_city, round(neighborhood_latitude,6), round(neighborhood_longitude,6)))

Now, let's get the top 100 venues that are within a radius of 500 meters of selected neighborhood. 

First, let's create the GET request URL.

In [None]:
## Define LIMIT query and radius 
radius = 500; LIMIT = 100

## Define the corresponding URL
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(
    CLIENT_ID, CLIENT_SECRET, neighborhood_latitude, neighborhood_longitude, VERSION, radius, LIMIT)
#url

Send the GET request and examine the resutls

In [None]:
results = requests.get(url).json()
#results.keys()
#results['response'].keys()
#results['response']['groups'][0].keys()
results['response']['groups'][0]['items']

Now, as we know that all the information is in the *items* key.    
Before we proceed, let's define the function to extract category of venues from items result. **get_category_type**

In [None]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

<li Style='font-size:110%;color:brown'>
    Lets clean the json output provided by Foursquare and structure it into a *pandas* dataframe.</li>

In [None]:
## Get Venues detail
venues = results['response']['groups'][0]['items']  
nearby_venues = pd.json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

# View Near by Venue 
print('\n Nearby_venues dataframe for venues of {}, {}:\n'.format(neighborhood_name, neighborhood_city))
nearby_venues

<li Style='font-size:110%;color:brown'> 
    Let's create a function to process all the neighborhoods in our dataset </li>

In [None]:
def getNearbyVenues(city, names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for city, name, lat, lng in zip(city, names, latitudes, longitudes):
        #print(name)            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, radius, LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([
            (city, name, lat, lng, v['venue']['name'], v['venue']['location']['lat'],
              v['venue']['location']['lng'], v['venue']['categories'][0]['name']) 
            for v in results])
    
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['City','Neighborhood', 'Neighborhood Latitude', 'Neighborhood Longitude',
                             'Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category']
    print('New dataframe for nearby venues is created.')
    return(nearby_venues)

<li Style='font-size:110%;color:brown'> 
    Now,Let's use the above function on each neighborhood and create a new dataframe called <b>Venues_df</b>.</li>

In [None]:
## Create new Dataframe using getNearbyVenues function
Venues_df = getNearbyVenues( city = Nghbr_df['City'], names=Nghbr_df['Neighborhood'],
                            latitudes=Nghbr_df['Latitude'], longitudes=Nghbr_df['Longitude']
                           )

## Print size of the resulting dataframe
print('This dataframe of Venues has {} rows and {} columns:'.format(Venues_df.shape[0]
                                                                           ,Venues_df.shape[1]),'\n')
## Save the venus df to CSV File
Venues_df.to_csv('Venues_df.csv', index=False) ## Saving the Final Dataframe as CSV
print('Venues data for all neighborhood is saved to the csv file named: Venues_df.csv')

## Take a look to dataframe
Venues_df.head()

*__Please Note:__ You can now use the csv file as local copy of Venues for further analysis, it will save time and resources used to send request to Forsquare API, and transformation response data.*

Let's check how many venues were returned for each neighborhood.

In [None]:
## Now that the data for both city is downloaded, let's read it into a pandas dataframe.
Venues_df = pd.read_csv("Venues_df.csv")
Venues_df = Venues_df.join(Nghbr_df[['Neighborhood','Borough']].set_index('Neighborhood'),on='Neighborhood')
Venues_df = Venues_df.reindex(['City','Borough','Neighborhood', 'Neighborhood Latitude','Neighborhood Longitude', 
                  'Venue', 'Venue Latitude', 'Venue Longitude','Venue Category'], axis=1)

__Let's find out how many unique categories can be curated from all the returned venues__

In [None]:
print('There are {} uniques categories of venues.'.format(len(Venues_df['Venue Category'].unique())))

In [None]:
Venues_df['Venue Category'].unique().astype(str)

In [None]:
Venues_df['Venue Category'].unique()
#df.replace(regex=[r'^ba.$', 'ffa'], value='new')

In [None]:
#print('# of Venues in each Neighborhood:')
Venues_df.head()
Venues_df.groupby(['Neighborhood'])[['Venue']].count()
#Venues_df.groupby('Neighborhood').count()

In [None]:
help(Venues_df.filter('regex'))

In [None]:
# Venues_df.set_index('Venue Category').filter(like='Restaurant', axis=0)
# Venues_df[(Venues_df['Venue Category'].str.contains("Ice Cream Shop") | 
#            Venues_df['Venue Category'].str.contains("Donut Shop"))]

In [None]:
df[~df['ids'].str.contains("ball")]