<h1 align=center><font size = 5>Final Exam - Battle of the Neighborhoods
                     – Week 1 Submission
</font></h1>

## Introduction

This report provides the four key requirements specified for the Week 1 submission. In addition, it includes some of the data related import and processing modules, as recommended by the Week 1 submission guidelines.

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Business Problem Introduction</a>

2. <a href="#item2">Data to Be Used</a>

3. <a href="#item3">Methodology and Analysis</a>

4. <a href="#item4">Results</a>

5. <a href="#item5">Discussion</a>    

6. <a href="#item5">Conclusion</a>    
</font>
</div>

# 1. Business Problem Introduction

Companies opening up new store fronts, have to deal with a myriad of issues: business licenses, regulatory requirements, sales and business taxes, competitors, and of course the biggest item: location, location, location. Positioning a store front in an advantageous location is one of, if not the, most important, key factor in a business’s ability to survive and thrive. The business problem this analysis attempts to solve, is to assist business owners with finding a set of potential locations to maximize their chances of success in opening a new store front. The specific “client” that this study will address, is a business owner that wants to open a new upscale (Steakhouse) restaurant in the Seattle area.

The overall strategy will be to find the highest revenue generating areas, in combination with the lowest number of potential competitors, within a given area. In addition, the total number of other nearby (non-restaurant) venues will be factored in, which act as additional “magnets” to entice more customers and foot traffic into the area that the business operates in.

Both a printed text summary of the best potential neighborhoods, as well as a Folium map showing the city of Seattle and the selected potential neighborhoods in graphic form, will be produced by the project.

## Who Would be Interested

The typical client for this kind of service would be any business that wanted to open a new storefront in any of Seattle’s neighborhoods. The customer would be able to get a list of the top 10 neighborhoods that would have the best desired “draw” characteristics, as well as good potential (under-served areas) for the target business. 

While this specific project focuses on selecting restaurant locations, any type of store front business (not just restaurants) could be evaluated, by changing the target venue. 

# 2. Data To Be Used Section

For this analysis, the following data is needed, in the form of CSV files that will be downloaded and generated as part of the reproducible source input data for the project:

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">•	File containing text names of Seattle neighborhoods and associated (parent) Districts</a>

2. <a href="#item2">•	File containing zip codes of the Seattle neighborhoods</a>

3. <a href="#item3">•	Files containing recent Real Estate Housing sales and rents of Seattle neighborhoods</a>

4. <a href="#item4">•	Four-Square data of current restaurant venues, including number of direct competitors, in the various Seattle neighborhoodss</a>

5. <a href="#item5">•	Four-Square data of containing counts of current general venues in a given neighborhood to ascertain additional “magnets” and points of interest that would help increase foot traffic and potential “draw in” business in the area</a>    

</font>
</div>

In an ideal world, a breakdown of overall retails sales or sales tax revenues by district, would be most valuable, in order to target the highest revenue generating areas, which would be most favorable for a high end upscale restaurant. While the city of Seattle and King County have very good data portals, they currently do not provide this level of detail. Instead, relevant sales revenue and sales taxes are currently only available as aggregate summaries by city, not by district or neighborhood. So as an approximation, housing prices and rent data (by neighborhood) will be used to select those districts and neighborhoods with the highest median values. These will be used to derive the best potential for an upscale restaurant.

The collected data will be used to determine which neighborhoods exist in the Seattle area, what are the income characteristics of each neighborhood, what competitors exist in each neighborhood, which neighborhoods might be being “under-served” (very few offerings or competitors in the area), and what “magnets” does the respective neighborhood have to draw people into the area. This information will be used to rank and select the top ranked 20 % of the neighborhoods. From there, additional analysis will be performed to whittle do the recommended locations to a “top 10” list.

## Data Sources

The following data sources will be used by the project:

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">•	Text Names of Seattle neighborhoods and districts – screen scraped via Python BeautifulSoup package from the following URL: https://en.wikipedia.org/wiki/List_of_neighborhoods_in_Seattle.html
      The results will be saved in a CSV file, which will later be loaded into a Pandas “Neighborhoods” DataFrame
</a>

2. <a href="#item2">•	Zip Codes of Seattle neighborhoods – the Nominatim geocode() API will be used to obtain the Zip code for each Seattle neighborhood, using the Neighborhood names contained in the Pandas “Neighborhoods” DataFrame. 
Nominatim was selected because of Google’s recently increased pricing for obtaining geo-coded data. Because Nominatim has a policy of no more than 1 request per second (which out getting throttled or rejected), a 2 second timer will be used between invocations. The results will be saved in a CSV file, that will later be merged into the Pandas “Neighborhoods”  DataFrame.
</a>

3. <a href="#item3">•	Real Estate Housing sales and rents in the Seattle neighborhoods – will be extracted from Zillow’s median Housing sales price and Rents Index data by neighborhood, which is downloaded from the following “Zillow Data” URL:
          https://www.zillow.com/research/data/
      The results will be saved in a CSV file, that will later be merged into the Pandas “Neighborhoods”  DataFrame
</a>

4. <a href="#item4">•	Four-Square data of current restaurant venues – invoked via FourSquare API and counted, and merged into the “Neighborhoods”  DataFrame</a>

5. <a href="#item5">•	Four-Square data with counts of current general venues in a given neighborhood – invoked via FourSquare API and counted, and merged into the “Neighborhoods”  DataFrame</a>    

</font>
</div>

# 3. Methodology and Analysis Section

## Methodology Details

In this section, we will do a detailed walk though of all the steps used in the analysis. This 
will walk through each step of the analysis, so that this report can be used to reproduce and verify the results of the analysis.

This section includes some of the data related import and processing modules, as recommended by the Week 1 submission guidelines.

#### First, setup all the python libraries dependencies that we will need.

In [4]:
import numpy as np    # library to handle data in a vectorized manner

import pandas as pd                     # library for data analsysis
pd.set_option ('display.max_columns', None)
pd.set_option ('display.max_rows', None)

import json                             # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim   # convert an address into latitude and longitude values

import requests                           # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium                             # map rendering library

# !pip install beautifulsoup4
from bs4 import BeautifulSoup             # pull in HTML screen scraper support

import csv

import time                               # for sleep function

print ('python Libraries have been imported.')

Solving environment: done

# All requested packages already installed.

Solving environment: done

# All requested packages already installed.

python Libraries have been imported.


<a id='item1'></a>

## Download and Explore Datasets

In order to segement the neighborhoods and explore them, we need to create a dataset that contains all the Seattle neighborhoods and their associated districts.

So we need to download the following HTML web page, and then use the BeautifulSoup screen scraper to extract the data we need.

In [5]:
# Retreive List of neighborhoods from Seattle Wikipedia html

url  = requests.get ('https://en.wikipedia.org/wiki/List_of_neighborhoods_in_Seattle')
soup = BeautifulSoup (url.text, 'lxml')

#print (soup.prettify())

print ('Seattle HTML Table data downloaded!')

Seattle HTML Table data downloaded!


#### Load and explore the data

In [6]:
#  Generate a CSV file from the Web screen scraped data that we parsed from BeautifulSoup

#with open ('toronto_postal.csv', 'w', newline='') as csv_file:
csv_output_file = open ('seattle_neighborhoods.csv', 'w')
csv_writer      = csv.writer (csv_output_file)
csv_writer.writerow (['Neighborhood', 'District'])   # write out header line

   # An examination of soup.prettify output shows that the data to be scraped
   # is laid out as an HTML table, using:
   #    <tr> (row begin) <td>xxx</td> <td>yyy</td> (detail columns) </tr> (row end)
   # e.g. one of the table entries is shown below: 
   #    <tr> <th><small>2</small> <td><b>Central Seattle</b></td> <td>Seattle[45]</td> ... </tr>
   #
   # Some Districts also have a / designator, as well as a few Neightnbors
   # Anything after the slashes and the citation [nn] need to be removed

   # Scrape the data from the HTML table, using the <tr> and <td> tags
for tr_row in soup.find_all('tr')[1:]:
      td_col = tr_row.find_all ('td')
      if len(td_col) >= 3:
          Neighborhood = td_col[0].text
          District     = td_col[1].text
          # strip off any [] citations attached to District or neighborhood, and any slashes / or (
          idx = District.find ('[')
          if (idx > 0):
             District = District[:idx]
          idx = District.find ('/')
          if (idx > 0):
             District = District[:idx]
          idx = Neighborhood.find ('[')
          if (idx > 0):
             Neighborhood = Neighborhood[:idx]
          idx = Neighborhood.find ('/')
          if (idx > 0):
             Neighborhood = Neighborhood[:idx]
          idx = Neighborhood.find ('(')
          if (idx > 0):
             Neighborhood = Neighborhood[:idx]
          # And strip any trailing \r or \n characters
          idx = District.find ('\r')
          if (idx > 0):
             District = District[:idx]
          idx = Neighborhood.find ('\r')
          if (idx > 0):
             Neighborhood = Neighborhood[:idx]
          idx = District.find ('\n')
          if (idx > 0):
             District = District[:idx]
          idx = Neighborhood.find ('\n')
          if (idx > 0):
             Neighborhood = Neighborhood[:idx]
          # strip any Neighbor Names that have commas on the tail
          idx = Neighborhood.find (',')
          if (idx > 0):
             Neighborhood = Neighborhood[:idx]
          # strip any Neighbor Names that have a " at the front
          idx = Neighborhood.find ('"')
          if (idx > 0):
             Neighborhood = Neighborhood[1:]     
          # do not write out any annotation entries of "West Seattle is further divided into ..."
          if (Neighborhood.find('West Seattle is further divided') == -1):
             csv_writer.writerow ([Neighborhood, District])
             print (Neighborhood, District)

csv_output_file.close()

North Seattle Seattle
Broadview North Seattle
Bitter Lake North Seattle
North Beach  North Seattle
Crown Hill North Seattle
Greenwood North Seattle
Northgate North Seattle
Haller Lake Northgate
Pinehurst Northgate
North College Park Northgate
Maple Leaf Northgate
Lake City North Seattle
Cedar Park Lake City
Matthews Beach Lake City
Meadowbrook Lake City
Olympic Hills Lake City
Victory Heights Lake City
Wedgwood North Seattle
View Ridge North Seattle
Sand Point North Seattle
Roosevelt North Seattle
Ravenna North Seattle
Bryant North Seattle
Windermere North Seattle
Hawthorne Hills Windermere
Laurelhurst North Seattle
University District  North Seattle
University Village Ravenna
Wallingford North Seattle
Northlake Lake Union 
Green Lake North Seattle
Fremont North Seattle
Phinney Ridge North Seattle
Ballard North Seattle
West Woodland Ballard
Whittier Heights Ballard
Adams Ballard
Sunset Hill Ballard
Loyal Heights Ballard
Central Seattle Seattle
Magnolia Central Seattle
Lawton Park Magno

In [7]:
# convert the CSV file into a pandas dataframe, then print the first few entries

df_seattle_raw = pd.read_csv ('seattle_neighborhoods.csv')

print ('Raw Seattle Neighborhoods shape is: ', df_seattle_raw.shape)

df_seattle_raw.head (10)

Raw Seattle Neighborhoods shape is:  (127, 2)


Unnamed: 0,Neighborhood,District
0,North Seattle,Seattle
1,Broadview,North Seattle
2,Bitter Lake,North Seattle
3,North Beach,North Seattle
4,Crown Hill,North Seattle
5,Greenwood,North Seattle
6,Northgate,North Seattle
7,Haller Lake,Northgate
8,Pinehurst,Northgate
9,North College Park,Northgate


## Data Wrangling/Cleaning

As per class instructions, clean up the data we received.

In [8]:
# There are a few Neighborhoods on the Wiki table that are duplicated. Remove all but one

df_seattle = df_seattle_raw.drop_duplicates (subset=None, keep='first', inplace=False)

df_seattle.sort_values (by=['District'],ascending=True,inplace=True)  # sort into order and reset index to 0,1,2, ...
df_seattle = df_seattle.reset_index (drop=True)

print ('Cleaned Seattle Neighborhoods shape is: ', df_seattle.shape)

df_seattle.head (15)

Cleaned Seattle Neighborhoods shape is:  (127, 2)


Unnamed: 0,Neighborhood,District
0,Judkins Park,Atlantic
1,Whittier Heights,Ballard
2,Adams,Ballard
3,Sunset Hill,Ballard
4,Loyal Heights,Ballard
5,West Woodland,Ballard
6,Holly Park,Beacon Hill
7,Mid Beacon Hill,Beacon Hill
8,North Beacon Hill,Beacon Hill
9,South Beacon Hill,Beacon Hill


In [10]:
# TBD Week 2 - Check if the Seatlle geo dataset has been created. If not, create it now

# NOTE: Nominatim limits stuff to maximum of 1 request per second !!!  (we need to process 127 hoods)

# skip this if the dataset already exists


geolocator = Nominatim()     # instantiate the lookup object
Nominatim (user_agent="my-application")   # set specific application name to avoid default user agent complaint


csv_geo_file = open ('Seattle_Geospatial_Coordinates.csv', 'w')
csv_writer   = csv.writer (csv_geo_file)
csv_writer.writerow (['Neighborhood', 'Latitude', 'Longitude'])   # write out header line

# loop thru the above pandas dataset and append the Neighborhood name to Seattle
# and then do a lookup on that, and save it into a csv file by neighbor hood name
#   for row in csv_f:   # using a read_csv
#       print (row)
for i in range(0, len(df_seattle)):
        Neighborhood = df_seattle.iloc[i]['Neighborhood']
        # a bit brute force, but we only have to do it once (for 127 Neighborhoods)
        address    = Neighborhood + ', Seattle, USA'
        location   = geolocator.geocode (address)
        if (location != None):
           Latitude   = location.latitude
           Longitude  = location.longitude
        # if null, will just use the previous Lat/Long as default
        csv_writer.writerow ([Neighborhood, Latitude, Longitude])
        #print ('Address coords to {} are {}, {}.'.format(address, latitude, longitude))
        time.sleep (2)         # Delays for 2 seconds to make Novatim happy. Can also use a float value.
    
csv_geo_file.close()
print ('Seattle Geo Coordinates file write complete')

  


Seattle Geo Coordinates file write complete


In [12]:
# import the CSV file Seattle_Geospatial_Coordinates.csv that we created

df_geospatial = pd.read_csv ('Seattle_Geospatial_Coordinates.csv')

print ('Shape of Seattle Geo coordinate data is ', df_geospatial.shape)

df_geospatial.head (10)

Shape of Seattle Geo coordinate data is  (127, 3)


Unnamed: 0,Neighborhood,Latitude,Longitude
0,Judkins Park,47.59147,-122.304069
1,Whittier Heights,47.683297,-122.371449
2,Adams,47.5653,-122.272014
3,Sunset Hill,47.675217,-122.398448
4,Loyal Heights,47.688709,-122.392907
5,West Woodland,47.675973,-122.347499
6,Holly Park,47.54165,-122.291929
7,Mid Beacon Hill,47.54165,-122.291929
8,North Beacon Hill,47.577586,-122.30996
9,South Beacon Hill,47.577586,-122.30996


In [14]:
# now merge the geospatial data with the toronto postal data, using
# Neighborhood as common key. This is equivalent to a SQL Join
# resulting column_names = ['Neighborhood', 'District', 'Latitude', 'Longitude'] 

df_seattle_data = pd.merge (df_seattle, df_geospatial, left_on='Neighborhood', right_on='Neighborhood')

print ("Total Neighborhoods: " + str(df_seattle_data['Neighborhood'].nunique()))
print ("Total Districts: " + str(df_seattle_data['District'].nunique()))
print ('Shape of merged dataframe is: ', df_seattle_data.shape)

df_seattle_data.head (10)          # and check the results

Total Neighborhoods: 127
Total Districts: 31
Shape of merged dataframe is:  (127, 4)


Unnamed: 0,Neighborhood,District,Latitude,Longitude
0,Judkins Park,Atlantic,47.59147,-122.304069
1,Whittier Heights,Ballard,47.683297,-122.371449
2,Adams,Ballard,47.5653,-122.272014
3,Sunset Hill,Ballard,47.675217,-122.398448
4,Loyal Heights,Ballard,47.688709,-122.392907
5,West Woodland,Ballard,47.675973,-122.347499
6,Holly Park,Beacon Hill,47.54165,-122.291929
7,Mid Beacon Hill,Beacon Hill,47.54165,-122.291929
8,North Beacon Hill,Beacon Hill,47.577586,-122.30996
9,South Beacon Hill,Beacon Hill,47.577586,-122.30996


In [15]:
# import the Housing median sales CSV file that we downloaded from Zillow

df_zillow_housing_raw = pd.read_csv ('Zillow_Housing_Sale_Prices_Neighborhood.csv')

# then select just the rows containing Seattle
# then rename the Neighborhood column
# then sum up the 3 prior years house prices

#df_zillow_housing_raw.head (10)

In [16]:
# merge the above results into the df_seattle_data dataset

# TBD Week 2

In [17]:
# import the Rental Index values CSV file that we downloaded from Zillow

df_zillow_rental_raw = pd.read_csv ('Zillow_Neighborhood_Rental_IndexZri_AllHomesPlusMultifamily_Summary.csv')

# then select just the rows containing Seattle
# then rename the Neighborhood column
# then sum up the 3 prior years rental prices

#df_zillow_rental_raw.head (10)

In [18]:
# merge the above results into the df_seattle_data dataset
# TBD Week 2

## Walk thru all the neighborhoods, and select the 20 % highest ones

TBD Week 2  based on the df_seattle_data

================================================================
## Setup Four Square credentials
================================================================

In [19]:
# use the 4-Square segmenting/clustering thing
CLIENT_ID = 'xxxxxxxxxxxxxxxxxx' # my Foursquare ID
CLIENT_SECRET = 'xxxxxxxxxxxxxxxxx' # my Foursquare Secret
VERSION = '20180605' # Foursquare API version or '20180604'
# CAUTION LIMIT must be same as shape (103). If set to 100 (default) later
# processing dies because the frame shapes don't match (103 vs 100)
LIMIT = 100
print ('Your credentials:')
print ('CLIENT_ID: ' + CLIENT_ID)
print ('CLIENT_SECRET: ' + CLIENT_SECRET)
print ('LIMIT: ' + LIMIT)

Your credentials:
CLIENT_ID: xxxxxxxxxxxxxxxxxx
CLIENT_SECRET: xxxxxxxxxxxxxxxxx


TypeError: must be str, not int

## Explore a Competitive Businesses in each neighborhood using Four-Square


In [20]:
# scan thru the top 20% Pandas Neighborhoods frame, and see how many competing restaraunts
# there are in each of the selected neighborhoods

# TBD week 2 

In [21]:
radius = 1000 # define radius of the area to search
# create the API request URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID,
            CLIENT_SECRET,
            VERSION,
            lat,
            lng,
            radius,
            LIMIT)
            
# make the GET request to 4-Square, and get the returned results
#results = requests.get(url).json()
#results

NameError: name 'lat' is not defined

From the Foursquare lab in Week 2, we know that all the information is in the *items* key. Before we proceed, let's borrow the **get_category_type** function from the Foursquare lab.

In [22]:
# function that extracts the category of the venue
def get_category_type (row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a *pandas* dataframe.

In [23]:
# read in, and clean, the received JSON data, and put it in a DataFrame

venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize (venues)     # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues    = nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

NameError: name 'results' is not defined

And how many venues were returned by Foursquare?

In [31]:
print ('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

2 venues were returned by Foursquare.


<a id='item2'></a>

## 2. Explore Suite of Venues in the the Neighborhoods

#### Create a function to repeat the same process to all the neighborhoods in Seattle

In [33]:
def  getNearbyVenues (names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print (name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append ([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame ([item for venue_list in venues_list for item in venue_list])
    
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return (nearby_venues)

#### Now write the code to run the above function on each selected neighborhood and create a new dataframe called *seattle_venues*.

In [24]:
# invoke the 4-Square functions for each of the neighborhoods,
# and save the results in a data frame

seattle_venues = getNearbyVenues (names = df_seattle_data['Neighbourhood'],
                                  latitudes = df_seattle_data['Latitude'],
                                  longitudes = df_seattle_data['Longitude']
                                 )

NameError: name 'getNearbyVenues' is not defined

#### Check the size and shape of the resulting dataframe

In [25]:
print ("Total Unique Neighbourhoods returned with Venues: " + str(seattle_venues['Neighborhood'].nunique()))
print ('Total number and shape of venues results returned=', seattle_venues.shape)
seattleo_venues.head (10)

NameError: name 'seattle_venues' is not defined

### Check how many venues were returned for each neighborhood

In [26]:
seattle_venues.groupby ('Neighborhood').count()

NameError: name 'seattle_venues' is not defined

#### Let's find out how many unique categories can be curated from all the returned venues

In [37]:
print ('There are {} unique categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 279 unique categories.


<a id='item3'></a>

## 3. Analyze Each Neighborhood

In [27]:
#######################
#  one hot encoding
#######################
seattle_onehot = pd.get_dummies (seattle_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
seattle_onehot['Neighborhood'] = seattle_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns  = [seattle_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
seattle_onehot = seattle_onehot[fixed_columns]

seattle_onehot.head()

NameError: name 'seattle_venues' is not defined

And let's examine the new dataframe size.

In [39]:
seattle_onehot.shape

(2233, 279)

#### Group the rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [28]:
seattle_grouped = seattle_onehot.groupby('Neighborhood').mean().reset_index()

seattle_grouped

NameError: name 'seattle_onehot' is not defined

#### Confirm the new size

In [42]:
seattle_grouped.shape

(101, 279)

#### Print each neighborhood along with the top 5 most common venues

In [29]:
num_top_venues = 5

for hood in seattle_grouped['Neighborhood']:
    print ("----"+hood+"----")
    temp = seattle_grouped[seattle_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round ({'freq': 2})
    print (temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print ('\n')

NameError: name 'seattle_grouped' is not defined

#### Put that venues data into a *pandas* dataframe

First, let's write a function to sort the venues in descending order.

In [46]:
def return_most_common_venues (row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

### Create the new dataframe and display the top 10 venues for each neighborhood.

In [30]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append ('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append ('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame (columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = seattle_grouped['Neighborhood']

for ind in np.arange(seattle_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(seattle_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

NameError: name 'seattle_grouped' is not defined

<a id='item4'></a>

## 4. Cluster Neighborhoods

Run *k*-means to cluster the neighborhood into 5 clusters.

In [51]:
# set number of clusters
kclusters = 5

seattle_grouped_clustering = seattle_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans (n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
#kmeans.labels_[0:10] 
kmeans.labels_

array([2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2,
       2, 1, 2, 2, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int32)

In [52]:
seattle_grouped_clustering.shape

(101, 278)

In [53]:
df_seattle_data.shape

(103, 6)

In [31]:
# Do an inner join, to eliminate any df_toronto_data neighborhoods that did not get results on the 4-square query
# Otherwise, we get blown out in the next step because the shapes don't match: e.g. 103 rows vs 101 rows

df_seattle_data_join = df_seattle_data.rename(columns={'Neighborhood':'Neighborhood'}) # ensure correct spelling

df_joined_result = seattle_grouped.join (df_seattle_data_join.set_index('Neighborhood'), on='Neighborhood')

df_joined_result.head (10)

NameError: name 'seattle_grouped' is not defined

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [32]:
seattle_merged = df_joined_result

# add clustering labels
seattle_merged['Cluster Labels'] = kmeans.labels_

################################################################
# merge toronto_grouped with toronto_data PANDAs to add latitude/longitude for each neighborhood
################################################################
seattle_merged = seattle_merged.join (neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

seattle_merged.head()

NameError: name 'df_joined_result' is not defined

Finally, let's visualize the resulting clusters

In [57]:
# Week 2 TBD

# create map
map_clusters = folium.Map (location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange (kclusters)

ys = [i+x+(i*x)**2 for i in range(kclusters)]

colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(seattle_merged['Latitude'], seattle_merged['Longitude'], seattle_merged['Neighborhood'], seattle_merged['Cluster Labels']):
    label = folium.Popup (str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker (
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<a id='item5'></a>

## 5. Examine Clusters

Week 2 TBD

Examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, you can then assign a name to each cluster. I will leave this exercise to you.

#### Cluster 1

In [33]:
seattle_merged.loc[seattle_merged['Cluster Labels'] == 0, seattle_merged.columns[[1] + list(range(5, seattle_merged.shape[1]))]]

NameError: name 'seattle_merged' is not defined

#### Cluster 2

In [34]:
seattle_merged.loc[seattle_merged['Cluster Labels'] == 1, seattle_merged.columns[[1] + list(range(5, seattle_merged.shape[1]))]]

NameError: name 'seattle_merged' is not defined

#### Cluster 3

In [35]:
seattle_merged.loc[seattle_merged['Cluster Labels'] == 2, seattle_merged.columns[[1] + list(range(5, seattle_merged.shape[1]))]]

NameError: name 'seattle_merged' is not defined

#### Cluster 4

In [36]:
seattle_merged.loc[seattle_merged['Cluster Labels'] == 3, seattle_merged.columns[[1] + list(range(5, seattle_merged.shape[1]))]]

NameError: name 'seattle_merged' is not defined

#### Cluster 5

In [37]:
seattle_merged.loc[seattle_merged['Cluster Labels'] == 4, seattle_merged.columns[[1] + list(range(5, seattle_merged.shape[1]))]]

NameError: name 'seattle_merged' is not defined

TDB Week 2

Additional aalysis and final data rollup

# 4. Results Section

tbd Week 2

# 5. Discussion Section

tbd Week 2

# 6. Conclusion Section

tbd Week 2