# Capstone Project - The Battle of the Neighborhoods (Week 1)
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)



## Introduction: Business Problem <a name="introduction"></a>

In this project we will try and compare South Africa's two biggest cities, namely Johannesburg and Cape Town.
This report is targeted at tourism businesses but can be used to inform a variety of business and or lifestyle decisions.

We will attempt to **classify all neighborhoods for both cities collectively** by analysing their **venue composition**.
Once the neighborhoods have been classified we can **review the distribution of these classes across the two cities**.

Reviewing the distribution of neighborhood classes will hopefully reveal an additional layer of insight that can be used to inform the structuring of tourism packages. This will also give a general idea of how the lifestyle for the areas would be similar or differ from city to city.

## Data <a name="data"></a>

A description of the data and how it will be used to solve the problem. 

Following data sources will be needed to extract/generate the required information:
- List of neighborhoods for each city (collected from Wikipedia)
- List of coordinates for each neighborhood (collected using Geopy Nominatim Module)
- List of venues for each neighborhood including their associated category (collected using FourSquare API)

Once the data is obtained we will need to perform the following analysis:
- Classify all neighborhoods, collectively, in terms of venue composition using Kmeans clustering
- Analyse the distribution of the neighborhood classes across the two cities to gauge similarity
- Plot the neighborhoods on a map to view the geographical distribution of the neighborhood classes

In [33]:
# FourSquare API information

api_secretspath = r'C:\Temp\FourSquare_AppSecret.txt'

with open(api_secretspath, "r") as secret_content:
    lines = secret_content.readlines()
    
CLIENT_ID = lines[0].rstrip() # Foursquare ID
CLIENT_SECRET = lines[1].rstrip() #Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

In [2]:
# Module Imports
import requests # library to handle requests
import numpy as np # useful for many scientific computing in Python
import pandas as pd # primary data structure library

from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

from sklearn.cluster import KMeans # import k-means from clustering stage

# Matplotlib and associated plotting modules
import folium # Geographic plotting library
from matplotlib import pyplot as plt # Plotting library
import matplotlib.cm as cm # Module used for generating colors
import matplotlib.colors as colors # Module used for generating colors



### Getting a list of neighborhoods per city

The below links contain the information required.
- https://en.wikipedia.org/wiki/List_of_Cape_Town_suburbs
- https://en.wikipedia.org/wiki/Suburbs_of_Johannesburg

The data is unfortunately not very well structured. It was decided to manually extract the list of neighborhoods and save them into text files which could then be imported. Along with saving the neighborhood names, the city was appended after a comma to facilitate the coordinate lookups.

In [7]:
# Reading in list of neighborhoods from text files

# Creating empty list
cpt_suburbs = []

# File Paths
cpt_listpath = r"Data\CapeTown_neighbourhood.txt"

# Open the files and read in the data
with open(cpt_listpath, "r") as cpt_content:
    lines = cpt_content.readlines()
    for l in lines:
        cpt_suburbs.append(l)

# Display sample of the lists
print('Cape Town - First 5 Neigborhoods of {}'.format(len(cpt_suburbs)))
cpt_suburbs[0:4]

Cape Town - First 5 Neigborhoods of 135


['Athlone, Cape Town\n',
 'Atlantis, Cape Town\n',
 'Bakoven, Cape Town\n',
 'Bantry Bay, Cape Town\n']

In [6]:
# Reading in list of neighborhoods from text files

# Creating empty list
jhb_suburbs = []

# File Paths
jhb_listpath = r"Data\Johannesburg_neighbourhood.txt"

with open(jhb_listpath, "r") as jhb_content:
    lines = jhb_content.readlines()
    for l in lines:
        jhb_suburbs.append(l)

print('Johannesburg - First 5 Neigborhoods of {}'.format(len(jhb_suburbs)))
jhb_suburbs[0:5]

Johannesburg - First 5 Neigborhoods of 358


['Abbotsford, Johannesburg\n',
 'Aeroton, Johannesburg\n',
 'Airdlin, Johannesburg\n',
 'Alan Manor, Johannesburg\n',
 'Albertskroon, Johannesburg\n']

### Adding longitude and latitude coordinates for all the neighborhoods

Now that we have a list of all the neighborhoods, we need to find their coordinates.
We will make use of the Nominatim module to do this.

As the data source is likely to not be accurate, we will also be gathering actual addresses along with coordinates.
This data will be used to ensure we have good information before performing any analysis.

In [11]:
# function for getting location information and coordinates
def get_locdata(searchstring):
    
    result = {}
       
    print("Searching : {}".format(searchstring)) # print information to keep track of progress
    
    try:
        geolocator = Nominatim(user_agent="sa_explorer")
        location = geolocator.geocode(searchstring)
        
        result['searchstring'] = searchstring
        result['address'] = location.address
        result['longitude'] = location.longitude
        result['latitude'] = location.latitude
    
    except: # unable to gather information, store None Values
        
        result['searchstring'] = searchstring
        result['address'] = None
        result['longitude'] = None
        result['latitude'] = None
        
    return result


In [12]:
# Gathering Location information

# Merge list of neighborhoods to be searched  
all_suburbs = cpt_suburbs + jhb_suburbs

# Create empty list to capture results of the searches
locdata = []

for loc in all_suburbs:
    locdata.append(get_locdata(loc.rstrip()))

Searching : Athlone, Cape Town
Searching : Atlantis, Cape Town
Searching : Bakoven, Cape Town
Searching : Bantry Bay, Cape Town
Searching : Belhar, Cape Town
Searching : Bellville, Cape Town
Searching : Bergvliet, Cape Town
Searching : Bishop Lavis, Cape Town
Searching : Bishopscourt, Cape Town
Searching : Bloubergstrand, Cape Town
Searching : Bo-Kaap, Cape Town
Searching : Bonteheuwel, Cape Town
Searching : Bothasig, Cape Town
Searching : Brackenfell, Cape Town
Searching : Brooklyn, Cape Town
Searching : Camps Bay, Cape Town
Searching : Capri Village, Cape Town
Searching : Claremont, Cape Town
Searching : Clifton, Cape Town
Searching : Clovelly, Cape Town
Searching : Constantia, Cape Town
Searching : Crawford, Cape Town
Searching : Crossroads, Cape Town
Searching : Darling, Cape Town
Searching : De Waterkant, Cape Town
Searching : Delft, Cape Town
Searching : Devil's Peak Estate, Cape Town
Searching : Diep River, Cape Town
Searching : Durbanville, Cape Town
Searching : Edgemead, Cape 

Searching : Emmarentia, Johannesburg
Searching : Ennerdale, Johannesburg
Searching : Epsom Downs, Johannesburg
Searching : Erand, Johannesburg
Searching : Evans Park, Johannesburg
Searching : Fairland, Johannesburg
Searching : Fairview, Johannesburg
Searching : Fairway, Johannesburg
Searching : Fairwood, Johannesburg
Searching : Farmall, Johannesburg
Searching : Fellside, Johannesburg
Searching : Ferndale, Johannesburg
Searching : Ferreirasdorp, Johannesburg
Searching : Florida, Johannesburg
Searching : Florida Glen, Johannesburg
Searching : Florida Hills, Johannesburg
Searching : Fontainebleau, Johannesburg
Searching : Forbesdale, Johannesburg
Searching : Fordsburg, Johannesburg
Searching : Forest Hill, Johannesburg
Searching : Forest Town, Johannesburg
Searching : Fourways, Johannesburg
Searching : Framton, Johannesburg
Searching : Gallo Manor, Johannesburg
Searching : Gillview, Johannesburg
Searching : Glen Athol, Johannesburg
Searching : Glen Austin, Johannesburg
Searching : Glenad

Searching : Steeledale, Johannesburg
Searching : Strathavon, Johannesburg
Searching : Strijdompark, Johannesburg
Searching : Suideroord, Johannesburg
Searching : Sunningdale, Johannesburg
Searching : Sunningdale Ridge, Johannesburg
Searching : Sunninghill, Johannesburg
Searching : Sunrella, Johannesburg
Searching : Sunset Acres, Johannesburg
Searching : Sydenham, Johannesburg
Searching : The Gables, Johannesburg
Searching : The Gardens, Johannesburg
Searching : The Hill, Johannesburg
Searching : Theta, Johannesburg
Searching : Towerby, Johannesburg
Searching : Townsview, Johannesburg
Searching : Trevallyn, Johannesburg
Searching : Trojan, Johannesburg
Searching : Troyeville, Johannesburg
Searching : Tulisa Park, Johannesburg
Searching : Turf Club, Johannesburg
Searching : Turffontein, Johannesburg
Searching : Vandia Grove, Johannesburg
Searching : Victoria, Johannesburg
Searching : Village Main, Johannesburg
Searching : Vorna Valley, Johannesburg
Searching : Vrededorp, Johannesburg
Sea

In [21]:
# Convert the list of dictionaries to a dataframe
loc_df = pd.DataFrame(locdata)

# Display some high level information of the dataframe
print('dataframe shape')
loc_df.shape

dataframe shape


(493, 4)

In [22]:
print('dataframe head')
loc_df.head()

dataframe head


Unnamed: 0,searchstring,address,longitude,latitude
0,"Athlone, Cape Town","Athlone, Cape Town Ward 49, Cape Town, City of...",18.505,-33.966667
1,"Atlantis, Cape Town","Atlantis, City of Cape Town, Western Cape, Sou...",18.500278,-33.567222
2,"Bakoven, Cape Town","Bakoven, Camps Bay, Cape Town, City of Cape To...",18.382778,-33.96
3,"Bantry Bay, Cape Town","Bantry Bay, Cape Town, City of Cape Town, West...",18.37897,-33.928151
4,"Belhar, Cape Town","Belhar, Cape Town Ward 22, City of Cape Town, ...",18.625833,-33.944722


### Cleaning the location dataframe

There are unfortunately problems with the dataset.
- locations that could not be found
- locations that were found but aren't the ones intended (other countries)
- searchstring and address columns are actually not needed post cleanup

In [25]:
# Cleaning location dataframe

# Filter out locations that could not be found
loc_df = loc_df.dropna()

# Filter out results that are not in South Africa
loc_df = loc_df[ loc_df.address.str.contains("South Africa") ]

# Adding two new columns to the existing dataframe. for city and neighbourhood based on the searchstring
loc_df[['neighborhood','city']] = loc_df.searchstring.str.split(", ",expand=True)

# Remove searchstring and address columns as they are no longer needed.
# This will be the baseline location data that we will use for analysis.
loc_df = loc_df.drop(columns=['address', 'searchstring'])

# reset the index
loc_df.reset_index(inplace=True,drop=True)

In [26]:
# Display some high level information of the dataframe
print('dataframe shape')
loc_df.shape

dataframe shape


(408, 4)

In [27]:
print('dataframe head')
loc_df.head()

dataframe head


Unnamed: 0,longitude,latitude,neighborhood,city
0,18.505,-33.966667,Athlone,Cape Town
1,18.500278,-33.567222,Atlantis,Cape Town
2,18.382778,-33.96,Bakoven,Cape Town
3,18.37897,-33.928151,Bantry Bay,Cape Town
4,18.625833,-33.944722,Belhar,Cape Town


### Exploring the Venues in all the neighborhoods

Now that we have a dataframe with the neighborhoods and coordinates, it is time to add all the nearby venues.
We will be using the FourSquare API to gather this information.

In [29]:
# Define function for exploring a location for all nearby venues
def getNearbyVenues(neighborhoods, cities, latitudes, longitudes):

    venues_list=[]
    for neighborhood, city, lat, lng in zip(neighborhoods, cities, latitudes, longitudes):
        print("Exploring : {} - {}".format(city, neighborhood).rstrip()) # Print current search to track progress
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            LIMIT)

        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            neighborhood,
            city,
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'City', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)


In [37]:
# Gather all venues into a dataframe
all_venues = getNearbyVenues(neighborhoods = loc_df['neighborhood'],
                             cities = loc_df['city'], 
                             latitudes = loc_df['latitude'],
                             longitudes = loc_df['longitude']
                             )

Exploring : Cape Town - Athlone
Exploring : Cape Town - Atlantis
Exploring : Cape Town - Bakoven
Exploring : Cape Town - Bantry Bay
Exploring : Cape Town - Belhar
Exploring : Cape Town - Bellville
Exploring : Cape Town - Bergvliet
Exploring : Cape Town - Bishop Lavis
Exploring : Cape Town - Bishopscourt
Exploring : Cape Town - Bloubergstrand
Exploring : Cape Town - Bo-Kaap
Exploring : Cape Town - Bonteheuwel
Exploring : Cape Town - Bothasig
Exploring : Cape Town - Brackenfell
Exploring : Cape Town - Brooklyn
Exploring : Cape Town - Camps Bay
Exploring : Cape Town - Capri Village
Exploring : Cape Town - Claremont
Exploring : Cape Town - Clifton
Exploring : Cape Town - Clovelly
Exploring : Cape Town - Constantia
Exploring : Cape Town - Crawford
Exploring : Cape Town - Crossroads
Exploring : Cape Town - Darling
Exploring : Cape Town - De Waterkant
Exploring : Cape Town - Delft
Exploring : Cape Town - Devil's Peak Estate
Exploring : Cape Town - Diep River
Exploring : Cape Town - Durbanvill

Exploring : Johannesburg - Glenhazel
Exploring : Johannesburg - Glenvista
Exploring : Johannesburg - Greenside
Exploring : Johannesburg - Greenstone Hill
Exploring : Johannesburg - Gresswold
Exploring : Johannesburg - Greymont
Exploring : Johannesburg - Haddon
Exploring : Johannesburg - Headway Hill
Exploring : Johannesburg - Heriotdale
Exploring : Johannesburg - Highlands
Exploring : Johannesburg - Highlands North
Exploring : Johannesburg - Hillbrow
Exploring : Johannesburg - Homestead Park
Exploring : Johannesburg - Hurlingham
Exploring : Johannesburg - Hurlingham Gardens
Exploring : Johannesburg - Hyde Park
Exploring : Johannesburg - Illovo
Exploring : Johannesburg - Inanda
Exploring : Johannesburg - Ivory Park
Exploring : Johannesburg - Jeppestown
Exploring : Johannesburg - Jeppestown South
Exploring : Johannesburg - Johannesburg North
Exploring : Johannesburg - Joubert Park
Exploring : Johannesburg - Judith's Paarl
Exploring : Johannesburg - Jukskei Park
Exploring : Johannesburg -

In [38]:
# Need to create a key based on city and neighborhood as an index
all_venues['ID'] = all_venues['City'] + '_' + all_venues['Neighborhood']

In [39]:
print('dataframe head')
all_venues.head()

dataframe head


Unnamed: 0,Neighborhood,City,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,ID
0,Athlone,Cape Town,-33.966667,18.505,Wembley Roadhouse,-33.965429,18.51466,Burger Joint,Cape Town_Athlone
1,Athlone,Cape Town,-33.966667,18.505,Common Ground Cafe,-33.958918,18.485833,Coffee Shop,Cape Town_Athlone
2,Athlone,Cape Town,-33.966667,18.505,Western Province Cricket Club,-33.973314,18.48235,Athletics & Sports,Cape Town_Athlone
3,Athlone,Cape Town,-33.966667,18.505,Rondebosch Common,-33.959344,18.48484,Park,Cape Town_Athlone
4,Athlone,Cape Town,-33.966667,18.505,Starlings,-33.9801,18.481972,Café,Cape Town_Athlone
