# Data Science Capstone Report

## Introduction/Business Problem

#### The idea
The idea is to analyse neighborhoods of Austrailia and try to cluster related neighborhoods together. Neighborhoods belonging to the same cluster can be analysed using their most common venues in order to take important decisions such as; Which neighborhood is better to open up a restaurant? Which neighborhoods prefer cafe's and cinemas or even nightclubs?

This will help any business to decide where they want to set up shop as they will know which neighborhood will prefer their business. It can also help to analyse if any particular type of business is not present in a given neighborhood like a neighborhood has only restaurants and cafes but no night clubs then somebody who owns a chain of nightclubs can set one up there.

## Data Section

### The data which will be used for the project is from the page https://en.wikipedia.org/wiki/Postcodes_in_Australia
### It contains two tables;
    1. A table that contains the state/territories along with their abbreviations
    2. A table containing the postcodes along with locality and state/territory abbreviations
Both tables will be joined using the common column i.e. abbreviations so that it will be easier for the user to understand which state/territory the locality belongs to.

### This section will contain the following parts;
    1. Extracting the tables from the wikipedia page
    2. Cleaning of extracted data
    3. Converting the data into dataframes
    4. Combining dataframes to get a final dataframe
    5. Using the geocoders package to get the latitude and longitude for each locality
    6. Adding the coordinates to the dataframe

#### First 3 steps will be combined together

### 1 2 and 3. Extracting, Cleaning and Converting data from Wikipedia page

In [2]:
#getting the wikipedia page
import requests
url = requests.get('https://en.wikipedia.org/wiki/Postcodes_in_Australia').text

#Using beautiful soup to convert url to xml
from bs4 import BeautifulSoup
soup = BeautifulSoup(url,"lxml")

In [3]:
#exract the table from the wikipedia page
table = soup.findAll('table',{'class':'wikitable'})

In [4]:
#Table one: contains abbreviations along with their expanded forms
print(table[0])

<table class="wikitable">
<tbody><tr>
<th>State/Territory</th>
<th>Abbreviation</th>
<th>Postcode range
</th></tr>
<tr>
<td><a href="/wiki/New_South_Wales" title="New South Wales">New South Wales</a></td>
<td>NSW</td>
<td>1000—1999 <i>(LVRs and PO Boxes only)</i><br/>2000—2599<br/>2619—2899<br/>2921—2999
</td></tr>
<tr>
<td><a href="/wiki/Australian_Capital_Territory" title="Australian Capital Territory">Australian Capital Territory</a>
</td>
<td>ACT</td>
<td>0200—0299 <i>(LVRs and PO Boxes only)</i><br/>2600—2618<br/>2900—2920
</td></tr>
<tr>
<td><a href="/wiki/Victoria_(Australia)" title="Victoria (Australia)">Victoria</a>
</td>
<td>VIC</td>
<td>3000—3999<br/>8000—8999 <i>(LVRs and PO Boxes only)</i>
</td></tr>
<tr>
<td><a href="/wiki/Queensland" title="Queensland">Queensland</a></td>
<td>QLD</td>
<td>4000—4999<br/>9000—9999 <i>(LVRs and PO Boxes only)</i>
</td></tr>
<tr>
<td><a href="/wiki/South_Australia" title="South Australia">South Australia</a></td>
<td>SA</td>
<td>5000—5799<br

In [5]:
#extract the contents of the table one
table1 = table[0].find_all('td')

In [6]:
# creating empty dataframe
import pandas as pd
column_name = ['State/Territory','Abbreviation']
df = pd.DataFrame(columns=column_name)
df

Unnamed: 0,State/Territory,Abbreviation


In [7]:
#populating the first dataframe
import re
i = 0
while i < len(table1):
    st = re.findall(r'<td>(.*?)</td>',str(table1[i]))
    if len(st) == 0:
        st = re.findall(r'<td>(.*?\n)</td>',str(table1[i])) 
    if '<a href'in st[0]:
        st1 = re.findall(r'>(.*?)</a>',st[0])
        if len(st1) == 0:
            st1 = re.findall(r'>(.*?\n)</a>',st[0])
        st2 = st1[0]
    else:
        st2 = st[0]
    
    abb = re.findall(r'<td>(.*?)</td>',str(table1[i+1]))
    a = abb[0]
    
    row = [st2,a]
    df.loc[len(df)] = row
    i+=3

In [8]:
#Checking the dataframe
df

Unnamed: 0,State/Territory,Abbreviation
0,New South Wales,NSW
1,Australian Capital Territory,ACT
2,Victoria,VIC
3,Queensland,QLD
4,South Australia,SA
5,Western Australia,WA
6,Tasmania,TAS
7,Northern Territory,NT


In [9]:
#Table two: contains postcode, Locality and abbreviations
print(table[1])

<table class="wikitable">
<tbody><tr>
<th>Postcode</th>
<th>Locality</th>
<th>State derived from<br/>Postcode ranges</th>
<th>Actual State<br/>for this locality
</th></tr>
<tr>
<td>4825</td>
<td>ALPURRURULAM</td>
<td>QLD</td>
<td>NT
</td></tr>
<tr>
<td>0872</td>
<td>ERNABELLA</td>
<td>NT</td>
<td>SA
</td></tr>
<tr>
<td>0872</td>
<td>FREGON</td>
<td>NT</td>
<td>SA
</td></tr>
<tr>
<td>0872</td>
<td>INDULKANA</td>
<td>NT</td>
<td>SA
</td></tr>
<tr>
<td>0872</td>
<td>MIMILI</td>
<td>NT</td>
<td>SA
</td></tr>
<tr>
<td>0872</td>
<td>NGAANYATJARRA-GILES</td>
<td>NT</td>
<td>WA
</td></tr>
<tr>
<td>0872</td>
<td>GIBSON DESERT NORTH</td>
<td>NT</td>
<td>WA
</td></tr>
<tr>
<td>0872</td>
<td>GIBSON DESERT SOUTH</td>
<td>NT</td>
<td>WA
</td></tr>
<tr>
<td>2406</td>
<td>MUNGINDI</td>
<td>NSW</td>
<td>QLD
</td></tr>
<tr>
<td>2540</td>
<td>HMAS CRESWELL</td>
<td>NSW</td>
<td><a href="/wiki/Jervis_Bay_Territory" title="Jervis Bay Territory">Jervis Bay Territory</a>
</td></tr>
<tr>
<td>2540</td>
<td>JER

In [10]:
#extract the contents of the table two
table2 = table[1].find_all('td')

In [11]:
# creating empty dataframe
column_names = ['PostCode','Neighborhood','Abbreviation']
df1 = pd.DataFrame(columns=column_names)
df1

Unnamed: 0,PostCode,Neighborhood,Abbreviation


In [12]:
#populating the second dataframe
i = 0
while i < len(table2):
    postcode = re.findall(r'<td>(.*?)</td>',str(table2[i]))
    pc = postcode[0]
    locality = re.findall(r'<td>(.*?)</td>',str(table2[i+1]))
    l = locality[0]
    abbrev = re.findall(r'<td>(.*?)</td>',str(table2[i+2]))
    abb = abbrev[0]
    
    row = [pc,l,abb]
    df1.loc[len(df1)] = row
    i+=4

In [13]:
#Checking the dataframe
df1

Unnamed: 0,PostCode,Neighborhood,Abbreviation
0,4825,ALPURRURULAM,QLD
1,872,ERNABELLA,NT
2,872,FREGON,NT
3,872,INDULKANA,NT
4,872,MIMILI,NT
5,872,NGAANYATJARRA-GILES,NT
6,872,GIBSON DESERT NORTH,NT
7,872,GIBSON DESERT SOUTH,NT
8,2406,MUNGINDI,NSW
9,2540,HMAS CRESWELL,NSW


#### Note: for the second table we have ignored the fourth column which also contains abbreviations to avoid any confusion and also because the rows containing 'Jervis Bay Territory' will get lost when the dataframes are merged as it is not present in the first dataframe and we do not want to lose any data. 
#### So we have decided to go for the 3rd column only, so that all the rows are present and no data is lost

### 4. Combining dataframes into a final dataframe

In [14]:
final_df = pd.merge(df,df1,on="Abbreviation")
del final_df['Abbreviation']
final_df.head()

Unnamed: 0,State/Territory,PostCode,Neighborhood
0,New South Wales,2406,MUNGINDI
1,New South Wales,2540,HMAS CRESWELL
2,New South Wales,2540,JERVIS BAY
3,New South Wales,2620,HUME
4,New South Wales,2620,KOWEN FOREST


In [15]:
# Grouping the Neighborhoods according to Postal Code
final_df = final_df.groupby(['PostCode','State/Territory'])['Neighborhood'].apply(', '.join).reset_index()
final_df

Unnamed: 0,PostCode,State/Territory,Neighborhood
0,872,Northern Territory,"ERNABELLA, FREGON, INDULKANA, MIMILI, NGAANYAT..."
1,2406,New South Wales,MUNGINDI
2,2540,New South Wales,"HMAS CRESWELL, JERVIS BAY"
3,2611,Australian Capital Territory,"COOLEMAN, BIMBERI, BRINDABELLA, URIARRA"
4,2620,New South Wales,"HUME, KOWEN FOREST, OAKS ESTATE, THARWA, TOP NAAS"
5,3500,Victoria,PARINGI
6,3585,Victoria,MURRAY DOWNS
7,3586,Victoria,MALLAN
8,3644,Victoria,"BAROOGA, LALALTY"
9,3691,Victoria,LAKE HUME VILLAGE


### Part 5 Using geocoders package to get latitude and longtitude of each postcode

In [16]:
#Importing necessary packages
!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
import geopy.geocoders 

Solving environment: done

## Package Plan ##

  environment location: /opt/ibm/conda/miniconda3

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2018.8.24  |       ha4d7672_0         136 KB  conda-forge
    geopy-1.17.0               |             py_0          49 KB  conda-forge
    certifi-2018.8.24          |        py35_1001         139 KB  conda-forge
    openssl-1.0.2p             |       h470a237_1         3.1 MB  conda-forge
    geographiclib-1.49         |             py_0          32 KB  conda-forge
    conda-4.5.11               |           py35_0         636 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         4.1 MB

The following NEW packages will be INSTALLED:

    geographiclib:   1.49-py_0             conda-forge
    geopy:       

In [17]:
#Calculating latitude and longitude of each row
from geopy.geocoders import Nominatim
geolocator = Nominatim()
place = 'Australia'
latitude = []
longitude = []
for index, rows in final_df.iterrows():
    pc = str(rows['PostCode'])
    addr = pc+','+place
    location = geolocator.geocode(addr)
    lat = location.latitude
    long = location.longitude
    latitude.append(lat)
    longitude.append(long)



### 6. Adding coordinates to the dataframe

In [18]:
#assigning the latitude and longitude to the dataframe
final_df['Latitude'] = latitude
final_df['Longitude'] = longitude

In [19]:
#Final dataframe
final_df

Unnamed: 0,PostCode,State/Territory,Neighborhood,Latitude,Longitude
0,872,Northern Territory,"ERNABELLA, FREGON, INDULKANA, MIMILI, NGAANYAT...",-25.719898,131.957835
1,2406,New South Wales,MUNGINDI,-29.030752,149.188191
2,2540,New South Wales,"HMAS CRESWELL, JERVIS BAY",-34.983059,150.603134
3,2611,Australian Capital Territory,"COOLEMAN, BIMBERI, BRINDABELLA, URIARRA",-35.316175,149.010503
4,2620,New South Wales,"HUME, KOWEN FOREST, OAKS ESTATE, THARWA, TOP NAAS",-35.277574,149.236242
5,3500,Victoria,PARINGI,-34.19609,142.14254
6,3585,Victoria,MURRAY DOWNS,-35.342298,143.558281
7,3586,Victoria,MALLAN,-35.402222,143.654334
8,3644,Victoria,"BAROOGA, LALALTY",-35.911559,145.671953
9,3691,Victoria,LAKE HUME VILLAGE,-36.162044,146.961706


#### This will be the final dataset that we will use to analyse the neighborhoods

## Methodology Section

#### In this section, we will analyse our dataset and perform clustering analysis in order to answer questions proposed in the business section.

In [20]:
#Getting the dataset used in Data section
aus_data = final_df
aus_data

Unnamed: 0,PostCode,State/Territory,Neighborhood,Latitude,Longitude
0,872,Northern Territory,"ERNABELLA, FREGON, INDULKANA, MIMILI, NGAANYAT...",-25.719898,131.957835
1,2406,New South Wales,MUNGINDI,-29.030752,149.188191
2,2540,New South Wales,"HMAS CRESWELL, JERVIS BAY",-34.983059,150.603134
3,2611,Australian Capital Territory,"COOLEMAN, BIMBERI, BRINDABELLA, URIARRA",-35.316175,149.010503
4,2620,New South Wales,"HUME, KOWEN FOREST, OAKS ESTATE, THARWA, TOP NAAS",-35.277574,149.236242
5,3500,Victoria,PARINGI,-34.19609,142.14254
6,3585,Victoria,MURRAY DOWNS,-35.342298,143.558281
7,3586,Victoria,MALLAN,-35.402222,143.654334
8,3644,Victoria,"BAROOGA, LALALTY",-35.911559,145.671953
9,3691,Victoria,LAKE HUME VILLAGE,-36.162044,146.961706


In [21]:
aus_data['State/Territory'].unique()

array(['Northern Territory', 'New South Wales',
       'Australian Capital Territory', 'Victoria', 'Queensland'], dtype=object)

In [22]:
#Getting the coordinates of Australia
address = 'Australia'
from geopy.geocoders import Nominatim
geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Australia are {}, {}.'.format(latitude, longitude))



The geograpical coordinate of Australia are -24.7761086, 134.755.


In [23]:
#Getting folium package
!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium

Solving environment: done

## Package Plan ##

  environment location: /opt/ibm/conda/miniconda3

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    branca-0.3.0               |             py_0          24 KB  conda-forge
    altair-2.2.2               |           py35_1         462 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         559 KB

The following NEW packages will be INSTALLED:

    altair:  2.2.2-py35_1 conda-forge
    branca:  0.3.0-py_0   conda-forge
    folium:  0.5.0-py_0   conda-forge
    vincent: 0.4.4-py_1   conda-forge


Downloading and Extracting Packages
branca-0.3.0         | 24 KB   

In [25]:
# create map of New York using latitude and longitude values
map_aus = folium.Map(location=[latitude, longitude], zoom_start=4)

# add markers to map
for lat, lng, borough, neighborhood in zip(aus_data['Latitude'], aus_data['Longitude'], aus_data['State/Territory'], aus_data['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_aus)  
    
map_aus

In [26]:
# The code was removed by Watson Studio for sharing.

In [27]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [28]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [29]:
import requests
# type your answer here
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500
aus_venues = getNearbyVenues(names=aus_data['Neighborhood'],
                                   latitudes=aus_data['Latitude'],
                                   longitudes=aus_data['Longitude']
                                  )

ERNABELLA, FREGON, INDULKANA, MIMILI, NGAANYATJARRA-GILES, GIBSON DESERT NORTH, GIBSON DESERT SOUTH
MUNGINDI
HMAS CRESWELL, JERVIS BAY
COOLEMAN, BIMBERI, BRINDABELLA, URIARRA
HUME, KOWEN FOREST, OAKS ESTATE, THARWA, TOP NAAS
PARINGI
MURRAY DOWNS
MALLAN
BAROOGA, LALALTY
LAKE HUME VILLAGE
BRINGENBRONG
MARYLAND
MINGOOLA
ALPURRURULAM


In [30]:
print(aus_venues.shape)
aus_venues

(13, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"COOLEMAN, BIMBERI, BRINDABELLA, URIARRA",-35.316175,149.010503,Mt Stromlo Observatory,-35.31872,149.009041,Planetarium
1,"COOLEMAN, BIMBERI, BRINDABELLA, URIARRA",-35.316175,149.010503,Scope,-35.318698,149.009088,Café
2,PARINGI,-34.19609,142.14254,Subway,-34.197785,142.146798,Sandwich Place
3,PARINGI,-34.19609,142.14254,Takeaway,-34.192816,142.142744,Deli / Bodega
4,PARINGI,-34.19609,142.14254,Pinno's Pizza Pasta Bar,-34.197463,142.14695,Pizza Place
5,PARINGI,-34.19609,142.14254,Coral Sea Fish And Chips,-34.197536,142.147181,Fish & Chips Shop
6,MURRAY DOWNS,-35.342298,143.558281,Jilarty Cafe,-35.341274,143.560255,Café
7,MURRAY DOWNS,-35.342298,143.558281,KFC,-35.343087,143.560441,Fast Food Restaurant
8,MURRAY DOWNS,-35.342298,143.558281,Quo Vadis,-35.341676,143.560437,Italian Restaurant
9,MURRAY DOWNS,-35.342298,143.558281,The 202 Cafe,-35.340252,143.558282,Café


In [31]:
aus_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"BAROOGA, LALALTY",2,2,2,2,2,2
"COOLEMAN, BIMBERI, BRINDABELLA, URIARRA",2,2,2,2,2,2
MURRAY DOWNS,5,5,5,5,5,5
PARINGI,4,4,4,4,4,4


In [32]:
print('There are {} uniques categories.'.format(len(aus_venues['Venue Category'].unique())))

There are 10 uniques categories.


In [33]:
# one hot encoding
aus_onehot = pd.get_dummies(aus_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
aus_onehot['Neighborhood'] = aus_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [aus_onehot.columns[-1]] + list(aus_onehot.columns[:-1])
aus_onehot = aus_onehot[fixed_columns]

aus_onehot

Unnamed: 0,Neighborhood,Beach,Café,Deli / Bodega,Discount Store,Fast Food Restaurant,Fish & Chips Shop,Italian Restaurant,Pizza Place,Planetarium,Sandwich Place
0,"COOLEMAN, BIMBERI, BRINDABELLA, URIARRA",0,0,0,0,0,0,0,0,1,0
1,"COOLEMAN, BIMBERI, BRINDABELLA, URIARRA",0,1,0,0,0,0,0,0,0,0
2,PARINGI,0,0,0,0,0,0,0,0,0,1
3,PARINGI,0,0,1,0,0,0,0,0,0,0
4,PARINGI,0,0,0,0,0,0,0,1,0,0
5,PARINGI,0,0,0,0,0,1,0,0,0,0
6,MURRAY DOWNS,0,1,0,0,0,0,0,0,0,0
7,MURRAY DOWNS,0,0,0,0,1,0,0,0,0,0
8,MURRAY DOWNS,0,0,0,0,0,0,1,0,0,0
9,MURRAY DOWNS,0,1,0,0,0,0,0,0,0,0


In [34]:
aus_grouped = aus_onehot.groupby('Neighborhood').mean().reset_index()
aus_grouped

Unnamed: 0,Neighborhood,Beach,Café,Deli / Bodega,Discount Store,Fast Food Restaurant,Fish & Chips Shop,Italian Restaurant,Pizza Place,Planetarium,Sandwich Place
0,"BAROOGA, LALALTY",0.5,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"COOLEMAN, BIMBERI, BRINDABELLA, URIARRA",0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0
2,MURRAY DOWNS,0.0,0.4,0.0,0.2,0.2,0.0,0.2,0.0,0.0,0.0
3,PARINGI,0.0,0.0,0.25,0.0,0.0,0.25,0.0,0.25,0.0,0.25


In [35]:
num_top_venues = 5

for hood in aus_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = aus_grouped[aus_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----BAROOGA, LALALTY----
                  venue  freq
0                 Beach   0.5
1                  Café   0.5
2         Deli / Bodega   0.0
3        Discount Store   0.0
4  Fast Food Restaurant   0.0


----COOLEMAN, BIMBERI, BRINDABELLA, URIARRA----
            venue  freq
0            Café   0.5
1     Planetarium   0.5
2           Beach   0.0
3   Deli / Bodega   0.0
4  Discount Store   0.0


----MURRAY DOWNS----
                  venue  freq
0                  Café   0.4
1        Discount Store   0.2
2  Fast Food Restaurant   0.2
3    Italian Restaurant   0.2
4                 Beach   0.0


----PARINGI----
               venue  freq
0      Deli / Bodega  0.25
1  Fish & Chips Shop  0.25
2        Pizza Place  0.25
3     Sandwich Place  0.25
4              Beach  0.00




In [36]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [37]:
import numpy as np

In [38]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = aus_grouped['Neighborhood']

for ind in np.arange(aus_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(aus_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"BAROOGA, LALALTY",Café,Beach,Sandwich Place,Planetarium,Pizza Place,Italian Restaurant,Fish & Chips Shop,Fast Food Restaurant,Discount Store,Deli / Bodega
1,"COOLEMAN, BIMBERI, BRINDABELLA, URIARRA",Planetarium,Café,Sandwich Place,Pizza Place,Italian Restaurant,Fish & Chips Shop,Fast Food Restaurant,Discount Store,Deli / Bodega,Beach
2,MURRAY DOWNS,Café,Italian Restaurant,Fast Food Restaurant,Discount Store,Sandwich Place,Planetarium,Pizza Place,Fish & Chips Shop,Deli / Bodega,Beach
3,PARINGI,Sandwich Place,Pizza Place,Fish & Chips Shop,Deli / Bodega,Planetarium,Italian Restaurant,Fast Food Restaurant,Discount Store,Café,Beach


#### Creating clusters of the neighborhoods

In [42]:
from sklearn.cluster import KMeans
# set number of clusters
kclusters = 3

aus_grouped_clustering = aus_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(aus_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([0, 2, 2, 1], dtype=int32)

In [43]:
aus_merged = aus_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
aus_merged = pd.merge(aus_merged,neighborhoods_venues_sorted,on='Neighborhood')

# add clustering labels
aus_merged['Cluster Labels'] = kmeans.labels_

aus_merged 

Unnamed: 0,PostCode,State/Territory,Neighborhood,Latitude,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
0,2611,Australian Capital Territory,"COOLEMAN, BIMBERI, BRINDABELLA, URIARRA",-35.316175,149.010503,Planetarium,Café,Sandwich Place,Pizza Place,Italian Restaurant,Fish & Chips Shop,Fast Food Restaurant,Discount Store,Deli / Bodega,Beach,0
1,3500,Victoria,PARINGI,-34.19609,142.14254,Sandwich Place,Pizza Place,Fish & Chips Shop,Deli / Bodega,Planetarium,Italian Restaurant,Fast Food Restaurant,Discount Store,Café,Beach,2
2,3585,Victoria,MURRAY DOWNS,-35.342298,143.558281,Café,Italian Restaurant,Fast Food Restaurant,Discount Store,Sandwich Place,Planetarium,Pizza Place,Fish & Chips Shop,Deli / Bodega,Beach,2
3,3644,Victoria,"BAROOGA, LALALTY",-35.911559,145.671953,Café,Beach,Sandwich Place,Planetarium,Pizza Place,Italian Restaurant,Fish & Chips Shop,Fast Food Restaurant,Discount Store,Deli / Bodega,1


### Visualising the clusters

In [46]:
# create map
import matplotlib.cm as cm
import matplotlib.colors as colors
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=5)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(aus_merged['Latitude'], aus_merged['Longitude'], aus_merged['Neighborhood'], aus_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Analysing the clusters

In [47]:
#Cluster one
aus_merged[aus_merged['Cluster Labels'] == 0]

Unnamed: 0,PostCode,State/Territory,Neighborhood,Latitude,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
0,2611,Australian Capital Territory,"COOLEMAN, BIMBERI, BRINDABELLA, URIARRA",-35.316175,149.010503,Planetarium,Café,Sandwich Place,Pizza Place,Italian Restaurant,Fish & Chips Shop,Fast Food Restaurant,Discount Store,Deli / Bodega,Beach,0


In [48]:
#Cluster two
aus_merged[aus_merged['Cluster Labels'] == 1]

Unnamed: 0,PostCode,State/Territory,Neighborhood,Latitude,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
3,3644,Victoria,"BAROOGA, LALALTY",-35.911559,145.671953,Café,Beach,Sandwich Place,Planetarium,Pizza Place,Italian Restaurant,Fish & Chips Shop,Fast Food Restaurant,Discount Store,Deli / Bodega,1


In [49]:
#Cluster three
aus_merged[aus_merged['Cluster Labels'] == 2]

Unnamed: 0,PostCode,State/Territory,Neighborhood,Latitude,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
1,3500,Victoria,PARINGI,-34.19609,142.14254,Sandwich Place,Pizza Place,Fish & Chips Shop,Deli / Bodega,Planetarium,Italian Restaurant,Fast Food Restaurant,Discount Store,Café,Beach,2
2,3585,Victoria,MURRAY DOWNS,-35.342298,143.558281,Café,Italian Restaurant,Fast Food Restaurant,Discount Store,Sandwich Place,Planetarium,Pizza Place,Fish & Chips Shop,Deli / Bodega,Beach,2


#### We have used K means clustering to group similar neighborhoods together. The venues have played a role in determining similar venues.We can now use this to make decisions.

## Result Section

#### Let us take a look at Cluster 1

In [50]:
#Cluster one
aus_merged[aus_merged['Cluster Labels'] == 0]

Unnamed: 0,PostCode,State/Territory,Neighborhood,Latitude,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
0,2611,Australian Capital Territory,"COOLEMAN, BIMBERI, BRINDABELLA, URIARRA",-35.316175,149.010503,Planetarium,Café,Sandwich Place,Pizza Place,Italian Restaurant,Fish & Chips Shop,Fast Food Restaurant,Discount Store,Deli / Bodega,Beach,0


##### Seems to me that the planetarium is the most happening place in this neighborhood followed by food places

#### Let us take a look at Cluster 2

In [51]:
#Cluster two
aus_merged[aus_merged['Cluster Labels'] == 1]

Unnamed: 0,PostCode,State/Territory,Neighborhood,Latitude,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
3,3644,Victoria,"BAROOGA, LALALTY",-35.911559,145.671953,Café,Beach,Sandwich Place,Planetarium,Pizza Place,Italian Restaurant,Fish & Chips Shop,Fast Food Restaurant,Discount Store,Deli / Bodega,1


##### This neighborhood favours the beach and tends to have a mix of activity areas and food places in the top 5 

#### Let us take a look at Cluster 3

In [52]:
#Cluster three
aus_merged[aus_merged['Cluster Labels'] == 2]

Unnamed: 0,PostCode,State/Territory,Neighborhood,Latitude,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
1,3500,Victoria,PARINGI,-34.19609,142.14254,Sandwich Place,Pizza Place,Fish & Chips Shop,Deli / Bodega,Planetarium,Italian Restaurant,Fast Food Restaurant,Discount Store,Café,Beach,2
2,3585,Victoria,MURRAY DOWNS,-35.342298,143.558281,Café,Italian Restaurant,Fast Food Restaurant,Discount Store,Sandwich Place,Planetarium,Pizza Place,Fish & Chips Shop,Deli / Bodega,Beach,2


#### We can clearly see that these neighborhoods favour food as food places are in the top 3

## Discussion Section

Since FourSquare API could not return venues for all the Neighborhoods, we were left with limited data to work with.
The project can be improved by getting more location data and hopefully more venues returned by FourSquare.

Still with what we've been able to work with, we have managed to find out the food oriented places and the activity oriented places. With more data, we'll be able to get more venue categories like banks, markets, fairs etc 

## Conclusion

After examination of the clusters, we can conlude the following;
  1. Neighborhoods: Paringi and Murray Downs are appropriate for foodies as all the top venues are food places
  2. The neighborhoods in Australian Capital Territory are useful for anyone who loves the planets and likes to roam around the food places as well
  3. The Barooga, Lalalty Neighborhood are for beach lovers
    
Decisions can me made based on these conclusions.