# Coursera IBM Data Science Capstone

## Analysis of the districts of Osnabrück
Segmenting the city Osnabrueck into different Districts using the geographical coordinates of the center of each District, and then using a combination of location data and machine learning to cluster it.

## Introduction/Data/Methodology

Osnabrück is a city in the federal state of Lower Saxony in north-west Germany. It is situated in a valley penned between the Wiehen Hills and the northern tip of the Teutoburg Forest. With a population of 168,145 Osnabrück is one of the four largest cities in Lower Saxony. The city is divided into 23 districts. 
This work is giving an overview about the similarities of the different districts so that contractors get more information where to start a business and private individuals receive help when they are looking for a new location to move to. 

The data about the districts and their postal codes are obtained from the internet by web scraping and associated with the venues for each area that can be queried via the foursquare API. Based on the available data, the city districts are grouped together into post districts. With the corresponding library it is possible to gain the length degrees and degrees of latitude for each post district, so that this information can be sent to the Foursquare API. The respones from Foursquare will contain the venues for every post district.

In terms of methodology, this data analysis differs significantly from the traditional statistical approach of experimental design. I start my analysis with the available data from Foursquare. The objectives of this approach are to win new information out of the existing dataset. Normally in statistical experimental designs, an experiment is developed and data is retrieved as a result. To gain an optimal dataset you can see below in many cells how I cleaned and preprocessed the data. 

I decided to use the unsupervised machine learning algorithm k-means for further analysis. With this method, it is possible to group similar data points together and discover underlying patterns. A so generated cluster refers to a collection of data points aggregated together because of certain similarities. In this case, the clusters will be grouped together because of similar venus. The insights gained from this algorithm will offer the users of this program an additional basis for their decisions.

In [33]:
# importing libaries
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import json # library to handle JSON files

# !conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

# !conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


### Get the name of the districts of Osnabrueck and the postcode

In [34]:
source = requests.get('https://www.suche-postleitzahl.org/plz-gebiet/490').text
soup = BeautifulSoup(source, 'lxml')
# print(soup.prettify())
table = soup.find('table')
# print(table)
table_elements = table.find_all('td')
# print(table_elements)

In [35]:
#building lists with list-comprehension

#add every 2 element to list postcode, starting from element 0
postcode = [table_elements[x].text for x in range(0, len(table_elements), 2)]

#add every 2 element to list borough, starting from element 1
district = [table_elements[x].text for x in range(1, len(table_elements), 2)]

In [36]:
#checking the length of the lists
len_post = len(postcode)
len_dis = len(district)
print(f'len postcode= {len_post}, len district= {len_dis}')

#according to wikipedia there are 23 districts in Onsnabrueck 

#first let's create a pd Dataframe
#first we need a dictionary:
data = {'Postcode': postcode, 'District': district}
#and heer comes the dataframe :-) :
df = pd.DataFrame(data)
print('number of unique postcodes: ', len(df.Postcode.unique()))
print('number of unique districts: ', len(df.District.unique()))
print()
df.info()
print()
df.sort_values(by=['District']).head()

len postcode= 26, len district= 26
number of unique postcodes:  9
number of unique districts:  23

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26 entries, 0 to 25
Data columns (total 2 columns):
Postcode    26 non-null object
District    26 non-null object
dtypes: object(2)
memory usage: 496.0+ bytes



Unnamed: 0,Postcode,District
2,49076,Atter
15,49086,Darum/Gretesch/Lüstringen
19,49088,Dodesheide
22,49090,Eversburg
13,49084,Fledder


In [37]:
# drop the duplicates in the district column (for the )
df= df.drop_duplicates(subset='District', keep='first')

print('number of unique postcodes in df: ', len(df.Postcode.unique()))
print('number of unique districts in df: ', len(df.District.unique()))
print()
# for the forsquare query we will need the names of the districts
#for the analysis of the data we later can group the data by postcodes
df.info()
df.head()

number of unique postcodes in df:  9
number of unique districts in df:  23

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23 entries, 0 to 25
Data columns (total 2 columns):
Postcode    23 non-null object
District    23 non-null object
dtypes: object(2)
memory usage: 552.0+ bytes


Unnamed: 0,Postcode,District
0,49074,Gartlage
1,49074,Innenstadt
2,49076,Atter
3,49076,Westerberg
4,49076,Weststadt


### Get the Venues for each district from the Forsquareapi

In [38]:
# @hidden_cell
# Define Foursquare Credentials and Version
CLIENT_ID = 'VFNDVYQF1EJ1WOWBVEV5S1ZX4P2VQ2WRF30MVRJKBUTJPVBU' # Foursquare ID
CLIENT_SECRET =  'T3LI3QY3APOU313ETO4VTEWBLJZ1LPHLNIIZ0CMYK3IPXIRC' #  Foursquare Secret
VERSION = '20180605'

In [39]:
# group the df dataframe by the postcodes to get the data required for 
# pgeocode libary to get the latitude and longitude for every postcode
df_postcode = df.groupby(['Postcode'])['District'].apply(', '.join).reset_index()

In [40]:
# get the geodata needed for the Forsquareapi
import pgeocode

postcode_list = df_postcode['Postcode'].values.tolist()

nomi = pgeocode.Nominatim('DE')
nomi.query_postal_code(postcode_list)

geodata = nomi.query_postal_code(postcode_list)
df_geodata = geodata[['postal_code', 'latitude', 'longitude']]
df_geodata.info()
df_geodata.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9 entries, 0 to 8
Data columns (total 3 columns):
postal_code    9 non-null object
latitude       9 non-null float64
longitude      9 non-null float64
dtypes: float64(2), object(1)
memory usage: 288.0+ bytes


Unnamed: 0,postal_code,latitude,longitude
0,49074,52.2738,8.0521
1,49076,52.2832,7.9485
2,49078,52.2651,8.0096
3,49080,52.2491,8.0367
4,49082,52.244,8.0613


In [41]:
# combining the dataframes
df_geodata.rename(columns={'postal_code': 'Postcode'}, inplace=True)
merged_data =pd.merge(df_postcode, df_geodata, on='Postcode')
merged_data.info()
print()
merged_data.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9 entries, 0 to 8
Data columns (total 4 columns):
Postcode     9 non-null object
District     9 non-null object
latitude     9 non-null float64
longitude    9 non-null float64
dtypes: float64(2), object(2)
memory usage: 360.0+ bytes



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)


Unnamed: 0,Postcode,District,latitude,longitude
0,49074,"Gartlage, Innenstadt",52.2738,8.0521
1,49076,"Atter, Westerberg, Weststadt",52.2832,7.9485
2,49078,Hellern,52.2651,8.0096
3,49080,"Kalkhügel, Wüste",52.2491,8.0367
4,49082,"Nahne, Schölerberg, Sutthausen",52.244,8.0613


In [42]:
# a function to get the top 100 venues for each District within a radius of 2500 meters

LIMIT = 100 # limit of number of venues returned by Foursquare API

def getNearbyVenues(names, latitudes, longitudes, radius=2500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['District', 
                  'District Latitude', 
                  'District Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [43]:
osnabrueck_venues = getNearbyVenues(merged_data.District, 
                                    merged_data.latitude, 
                                    merged_data.longitude)

Gartlage, Innenstadt
Atter, Westerberg, Weststadt
Hellern
Kalkhügel, Wüste
Nahne, Schölerberg, Sutthausen
Fledder, Schinkel
Darum/Gretesch/Lüstringen, Lüstringen, Voxtrup, Widukindland
Dodesheide, Sonnenhügel
Eversburg, Hafen, Haste, Pye


In [44]:
print(osnabrueck_venues.shape)
osnabrueck_venues.head()

(423, 7)


Unnamed: 0,District,District Latitude,District Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Gartlage, Innenstadt",52.2738,8.0521,Culina,52.274368,8.050286,Café
1,"Gartlage, Innenstadt",52.2738,8.0521,Schlossgarten,52.270614,8.044237,Park
2,"Gartlage, Innenstadt",52.2738,8.0521,Tiefenrausch,52.273691,8.044015,Bar
3,"Gartlage, Innenstadt",52.2738,8.0521,L+T Kaufhaus,52.27488,8.046627,Clothing Store
4,"Gartlage, Innenstadt",52.2738,8.0521,Fontanella Eis Cafe,52.276506,8.04209,Ice Cream Shop


In [45]:
#check how many venues are reurned for one district
osnabrueck_venues.groupby('District').count()

Unnamed: 0_level_0,District Latitude,District Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
District,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Atter, Westerberg, Weststadt",10,10,10,10,10,10
"Darum/Gretesch/Lüstringen, Lüstringen, Voxtrup, Widukindland",24,24,24,24,24,24
"Dodesheide, Sonnenhügel",93,93,93,93,93,93
"Eversburg, Hafen, Haste, Pye",28,28,28,28,28,28
"Fledder, Schinkel",76,76,76,76,76,76
"Gartlage, Innenstadt",92,92,92,92,92,92
Hellern,45,45,45,45,45,45
"Kalkhügel, Wüste",27,27,27,27,27,27
"Nahne, Schölerberg, Sutthausen",28,28,28,28,28,28


In [46]:
#Let's find out how many unique categories can be curated from all the returned venues
print('There are {} uniques categories.'.format(len(osnabrueck_venues['Venue Category'].unique())))

There are 77 uniques categories.


### Analyze Each District

In [47]:
# one hot encoding
osnabrueck_onehot = pd.get_dummies(osnabrueck_venues[['Venue Category']], prefix="", prefix_sep="")

# add Neighborhood column back to dataframe
osnabrueck_onehot['District'] = osnabrueck_venues['District'] 

# move Neighborhood column to the first column
fixed_columns = [osnabrueck_onehot.columns[-1]] + list(osnabrueck_onehot.columns[:-1])
osnabrueck_onehot = osnabrueck_onehot[fixed_columns]

osnabrueck_onehot.head()

Unnamed: 0,District,Art Gallery,Asian Restaurant,Auto Dealership,BBQ Joint,Bakery,Bar,Big Box Store,Border Crossing,Bowling Alley,...,Soccer Stadium,Spanish Restaurant,Steakhouse,Supermarket,Sushi Restaurant,Tapas Restaurant,Train Station,Trattoria/Osteria,Wine Bar,Zoo
0,"Gartlage, Innenstadt",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Gartlage, Innenstadt",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Gartlage, Innenstadt",0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Gartlage, Innenstadt",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Gartlage, Innenstadt",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [48]:
osnabrueck_onehot.shape

(423, 78)

In [49]:
# grouping rows by District and by taking the mean of 
# the frequency of occurrence of each category

osnabrueck_grouped = osnabrueck_onehot.groupby('District').mean().reset_index()
osnabrueck_grouped

Unnamed: 0,District,Art Gallery,Asian Restaurant,Auto Dealership,BBQ Joint,Bakery,Bar,Big Box Store,Border Crossing,Bowling Alley,...,Soccer Stadium,Spanish Restaurant,Steakhouse,Supermarket,Sushi Restaurant,Tapas Restaurant,Train Station,Trattoria/Osteria,Wine Bar,Zoo
0,"Atter, Westerberg, Weststadt",0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.1,0.0,...,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Darum/Gretesch/Lüstringen, Lüstringen, Voxtrup...",0.0,0.041667,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,...,0.041667,0.0,0.0,0.416667,0.0,0.0,0.041667,0.0,0.0,0.0
2,"Dodesheide, Sonnenhügel",0.010753,0.010753,0.0,0.0,0.010753,0.075269,0.010753,0.0,0.0,...,0.010753,0.032258,0.010753,0.064516,0.010753,0.010753,0.010753,0.021505,0.010753,0.010753
3,"Eversburg, Hafen, Haste, Pye",0.0,0.0,0.035714,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.178571,0.0,0.0,0.0,0.0,0.0,0.0
4,"Fledder, Schinkel",0.0,0.013158,0.0,0.0,0.0,0.092105,0.013158,0.0,0.013158,...,0.013158,0.026316,0.013158,0.105263,0.013158,0.013158,0.013158,0.026316,0.0,0.013158
5,"Gartlage, Innenstadt",0.01087,0.01087,0.0,0.0,0.021739,0.076087,0.01087,0.0,0.0,...,0.01087,0.032609,0.01087,0.076087,0.01087,0.01087,0.01087,0.021739,0.01087,0.0
6,Hellern,0.0,0.022222,0.0,0.0,0.066667,0.044444,0.0,0.0,0.0,...,0.0,0.0,0.022222,0.133333,0.022222,0.0,0.0,0.0,0.022222,0.0
7,"Kalkhügel, Wüste",0.0,0.037037,0.0,0.0,0.037037,0.037037,0.0,0.0,0.0,...,0.0,0.0,0.0,0.148148,0.0,0.0,0.0,0.0,0.0,0.037037
8,"Nahne, Schölerberg, Sutthausen",0.0,0.071429,0.0,0.0,0.0,0.035714,0.0,0.0,0.0,...,0.0,0.0,0.0,0.178571,0.0,0.0,0.0,0.0,0.0,0.035714


In [50]:
# each District with the top 5 most common venues

num_top_venues = 5

for hood in osnabrueck_grouped['District']:
    print("----"+hood+"----")
    temp = osnabrueck_grouped[osnabrueck_grouped['District'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Atter, Westerberg, Weststadt----
               venue  freq
0  German Restaurant   0.2
1       Intersection   0.1
2          BBQ Joint   0.1
3               Lake   0.1
4    Border Crossing   0.1


----Darum/Gretesch/Lüstringen, Lüstringen, Voxtrup, Widukindland----
           venue  freq
0    Supermarket  0.42
1          Hotel  0.08
2      Drugstore  0.04
3      Gastropub  0.04
4  Shopping Mall  0.04


----Dodesheide, Sonnenhügel----
         venue  freq
0          Bar  0.08
1         Café  0.06
2  Supermarket  0.06
3    Drugstore  0.04
4    Nightclub  0.04


----Eversburg, Hafen, Haste, Pye----
                  venue  freq
0           Supermarket  0.18
1           Gas Station  0.11
2             Drugstore  0.07
3             Nightclub  0.07
4  Fast Food Restaurant  0.07


----Fledder, Schinkel----
         venue  freq
0  Supermarket  0.11
1          Bar  0.09
2         Café  0.07
3        Hotel  0.04
4    Drugstore  0.04


----Gartlage, Innenstadt----
         venue  freq
0      

In [51]:
# putting that data into a pandas df
# first: function to sort the venues

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [52]:
# creating the new df and displaying the top 5 venues

num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['District']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
District_venues_sorted = pd.DataFrame(columns=columns)
District_venues_sorted['District'] = osnabrueck_grouped['District']

for ind in np.arange(osnabrueck_grouped.shape[0]):
    District_venues_sorted.iloc[ind, 1:] = return_most_common_venues(osnabrueck_grouped.iloc[ind, :], num_top_venues)

District_venues_sorted.head(20)

Unnamed: 0,District,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,"Atter, Westerberg, Weststadt",German Restaurant,Border Crossing,Lake,Intersection,Soccer Field
1,"Darum/Gretesch/Lüstringen, Lüstringen, Voxtrup...",Supermarket,Hotel,Big Box Store,Construction & Landscaping,Drugstore
2,"Dodesheide, Sonnenhügel",Bar,Supermarket,Café,Drugstore,Nightclub
3,"Eversburg, Hafen, Haste, Pye",Supermarket,Gas Station,Fast Food Restaurant,Nightclub,Drugstore
4,"Fledder, Schinkel",Supermarket,Bar,Café,Drugstore,Hotel
5,"Gartlage, Innenstadt",Bar,Supermarket,Café,Drugstore,Nightclub
6,Hellern,Supermarket,Hotel,Bakery,Bar,Drugstore
7,"Kalkhügel, Wüste",Supermarket,Drugstore,Pizza Place,Zoo,Pub
8,"Nahne, Schölerberg, Sutthausen",Supermarket,Fast Food Restaurant,Gas Station,Asian Restaurant,Museum


## Clustering Neighborhoods

In [53]:
# Run k-means to cluster the Districts into 6 clusters.
# set number of clusters
kclusters = 5

osnabrueck_grouped_clustering = osnabrueck_grouped.drop('District', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(osnabrueck_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 2, 3, 4, 3, 3, 1, 1, 1])

In [54]:
# create a new dataframe that includes the cluster 
# as well as the top 5 venues for each District

# adjust the District column in the merged_data dataframe
merged_data.District = [word.replace('\n', '') for word in merged_data.District]

# add clustering labels
District_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

osnabrueck_merged = merged_data

# merge osnabrueck_grouped with toronto_data to add latitude/longitude for each District
osnabrueck_merged = osnabrueck_merged.join(District_venues_sorted.set_index('District'), on='District')

osnabrueck_merged.head(200) # check the last columns!

Unnamed: 0,Postcode,District,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,49074,"Gartlage, Innenstadt",52.2738,8.0521,3,Bar,Supermarket,Café,Drugstore,Nightclub
1,49076,"Atter, Westerberg, Weststadt",52.2832,7.9485,0,German Restaurant,Border Crossing,Lake,Intersection,Soccer Field
2,49078,Hellern,52.2651,8.0096,1,Supermarket,Hotel,Bakery,Bar,Drugstore
3,49080,"Kalkhügel, Wüste",52.2491,8.0367,1,Supermarket,Drugstore,Pizza Place,Zoo,Pub
4,49082,"Nahne, Schölerberg, Sutthausen",52.244,8.0613,1,Supermarket,Fast Food Restaurant,Gas Station,Asian Restaurant,Museum
5,49084,"Fledder, Schinkel",52.2667,8.0725,3,Supermarket,Bar,Café,Drugstore,Hotel
6,49086,"Darum/Gretesch/Lüstringen, Lüstringen, Voxtrup...",52.2887,8.0892,2,Supermarket,Hotel,Big Box Store,Construction & Landscaping,Drugstore
7,49088,"Dodesheide, Sonnenhügel",52.2667,8.05,3,Bar,Supermarket,Café,Drugstore,Nightclub
8,49090,"Eversburg, Hafen, Haste, Pye",52.2983,8.0131,4,Supermarket,Gas Station,Fast Food Restaurant,Nightclub,Drugstore


### Visualization of the Clusters

In [55]:
# Use geopy library to get the latitude and longitude values of Osnabrueck
address = 'Osnabrueck'

geolocator = Nominatim(user_agent="osnabrueck_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print(f'The geograpical coordinate of Osnabrueck are: latitude= {latitude}, longitude= {longitude}.')

The geograpical coordinate of Osnabrueck are: latitude= 52.266837, longitude= 8.049741.


In [56]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(osnabrueck_merged['latitude'], osnabrueck_merged['longitude'], osnabrueck_merged['District'], osnabrueck_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Results 

### Examine Clusters

For the presentation of the results, I will give the clusters descriptive names to determine the discriminating venue categories that distinguish each cluster. 

#### Restaurant Cluster

In [57]:
osnabrueck_merged.loc[osnabrueck_merged['Cluster Labels'] == 0, osnabrueck_merged.columns[[1] + list(range(5, osnabrueck_merged.shape[1]))]]

Unnamed: 0,District,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
1,"Atter, Westerberg, Weststadt",German Restaurant,Border Crossing,Lake,Intersection,Soccer Field


#### Bar Cluster

In [58]:
osnabrueck_merged.loc[osnabrueck_merged['Cluster Labels'] == 1, osnabrueck_merged.columns[[1] + list(range(5, osnabrueck_merged.shape[1]))]]

Unnamed: 0,District,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
2,Hellern,Supermarket,Hotel,Bakery,Bar,Drugstore
3,"Kalkhügel, Wüste",Supermarket,Drugstore,Pizza Place,Zoo,Pub
4,"Nahne, Schölerberg, Sutthausen",Supermarket,Fast Food Restaurant,Gas Station,Asian Restaurant,Museum


#### Industrial Estate

In [59]:
osnabrueck_merged.loc[osnabrueck_merged['Cluster Labels'] == 2, osnabrueck_merged.columns[[1] + list(range(5, osnabrueck_merged.shape[1]))]]

Unnamed: 0,District,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
6,"Darum/Gretesch/Lüstringen, Lüstringen, Voxtrup...",Supermarket,Hotel,Big Box Store,Construction & Landscaping,Drugstore


#### Supermarket Cluster

In [60]:
osnabrueck_merged.loc[osnabrueck_merged['Cluster Labels'] == 3, osnabrueck_merged.columns[[1] + list(range(5, osnabrueck_merged.shape[1]))]]

Unnamed: 0,District,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,"Gartlage, Innenstadt",Bar,Supermarket,Café,Drugstore,Nightclub
5,"Fledder, Schinkel",Supermarket,Bar,Café,Drugstore,Hotel
7,"Dodesheide, Sonnenhügel",Bar,Supermarket,Café,Drugstore,Nightclub


#### Supermarket Cluster

In [61]:
osnabrueck_merged.loc[osnabrueck_merged['Cluster Labels'] == 4, osnabrueck_merged.columns[[1] + list(range(5, osnabrueck_merged.shape[1]))]]

Unnamed: 0,District,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
8,"Eversburg, Hafen, Haste, Pye",Supermarket,Gas Station,Fast Food Restaurant,Nightclub,Drugstore


## Discussion
The separation into the different clusters shows that there are significant differences between some clusters, others seem to be almost identical. I varied the number of clusters to get clearer results but the number of venues and the limited amount of data I received from Foursquare don't allow an unambiguous result. Especially the great variations in the number of venues per district are a reason for the low relevance of the separation. For example, there were nine times more venues in the district "Dodesheide, Sonnenhügel" then in the district "Atter, Westerberg, Weststadt". The cause for this inequality lies in the incompleteness of the data from Foursquare and not in the real world differences between the districts.

## Conclusion
The approach used in this analysis makes it possible to see some differences between the districts of Osnabrück. In particular the most common venues should allow the reader of this work to get an impression of the structure of an district. This knowledge can help to make a decission for a new place of residence or a business opening. Because of the lack of data this analysis schould only be used as one of many indicators to gain information.