#  IBM Data Science Professional Certificate Capstone Project

### Clusters of Essential Businesses in Michigan Cities

## Table of contents
* [Introduction](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)



## Introduction <a name="introduction"></a>

Essential businesses such as grocery stores, markets, pharmacies, hospitals, etcetera, can aid in resource distribution during the fluid environment of the COVID-19 year. 

This project will aim to examine clusters of essential business in the state of Michigan.  The scope is to determine which cities and counties have better access to these essential business (by proximity to the cluster densities of these essential businesses), and which ones do not (by the lack of close proximity to the clusters of essential businesses). 

## Data <a name="data"></a>

Two set of excel worksheets will be loaded.
1 - A set of Zipcode, City, and County data compiled from the following website: https://www.zipcodestogo.com/Michigan/
2 - A set of Zipcode, Latitude and Longitude data compiled from the following github page: https://gist.github.com/erichurst/7882666

These two sets of data will be the foundation for building the final table which will be linked to the Foursquare API venue data, on which K-means will be built on.

## Methodology <a name="methodology"></a>

1) Defining initial DataFrame of Zipcode, City and County.

In [None]:
#importing required libraries
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
from pandas import DataFrame #to convert the list type into dataframe type
import json # library to handle JSON files

!pip install geopy

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!pip install folium
import folium # map rendering library

print('Libraries imported.')

In [None]:
df = pd.read_excel ('C:\Michigancitieszipcodes.xls')
print (df)

In [None]:
df_table=pd.DataFrame(df)
df_table

2) Importing the Latitude and Longitude of each zip code.

In [None]:
#importing lat/long from xls file
df_geo=pd.read_excel ('C:\Ziplatlong.xls')
df_geo.head()

In [None]:
df_geo.shape #verifying the nature of the table to ensure all the lat/long were picked up for all the rows

In [None]:
df_table.columns #verifying text of column to merge on

In [None]:
df_geo.columns #verifying text of column to merge on

3) Merging the latitudes/longitudes of each zip code to the original table into one final table.

In [None]:
#merging the two data frames on postal code
df_geo.rename(columns={'Zip Code':'Zip Code'},inplace=True)
df_final = pd.merge(df_table,df_geo,on='Zip Code')
df_final.head()

4) Exploring the county regions of Michigan.

In [None]:
#checking the unique Boroughs in the final table
df_final.County.unique() 

In [None]:
#resetting index to 0 and getting a view of the table
df_final.reset_index(drop=True)
df_final

In [None]:
#getting Michigan latitude and longitude

address = 'michigan'

geolocator = Nominatim(user_agent="Michigan")
location = geolocator.geocode(address)
lat_Michigan = location.latitude
long_Michigan = location.longitude
print('The geograpical coordinate of Michigan are {}, {}.'.format(lat_Michigan, long_Michigan))


5) Visualizing a map of Michigan with all the data from the final table zipcode latitudes and longitudes labeled by City, County.

In [None]:
#visualize map of Michigan Zipcodes

#create map of Michigan using the latitude and longitude values
latitude = 43.6211955
longtitude = -84.6824346

# create map of Michigan using latitude and longitude values
map_Michigan = folium.Map(location=[lat_Michigan, long_Michigan], zoom_start=7)

# add markers to map
for lat, lng, city, county in zip(df_final['Latitude'], df_final['Longitude'], df_final['City'], df_final['County']):
    label = '{}, {}'.format(city, county)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Michigan)  
    
map_Michigan

6) Connecting to Foursquare API and pulling categories of essential businesses

In [None]:
#Connecting to the Foursquare API with the Foursquare Credentials
#content removed for purposes of privacy

In [None]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['County', 
                  'County Latitude', 
                  'County Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [None]:
LIMIT=50
Michigan_venues = getNearbyVenues(names=df_final['County'],
                                   latitudes=df_final['Latitude'],
                                   longitudes=df_final['Longitude']
                                  )

In [None]:
michigan_venues.shape

## Analysis <a name="analysis"></a>

In [None]:
# one hot encoding
michigan_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
michigan_onehot['City'] = michigan_venues['City'] 

# move neighborhood column to the first column
fixed_columns = [michigan_onehot.columns[-1]] + list(michigan_onehot.columns[:-1])
michigan_onehot = michigan_onehot[fixed_columns]

michigan_onehot.head()

In [None]:
#group data from on hot coding
michigan_grouped = michigan_onehot.groupby('City').mean().reset_index()
michigan_grouped

In [None]:
michigan_grouped.shape #the new table from getting the coordinates added and also linking to the Foursquare API venue data for all Neighborhoods (from one hot coding); df_toronto contained 39 rows but only 5 columns, and now it contains all the 217 columns from one hot coding.

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['City']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
city_venues_sorted = pd.DataFrame(columns=columns)
city_venues_sorted['City'] = michigan_grouped['City']

for ind in np.arange(michigan_grouped.shape[0]):
    city_venues_sorted.iloc[ind, 1:] = return_most_common_venues(michigan_grouped.iloc[ind, :], num_top_venues)

city_venues_sorted.head()

In [None]:
city_sorted.shape # neighborhoods with the top 10 most common venues

K-Means Clustering

In [None]:
#K - means

kclusters = 5

michigan_grouped_clustering = michigan_grouped.drop('City', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(michigan_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

In [None]:
# add clustering labels
city_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
michigan_merged = df_final

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
michigan_merged = michigan_merged.join(city_venues_sorted.set_index('City'), on='City')

city_merged.head() # check the last columns!

In [None]:
michigan_merged.shape

In [None]:
# create map
map_clusters = folium.Map(location=[lat_michigan, long_michigan], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(michigan_merged['Latitude'], michigan_merged['Longitude'], michigan_merged['Neighborhood'], michigan_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster Labels' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Initiating clusters iterations

In [None]:
#algorithm iterations
michigan_merged.loc[michigan_merged['Cluster Labels'] == 0, michigan_merged.columns[[1] + list(range(5, michigan_merged.shape[1]))]]

In [None]:
#algorithm iterations
#1-michigan_merged.loc[michigan_merged['Cluster Labels'] == 0, michigan_merged.columns[[1] + list(range(5, michigan_merged.shape[1]))]]

michigan_merged.loc[michigan_merged['Cluster Labels'] == 1, michigan_merged.columns[[1] + list(range(5, michigan_merged.shape[1]))]]

In [None]:
#algorithm iterations
#1-michigan_merged.loc[michigan_merged['Cluster Labels'] == 0, michigan_merged.columns[[1] + list(range(5, michigan_merged.shape[1]))]]
#2- michigan_merged.loc[michigan_merged['Cluster Labels'] == 1, michigan_merged.columns[[1] + list(range(5, michigan_merged.shape[1]))]]

michigan_merged.loc[michigan_merged['Cluster Labels'] == 2, michigan_merged.columns[[1] + list(range(5, michigan_merged.shape[1]))]]
