Coursera Capstone Project - Business Problem, Methodology and Data Source

Introduction: 

The "Laver Cup" is a Tennis Tournament organized each year, in the month of September, for a 3 days showdown of the world's top Tennis talents in the world, competing against one another. The "Laver Cup" pits six top European players against six of their counterparts from the rest of the World.
One of the prime reasons to introduce the "Laver Cup" is to diversify, advocate and market the sport of Tennis to the entire world and also to lesser famous Tennis destinations of the world, where the sport is not a top event of the city. 
In a similar vein, the location of the "Laver Cup" rotates between Europe and rest of the world cities, each year. 
The "Laver Cup" was brought to life in 2017 and has been a mega-success since its inception. 
The Cup was inaugurated in Prague, Czech Republic in 2017, it then travelled to the US for the 1st time in 2018, in Chicago and back to Europe (Geneva, Switzerland), in 2019. In summary: 
1. 2017 - Prague, Czech Republic, Europe
2. 2018 - Chicago, Illinois, US
3. 2019 - Geneva, Switzerland, Europe

Business Problem: 

The Laver Cup committee has been tasked to host the next Laver Cup, scheduled in September, 2020 in one of the major US cities. A number of major US cities entered the draw, out of which a total of 6 major US cities are shortliste, from the random draw. From the 6 major US cities, the Laver Cup committee must choose a city, from the final 6 cities, to host the Cup. The committee must ensure that the city will be ready from an infrastructure stand-point, to ensure the facilities in and around the main stadium are developed to support a flush of world audience. The committee must also ensure the neighborhood around the stadium is conducive to support a strong sports culture and captivate the Tennis audience for an ultimate fan experience. 

Methodology:

To help the choose the next best US city to host "Laver Cup - 2020", the Tennis stadiums of the 4 major Grand Slams will be explored along with the 3 city centers which have hosted the Laver Cups, in the past. 
A master data set of all the 7 major stadiums (4 Grand Slam stadiums + 3 Laver Cup stadiums) will be created to cluster the stadiums and explore the similarity between the 7 stadiums. 

Next, the 6 candidate city stadiums, in the US, will be explored in a dataframe (DF).
The venues around these stadiums will be categorized and the frequency of the venue categories will be explored around each of the 6 city stadiums and compared against the 7 major world stadiums (found earlier). 

To compare the 6 candidate US cities with the 7 major stadiums, a concatenated "Master Tennis Dataset" will be created which will include the frequency of venue categories against each of 13 major city stadiums (6 candidates + 7 historical stadiums).

ML model(s):
A "KMeans" clustering algorithm will be utilized to segregate the 6 candidate city stadiums into different clusters. A pattern based algorithm was chosen here, since the dataset contains historical, static data on which the KMeans is known to perform with acceptable accuracy. 
In order to decide the number of clusters to solve each of the problems/ datasets, 2 optimization techniques were utilized: 
1. The Elbow Method - This computes the "WSS" or 'within-sum-of-squares' error through finding the euclidian distance of each of the data points within a cluster to the center of other clusters. The point (number of clusters) where the "WSS" loss tends to flatten out (or create an elbow point), is the optimal number of clusters to use for the given dataset. 
2. The Silhoutte Method - This method measures the similarity of an object within a cluster with other objects within the same cluster (cohesiveness) and compares it with the dissimilarity of the object with other clusters (separation). The Silhoutte score ranges from -1 to +1. A high value implies an appropriate clustering configuration, while lower values imply that number of clusters are too many or too few (i.e. the object in the cluster could be better classified compared to other clusters in the dataset). In the graph, the highest point (number of clusters with highest value), is considered as the optimal number of clusters. 

Analysis:
After running the "KMeans" algorithm on the "Master Tennis Dataset", the candidate states / candidate cities (out of the 6), which are most similar to US Open and Chicago-Laver Cup event (i.e. if fall under the same cluster), were considered the best cities to host the next Laver Cup. 
Again, measurement metric utilized is 'Neighborhood Similarity'. The US Open was taken as the baseline as it is a considered a good attraction point for many sport enthusiasts and Tennis fans alike, being one of the 4 major Tennis destinations of world.  

City Stadium exploration: 
Of the candidate cities which fall under the same cluster as "US Open" and "Chicago-Laver-Cup", each of thsoe cities was explored on a map to visualize the sports, tourism, entertainment and food infrastructure around the stadium. 
This is significant to understand if the city os ready to host a world event within a 10 months time-period. 
Candidate cities, for which, the neighborhood was not well developed to support a huge inlfux of sports tourism, were discarded (e.g. Miami, FL was found to be under-developed in this respect and was discarded). 

Final Analysis:
Of the candidate cities which were left, a final clustering algorithm was run, to find the city neighborhood which is closest to the US Open.

Data Source: 

The Foursquare API was utilized to extract, load raw JSON, filter, structure into a DF and analyze datasets around the major Tennis stadiums of the world. 
For historical datasets, 7 major Tennis stadiums (4 Grand Slams stadiums + 3 Past Laver Cup stadiums) was explored using the 'search' criteria around the stadium locations using the Foursquare API (Account). The addresses used, for venues exploration, were as follows: 
1. Australian Open --> 'Rod Laver Arena, Victoria, Australia'
2. US Open --> 'Flushing Meadows, Queens, NY'
3. Wimbledon --> 'Wimbledon, London, UK'
4. French Open --> '2 Avenue Gordon Bennett, Paris, France'
5. Laver Cup 2017 --> 'O2 Arena, Prague, Czech Republic'
6. Laver Cup 2018 --> 'United Center, Chicago, Illinois'
7. Laver Cup 2019 --> 'Palexpo, Geneva, Switzerland'

Similarly, the 'search' criteria was used for the 6 candidate cites, to explore the venues around each of the city stadiums. The addresses used were as follows: 
1. Miami --> 'Crandon Park, Miami, FL'
2. Cincinatti --> 'Lindner Family Tennis Center, OH'
3. Palm Springs --> 'Indian Wells, Coachella Valley, CA'
4. Boston --> 'The Garden, Boston, MA'
5. Arizona --> 'Arizona Veterans Memorial, AZ'
6. Dallas --> 'American Airlines Center, Victory Park, Dallas, TX'



Execution:

Installing and Importing the necessary packages for reading raw json files, filtering raw data into Pandas DF, running clustering algorithms and other ML models and visualizing results:

In [1]:
import numpy as np 
import pandas as pd 

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import folium 

import matplotlib.pyplot as plt #Import Matplotlib to print out graph to gauge the number of clusters required to 

from pandas.io.json import json_normalize # Convert the json file into pandas dataframe

!conda install -c conda-forge geopy --yes #uncomment this line if you haven't completed the Foursquare API lab
           
#!conda update -n base -c defaults conda

from geopy.geocoders import Nominatim #convert an address into latitude and longitude values

!conda install -c conda-forge beautifulsoup4 --yes #Installing BeautifulSoup
!conda install -c conda-forge lxml --yes  #Installing Parser library

#from bs4 import Beautifulsoup

import json
import requests

import matplotlib.colors as colors
import matplotlib.cm as cm 

print('All Libraries Installed')

Solving environment: done


  current version: 4.5.11
  latest version: 4.8.1

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.

Solving environment: done


  current version: 4.5.11
  latest version: 4.8.1

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.

Solving environment: done


  current version: 4.5.11
  latest version: 4.8.1

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.

All Libraries Installed
