# Coursera IBM Data Science Certificate
## Capstone Project week 4+5

#### Introduction/Business Problem
My client is a family who's lived in New York for the last 10 years. Husband and wife have a good reputation on the local job market and were able to transition fully into home office working arrangements. They have gotten quite tired of the many people in New York, however they do appreciate its cultural offerings, bars, parks and general characteristics. They decided to have children soon and need to move to a smaller city. Thus, they asked me to analyze the top 50 cities in the US by population and find one that has the same flair and characteristics as NY, however with fewer people. 

#### Data that will be used
For the problem at hand, I will firstly need the names of the top 50 US cities by population. I will take these from a table on Wikipedia. This will enable me to run sophisticated location analysis with the Foursquare API. 

I will then pull the most trending venues for each city and sort them by frequency and category via the Foursquare API. These venues will serve as the basis for running a cluster algorithm and categorize each US city according to the most common venue categories. The result will be shown in a folium map which can be used to identify cities that possess the same characteristics as NY.

### Please be advised that the exploration endpoint of Foursquare will give different results for trending venues depending on the day, time etc. Thus, the result may deviate if you run this script at different times

In [39]:
%matplotlib inline
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import numpy as np
import datetime
import pandas as pd
import matplotlib.pyplot as plt
import datetime

In [40]:
!pip -q install folium==0.5.0
import folium

In [41]:
!pip -q install geocoder
import geocoder

In [42]:
!pip -q install requests
import requests

## Pulling and cleaning of initial dataframe

In [43]:
df = pd.read_html(
    "https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population")[4][["City", "2018estimate"]].iloc[:50]
df

Unnamed: 0,City,2018estimate
0,New York[d],8398748
1,Los Angeles,3990456
2,Chicago,2705994
3,Houston[3],2325502
4,Phoenix,1660272
5,Philadelphia[e],1584138
6,San Antonio,1532233
7,San Diego,1425976
8,Dallas,1345047
9,San Jose,1030119


In [44]:
# Cleaning up the dataframe and adding a size_folium column for the size of marker circles later in the process
df["City"] = df["City"].apply(lambda x: x.split("[")[0])
df["Size_Folium"] = df["2018estimate"]/ 1000000
df = df.drop("2018estimate", axis=1)
df

Unnamed: 0,City,Size_Folium
0,New York,8.398748
1,Los Angeles,3.990456
2,Chicago,2.705994
3,Houston,2.325502
4,Phoenix,1.660272
5,Philadelphia,1.584138
6,San Antonio,1.532233
7,San Diego,1.425976
8,Dallas,1.345047
9,San Jose,1.030119


#### Coordinate search for all cities

In [45]:
def location_search(df_row):
    for city in df_row:
        location = geocoder.osm("{}".format(city))
        lat, lng = location.latlng
        return "{}, {}".format(lat, lng)


df[["Latitude", "Longitude"]] = df.apply(location_search, axis=1).str.split(",", expand=True)
df.head(15)

Unnamed: 0,City,Size_Folium,Latitude,Longitude
0,New York,8.398748,40.7127281,-74.0060152
1,Los Angeles,3.990456,34.0536909,-118.2427666
2,Chicago,2.705994,41.8755616,-87.6244212
3,Houston,2.325502,29.7589382,-95.3676974
4,Phoenix,1.660272,33.4485866,-112.0773456
5,Philadelphia,1.584138,39.9527237,-75.1635262
6,San Antonio,1.532233,29.4246002,-98.4951405
7,San Diego,1.425976,32.7174209,-117.1627714
8,Dallas,1.345047,32.7762719,-96.7968559
9,San Jose,1.030119,37.3361905,-121.8905833


## Defining a function that uses the city locations to retrieve data from Foursquare API
# Please use your Foursquare API key and secret here

In [46]:
# The code was removed by Watson Studio for sharing.

In [47]:
def api_searcher(city_name, latitude, longitude):
    venues = []

    for city, lat, lng in zip(city_name, latitude, longitude):
        end_point = "explore"
        test_url = 'https://api.foursquare.com/v2/venues/{}'.format(end_point)

        params = dict(
            client_id= api_key,
            client_secret= api_secret,
            v=datetime.datetime.now().strftime("%Y%m%d"),
            ll='{}, {}'.format(lat, lng),
            limit=20,
            radius = 5000
        )

        result = requests.get(url=test_url, params=params).json()
        # print(json.dumps(result, indent=4))
        filtered_items = result["response"]["groups"][0]["items"]

        venues.append([(
            city,
            item["venue"]["name"],
            item["venue"]["location"]["lat"],
            item["venue"]["location"]["lng"],
            item["venue"]["categories"][0]["name"]) for item in filtered_items]
        )
    venue_list = pd.DataFrame([item for entry in venues for item in entry],
                              columns=["City", "Venue_name",
                              "Venue_latitude", "Venue_longitude", "Venue_category"])
    return venue_list

## If the following cell produces an error, please try running it again

In [48]:
# Creation of dataframe with venue data
venue_df = api_searcher(df["City"], df["Latitude"], df["Longitude"])
venue_df

Unnamed: 0,City,Venue_name,Venue_latitude,Venue_longitude,Venue_category
0,New York,The Bar Room at Temple Court,40.711448,-74.006802,Hotel Bar
1,New York,Four Seasons Hotel New York Downtown,40.712612,-74.009380,Hotel
2,New York,Korin,40.714824,-74.009404,Furniture / Home Store
3,New York,Aire Ancient Baths,40.718141,-74.004941,Spa
4,New York,One World Trade Center,40.713069,-74.013133,Building
5,New York,9/11 Memorial North Pool,40.712077,-74.013187,Memorial Site
6,New York,Washington Market Park,40.717046,-74.011095,Playground
7,New York,National September 11 Memorial & Museum,40.711451,-74.013433,Memorial Site
8,New York,Crown Shy,40.706187,-74.007490,Restaurant
9,New York,Liberty Park,40.710384,-74.013868,Park


## Folium map of all the venues in all the cities that were used as a basis for the analysis

In [49]:
map = folium.Map(location=[39.381266, -97.922211], zoom_start=3)
for name, lat, lng in zip(venue_df["Venue_name"], venue_df["Venue_latitude"], venue_df["Venue_longitude"]):
    name = name.replace("'", "")
    folium.CircleMarker(
        location=[lat, lng],
        fill= True,
        fill_color = "Black",
        radius=10,
        popup= name
    ).add_to(map)
map

### Creation of dataframe --- pre processing for machine learning

In [50]:
global_onehot = pd.get_dummies(venue_df["Venue_category"])
global_onehot.insert(loc=0, column="City", value= venue_df["City"])
global_grouped = global_onehot.groupby(by="City").mean().reset_index()
global_grouped.head()

Unnamed: 0,City,Accessories Store,American Restaurant,Amphitheater,Aquarium,Arcade,Arepa Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,...,Used Bookstore,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Warehouse Store,Waterfront,Wine Bar,Wine Shop,Yoga Studio,Zoo
0,Albuquerque,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05
1,Arlington,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.05,0.05,0.0,0.0,0.05,0.0,0.0
2,Atlanta,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0
3,Austin,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Baltimore,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.05,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Creation of summary dataframe that includes most common categories

In [51]:
top_venue_categories = 10
appendices = ["st", "nd", "rd"]
columns = []

# Create columns of new dataframe
for i in np.arange(top_venue_categories):
    try:
        columns.append("{}{} most common category".format(i+1, appendices[i]))
    except:
        columns.append("{}th most common category".format(i+1))

# Create dataframe
summary_df = pd.DataFrame(columns= columns)
summary_df.insert(loc=0, column="City", value=global_grouped["City"])

def sort_rows_by_frequency(row, top_values):
    row_categories_sorted = row.sort_values(ascending=False)
    return row_categories_sorted.index.values[:top_values]

# Populate dataframe
for i in np.arange(global_grouped.shape[0]):
    summary_df.iloc[i, 1:] = sort_rows_by_frequency(global_grouped.iloc[i, 1:], top_venue_categories)

# Machine learning

In [52]:
cluster_df = global_grouped.drop("City", axis=1)
num_clusters = 4

#Train model
kmeans = KMeans(n_clusters=num_clusters, random_state=0).fit(cluster_df)
summary_df.insert(loc=1, column= "Cluster_Label", value=kmeans.labels_)

# Join two dataframes for better overview
summary_df = summary_df.join(df.set_index("City"), on="City")

## Create folium map with clustered cities

In [53]:
# Create random colors from jet colormap
colors_array = cm.jet((np.linspace(0, 1, num_clusters)))
colormap = [colors.rgb2hex(i) for i in colors_array]

#Create folium map
colors_array = cm.jet((np.linspace(0, 1, num_clusters)))
colormap = [colors.rgb2hex(i) for i in colors_array]

cluster_map = folium.Map(location=[39.381266, -97.922211], zoom_start=4)

for city, lat, lng, cluster, size in zip(summary_df["City"], summary_df["Latitude"], summary_df["Longitude"], summary_df["Cluster_Label"], summary_df["Size_Folium"]):
    lat = float(lat)
    lng = float(lng)
    folium.CircleMarker(location=[lat, lng],
                        radius=size + 10,
                        popup="{} ---> Cluster {}".format(city, cluster),
                        fill= True,
                        fill_color= colormap[cluster],
                        fill_opacity= 1
    ).add_to(cluster_map)

cluster_map


# Now, to solve the problem of the family, we would choose for them to move to a city that has the same cluster label as New York