# Introduction

### Background

As the evolvement of globalization, it becomes more and more common for people to relocate among cities and countries. 

One major part of people's life is eating and entertaining. It is crucially important for people to have access to places they enjoy. Also, the weather and culture differs in different cities are important factors as well. It is helpful for people interested in moving to have a tool to compare the similarities among cities. 

Therefore, I am going to use data like venues types, restaurant quantity, climate data, demographics, etc. And create a tool to directly compare cities and cluster similar into groups. Hope it would be helpful for those people.

Particularly, I am going to cluster cities in North America into several groups.

#### Stakeholder/Target Audience

The target audience is those who are currently searching for new cities to moving. Or someone wants to compare the living environment between different cities/neighborhoods.  Or someone who is curious about smililarities among different cities. 

# Data

- Foursquare Data
- Geolacation Data
- Climate data
- House price: median house price
- Demography

### Load Packages

In [97]:
import pandas as pd
import requests
from pathlib import Path
from bs4 import BeautifulSoup 

### Example: Fetch climate data for different cities from wikipedia to later cluster

In [114]:
cities={
    "New_York_City": "https://en.wikipedia.org/wiki/New_York_City",
    "Toronto": "https://en.wikipedia.org/wiki/Toronto",
    "Vancouver": "https://en.wikipedia.org/wiki/Vancouver",
    "Boston": "https://en.wikipedia.org/wiki/Boston",
    "Montreal": "https://en.wikipedia.org/wiki/Montreal",
    "San_Francisco": "https://en.wikipedia.org/wiki/San_Francisco",
    "Seattle": "https://en.wikipedia.org/wiki/Seattle",
    "Edmonton": "https://en.wikipedia.org/wiki/Edmonton",
    "Calgary": "https://en.wikipedia.org/wiki/Calgary",
    "Los_Angeles": "https://en.wikipedia.org/wiki/Los_Angeles",
    "Chicago": "https://en.wikipedia.org/wiki/Chicago",
    "Houston": "https://en.wikipedia.org/wiki/Houston",
}

In [99]:
def read_file(file):
    with open(file) as f:
        content = f.read()
    return content

def write_file(file, content):
    with open(file, 'w') as f:
        f.write(content)
    

def get_page_content(name, url):
    """
    Get web page content and cache it locally
    """
    storage_path = "./data"
    suffix = ".html"  
    file_path = storage_path + '/' + name + suffix
    
    file = Path(file_path)
    if file.exists():
        content = read_file(file)
    else:
        content = requests.get(url).text
        write_file(file_path, content)
    
    return content

In [109]:
def parse_climate_data(page_content):
    """
    parse wikipedia climate data
    """   
    # Parse HTML file
    soup = BeautifulSoup(page_content, 'lxml')
    
    tables = soup.find_all('table')
    for table in tables:
        
        # posistion Climate data
        if table.tbody.tr.text.find('Climate data') != -1:
            table_content = pd.read_html(str(table))[0]
            return table_content

In [112]:
def process_climate_table(page_content):
    """
    process wikipedia climate table
    """
    # Get climate data table
    df = parse_climate_data(page_content)
    
    # rename header
    header = df.iloc[1]
    df.rename(columns=header, inplace=True)
    
    # using first column as index
    df.set_index("Month", inplace=True)
    
    # drop extra data
    df=df.iloc[2:-2,:]
    return df

In [115]:
for city in cities.keys():
    city_page_content = get_page_content(city, cities[city])
    df = process_climate_table(city_page_content)
    print("\n==== {} ====\n".format(city))
    print(df.iloc[:, -1])


==== New_York_City ====

Month
Record high °F (°C)                            106(41)
Mean maximum °F (°C)                        97.0(36.1)
Average high °F (°C)                        62.0(16.7)
Average low °F (°C)                          48.0(8.9)
Mean minimum °F (°C)                        7.0(−13.9)
Record low °F (°C)                            −15(−26)
Average precipitation inches (mm)         49.94(1,268)
Average snowfall inches (cm)                  25.8(66)
Average precipitation days (≥ 0.01 in)           122.0
Average snowy days (≥ 0.1 in)                     11.4
Average relative humidity (%)                     63.0
Mean monthly sunshine hours                     2534.7
Percent possible sunshine                           57
Average ultraviolet index                            5
Name: Year, dtype: object

==== Toronto ====

Month
Record high humidex                              44.5
Record high °C (°F)                       40.6(105.1)
Average high °C (°F)                  