**CAPSTONE** **PROJECT: BATTLE OF THE NEIGHBORHOODS**

Tokyo Japana Cinemas and Expatriates location Recommendation

I. **PURPOSE**

This document provides the details of my final peer reviewed assignment for the IBM Data Science Professional Certificate program  

**INTRODUCTION**

The cinema of Japan (日本映画 Nihon eiga, also known domestically as 邦画 hōga, "domestic cinema") has a history that spans more than 100 years. Japan has one of the oldest and largest film industries in the world; as of 2010, it was the fourth largest by number of feature films produced. In 2011 Japan produced 411 feature films that earned 54.9% of a box office total of US$2.338 billion. Films have been produced in Japan since 1897, when the first foreign cameramen arrived

The sample recommender in this notebook will provide the following use case scenario

* A person planning to a new cinema.
* The user wants to receive location recommendation where he can open or start up an  new cinema as a company's new business with close proximity to places of interest or search category option
* The recommendation should not only present the most viable option, but also present a comparison table of all possible town location.

For this demonstration, this notebook will make use of the following data:
* list of cinemas in Tokyo.
* Popular cinemas location in the vicinity. (Sample category selection)
    
Note: While this demo makes use of the list of cinemas location  Category, Other possible categories can also be used for the same implementation such as checking categories like:
* Outdoors and Recreation
* Nightlife
* Nearby Schools, etc.
            
            
I will limit the scope of this search as FourSquare API only allows 50 free venue query limit per day when using a free user access.            

**DATA ACQUISITION**

This demonstration will make use of the following data sources:

A list of Tokyo cinema and cinemas' geographic coordinates .
Data will retrieved from an open dataset from 'https://hkmovie6.com/cinema'. 

The original data source contains a list of Tokyo cinema and cinemas' geographic coordinates . I will retrieve  the most recent recordes  from this data source being the most relevant location data available at this time. For this demonstration, I will simplify the analysis by using the average rental prices of all available flat type.

Tokyo cinema location data retrieved using Google maps API.
Data coordinates of cinemas will be retrieved using google API. I also make use of MRT stations coordinate as a more important center of for all towns included in venue recommendations.

Tokyo Cinema location Recommendations from FourSquare API
(FourSquare website: www.foursquare.com)

I will be using the FourSquare API to explore neighborhoods in selected cinemas in Tokyo. The Foursquare explore function will be used to get the most common cinemas categories in each location, and then use this feature to group the locations into clusters.  The following information are retrieved on the first query:
* Venue ID
* Venue Name
* Coordinates : Latitude and Longitude
* Category Name

Another venue query will be performed to retrieve venue ratings for each location. Note that rating information is a paid service from FourSquare and we are limited to only 50 queries per day. With this constraint, we limit the category analysis with only one type for this demo. I will try to retrieve as many ratings as possible for each retrieved venue ID. 

**METHODOLOGY**

A list of Tokyo cinema and cinemas' geographic coordinates.
The source data contains  cinemas in Tokyo. I will retrive the most recent recordes  from this data source being the most relevant cinemas location available at this time. For this demonstration.
**Data Cleanup and re-grouping.** The retrieved table contains some un-wanted entries and needs some cleanup.

The following tasks will be performed:
* Drop/ignore cells with missing data.
* Use most current data record.
* Fix data types.

**Importing Python Libraries**

This section imports required python libraries for processing data. <br>
While this first part of python notebook is for data acquisition, we will use some  of the libraries make some data visualization.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files


import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


In [18]:
# Import necessary library
import json
import pandas as pd

In [19]:
# Download the cinema list
!wget -O hk_cinema_list.json https://hkmovie6.com/api/cinemas/lists

--2019-10-01 20:39:29--  https://hkmovie6.com/api/cinemas/lists
Resolving hkmovie6.com (hkmovie6.com)... 104.31.67.1, 104.31.66.1, 2606:4700:30::681f:4301, ...
Connecting to hkmovie6.com (hkmovie6.com)|104.31.67.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/json]
Saving to: ‘hk_cinema_list.json’

hk_cinema_list.json     [ <=>                ]  54.95K   316KB/s    in 0.2s    

2019-10-01 20:39:30 (316 KB/s) - ‘hk_cinema_list.json’ saved [56269]



In [22]:
# Convert the JSON data into DataFrame
cinemas_json = None
with open('hk_cinema_list.json', 'r', encoding='utf-8') as f:
    cinemas_json = json.load(f)
    
cinemas = []
for data in cinemas_json['data']:    
    cinemas.append({
        'Name': data['name'],
        'ChiName': data['chiName'],
        'Address': data['address']
    })
df_cinemas = pd.DataFrame(cinemas, columns=['Name','ChiName','Address','Latitude','Longitude'])

In [23]:
print('There are {} cinemas in Tokyo'.format(len(df_cinemas)))

There are 72 cinemas in Tokyo


In [4]:
df_cinemas.head()

Unnamed: 0,Name,ChiName,Address,Latitude,Longitude
0,Emperor Cinemas - Entertainment Building,英皇戲院 - 娛樂行,"3/F, Emperor Cinemas Entertainment Building, 3...",,
1,Emperor Cinemas - Ma On Shan,英皇戲院 - 馬鞍山新港城中心,"L2, MOSTown, Sai Sha Road, Ma On Shan, N.T.",,
2,Emperor Cinemas - Tuen Mun,英皇戲院 - 屯門新都商場,"3/F, New Town Commercial Arcade, 2 Tuen Lee St...",,
3,The Coronet @ Emperor Cinemas - Entertainment ...,The Coronet @ 英皇戲院 - 娛樂行,"3/F, Emperor Cinemas Entertainment Building, 3...",,
4,Festival Grand Cinema,Festival Grand Cinema,"Level UG, Festival Walk, 80 Tat Chee Avenue, K...",,


 **Geographic coordinates of 5 possible cinema addresses**

Geographic coordinates of 5 possible cinemas are required and I can use Google Map API to find this information

In [5]:
possible_locations = [
    { 'Location': 'L1', 'Address': 'Sau Mau Ping Shopping Centre, Sau Mau Ping'},
    { 'Location': 'L2', 'Address': 'Tuen Mun Ferry, Tuen Mun'},
    { 'Location': 'L3', 'Address': 'Un Chau Shopping Centre, Cheung Sha Wan'},
    { 'Location': 'L4', 'Address': 'Prosperity Millennia Plaza, North Point'},
    { 'Location': 'L5', 'Address': 'Tsuen Fung Centre Shopping Arcade, Tsuen Wan'},
]

In [6]:

# install the google map api client library
!pip install -U googlemaps

Collecting googlemaps
  Downloading https://files.pythonhosted.org/packages/9b/33/b93685916130c07325645d06a765dae23f4655b7aeb79c8a96fe9f552e26/googlemaps-3.1.3-py3-none-any.whl
Installing collected packages: googlemaps
Successfully installed googlemaps-3.1.3


In [75]:
!jupyter lab build

Traceback (most recent call last):
  File "/home/jupyterlab/conda/envs/python/bin/jupyter-lab", line 7, in <module>
    from jupyterlab.labapp import main
ModuleNotFoundError: No module named 'jupyterlab'


Retrieving a Dataframe of 5 target locations with geographic coordinates information

In [None]:
google_act = None
with open('google_map_act.json', 'r') as f:
    google_act = json.load(f)
    
GOOGLE_MAP_API_KEY = google_act['api_key']    

import googlemaps
gmaps = googlemaps.Client(key=GOOGLE_MAP_API_KEY)

In [None]:

# Retrieve geolocation and create the dataframe of pending cinema addresses
def getLatLng(address):
    latlnt = gmaps.geocode('{}, Hong Kong'.format(address))
    return (latlnt[0]['geometry']['location']['lat'], latlnt[0]['geometry']['location']['lng'])

In [None]:
for loc in possible_locations:        
    (lat, lng) = getLatLng(loc['Address'])
    loc['Latitude'] = lat
    loc['Longitude'] = lng
    
df_possible_locations = pd.DataFrame(possible_locations, columns=['Location', 'Address', 'Latitude', 'Longitude'])
df_possible_locations

Most preffered cinema list of stakeholder

In [9]:
favorite = [
    {'Name': 'Broadway Circuit - MONGKOK', 'Rating': 4.5},
    {'Name': 'Broadway Circuit - The ONE', 'Rating': 4.5},
    {'Name': 'Grand Ocean', 'Rating': 4.3},
    {'Name': 'The Grand Cinema', 'Rating': 3.4},
    {'Name': 'AMC Pacific Place', 'Rating': 2.3},
    {'Name': 'UA IMAX @ Airport', 'Rating': 1.5},
]

df_favorite = pd.DataFrame(favorite, columns=['Name','Rating'])
df_favorite

Unnamed: 0,Name,Rating
0,Broadway Circuit - MONGKOK,4.5
1,Broadway Circuit - The ONE,4.5
2,Grand Ocean,4.3
3,The Grand Cinema,3.4
4,AMC Pacific Place,2.3
5,UA IMAX @ Airport,1.5


**Eating, Shopping and Public transportation facility around cinema¶**

The recommended cinema location needs to have many eating and shopping venues nearby. Convenient public transport is also required.
These data can be found by using FourSquare API to find these venues around the location. The radius of exploration distance is set to 500 meters, which is about 5 minutes walking distance.

Following type of venue category will be used to search

In [11]:
fs_categories = {
    'Food': '4d4b7105d754a06374d81259',
    'Shop & Service': '4d4b7105d754a06378d81259',
    'Bus Stop': '52f2ab2ebcbc57f1066b8b4f',
    'Metro Station': '4bf58dd8d48988d1fd931735',
    'Nightlife Spot': '4d4b7105d754a06376d81259',
    'Arts & Entertainment': '4d4b7104d754a06370d81259'
}

In [12]:

', '.join([ cat for cat in fs_categories])

'Food, Shop & Service, Bus Stop, Metro Station, Nightlife Spot, Arts & Entertainment'

In [13]:
cinema = df_cinemas.loc[0]

In [14]:
print('Use the first cinema "{}" in the list as example to explore venues nearyby'.format(cinema['Name']))

Use the first cinema "Emperor Cinemas - Entertainment Building" in the list as example to explore venues nearyby


In [15]:
# Install FourSquare client library
!pip install foursquare

Collecting foursquare
  Downloading https://files.pythonhosted.org/packages/0b/e7/02438dddc98f19f998e1d4b962ab6bb8c37b90fa37e33a6678ce18b85f56/foursquare-1%212019.9.11.tar.gz
Building wheels for collected packages: foursquare
  Building wheel for foursquare (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/jupyterlab/.cache/pip/wheels/53/6c/d9/0810f42ef7521037af97032caab9411144ab0efab2aed8300f
Successfully built foursquare
Installing collected packages: foursquare
Successfully installed foursquare-1!2019.9.11


In [24]:
fs_act = None
with open('fs_act.json') as json_data:
    fs_act = json.load(json_data)

FileNotFoundError: [Errno 2] No such file or directory: 'fs_act.json'

In [21]:

import foursquare
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
fs = foursquare.Foursquare(client_id=fs_act['client_id'], client_secret=fs_act['client_secret'])

TypeError: 'NoneType' object is not subscriptable

In [25]:
RADIUS = 500 # 500m, around 5 minutes walking time

In [None]:
# Define a function to search nearby information and convert the result as dataframe
def venues_nearby(latitude, longitude, category, verbose=True):    
    results = fs.venues.search(
        params = {
            'query': category, 
            'll': '{},{}'.format(latitude, longitude),
            'radius': RADIUS,
            'categoryId': fs_categories[category]
        }
    )    
    df = json_normalize(results['venues'])
    cols = ['Name','Latitude','Longitude','Tips','Users','Visits']    
    if( len(df) == 0 ):        
        df = pd.DataFrame(columns=cols)
    else:        
        df = df[['name','location.lat','location.lng','stats.tipCount','stats.usersCount','stats.visitsCount']]
        df.columns = cols
    if( verbose ):
        print('{} "{}" venues are found within {}m of location'.format(len(df), category, RADIUS))
    return df

Find Metro Station around the cinema

In [26]:
venues_nearby(cinema['Latitude'], cinema['Longitude'], 'Metro Station').head()


NameError: name 'venues_nearby' is not defined


Find Bus Stop around the cinema

In [None]:
venues_nearby(cinema['Latitude'], cinema['Longitude'], 'Bus Stop').head()

Find eating places around the cinema

In [None]:
venues_nearby(cinema['Latitude'], cinema['Longitude'], 'Food').head()

In [None]:
venues_nearby(cinema['Latitude'], cinema['Longitude'], 'Arts & Entertainment').head()

<h1>Data Cleansing and Preparation</h1>

Data Cleansing and Preparation

In [29]:
duplicated = df_cinemas.duplicated('Address', keep=False)
df_cinemas[duplicated].sort_values('Address')

Unnamed: 0,Name,ChiName,Address,Latitude,Longitude
18,Cinema City VICTORIA (Causeway Bay),Cinema City VICTORIA (銅鑼灣),"2-8 Sugar Street, Causeway Bay, Hong Kong",,
19,Diamond Suite VIP House @ Cinema City VICTORIA...,Diamond Suite VIP House @ Cinema City VICTORIA...,"2-8 Sugar Street, Causeway Bay, Hong Kong",,
0,Emperor Cinemas - Entertainment Building,英皇戲院 - 娛樂行,"3/F, Emperor Cinemas Entertainment Building, 3...",,
3,The Coronet @ Emperor Cinemas - Entertainment ...,The Coronet @ 英皇戲院 - 娛樂行,"3/F, Emperor Cinemas Entertainment Building, 3...",,
47,IMAX @ UA Cine Moko,IMAX @ UA Cine Moko,"L4, MOKO, 193 Prince Edward Road West, Mongkok...",,
51,UA Cine Moko,UA Cine Moko,"L4, MOKO, 193 Prince Edward Road West, Mongkok...",,
48,IMAX @ UA MegaBox,IMAX @ UA MegaBox,"Level 11, MegaBox, Enterprise Square 5, 38 Wan...",,
49,Oscars Club @ UA MegaBox,Oscars Club @ UA MegaBox,"Level 11, MegaBox, Enterprise Square 5, 38 Wan...",,
53,UA MegaBox,UA MegaBox,"Level 11, MegaBox, Enterprise Square 5, 38 Wan...",,
14,Blackbox @ K11 Art House,Blackbox @ K11 Art House,"Shop 415, 4/F, Victoria Dockside K11 MUSEA, 18...",,



Some "special house" in cinema are separated as a new cinema in www.hkmovie6.com
These records are duplicated in my case and should be corrected.

In [30]:
# The Grand SC Starsuite -> The Grand Cinema
df_cinemas.loc[29, 'Name'] = 'The Grand Cinema'

# XXX @ UA MegaBox -> UA MegaBox
df_cinemas.loc[44, 'Name'] = 'UA MegaBox'
df_cinemas.loc[45, 'Name'] = 'UA MegaBox'

# BEA IMAX @ UA Cine Moko -> UA Cine Moko
df_cinemas.loc[42, 'Name'] = 'UA Cine Moko'

# XXX @ UA iSQUARE -> iSQUARE
df_cinemas.loc[43, 'Name'] = 'UA iSQUARE'
df_cinemas.loc[46, 'Name'] = 'UA iSQUARE'

# Emperor Cinemas - Entertainment Building
df_cinemas.loc[1, 'Name'] = 'Emperor Cinemas - Entertainment Building'

# Cinema City VICTORIA (Causeway Bay)
df_cinemas.loc[6, 'Name'] = 'Cinema City VICTORIA (Causeway Bay)'

In [31]:
df_cinemas[duplicated]


Unnamed: 0,Name,ChiName,Address,Latitude,Longitude
0,Emperor Cinemas - Entertainment Building,英皇戲院 - 娛樂行,"3/F, Emperor Cinemas Entertainment Building, 3...",,
3,The Coronet @ Emperor Cinemas - Entertainment ...,The Coronet @ 英皇戲院 - 娛樂行,"3/F, Emperor Cinemas Entertainment Building, 3...",,
14,Blackbox @ K11 Art House,Blackbox @ K11 Art House,"Shop 415, 4/F, Victoria Dockside K11 MUSEA, 18...",,
15,IMAX @ K11 Art House,IMAX @ K11 Art House,"Shop 415, 4/F, Victoria Dockside K11 MUSEA, 18...",,
16,K11 Art House,K11 Art House,"Shop 415, 4/F, Victoria Dockside K11 MUSEA, 18...",,
18,Cinema City VICTORIA (Causeway Bay),Cinema City VICTORIA (銅鑼灣),"2-8 Sugar Street, Causeway Bay, Hong Kong",,
19,Diamond Suite VIP House @ Cinema City VICTORIA...,Diamond Suite VIP House @ Cinema City VICTORIA...,"2-8 Sugar Street, Causeway Bay, Hong Kong",,
47,IMAX @ UA Cine Moko,IMAX @ UA Cine Moko,"L4, MOKO, 193 Prince Edward Road West, Mongkok...",,
48,IMAX @ UA MegaBox,IMAX @ UA MegaBox,"Level 11, MegaBox, Enterprise Square 5, 38 Wan...",,
49,Oscars Club @ UA MegaBox,Oscars Club @ UA MegaBox,"Level 11, MegaBox, Enterprise Square 5, 38 Wan...",,


In [32]:

df_cinemas.drop_duplicates('Address', inplace=True, keep='first')

In [33]:
df_cinemas[df_cinemas.duplicated('Name')]

Unnamed: 0,Name,ChiName,Address,Latitude,Longitude
1,Emperor Cinemas - Entertainment Building,英皇戲院 - 馬鞍山新港城中心,"L2, MOSTown, Sai Sha Road, Ma On Shan, N.T.",,
18,Cinema City VICTORIA (Causeway Bay),Cinema City VICTORIA (銅鑼灣),"2-8 Sugar Street, Causeway Bay, Hong Kong",,
45,UA MegaBox,Cinema City 朗豪坊,"Level 8-11, Langham Place, 8 Argyle Street, Mo...",,
46,UA iSQUARE,UA Cine Times,"13/F, Times Square, 1 Matheson St., Causeway Bay",,


In [34]:
df_cinemas.head()

Unnamed: 0,Name,ChiName,Address,Latitude,Longitude
0,Emperor Cinemas - Entertainment Building,英皇戲院 - 娛樂行,"3/F, Emperor Cinemas Entertainment Building, 3...",,
1,Emperor Cinemas - Entertainment Building,英皇戲院 - 馬鞍山新港城中心,"L2, MOSTown, Sai Sha Road, Ma On Shan, N.T.",,
2,Emperor Cinemas - Tuen Mun,英皇戲院 - 屯門新都商場,"3/F, New Town Commercial Arcade, 2 Tuen Lee St...",,
4,Festival Grand Cinema,Festival Grand Cinema,"Level UG, Festival Walk, 80 Tat Chee Avenue, K...",,
5,Grand Kornhill Cinema,康怡戲院,"4/F, Kornhill Plaza South, 2 Kornhill Road, Qu...",,


In [35]:
df_cinemas['ChiName'].to_frame()

Unnamed: 0,ChiName
0,英皇戲院 - 娛樂行
1,英皇戲院 - 馬鞍山新港城中心
2,英皇戲院 - 屯門新都商場
4,Festival Grand Cinema
5,康怡戲院
6,皇室戲院
7,MCL 長沙灣戲院
8,MCL 粉嶺戲院
9,MCL 新都城戲院
10,MCL 海怡戲院


Cinema '新光戲院大劇場' and '大館' should be considered as cinema in Tokyo. These records must be rmeoved

In [36]:
df_cinemas.drop(index=[65,67], inplace=True)

In [37]:
df_cinemas.drop(axis=1, columns=['ChiName'], inplace=True)

In [38]:
df_cinemas.head()

Unnamed: 0,Name,Address,Latitude,Longitude
0,Emperor Cinemas - Entertainment Building,"3/F, Emperor Cinemas Entertainment Building, 3...",,
1,Emperor Cinemas - Entertainment Building,"L2, MOSTown, Sai Sha Road, Ma On Shan, N.T.",,
2,Emperor Cinemas - Tuen Mun,"3/F, New Town Commercial Arcade, 2 Tuen Lee St...",,
4,Festival Grand Cinema,"Level UG, Festival Walk, 80 Tat Chee Avenue, K...",,
5,Grand Kornhill Cinema,"4/F, Kornhill Plaza South, 2 Kornhill Road, Qu...",,



Check the shape of cinemas dataset

In [39]:
df_cinemas.shape

(63, 4)


Now I can use the FourSquare API to explore nearby venues of Hong Kong cinemas

In [None]:
from pathlib import Path

venues_csv = Path('./cinemas_venues.csv')
df_venues = None

# check the venues data is explored and downloaded 
if( venues_csv.exists() ):
    df_venues = pd.read_csv('./cinemas_venues.csv')
else:    
    # construct a dataframe to store data
    df_venues = pd.DataFrame(columns=['Cinema Name', 'Category', 'Name', 'Latitude', 'Longitude', 'Tips', 'Users', 'Visits'])
    for (name, address, latitude, longitude) in df_cinemas.itertuples(index=False):
        for cat, cat_id in fs_categories.items():
            df = venues_nearby(latitude, longitude, cat, verbose=False)
            df['Cinema Name'] = name
            df['Category'] = cat
            df_venues = df_venues.append(df, sort=True)
    df_venues.to_csv('cinemas_venues.csv', index=False)

In [None]:
print('Total {} of venues are found'.format(len(df_venues)))

In [None]:
# check the shape of data
df_venues.shape

In [None]:
# check some data
df_venues.head()

Number of venues in each category

In [None]:

df_venues['Category'].value_counts().to_frame(name='Count')

In [None]:

df_venues[(df_venues.Tips > 0)|(df_venues.Users > 0)|(df_venues.Visits > 0)]

In [None]:
df_venues.drop(columns=['Tips','Users','Visits'], inplace=True)

In [None]:
df_venues[df_venues.Category=='Nightlife Spot']

In [None]:

df_venues.drop(index=87, inplace=True)

Comapred with other categories, only one 'Nightlife Spot' venue. This category is removed.

In [None]:
df_venues.shape

Explore nearby venues of 5 possible/target locations

In [None]:
df_target_venues.head()

In [None]:
df_target_venues[(df_target_venues.Tips > 0)|(df_target_venues.Users > 0)|(df_target_venues.Visits > 0)]

In [None]:

df_target_venues.drop(columns=['Tips','Users','Visits'], inplace=True)

In [None]:

df_target_venues['Category'].value_counts().to_frame(name='Count')

No venue is found for 'Nightlife Spot' category

In [None]:
df_target_venues.shape

I only interested in number of venues in each category of dataframe.

In [None]:
df_venues_count = df_venues.groupby(['Cinema Name','Category'], as_index=False).count()
df_venues_count.drop(columns=['Latitude','Longitude'], inplace=True)
df_venues_count.rename(columns={'Name':'Count'}, inplace=True)
df_venues_count.head()

In [None]:
df_venues_count = df_venues_count.pivot(index='Cinema Name', columns='Category', values='Count').fillna(0)
df_venues_count.head()

In [None]:
# Do the same process on target locations
df_target_venues_count = df_target_venues.groupby(['Location','Category']).size().reset_index(name='Count')
df_target_venues_count = df_target_venues_count.pivot(index='Location', columns='Category', values='Count').fillna(0)

In [None]:

df_target_venues_count

Check most preffered  favorite cinema list

In [42]:
favorite

[{'Name': 'Broadway Circuit - MONGKOK', 'Rating': 4.5},
 {'Name': 'Broadway Circuit - The ONE', 'Rating': 4.5},
 {'Name': 'Grand Ocean', 'Rating': 4.3},
 {'Name': 'The Grand Cinema', 'Rating': 3.4},
 {'Name': 'AMC Pacific Place', 'Rating': 2.3},
 {'Name': 'UA IMAX @ Airport', 'Rating': 1.5}]

check most preffered cinemas are inside the Tokyo cinemas dataset

Check Tokyo cinema list contains all stakeholder's most liked cinema

In [43]:
names = [ cinema['Name'] for cinema in favorite ]
df_cinemas[df_cinemas.Name.isin(names)]

Unnamed: 0,Name,Address,Latitude,Longitude
23,Broadway Circuit - MONGKOK,"6-12 Sai Yeung Choi Street, Mongkok, Kowloon",,
24,Broadway Circuit - The ONE,"6-11/F, The ONE, No. 100 Nathan Road, Tsim Sha...",,
29,The Grand Cinema,"L1-L4 Metroplaza, 223 Hing Fong Road, Kwai Fon...",,
32,AMC Pacific Place,"Level 1, Pacific Place, 88 Queensway Road, Hon...",,
40,Grand Ocean,"Ocean Centre, 3 Canton Road, Kowloon",,
55,UA IMAX @ Airport,"6P059, Level 6, Terminal 2, 1 Sky Plaza Road, ...",,


Stakholder's favorite cinema list

In [44]:
df_favorite = pd.DataFrame(favorite, columns=['Name','Rating'])
df_favorite

Unnamed: 0,Name,Rating
0,Broadway Circuit - MONGKOK,4.5
1,Broadway Circuit - The ONE,4.5
2,Grand Ocean,4.3
3,The Grand Cinema,3.4
4,AMC Pacific Place,2.3
5,UA IMAX @ Airport,1.5


<h2>Data Analysis</h2>

In [45]:

!conda install seaborn=0.9 --yes

Solving environment: done


  current version: 4.5.11
  latest version: 4.7.12

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs: 
    - seaborn=0.9


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    matplotlib-2.2.2           |   py36hb69df0a_2         6.6 MB
    seaborn-0.9.0              |           py36_0         379 KB
    openssl-1.1.1d             |       h7b6447c_2         3.7 MB
    certifi-2019.9.11          |           py36_0         154 KB
    sip-4.18.1                 |   py36hf484d3e_2         278 KB
    qt-5.6.3                   |       h8bf5577_3        45.7 MB
    pyqt-5.6.0                 |   py36h22d08a2_6         5.4 MB
    ------------------------------------------------------------
                                           Total: 

In [46]:
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

The mkdirs function was deprecated in Matplotlib 3.0 and will be removed in 3.2.
  import matplotlib.texmanager as texmanager


AttributeError: module 'matplotlib' has no attribute '_get_configdir'

In [None]:
df_venues_count.dtypes.to_frame(name='Data Type')

All datatype is numeric

Generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset's distribution

In [None]:
df_venues_count.describe()


Cinema really has many 'Bus Stop', 'Food', 'Shop & Service' venues around. However it is unusual that a cinema has 4 metro stations nearby (within 500 meters).

In [None]:
df_venues_count['Metro Station'].value_counts().sort_index().to_frame('Cinema Count')

One cinema contains 4 Metro Station around

In [None]:

df_venues_count[df_venues_count['Metro Station'] > 2]

In [None]:
metro_over_2 = df_venues_count[df_venues_count['Metro Station'] > 2].index.tolist()
df_venues[(df_venues['Cinema Name'].isin(metro_over_2)) & (df_venues.Category == 'Metro Station')]

In [None]:
Venue 'Mtr Hung Hom Station Platform 4' is duplicated and should be removed.

In [None]:
df_venues.loc[2182, 'Name'] = 'MTR Hung Hom Station'

In [None]:
df_venues.drop(index=2183, inplace=True)

Re-construct the dataframe again

In [None]:
df_venues_count = df_venues.groupby(['Cinema Name','Category'], as_index=False).count()
df_venues_count.drop(columns=['Latitude','Longitude'], inplace=True)
df_venues_count.rename(columns={'Name':'Count'}, inplace=True)
df_venues_count = df_venues_count.pivot(index='Cinema Name', columns='Category', values='Count').fillna(0)
df_venues_count.head()

Plot the distribution of other variables

In [None]:
f, axes = plt.subplots(2, 2, figsize=(10, 10))
sns.distplot(df_venues_count['Arts & Entertainment'] , color="skyblue", ax=axes[0, 0], kde=False)
sns.distplot(df_venues_count['Bus Stop'] , color="olive", ax=axes[0, 1], kde=False)
sns.distplot(df_venues_count['Food'] , color="gold", ax=axes[1, 0], kde=False)
sns.distplot(df_venues_count['Shop & Service'] , color="teal", ax=axes[1, 1], kde=False)

The distribution of other variables are quite similar. Now check their Pearson Correlation

In [None]:
df_venues_count.corr()


It seems that 'Bus Stop', 'Shop & Service' and 'Food' category are highly correlated.
Find P-Value of the variables

By convention, when the p-value is:

< 0.001 we say there is strong evidence that the correlation is significant,
< 0.05; there is moderate evidence that the correlation is significant,
< 0.1; there is weak evidence that the correlation is significant, and
is > 0.1; there is no evidence that the correlation is significant

In [47]:
from scipy import stats

In [None]:

p_value_data = []
for left in df_venues_count.columns:
    p_values = [left]
    for right in df_venues_count.columns:        
        pearson_coef, p_value = stats.pearsonr(df_venues_count[left], df_venues_count[right])
        if(p_value < 0.001):
            p_values.append('strong')
        elif(p_value < 0.05):
            p_values.append('moderate')
        elif(p_value < 0.1):
            p_values.append('weak')
        else:
            p_values.append('no')            
    p_value_data.append(p_values)

In [None]:
df_p_values = pd.DataFrame(p_value_data, columns=['Category'] + df_venues_count.columns.tolist())

In [None]:
df_p_values


The correlation between 'Bus Stop', 'Food', 'Metro Station' and 'Shop & Service' are statistically significant, and the coefficient of > 0.5 shows that the relationship is positive

In [48]:
df_favorite

Unnamed: 0,Name,Rating
0,Broadway Circuit - MONGKOK,4.5
1,Broadway Circuit - The ONE,4.5
2,Grand Ocean,4.3
3,The Grand Cinema,3.4
4,AMC Pacific Place,2.3
5,UA IMAX @ Airport,1.5


In [49]:
!conda install -c conda-forge folium=0.5 --yes
import folium

print('Folium installed and imported!')

Solving environment: done


  current version: 4.5.11
  latest version: 4.7.12

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs: 
    - folium=0.5


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2019.9.11          |           py36_0         147 KB  conda-forge

The following packages will be UPDATED:

    certifi: 2019.9.11-py36_0  --> 2019.9.11-py36_0  conda-forge

The following packages will be DOWNGRADED:

    openssl: 1.1.1d-h7b6447c_2 --> 1.1.1c-h516909a_0 conda-forge


Downloading and Extracting Packages
certifi-2019.9.11    | 147 KB    | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Folium installed and imported!


In [51]:
hk_coords = getLatLng('Tokyo')

NameError: name 'getLatLng' is not defined

Visualize the location of cinemas, target location and stakeholder's favorite cineams on the map

In [None]:
hk_map = folium.Map(location=hk_coords, zoom_start=12, tiles='Stamen Toner')

cinemas_fg = folium.FeatureGroup()
targets_fg = folium.FeatureGroup()

for(location, address, latitude, longitude) in df_possible_locations.itertuples(index=False):
    targets_fg.add_child(
        folium.features.CircleMarker(
            location=(latitude, longitude),
            popup=location,
            radius=5,
            fill=True,
            color='yellow',
            fill_opacity=1.
        )
    )

boss_ratings = df_favorite.set_index('Name')    
name_list = ratings.index.tolist()

for (name, address, latitude, longitude ) in df_cinemas.itertuples(index=False):    
    
    color = 'blue'        
    popup = name
    
    if( name in name_list ):
        color = 'red'    
        popup = '{} - Rating: {}'.format(name, boss_ratings.loc[name,'Rating'])
        
    cinemas_fg.add_child(        
        folium.features.CircleMarker(
            location=(latitude, longitude),
            popup=popup,
            radius=5,
            fill=True,
            color=color,
            fill_opacity=1.
        )
    )

hk_map.add_child(cinemas_fg)
hk_map.add_child(targets_fg)

Most of Tokyo cinemas (blue circle) and stakeholder's favorite cinemas (red circle) location are built near main road, and centralized in urban area of Tokyo. The target locations (yellow circle) of new cinema are not near to main road.

Machine Learning
Now, let's use Content-Based or Item-Item recommendation systems. In this case, I am going to try to figure out the boss's favorite new cinema location by counting number of nearby venues and ratings given.

Normalize the values of venues dataframe by using MinMaxScaler method

In [None]:
df_venues_count.head()

In [52]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

In [None]:

venues_normalized = scaler.fit_transform(df_venues_count)

In [None]:

df_venues_normalized = pd.DataFrame(
    venues_normalized,
    index=df_venues_count.index,
    columns=df_venues_count.columns
)

In [None]:
df_venues_normalized.head()

Merge the data with  favorite list

In [None]:
boss_rating_table = pd.merge(
    df_favorite,
    df_venues_normalized,
    how='inner',
    left_on='Name',
    right_index=True
)
rating_table.drop(['Name','Rating'], axis=1, inplace=True)
rating_table


Dot product to get the weight of rating on each category according to favorite

In [53]:
profile = rating_table.transpose().dot(boss_favorite['Rating'])

NameError: name 'rating_table' is not defined

In [None]:
profile

Normalize the values of target venues

In [None]:

df_targets_normalized = pd.DataFrame(
    scaler.transform(df_target_venues_count),
    index=df_target_venues_count.index,
    columns=df_target_venues_count.columns
)

In [None]:
df_targets_normalized


<h4>Results</h4>
Results section where you discuss the results.

With the boss's profile and the complete list of cinemas and their venues count in hand, I am going to take the weighted average of every lcoation based on the profile and recommend the top location that most satisfy it.

In [None]:
df_recommend = (df_targets_normalized*boss_profile).sum(axis=1)/boss_profile.sum()
df_recommend = df_recommend.reset_index(name='Rating')

In [None]:
df_possible_locations

In [None]:
df_final = pd.merge(
    df_possible_locations,
    df_recommend,
    left_on='Location',
    right_on='Location'
)
df_final.sort_values('Rating', ascending=False, inplace=True)

In [None]:
df_final

In [None]:
print('I should recommend the location "{}" of address "{}" to the stackholder'.format(df_final.iat[0,0], df_final.iat[0,1]))

The result is reasonable. Location "L5" has the most number of venues in category "Bus Stop", "Food", "Metro Station" and "Shop & Service".

In [None]:
df_target_venues_count.head()

Moreover, these categories are most concerned by the stakeholder according to profile rating

In [None]:
boss_profile.sort_values(ascending=False)

Therefore, Location "L5" should be recommeded to the stakehold

<h5>Discussion</h5>


Discussion section where you discuss any observations you noted and any recommendations you can make based on the results.

Number of venues of 5 target locations are actually below the average


df_venues_count.mean().to_frame(name='Average Count')

In [None]:
df_target_venues_count.mean().to_frame('Average Count')


I should contact local commercial property agents to find more suitable locations. Moreover, FourSquare is not popular in Hong Kong, the data maybe out-dated or unreliable, the report should gather more data from other location data source such as Google Place API.

<h2>Conclusion</h2>
Conclusion section where you conclude the report.

The stakeholder's problem is resolved. Stakeholder wants to find the best place to build a new cinema in Hong Kong, and the factors of "best location" is based on the number of venues in eating, shopping, transportation category around the location. Stakeholder also provide his favorite list of cinema to further explain what the "best location" is. Content-based filtering machine learning technique is the most suitable method to resolve the problem. It combines stakeholder's preference and cinema profile to make the recommendation result.

The 5 target locations of new cinema may not be a good choices. As the weighting matrix is developed, I can quickly pick other locations and make the recommendation again.