# Capstone Project Notebook

This notebook will be used for the Capstone Project for the Data Science Certification offered by IBM through Coursera.

# Outline

- [Introduction: Business Problem](#introduction)
- [Data](#data)
- [Analysis](#analysis)
- [Results and Discussion](#results)
- [Conclusion](#conclusion)

# Introduction: Business Problem <a name="introduction"></a>

Many entertainment companies seek out the success of their movies as well as revenue for the movie theaters broadcasting those movies. Therefore, the location of cinemas as well as the taste of individuals and their capability to watch movies and afford them contributes to the success of cinemas as a business and the entertainment inductry as a whole.

For this project, I will be scouting the ideal location for a new **movie theater** to be established in Toronto, Canada and how to explore that using different approaches.  

The target audience would be **movie theater owners and companies.** I will determine where the best location is for a new cinema could be established whether the theater is part of a branch or a stand-alone small business.

# Data <a name="data"></a>

For the scope of my project, I will need to know:

- The number of theaters in the city
- Their distribution throughout by neighborhood
- Something else to explore, I'm tired 

I will also explore other data such as income and demographics that can contribute to the success of the business and see how they impact previous establishments. I am aware that pricing, movie selection and establishment policies also effect the success of a movie theater. However, for the scope of my study, I will be fousing on geographic location as well as associated socioeconomic factors that can be accounted for. 

I will start by importing the required modules and libraries for the analysis:

In [1]:
# For plotting, clustering and preprocessing of data:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import geopandas as gpd
%matplotlib inline

import requests  # might not need this 
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler



## Importing and Cleaning Datasets:

### Income dataset:

Retrieving the data from the csv file then converting it to a Pandas dataframe:

In [2]:
data = pd.read_csv('neighbourhood-profiles-2016-csv.csv')
data.head()

Unnamed: 0,_id,Category,Topic,Data Source,Characteristic,City of Toronto,Agincourt North,Agincourt South-Malvern West,Alderwood,Annex,...,Willowdale West,Willowridge-Martingrove-Richview,Woburn,Woodbine Corridor,Woodbine-Lumsden,Wychwood,Yonge-Eglinton,Yonge-St.Clair,York University Heights,Yorkdale-Glen Park
0,1,Neighbourhood Information,Neighbourhood Information,City of Toronto,Neighbourhood Number,,129,128,20,95,...,37,7,137,64,60,94,100,97,27,31
1,2,Neighbourhood Information,Neighbourhood Information,City of Toronto,TSNS2020 Designation,,No Designation,No Designation,No Designation,No Designation,...,No Designation,No Designation,NIA,No Designation,No Designation,No Designation,No Designation,No Designation,NIA,Emerging Neighbourhood
2,3,Population,Population and dwellings,Census Profile 98-316-X2016001,"Population, 2016",2731571,29113,23757,12054,30526,...,16936,22156,53485,12541,7865,14349,11817,12528,27593,14804
3,4,Population,Population and dwellings,Census Profile 98-316-X2016001,"Population, 2011",2615060,30279,21988,11904,29177,...,15004,21343,53350,11703,7826,13986,10578,11652,27713,14687
4,5,Population,Population and dwellings,Census Profile 98-316-X2016001,Population Change 2011-2016,4.50%,-3.90%,8.00%,1.30%,4.60%,...,12.90%,3.80%,0.30%,7.20%,0.50%,2.60%,11.70%,7.50%,-0.40%,0.80%


In [3]:
data.drop(1, axis = 0, inplace=True)
data.drop(['_id','Data Source'], axis=1, inplace=True)
data

Unnamed: 0,Category,Topic,Characteristic,City of Toronto,Agincourt North,Agincourt South-Malvern West,Alderwood,Annex,Banbury-Don Mills,Bathurst Manor,...,Willowdale West,Willowridge-Martingrove-Richview,Woburn,Woodbine Corridor,Woodbine-Lumsden,Wychwood,Yonge-Eglinton,Yonge-St.Clair,York University Heights,Yorkdale-Glen Park
0,Neighbourhood Information,Neighbourhood Information,Neighbourhood Number,,129,128,20,95,42,34,...,37,7,137,64,60,94,100,97,27,31
2,Population,Population and dwellings,"Population, 2016",2731571,29113,23757,12054,30526,27695,15873,...,16936,22156,53485,12541,7865,14349,11817,12528,27593,14804
3,Population,Population and dwellings,"Population, 2011",2615060,30279,21988,11904,29177,26918,15434,...,15004,21343,53350,11703,7826,13986,10578,11652,27713,14687
4,Population,Population and dwellings,Population Change 2011-2016,4.50%,-3.90%,8.00%,1.30%,4.60%,2.90%,2.80%,...,12.90%,3.80%,0.30%,7.20%,0.50%,2.60%,11.70%,7.50%,-0.40%,0.80%
5,Population,Population and dwellings,Total private dwellings,1179057,9371,8535,4732,18109,12473,6418,...,8054,8721,19098,5620,3604,6185,6103,7475,11051,5847
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2378,Mobility,Mobility status - Place of residence 5 years ago,Migrants,400950,3170,3145,925,6390,3140,2235,...,3765,2270,7260,985,620,1350,2425,2310,4965,1345
2379,Mobility,Mobility status - Place of residence 5 years ago,Internal migrants,184120,880,980,680,3930,1405,915,...,1545,1110,1720,610,395,780,1260,1355,1700,580
2380,Mobility,Mobility status - Place of residence 5 years ago,Intraprovincial migrants,141135,735,760,615,2630,1190,745,...,1070,960,1400,350,320,570,970,1025,1490,445
2381,Mobility,Mobility status - Place of residence 5 years ago,Interprovincial migrants,42985,135,220,70,1310,220,170,...,475,150,335,250,85,210,290,325,195,135


In [4]:
data.set_index(list(data[['Category']]), inplace=True)
data.head()

Unnamed: 0_level_0,Topic,Characteristic,City of Toronto,Agincourt North,Agincourt South-Malvern West,Alderwood,Annex,Banbury-Don Mills,Bathurst Manor,Bay Street Corridor,...,Willowdale West,Willowridge-Martingrove-Richview,Woburn,Woodbine Corridor,Woodbine-Lumsden,Wychwood,Yonge-Eglinton,Yonge-St.Clair,York University Heights,Yorkdale-Glen Park
Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Neighbourhood Information,Neighbourhood Information,Neighbourhood Number,,129,128,20,95,42,34,76,...,37,7,137,64,60,94,100,97,27,31
Population,Population and dwellings,"Population, 2016",2731571,29113,23757,12054,30526,27695,15873,25797,...,16936,22156,53485,12541,7865,14349,11817,12528,27593,14804
Population,Population and dwellings,"Population, 2011",2615060,30279,21988,11904,29177,26918,15434,19348,...,15004,21343,53350,11703,7826,13986,10578,11652,27713,14687
Population,Population and dwellings,Population Change 2011-2016,4.50%,-3.90%,8.00%,1.30%,4.60%,2.90%,2.80%,33.30%,...,12.90%,3.80%,0.30%,7.20%,0.50%,2.60%,11.70%,7.50%,-0.40%,0.80%
Population,Population and dwellings,Total private dwellings,1179057,9371,8535,4732,18109,12473,6418,18436,...,8054,8721,19098,5620,3604,6185,6103,7475,11051,5847


In [5]:
income_total = data.loc[['Neighbourhood Information','Income']]
income_total.set_index('Topic', inplace=True)
income_total.head()

Unnamed: 0_level_0,Characteristic,City of Toronto,Agincourt North,Agincourt South-Malvern West,Alderwood,Annex,Banbury-Don Mills,Bathurst Manor,Bay Street Corridor,Bayview Village,...,Willowdale West,Willowridge-Martingrove-Richview,Woburn,Woodbine Corridor,Woodbine-Lumsden,Wychwood,Yonge-Eglinton,Yonge-St.Clair,York University Heights,Yorkdale-Glen Park
Topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Neighbourhood Information,Neighbourhood Number,,129,128,20,95,42,34,76,52,...,37,7,137,64,60,94,100,97,27,31
Income of households in 2015,"$150,000 to $199,999",77810.0,595,475,560,1130,935,505,750,665,...,565,760,855,625,365,460,430,465,445,380
Income of individuals in 2015,"$60,000 to $69,999",114460.0,865,825,690,1460,1425,685,1250,1070,...,735,1055,1665,600,420,620,595,720,895,530
Income of individuals in 2015,"$70,000 to $79,999",89645.0,655,570,530,1290,1220,535,1090,870,...,605,790,1230,540,400,505,520,585,585,380
Income of individuals in 2015,"$80,000 to $89,999",69990.0,435,435,395,1000,960,405,925,675,...,515,585,800,465,315,345,415,495,405,320


In [6]:
income = income_total.loc[['Neighbourhood Information', 'Income of individuals in 2015']]
income = income.transpose()
income.head()

Topic,Neighbourhood Information,Income of individuals in 2015,Income of individuals in 2015.1,Income of individuals in 2015.2,Income of individuals in 2015.3,Income of individuals in 2015.4,Income of individuals in 2015.5,Income of individuals in 2015.6,Income of individuals in 2015.7,Income of individuals in 2015.8,...,Income of individuals in 2015.9,Income of individuals in 2015.10,Income of individuals in 2015.11,Income of individuals in 2015.12,Income of individuals in 2015.13,Income of individuals in 2015.14,Income of individuals in 2015.15,Income of individuals in 2015.16,Income of individuals in 2015.17,Income of individuals in 2015.18
Characteristic,Neighbourhood Number,"$60,000 to $69,999","$70,000 to $79,999","$80,000 to $89,999","$90,000 to $99,999","$150,000 and over","$100,000 and over","$100,000 and over","$100,000 to $149,999",Total - Income statistics in 2015 for the popu...,...,"$20,000 to $29,999","$30,000 to $39,999","$40,000 to $49,999","$50,000 to $59,999","$60,000 to $69,999","$70,000 to $79,999","$80,000 and over","$80,000 to $89,999","$90,000 to $99,999","$100,000 and over"
City of Toronto,,114460,89645,69990,58210,89770,116680,209580,119810,2294785,...,176835,157115,141850,113280,90310,72190,273760,57585,50845,165330
Agincourt North,129,865,655,435,365,135,245,665,530,25005,...,2120,1720,1360,985,670,460,1160,365,295,505
Agincourt South-Malvern West,128,825,570,435,315,165,265,685,525,20400,...,1730,1365,1160,860,660,460,1135,355,275,515
Alderwood,20,690,530,395,370,225,325,845,620,10265,...,735,715,695,660,555,455,1415,360,330,730


In [7]:
income = income.drop('City of Toronto')
income.head()

Topic,Neighbourhood Information,Income of individuals in 2015,Income of individuals in 2015.1,Income of individuals in 2015.2,Income of individuals in 2015.3,Income of individuals in 2015.4,Income of individuals in 2015.5,Income of individuals in 2015.6,Income of individuals in 2015.7,Income of individuals in 2015.8,...,Income of individuals in 2015.9,Income of individuals in 2015.10,Income of individuals in 2015.11,Income of individuals in 2015.12,Income of individuals in 2015.13,Income of individuals in 2015.14,Income of individuals in 2015.15,Income of individuals in 2015.16,Income of individuals in 2015.17,Income of individuals in 2015.18
Characteristic,Neighbourhood Number,"$60,000 to $69,999","$70,000 to $79,999","$80,000 to $89,999","$90,000 to $99,999","$150,000 and over","$100,000 and over","$100,000 and over","$100,000 to $149,999",Total - Income statistics in 2015 for the popu...,...,"$20,000 to $29,999","$30,000 to $39,999","$40,000 to $49,999","$50,000 to $59,999","$60,000 to $69,999","$70,000 to $79,999","$80,000 and over","$80,000 to $89,999","$90,000 to $99,999","$100,000 and over"
Agincourt North,129,865,655,435,365,135,245,665,530,25005,...,2120,1720,1360,985,670,460,1160,365,295,505
Agincourt South-Malvern West,128,825,570,435,315,165,265,685,525,20400,...,1730,1365,1160,860,660,460,1135,355,275,515
Alderwood,20,690,530,395,370,225,325,845,620,10265,...,735,715,695,660,555,455,1415,360,330,730
Annex,95,1460,1290,1000,830,3055,3660,5255,2190,26295,...,1800,1580,1425,1225,1155,915,5095,770,630,3715


In [8]:
new_header = income.iloc[0] 
income = income[1:] 
income.columns = new_header 
income.head()

Characteristic,Neighbourhood Number,"$60,000 to $69,999","$70,000 to $79,999","$80,000 to $89,999","$90,000 to $99,999","$150,000 and over","$100,000 and over","$100,000 and over.1","$100,000 to $149,999",Total - Income statistics in 2015 for the population aged 15 years and over in private households,...,"$20,000 to $29,999","$30,000 to $39,999","$40,000 to $49,999","$50,000 to $59,999","$60,000 to $69,999.1","$70,000 to $79,999.1","$80,000 and over","$80,000 to $89,999.1","$90,000 to $99,999.1","$100,000 and over.2"
Agincourt North,129,865,655,435,365,135,245,665,530,25005,...,2120,1720,1360,985,670,460,1160,365,295,505
Agincourt South-Malvern West,128,825,570,435,315,165,265,685,525,20400,...,1730,1365,1160,860,660,460,1135,355,275,515
Alderwood,20,690,530,395,370,225,325,845,620,10265,...,735,715,695,660,555,455,1415,360,330,730
Annex,95,1460,1290,1000,830,3055,3660,5255,2190,26295,...,1800,1580,1425,1225,1155,915,5095,770,630,3715
Banbury-Don Mills,42,1425,1220,960,820,1635,2150,3670,2035,23410,...,1395,1180,1245,1055,910,795,3890,690,640,2570


In [9]:
income.reset_index(inplace=True)

In [10]:
income.rename(columns={"index": "Neighbourhood Name"},inplace=True)

For the census data used, the data contains information about private households only, therefore the income brackets are higher for these and it doesn't include low income individuals. 

This might be a limitation but it is sufficient to determine how income affects location for this income bracket and a seperate study can be done for low income should that data be available. 

In [11]:
column = list(income.iloc[:,9])
income = income.iloc[:,:7]
income.insert(6,'100,000 to 149,000',column)
income.head()

Characteristic,Neighbourhood Name,Neighbourhood Number,"$60,000 to $69,999","$70,000 to $79,999","$80,000 to $89,999","$90,000 to $99,999","100,000 to 149,000","$150,000 and over"
0,Agincourt North,129,865,655,435,365,530,135
1,Agincourt South-Malvern West,128,825,570,435,315,525,165
2,Alderwood,20,690,530,395,370,620,225
3,Annex,95,1460,1290,1000,830,2190,3055
4,Banbury-Don Mills,42,1425,1220,960,820,2035,1635


Now that I have a much smaller dataset, I will drop the rest of the columns manually by scanning through the dataset and dropping the columns by name.

In [12]:
col = income.pop('Neighbourhood Number')

In [13]:
income.head()

Characteristic,Neighbourhood Name,"$60,000 to $69,999","$70,000 to $79,999","$80,000 to $89,999","$90,000 to $99,999","100,000 to 149,000","$150,000 and over"
0,Agincourt North,865,655,435,365,530,135
1,Agincourt South-Malvern West,825,570,435,315,525,165
2,Alderwood,690,530,395,370,620,225
3,Annex,1460,1290,1000,830,2190,3055
4,Banbury-Don Mills,1425,1220,960,820,2035,1635


In [14]:
income.insert(0,'Neighbourhood Number',col)
income.head()

Characteristic,Neighbourhood Number,Neighbourhood Name,"$60,000 to $69,999","$70,000 to $79,999","$80,000 to $89,999","$90,000 to $99,999","100,000 to 149,000","$150,000 and over"
0,129,Agincourt North,865,655,435,365,530,135
1,128,Agincourt South-Malvern West,825,570,435,315,525,165
2,20,Alderwood,690,530,395,370,620,225
3,95,Annex,1460,1290,1000,830,2190,3055
4,42,Banbury-Don Mills,1425,1220,960,820,2035,1635


Then I copy the cleaned dataframe into a new csv for retrieval. This will be in the Github as part of the resources.

In [15]:
income.to_csv('income_data.csv')

Now the data is complete. Next we retrieve the shape of the dataframe after the cleaning:

In [16]:
income.shape

(140, 8)

### Importing Geo data for neighborhoods:

Here I will be using the geo coordinates csv file to import latitudes and longitudes:

In [17]:
geo = r'Neighbourhoods.geojson'

Now I merge the existing dataframe with the coordinates matching the postal codes on the geo coordiante dataframe to the toronto neighborhood datframe:

In [18]:
toronto = gpd.read_file(geo) 
toronto.head() 

Unnamed: 0,_id,AREA_ID,AREA_ATTR_ID,PARENT_AREA_ID,AREA_SHORT_CODE,AREA_LONG_CODE,AREA_NAME,AREA_DESC,X,Y,LONGITUDE,LATITUDE,OBJECTID,Shape__Area,Shape__Length,geometry
0,8261,25886861,25926662,49885,94,94,Wychwood (94),Wychwood (94),,,-79.425515,43.676919,16491505,3217960.0,7515.779658,"POLYGON ((-79.43592 43.68015, -79.43492 43.680..."
1,8262,25886820,25926663,49885,100,100,Yonge-Eglinton (100),Yonge-Eglinton (100),,,-79.40359,43.704689,16491521,3160334.0,7872.021074,"POLYGON ((-79.41096 43.70408, -79.40962 43.704..."
2,8263,25886834,25926664,49885,97,97,Yonge-St.Clair (97),Yonge-St.Clair (97),,,-79.397871,43.687859,16491537,2222464.0,8130.411276,"POLYGON ((-79.39119 43.68108, -79.39141 43.680..."
3,8264,25886593,25926665,49885,27,27,York University Heights (27),York University Heights (27),,,-79.488883,43.765736,16491553,25418210.0,25632.335242,"POLYGON ((-79.50529 43.75987, -79.50488 43.759..."
4,8265,25886688,25926666,49885,31,31,Yorkdale-Glen Park (31),Yorkdale-Glen Park (31),,,-79.457108,43.714672,16491569,11566690.0,13953.408098,"POLYGON ((-79.43969 43.70561, -79.44011 43.705..."


Now I will slice the dataframe to only include the necessary data for my analysis:

In [19]:
toronto = toronto.iloc[:, 5:]  
toronto.rename(columns={'AREA_LONG_CODE': 'Neighbourhood Number'},
                   inplace=True)  
toronto.drop(labels=['AREA_DESC', 'OBJECTID', 'X', 'Y'],
                 axis=1, inplace=True) 

toronto.head() 

Unnamed: 0,Neighbourhood Number,AREA_NAME,LONGITUDE,LATITUDE,Shape__Area,Shape__Length,geometry
0,94,Wychwood (94),-79.425515,43.676919,3217960.0,7515.779658,"POLYGON ((-79.43592 43.68015, -79.43492 43.680..."
1,100,Yonge-Eglinton (100),-79.40359,43.704689,3160334.0,7872.021074,"POLYGON ((-79.41096 43.70408, -79.40962 43.704..."
2,97,Yonge-St.Clair (97),-79.397871,43.687859,2222464.0,8130.411276,"POLYGON ((-79.39119 43.68108, -79.39141 43.680..."
3,27,York University Heights (27),-79.488883,43.765736,25418210.0,25632.335242,"POLYGON ((-79.50529 43.75987, -79.50488 43.759..."
4,31,Yorkdale-Glen Park (31),-79.457108,43.714672,11566690.0,13953.408098,"POLYGON ((-79.43969 43.70561, -79.44011 43.705..."


Then I will merge the datasets at the "Neighborhood Information' Column:

In [20]:
toronto['Neighbourhood Number'] = toronto['Neighbourhood Number'].astype(str).astype(object)

In [21]:
toronto.sort_values('AREA_NAME',ascending=False)

Unnamed: 0,Neighbourhood Number,AREA_NAME,LONGITUDE,LATITUDE,Shape__Area,Shape__Length,geometry
4,31,Yorkdale-Glen Park (31),-79.457108,43.714672,1.156669e+07,13953.408098,"POLYGON ((-79.43969 43.70561, -79.44011 43.705..."
3,27,York University Heights (27),-79.488883,43.765736,2.541821e+07,25632.335242,"POLYGON ((-79.50529 43.75987, -79.50488 43.759..."
2,97,Yonge-St.Clair (97),-79.397871,43.687859,2.222464e+06,8130.411276,"POLYGON ((-79.39119 43.68108, -79.39141 43.680..."
1,100,Yonge-Eglinton (100),-79.403590,43.704689,3.160334e+06,7872.021074,"POLYGON ((-79.41096 43.70408, -79.40962 43.704..."
0,94,Wychwood (94),-79.425515,43.676919,3.217960e+06,7515.779658,"POLYGON ((-79.43592 43.68015, -79.43492 43.680..."
...,...,...,...,...,...,...,...
78,42,Banbury-Don Mills (42),-79.349718,43.737657,1.924897e+07,25141.572290,"POLYGON ((-79.33055 43.73979, -79.33044 43.739..."
77,95,Annex (95),-79.404001,43.671585,5.337192e+06,10513.883143,"POLYGON ((-79.39414 43.66872, -79.39588 43.668..."
76,20,Alderwood (20),-79.541611,43.604937,9.502180e+06,12667.013917,"POLYGON ((-79.54866 43.59022, -79.54876 43.590..."
75,128,Agincourt South-Malvern West (128),-79.265612,43.788658,1.511736e+07,21320.849547,"POLYGON ((-79.25498 43.78122, -79.25797 43.780..."


In [22]:
income.sort_values('Neighbourhood Name',ascending=False)

Characteristic,Neighbourhood Number,Neighbourhood Name,"$60,000 to $69,999","$70,000 to $79,999","$80,000 to $89,999","$90,000 to $99,999","100,000 to 149,000","$150,000 and over"
139,31,Yorkdale-Glen Park,530,380,320,245,450,195
138,27,York University Heights,895,585,405,250,380,80
137,97,Yonge-St.Clair,720,585,495,425,1075,1645
136,100,Yonge-Eglinton,595,520,415,360,870,1230
135,94,Wychwood,620,505,345,335,710,575
...,...,...,...,...,...,...,...,...
4,42,Banbury-Don Mills,1425,1220,960,820,2035,1635
3,95,Annex,1460,1290,1000,830,2190,3055
2,20,Alderwood,690,530,395,370,620,225
1,128,Agincourt South-Malvern West,825,570,435,315,525,165


In [23]:
data = pd.merge(left=income, right=toronto, left_on='Neighbourhood Number', right_on='Neighbourhood Number')
data.head()

Unnamed: 0,Neighbourhood Number,Neighbourhood Name,"$60,000 to $69,999","$70,000 to $79,999","$80,000 to $89,999","$90,000 to $99,999","100,000 to 149,000","$150,000 and over",AREA_NAME,LONGITUDE,LATITUDE,Shape__Area,Shape__Length,geometry
0,129,Agincourt North,865,655,435,365,530,135,Agincourt North (129),-79.266712,43.805441,13951450.0,17159.740667,"POLYGON ((-79.24213 43.80247, -79.24319 43.802..."
1,128,Agincourt South-Malvern West,825,570,435,315,525,165,Agincourt South-Malvern West (128),-79.265612,43.788658,15117360.0,21320.849547,"POLYGON ((-79.25498 43.78122, -79.25797 43.780..."
2,20,Alderwood,690,530,395,370,620,225,Alderwood (20),-79.541611,43.604937,9502180.0,12667.013917,"POLYGON ((-79.54866 43.59022, -79.54876 43.590..."
3,95,Annex,1460,1290,1000,830,2190,3055,Annex (95),-79.404001,43.671585,5337192.0,10513.883143,"POLYGON ((-79.39414 43.66872, -79.39588 43.668..."
4,42,Banbury-Don Mills,1425,1220,960,820,2035,1635,Banbury-Don Mills (42),-79.349718,43.737657,19248970.0,25141.57229,"POLYGON ((-79.33055 43.73979, -79.33044 43.739..."


In [24]:
data.drop(labels=['AREA_NAME'], axis=1, inplace=True) 
data.head()

Unnamed: 0,Neighbourhood Number,Neighbourhood Name,"$60,000 to $69,999","$70,000 to $79,999","$80,000 to $89,999","$90,000 to $99,999","100,000 to 149,000","$150,000 and over",LONGITUDE,LATITUDE,Shape__Area,Shape__Length,geometry
0,129,Agincourt North,865,655,435,365,530,135,-79.266712,43.805441,13951450.0,17159.740667,"POLYGON ((-79.24213 43.80247, -79.24319 43.802..."
1,128,Agincourt South-Malvern West,825,570,435,315,525,165,-79.265612,43.788658,15117360.0,21320.849547,"POLYGON ((-79.25498 43.78122, -79.25797 43.780..."
2,20,Alderwood,690,530,395,370,620,225,-79.541611,43.604937,9502180.0,12667.013917,"POLYGON ((-79.54866 43.59022, -79.54876 43.590..."
3,95,Annex,1460,1290,1000,830,2190,3055,-79.404001,43.671585,5337192.0,10513.883143,"POLYGON ((-79.39414 43.66872, -79.39588 43.668..."
4,42,Banbury-Don Mills,1425,1220,960,820,2035,1635,-79.349718,43.737657,19248970.0,25141.57229,"POLYGON ((-79.33055 43.73979, -79.33044 43.739..."


## Analysis:<a name="analysis"></a>

I will search for all cinemas then identify patterns between the income and theater density in the neighbourhoods:

In [25]:
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

import matplotlib.cm as cm
import matplotlib.colors as colors


from geopy.geocoders import Nominatim 
import folium # plotting library

print('Folium installed')
print('Libraries imported.')

Folium installed
Libraries imported.


In [26]:
address = 'Toronto, CA'

geolocator = Nominatim(user_agent="to_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [27]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, neighborhood in zip(data['LATITUDE'], data['LONGITUDE'], data['Neighbourhood Name']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Here is the code cell containing the client ID and client secret as well as Limit and radius. It is hidden for security purposes.

In [28]:

CLIENT_ID = '1UO2WLZN3CYFK3NOXJW5ZVOX1H3EGQP0D12N0PEQQ40VV21J' # your Foursquare ID
CLIENT_SECRET = 'CL3NFA13ED0CXKLOUJ4V4MKZIDKF40G51KKSJ1UG3UZX0B24' # your Foursquare Secret
ACCESS_TOKEN = 'PDBQUUJE41L2UVBMKUBMNCSHZE1GDPECNJPWZH13L5EI3SUM' # your FourSquare Access Token
VERSION = '20180604'
LIMIT = 30
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 1UO2WLZN3CYFK3NOXJW5ZVOX1H3EGQP0D12N0PEQQ40VV21J
CLIENT_SECRET:CL3NFA13ED0CXKLOUJ4V4MKZIDKF40G51KKSJ1UG3UZX0B24


Here I will explore the dataframe containing neighborhoods where the borough name contains **'Toronto'**.

Now I explore the neighborhoods in Toronto!

In [29]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [30]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [31]:
toronto_venues = getNearbyVenues(names=data['Neighbourhood Name'],
                                   latitudes=data['LATITUDE'],
                                   longitudes=data['LONGITUDE']
                                  )

Agincourt North
Agincourt South-Malvern West
Alderwood
Annex
Banbury-Don Mills
Bathurst Manor
Bay Street Corridor
Bayview Village
Bayview Woods-Steeles
Bedford Park-Nortown
Beechborough-Greenbrook
Bendale
Birchcliffe-Cliffside
Black Creek
Blake-Jones
Briar Hill-Belgravia
Bridle Path-Sunnybrook-York Mills
Broadview North
Brookhaven-Amesbury
Cabbagetown-South St. James Town
Caledonia-Fairbank
Casa Loma
Centennial Scarborough
Church-Yonge Corridor
Clairlea-Birchmount
Clanton Park
Cliffcrest
Corso Italia-Davenport
Danforth
Danforth East York
Don Valley Village
Dorset Park
Dovercourt-Wallace Emerson-Junction
Downsview-Roding-CFB
Dufferin Grove
East End-Danforth
Edenbridge-Humber Valley
Eglinton East
Elms-Old Rexdale
Englemount-Lawrence
Eringate-Centennial-West Deane
Etobicoke West Mall
Flemingdon Park
Forest Hill North
Forest Hill South
Glenfield-Jane Heights
Greenwood-Coxwell
Guildwood
Henry Farm
High Park North
High Park-Swansea
Highland Creek
Hillcrest Village
Humber Heights-Westmount
Hu

In [32]:
print(toronto_venues.shape)
toronto_venues.head()

(1632, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Agincourt North,43.805441,-79.266712,Menchie's,43.808338,-79.268288,Frozen Yogurt Shop
1,Agincourt North,43.805441,-79.266712,Congee Town 太皇名粥,43.809035,-79.267634,Chinese Restaurant
2,Agincourt North,43.805441,-79.266712,Shoppers Drug Mart,43.808894,-79.269854,Pharmacy
3,Agincourt North,43.805441,-79.266712,Dollarama,43.808894,-79.269854,Discount Store
4,Agincourt North,43.805441,-79.266712,Popeyes Louisiana Kitchen,43.808652,-79.267929,Fried Chicken Joint


In [33]:
toronto_theater = toronto_venues[toronto_venues['Venue Category']=='Movie Theater']
toronto_theater

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
67,Banbury-Don Mills,43.737657,-79.349718,Cineplex Cinemas,43.734279,-79.346164,Movie Theater
1155,Rexdale-Kipling,43.723725,-79.566228,SilverCity,43.726223,-79.569855,Movie Theater
1549,Yonge-Eglinton,43.704689,-79.40359,Cineplex VIP Yonge & Eglinton,43.706515,-79.39895,Movie Theater


## Results and Discussion: <a name="results"></a>

In Toronto, there are only **3** movie theaters in the area. Therefore , there is a lot of room for building new cinemas for the city.

Since there is no extracting the data to calculate the density of each neighborhood with theaters due to the small number, I will move to consider a different metric: **income.**

This will help determine the likelihood of success of a theater in that area in addition to the similarities between the areas that contain movie theaters.

I will also look at **most common venues** and see if there are patterns there, I will apply this to the whole set then look at the 3 neighborhoods seperately afterwards.

In [34]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt North,19,19,19,19,19,19
Agincourt South-Malvern West,17,17,17,17,17,17
Alderwood,4,4,4,4,4,4
Annex,23,23,23,23,23,23
Banbury-Don Mills,22,22,22,22,22,22
...,...,...,...,...,...,...
Wychwood,4,4,4,4,4,4
Yonge-Eglinton,30,30,30,30,30,30
Yonge-St.Clair,30,30,30,30,30,30
York University Heights,10,10,10,10,10,10


In [35]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 261 uniques categories.


In [36]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + toronto_onehot.columns[:-1].tolist()
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Zoo Exhibit,ATM,African Restaurant,American Restaurant,Amphitheater,Animal Shelter,Antique Shop,Argentinian Restaurant,Art Gallery,Art Museum,...,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [37]:
toronto_onehot.shape

(1632, 261)

In [38]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Zoo Exhibit,ATM,African Restaurant,American Restaurant,Amphitheater,Animal Shelter,Antique Shop,Argentinian Restaurant,Art Gallery,...,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo
0,Agincourt North,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.052632,0.0,0.0,0.0,0.052632,0.0,0.000000,0.0
1,Agincourt South-Malvern West,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0
2,Alderwood,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0
3,Annex,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0
4,Banbury-Don Mills,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
133,Wychwood,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0
134,Yonge-Eglinton,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0
135,Yonge-St.Clair,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.033333,0.0
136,York University Heights,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0


In [39]:
toronto_grouped.shape

(138, 261)

In [40]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Agincourt North----
                 venue  freq
0   Chinese Restaurant  0.11
1  Japanese Restaurant  0.05
2         Liquor Store  0.05
3           Beer Store  0.05
4       Sandwich Place  0.05


----Agincourt South-Malvern West----
                  venue  freq
0    Chinese Restaurant  0.35
1                  Café  0.06
2  Cantonese Restaurant  0.06
3    Seafood Restaurant  0.06
4      Malay Restaurant  0.06


----Alderwood----
               venue  freq
0        Pizza Place  0.50
1  Convenience Store  0.25
2        Coffee Shop  0.25
3        Zoo Exhibit  0.00
4         Nail Salon  0.00


----Annex----
            venue  freq
0            Café  0.13
1            Park  0.13
2             Pub  0.09
3  Sandwich Place  0.09
4     Coffee Shop  0.04


----Banbury-Don Mills----
                  venue  freq
0           Coffee Shop  0.09
1          Gourmet Shop  0.09
2           Pizza Place  0.09
3         Shopping Mall  0.05
4  Cantonese Restaurant  0.05


----Bathurst Manor----
        

In [41]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [42]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt North,Chinese Restaurant,Liquor Store,Pharmacy,Beer Store,Japanese Restaurant,Bank,Bakery,Discount Store,Sandwich Place,Fried Chicken Joint
1,Agincourt South-Malvern West,Chinese Restaurant,Restaurant,Seafood Restaurant,Bank,Mediterranean Restaurant,Noodle House,Café,Asian Restaurant,Cantonese Restaurant,Pool Hall
2,Alderwood,Pizza Place,Convenience Store,Coffee Shop,Zoo,Ethiopian Restaurant,Donut Shop,Dumpling Restaurant,Eastern European Restaurant,Egyptian Restaurant,Electronics Store
3,Annex,Café,Park,Pub,Sandwich Place,Burger Joint,BBQ Joint,Indian Restaurant,Middle Eastern Restaurant,French Restaurant,Liquor Store
4,Banbury-Don Mills,Pizza Place,Gourmet Shop,Coffee Shop,Pharmacy,Clothing Store,Bubble Tea Shop,Sandwich Place,Liquor Store,Sporting Goods Shop,Movie Theater


#### Theater Neighborhood Patterns: 

Although there are only 3 movie theaters in Toronto, I want to explore if there are significant similarities between the neighborhoods by looking at their most common venues. 

Then I will infer about the closest members to those neighborhoods in clusters that will inform me of the next best possible locations.

In [43]:
cond1 = (neighborhoods_venues_sorted["Neighborhood"] == 'Banbury-Don Mills')
cond2 = (neighborhoods_venues_sorted["Neighborhood"] == 'Rexdale-Kipling')
cond3 = (neighborhoods_venues_sorted["Neighborhood"] == 'Yonge-Eglinton')
row1 = neighborhoods_venues_sorted.loc[cond1,:]
row2 = neighborhoods_venues_sorted.loc[cond2,:]
row3 = neighborhoods_venues_sorted.loc[cond3,:]

In [44]:
theater_common_venues = pd.concat([row1,row2,row3])
theater_common_venues

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,Banbury-Don Mills,Pizza Place,Gourmet Shop,Coffee Shop,Pharmacy,Clothing Store,Bubble Tea Shop,Sandwich Place,Liquor Store,Sporting Goods Shop,Movie Theater
101,Rexdale-Kipling,Movie Theater,Playground,Flower Shop,Zoo,Donut Shop,Dumpling Restaurant,Eastern European Restaurant,Egyptian Restaurant,Electronics Store,Event Service
134,Yonge-Eglinton,Coffee Shop,Gym,Fast Food Restaurant,Restaurant,Gym / Fitness Center,Plaza,Pizza Place,Poutine Place,Movie Theater,Caribbean Restaurant


Now we will look at income in these neighborhoods to see if there is a pattern there:

In [45]:
cond_1 = (data["Neighbourhood Name"] == 'Banbury-Don Mills')
cond_2 = (data["Neighbourhood Name"] == 'Rexdale-Kipling')
cond_3 = (data["Neighbourhood Name"] == 'Yonge-Eglinton')
row_1 = data.loc[cond_1,:]
row_2 = data.loc[cond_2,:]
row_3 = data.loc[cond_3,:]
income_theater_neigh = pd.concat([row_1,row_2,row_3])
income_theater_neigh.rename(columns={'Neighbourhood Name':'Neighborhood'}, inplace=True)
income_theater_neigh

Unnamed: 0,Neighbourhood Number,Neighborhood,"$60,000 to $69,999","$70,000 to $79,999","$80,000 to $89,999","$90,000 to $99,999","100,000 to 149,000","$150,000 and over",LONGITUDE,LATITUDE,Shape__Area,Shape__Length,geometry
4,42,Banbury-Don Mills,1425,1220,960,820,2035,1635,-79.349718,43.737657,19248970.0,25141.57229,"POLYGON ((-79.33055 43.73979, -79.33044 43.739..."
101,4,Rexdale-Kipling,430,310,220,185,205,45,-79.566228,43.723725,4801397.0,9788.586534,"POLYGON ((-79.55512 43.71510, -79.55504 43.714..."
136,100,Yonge-Eglinton,595,520,415,360,870,1230,-79.40359,43.704689,3160334.0,7872.021074,"POLYGON ((-79.41096 43.70408, -79.40962 43.704..."


In [46]:
merged = pd.merge(left=theater_common_venues, right=income_theater_neigh,on='Neighborhood')
merged

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,...,"$70,000 to $79,999","$80,000 to $89,999","$90,000 to $99,999","100,000 to 149,000","$150,000 and over",LONGITUDE,LATITUDE,Shape__Area,Shape__Length,geometry
0,Banbury-Don Mills,Pizza Place,Gourmet Shop,Coffee Shop,Pharmacy,Clothing Store,Bubble Tea Shop,Sandwich Place,Liquor Store,Sporting Goods Shop,...,1220,960,820,2035,1635,-79.349718,43.737657,19248970.0,25141.57229,"POLYGON ((-79.33055 43.73979, -79.33044 43.739..."
1,Rexdale-Kipling,Movie Theater,Playground,Flower Shop,Zoo,Donut Shop,Dumpling Restaurant,Eastern European Restaurant,Egyptian Restaurant,Electronics Store,...,310,220,185,205,45,-79.566228,43.723725,4801397.0,9788.586534,"POLYGON ((-79.55512 43.71510, -79.55504 43.714..."
2,Yonge-Eglinton,Coffee Shop,Gym,Fast Food Restaurant,Restaurant,Gym / Fitness Center,Plaza,Pizza Place,Poutine Place,Movie Theater,...,520,415,360,870,1230,-79.40359,43.704689,3160334.0,7872.021074,"POLYGON ((-79.41096 43.70408, -79.40962 43.704..."


## Conclusion: <a name="conclusion"></a>

Looking at this data, the recommendations that can be made are:
 - Toronto is a prime location for theaters in general as there aren't many in the area.
 - Based on similarity in income between certain neighborhoods, we can say that the center of toronto is an ideal location based on proximity to people as well as income bracket for that region, although there wasn't any significance for income contribution to the theater locations.

Due to the limited number of venues, I can't make any analytical inferences as the sample is very small. Any studies towards successful business in Toronto for movie theater will need additional indicators such as market research including demographics and population studies.

The latter will offer insight as to whether the lack of theaters is due to the lack of business or that watching movies in movie theaters isn't part of the entertainment culture in Toronto.

**Thank you for getting this far and hope you liked my approach to this study!** 

**Done**