<h1><center>Opening an upscale hotel in Los Angeles</center></h1>

## Final Report

### Tasks :
* Scrape metadata of each neighborhood in Los Angeles from la times to build a dataframe.
* Obtain coordinates of each neighborhood in order to later on plot each one on the map.
* Aquire nearby venues using the FourSquare API.
* Clean Data.
* Segment each neighborhood into a cluster.
* Form an opinion on the neighborhoods in ragards to opening an upscale hotel there.

# Import necessary libraries

In [483]:
# If these packages are not yet installed on your machine.
#! pip install folium
#! pip install geocoder

# Libraries for creating dataframes.
import pandas as pd
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

# Libraries for web-scraping
from bs4 import BeautifulSoup
import requests
import json
from pandas.io.json import json_normalize

# Method for clustering.
from sklearn.cluster import KMeans

# Libraries for plotting maps.
import geocoder
from geopy.geocoders import Nominatim
import folium
print("Finished!")

Finished!


# Scraping LA Times for Data

In [15]:
url = "https://maps.latimes.com/neighborhoods/income/median/neighborhood/list/"
data = requests.get(url).text
soup = BeautifulSoup(data,"html.parser")
table = soup.find(id = "sortable_table")

In [125]:
# Scrape for each neighborhood name in container.

# List of neighborhood names.
neighborhoodList = [table.find_all('a')[i]['name'] for i in range(len(table.find_all('a')))]
neighborhoodList = [hName.replace("-"," ") for hName in neighborhoodList]

# List of median household incomes by neighborhood in LA.
incomeTd = table.find_all(style="font-size:17px; padding-left:5px; line-height:160%; text-align:right;")
incomeList = []

# Cleanse data from HTML lines to list strings.
for line in incomeTd:
    
    # Find where the string contains '$' in order to know that's the income numbers.
    startIndex = str(line).find("$")
    endIndex = str(line).find("</td>")
    income = str(line)[startIndex + 1:endIndex]
    incomeList.append(income)

# Remove commas from each number in list.
incomeList = [num.replace(',','') for num in incomeList]

# Convert each string object to float for ease of use during analysis.
incomeList = [float(num) for num in incomeList]

# Confirm that lists are equal in length.
print(len(incomeList) == len(neighborhoodList))

True


# Creating a dataframe for our data

In [169]:
#laData = pd.DataFrame()
#laData["Neighborhood"] = neighborhoodList
#laData["Median-Income"] = incomeList

laData.head

<bound method NDFrame.head of                   Neighborhood  Median-Income
0                      bel air       207938.0
1                 hidden hills       203199.0
2                rolling hills       184777.0
3                beverly crest       169282.0
4            pacific palisades       168008.0
5         palos verdes estates       167344.0
6                   san marino       158855.0
7         la canada flintridge       148996.0
8        rolling hills estates       145628.0
9                       malibu       138215.0
10            la habra heights       137034.0
11             manhattan beach       136481.0
12      santa monica mountains       132997.0
13         rancho palos verdes       128321.0
14            westlake village       126550.0
15                   calabasas       126178.0
16              west san dimas       125984.0
17                    bradbury       123773.0
18             stevenson ranch       122833.0
19                porter ranch       121428.0
20  

# Getting coordinates for each neighborhood.

In [221]:
# Lists containing column values for each neighborhood.
lats = []
longs = []
hoodNames = []

# Create coordinates dataframe.
coordinatesDf = pd.DataFrame(columns = ["Neighborhood","Latidude","Longitude"])

# Iterate over each neighborhood in the dataframe and retrieve it's coordinates.
geolocator = Nominatim(user_agent="http")

In [222]:
# Get coordinates for each neighborhood.
for ind, hood in enumerate(laData["Neighborhood"]):
    print(hood,ind)
    try:
        address = geolocator.geocode("{} , Los Angeles".format(hood))
    except:
        continue
    if address == None:
        continue
    else:
        lats.append(address.latitude)
        longs.append(address.longitude)
        hoodNames.append(hood)

bel air 0
hidden hills 1
rolling hills 2
beverly crest 3
pacific palisades 4
palos verdes estates 5
san marino 6
la canada flintridge 7
rolling hills estates 8
malibu 9
la habra heights 10
manhattan beach 11
santa monica mountains 12
rancho palos verdes 13
westlake village 14
calabasas 15
west san dimas 16
bradbury 17
stevenson ranch 18
porter ranch 19
topanga 20
ladera heights 21
agoura hills 22
leona valley 23
brentwood 24
cheviot hills 25
hermosa beach 26
castaic 27
hollywood hills west 28
walnut 29
hasley canyon 30
agua dulce 31
beverlywood 32
northwest palmdale 33
west hills 34
cerritos 35
beverly hills 36
century city 37
north whittier 38
santa susana mountains 39
castaic canyons 40
san pasqual 41
ridge route 42
marina del rey 43
diamond bar 44
redondo beach 45
playa del rey 46
woodland hills 47
claremont 48
santa clarita 49
sierra madre 50
west los angeles 51
ramona 52
tujunga canyons 53
hancock park 54
san dimas 55
chatsworth 56
acton 57
el segundo 58
granada hills 59
la mirada

# Creating a dataframe with coordinates

In [234]:
# Creating the dataframe.
coordinateDf = pd.DataFrame()

# Filling the dataframe with data.
coordinateDf["Neighborhood"] = hoodNames
coordinateDf["Latitude"] = lats
coordinateDf["Longitude"] = longs

coordinateDf.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,bel air,34.098883,-118.459881
1,hidden hills,34.164091,-118.657837
2,rolling hills,33.766804,-118.349662
3,beverly crest,34.11677,-118.432261
4,pacific palisades,34.048064,-118.526471


# Merging the two dataframes

In [544]:
totalLaDf = coordinateDf.merge(laData, on = "Neighborhood")

# Making sure the size of the dataframe is correct.
totalLaDf.shape

# Saving the dataframe.
totalLaDf.to_csv("totalDf.csv",index = False)

totalLaDf.head()

Unnamed: 0,Neighborhood,Latitude,Longitude,Median-Income
0,bel air,34.098883,-118.459881,207938.0
1,hidden hills,34.164091,-118.657837,203199.0
2,rolling hills,33.766804,-118.349662,184777.0
3,beverly crest,34.11677,-118.432261,169282.0
4,pacific palisades,34.048064,-118.526471,168008.0


# Creating a map of all the neighborhoods.

In [240]:
address = "Los Angeles, California"

# Getting coordinates of LA.
geolocator = Nominatim(user_agent="html")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print("The geograpical coordinate of Los Angeles, California {}, {}.".format(latitude, longitude))

The geograpical coordinate of Los Angeles, California 34.0536909, -118.2427666.


In [615]:
laMap = folium.Map(location = [latitude,longitude], 
                  zoom_start = 11, tiles = "stamentoner")
for name,lat,long,income in zip(totalLaDf["Neighborhood"],totalLaDf["Latitude"],totalLaDf["Longitude"],totalLaDf["Median-Income"]):
    folium.CircleMarker(
    radius=15,
    location=[lat,long],
    popup='Name : {}\n Income : ${}'.format(name,income),
    color="magenta",
    fill_color = "magenta",
    fill=True,
    fill_opacity = .5,
).add_to(laMap)
    
laMap

In [253]:
# Save map.
laMap.save("laMap.html")

# Using FourSquare to retrieve nearby hotels for each neighborhood

In [263]:
# @hidden_cell
CLIENT_ID = "ODSQDIMDMLN4QZUGZMUJJPEF2H31HCHPM2ZHBNJFHZH0C4F4"
CLIENT_SECRET = "C0MSMI1XRWEFWG5J3O2UE2KDKC5SLH4XDS3XYXKHR5BTDRHT"
VERSION = "20180605"
LIMIT = 200

In [266]:
venues = []
# Get nearby venues for each neighborhood.
for lat,long,hood in zip(totalLaDf["Latitude"],totalLaDf["Longitude"],totalLaDf["Neighborhood"]):
    
    # URL to get data from FourSquare.
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        2000, 
        LIMIT)
    
    # Data recieved from FourSquare API.
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # Add venues to list.
    for venue in results:
        venues.append((hood,
                       lat,
                       long,
                       venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

In [307]:
venuesDf = pd.DataFrame(venues)
venuesDf.columns = ["Neighborhood", "Latitude", "Longitude", "Venue Name", "Venue Latitude", "Venue Longitude", "Venue Category"]
venuesDf.head()

Unnamed: 0,Neighborhood,Latitude,Longitude,Venue Name,Venue Latitude,Venue Longitude,Venue Category
0,bel air,34.098883,-118.459881,Hotel Bel Air,34.086611,-118.446362,Hotel
1,bel air,34.098883,-118.459881,Getty Sculpture Garden,34.08756,-118.475748,Art Museum
2,bel air,34.098883,-118.459881,Wolfgang Puck,34.086594,-118.446351,Restaurant
3,bel air,34.098883,-118.459881,Oak Bar at Hotel Bel Air,34.086209,-118.446144,Hotel Bar
4,bel air,34.098883,-118.459881,Bel Air Foods,34.116383,-118.464182,Grocery Store


# Cleaning the data

In [311]:
print("There are {} venue types in the datatype yet we only need one.".format(len(venuesDf["Venue Category"].unique())))

There are 463 venue types in the datatype yet we only need one.


In [545]:
# Only keep the venues that are hotels.
hotelDf = venuesDf[venuesDf["Venue Category"] == 'Hotel']

print("There are {} hotels in LA.".format(hotelDf.shape[0]))
hotelDf.head()

There are 240 hotels in LA.


Unnamed: 0,Neighborhood,Latitude,Longitude,Venue Name,Venue Latitude,Venue Longitude,Venue Category
0,bel air,34.098883,-118.459881,Hotel Bel Air,34.086611,-118.446362,Hotel
47,hidden hills,34.164091,-118.657837,Hilton Garden Inn,34.151491,-118.648628,Hotel
380,malibu,34.035591,-118.689423,Malibu Beach Inn,34.038075,-118.67432,Hotel
390,malibu,34.035591,-118.689423,Malibu Inn,34.037568,-118.677005,Hotel
391,malibu,34.035591,-118.689423,The Surfrider Malibu,34.03711,-118.67793,Hotel


In [546]:
# Create a datafreame with frequency of hotels with each neighborhood.
hotelFreqDf = hotelDf.groupby('Neighborhood').count()
hotelFreqDf.reset_index(inplace=True)

# Keep only the frequency and name.
hotelFreqDf = hotelFreqDf[["Neighborhood","Venue Category"]].astype(str)
hotelFreqDf.head()

Unnamed: 0,Neighborhood,Venue Category
0,agoura hills,2
1,arcadia,6
2,artesia,1
3,avalon,9
4,azusa,1


# Segmenting each neighborhood to a cluster

In [547]:
# Creating the dataframe for building or k-means model.
kMeansDf = hotelFreqDf.merge(totalLaDf, on = "Neighborhood")

# Dropping un-necessary columns.
kMeansDf = kMeanDf[["Venue Category","Median-Income"]]
kMeansDf.columns = ["Num of Hotels","Median-Income"]

# Casting Num of Hotels column to float
kMeansDf["Num of Hotels"] = kMeansDf["Num of Hotels"].astype(float)

# Normalizing the data.
kMeansDf["Num of Hotels"] = kMeansDf["Num of Hotels"] / kMeansDf["Num of Hotels"].max()
kMeansDf["Median-Income"] = kMeansDf["Median-Income"] / kMeansDf["Median-Income"].max()

kMeansDf.head()

Unnamed: 0,Num of Hotels,Median-Income
0,0.222222,0.565592
1,0.666667,0.36457
2,0.111111,0.291135
3,1.0,0.255845
4,0.111111,0.256322


In [548]:
k = 4

# Creating the k-means model.
kMeans = KMeans(n_clusters = k,random_state = 0).fit(kMeansDf)

kMeans.labels_[:10]

array([3, 2, 0, 2, 0, 0, 3, 0, 0, 0], dtype=int32)

In [577]:
# Adding the cluster label for each neighborhood on the dataframe.
hotelFreqDf["Cluster"] = kMeans.labels_

# Getting full dataset.
finalDf = hotelFreqDf.merge(totalLaDf.copy(),on = "Neighborhood")

# Re-naming columns.
finalDf.columns = ["Neighborhood","Hotel-Count","Cluster","Latitude","Longitude","Median-Income"]

# Setting columns to numeric values for analysis ease.
finalDf["Hotel-Count"] = finalDf["Hotel-Count"].astype(float)

finalDf.head()

Unnamed: 0,Neighborhood,Hotel-Count,Cluster,Latitude,Longitude,Median-Income
0,agoura hills,2.0,3,34.14791,-118.765704,117608.0
1,arcadia,6.0,2,-0.193964,-78.492941,75808.0
2,artesia,1.0,0,33.86902,-118.07962,60538.0
3,avalon,9.0,2,33.34221,-118.327261,53200.0
4,azusa,1.0,0,34.133875,-117.905605,53299.0


# Analyzing the data

In [578]:
# Cluster 0
finalDf[finalDf["Cluster"] == 0]

Unnamed: 0,Neighborhood,Hotel-Count,Cluster,Latitude,Longitude,Median-Income
2,artesia,1.0,0,33.86902,-118.07962,60538.0
4,azusa,1.0,0,34.133875,-117.905605,53299.0
5,baldwin park,2.0,0,34.085474,-117.961176,56585.0
7,bell,1.0,0,33.974781,-118.186636,40556.0
8,bell gardens,2.0,0,33.969456,-118.150395,41532.0
9,bellflower,1.0,0,33.896347,-118.117083,53325.0
10,beverly grove,2.0,0,34.076034,-118.369972,63039.0
13,burbank,2.0,0,34.181648,-118.325855,64416.0
15,carson,2.0,0,33.832204,-118.251755,70645.0
16,carthay,2.0,0,34.061121,-118.3673,71398.0


In [579]:
# Cluster 1
finalDf[finalDf["Cluster"] == 1]

Unnamed: 0,Neighborhood,Hotel-Count,Cluster,Latitude,Longitude,Median-Income
12,beverlywood,5.0,1,34.045933,-118.39492,105253.0
28,desert view highlands,4.0,1,34.589978,-118.153456,80867.0
30,downtown,4.0,1,34.042849,-118.247673,15003.0
33,harbor city,3.0,1,33.797282,-118.300472,55454.0
37,hermosa beach,4.0,1,33.86428,-118.39591,109509.0
42,hollywood hills west,5.0,1,34.110485,-118.373388,108199.0
49,lawndale,3.0,1,33.888522,-118.353199,53150.0
56,marina del rey,5.0,1,33.977685,-118.448648,92763.0
59,montebello,4.0,1,24.048652,-104.608102,52623.0
60,palmdale,3.0,1,34.579313,-118.117111,63317.0


In [580]:
# Cluster 2
finalDf[finalDf["Cluster"] == 2]

Unnamed: 0,Neighborhood,Hotel-Count,Cluster,Latitude,Longitude,Median-Income
1,arcadia,6.0,2,-0.193964,-78.492941,75808.0
3,avalon,9.0,2,33.34221,-118.327261,53200.0
11,beverly hills,6.0,2,34.06965,-118.396306,96312.0
17,catalina island,9.0,2,-14.060373,-75.740348,56295.0
18,century city,6.0,2,34.057426,-118.414727,95135.0
27,del aire,7.0,2,33.923274,-118.374702,66442.0
32,glendale,6.0,2,-0.193964,-78.492941,57112.0
39,highland park,6.0,2,-0.193964,-78.492941,45478.0
40,hollywood,6.0,2,34.098003,-118.329523,33694.0
52,long beach,6.0,2,33.769016,-118.191605,50985.0


In [581]:
# Cluster 3
finalDf[finalDf["Cluster"] == 3]

Unnamed: 0,Neighborhood,Hotel-Count,Cluster,Latitude,Longitude,Median-Income
0,agoura hills,2.0,3,34.14791,-118.765704,117608.0
6,bel air,1.0,3,34.098883,-118.459881,207938.0
14,calabasas,1.0,3,34.144664,-118.644097,126178.0
19,cerritos,1.0,3,33.864429,-118.053932,98212.0
21,cheviot hills,3.0,3,34.040588,-118.409887,111813.0
38,hidden hills,1.0,3,34.164091,-118.657837,203199.0
46,ladera heights,1.0,3,33.994179,-118.375354,117925.0
54,malibu,3.0,3,34.035591,-118.689423,138215.0
55,manhattan beach,1.0,3,33.891599,-118.395124,136481.0
66,porter ranch,1.0,3,34.281816,-118.561271,121428.0


# What can we see?

In [593]:
finalDf.groupby("Cluster").mean()[["Hotel-Count","Median-Income"]]

Unnamed: 0_level_0,Hotel-Count,Median-Income
Cluster,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1.352941,57538.588235
1,3.941176,74677.235294
2,6.538462,58540.769231
3,1.583333,135698.333333


### After segmenting each neighborhood into one of four clusters, you could see a difference in each cluster. 

### Cluster 0 offers no competition among other hotels but it's median-household income suggests that it's a middle class area which hints that it's not ideal for an upscale hotel to be opened.

### Cluster 1 and 2 are both occupied by other hotels which would cause un-wanted competition for one to open a hotel, that  making both of those clusters non-suitable areas for such a business to be launched.

### Cluster 3 neighborhoods are by far the idealest neighborhoods to open hotels. Their income index suggests that those areas are wealthy and are not contested by other hotels nearby.


# Visualizing the data

In [627]:
# Create map object.
clusteredMap = folium.Map([latitude,longitude], 
                  zoom_start = 11, tiles = "stamentoner")
colors = ["coral","yellow","red","lime"]

# Plot each neighborhood to map and color it according to cluster.
for nHood,lat,long,clus in zip(finalDf["Neighborhood"],finalDf["Latitude"],finalDf["Longitude"],finalDf["Cluster"]):
    folium.CircleMarker(
    [lat,long],
    radius = 15,
    color = colors[clus],
    popup = "{}, cluster {}".format(nHood,clus),
    fill = True,
    fill_color = colors[clus],
    fill_opacity = .5).add_to(clusteredMap)
clusteredMap

# Conclusions
### By looking at the map, you could see which neighborhoods are worthy to open a luxury hotel in. If I was to rate the cluster in order, I would choose, Cluster 3 > Cluster 1 > Cluster <0 < Cluster 2.

##### I hope that you found this presentation to be insightful. For any feedback, contact imrond2@gmail.com.