# Capstone Project
## Finding a possible location and type for a restaurant in Los Angeles County, CA

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)


## Introduction: Business Problem <a name="introduction"></a>


In this project I will try to give a recommendation on where to open a
restaurant in Los Angeles County, CA. In addition, there shall be given  a
recommendation of which type of restaurant could be opened, based on
existing restaurants in the area and generally popular restaurants in the
whole county.

The decision on where to open a restaurant can be based on many factors,
depending on the target group. For example, one could look for very dense
populated areas, or areas with lots of wealthy citizens.

I will make a decision based on the following conditions:
- Find the area with a good balance between number of possible customers
and a high median income
- the type of restaurant will be determined by the most recommended categories
 of food venue in Los Angeles County, CA and the number of already existing
 venues in
  the area


## Data <a name="data"></a>

I used three different datasets as basis for my analysis. Using these
datasets I am able to work with the following features, among others:
- A list of areas in Los Angeles County, CA based on the ZIP code
- The number of citizens and households in every area
- The estimated median income of every area
- The latitude and longitude for every zip code in the county

<br />

I combined the following datasets for this:
-	2010 Los Angeles Census Data
    - 	https://www.kaggle.com/cityofLA/los-angeles-census-data
-	Median Household Income by Zip Code in 2019
    -	http://www.laalmanac.com/employment/em12c.php
-	US Zip Code Latitude and Longitude
    -   https://public.opendatasoft.com/explore/dataset/us-zip-code-latitude-and-longitude/information/

To analyse the existing food venues in the county, the **Foursquare API** is
 used. With this API we can provide the data to answer the following two
 questions:
- What are the most recommended types of restaurants in the county?
- What are the existing food venues categories in the area where we want to
open a restaurant in?

<br />

#### Merging the three datasets as basis of the analytics
First, let's merge all three datasets and get rid of unnecessary
information. I will continue to use a single dataframe with the combined
datasets as basis for further analysis.<br />

The census data and the geodata for the US zip codes are available as csv
files that I will read directly into a dataframe. The records for the median
 household income are available on a website, so I downloaded the data as a
 html file, which then is used to create a dataframe. As the median income
 has a dollar sign and can not be converted into a numeric value
 automatically because of its format, we have to do some data preparation.

In [6]:
import pandas as pd

# Dataset: 2010 Los Angeles Census Data
df_census = pd.read_csv("2010-census-populations-by-zip-code.csv")
# Dataset: Median Household Income by Zip Code in 2019
url = "Median Household Income By Zip Code in Los Angeles County, California.html"
dfs = pd.read_html(url)
df_income_all = dfs[0]
# drop areas where the median income is missing
df_income_na = df_income_all[df_income_all['Estimated Median Income'].notna()]
# Next step: clean the values in the median income column to retrieve
# numeric values that can be used for clustering / calculation
df_income = df_income_na.drop(df_income_na[df_income_na[
                             'Estimated Median Income'] == '---'].index)
df_income['Estimated Median Income'] = \
    df_income['Estimated Median Income'].map(lambda x: x.lstrip('$'))
df_income['Estimated Median Income'] =\
    df_income['Estimated Median Income'].str.replace(',','')
df_income["Estimated Median Income"] =\
    pd.to_numeric(df_income["Estimated Median Income"])
# Dataset: US Zip Code Latitude and Longitude
df_geodata = pd.read_csv("us-zip-code-latitude-and-longitude.csv", sep=';')
# Let's start by joining the geodata on the income dataset via the zip code
df_income_geo = df_census.join(df_income.set_index('Zip Code'), on='Zip Code')
# Now join the census data with the income dataset and the geodata
dataset_geo = df_income_geo.join(df_geodata.set_index('Zip'), on='Zip Code',
                           how='left')
# Let us only use the columns we need for the further analysis and ignore
# the rest
prepared_ds = dataset_geo[[
  "Zip Code", "City", "Community", "Estimated Median Income",
  "Longitude", "Latitude", "Total Population", "Median Age",
  "Total Males", "Total Females", "Total Households",
  "Average Household Size"]]
# last stop: let's drop records with missing data
final_ds = prepared_ds.dropna(axis=0)
final_ds.shape

(279, 12)

So the final dataframe now contains 279 areas with 12 features. <br />
Let's get an overview on how the records look now:

In [8]:
final_ds.head(5)

Unnamed: 0,Zip Code,City,Community,Estimated Median Income,Longitude,Latitude,Total Population,Median Age,Total Males,Total Females,Total Households,Average Household Size
1,90001,Los Angeles,"Los Angeles (South Los Angeles), Florence-Graham",43360.0,-118.24878,33.972914,57110,26.6,28468,28642,12971,4.4
2,90002,Los Angeles,"Los Angeles (Southeast Los Angeles, Watts)",37285.0,-118.24845,33.948315,51223,25.5,24876,26347,11731,4.36
3,90003,Los Angeles,"Los Angeles (South Los Angeles, Southeast Los ...",40598.0,-118.276,33.962714,66266,26.3,32631,33635,15642,4.22
4,90004,Los Angeles,"Los Angeles (Hancock Park, Rampart Village, Vi...",49675.0,-118.30755,34.07711,62180,34.8,31302,30878,22547,2.73
5,90005,Los Angeles,"Los Angeles (Hancock Park, Koreatown, Wilshire...",38491.0,-118.30848,34.058911,37681,33.9,19299,18382,15044,2.5


#### Finding the most recommended food venue categories in the county <br />
Now we will use the Foursquare API to find out what are the most recommended
 food venues in the county. For this we will go through every single area
 and get the recommended food venues in the vicinity of the area center.

In [18]:
# first, lets import the necessary libraries and the credentials for the API
import foursquare
import requests
CLIENT_ID = foursquare.CLIENT_ID
CLIENT_SECRET = foursquare.CLIENT_SECRET
ACCESS_TOKEN = foursquare.ACCESS_TOKEN
VERSION = '20210514' # Foursquare API version
LIMIT = 100

In [12]:
# first, we use a list of the areas and without columns we don't need
areas = final_ds.drop(['Estimated Median Income', 'Total ' 'Population',
                        'Median Age', 'Total Males', 'Total Females',
                        'Total Households', 'Average Household Size'], 1)
areas.head()

Unnamed: 0,Zip Code,City,Community,Longitude,Latitude
1,90001,Los Angeles,"Los Angeles (South Los Angeles), Florence-Graham",-118.24878,33.972914
2,90002,Los Angeles,"Los Angeles (Southeast Los Angeles, Watts)",-118.24845,33.948315
3,90003,Los Angeles,"Los Angeles (South Los Angeles, Southeast Los ...",-118.276,33.962714
4,90004,Los Angeles,"Los Angeles (Hancock Park, Rampart Village, Vi...",-118.30755,34.07711
5,90005,Los Angeles,"Los Angeles (Hancock Park, Koreatown, Wilshire...",-118.30848,34.058911


In the next code block, we will iterate through every area in the dataframe
and get up to 50 recommendations per area. For that I use the "explore"
endpoint of the Foursquare API. I use the *categoryId* and
*sortByPopularity* parameters to only request food venues that are sorted by
 popularity in descending order.

In [19]:
venues_list = []
for index, area in areas.iterrows():
    url = 'https://api.foursquare' \
      '.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},' \
      '{}&radius={}&limit={}&offset={}&categoryId={}&sortByPopularity={}'\
        .format(
    CLIENT_ID, CLIENT_SECRET, VERSION, area['Latitude'], area['Longitude'],
    1000, LIMIT, 0, '4d4b7105d754a06374d81259', 1)

    results = requests.get(url).json()["response"]['groups'][0]['items']
    for v in results:
        try: # try to extract the city, if there is one in the response
            city = v['venue']['location']['city']
        except:
            city = area['City']

        try: # try to extract the zip code, if there is one
            postalCode = str(v['venue']['location']['postalCode'])
        except:
            postalCode = str(area['Zip Code'])
        # build a list with all the columns I want to use
        if postalCode == str(area['Zip Code']):
            venues_list.append((
                        area['Zip Code'],
                        area['Community'],
                        area['Latitude'],
                        area['Longitude'],
                        v['venue']['name'],
                        v['venue']['categories'][0]['name'],
                        city ))
# create a dataframe from the results of the request
la_venues = pd.DataFrame(venues_list, columns=['Zip Code', 'Community',
              'Zip Code Latitude', 'Zip Code Longitude', 'Venue',
              'Venue Category', 'City'])

In [25]:
la_venues.head()

Unnamed: 0,Zip Code,Community,Zip Code Latitude,Zip Code Longitude,Venue,Venue Category,City
0,90001,"Los Angeles (South Los Angeles), Florence-Graham",33.972914,-118.24878,Mi Lindo Nayarit Mariscos,Mexican Restaurant,Los Angeles
1,90001,"Los Angeles (South Los Angeles), Florence-Graham",33.972914,-118.24878,Happy Donut,Donut Shop,Los Angeles
2,90001,"Los Angeles (South Los Angeles), Florence-Graham",33.972914,-118.24878,Jack in the Box,Fast Food Restaurant,Los Angeles
3,90001,"Los Angeles (South Los Angeles), Florence-Graham",33.972914,-118.24878,Birrieria Tlaquepaque,Mexican Restaurant,Los Angeles
4,90001,"Los Angeles (South Los Angeles), Florence-Graham",33.972914,-118.24878,Birrieria Jalisco,Mexican Restaurant,Los Angeles


In [26]:
la_venues["Venue"].count()

8619

So we have found 8,619 recommendations for our 279 areas in Los Angeles
County, CA. That are ~31 recommendations per area. Let's extract the city
and venue category and group the data by category, to find out about the
distribution of the recommended food venue categories.

In [30]:
venues = la_venues[['City', 'Venue Category']]
categories = venues.groupby('Venue Category').size().to_frame('Count').reset_index()
sorted = categories.sort_values(by='Count', ascending=False) # Sort by Count
top10 = sorted.iloc[0:9] # Show the top 10 categories
top10

Unnamed: 0,Venue Category,Count
78,Mexican Restaurant,828
90,Pizza Place,575
43,Fast Food Restaurant,500
24,Chinese Restaurant,407
9,Bakery,349
100,Sandwich Place,339
2,American Restaurant,319
18,Café,294
112,Sushi Restaurant,263


After that, we will visualize the distribution.

In [38]:
import plotly.express as px
top15 = sorted.iloc[0:14]
fig = px.bar(top15, x="Venue Category", y="Count",
             title='Distribution of recommended food venue categories')
fig.show()

So with this data we can tell what food venue categories are recommended the
 most throughout Los Angeles County, CA. <br />

We have prepared the following data, which we will use for further analysis:
- a dataframe with areas in LA County, enriched with geodata, median income
and census data
- a dataframe with the most recommended food venue categories in the county

## Methodology <a name="methodology"></a>

In this project we will focus on finding a suitable area for a new
restaurant in Los Angeles County, CA. The areas are defined by their US Zip
Code. In addition, we will look at the most
recommended food venue categories throughout the country, to suggest which
type of restaurant could be opened. There won't be a specific location in
the chosen area recommended.

In the first step we have merged three different datasets, that provide data
 on the different areas in Los Angeles County. With this data it is possible
  to cluster the areas using information like median income, number of
  households and number of inhabitants. In addition, using Foursquare, we
  identified the most recommended food venue categories in the county.

The second step in the analysis is to cluster (using k-means clustering) the
areas in the county and to describe the individual clusters. Using this
method we support the process of finding a single area that looks promising
for a new restaurant.

The third step is to pick a cluster that fits the chosen criteria most. The
area shall be chosen under the premise of finding a good balance between estimated median income and number of potential
 customers. So the target is to find an area that has as many citizens as
 possible with the highest income possible. After an area was chosen, the
 distribution of local restaurant types in this area will be analysed.
 Combining this information with the categories of food venues that are
 popular throughout the county, a recommendation of the restaurant to open
 can be given.

## Analysis <a name="analysis"></a>

First of all, let's have a look on our dataset describing the areas.