# Capstone Project
## Finding a possible location and type for a restaurant in Los Angeles County, CA

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)


## Introduction: Business Problem <a name="introduction"></a>


In this project I will try to give a recommendation on where to open a
restaurant in Los Angeles County, CA. In addition, there shall be given  a
recommendation of which type of restaurant could be opened, based on
existing restaurants in the area and generally popular restaurants in the
whole county.

The decision on where to open a restaurant can be based on many factors,
depending on the target group. For example, one could look for very dense
populated areas, or areas with lots of wealthy citizens.

I will make a decision based on the following conditions:
- Find the area with a good balance between number of possible customers
and a high median income
- the type of restaurant will be determined by the most recommended categories
 of food venue in Los Angeles County, CA and the number of already existing
 venues in
  the area


## Data <a name="data"></a>

I used three different datasets as basis for my analysis. Using these
datasets I am able to work with the following features, among others:
- A list of areas in Los Angeles County, CA based on the ZIP code
- The number of citizens and households in every area
- The estimated median income of every area
- The latitude and longitude for every zip code in the county

<br />

I combined the following datasets for this:
-	2010 Los Angeles Census Data
    - 	https://www.kaggle.com/cityofLA/los-angeles-census-data
-	Median Household Income by Zip Code in 2019
    -	http://www.laalmanac.com/employment/em12c.php
-	US Zip Code Latitude and Longitude
    -   https://public.opendatasoft.com/explore/dataset/us-zip-code-latitude-and-longitude/information/

To analyse the existing food venues in the county, the **Foursquare API** is
 used. With this API we can provide the data to answer the following two
 questions:
- What are the most recommended types of restaurants in the county?
- What are the existing food venues categories in the area where we want to
open a restaurant in?

<br />

#### Merging the three datasets as basis of the analytics
First, let's merge all three datasets and get rid of unnecessary
information. I will continue to use a single dataframe with the combined
datasets as basis for further analysis.

In [1]:
import pandas as pd

# Dataset: 2010 Los Angeles Census Data
df_census = pd.read_csv("2010-census-populations-by-zip-code.csv")
# Dataset: Median Household Income by Zip Code in 2019
url = "Median Household Income By Zip Code in Los Angeles County, California.html"
dfs = pd.read_html(url)
df_income_all = dfs[0]
# drop areas where the median income is missing
df_income_na = df_income_all[df_income_all['Estimated Median Income'].notna()]

df_income = df_income_na.drop(df_income_na[df_income_na[
                             'Estimated Median Income'] == '---'].index)
df_income['Estimated Median Income'] = \
    df_income['Estimated Median Income'].map(lambda x: x.lstrip('$'))
df_income['Estimated Median Income'] =\
    df_income['Estimated Median Income'].str.replace(',','')
df_income["Estimated Median Income"] =\
    pd.to_numeric(df_income["Estimated Median Income"])
# Dataset: US Zip Code Latitude and Longitude
df_geodata = pd.read_csv("us-zip-code-latitude-and-longitude.csv", sep=';')
# Let's start by joining the geodata on the income dataset via the zip code
df_income_geo = df_census.join(df_income.set_index('Zip Code'), on='Zip Code')
# Now join the census data with the income dataset and the geodata
dataset_geo = df_income_geo.join(df_geodata.set_index('Zip'), on='Zip Code',
                           how='left')
# Let us drop the columns that are not of interest for my further analysis
#dataset_geo.drop(columns=[ 'State', 'Timezone', 'Daylight savings time '
#                           'flag', 'geopoint'], inplace = True)
#dataset = df_income_geo[df_income_geo['Estimated Median Income'].notna()]
prepared_ds = dataset_geo[[
  "Zip Code", "City", "Community", "Estimated Median Income",
  "Longitude", "Latitude", "Total Population", "Median Age",
  "Total Males", "Total Females", "Total Households",
  "Average Household Size"]]

#ppds = prepared_ds.drop(prepared_ds[prepared_ds['Estimated Median Income'] ==
#                             '---'].index)
final_ds = prepared_ds.dropna(axis=0)

final_ds.head(10)

Unnamed: 0,Zip Code,City,Community,Estimated Median Income,Longitude,Latitude,Total Population,Median Age,Total Males,Total Females,Total Households,Average Household Size
1,90001,Los Angeles,"Los Angeles (South Los Angeles), Florence-Graham",43360.0,-118.24878,33.972914,57110,26.6,28468,28642,12971,4.4
2,90002,Los Angeles,"Los Angeles (Southeast Los Angeles, Watts)",37285.0,-118.24845,33.948315,51223,25.5,24876,26347,11731,4.36
3,90003,Los Angeles,"Los Angeles (South Los Angeles, Southeast Los ...",40598.0,-118.276,33.962714,66266,26.3,32631,33635,15642,4.22
4,90004,Los Angeles,"Los Angeles (Hancock Park, Rampart Village, Vi...",49675.0,-118.30755,34.07711,62180,34.8,31302,30878,22547,2.73
5,90005,Los Angeles,"Los Angeles (Hancock Park, Koreatown, Wilshire...",38491.0,-118.30848,34.058911,37681,33.9,19299,18382,15044,2.5
6,90006,Los Angeles,"Los Angeles (Byzantine-Latino Quarter, Harvard...",37072.0,-118.2943,34.048351,59185,32.4,30254,28931,18617,3.13
7,90007,Los Angeles,"Los Angeles (Southeast Los Angeles, Univerity ...",27406.0,-118.2829,34.026448,40920,24.0,20915,20005,11944,3.0
8,90008,Los Angeles,"Los Angeles (Baldwin Hills, Crenshaw, Leimert ...",43364.0,-118.33705,34.009754,32327,39.7,14477,17850,13841,2.33
9,90010,Los Angeles,"Los Angeles (Hancock Park, Wilshire Center, Wi...",63112.0,-118.31481,34.062709,3800,37.8,1874,1926,2014,1.87
10,90011,Los Angeles,Los Angeles (Southeast Los Angeles),40940.0,-118.25868,34.007063,103892,26.2,52794,51098,22168,4.67


In [3]:
final_ds.shape

(279, 12)