# Introduction

You are a Starbucks big data analyst ([that’s a real job!](https://www.forbes.com/sites/bernardmarr/2018/05/28/starbucks-using-big-data-analytics-and-artificial-intelligence-to-boost-performance/#130c7d765cdc)) looking to find the next store into a [Starbucks Reserve Roastery](https://www.businessinsider.com/starbucks-reserve-roastery-compared-regular-starbucks-2018-12#also-on-the-first-floor-was-the-main-coffee-bar-five-hourglass-like-units-hold-the-freshly-roasted-coffee-beans-that-are-used-in-each-order-the-selection-rotates-seasonally-5).  These roasteries are much larger than a typical Starbucks store and have several additional features, including various food and wine options, along with upscale lounge areas.  You'll investigate the demographics of various counties in the state of California, to determine potentially suitable locations.

<center>
<img src="https://i.imgur.com/BIyE6kR.png" width="450"><br/>
</center>

Before you get started, run the code cell below to set everything up.

In [None]:
import math
import pandas as pd
import geopandas as gpd
from geopandas.tools import geocode

import folium
from folium.plugins import MarkerCluster

You'll use the `embed_map()` function from the previous exercise to visualize your maps.

In [None]:
def embed_map(m, file_name):
    from IPython.display import IFrame
    m.save(file_name)
    return IFrame(file_name, width='100%', height='500px')

# Exercises

### 1) Geocode the missing locations.

Run the next code cell to create a DataFrame `starbucks` containing Starbucks locations in the state of California.

In [None]:
# Load and preview Starbucks locations in California
starbucks = pd.read_csv("../input/geospatial-course-data/starbucks_locations.csv")
starbucks.head()

Most of the stores have known (latitude, longitude) locations.  But, all of the locations in the city of Berkeley are missing.

In [None]:
# How many rows in each column have missing values?
print(starbucks.isnull().sum())

# View rows with missing locations
rows_with_missing = starbucks[starbucks["City"]=="Berkeley"]
rows_with_missing

Use the code cell below to fill in these values with geolocation.

In [None]:
# Your code here
def geo_locate(row):
    point = geocode(row, provider='nominatim').geometry[0]
    return pd.Series({'Longitude': point.x, 'Latitude': point.y})

berkeley_locations = rows_with_missing.apply(lambda x: geo_locate(x['Address']), axis=1)
starbucks.update(berkeley_locations)

# Check your answer
#q_1.check()

In [None]:
#q_1.hint()
#q_1.solution()

### 2) View Berkeley locations.

Let's take a look at the locations you just found.  Visualize the (latitude, longitude) locations in Berkeley in the OpenStreetMap style. 

In [None]:
# Your code here
m_1 = folium.Map(location=[37.88,-122.26], tiles='openstreetmap', zoom_start=13)

for idx, row in starbucks[starbucks["City"]=='Berkeley'].iterrows():
    folium.Marker([row['Latitude'], row['Longitude']]).add_to(m_1)

embed_map(m_1, 'm_1.html')

Considering only the five locations in Berkeley, how many of the (latitude, longitude) locations seem potentially correct (are located in the correct city)?

In [None]:
# Fill in your answer 
num_locations = ____

# Check your answer
#q_2.check()

In [None]:
#q_2.hint()
#q_2.solution()

### 3) Consolidate your information.

Run the code below to load a GeoDataFrame `CA_counties` containing the name, area (in square kilometers), and a unique id (in the "GEOID" column) for each county in the state of California.

In [None]:
CA_counties = gpd.read_file("../input/geospatial-course-data/CA_county_boundaries/CA_county_boundaries.shp")
CA_counties.set_index("GEOID", inplace=True)
CA_counties.head()

Next, we create three DataFrames:
- `CA_pop` contains an estimate of the population of each county.
- `CA_high_earners` contains the number of households with an income of at least $150,000 per year.
- `CA_median_age` contains the median age for each county.

In [None]:
CA_pop = pd.read_csv("../input/geospatial-course-data/CA_county_population.csv", index_col="GEOID")
CA_high_earners = pd.read_csv("../input/geospatial-course-data/CA_county_high_earners.csv", index_col="GEOID")
CA_median_age = pd.read_csv("../input/geospatial-course-data/CA_county_median_age.csv", index_col="GEOID")

Use the next code cell to join the `CA_counties` GeoDataFrame with `CA_pop`, `CA_high_earners`, and `CA_median_age`.  Name the resultant GeoDataFrame `CA_stats`, and make sure it is indexed by the "GEOID" column.  

In [None]:
CA_stats = CA_counties.join([CA_pop, CA_high_earners, CA_median_age])
CA_stats["density"] = CA_stats["population"] / CA_stats["area_sqkm"]
CA_stats.crs = {'init': 'epsg:4326'}
CA_stats.head()

# Check your answer
#q_3.check()

In [None]:
#q_3.hint()
#q_3.solution()

### 4) Which counties look promising?

Now that we have all of the demographic data in one place, it's much easier to select counties that meet our criteria.

Create a GeoDataFrame `sel_counties` that contains a subset of the rows (and all of the columns) from the `CA_stats` GeoDataFrame.  In particular, you should select counties where:
- there are at least 100,000 households making \$150,000 per year,
- the median age is less than 38.5, and
- the density of inhabitants is at least 285 (per square kilometer).

Additionally, selected counties should satisfy at least one of the following criteria:
- there are at least 500,000 households making \$150,000 per year,
- the median age is less than 35.5, or
- the density of inhabitants is at least 1400 (per square kilometer).

In [None]:
sel_counties = CA_stats[((CA_stats.high_earners > 100000) & \
                         (CA_stats.median_age < 38.5) & \
                         (CA_stats.density > 285) & \
                        ((CA_stats.median_age < 35.5) | \
                         (CA_stats.density > 1400) | \
                         (CA_stats.high_earners > 500000)))]

# Check your answer
#q_4.check()

In [None]:
#q_4.hint()
#q_4.solution()

### 5) How many stores did you identify?

When looking for the next Starbucks Reserve Roastery location, you'd like to consider all of the stores within the counties that you selected.  So, how many stores are within the selected counties?

In [None]:
starbucks_gdf = gpd.GeoDataFrame(starbucks, geometry=gpd.points_from_xy(starbucks.Longitude, starbucks.Latitude))
starbucks_gdf.crs = {'init': 'epsg:4326'}
locations_of_interest = gpd.sjoin(starbucks_gdf, sel_counties, op="within")
num_stores = len(locations_of_interest)

# Check your answer
#q_5.check()

In [None]:
#q_5.hint()
#q_5.solution()

### 6) Visualize the store locations.

Create a map that shows the locations of the stores that you identified in the previous question.

In [None]:
m_6 = folium.Map(location=[37,-120], tiles='cartodbpositron', zoom_start=6)

mc = MarkerCluster()

for idx, row in locations_of_interest.iterrows():
    if not math.isnan(row['Longitude']) and not math.isnan(row['Latitude']):
        mc.add_child(folium.Marker([row['Latitude'], row['Longitude']]))
        
m_6.add_child(mc)

embed_map(m_6, 'm_6.html')

In [None]:
#q_6.hint()
#q_6.solution()

# Keep going

...