# Project : Starbucks Reserve Roastery Analysis
## Tasks : 
- Role: Act as a Starbucks big data analyst looking to find the next store into a Starbucks Reserve Roastery. 
- These roasteries are much larger than a typical Starbucks store and have several additional features, including various food and wine options, along with upscale lounge areas. Investigate the demographics of various counties in the state of California, to determine potentially suitable locations.

In [1]:
# Import libraries
import math
import pandas as pd
import geopandas as gpd
# from geopy.geocoders import Nominatism- that anyone normally run for this kind of anaysis
from geopy.geocoders import Nominatim
import folium
from folium import Marker
from folium.plugins import MarkerCluster


In [2]:
def embed_map(m, file_name) :
    from IPython.display import IFrame
    m.save(file_name)
    return IFrame(file_name, width='100%', height='500px')

In [3]:
# Geocode the missing locations- starbucks containing Starbucks locations in the state of California.
# Load and Preview Starbucks locations in California
starbucks = pd.read_csv("starbucks_locations.csv")
starbucks.head()

Unnamed: 0,Store Number,Store Name,Address,City,Longitude,Latitude
0,10429-100710,Palmdale & Hwy 395,14136 US Hwy 395 Adelanto CA,Adelanto,-117.4,34.51
1,635-352,Kanan & Thousand Oaks,5827 Kanan Road Agoura CA,Agoura,-118.76,34.16
2,74510-27669,Vons-Agoura Hills #2001,5671 Kanan Rd. Agoura Hills CA,Agoura Hills,-118.76,34.15
3,29839-255026,Target Anaheim T-0677,8148 E SANTA ANA CANYON ROAD AHAHEIM CA,AHAHEIM,-117.75,33.87
4,23463-230284,Safeway - Alameda 3281,2600 5th Street Alameda CA,Alameda,-122.28,37.79


Most of the stores have known (latitude, longitude) locations. But, all of the locations in the city of Berkeley are missing.

In [4]:
# How many rows in each column have missing values?
print(starbucks.isnull().sum())

# View rows with missing locations
rows_with_missing = starbucks[starbucks["City"]=="Berkeley"]
rows_with_missing

Store Number    0
Store Name      0
Address         0
City            0
Longitude       5
Latitude        5
dtype: int64


Unnamed: 0,Store Number,Store Name,Address,City,Longitude,Latitude
153,5406-945,2224 Shattuck - Berkeley,2224 Shattuck Avenue Berkeley CA,Berkeley,,
154,570-512,Solano Ave,1799 Solano Avenue Berkeley CA,Berkeley,,
155,17877-164526,Safeway - Berkeley #691,1444 Shattuck Place Berkeley CA,Berkeley,,
156,19864-202264,Telegraph & Ashby,3001 Telegraph Avenue Berkeley CA,Berkeley,,
157,9217-9253,2128 Oxford St.,2128 Oxford Street Berkeley CA,Berkeley,,


# Nominatim geocoder
- used Nominatim() (from geopy.geocoders) to geocode values.


In [5]:
# Create the geocoder
geolocator = Nominatim(user_agent="Kaggle_learn")
def my_geocoder(row):
    point = geolocator.geocode(row).point
    return pd.Series ({'Latitude': point.latitude, 'Longitude' : point.longitude})
berkeley_locations = rows_with_missing.apply(lambda x: my_geocoder(x['Address']), axis=1)
starbucks.update(berkeley_locations)

In [6]:
# Create a base map
m_2 = folium.Map(location=[37.88,-122.26], zoom_start=13)

# Add points to the map
for idx, row in starbucks[starbucks["City"] == "Berkeley"].iterrows():
    Marker ([row['Latitude'], row['Longitude']]).add_to(m_2)

# Display the map
m_2

How many above of the locations seem potentially correct i.e are correctly located in the city ?
All five locations appear to be correct!

# Polygon :
- **Vector** data represents geographic data symbolized as points, lines, or polygons. 
- **Raster** data represents geographic data as a matrix of cells that each contains an attribute value. While the area of different polygon shapes in a data set can differ, each cell in a raster data set is the same cell.

# Consolidate the data
- To load a GeoDataFrame CA_counties containing the name,area(in square kilometers), and a unique id (in the "GEOID" column) for each country in the state of California.
- The geometry column contains a polygon with county boundaries.

In [7]:
# load geodataframe
CA_counties = gpd.read_file("CA_county_boundaries")
CA_counties.crs = {'init': 'epsg:4326'}
CA_counties.head()

  in_crs_string = _prepare_from_proj_string(in_crs_string)


Unnamed: 0,GEOID,name,area_sqkm,geometry
0,6091,Sierra County,2491.995494,"POLYGON ((-120.65560 39.69357, -120.65554 39.6..."
1,6067,Sacramento County,2575.258262,"POLYGON ((-121.18858 38.71431, -121.18732 38.7..."
2,6083,Santa Barbara County,9813.817958,"MULTIPOLYGON (((-120.58191 34.09856, -120.5822..."
3,6009,Calaveras County,2685.626726,"POLYGON ((-120.63095 38.34111, -120.63058 38.3..."
4,6111,Ventura County,5719.321379,"MULTIPOLYGON (((-119.63631 33.27304, -119.6360..."


# Dataframes contains:
- Ca_pop contains an estimate of the population of each county.
- CA_high_earners contains the number of households with an income of at least $150,000 per year.
- CA_median_age contains the median age for each county.

In [8]:
CA_pop = pd.read_csv("CA_county_population.csv", index_col="GEOID")
CA_high_earners = pd.read_csv("CA_county_high_earners.csv", index_col="GEOID")
CA_median_age = pd.read_csv("CA_county_median_age.csv", index_col="GEOID")

Join the CA_counties GeoDataFrame with CA_pop, CA_high_earners, and CA_median_age

In [9]:
# All the data is in one place because its easier to calculate the statistics that uses a combination of columns.
cols_to_add = CA_pop.join([CA_high_earners, CA_median_age]).reset_index()
CA_stats = CA_counties.merge(cols_to_add, on="GEOID")


In [10]:
# "density" column with the population density
CA_stats["density"] = CA_stats["population"] / CA_stats["area_sqkm"]

# Analysis Insights:
## 1.Counties which looks promising: 
- Collapsing all of the information into a single GeoDataFrame also makes it much easier to select counties that meet specific criteria.
- create a GeoDataFrame sel_counties that contains a subset of the rows (and all of the columns) from the CA_stats GeoDataFrame. In particular, we should select counties where:

1. There are at least 100,000 households making $150,000 per year.
2. The median age is less than 38.5.
3. The density of inhabitants is at least 285 (per square kilometer).
- Additionally, selected counties should satisfy at least one of the following criteria:

1. There are at least 500,000 households making $150,000 per year,
2. The median age is less than 35.5, or
3. The density of inhabitants is at least 1400 (per square kilometer).

In [11]:
sel_counties = CA_stats[((CA_stats.high_earners > 100000) &
                         (CA_stats.median_age < 38.5) &
                         (CA_stats.density > 285) & 
                         (( CA_stats.median_age < 38.5 ) |
                          (CA_stats.density > 1400) |
                          (CA_stats.high_earners > 500000)))]

## 2. Numbers of stores within the selected Counties
For the next Starbucks Reserve Roastery location consider all of the stores within the counties that have selected.


In [12]:
# create a GeoDataFrame with all of the starbucks locations
starbucks_gdf = gpd.GeoDataFrame(starbucks, geometry=gpd.points_from_xy(starbucks.Longitude, starbucks.Latitude))
starbucks_gdf.crs ={'init': 'epsg:4326'}

  in_crs_string = _prepare_from_proj_string(in_crs_string)


In [13]:
locations_of_interest= gpd.sjoin(starbucks_gdf, sel_counties)
num_stores = len(locations_of_interest)

## 3.The stores locations on Map (Visualization) 

In [14]:
m_6 = folium.Map(location=[37,-120], zoom_start=6)
mc = MarkerCluster()

locations_of_interest = gpd.sjoin(starbucks_gdf, sel_counties)
for idx, row in locations_of_interest.iterrows():
    if not math.isnan(row['Longitude']) and not math.isnan(row['Latitude']):
        mc.add_child(folium.Marker([row['Latitude'], row['Longitude']]))
        
m_6.add_child(mc)