In [52]:
import pandas as pd
from config import gkey
import numpy as np
import requests
import json
import geopy.distance as gp

I created a csv of school names, addresses, and poverty level from a PDF provided by the U.S. Department of Education. Obtaining and cleaning that data took roughly six hours. Using the csv I made, I used Google's geolocation API to get each school's latitude and longitude. 

Next I needed to find nearby grocery stores. I decided to look up grocery stores within a 10-kilometer radius of the center of Austin by the type 'supermarket'. My rationale for taking this approach was that I could use geopy to compare school coordinates to grocery store coordinates and find the closest sets. It worked, but turned out to be not the most efficient route to take.

Using the nearby search API, I pulled the data on grocery stores.

In [44]:
Austin_lat = 30.267153
Austin_long = -97.7430608 
lat_long = [Austin_lat,Austin_long]
base_url = "https://maps.googleapis.com/maps/api/place/nearbysearch/json?"

params = {"key": gkey, 
          "location": f"{Austin_lat},{Austin_long}", 
          "radius": 10000, 
          "type": "supermarket"}

In [45]:
data = requests.get(base_url, params = params).json()

I created a series of arrays to hold output. This loop uses i as a counter and runs through the first 20 google returns (page 1). 

In [46]:
name = []
lat = []
long = []
address = []

i = 0

while i < 20:
    
    name.append(data["results"][i]["name"]),
    lat.append(data["results"][i]["geometry"]["location"]["lat"]),
    long.append(data["results"][i]["geometry"]["location"]["lng"]),
    address.append(data["results"][i]["vicinity"])
    
    i += 1

A quick look at the first 10 entries in the list 'name' shows that HEB and Randalls are both missing. Moreover, Family Dollar, a store that doesn't meet our criteria of having fresh fruits and vegetables available, needed to be cleaned from the dataset. 

In [48]:
name[0:10]

["Trader Joe's",
 'Walmart Supercenter',
 'Wheatsville Food Co-Op',
 'Walmart Supercenter',
 'Fiesta Mart',
 'Walmart Supercenter',
 'Fiesta Mart',
 'Family Dollar',
 'Family Dollar',
 'Walmart Supercenter']

I used next_page_token to continue the Google search (not shown). To find HEB and, later, Randalls stores in Austin, I used the same procedure as above, but changed the type from supermarket and instead chose to use a name to match.

In [4]:
params = {"key": gkey, 
          "location": f"{Austin_lat},{Austin_long}", 
          "radius": 10000, 
          "name": "HEB"}


Checking the data assured me that I was on the right track.

In [17]:
data["results"][0]["name"]

'H-E-B'

Again, I created a series of arrays to hold output and dropped them into a dataframe.

In [34]:
df = pd.DataFrame()
df["Store Name"] = name
df["Store Address"] = address
df["Lat"] = lat
df["Long"] = long

The head looks good, but as you get to the tail, you see that there are stores that show up multiple times. They have bakeries within them that have slightly different latitudes/longitudes. Similarly, a fuel station also appeared in the dataset.

In [36]:
df.tail(2)

Unnamed: 0,Store Name,Store Address,Lat,Long
18,H-E-B Bakery,West Lake Hills,30.291743,-97.824836
19,H-E-B Bakery,"2400 S Congress Ave, Austin",30.238694,-97.754756


In [37]:
df = df.drop_duplicates(subset = ["Store Address"])

Getting rid of duplicates by address was helpful, but not all bakeries had the full address listed.

In [38]:
df.tail(3)

Unnamed: 0,Store Name,Store Address,Lat,Long
13,H-E-B,"2110 W Slaughter Ln, Austin",30.175379,-97.825078
16,H-E-B Bakery,Austin,30.21623,-97.831086
18,H-E-B Bakery,West Lake Hills,30.291743,-97.824836


A quick look at the entire dataframe showed that stores 11, 16, and 18 are duplicates. I dropped those by index.

In [39]:
df

Unnamed: 0,Store Name,Store Address,Lat,Long
0,H-E-B,"6900 Brodie Ln, Austin",30.216283,-97.830987
1,H-E-B,"600 W William Cannon Dr, Austin",30.197923,-97.786481
2,H-E-B,"2400 S Congress Ave, Austin",30.238729,-97.755227
3,H-E-B,"6607 S IH 35 Frontage Rd, Austin",30.188724,-97.768718
4,H-E-B,"701 N Capital of Texas Hwy Bld C, West Lake Hills",30.291796,-97.82519
5,H-E-B,"1000 E 41st St, Austin",30.300637,-97.719957
6,H-E-B plus!,"2508 E Riverside Dr, Austin",30.236523,-97.722073
7,H-E-B,"2701 E 7th St, Austin",30.259124,-97.711619
8,H-E-B,"5808 Burnet Rd, Austin",30.334406,-97.741126
9,H-E-B,"1801 E 51st St, Austin",30.301166,-97.698715


In [40]:
df = df.drop(11)
df = df.drop(16)
df = df.drop(18)

The HEB data is ready to go. Later I combined it with Randalls and with the cleaned data from other grocery stores.

In [43]:
df

Unnamed: 0,Store Name,Store Address,Lat,Long
0,H-E-B,"6900 Brodie Ln, Austin",30.216283,-97.830987
1,H-E-B,"600 W William Cannon Dr, Austin",30.197923,-97.786481
2,H-E-B,"2400 S Congress Ave, Austin",30.238729,-97.755227
3,H-E-B,"6607 S IH 35 Frontage Rd, Austin",30.188724,-97.768718
4,H-E-B,"701 N Capital of Texas Hwy Bld C, West Lake Hills",30.291796,-97.82519
5,H-E-B,"1000 E 41st St, Austin",30.300637,-97.719957
6,H-E-B plus!,"2508 E Riverside Dr, Austin",30.236523,-97.722073
7,H-E-B,"2701 E 7th St, Austin",30.259124,-97.711619
8,H-E-B,"5808 Burnet Rd, Austin",30.334406,-97.741126
9,H-E-B,"1801 E 51st St, Austin",30.301166,-97.698715


For the following example, I've imported previously cleaned school data and grocery store data.

In [134]:
infile = pd.read_csv("Austin_Coords.csv")
school_df = pd.DataFrame(infile)
school_df.head(2)

Unnamed: 0,School Name,Location,Percent in Poverty,Lat,Long,Google Place ID
0,Allison Elementary,Austin,92.0,30.168207,-97.81776,ChIJ04RBdWa3RIYRgJ7aGK5dgXM
1,Andrews Elementary School,Austin,91.02,30.317554,-97.679663,ChIJ6YaMBenJRIYRhrPJHKXhGBU


In [118]:
infile = pd.read_csv("Austin_groceries.csv")
grocery_df = pd.DataFrame(infile)
grocery_df.head(2)

Unnamed: 0,Store Name,Location,Lat,Long,Vicinity
0,H-E-B,Austin,30.216284,-97.830988,"6900 Brodie Ln, Austin"
1,H-E-B,Austin,30.197923,-97.786481,"600 W William Cannon Dr, Austin"


I needed arrays for lat/long, so I set those up.

In [65]:
store_lat, store_long = grocery_df["Lat"], grocery_df["Long"]
school_lat, school_long = school_df["Lat"], school_df["Long"]

I tested the first two sets of coordinates to see if the function worked. It does.

In [66]:
gp.distance(f'{school_lat[0]},{school_long[0]}', f'{store_lat[0]},{store_long[0]}').miles

3.4049041064242647

Finding the closest store to each school took a little bit of setup. 

First I needed arrays for the school and store names, both to run the loop and to compare against the indices the loops would create.

I created an array to hold the closest store, and made a variable called 'closest' that (ironically) represented the farthest length away from a school it was possible for the grocery stores to be. I set it at that length so that the first store would be closer than 10 miles and all others could be compared against it. This is a double loop that checks one school against all the stores, then the next school against all the stores, and so on. After a school completes its run through the stores, the closest store is appended to the 'closest' array. 

To keep track of which school matched which store, I needed to find the matching store index. The school index was simply a sequential run of 0 through the length of the store array, which already matched the index of the school dataframe. The store index will be recorded every time a store hits the 'closest' variable, and is appended to the closest_store array after the loop is completed. I made arrays to hold the school names (to ensure we have the right school) and the store names and also initialized variables for the indices.

In [137]:
schools = school_df["School Name"]
stores = grocery_df["Store Name"]

closest_store_distance = []
closest_store_index_list = []

school_index = 0
store_index = 0

i = 0
j = 0

closest = 10

while i < len(schools):
    
    while j < len(stores):
        
        distance = gp.distance(f'{school_lat[i]},{school_long[i]}', 
                               f'{store_lat[j]},{store_long[j]}').miles
        
        if distance < closest:
            closest = distance
            store_index = j
        
        # incrementing j moves us to the next store in the array
            
        j += 1
            
    # Now that the initial loop is finished, I append the distance and the correct indices,
    # then reset the j index so that the store indices can be looped through correctly again
    
    closest_store_distance.append(closest)
    closest_store_index_list.append(store_index)
    
    closest = 10
    j = 0
    
    # Now the school index will be incremented so that we loop through all schools
    
    i += 1
    

Checking the first five outputs for each list assures that they look as expected

In [95]:
closest_store_index_list[:5]

[12, 11, 2, 8, 7]

In [138]:
closest_store_distance[:5]

[0.6602052745641682,
 0.9677309053194558,
 0.8419731374081209,
 0.8305586639033838,
 0.6254233778464495]

To prepare a new dataframe that matches schools with closest stores, I'll create arrays to hold the appropriate store information

In [96]:
close_store = []
store_address = []

Now I'll loop through each store in closest_store_index_list to pull its info out of the grocery_df dataframe

In [119]:
for store in closest_store_index_list:
    
    close_store.append(grocery_df.iloc[store]["Store Name"])
    store_address.append(grocery_df.iloc[store]["Vicinity"])

Because I'd already combined some datasets before putting the example into this notebook, I had to truncate the close_store list below.

In [128]:
len(close_store)

309

In [147]:
updated_df = pd.DataFrame()
updated_df["School Name"] = school_df["School Name"]
updated_df["Location"] = school_df["Location"]
updated_df["Closest Store"] = close_store[:61]
updated_df["Store Address"] = store_address
updated_df["Distance in Miles"] = closest_store_distance
updated_df["Percent in Poverty"] = school_df["Percent in Poverty"]

In [148]:
updated_df.head(3)

Unnamed: 0,School Name,Location,Closest Store,Store Address,Distance in Miles,Percent in Poverty
0,Allison Elementary,Austin,H-E-B,"2110 W Slaughter Ln, Austin",0.660205,92.0
1,Andrews Elementary School,Austin,H-E-B,"7112 Ed Bluestein Blvd #125, Austin",0.967731,91.02
2,Becker Elementary School,Austin,H-E-B,"2400 S Congress Ave, Austin",0.841973,64.52


To ensure that the data was accurate, we hand-checked a few randomly chosen schools against the corresponding stores to make sure they matched. 

Doing all of the data processing above proved to be so time-consuming that Kellye and I decided to pull the Dallas and Laredo data by hand out of Google maps. That incurred its own problems. (quote from writeup)