# Quality of Google's Abortion Search Results by County Income Level
**CS 234 Data Visualizations (Fall 2022) Final Project - Data Collection**
- **Name:** Sofia Kobayashi
- **Date:** Dec. 21, 2022
- **Description:** Data Collection, Cleaning, & Formatting for final project of CS 234
    - Data only collected for 37 states where abortion is legal (as of 11/15/2022) because there aren't any abortion clinics in the other states to collect data on

### **<u>Table of Contents</u>**
1. [Creating County Information dataset](#sec1)
2. [Collecting Google Places Web-searches](#sec2)
3. [Adding Google Places data to County dataset](#sec3)

In [22]:
import pandas as pd
import re
import warnings
warnings.filterwarnings('ignore')

In [23]:
# states used in this analysis; where abortion is currently legal (and thus could have abortion clinics)
statesUsing = ["Georgia","New York",'Hawaii','Alaska','New Mexico','Pennsylvania','Arizona','Illinois','Connecticut',
      'Rhode Island','North Dakota','South Carolina','Michigan','North Carolina','Colorado','Nebraska','Oregon',
      'Iowa','New Hampshire','Wyoming','Washington','Ohio','Virginia','Utah','Montana','Delaware','Massachusetts',
      'Kansas','New Jersey','Wisconsin','Maine','Maryland','California','Florida','Vermont','Indiana','Minnesota'
      'Nevada']

<a id="sec1"></a>
# 1. Creating County Information dataframe
- Reads in CSV files for three different county data sets, picks out desired data, cleans/standarizes it across data sets, merges into one big DF 
1. [Helper Functions & Variables](#sec1.1)
1. [Read in county coordinate data](#sec1.2)
1. [Read in county population and race data](#sec1.3)
1. [Read in county income data](#sec1.4)
1. [Merge above 3 data sets](#sec1.5)
1. [Add County Political Alignment based on their State](#sec1.6)

<a id="sec1.1"></a>
### 1.1 Helper Functions & Variables

In [24]:
def cleanCounties(county):
    """Takes a county name and cleans/standardizes it to better match across data sets."""
    content = county.strip()
    r1 = re.sub("\sMunicipality$", "", content)
    r2 = re.sub("\sCensus Area$", "", r1)
    r3 = re.sub("\sCounty$", "", r2)
    r4 = re.sub("\sCity and Borough$", "", r3)
    r5 = re.sub("ʻi", "i", r4)
    r6 = re.sub("\sBorough$", "", r5)
    return r6

In [25]:
us_state_to_abbv = {"Alabama": "AL", "Alaska": "AK", "Arizona": "AZ", "Arkansas": "AR", 
                      "California": "CA", "Colorado": "CO", "Connecticut": "CT", 
                      "Delaware": "DE", "Florida": "FL", "Georgia": "GA", "Hawaii": "HI", 
                      "Idaho": "ID", "Illinois": "IL", "Indiana": "IN", "Iowa": "IA", 
                      "Kansas": "KS", "Kentucky": "KY", "Louisiana": "LA", "Maine": "ME", 
                      "Maryland": "MD", "Massachusetts": "MA", "Michigan": "MI", 
                      "Minnesota": "MN", "Mississippi": "MS", "Missouri": "MO", 
                      "Montana": "MT", "Nebraska": "NE", "Nevada": "NV", "New Hampshire": "NH", 
                      "New Jersey": "NJ", "New Mexico": "NM", "New York": "NY", 
                      "North Carolina": "NC", "North Dakota": "ND", "Ohio": "OH", 
                      "Oklahoma": "OK", "Oregon": "OR", "Pennsylvania": "PA", 
                      "Rhode Island": "RI", "South Carolina": "SC", "South Dakota": "SD", 
                      "Tennessee": "TN", "Texas": "TX", "Utah": "UT", "Vermont": "VT", 
                      "Virginia": "VA", "Washington": "WA", "West Virginia": "WV", 
                      "Wisconsin": "WI", "Wyoming": "WY", "District of Columbia": "DC", 
                      "American Samoa": "AS", "Guam": "GU", "Northern Mariana Islands": "MP", 
                      "Puerto Rico": "PR", "United States Minor Outlying Islands": "UM", 
                      "U.S. Virgin Islands": "VI", }

us_abbr_to_state = {v: k for k, v in us_state_to_abbv.items()}

<a id="sec1.2"></a>
### 1.2 Read in county coordinates data
- From Wikipedia (which got the information from the US Census Bureau)
- Found at this [Wikipedia link](https://en.wikipedia.org/wiki/User:Michael_J/County_table)

In [26]:
# read in CSV file
coordsData = pd.read_csv("countyData/coords.csv")

# get only desired information
df1 = coordsData[["State","County","Latitude","Longitude"]] \
            .rename(columns={"County": "county",
                            "Latitude":"latitude",
                            "Longitude":"longitude"})

# add & clean columns
df1["county"] = df1["county"].apply(cleanCounties)
df1["latitude"] = df1["latitude"].apply(lambda val: float(val.replace("°","")))
df1["longitude"] = df1["longitude"].apply(lambda val: float(val.replace("°","").replace("–","-")))
df1["state"] = df1["State"].apply(lambda abbv: us_abbr_to_state[abbv].strip())

# reorder columns
df1 = df1[["state", "county","latitude","longitude"]]

# print report
print(f"Shape: {df1.shape}")
df1.head()

Shape: (3143, 4)


Unnamed: 0,state,county,latitude,longitude
0,Alabama,Autauga,32.536382,-86.64449
1,Alabama,Baldwin,30.659218,-87.746067
2,Alabama,Barbour,31.87067,-85.405456
3,Alabama,Bibb,33.015893,-87.127148
4,Alabama,Blount,33.977448,-86.567246


<a id="sec1.3"></a>
### 1.3 Read in county 2021 population data
- From the US Census Bureau (for 2021), total population estimate for each United States county
- Found at this [US Census Bureau page](https://data.census.gov/table?q=county,+population+2021)

In [27]:
# read in CSV file
popData = pd.read_csv("countyData/population_race_2021.csv", low_memory=False)

# get only desired information
df2 = popData[["NAME","DP05_0001E","DP05_0002E","DP05_0003E","DP05_0037E"]] \
        .rename(columns={"DP05_0001E":"total_population",
                         "DP05_0002E":"total_male",
                         "DP05_0003E":"total_female",
                         "DP05_0037E":"total_one_race_white"}) \
        .drop([0]) \
        .reset_index().drop(columns=["index"])

# add & clean columns
df2["county"] = df2["NAME"].apply(lambda name: name.split(",")[0]).apply(cleanCounties)
df2["state"] = df2["NAME"].apply(lambda name: name.split(",")[1])
df2["percent_female"] = df2.apply(lambda row: int(row["total_female"])/int(row["total_population"]), axis=1)
df2["percent_white"] = df2.apply(lambda row: int(row["total_one_race_white"])/int(row["total_population"]), axis=1)
df2["state"] = df2["state"].apply(lambda state: state.strip())

# reorder columns
df2 = df2[["state","county","total_population","total_female","total_male","total_one_race_white","percent_female","percent_white"]]

# print report
print(f"Shape: {df2.shape}")
df2.head()

Shape: (3221, 8)


Unnamed: 0,state,county,total_population,total_female,total_male,total_one_race_white,percent_female,percent_white
0,Alabama,Autauga,58239,30033,28206,43755,0.515685,0.751301
1,Alabama,Baldwin,227131,116350,110781,192034,0.512259,0.845477
2,Alabama,Barbour,25259,11898,13361,11495,0.47104,0.455085
3,Alabama,Bibb,22412,10112,12300,17020,0.451187,0.759415
4,Alabama,Blount,58884,29354,29530,54439,0.498506,0.924513


<a id="sec1.4"></a>
### 1.4 Read in 2021 county income data
- From the US Census Bureau (for 2021), estimated median household income for each United States county
- Found at this [US Census Bureau page](https://data.census.gov/table?q=county,+median+household+income&tid=ACSST1Y2021.S1903&moe=false)

In [28]:
# read in CSV file
incomeData = pd.read_csv("countyData/income_2021.csv", low_memory=False)

# get only desired information
df3 = incomeData[["NAME","S1903_C01_001E","S1903_C03_001E"]] \
        .rename(columns={"S1903_C01_001E":"num_households",
                         "S1903_C03_001E":"dollar_household_median_income"}) \
        .drop([0]) \
        .reset_index().drop(columns=["index"])

# add & clean columns
df3["county"] = df3["NAME"].apply(lambda name: name.split(",")[0].strip()).apply(cleanCounties)
df3["state"] = df3["NAME"].apply(lambda name: name.split(",")[1].strip())

# reorder information
df3 = df3[["state","county","num_households","dollar_household_median_income"]]

# print report
print(f"Shape: {df3.shape}")
df3.head()

Shape: (3221, 4)


Unnamed: 0,state,county,num_households,dollar_household_median_income
0,Alabama,Autauga,21856,62660
1,Alabama,Baldwin,87190,64346
2,Alabama,Barbour,9088,36422
3,Alabama,Bibb,7083,54277
4,Alabama,Blount,21300,52830


<a id="sec1.5"></a>
### 1.5 Merge 3 DFs to make first County dataset
- Even though all three datasets being merged are from the US Census Bureau and sets 2 and 3 have the same number of rows, there were differences in county names
- to remedy this, I wrote the helper function cleanCounties() to standardize county names as much as possible and handle:
    - mismatched county names that were clearly the same county but some had titles added to the end
        - ex. "Anchorage" vs. "Anchorage Municipality", "Aleutians West" vs. "Aleutians West Census Area", "Hawaii" vs "Hawai'i", etc.
    - one data set also had trailing whitespace in so many cells, my first attempt to merge them by matching county names only resulted in 2 columns
- the others unmatched counties had no apparent match across data sets, so unforutnately the final merged counties_1 DF lost ~200 counties
    - ex. 'Wade Hampton' county in data set 1, but there is no "Wade" included in anyway in any county names in data set 2
    - even looking up official list of US counties from other sources didn't help–they often had a different numbers of counties or different names for the unmatched counties and no way to reliably identify which counties were the same

In [29]:
# merge coords + population data (data sets 1 and 2)
dfCounties_1 = df2.merge(df1, how='inner', left_on=["state","county"], right_on=["state","county"]) \
                .drop_duplicates(subset=["state","county"]) \
                .reset_index() \
                .drop(columns=["index"])

# merge coords/population data + income data (adding data set 3)
dfCounties_1 = dfCounties_1.merge(df3, how='inner', left_on=["state","county"], right_on=["state","county"]) \
                .drop_duplicates(subset=["state","county"]) \
                .reset_index() \
                .drop(columns=["index"])

# filter out the states not used in analysis (where abortion is illegal and thus have no clinic to get data from)
dfCounties_1 = dfCounties_1[dfCounties_1.state.isin(statesUsing)==True].reset_index().drop(columns=["index"])

# print report
dfCounties_1.to_csv("counties_1.csv")
print(f"Shape: {dfCounties_1.shape}")
dfCounties_1.head()

Shape: (1880, 12)


Unnamed: 0,state,county,total_population,total_female,total_male,total_one_race_white,percent_female,percent_white,latitude,longitude,num_households,dollar_household_median_income
0,Alaska,Aleutians East,3409,1395,2014,497,0.409211,0.145791,55.243722,-161.950749,914,72258
1,Alaska,Aleutians West,5251,2256,2995,1286,0.429632,0.244906,51.959447,-178.338813,1004,90708
2,Alaska,Anchorage,292545,142897,149648,176279,0.488462,0.602571,61.177549,-149.274354,106695,88871
3,Alaska,Bethel,18514,8790,9724,1769,0.474776,0.095549,60.928916,-160.15335,4520,57460
4,Alaska,Bristol Bay,849,369,480,380,0.434629,0.447585,58.731373,-156.986612,315,81563


<a id="sec1.6"></a>
### 1.6 Add County Political Alignment based on their State
- this data is from my group, based on the 2020 political election

In [30]:
redStates = ["georgia", "alaska", "south carolina", "north carolina", "nebraska", "iowa", "wyoming", 
             "ohio", "utah", "montana", "kansas", "florida", "indiana"]
blueStates = ["new york", "hawaii", "new mexico", "illinois", "connecticut", "rhode island", "colorado", "oregon", 
              "new hampshire", "washington", "michigan", "pennsylvania", "arizona", "virginia", "delaware", 
              "massachusetts", "new jersey", "wisconsin", "maine", "maryland", "california", "nevada", "vermont", "minnesota"]

In [31]:
def alignment(state):
    if state.lower() in redStates: return "republican"
    else: return "democrat"

dfCounties_1["poli_align"] = dfCounties_1["state"].apply(alignment)
dfCounties_1.head()

Unnamed: 0,state,county,total_population,total_female,total_male,total_one_race_white,percent_female,percent_white,latitude,longitude,num_households,dollar_household_median_income,poli_align
0,Alaska,Aleutians East,3409,1395,2014,497,0.409211,0.145791,55.243722,-161.950749,914,72258,republican
1,Alaska,Aleutians West,5251,2256,2995,1286,0.429632,0.244906,51.959447,-178.338813,1004,90708,republican
2,Alaska,Anchorage,292545,142897,149648,176279,0.488462,0.602571,61.177549,-149.274354,106695,88871,republican
3,Alaska,Bethel,18514,8790,9724,1769,0.474776,0.095549,60.928916,-160.15335,4520,57460,republican
4,Alaska,Bristol Bay,849,369,480,380,0.434629,0.447585,58.731373,-156.986612,315,81563,republican


<a id="sec2"></a>
# 2. Collecting Google Places Web-searches
- Uses Selenium to automate searching
- For all 1880 counties in dfCounties_1, search for Google for 'abortion near me' and save the HTML file for each search
1. [Imports & Dependancies](#sec2.1)
1. [Helper Functions & Variables](#sec2.2)
1. [Scraping the Web: Collecting Google Places search results](#sec2.3)

<a id="sec2.1"></a>
### 2.1 Imports & Dependancies

In [32]:
import selenium
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By

from time import sleep

In [33]:
# Set the driver path
driverpath ='driver/chromedriver'
chrome_options = webdriver.ChromeOptions()

# This option is what will allow to change the geolocation
chrome_options.add_experimental_option("prefs", { "profile.default_content_settings.geolocation": 1})

# Create the driver instance
service = Service(driverpath)

<a id="sec2.2"></a>
### 2.2 Helper Functions & Variables
- both helper functions below were <mark>written by Malika Parkhomchuk</mark>, and provided by Eni Mustafaraj, although I did modify `click_ul_element` so it wouldn't get stuck not moving a page

In [34]:
from selenium.webdriver.common.keys import Keys

def click_ul_element(driver, footerID):
    """
    *** Written by Malika Parkhomchuk, provided by Eni Mustafaraj, modified by Sofia Kobayashi***
    Scrolls to bottom of page to click Footer. If scrolls too long without moving, 
    keys upward to 'reset' scrolling.
    """
    footer = driver.find_elements(By.ID, footerID)
    
    # initialize page height & number of scrolls, for 'reset' later
    saveHeight = 0
    numScroll = 0
    
    wait = WebDriverWait(driver, 5)
    while len(footer) == 0:
        driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
        sleep(2)
        footer = driver.find_elements(By.ID, footerID)
        
    ul = footer[0].find_element(By.TAG_NAME, "update-location")
        
    while ul.is_displayed() is False:
        driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
        sleep(2)
        ul = footer[0].find_element(By.TAG_NAME, "update-location")
        
        # get current height
        height = driver.execute_script("return document.body.scrollHeight")
        
        # every 3 scrolls, check if page height has changed, if not key upwards
        # (because sometimes scroll will be called, but page won't move and code gets caught in an infinite 
        # loop, keying upwards 'resets' like
        if (numScroll % 3) == 0:
            if height == saveHeight:
                elm = driver.find_element(By.TAG_NAME, "html")
                for i in range(2): elm.send_keys(Keys.ARROW_UP)
            saveHeight = height
        numScroll+= 1
                
    ul.click()

In [35]:
import time, os
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

def search_geolocation(query, ind, coordinatesDict, locationName):
    """
    *** Written by Malika Parkhomchuk, provided by Eni Mustafaraj ***
    This function can search Google by changing the location for 
    the search. Parameters:
    query - a string that contains the phrase that will be searched
    locationName - a string that is used to save the search results page
    coordinatesDict - a dictionary with the latitude, longitude, and accuracy
    """
    # Create a new instance of the driver for every search
    driver = webdriver.Chrome(service=service, options=chrome_options)
    
    # setup the new coordinates
    driver.execute_cdp_cmd("Emulation.setGeolocationOverride", 
                           coordinatesDict)
    
    # perform the search, because we need the location link to show
    url = f"https://google.com/search?q={query}"
    driver.get(url)

    # NEW LINE OF CODE #
    click_ul_element(driver, 'footcnt')
    sleep(3)
    ####################
    
    # Access the content of the page
    htmlPage = driver.page_source
    
    # if a folder with the name of the query doesn't exist, create it, then save the file
    if not os.path.isdir(query):
        os.mkdir(query)
    with open(f"{query}/{ind}_{locationName}.html", 'w', encoding='utf-8') as output:
        output.write(htmlPage)        
    # close the instance
    driver.close()

<a id="sec2.3"></a>
### 2.3 Scraping the Web: Collecting Google Places search results
- This code searches for 'abortion near me' at all 1880 US counties (excluding the states in which abortion is illegal), then saves their HTML files in the 'searchData' directory

#### Computer Labor Division
- Split among 6 computers over 2 hours
- Range of indices that represent all 1880 counties to get the Google Places pages for
    - C1: 0-143
    - C2: 144-287
    - C3: 288-431
    - C4: 432-575

    - C5: 576-719
    - C6: 720-863
    - C7: 864-1007
    - C8: 1008-1151

    - C9: 1152-1295
    - C10: 1296-1439
    - C11: 1440-1583
    - C12: 1583-1727

    - C13: 1728-1879


In [36]:
import time

# specifies range of county indexes from dfCounties_1 to search in
startIndex = 0
endIndex_inclusive = -1

# iterate through all rows in specified range
counter = 1
for ind in range(startIndex, endIndex_inclusive+1):
    # try/except so entire range doesn't stop if one page throws an error
    try: 
        # pull needed data from each row
        row = dfCounties_1.loc[ind].to_dict()
        fileName = f"{row['state']}_{row['county']}"
        locationsDct = {'latitude': row['latitude'], 
                        'longitude': row['longitude'], 
                        'accuracy': 100}

        # initialize timer and print statements to make it clear if there is a problem
        print(f"{counter} - Searching ({ind}) {fileName}...")
        before = time.perf_counter()
        
        # search & save Google page to html file
        search_geolocation("abortion near me", ind, locationsDct, fileName) 
        after = time.perf_counter()

        counter += 1
        print(f"--- Wrote successfully, took {round(after-before, 2)} second(s)!")
    except: 
        counter += 1
        print(f"--- ({ind}) FAILED, ELEPHANT")
    print()

<a id="sec3"></a>
# 3. Adding Google Places data to County dataset
- Uses BeautifulSoup to find information on pages
- Adds average star rating and number of reviews for clinics suggested by Google Places and distance from the county to the nearest one

1. [Helper Functions](#sec3.1)
1. [Function to Find, Format, & Return Google Places data](#sec3.2)
1. [Add Google Places data to county DF](#sec3.3)

<a id="sec3.1"></a>
### 3.1 Helper Functions
- the <mark>`getGooglePlace` function was written by Maddie Moon</mark>, but modified by me

In [37]:
def get_all_files(dirName): 
    """Takes a string - name of directory. Returns list of ALL files within that directory minus the .DS_Store"""
    from os import listdir
    allFiles = [f for f in listdir(dirName)]
    if ".DS_Store" in allFiles: allFiles.remove(".DS_Store")
    return allFiles

# get_all_files("abortion near me")

In [38]:
import requests
import json
YOUR_API_KEY = 'AIzaSyCdbEE11g4-o02glyquHQo825Csh9l7410'

def getGooglePlace(place):
    '''*** Written by Maddie Moon, modified by Sofia Kobayashi. ***
    Given the name of a place, uses Google Distance Matrix API to find & return it's coordiantes.
    '''
    url = "https://maps.googleapis.com/maps/api/place/findplacefromtext/json?input=" + place + "&inputtype=textquery&fields=formatted_address%2Cname%2Crating%2Cplace_id%2Cgeometry%2Cplus_code&key=" + YOUR_API_KEY

    payload={}
    headers = {}
    response = requests.request("GET", url, headers=headers, data=payload)
    loc = json.loads(response.text)

    # some place names get no results 
    if loc["candidates"] == []: return None
    else:
        coods = loc["candidates"][0]["geometry"]["location"]
        return coods

# getGooglePlace("Lilith Clinic")


In [39]:
import pandas as pd
import googlemaps
from itertools import tee
MILE_CONVERSION = 0.621371

def getDistance(originCoords, destCoords,gmaps):
    """
    Takes two sets of coordinates (lat, long), returns distance between them in 
    miles using Google Distance API.
    """
    # use Google Distance Matrix API to get the distance between two sets of coordinates
    result = gmaps.distance_matrix(originCoords, destCoords, mode='driving')
    
    # finds distance information (in km) and converts to miles
    mi = float(result["rows"][0]["elements"][0]["distance"]["text"].replace(" km","").replace(",",""))* MILE_CONVERSION
    return mi


<a id="sec3.2"></a>
### 3.2 Function to Find, Format, & Return Google Places data
- Uses BeautifulSoup

In [40]:
from bs4 import BeautifulSoup as BS
import requests
import numpy as np

def getGoogleData(fileName):
    """
    Takes the HTML file from an 'abortion near me' search for one of the US counties. 
    Returns a dictionary of the distance to closest clinic, number of Google Places suggestions, and
    average star rating and number of reviews from abortion clinics in that county as suggested by 
    Google Places.
    """
    # get state name, county name & coords
    index = int(fileName.split("_")[0])
    state = fileName.split("_")[1]
    county = fileName.split("_")[2].replace(".html","")
    
    row = dfCounties_1.iloc[[index]]
    countyCoords = (row.iloc[0]["latitude"], row.iloc[0]["longitude"])
#     print(f"{county}, {state} (state), county coords: {countyCoords}")
    
    # read in & parse Google Search page file
    with open(f"searchData/{fileName}","r") as f:
        contents = f.read()
        soup = BS(contents, 'lxml')
    
    # find all Google Places suggestions (ranging from 0-3)
    placesDiv = soup.select("div[class='kuydt']")
    if placesDiv == []: 
        return {"starAvg": None, "reviewAvg": None, "closestClinic_mi": None, "numPlaces":0}
    placesDivs = soup.find_all("div", {"jsname": "jXK9ad"})
    
    # initialize storage & Distance API 
    ratings = []
    reviews = []
    distances = []
    YOUR_API_KEY = 'AIzaSyCdbEE11g4-o02glyquHQo825Csh9l7410'
    gmaps = googlemaps.Client(key=YOUR_API_KEY)
    
    # get info on all Google Places suggested clinics
    for div in placesDivs:
        # get clinic name
        name = div.select("span[class='OSrXXb']")[0].text
        
        # get clinic star rating
        starSpan  = div.select("span[class='yi40Hd YrbPuc']")
        if starSpan != []: starRating = float(starSpan[0].text)
        else: starRating = np.nan
        
        # get clinic number of reviews
        revSpan  = div.select("span[class='RDApEe YrbPuc']")
        if revSpan != []: numReviews = int(revSpan[0].text[1:-1])
        else: numReviews = np.nan
            
        # find distance from county
        coords = getGooglePlace(name)
        if coords == None: distanceBetween = np.nan
        else: 
            clinicCoords = (coords["lat"], coords["lng"])
            distanceBetween = getDistance(countyCoords,clinicCoords,gmaps)
        
        # store rating, numReviews, and distance, to be used later
        ratings.append(starRating)
        reviews.append(numReviews)
        distances.append(distanceBetween)        
        
    # find averages or closest distance
    starAvg = round(sum(ratings)/len(ratings), 2)
    revAvg = round(sum(reviews)/len(reviews),2)
    closestClinic = round(min(distances), 2)
    numPlaces = len(placesDivs)

    return {"starAvg": starAvg, "reviewAvg": revAvg, "closestClinic_mi": closestClinic, "numPlaces":numPlaces}
    
# c1 = '0_Alaska_Aleutians East.html'
# c2 = '107_Colorado_Broomfield.html'
# c3 = '1157_New York_Chautauqua.html'
    
# getGoogleData("1486_Oregon_Wasco.html")

<a id="sec3.3"></a>
### 3.3 Add Google Places data to county DF
- the `addGoogleData` function didn't quite work as planned, filling the counties DF with string formatted like pd.Series with the values needed inside them, but the helper function `convert` fixed this within seconds 
- if I was using `addGoogleData` again, I would fix it, but for now, it's working with `convert`'s help
- some of the Google-search-related counties rows are filled with NaN's for one reason or another:
    - they didn't have any places results
    - the Places results didn't have any information other than the name 
    - the names of the clinics in Places could not be found by the Google Distance Matric API 

In [41]:
def convert(x):
    """To return to normal how the Google Data got read into the counties DF."""
    if type(x) is float: return(x)
    elif x.split()[1] == "None": return(np.nan)
    else: 
        return(float(x.split()[1]))

# dfCounties_1["avgNumReviews"].apply(convert)

In [42]:
# add columns for Google Places Data
import numpy as np

def addGPcols():
    dfCounties_1["avgStarRating"] = np.nan
    dfCounties_1["avgNumReviews"] = np.nan
    dfCounties_1["closestClinic_mi"] = np.nan
    dfCounties_1["numGooglePlaces"] = np.nan

addGPcols()
dfCounties_1.head()

Unnamed: 0,state,county,total_population,total_female,total_male,total_one_race_white,percent_female,percent_white,latitude,longitude,num_households,dollar_household_median_income,poli_align,avgStarRating,avgNumReviews,closestClinic_mi,numGooglePlaces
0,Alaska,Aleutians East,3409,1395,2014,497,0.409211,0.145791,55.243722,-161.950749,914,72258,republican,,,,
1,Alaska,Aleutians West,5251,2256,2995,1286,0.429632,0.244906,51.959447,-178.338813,1004,90708,republican,,,,
2,Alaska,Anchorage,292545,142897,149648,176279,0.488462,0.602571,61.177549,-149.274354,106695,88871,republican,,,,
3,Alaska,Bethel,18514,8790,9724,1769,0.474776,0.095549,60.928916,-160.15335,4520,57460,republican,,,,
4,Alaska,Bristol Bay,849,369,480,380,0.434629,0.447585,58.731373,-156.986612,315,81563,republican,,,,


In [43]:
def addSearchData(dirName):
    """
    Adds all the Google Search data retreived above (by `getGoogleData`) to counties DF.
    Returns a list of all the files that failed to add their info.
    """
    # Get all files in searchData directory
    searchFiles = get_all_files(dirName)
    # to keep track of those that failed
    failed = [] 

    for file in searchFiles:
        print(f"Starting {file}...")
        try:
            # get county, state, index from file
            index = int(file.split("_")[0])
            state = file.split("_")[1]
            county = file.split("_")[2].replace(".html","")

            # get Google Places data
            searchRes = getGoogleData(file)

            # make rows with search data (DOES NOT WORK AS INTENDED, but well enough fixed very quickly)
            row = dfCounties_1.iloc[[index]]
            row["avgStarRating"] = searchRes["starAvg"]
            row["avgNumReviews"] = searchRes["reviewAvg"]
            row["closestClinic_mi"] = searchRes["closestClinic_mi"]
            row["numGooglePlaces"] = searchRes["numPlaces"]
            
            # add to dfCounties_1
            dfCounties_1.iloc[index] = row

        except:
            # if file failed, add to failed to be printed out
            failed.append(file)     

#     dfCounties_1.to_csv("counties_2.csv")
    return failed


#addSearchData("searchData")

## Now `counties_2.csv` file has all the needed information! Onto the data analysis!
- see next notebook