## Data Acquisition

In this project my goal is to create an interactive tool that can be used to find a user's top location recommendations for where to move/live in the continental United States, given a user's preferences. I was inspired to create this because I am currently relocating and wished there was an aggregated source to visualize my preferences for things that matter to me when choosing a state and city to move to and live and thrive in.

# First source: The Municipal Equality Index by HRC (2020)

https://www.hrc.org/resources/municipalities

I'll try to use web scraping on the database results from a blank query on their webpage.

In [3]:
import numpy as np
import pandas as pd

from requests import get
import re
from bs4 import BeautifulSoup
import os

In [2]:
url = 'https://www.hrc.org/resources/municipalities'
headers = {'User-Agent': 'Kwame'} # Some websites don't accept the pyhon-requests default user-agent
response = get(url, headers=headers)

In [3]:
print(response.text[:400])

<!DOCTYPE html>
<html dir="ltr" lang="en-US" class="no-js no-touch">
<head>
  <meta charset="utf-8">
  <meta http-equiv="x-ua-compatible" content="ie=edge">
  <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">

  <link rel="home" href="https://www.hrc.org/">
  <link rel="shortcut icon" href="/favicon.ico">
  <link rel="icon" sizes="16x16 32x32 64x64" href="/favi


In [4]:
# Make a soup variable holding the response content
soup = BeautifulSoup(response.content, 'html.parser')

# Look at the website and identify parts of results
title = soup.find('h2').text
content = soup.find('p').text

In [5]:
title

'\n            Cookies in use\n          '

In [6]:
content

'Where does your city stand in the movement for equality?'

In [7]:
soup.find_all('h2')

[<h2 class="heading-24 mb-12">
             Cookies in use
           </h2>,
 <h2 class="type-eyebrow text-theme-headline-default-current">
               Related Resources
             </h2>,
 <h2 class="heading-48">
             Love conquers hate.
           </h2>,
 <h2 class="font-bold leading-squeeze text-32">
             Wear your pride this year.
           </h2>,
 <h2 class="heading-48">
                 Love conquers hate.
               </h2>,
 <h2 class="font-bold leading-squeeze text-32">
                 Wear your pride this year.
               </h2>,
 <h2 class="heading-20 mb-32">Join millions of supporters by signing up for the HRC newsletter.</h2>,
 <h2 class="heading-60 mb-32">You are leaving HRC.org</h2>]

In [8]:
soup.find_all("span", class_="align-middle")

[<span class="align-middle">Accept</span>,
 <span class="align-middle">Shop</span>,
 <span class="align-middle">Donate</span>,
 <span class="align-middle">Shop</span>,
 <span class="align-middle">Donate</span>,
 <span class="align-middle">View All</span>,
 <span class="align-middle">Donate Today</span>,
 <span class="align-middle">Shop Now</span>,
 <span class="align-middle">Donate Today</span>,
 <span class="align-middle">Shop Now</span>,
 <span class="align-middle">Sign Me Up</span>]

In [9]:
soup.find_all('$0')

[]

In [10]:
soup.find_all('article', {'data-label': 'component-score-card-index'})

[]

In [11]:
for item in soup:
    print(soup.div.article['aria-label'])

Congressional Scorecard
Congressional Scorecard
Congressional Scorecard
Congressional Scorecard


In [12]:
soup.find_all('/html/body/div[1]/main/div/section/div/article[1]/div[1]/div[1]/h2/a/span[1]')

[]

I think there is javascript in the mix, so I'm going to try scraping using Selenium.

In [13]:
# import libraries
import urllib.request
from selenium import webdriver
import time

# specify the url
urlpage = 'https://www.hrc.org/resources/municipalities/search?sort=score-desc' 
print(urlpage)
# run firefox webdriver from executable path of your choice
driver = webdriver.Firefox(executable_path = '/Users/a666/codeup-data-science/flask-project/geckodriver')

https://www.hrc.org/resources/municipalities/search?sort=score-desc


In [25]:
# import libraries
import urllib.request
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import time

# specify the url
urlpage = 'https://www.hrc.org/resources/municipalities/search?sort=score-desc' 
print(urlpage)
# run firefox webdriver from executable path of your choice
# i'm using headless mode with geckodriver
#options = Options()
#options.headless = True
driver = webdriver.Firefox(executable_path = '/Users/a666/codeup-data-science/flask-project/geckodriver')
#options=options, 

https://www.hrc.org/resources/municipalities/search?sort=score-desc


In [26]:
# get web page
driver.get(urlpage)
# execute script to scroll down the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
# sleep for 30s
time.sleep(30)
# driver.quit()

In [27]:
results_cities = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/div/h2/a/span")
results_states = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/p")
print('Number of results (cities):', len(results_cities))
print('Number of results (states):', len(results_states))

Number of results (cities): 20
Number of results (states): 10


In [30]:
# create empty df and arrays to store data
df = pd.DataFrame(columns = ['city', 'state'])
city_list = []
state_list = []
# loop over results and store in the lists
for result_city in results_cities:
    if result_city.text != "":
        city_list.append(result_city.text)
for result_state in results_states:
    state_list.append(result_state.text)
# add lists to df
df['city'] = pd.Series(city_list)
df['state'] = pd.Series(state_list)

I'll make this a function so I can easily use it later.

In [28]:
def get_locations(target_page_url):
    # print the url
    print(target_page_url)
    # run firefox webdriver from executable path of your choice
    # i'm using headless mode with geckodriver
    #options = Options()
    #options.headless = True
    driver = webdriver.Firefox(executable_path = '/Users/a666/codeup-data-science/flask-project/geckodriver')
    #options=options, 
    
    # get web page
    driver.get(target_page_url)
    # execute script to scroll down the page
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
    # sleep for 30s
    time.sleep(30)
    driver.quit()
    
    results_cities = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/div/h2/a/span")
    results_states = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/p")
    print('Number of results (cities):', len(results_cities))
    print('Number of results (states):', len(results_states))

    # create empty df and arrays to store data from this page
    df = pd.DataFrame(columns = ['city', 'state'])
    city_list = []
    state_list = []
    # loop over this page's results and store in the lists
    for result_city in results_cities:
        if result_city.text != "":
            city_list.append(result_city.text)
    for result_state in results_states:
        state_list.append(result_state.text)
    # add this page's city and states to df
    df['city'] = pd.Series(city_list)
    df['state'] = pd.Series(state_list)
    
    return df

In [31]:
df

Unnamed: 0,city,state
0,Albany,New York
1,Alexandria,Virginia
2,Allentown,Pennsylvania
3,Ann Arbor,Michigan
4,Arlington,Massachusetts
5,Arlington,Virginia
6,Atlanta,Georgia
7,Austin,Texas
8,Baltimore,Maryland
9,Bellevue,Washington


Now that I have retrieved the first page successfully, it is time to go through the rest of the pages.

https://www.hrc.org/resources/municipalities/search/p2?sort=score-desc

In [20]:
#start_url = 'https://www.hrc.org/resources/municipalities/search/p'
#page_nums = [2, 3, 4, 5, 6, 7, 8, 9, 10]
#end_url = '?sort=score-desc'
#locations = pd.DataFrame(['city', 'state'])

#for page in page_nums:
#    target_page_url = start_url + str(page) + end_url
#    locations = pd.concat(locations, get_locations(target_page_url))
    
#locations

I got an error, which I think is caused by the need for Selenium to open a browser window for each page. I am going to try to get around this using headless mode with geckodriver.

I actually don't have the Selenium knowledge to solve this loop right now, so I am just going to iterate through the (10) pages by hand.

In [21]:
start_url = 'https://www.hrc.org/resources/municipalities/search/p'
page_nums = [2, 3, 4, 5, 6, 7, 8, 9, 10]
end_url = '?sort=score-desc'
locations = pd.DataFrame(['city', 'state'])

for page in page_nums:
    target_page_url = start_url + str(page) + end_url
    print(target_page_url)

https://www.hrc.org/resources/municipalities/search/p2?sort=score-desc
https://www.hrc.org/resources/municipalities/search/p3?sort=score-desc
https://www.hrc.org/resources/municipalities/search/p4?sort=score-desc
https://www.hrc.org/resources/municipalities/search/p5?sort=score-desc
https://www.hrc.org/resources/municipalities/search/p6?sort=score-desc
https://www.hrc.org/resources/municipalities/search/p7?sort=score-desc
https://www.hrc.org/resources/municipalities/search/p8?sort=score-desc
https://www.hrc.org/resources/municipalities/search/p9?sort=score-desc
https://www.hrc.org/resources/municipalities/search/p10?sort=score-desc


In [32]:
df1 = df
df1

Unnamed: 0,city,state
0,Albany,New York
1,Alexandria,Virginia
2,Allentown,Pennsylvania
3,Ann Arbor,Michigan
4,Arlington,Massachusetts
5,Arlington,Virginia
6,Atlanta,Georgia
7,Austin,Texas
8,Baltimore,Maryland
9,Bellevue,Washington


In [44]:
#df2 = get_locations('https://www.hrc.org/resources/municipalities/search/p2?sort=score-desc')

# specify the url
urlpage = 'https://www.hrc.org/resources/municipalities/search/p2?sort=score-desc'
print(urlpage)
# run firefox webdriver from executable path of your choice
driver = webdriver.Firefox(executable_path = '/Users/a666/codeup-data-science/flask-project/geckodriver')

https://www.hrc.org/resources/municipalities/search/p2?sort=score-desc


In [45]:
# get web page
driver.get(urlpage)
# execute script to scroll down the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
# sleep for 30s
time.sleep(30)

results_cities = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/div/h2/a/span")
results_states = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/p")
print('Number of results (cities):', len(results_cities))
print('Number of results (states):', len(results_states))

Number of results (cities): 20
Number of results (states): 10


In [46]:
# create empty df and arrays to store data
df2 = pd.DataFrame(columns = ['city', 'state'])
city_list = []
state_list = []
# loop over results and store in the lists
for result_city in results_cities:
    if result_city.text != "":
        city_list.append(result_city.text)
for result_state in results_states:
    state_list.append(result_state.text)
# add lists to df
df2['city'] = pd.Series(city_list)
df2['state'] = pd.Series(state_list)

driver.quit()

df2

Unnamed: 0,city,state
0,Berkeley,California
1,Birmingham,Alabama
2,Bloomington,Indiana
3,Boston,Massachusetts
4,Boulder,Colorado
5,Brookings,South Dakota
6,Cambridge,Massachusetts
7,Cathedral City,California
8,Cedar Rapids,Iowa
9,Chicago,Illinois


In [47]:
# specify the url
urlpage = 'https://www.hrc.org/resources/municipalities/search/p3?sort=score-desc'
print(urlpage)
# run firefox webdriver from executable path of your choice
driver = webdriver.Firefox(executable_path = '/Users/a666/codeup-data-science/flask-project/geckodriver')

https://www.hrc.org/resources/municipalities/search/p3?sort=score-desc


In [48]:
# get web page
driver.get(urlpage)
# execute script to scroll down the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
# sleep for 30s
time.sleep(30)

results_cities = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/div/h2/a/span")
results_states = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/p")
print('Number of results (cities):', len(results_cities))
print('Number of results (states):', len(results_states))

Number of results (cities): 20
Number of results (states): 10


In [49]:
# create empty df and arrays to store data
df3 = pd.DataFrame(columns = ['city', 'state'])
city_list = []
state_list = []
# loop over results and store in the lists
for result_city in results_cities:
    if result_city.text != "":
        city_list.append(result_city.text)
for result_state in results_states:
    state_list.append(result_state.text)
# add lists to df
df3['city'] = pd.Series(city_list)
df3['state'] = pd.Series(state_list)

driver.quit()

df3

Unnamed: 0,city,state
0,Chula Vista,California
1,Cincinatti,Ohio
2,Cleveland,Ohio
3,College Park,Maryland
4,Columbia,Maryland
5,Columbia,Missouri
6,Columbus,Ohio
7,Dallas,Texas
8,Dayton,Ohio
9,Denver,Colorado


In [50]:
# specify the url
urlpage = 'https://www.hrc.org/resources/municipalities/search/p4?sort=score-desc'
print(urlpage)
# run firefox webdriver from executable path of your choice
driver = webdriver.Firefox(executable_path = '/Users/a666/codeup-data-science/flask-project/geckodriver')

https://www.hrc.org/resources/municipalities/search/p4?sort=score-desc


In [51]:
# get web page
driver.get(urlpage)
# execute script to scroll down the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
# sleep for 30s
time.sleep(30)

results_cities = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/div/h2/a/span")
results_states = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/p")
print('Number of results (cities):', len(results_cities))
print('Number of results (states):', len(results_states))

Number of results (cities): 20
Number of results (states): 10


In [52]:
# create empty df and arrays to store data
df4 = pd.DataFrame(columns = ['city', 'state'])
city_list = []
state_list = []
# loop over results and store in the lists
for result_city in results_cities:
    if result_city.text != "":
        city_list.append(result_city.text)
for result_state in results_states:
    state_list.append(result_state.text)
# add lists to df
df4['city'] = pd.Series(city_list)
df4['state'] = pd.Series(state_list)

driver.quit()

df4

Unnamed: 0,city,state
0,Detroit,Michigan
1,Dubuque,Iowa
2,East Lansing,Michigan
3,Enterprise,Nevada
4,Eugene,Oregon
5,Ferndale,Michigan
6,Fort Lauderdale,Florida
7,Fort Worth,Texas
8,Frederick,Maryland
9,Hoboken,New Jersey


In [53]:
# specify the url
urlpage = 'https://www.hrc.org/resources/municipalities/search/p5?sort=score-desc'
print(urlpage)
# run firefox webdriver from executable path of your choice
driver = webdriver.Firefox(executable_path = '/Users/a666/codeup-data-science/flask-project/geckodriver')

https://www.hrc.org/resources/municipalities/search/p5?sort=score-desc


In [54]:
# get web page
driver.get(urlpage)
# execute script to scroll down the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
# sleep for 30s
time.sleep(30)

results_cities = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/div/h2/a/span")
results_states = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/p")
print('Number of results (cities):', len(results_cities))
print('Number of results (states):', len(results_states))

Number of results (cities): 20
Number of results (states): 10


In [55]:
# create empty df and arrays to store data
df5 = pd.DataFrame(columns = ['city', 'state'])
city_list = []
state_list = []
# loop over results and store in the lists
for result_city in results_cities:
    if result_city.text != "":
        city_list.append(result_city.text)
for result_state in results_states:
    state_list.append(result_state.text)
# add lists to df
df5['city'] = pd.Series(city_list)
df5['state'] = pd.Series(state_list)

driver.quit()

df5

Unnamed: 0,city,state
0,Huntington,West Virginia
1,Iowa City,Iowa
2,Jersey City,New Jersey
3,Las Vegas,Nevada
4,Long Beach,California
5,Los Angeles,California
6,Louisville,Kentucky
7,Madison,Wisconsin
8,Miami Beach,Florida
9,Milwaukee,Wisconsin


In [56]:
# specify the url
urlpage = 'https://www.hrc.org/resources/municipalities/search/p6?sort=score-desc'
print(urlpage)
# run firefox webdriver from executable path of your choice
driver = webdriver.Firefox(executable_path = '/Users/a666/codeup-data-science/flask-project/geckodriver')

https://www.hrc.org/resources/municipalities/search/p6?sort=score-desc


In [57]:
# get web page
driver.get(urlpage)
# execute script to scroll down the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
# sleep for 30s
time.sleep(30)

results_cities = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/div/h2/a/span")
results_states = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/p")
print('Number of results (cities):', len(results_cities))
print('Number of results (states):', len(results_states))

Number of results (cities): 20
Number of results (states): 10


In [58]:
# create empty df and arrays to store data
df6 = pd.DataFrame(columns = ['city', 'state'])
city_list = []
state_list = []
# loop over results and store in the lists
for result_city in results_cities:
    if result_city.text != "":
        city_list.append(result_city.text)
for result_state in results_states:
    state_list.append(result_state.text)
# add lists to df
df6['city'] = pd.Series(city_list)
df6['state'] = pd.Series(state_list)

driver.quit()

df6

Unnamed: 0,city,state
0,Minneapolis,Minnesota
1,New Orleans,Louisiana
2,New Rochelle,New York
3,New York,New York
4,Northampton,Massachusetts
5,Norwalk,Connecticut
6,Oakland,California
7,Oceanside,California
8,Olympia,Washington
9,Orlando,Florida


In [59]:
# specify the url
urlpage = 'https://www.hrc.org/resources/municipalities/search/p7?sort=score-desc'
print(urlpage)
# run firefox webdriver from executable path of your choice
driver = webdriver.Firefox(executable_path = '/Users/a666/codeup-data-science/flask-project/geckodriver')

https://www.hrc.org/resources/municipalities/search/p7?sort=score-desc


In [60]:
# get web page
driver.get(urlpage)
# execute script to scroll down the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
# sleep for 30s
time.sleep(30)

results_cities = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/div/h2/a/span")
results_states = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/p")
print('Number of results (cities):', len(results_cities))
print('Number of results (states):', len(results_states))

Number of results (cities): 20
Number of results (states): 10


In [61]:
# create empty df and arrays to store data
df7 = pd.DataFrame(columns = ['city', 'state'])
city_list = []
state_list = []
# loop over results and store in the lists
for result_city in results_cities:
    if result_city.text != "":
        city_list.append(result_city.text)
for result_state in results_states:
    state_list.append(result_state.text)
# add lists to df
df7['city'] = pd.Series(city_list)
df7['state'] = pd.Series(state_list)

driver.quit()

df7

Unnamed: 0,city,state
0,Palm Springs,California
1,Paradise,Nevada
2,Philadelphia,Pennsylvania
3,Phoenix,Arizona
4,Pittsburgh,Pennsylvania
5,Portland,Oregon
6,Princeton,New Jersey
7,Providence,Rhode Island
8,Provincetown,Massachusetts
9,Rancho Mirage,California


In [62]:
# specify the url
urlpage = 'https://www.hrc.org/resources/municipalities/search/p8?sort=score-desc'
print(urlpage)
# run firefox webdriver from executable path of your choice
driver = webdriver.Firefox(executable_path = '/Users/a666/codeup-data-science/flask-project/geckodriver')

https://www.hrc.org/resources/municipalities/search/p8?sort=score-desc


In [63]:
# get web page
driver.get(urlpage)
# execute script to scroll down the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
# sleep for 30s
time.sleep(30)

results_cities = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/div/h2/a/span")
results_states = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/p")
print('Number of results (cities):', len(results_cities))
print('Number of results (states):', len(results_states))

Number of results (cities): 20
Number of results (states): 10


In [64]:
# create empty df and arrays to store data
df8 = pd.DataFrame(columns = ['city', 'state'])
city_list = []
state_list = []
# loop over results and store in the lists
for result_city in results_cities:
    if result_city.text != "":
        city_list.append(result_city.text)
for result_state in results_states:
    state_list.append(result_state.text)
# add lists to df
df8['city'] = pd.Series(city_list)
df8['state'] = pd.Series(state_list)

driver.quit()

df8

Unnamed: 0,city,state
0,Reno,Nevada
1,Richmond,Virginia
2,Riverside,California
3,Rochester,New York
4,Rockville,Maryland
5,Sacramento,California
6,Saint Paul,Minnesota
7,Salem,Massachusetts
8,San Antonio,Texas
9,San Diego,California


In [65]:
# specify the url
urlpage = 'https://www.hrc.org/resources/municipalities/search/p9?sort=score-desc'
print(urlpage)
# run firefox webdriver from executable path of your choice
driver = webdriver.Firefox(executable_path = '/Users/a666/codeup-data-science/flask-project/geckodriver')

https://www.hrc.org/resources/municipalities/search/p9?sort=score-desc


In [66]:
# get web page
driver.get(urlpage)
# execute script to scroll down the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
# sleep for 30s
time.sleep(30)

results_cities = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/div/h2/a/span")
results_states = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/p")
print('Number of results (cities):', len(results_cities))
print('Number of results (states):', len(results_states))

Number of results (cities): 20
Number of results (states): 10


In [67]:
# create empty df and arrays to store data
df9 = pd.DataFrame(columns = ['city', 'state'])
city_list = []
state_list = []
# loop over results and store in the lists
for result_city in results_cities:
    if result_city.text != "":
        city_list.append(result_city.text)
for result_state in results_states:
    state_list.append(result_state.text)
# add lists to df
df9['city'] = pd.Series(city_list)
df9['state'] = pd.Series(state_list)

driver.quit()

df9

Unnamed: 0,city,state
0,San Francisco,California
1,Santa Monica,California
2,Seattle,Washington
3,St. Louis,Missouri
4,St. Petersburg,Florida
5,Stamford,Connecticut
6,State College,Pennsylvania
7,Tallahassee,Florida
8,Tampa,Florida
9,Tempe,Arizona


In [68]:
# specify the url
urlpage = 'https://www.hrc.org/resources/municipalities/search/p10?sort=score-desc'
print(urlpage)
# run firefox webdriver from executable path of your choice
driver = webdriver.Firefox(executable_path = '/Users/a666/codeup-data-science/flask-project/geckodriver')

https://www.hrc.org/resources/municipalities/search/p10?sort=score-desc


In [69]:
# get web page
driver.get(urlpage)
# execute script to scroll down the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
# sleep for 30s
time.sleep(30)

results_cities = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/div/h2/a/span")
results_states = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/p")
print('Number of results (cities):', len(results_cities))
print('Number of results (states):', len(results_states))

Number of results (cities): 20
Number of results (states): 10


In [70]:
# create empty df and arrays to store data
df10 = pd.DataFrame(columns = ['city', 'state'])
city_list = []
state_list = []
# loop over results and store in the lists
for result_city in results_cities:
    if result_city.text != "":
        city_list.append(result_city.text)
for result_state in results_states:
    state_list.append(result_state.text)
# add lists to df
df10['city'] = pd.Series(city_list)
df10['state'] = pd.Series(state_list)

driver.quit()

df10

Unnamed: 0,city,state
0,Tucson,Arizona
1,Virginia Beach,Virginia
2,West Hollywood,California
3,West Palm Beach,Florida
4,Wilton Manors,Florida
5,Worcester,Massachusetts
6,Yonkers,New York
7,Akron,Ohio
8,Hartford,Connecticut
9,Missoula,Montana


In [71]:
# specify the url
urlpage = 'https://www.hrc.org/resources/municipalities/search/p11?sort=score-desc'
print(urlpage)
# run firefox webdriver from executable path of your choice
driver = webdriver.Firefox(executable_path = '/Users/a666/codeup-data-science/flask-project/geckodriver')

https://www.hrc.org/resources/municipalities/search/p11?sort=score-desc


In [72]:
# get web page
driver.get(urlpage)
# execute script to scroll down the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
# sleep for 30s
time.sleep(30)

results_cities = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/div/h2/a/span")
results_states = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/p")
print('Number of results (cities):', len(results_cities))
print('Number of results (states):', len(results_states))

Number of results (cities): 20
Number of results (states): 10


In [73]:
# create empty df and arrays to store data
df11 = pd.DataFrame(columns = ['city', 'state'])
city_list = []
state_list = []
# loop over results and store in the lists
for result_city in results_cities:
    if result_city.text != "":
        city_list.append(result_city.text)
for result_state in results_states:
    state_list.append(result_state.text)
# add lists to df
df11['city'] = pd.Series(city_list)
df11['state'] = pd.Series(state_list)

driver.quit()

df11

Unnamed: 0,city,state
0,Oakland Park,Florida
1,West Des Moines,Iowa
2,Burlington,Vermont
3,Ithaca,New York
4,Lambertville,New Jersey
5,Lawrence,Kansas
6,Toledo,Ohio
7,San Jose,California
8,Asbury Park,New Jersey
9,Covington,Kentucky


In [74]:
# specify the url
urlpage = 'https://www.hrc.org/resources/municipalities/search/p12?sort=score-desc'
print(urlpage)
# run firefox webdriver from executable path of your choice
driver = webdriver.Firefox(executable_path = '/Users/a666/codeup-data-science/flask-project/geckodriver')

https://www.hrc.org/resources/municipalities/search/p12?sort=score-desc


In [75]:
# get web page
driver.get(urlpage)
# execute script to scroll down the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
# sleep for 30s
time.sleep(30)

results_cities = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/div/h2/a/span")
results_states = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/p")
print('Number of results (cities):', len(results_cities))
print('Number of results (states):', len(results_states))

Number of results (cities): 20
Number of results (states): 10


In [76]:
# create empty df and arrays to store data
df12 = pd.DataFrame(columns = ['city', 'state'])
city_list = []
state_list = []
# loop over results and store in the lists
for result_city in results_cities:
    if result_city.text != "":
        city_list.append(result_city.text)
for result_state in results_states:
    state_list.append(result_state.text)
# add lists to df
df12['city'] = pd.Series(city_list)
df12['state'] = pd.Series(state_list)

driver.quit()

df12

Unnamed: 0,city,state
0,Signal Hill,California
1,Tacoma,Washington
2,White Plains,New York
3,Fort Collins,Colorado
4,Gainesville,Florida
5,Lexington,Kentucky
6,Trenton,New Jersey
7,Guerneville,California
8,Henderson,Nevada
9,Kansas City,Missouri


In [77]:
# specify the url
urlpage = 'https://www.hrc.org/resources/municipalities/search/p13?sort=score-desc'
print(urlpage)
# run firefox webdriver from executable path of your choice
driver = webdriver.Firefox(executable_path = '/Users/a666/codeup-data-science/flask-project/geckodriver')

https://www.hrc.org/resources/municipalities/search/p13?sort=score-desc


In [78]:
# get web page
driver.get(urlpage)
# execute script to scroll down the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
# sleep for 30s
time.sleep(30)

results_cities = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/div/h2/a/span")
results_states = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/p")
print('Number of results (cities):', len(results_cities))
print('Number of results (states):', len(results_states))

Number of results (cities): 20
Number of results (states): 10


In [79]:
# create empty df and arrays to store data
df13 = pd.DataFrame(columns = ['city', 'state'])
city_list = []
state_list = []
# loop over results and store in the lists
for result_city in results_cities:
    if result_city.text != "":
        city_list.append(result_city.text)
for result_state in results_states:
    state_list.append(result_state.text)
# add lists to df
df13['city'] = pd.Series(city_list)
df13['state'] = pd.Series(state_list)

driver.quit()

df13

Unnamed: 0,city,state
0,Palm Desert,California
1,Portland,Maine
2,Richmond,California
3,Vashon,Washington
4,Gaithersburg,Maryland
5,Irvine,California
6,Laguna Beach,California
7,Overland Park,Kansas
8,Pasadena,California
9,Anchorage,Alaska


In [80]:
# specify the url
urlpage = 'https://www.hrc.org/resources/municipalities/search/p14?sort=score-desc'
print(urlpage)
# run firefox webdriver from executable path of your choice
driver = webdriver.Firefox(executable_path = '/Users/a666/codeup-data-science/flask-project/geckodriver')

https://www.hrc.org/resources/municipalities/search/p14?sort=score-desc


In [81]:
# get web page
driver.get(urlpage)
# execute script to scroll down the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
# sleep for 30s
time.sleep(30)

results_cities = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/div/h2/a/span")
results_states = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/p")
print('Number of results (cities):', len(results_cities))
print('Number of results (states):', len(results_states))

Number of results (cities): 20
Number of results (states): 10


In [82]:
# create empty df and arrays to store data
df14 = pd.DataFrame(columns = ['city', 'state'])
city_list = []
state_list = []
# loop over results and store in the lists
for result_city in results_cities:
    if result_city.text != "":
        city_list.append(result_city.text)
for result_state in results_states:
    state_list.append(result_state.text)
# add lists to df
df14['city'] = pd.Series(city_list)
df14['state'] = pd.Series(state_list)

driver.quit()

df14

Unnamed: 0,city,state
0,Charleston,West Virginia
1,Grand Rapids,Michigan
2,Norman,Oklahoma
3,Albuquerque,New Mexico
4,Buffalo,New York
5,Coral Gables,Florida
6,Norfolk,Virginia
7,Reading,Pennsylvania
8,Duluth,Minnesota
9,Durham,New Hampshire


In [83]:
# specify the url
urlpage = 'https://www.hrc.org/resources/municipalities/search/p15?sort=score-desc'
print(urlpage)
# run firefox webdriver from executable path of your choice
driver = webdriver.Firefox(executable_path = '/Users/a666/codeup-data-science/flask-project/geckodriver')

https://www.hrc.org/resources/municipalities/search/p15?sort=score-desc


In [84]:
# get web page
driver.get(urlpage)
# execute script to scroll down the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
# sleep for 30s
time.sleep(30)

results_cities = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/div/h2/a/span")
results_states = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/p")
print('Number of results (cities):', len(results_cities))
print('Number of results (states):', len(results_states))

Number of results (cities): 20
Number of results (states): 10


In [85]:
# create empty df and arrays to store data
df15 = pd.DataFrame(columns = ['city', 'state'])
city_list = []
state_list = []
# loop over results and store in the lists
for result_city in results_cities:
    if result_city.text != "":
        city_list.append(result_city.text)
for result_state in results_states:
    state_list.append(result_state.text)
# add lists to df
df15['city'] = pd.Series(city_list)
df15['state'] = pd.Series(state_list)

driver.quit()

df15

Unnamed: 0,city,state
0,Fremont,California
1,Salem,Oregon
2,Terre Haute,Indiana
3,Wilkes-Barre,Pennsylvania
4,Charlottesville,Virginia
5,Indianapolis,Indiana
6,New Hope,Pennsylvania
7,Ocean Grove (Neptune),New Jersey
8,Sunnyvale,California
9,Ames,Iowa


I want to (for now) have a list of only cities with a MEI score of 90 or above. Therefore I will cut off anything after Wilkes-Barre, PA.m

In [90]:
df15 = df15[:4]
df15

Unnamed: 0,city,state
0,Fremont,California
1,Salem,Oregon
2,Terre Haute,Indiana
3,Wilkes-Barre,Pennsylvania


Now I will concat all the dfs, so that I now have a list of all of the cities listed as receiving a score of at least 90 on the MEI.

In [94]:
hrc_mei_90plus_cities = pd.concat([df1, df2, df3, df4, df5, df6, df7, df8, df9, df10, df11, df12, df13, df14, df15])
hrc_mei_90plus_cities

Unnamed: 0,city,state
0,Albany,New York
1,Alexandria,Virginia
2,Allentown,Pennsylvania
3,Ann Arbor,Michigan
4,Arlington,Massachusetts
...,...,...
9,Durham,New Hampshire
0,Fremont,California
1,Salem,Oregon
2,Terre Haute,Indiana


Success. Now I'll export it to csv.

In [95]:
hrc_mei_90plus_cities.to_csv('hrc_mei_90plus_cities.csv')

# Second source: Cat-friendly and Dog-friendly locations

https://www.thrillist.com/news/nation/most-cat-friendly-us-cities-right-now

https://www.thrillist.com/news/nation/pet-friendly-us-cities-2020-list-wallethub

In [None]:
cats = pd.DataFrame(columns = ['city', 'state'])
dogs = pd.DataFrame(columns = ['city', 'state'])

In [None]:
dog_cities = ['Tampa', 'Austin', 'Las Vegas', 'Orlando', 'Seattle',
              'St. Louis', 'Atlanta', 'New Orleans', 'Birmingham', 'San Diego',
              'Cincinnati', 'Scottsdale', 'Boise', 'Portland', 'Lexington-Fayette',
              'Miami', 'Nashville', 'Houston', 'Corpus Christi', 'Oklahoma City']

dog_states = ['Florida', 'Texas', 'Nevada', 'Florida', 'Washington', 'Missouri', 'Georgia',
             'Louisiana', 'Alabama', 'California', 'Ohio', 'Arizona', 'Idaho', 'Oregon',
             'Kentucky', 'Florida', 'Tennessee', 'Texas', 'Texas', 'Oklahoma']

dogs['city'] = pd.Series(dog_cities)
dogs['state'] = pd.Series(dog_states)

dogs

In [None]:
# going to truncate at 20 locations for cats (for now at least) so that dogs and cats have equal locations

cat_cities = ['Birmingham', 'Portland', 'Madison', 'Richmond', 'Minneapolis', 'St. Louis', 'Tampa',
              'Orlando', 'Greensboro', 'Denver', 'Fort Wayne', 'Baton Rouge', 'Seattle', 'Omaha', 'Tulsa',
              'St. Paul', 'Sacramento', 'St. Petersburg', 'Reno', 'Cincinnati']

cat_states = ['Alabama', 'Oregon', 'Wisconsin', 'Virginia', 'Minnesota', 'Missouri', 'Florida', 'Florida',
             'North Carolina', 'Colorado', 'Indiana', 'Louisiana', 'Washington', 'Nebraska',
             'Oklahoma', 'Minnesota', 'California', 'Florida', 'Nevada', 'Ohio']

cats['city'] = pd.Series(cat_cities)
cats['state'] = pd.Series(cat_states)

cats

Now I'll export the data I have so far and do some visualizations in an EDA notebook.

In [None]:
dogs.to_csv('dogs.csv')
cats.to_csv('cats.csv')

# Third source: Natural Environment Rankings

http://localhost:8888/notebooks/acquire.ipynb

In [52]:
# import libraries
import urllib.request
from selenium import webdriver
import time

# specify the url
urlpage = 'https://www.usnews.com/news/best-states/rankings/natural-environment' 
print(urlpage)
# run firefox webdriver from executable path of your choice
driver = webdriver.Firefox(executable_path = '/Users/a666/codeup-data-science/flask-project/geckodriver')

https://www.usnews.com/news/best-states/rankings/natural-environment


In [53]:
# get web page
driver.get(urlpage)
# execute script to scroll down the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
# sleep for 30s
time.sleep(30)

results_rank = driver.find_elements_by_xpath("/html/body/main/div/div/div/div/div/div/div/div/div/div/table/tbody/tr/td/span")
print('Number of results (rank):', len(results_rank))

Number of results (rank): 200


In [54]:
for result in results_rank:
    print(result.text)

1
Hawaii
1
13
2
New Hampshire
10
2
3
South Dakota
11
3
4
Massachusetts
2
17
5
New York
7
11
6
Nebraska
6
12
7
Rhode Island
8
7
8
North Dakota
3
19
9
Vermont
28
1
10
Minnesota
14
14
11
Maryland
16
16
12
Idaho
27
10
13
Wyoming
36
5
14
Maine
31
8
15
Washington
17
18
16
Kansas
9
23
17
Wisconsin
15
20
18
Florida
5
27
19
Virginia
4
32
20
Iowa
18
25
21
Missouri
26
24
22
Mississippi
13
28
23
Colorado
41
9
24
Montana
39
15
25
Georgia
32
26
26
South Carolina
12
33
27
North Carolina
24
31
28
Connecticut
30
30
29
Kentucky
22
35
30
Arkansas
20
36
31
New Mexico
48
4
32
Michigan
35
34
33
New Jersey
25
38
34
Oklahoma
45
22
35
California
47
6
36
West Virginia
37
37
37
Alabama
33
41
38
Pennsylvania
40
39
39
Tennessee
23
43
40
Texas
42
40
41
Arizona
49
21
42
Oregon
21
44
43
Illinois
44
42
44
Ohio
34
45
45
Delaware
29
47
46
Alaska
50
29
47
Utah
43
46
48
Indiana
38
48
49
Louisiana
19
50
50
Nevada
46
49


I'll use regex to grab the numbers (and states) that I need.

In [55]:
s = ""
for result in results_rank:
    s += result.text
    s += " "
    
s

'1 Hawaii 1 13 2 New Hampshire 10 2 3 South Dakota 11 3 4 Massachusetts 2 17 5 New York 7 11 6 Nebraska 6 12 7 Rhode Island 8 7 8 North Dakota 3 19 9 Vermont 28 1 10 Minnesota 14 14 11 Maryland 16 16 12 Idaho 27 10 13 Wyoming 36 5 14 Maine 31 8 15 Washington 17 18 16 Kansas 9 23 17 Wisconsin 15 20 18 Florida 5 27 19 Virginia 4 32 20 Iowa 18 25 21 Missouri 26 24 22 Mississippi 13 28 23 Colorado 41 9 24 Montana 39 15 25 Georgia 32 26 26 South Carolina 12 33 27 North Carolina 24 31 28 Connecticut 30 30 29 Kentucky 22 35 30 Arkansas 20 36 31 New Mexico 48 4 32 Michigan 35 34 33 New Jersey 25 38 34 Oklahoma 45 22 35 California 47 6 36 West Virginia 37 37 37 Alabama 33 41 38 Pennsylvania 40 39 39 Tennessee 23 43 40 Texas 42 40 41 Arizona 49 21 42 Oregon 21 44 43 Illinois 44 42 44 Ohio 34 45 45 Delaware 29 47 46 Alaska 50 29 47 Utah 43 46 48 Indiana 38 48 49 Louisiana 19 50 50 Nevada 46 49 '

In [56]:
r1 = re.findall(r"\d+ [A-Z]\w+ ?\D*", s)
print(r1)

['1 Hawaii ', '2 New Hampshire ', '3 South Dakota ', '4 Massachusetts ', '5 New York ', '6 Nebraska ', '7 Rhode Island ', '8 North Dakota ', '9 Vermont ', '10 Minnesota ', '11 Maryland ', '12 Idaho ', '13 Wyoming ', '14 Maine ', '15 Washington ', '16 Kansas ', '17 Wisconsin ', '18 Florida ', '19 Virginia ', '20 Iowa ', '21 Missouri ', '22 Mississippi ', '23 Colorado ', '24 Montana ', '25 Georgia ', '26 South Carolina ', '27 North Carolina ', '28 Connecticut ', '29 Kentucky ', '30 Arkansas ', '31 New Mexico ', '32 Michigan ', '33 New Jersey ', '34 Oklahoma ', '35 California ', '36 West Virginia ', '37 Alabama ', '38 Pennsylvania ', '39 Tennessee ', '40 Texas ', '41 Arizona ', '42 Oregon ', '43 Illinois ', '44 Ohio ', '45 Delaware ', '46 Alaska ', '47 Utah ', '48 Indiana ', '49 Louisiana ', '50 Nevada ']


In [76]:
dfr1 = pd.DataFrame(r1)
dfr1.head()

Unnamed: 0,0
0,1 Hawaii
1,2 New Hampshire
2,3 South Dakota
3,4 Massachusetts
4,5 New York


In [86]:
green_states = pd.DataFrame(dfr1[0].str.split(' ',1).tolist())
green_states = green_states.rename(columns={0: 'environmental_rank', 1: 'state'})

green_states

Unnamed: 0,environmental_rank,state
0,1,Hawaii
1,2,New Hampshire
2,3,South Dakota
3,4,Massachusetts
4,5,New York
5,6,Nebraska
6,7,Rhode Island
7,8,North Dakota
8,9,Vermont
9,10,Minnesota


In [87]:
driver.quit()

Now I'll export it to csv.

In [89]:
green_states.to_csv('green_states.csv')

# Fourth source: Marijuana legalization laws by state

http://pdaps.org/datasets/recreational-marijuana-laws

In [4]:
# load raw data
marijuana_states = pd.read_excel('marijuana_states.xlsx')
marijuana_states.head()

Unnamed: 0,Jurisdictions,Effective Date,Valid Through Date,rm-rmlaw_Yes,rm-rmlaw_No,rm-age_21,rm-regulatoryagency_Liquor Control Board,rm-regulatoryagency_Department of Revenue,rm-regulatoryagency_Marijuana Control Board,rm-regulatoryagency_Department of Consumer Affairs,...,rm-excisetaxrate_37%,rm-excisetaxrate_$35 per ounce of usable marijuana,rm-excisetaxrate_$50 per ounce of usable marijuana,rm-excisetaxrate_$10 per ounce of marijuana leaves,rm-excisetaxrate_$5 per immature marijuana plant,rm-salestax_Yes,rm-salestax_No,rm-salestaxrate_3.75%,rm-salestaxrate_10%,rm-salestaxrate_17%
0,Alabama,2014-10-01,2017-02-01,.,1,.,.,.,.,.,...,.,.,.,.,.,.,.,.,.,.
1,Alaska,2014-10-01,2015-02-23,.,1,.,.,.,.,.,...,.,.,.,.,.,.,.,.,.,.
2,Alaska,2015-02-24,2015-05-04,1,.,1,1,.,.,.,...,.,.,1,.,.,.,1,.,.,.
3,Alaska,2015-05-05,2016-02-20,1,.,1,.,.,1,.,...,.,.,1,.,.,.,1,.,.,.
4,Alaska,2016-07-29,2016-10-03,1,.,1,.,.,1,.,...,.,.,1,.,.,.,1,.,.,.


Now I'll prepare the data (with the help of it's data dictionary) so it's clean enough to join with all the other data I've collected thus far.

In [10]:
# quick check for any nulls
marijuana_states.isnull().any().sum()

0

In [13]:
# are the jurisdictions just states or also cities?
marijuana_states.Jurisdictions.value_counts()

Colorado                16
Washington              15
Oregon                  11
Alaska                   8
Nevada                   2
District of Columbia     2
Massachusetts            2
Maine                    2
California               2
Montana                  1
Florida                  1
North Dakota             1
Arizona                  1
Tennessee                1
Iowa                     1
New Jersey               1
Nebraska                 1
Vermont                  1
Ohio                     1
Wyoming                  1
Delaware                 1
Wisconsin                1
Kentucky                 1
North Carolina           1
Mississippi              1
Illinois                 1
Louisiana                1
Rhode Island             1
New Hampshire            1
Michigan                 1
Oklahoma                 1
Maryland                 1
Missouri                 1
Texas                    1
Minnesota                1
South Carolina           1
Georgia                  1
N

In [16]:
# how many jurisdictions are there?
marijuana_states.Jurisdictions.nunique()

51

In [None]:
# set up columns I ideally want to have
#temp = pd.DataFrame(columns=[''])

### Does the state have a law authorizing recreational marijuana? -- ```rm-rmlaw```

In [32]:
marijuana_states[marijuana_states['rm-rmlaw_Yes'] == 1].Jurisdictions.unique()

array(['Alaska', 'California', 'Colorado', 'District of Columbia',
       'Maine', 'Massachusetts', 'Nevada', 'Oregon', 'Washington'],
      dtype=object)

# Fifth source: 