## Data Acquisition

In this project my goal is to create an interactive tool that can be used to find a user's top location recommendations for where to move/live in the continental United States, given a user's preferences. I was inspired to create this because I am currently relocating and wished there was an aggregated source to visualize my preferences for things that matter to me when choosing a state and city to move to and live and thrive in.

### First source: The Municipal Equality Index by HRC (2020)

https://www.hrc.org/resources/municipalities

I'll try to use web scraping on the database results from a blank query on their webpage.

In [11]:
import numpy as np
import pandas as pd

from requests import get
import re
from bs4 import BeautifulSoup
import os

In [None]:
url = 'https://www.hrc.org/resources/municipalities'
headers = {'User-Agent': 'Kwame'} # Some websites don't accept the pyhon-requests default user-agent
response = get(url, headers=headers)

In [None]:
print(response.text[:400])

In [None]:
# Make a soup variable holding the response content
soup = BeautifulSoup(response.content, 'html.parser')

# Look at the website and identify parts of results
title = soup.find('h2').text
content = soup.find('p').text

In [None]:
title

In [None]:
content

In [None]:
soup.find_all('h2')

In [None]:
soup.find_all("span", class_="align-middle")

In [None]:
soup.find_all('$0')

In [None]:
soup.find_all('article', {'data-label': 'component-score-card-index'})

In [None]:
for item in soup:
    print(soup.div.article['aria-label'])

In [None]:
soup.find_all('/html/body/div[1]/main/div/section/div/article[1]/div[1]/div[1]/h2/a/span[1]')

I think there is javascript in the mix, so I'm going to try scraping using Selenium.

In [6]:
# import libraries
import urllib.request
from selenium import webdriver
import time

# specify the url
urlpage = 'https://www.hrc.org/resources/municipalities/search?sort=score-desc' 
print(urlpage)
# run firefox webdriver from executable path of your choice
driver = webdriver.Firefox(executable_path = '/Users/a666/codeup-data-science/flask-project/geckodriver')

https://www.hrc.org/resources/municipalities/search?sort=score-desc


In [31]:
# import libraries
import urllib.request
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import time

# specify the url
urlpage = 'https://www.hrc.org/resources/municipalities/search?sort=score-desc' 
print(urlpage)
# run firefox webdriver from executable path of your choice
# i'm using headless mode with geckodriver
options = Options()
options.headless = True
driver = webdriver.Firefox(options=options, executable_path = '/Users/a666/codeup-data-science/flask-project/geckodriver')

https://www.hrc.org/resources/municipalities/search?sort=score-desc


In [7]:
# get web page
driver.get(urlpage)
# execute script to scroll down the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
# sleep for 30s
time.sleep(30)
# driver.quit()

In [8]:
results_cities = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/div/h2/a/span")
results_states = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/p")
print('Number of results (cities):', len(results_cities))
print('Number of results (states):', len(results_states))

Number of results (cities): 20
Number of results (states): 10


In [21]:
# create empty df and arrays to store data
df = pd.DataFrame(columns = ['city', 'state'])
city_list = []
state_list = []
# loop over results and store in the lists
for result_city in results_cities:
    if result_city.text != "":
        city_list.append(result_city.text)
for result_state in results_states:
    state_list.append(result_state.text)
# add lists to df
df['city'] = pd.Series(city_list)
df['state'] = pd.Series(state_list)

I'll make this a function so I can easily use it later.

In [45]:
def get_locations(target_page_url):
    # print the url
    print(target_page_url)
    # run firefox webdriver from executable path of your choice
    # i'm using headless mode with geckodriver
    options = Options()
    options.headless = True
    driver = webdriver.Firefox(options=options, executable_path = '/Users/a666/codeup-data-science/flask-project/geckodriver')

    # get web page
    driver.get(target_page_url)
    # execute script to scroll down the page
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
    # sleep for 30s
    time.sleep(45)
    driver.quit()
    
    results_cities = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/div/h2/a/span")
    results_states = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/p")
    print('Number of results (cities):', len(results_cities))
    print('Number of results (states):', len(results_states))

    # create empty df and arrays to store data from this page
    df = pd.DataFrame(columns = ['city', 'state'])
    city_list = []
    state_list = []
    # loop over this page's results and store in the lists
    for result_city in results_cities:
        if result_city.text != "":
            city_list.append(result_city.text)
    for result_state in results_states:
        state_list.append(result_state.text)
    # add this page's city and states to df
    df['city'] = pd.Series(city_list)
    df['state'] = pd.Series(state_list)
    
    return df

In [22]:
df

Unnamed: 0,city,state
0,Albany,New York
1,Alexandria,Virginia
2,Allentown,Pennsylvania
3,Ann Arbor,Michigan
4,Arlington,Massachusetts
5,Arlington,Virginia
6,Atlanta,Georgia
7,Austin,Texas
8,Baltimore,Maryland
9,Bellevue,Washington


Now that I have retrieved the first page successfully, it is time to go through the rest of the pages.

https://www.hrc.org/resources/municipalities/search/p2?sort=score-desc

In [46]:
start_url = 'https://www.hrc.org/resources/municipalities/search/p'
page_nums = [2, 3, 4, 5, 6, 7, 8, 9, 10]
end_url = '?sort=score-desc'
locations = pd.DataFrame(['city', 'state'])

for page in page_nums:
    target_page_url = start_url + str(page) + end_url
    locations = pd.concat(locations, get_locations(target_page_url))
    
locations

https://www.hrc.org/resources/municipalities/search/p2?sort=score-desc


MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=55000): Max retries exceeded with url: /session/98271832-edf9-5b4b-a2b9-0832e5074c8f/elements (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ff5ab480430>: Failed to establish a new connection: [Errno 61] Connection refused'))

I got an error, which I think is caused by the need for Selenium to open a browser window for each page. I am going to try to get around this using headless mode with geckodriver.

I actually don't have the Selenium knowledge to solve this loop right now, so I am just going to iterate through the (10) pages by hand.

In [41]:
start_url = 'https://www.hrc.org/resources/municipalities/search/p'
page_nums = [2, 3, 4, 5, 6, 7, 8, 9, 10]
end_url = '?sort=score-desc'
locations = pd.DataFrame(['city', 'state'])

for page in page_nums:
    target_page_url = start_url + str(page) + end_url
    print(target_page_url)

https://www.hrc.org/resources/municipalities/search/p2?sort=score-desc
https://www.hrc.org/resources/municipalities/search/p3?sort=score-desc
https://www.hrc.org/resources/municipalities/search/p4?sort=score-desc
https://www.hrc.org/resources/municipalities/search/p5?sort=score-desc
https://www.hrc.org/resources/municipalities/search/p6?sort=score-desc
https://www.hrc.org/resources/municipalities/search/p7?sort=score-desc
https://www.hrc.org/resources/municipalities/search/p8?sort=score-desc
https://www.hrc.org/resources/municipalities/search/p9?sort=score-desc
https://www.hrc.org/resources/municipalities/search/p10?sort=score-desc


In [42]:
df1 = df
df1

Unnamed: 0,city,state
0,Albany,New York
1,Alexandria,Virginia
2,Allentown,Pennsylvania
3,Ann Arbor,Michigan
4,Arlington,Massachusetts
5,Arlington,Virginia
6,Atlanta,Georgia
7,Austin,Texas
8,Baltimore,Maryland
9,Bellevue,Washington


In [75]:
df2 = get_locations('https://www.hrc.org/resources/municipalities/search/p2?sort=score-desc')
df2

https://www.hrc.org/resources/municipalities/search/p3?sort=score-desc


MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=60209): Max retries exceeded with url: /session/f1010589-e880-f145-9730-4e183973671e/elements (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ff5ab4bdc40>: Failed to establish a new connection: [Errno 61] Connection refused'))

I may have reached some kind of rate limit, so I will save my progress and try again later.

### Second source: Cat-friendly and Dog-friendly locations

https://www.thrillist.com/news/nation/most-cat-friendly-us-cities-right-now

https://www.thrillist.com/news/nation/pet-friendly-us-cities-2020-list-wallethub

In [63]:
cats = pd.DataFrame(columns = ['city', 'state'])
dogs = pd.DataFrame(columns = ['city', 'state'])

In [72]:
dog_cities = ['Tampa', 'Austin', 'Las Vegas', 'Orlando', 'Seattle',
              'St. Louis', 'Atlanta', 'New Orleans', 'Birmingham', 'San Diego',
              'Cincinnati', 'Scottsdale', 'Boise', 'Portland', 'Lexington-Fayette',
              'Miami', 'Nashville', 'Houston', 'Corpus Christi', 'Oklahoma City']

dog_states = ['Florida', 'Texas', 'Nevada', 'Florida', 'Washington', 'Missouri', 'Georgia',
             'Louisiana', 'Alabama', 'California', 'Ohio', 'Arizona', 'Idaho', 'Oregon',
             'Kentucky', 'Florida', 'Tennessee', 'Texas', 'Texas', 'Oklahoma']

dogs['city'] = pd.Series(dog_cities)
dogs['state'] = pd.Series(dog_states)

dogs

Unnamed: 0,city,state
0,Tampa,Florida
1,Austin,Texas
2,Las Vegas,Nevada
3,Orlando,Florida
4,Seattle,Washington
5,St. Louis,Missouri
6,Atlanta,Georgia
7,New Orleans,Louisiana
8,Birmingham,Alabama
9,San Diego,California


In [73]:
# going to truncate at 20 locations for cats (for now at least) so that dogs and cats have equal locations

cat_cities = ['Birmingham', 'Portland', 'Madison', 'Richmond', 'Minneapolis', 'St. Louis', 'Tampa',
              'Orlando', 'Greensboro', 'Denver', 'Fort Wayne', 'Baton Rouge', 'Seattle', 'Omaha', 'Tulsa',
              'St. Paul', 'Sacramento', 'St. Petersburg', 'Reno', 'Cincinnati']

cat_states = ['Alabama', 'Oregon', 'Wisconsin', 'Virginia', 'Minnesota', 'Missouri', 'Florida', 'Florida',
             'North Carolina', 'Colorado', 'Indiana', 'Louisiana', 'Washington', 'Nebraska',
             'Oklahoma', 'Minnesota', 'California', 'Florida', 'Nevada', 'Ohio']

cats['city'] = pd.Series(cat_cities)
cats['state'] = pd.Series(cat_states)

cats

Unnamed: 0,city,state
0,Birmingham,Alabama
1,Portland,Oregon
2,Madison,Wisconsin
3,Richmond,Virginia
4,Minneapolis,Minnesota
5,St. Louis,Missouri
6,Tampa,Florida
7,Orlando,Florida
8,Greensboro,North Carolina
9,Denver,Colorado


Now I'll export the data I have so far and do some visualizations in an EDA notebook.

### Export data

In [76]:
dogs.to_csv('dogs.csv')
cats.to_csv('cats.csv')