## Data Acquisition

In this project my goal is to create an interactive tool that can be used to find a user's top location recommendations for where to move/live in the continental United States, given a user's preferences. I was inspired to create this because I am currently relocating and wished there was an aggregated source to visualize my preferences for things that matter to me when choosing a state and city to move to and live and thrive in.

### First source: The Municipal Equality Index by HRC (2020)

https://www.hrc.org/resources/municipalities

I'll try to use web scraping on the database results from a blank query on their webpage.

In [11]:
import numpy as np
import pandas as pd

from requests import get
import re
from bs4 import BeautifulSoup
import os

In [None]:
url = 'https://www.hrc.org/resources/municipalities'
headers = {'User-Agent': 'Kwame'} # Some websites don't accept the pyhon-requests default user-agent
response = get(url, headers=headers)

In [None]:
print(response.text[:400])

In [None]:
# Make a soup variable holding the response content
soup = BeautifulSoup(response.content, 'html.parser')

# Look at the website and identify parts of results
title = soup.find('h2').text
content = soup.find('p').text

In [None]:
title

In [None]:
content

In [None]:
soup.find_all('h2')

In [None]:
soup.find_all("span", class_="align-middle")

In [None]:
soup.find_all('$0')

In [None]:
soup.find_all('article', {'data-label': 'component-score-card-index'})

In [None]:
for item in soup:
    print(soup.div.article['aria-label'])

In [None]:
soup.find_all('/html/body/div[1]/main/div/section/div/article[1]/div[1]/div[1]/h2/a/span[1]')

I think there is javascript in the mix, so I'm going to try scraping using Selenium.

In [6]:
# import libraries
import urllib.request
from selenium import webdriver
import time

# specify the url
urlpage = 'https://www.hrc.org/resources/municipalities/search?sort=score-desc' 
print(urlpage)
# run firefox webdriver from executable path of your choice
driver = webdriver.Firefox(executable_path = '/Users/a666/codeup-data-science/flask-project/geckodriver')

https://www.hrc.org/resources/municipalities/search?sort=score-desc


In [7]:
# get web page
driver.get(urlpage)
# execute script to scroll down the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
# sleep for 30s
time.sleep(30)
# driver.quit()

In [8]:
results_cities = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/div/h2/a/span")
results_states = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/p")
print('Number of results (cities):', len(results_cities))
print('Number of results (states):', len(results_states))

Number of results (cities): 20
Number of results (states): 10


In [21]:
# create empty array to store data
df = pd.DataFrame(columns = ['city', 'state'])
city_list = []
state_list = []
# loop over results
for result_city in results_cities:
    if result_city.text != "":
        city_list.append(result_city.text)
for result_state in results_states:
    state_list.append(result_state.text)
# add to df
#df['city'] = df['city'].append(pd.Series(city_list))
#df['state'] = df['state'].append(pd.Series(state_list))

df['city'] = pd.Series(city_list)
df['state'] = pd.Series(state_list)

In [22]:
df

Unnamed: 0,city,state
0,Albany,New York
1,Alexandria,Virginia
2,Allentown,Pennsylvania
3,Ann Arbor,Michigan
4,Arlington,Massachusetts
5,Arlington,Virginia
6,Atlanta,Georgia
7,Austin,Texas
8,Baltimore,Maryland
9,Bellevue,Washington


Now that I have retrieved the first page successfully, it is time to go through the rest of the pages.