## Data Acquisition

In this project my goal is to create an interactive tool that can be used to find a user's top location recommendations for where to move/live in the continental United States, given a user's preferences. I was inspired to create this because I am currently relocating and wished there was an aggregated source to visualize my preferences for things that matter to me when choosing a state and city to move to and live and thrive in.

### First source: The Municipal Equality Index by HRC (2020)

https://www.hrc.org/resources/municipalities

I'll try to use web scraping on the database results from a blank query on their webpage.

In [1]:
import numpy as np
import pandas as pd

from requests import get
import re
from bs4 import BeautifulSoup
import os

In [2]:
url = 'https://www.hrc.org/resources/municipalities'
headers = {'User-Agent': 'Kwame'} # Some websites don't accept the pyhon-requests default user-agent
response = get(url, headers=headers)

In [3]:
print(response.text[:400])

<!DOCTYPE html>
<html dir="ltr" lang="en-US" class="no-js no-touch">
<head>
  <meta charset="utf-8">
  <meta http-equiv="x-ua-compatible" content="ie=edge">
  <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">

  <link rel="home" href="https://www.hrc.org/">
  <link rel="shortcut icon" href="/favicon.ico">
  <link rel="icon" sizes="16x16 32x32 64x64" href="/favi


In [4]:
# Make a soup variable holding the response content
soup = BeautifulSoup(response.content, 'html.parser')

# Look at the website and identify parts of results
title = soup.find('h2').text
content = soup.find('p').text

In [5]:
title

'\n            Cookies in use\n          '

In [6]:
content

'Where does your city stand in the movement for equality?'

In [7]:
soup.find_all('h2')

[<h2 class="heading-24 mb-12">
             Cookies in use
           </h2>,
 <h2 class="type-eyebrow text-theme-headline-default-current">
               Related Resources
             </h2>,
 <h2 class="heading-48">
             Love conquers hate.
           </h2>,
 <h2 class="font-bold leading-squeeze text-32">
             Wear your pride this year.
           </h2>,
 <h2 class="heading-48">
                 Love conquers hate.
               </h2>,
 <h2 class="font-bold leading-squeeze text-32">
                 Wear your pride this year.
               </h2>,
 <h2 class="heading-20 mb-32">Join millions of supporters by signing up for the HRC newsletter.</h2>,
 <h2 class="heading-60 mb-32">You are leaving HRC.org</h2>]

In [11]:
soup.find_all("span", class_="align-middle")

[<span class="align-middle">Accept</span>,
 <span class="align-middle">Shop</span>,
 <span class="align-middle">Donate</span>,
 <span class="align-middle">Shop</span>,
 <span class="align-middle">Donate</span>,
 <span class="align-middle">View All</span>,
 <span class="align-middle">Donate Today</span>,
 <span class="align-middle">Shop Now</span>,
 <span class="align-middle">Donate Today</span>,
 <span class="align-middle">Shop Now</span>,
 <span class="align-middle">Sign Me Up</span>]

In [13]:
soup.find_all('$0')

[]

In [14]:
soup.find_all('article', {'data-label': 'component-score-card-index'})

[]

In [15]:
for item in soup:
    print(soup.div.article['aria-label'])

Congressional Scorecard
Congressional Scorecard
Congressional Scorecard
Congressional Scorecard


In [16]:
soup.find_all('/html/body/div[1]/main/div/section/div/article[1]/div[1]/div[1]/h2/a/span[1]')

[]

I think there is javascript in the mix, so I'm going to try scraping using Selenium.

In [26]:
# import libraries
import urllib.request
from selenium import webdriver
import time

# specify the url
urlpage = 'https://www.hrc.org/resources/municipalities/search?sort=score-desc' 
print(urlpage)
# run firefox webdriver from executable path of your choice
driver = webdriver.Firefox(executable_path = '/Users/a666/codeup-data-science/flask-project/geckodriver')

https://www.hrc.org/resources/municipalities/search?sort=score-desc


In [27]:
# get web page
driver.get(urlpage)
# execute script to scroll down the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
# sleep for 30s
time.sleep(30)
# driver.quit()

In [32]:
results = driver.find_elements_by_xpath("/html/body/div/main/div/section/div/article/div/div/h2/a/span")
print('Number of results', len(results))

Number of results 20


In [37]:
for result in results:
    print(result.text)

Albany

Alexandria

Allentown

Ann Arbor

Arlington

Arlington

Atlanta

Austin

Baltimore

Bellevue



In [43]:
# create empty array to store data
data = []
# loop over results
for result in results:
    if result.text != "":
        city = result.text
        # append dict to array
        data.append({"city" : city})

In [44]:
data

[{'city': 'Albany'},
 {'city': 'Alexandria'},
 {'city': 'Allentown'},
 {'city': 'Ann Arbor'},
 {'city': 'Arlington'},
 {'city': 'Arlington'},
 {'city': 'Atlanta'},
 {'city': 'Austin'},
 {'city': 'Baltimore'},
 {'city': 'Bellevue'}]