<a href="https://colab.research.google.com/github/Location-Artistry/GEO-DEV-NOTEBOOKS/blob/main/30_DAY_MAP_2021.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **30 DAY MAP CHALLENGE**
## Scripts & Scrapers for Map Data
## Stay Calm and Map ON
Several automated scrips and Beautiful Soup scraping libraries to programmatically fetch data from websites.
Utilize Beautiful Soup4 and Google Libraries to search state websites


In [None]:
!pip install search-engine-parser
!pip install "search-engine-parser[cli]"
!pip install beautifulsoup4 #- already installed with Colab
!pip install git+https://github.com/abenassi/Google-Search-API

import pandas as pd
import requests
import nest_asyncio
from bs4 import BeautifulSoup
from search_engine_parser import GoogleSearch
from googlesearch import search 
from googleapi import google
from IPython.display import IFrame

nest_asyncio.apply()

## Scrape wikipedia with Beautiful Soup for largest US cites list

In [None]:
# Scrape page contents
URL = 'https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')

In [None]:
# get all tables from page, and print number of tables
wikiTables = soup.find_all("table", attrs={"class": "wikitable sortable"})
len(wikiTables)

4

In [None]:
# searching through table data inspect if it is the target
table = wikiTables[0]
body = table.find_all("tr")
head = body[0]
head

<tr>
<th>2020<br/>rank
</th>
<th>City
</th>
<th>State<sup class="reference" id="cite_ref-5"><a href="#cite_note-5">[c]</a></sup>
</th>
<th>2020<br/>census
</th>
<th>2010<br/>census
</th>
<th>Change
</th>
<th colspan="2">2020 land area
</th>
<th colspan="2">2020 population density
</th>
<th>Location
</th></tr>

#### We've got the correct table listing the 50 largest US cities by population   
Now let's load the table rows into a pandas dataframe

In [None]:
# Create the header row and strip the '\n' from the text
headings = []
for item in head.find_all("th"): # loop through all th elements
    item = (item.text).rstrip("\n")
    headings.append(item)
print(headings)

['2020rank', 'City', 'State[c]', '2020census', '2010census', 'Change', '2020 land area', '2020 population density', 'Location']


In [None]:
# get all rows except the header to load into dataframe
body_rows = body[1:] # All other items becomes the rest of the rows

In [None]:
all_rows = [] 
for row_num in range(len(body_rows)): # A row at a time
    row = [] # this will hold entries for one row
    for row_item in body_rows[row_num].find_all("td"): #loop through all row entries
        row.append(row_item.text)
    all_rows.append(row)
# print(all_rows)

In [None]:
df = pd.DataFrame(data=all_rows,columns=headings)
df.describe()

Resulting in "AssertionError: 9 columns passed, passed data had 11 columns"
Some of the rows are showing 11 columns


In [None]:
# Ah ha! 9 heading though 11 columns due to separate columns for both sq mi and sq km!
for i, heads in enumerate(headings):
  display(f'{i} - {heads}')

'0 - 2020rank'

'1 - City'

'2 - State[c]'

'3 - 2020census'

'4 - 2010census'

'5 - Change'

'6 - 2020 land area'

'7 - 2020 land area km'

'8 - 2020 population density'

'9 - 2020 pop density sq km'

'10 - Location'

In [None]:
# added additional header rows to match row columns
headings.insert(7, '2020 land area km')
headings.insert(9, '2020 pop density sq km')

### Now attempting to load into dataframe with correct column numbers!

In [None]:
all_rows = [] 
for row_num in range(len(body_rows)): # A row at a time
    row = [] # this will hold entries for one row
    for row_item in body_rows[row_num].find_all("td"): #loop through all row entries
        rowText = (row_item.text).rstrip("\n")
        rowText = (rowText).rstrip("[d]")
        row.append(rowText.rstrip())
    all_rows.append(row)

In [None]:
# Success!
df = pd.DataFrame(data=all_rows,columns=headings)
df.describe()

Unnamed: 0,2020rank,City,State[c],2020census,2010census,Change,2020 land area,2020 land area km,2020 population density,2020 pop density sq km,Location
count,326,326,326,326,326,326,326,326,326,326,326
unique,326,316,46,326,322,306,296,296,315,306,326
top,201,Springfiel,California,150227,197899,+2.38%,23.5 sq mi,60.9 km2,"4,888/sq mi","1,887/km2",34°56′N 120°26′W﻿ / ﻿34.93°N 120.44°W﻿ / 34.93...
freq,1,3,75,1,2,2,3,3,3,3,1


In [None]:
df.head().T

Unnamed: 0,0,1,2,3,4
2020rank,1,2,3,4,5
City,New York,Los Angeles,Chicago,Houston,Phoenix
State[c],New York,California,Illinois,Texas,Arizona
2020census,8804190,3898747,2746388,2304580,1608139
2010census,8175133,3792621,2695598,2099451,1445632
Change,+7.69%,+2.80%,+1.88%,+9.77%,+11.24%
2020 land area,300.5 sq mi,469.5 sq mi,227.7 sq mi,640.4 sq mi,518.0 sq mi
2020 land area km,778.3 km2,"1,216.0 km2",589.7 km2,"1,658.6 km2","1,341.6 km2"
2020 population density,"29,298/sq mi","8,304/sq mi","12,061/sq mi","3,599/sq mi","3,105/sq mi"
2020 pop density sq km,"11,312/km2","3,206/km2","4,657/km2","1,390/km2","1,199/km2"


#### Adding rstrip to row processing eliminated unwanted characters

In [None]:
# If remaining "\n" characters, use this method to remove
# removed additional '[a]' citations from city names
removeChars = ['\[e','\[f','\[g','\[h','\[i','\[j','\[k','\[l','\[m']
for chars in removeChars:
  df['City'] = df['City'].str.replace(chars, '')

In [None]:
# slicing down to only the top 30 cities
df30 = df[0:-296]
df30.count()

2020rank                   30
City                       30
State[c]                   30
2020census                 30
2010census                 30
Change                     30
2020 land area             30
2020 land area km          30
2020 population density    30
2020 pop density sq km     30
Location                   30
dtype: int64

## Scraping for each of the 30 largest cities

In [None]:
# use the search API to search for query= '{name of city} city open data'
# return top 3 results and save as dataframe
q = "open data"
dfRes = pd.DataFrame(columns = ['CITY','RANK','URL'])
for x, city in enumerate(df30['City']):
  print(f'{x} - {city}')
  for z, i in enumerate(search((f'{city} city {q}'), tld="com", num=3, stop=3, pause=2)):
    dfRes.loc[len(dfRes.index)] = [city,z,i]

0 - New York
1 - Los Angeles
2 - Chicago
3 - Houston
4 - Phoenix
5 - Philadelphia
6 - San Antonio
7 - San Diego
8 - Dallas
9 - San Jose
10 - Austin
11 - Jacksonville
12 - Fort Worth
13 - Columbus
14 - Indianapolis
15 - Charlotte
16 - San Francisco
17 - Seattle
18 - Denver
19 - Washington
20 - Nashville
21 - Oklahoma City
22 - El Paso
23 - Boston
24 - Portlan
25 - Las Vegas
26 - Detroit
27 - Memphis
28 - Louisville
29 - Baltimore


In [None]:
# Check resulting dataframe for search results
dfRes.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 90 entries, 0 to 89
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   CITY    90 non-null     object
 1   RANK    90 non-null     object
 2   URL     90 non-null     object
dtypes: object(3)
memory usage: 2.8+ KB


In [None]:
# everything looks good!
dfRes.head(12)

Unnamed: 0,CITY,RANK,URL
0,New York,0,https://opendata.cityofnewyork.us/
1,New York,1,https://data.ny.gov/
2,New York,2,http://www.nyc.gov/html/data/about.html
3,Los Angeles,0,https://data.lacity.org/
4,Los Angeles,1,https://www.lacity.org/residents/popular-infor...
5,Los Angeles,2,https://data.lacounty.gov/
6,Chicago,0,https://data.cityofchicago.org/
7,Chicago,1,https://data.cityofchicago.org/browse
8,Chicago,2,https://data.cityofchicago.org/browse?category...
9,Houston,0,https://data.houstontx.gov/


In [None]:
# save to CSV file
dfRes.to_csv('MapChalData2021.csv',encoding='utf-8')

### First Phase Completed
 Scraped wikipedia list of largest US cities by population with Beautiful Soup, parsed list into a dataframe, cleaned extraneous text, reduced list from 326 to top 30 cities per 30 day challenge, one city per day.   
Automated a search of the 30 cities with the 'open data' query to get the top 3 hit URLs for each city, loaded these entries into the resulting dataframe, exported as CSV -> /content/drive/MyDrive/CODE-2022/30DayMap2021/MapChalData2021.csv

In [None]:
# let's read that dataframe back in as a CSV
df = pd.read_csv('/content/drive/MyDrive/CODE-2022/30DayMap2021/MapChalData2021.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,CITY,RANK,URL
0,0,New York,0,https://opendata.cityofnewyork.us/
1,1,New York,1,https://data.ny.gov/
2,2,New York,2,http://www.nyc.gov/html/data/about.html
3,3,Los Angeles,0,https://data.lacity.org/
4,4,Los Angeles,1,https://www.lacity.org/residents/popular-infor...


### Learning new method for opening URL programmtically within the notebook
Perhaps opening in a notebook cell could utilize cloud resources over my limited machine memory...

In [None]:
dfSlice = df[0:3]
for url in dfSlice['URL']:
  IFrame(src='url', width='100%', height='800px')

In [None]:
print(dfSlice['URL'])
#IFrame(src='url', width='100%', height='800px')

0         https://opendata.cityofnewyork.us/
1                       https://data.ny.gov/
2    http://www.nyc.gov/html/data/about.html
Name: URL, dtype: object


In [None]:
IFrame(src=dfSlice['URL'][0], width='100%', height='800px')

In [None]:
# Site with nice pre-made lists of all US states separated by commas
https://sceptermarketing.com/comma-separated-lists-of-us-states-abbreviations-select-options-etc/
# Open Data Network, Socrata compilation of states and regions!
https://www.opendatanetwork.com/
# Links of all states Open Data Portals!
http://www.harker.com/OpenData/socrata-data-portals.html