# Python Scraping Demo

A notebook showing an example of scraping some table-based data from a city website.
The page we're looking at is the Wilmington Civic and Neighborhood Organizations.
We're using an html copy of the page to avoid sending a lot of bogus traffic to the city's webservers.
Here's [the copied page](https://davidginzberg.github.io/web-scraping-with-python/practice-sites/Wilmington-Civic-Associations.html) and the original can be found on the [City of Wilmington's Website](https://www.wilmingtonde.gov/government/city-offices/constituent-services/civic-and-neighborhood-organizations)

Let's start by importing all the modules we'll be using. This should normally produce no output. On your own system you might need to `pip install bs4` before this works properly. This is automatic on mybinder.org because of the `environment.yml` file in the GitHub repository.

In [None]:
import requests
from bs4 import BeautifulSoup

In [None]:
target_url= "https://davidginzberg.github.io/web-scraping-with-python/practice-sites/Wilmington-Civic-Associations.html"

In [None]:
response = requests.get(target_url)
if response.status_code is not 200:
    print("Response code was not 200. Exiting")
    exit(response.status_code)

In [None]:
print(response.status_code)

In [None]:
webpage_soup = BeautifulSoup(response.content, "html.parser")
print(webpage_soup)

That's the whole page, and it's a lot! let's filter it down to just the tables we saw online.

In [None]:
tables = webpage_soup('table')
print(tables)

In [None]:
def list_table_headers(table_list):
    headers = list()
    for table in table_list:
        header = table.find("thead")
        if header is not None:  #Tables without headers cause problems without this check
            headers.append(header.get_text(strip=True))
    return headers

In [None]:
print("Found tables with the following headers:")

for th in list_table_headers(tables):
    print(th)

In [None]:
def get_header_from_table(table):
    header = table.find("thead")
    if header is not None:
        return header.get_text(strip=True)
    else:
        return "<No table header>"

def list_of_fields_from_rows(row_list):
    field_list = list()
    for row in row_list:
        cells = row.find_all('td')
        #This assumes a 2-colunn table. Really only designed for the page we're working on
        if len(cells) is 2:
            field_list.append( 
                ( cells[0].get_text(), cells[1].get_text() ) 
            )
    return field_list

def build_table_dicts(table_list):
    table_dicts = list()
    for table in table_list:
        t_dict = dict()
        #From each table we want the header and all the fields (and their values)
        t_dict["title"] = get_header_from_table(table)
        #Add the fields to the dictionary
        t_dict["fields"] = list_of_fields_from_rows(table)
        #Add the dictionary of the current table to the list
        table_dicts.append(t_dict)

Want to see the object types you're dealing with in Python? Try this next snippet. You don't need this to be able to scrape a page, but I found it useful during debugging.

In [None]:
 for table in tables:
     print(f"Table of type: {type(table)}")
     for row in table.find_all('tr'):
         print(f"Row of type: {type(row)}")
