### Scraping Wikipedia

In this notebook, I will scrape one of the Wikipedia pages from the web. 
I will be using BeautifulSoup, requests libraries.

#### Getting Started with requests library
Python library requests retrieves information from the web. 

In [1]:
# importing requests library
import requests

In [3]:
url = 'https://en.wikipedia.org/wiki/Pennsylvania'
response = requests.get(url)

In [14]:
response.status_code # success

200

In [15]:
print(response.text[:200])

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Pennsylvania - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wg


In [16]:
page = response.text

In [17]:
type(page)

str

#### Using requests with BeautifulSoup

In [18]:
from bs4 import BeautifulSoup as bs

In [19]:
soup = bs(page)

In [20]:
soup.find('h1')

<h1 class="firstHeading" id="firstHeading" lang="en">Pennsylvania</h1>

In [21]:
soup.find('h1').text

'Pennsylvania'

Find a disambiguation link for Pensillvania

In [22]:
soup.find('a')

<a id="top"></a>

In [24]:
soup.find(class_="mw-disambig")['href']

'/wiki/Pennsylvania_(disambiguation)'

In [25]:
soup.find(class_ = 'geo-dec').text

'41°N 77.5°W'

prettify() methods gives a nice indented structure to an HTML

In [26]:
print(soup.find('table').prettify())

<table class="infobox geography vcard" style="width:22em;width:23em">
 <tbody>
  <tr>
   <th colspan="2" style="text-align:center;font-size:125%;font-weight:bold;font-size:1.25em; white-space:nowrap">
    <div class="fn org" style="display:inline">
     Pennsylvania
    </div>
   </th>
  </tr>
  <tr>
   <td colspan="2" style="text-align:center;background-color:#cddeff; font-weight:bold;">
    <div class="category">
     <a href="/wiki/U.S._state" title="U.S. state">
      State
     </a>
    </div>
   </td>
  </tr>
  <tr class="mergedtoprow">
   <td colspan="2" style="text-align:center;font-weight:bold;">
    Commonwealth of Pennsylvania
   </td>
  </tr>
  <tr class="mergedtoprow">
   <td class="maptable" colspan="2" style="text-align:center">
    <div style="display:table; width:100%; background:none;">
     <div style="display:table-row">
      <div style="display:table-cell;vertical-align:middle; text-align:center;">
       <a class="image" href="/wiki/File:Flag_of_Pennsylvania.svg"

In [27]:
first_table = soup.find('table')

In [28]:
type(first_table)

bs4.element.Tag

In [29]:
# find first table header 
first_table.find('th').text

'Pennsylvania'

In [30]:
# find first table data
first_table.find('td').text

'State'

In [31]:
# find first table row
first_table.find('tr').text

'Pennsylvania'

Chaining methods and saving data in Python variables

In [32]:
state = soup.find('table').find('th').text
state

'Pennsylvania'

In [33]:
for row in soup.find('table').find_all('tr')[:10]:
    print(row.text)    

Pennsylvania
State
Commonwealth of Pennsylvania

FlagSeal
Nickname(s): Keystone State;[1] Quaker State
Motto(s): Virtue, Liberty and Independence
Anthem: "Pennsylvania"
Map of the United States with Pennsylvania highlighted
CountryUnited States
Before statehoodProvince of Pennsylvania


#### Locating information by position

In [34]:
soup.find(text='Admitted') # nothing comes out, text search should be exact match

In [35]:
soup.find(text='Admitted to the Union')

'Admitted to the Union'

Alternatively, we could use RegularExpression

In [36]:
import re
admitted_regex = re.compile('Admitted')
soup.find(text=admitted_regex)

'Admitted to the Union'

In [37]:
admitted = soup.find(text=admitted_regex)
type(admitted) # because it is a BeautifulSoup element, we can navigate the DOM

bs4.element.NavigableString

In [38]:
admitted.next

<td>December 12, 1787 (2nd)</td>

In [39]:
admitted.next.next

'December 12, 1787 (2nd)'

#### Find the Capital of Pennsylvania

In [40]:
capital_regex = "Capital"
soup.find(text=capital_regex)

'Capital'

In [41]:
capital = soup.find(text=capital_regex)
capital.next

<td><a href="/wiki/Harrisburg,_Pennsylvania" title="Harrisburg, Pennsylvania">Harrisburg</a></td>

In [42]:
capital.next.text

'Harrisburg'

#### Print out the first three references (at the bottom of the page).

In [43]:
ref3 = soup.find(class_ = 'references').find_all('cite')[:3]

In [44]:
# print first 3 references
for ref in ref3:
    print(ref.text)       

"Symbols of Pennsylvania". Portal.state.pa.us. Archived from the original on October 14, 2007. Retrieved May 4, 2014.
"Elevations and Distances in the United States". United States Geological Survey. 2001. Archived from the original on October 15, 2011. Retrieved October 24, 2011.
"Median Annual Household Income". The Henry J. Kaiser Family Foundation. Archived from the original on December 20, 2016. Retrieved December 9, 2016.


In [45]:
# print first 3 external links
for ref in ref3:
    for link in ref.find_all('a', class_ = 'external'):
        print(link['href'])

http://www.portal.state.pa.us/portal/server.pt/community/things/4280/symbols_of_pennsylvania/478690
https://web.archive.org/web/20071014215922/http://www.phmc.state.pa.us/bah/pahist/symbols.asp?secid=31
https://web.archive.org/web/20111015012701/http://egsc.usgs.gov/isb/pubs/booklets/elvadist/elvadist.html
http://egsc.usgs.gov/isb/pubs/booklets/elvadist/elvadist.html
http://kff.org/other/state-indicator/median-annual-income/?currentTimeframe=0
https://web.archive.org/web/20161220091007/http://kff.org/other/state-indicator/median-annual-income/?currentTimeframe=0


## Data Preparation
Now that we know how to gather information from the web, what do we do with it?

This data can be

* aggregated to look for trends
* visualized to understand patterns
* leveraged with machine learning algorithms
But first we need to

* convert several strings into numerical or datetime values
* collect and store data from multiple pages (next section)

**Tip:** Most web scraping project rely on multiple pages of information, each of which serving as a data observation. For this case, we might collect data about Pennsylvania and then collect the same kinds of information for all 50 United States before analyzing or visualizing the data.

### Data processing
#### Date Admitted
In the last section, we collected the date that Pennsylvania was admitted to the union.

In [46]:
admitted_date = admitted.next.text
admitted_date

'December 12, 1787 (2nd)'

In [47]:
# Split the string
admitted_date_list = admitted_date.split(' ')[:-1]
admitted_date_list

['December', '12,', '1787']

In [48]:
# Join the list
admitted_date_str = ' '.join(admitted_date_list)
admitted_date_str

'December 12, 1787'

Now we will convert this string into a datetime data type.

In [49]:
import dateutil.parser

In [50]:
date_admitted = dateutil.parser.parse(admitted_date_str)
date_admitted

datetime.datetime(1787, 12, 12, 0, 0)

In [51]:
type(date_admitted)

datetime.datetime

In [52]:
date_admitted.year

1787

#### Population and Area

In [53]:
soup.find(text=re.compile('Total'))

'\xa0•\xa0Total'

In [54]:
soup.find(text=re.compile('Total')).next

<td>46,055 sq mi (119,283 km<sup>2</sup>)</td>

In [55]:
# Save area text in a variable
area_text = soup.find(text=re.compile('Total')).next.text
area_text

'46,055\xa0sq\xa0mi (119,283\xa0km2)'

In [56]:
soup.find(text=re.compile('Population')).parent.parent

<tr class="mergedtoprow"><th colspan="2" style="text-align:center;text-align:left">Population<div style="font-weight:normal;display:inline;"><span class="nowrap"> </span>(2019)</div></th></tr>

In [57]:
soup.find(text=re.compile('Population')).parent.parent.next_sibling

<tr class="mergedrow"><th scope="row"> • Total</th><td>12,801,989</td></tr>

In [58]:
soup.find(text=re.compile('Population')).parent.parent.next_sibling.find('td')

<td>12,801,989</td>

In [59]:
population_text = soup.find(text=re.compile('Population')).parent.parent.next_sibling.find('td').text
population_text

'12,801,989'

#### Convert strings into integers

In [60]:
population = int(population_text.replace(',', ''))

population

12801989

In [61]:
# Create converter functions

def to_date(date_str):
    date_str = re.match('[\w\s,]+', date_str)[0]
    return dateutil.parser.parse(date_str)

def to_int(number_str):
    number_str = re.match('[\d,$]+', number_str)[0]
    number_str = number_str.replace('$', '').replace(',', '')
    return int(number_str)

In [62]:
area = to_int(area_text)
area

46055

### Data storage
Now let's put all the information we have about Pennsylvania together.

In [63]:
penn_dict=  {
    'state': state, 
    'date_admitted': date_admitted,
    'population': population,
    'area_sq_mi': area
}
penn_dict

{'state': 'Pennsylvania',
 'date_admitted': datetime.datetime(1787, 12, 12, 0, 0),
 'population': 12801989,
 'area_sq_mi': 46055}

Once we have this information in dictionary form, we can build a pandas dataframe with it and eventually perform further analyses or save it to our computer.

In [64]:
import pandas as pd

In [65]:
penn_info = [penn_dict]

In [66]:
penn_df = pd.DataFrame(penn_info)
penn_df

Unnamed: 0,state,date_admitted,population,area_sq_mi
0,Pennsylvania,1787-12-12,12801989,46055


In [67]:
# saving to a csv
penn_df.to_csv('Penn_State_Information.csv')

In [68]:
#find median household income
mhi_text =soup.find(text='Median household income').next.next.text
mhi_text

'$59,195[4]'

In [69]:
mhi = to_int(mhi_text)
mhi

59195

In [70]:
# add mhi to state dict
penn_dict['median_household_income'] = mhi
penn_dict

{'state': 'Pennsylvania',
 'date_admitted': datetime.datetime(1787, 12, 12, 0, 0),
 'population': 12801989,
 'area_sq_mi': 46055,
 'median_household_income': 59195}

In [71]:
# recreate the dataframe
state_df = pd.DataFrame([penn_dict])
state_df

Unnamed: 0,state,date_admitted,population,area_sq_mi,median_household_income
0,Pennsylvania,1787-12-12,12801989,46055,59195


### Pipeline Considerations
Now that we can extract numerical data from this page about Pennsylvania, how would we build out a full analytic or data science project?

The next step is to systematically retrieve this information from the Wikipedia page of each US state. First, let's build reusable functions to find the state's

* name
* date admitted
* population
* area
* median household income

Note: all of this info can be found in the table on the right side of the page.

In [72]:
def get_name(table):
    raw_name = table.find('th').text
    return re.match('[A-z\s]+', raw_name)[0]

def get_date_admitted(table):
    raw_date = table.find(text='Admitted to the Union').next.text
    return to_date(raw_date)

def get_population(table):
    raw_population = table.find(text='Population')\
        .parent.parent.next_sibling.find('td').text
    return to_int(raw_population)

def get_area(table):
    raw_area = table.find(text=re.compile('Total')).next.text
    return to_int(raw_area)

def get_income(table):
    raw_income = table.find(text='Median household income').next.next.text
    return to_int(raw_income)

These functions will extract information from any Wikipedia state table we pass into them. For example, let's try parsing the page for New York.

In [73]:
ny_url = 'https://en.wikipedia.org/wiki/New_York_(state)'

In [74]:
ny_page = requests.get(ny_url).text
ny_soup = bs(ny_page)

In [75]:
ny_table = ny_soup.find('table')

In [76]:
get_name(ny_table)

'New York'

In [77]:
get_population(ny_table)

19453561

In [78]:
get_area(ny_table)

54555

In [79]:
get_date_admitted(ny_table)

datetime.datetime(1788, 7, 26, 0, 0)

In [80]:
get_income(ny_table)

64894

Let's also make a function to gather all five values from a given state Wiki page and return the information as a dictionary.

In [81]:
def parse_url(url):
    page = requests.get(url).text
    return bs(page)

def get_state_info(state_url):
    
    #parse url with above function and get table
    state_soup = parse_url(state_url)
    state_table = state_soup.find('table')
    
    state_info = {}
    
    #get info with pre-defined functions
    state_info['state'] = get_name(state_table)
    state_info['date_admitted'] = get_date_admitted(state_table)
    state_info['population'] = get_population(state_table)
    state_info['area'] = get_area(state_table)
    state_info['median_household_income'] = get_income(state_table)
    
    return state_info

In [82]:
ny_info = get_state_info(ny_url)
ny_info

{'state': 'New York',
 'date_admitted': datetime.datetime(1788, 7, 26, 0, 0),
 'population': 19453561,
 'area': 54555,
 'median_household_income': 64894}

### Lists of links
The next step in our process will require us to use our get_state_info() function on the URLs of each of the 50 US states. But how do we know which URLs to visit? We might be able to guess that the page for Rhode Island is https://en.wikipedia.org/wiki/Rhode_Island but not all pages follow this convention.

Instead of guessing, let's first gather these links from this "[List of States and Territories of the United States" article](https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States).

Click on this link and inspect the page to develop a plan for doing this.

It looks like each of the states are listed in the second table of the page. Each state name and link is contained within table header tags (th) and have the additional property of scope="row".

In [83]:
list_url = 'https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States'
list_page = requests.get(list_url).text
list_soup = bs(list_page)

In [84]:
state_rows = list_soup.find_all('table')[0].find_all('th', scope='row')
state_rows[:5]

[<th scope="row"><span class="flagicon"><img alt="" class="thumbborder" data-file-height="400" data-file-width="600" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/5/5c/Flag_of_Alabama.svg/23px-Flag_of_Alabama.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/5/5c/Flag_of_Alabama.svg/35px-Flag_of_Alabama.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/5/5c/Flag_of_Alabama.svg/45px-Flag_of_Alabama.svg.png 2x" width="23"/> </span><a href="/wiki/Alabama" title="Alabama">Alabama</a>
 </th>,
 <th scope="row"><span class="flagicon"><img alt="" class="thumbborder" data-file-height="1000" data-file-width="1416" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/e/e6/Flag_of_Alaska.svg/21px-Flag_of_Alaska.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/e/e6/Flag_of_Alaska.svg/33px-Flag_of_Alaska.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/e/e6/Flag_of_Alaska.svg/43px-Flag_

In [85]:
state_rows[0].find('a')

<a href="/wiki/Alabama" title="Alabama">Alabama</a>

In [86]:
state_rows[0].find('a')['href']

'/wiki/Alabama'

In [87]:
state_links  = [row.find('a')['href'] for row in state_rows]
state_links[:5]

['/wiki/Alabama',
 '/wiki/Alaska',
 '/wiki/Arizona',
 '/wiki/Arkansas',
 '/wiki/California']

Each of these links point to a place within Wikipedia, but if we want to link to the full URLs, we have to append 'https://en.wikipedia.org' to each of them.

In [88]:
base_url = 'https://en.wikipedia.org'
state_urls = [base_url + link for link in state_links]
state_urls[:5]

['https://en.wikipedia.org/wiki/Alabama',
 'https://en.wikipedia.org/wiki/Alaska',
 'https://en.wikipedia.org/wiki/Arizona',
 'https://en.wikipedia.org/wiki/Arkansas',
 'https://en.wikipedia.org/wiki/California']

In [89]:
state_urls[-5:]

['https://en.wikipedia.org/wiki/Virginia',
 'https://en.wikipedia.org/wiki/Washington_(state)',
 'https://en.wikipedia.org/wiki/West_Virginia',
 'https://en.wikipedia.org/wiki/Wisconsin',
 'https://en.wikipedia.org/wiki/Wyoming']

In [90]:
len(state_urls)

50

### Handling missing values
We will eventually be cycling through these state links to collect and store information about every state. But what happens when certain information is unavailable? That is, what if the Georgia page is missing area information or the median household income isn't listed for Nevada?

We can make our code more robust by including instructions for handling missing information. One way to do this is to include try/except statements.

In [91]:
def get_state_info_robust(state_url):
    #if we can't find the url, we print out the url and exit the page
    try:
        state_soup = parse_url(state_url)
        state_table = state_soup.find('table')
    except:
        print(f"Cannot parse table: {state_url}")
        return None
    
    state_info = {}
    
    # granb info with pre-defined functions
    # if any values can't be found, fill with None
    values = ['state', 'date_admitted', 'population', 'area_sq_mi', 'median_household_income']
    functions = [get_name, get_date_admitted, get_population, get_area, get_income]
    
    for val, func in zip(values,functions):
        try:
            state_info[val]=func(state_table)
        except:
            state_info[val] = None
    
    return state_info
              
              

In [92]:
ny_dict = get_state_info_robust(ny_url)
ny_dict

{'state': 'New York',
 'date_admitted': datetime.datetime(1788, 7, 26, 0, 0),
 'population': 19453561,
 'area_sq_mi': 54555,
 'median_household_income': 64894}

In [93]:
get_state_info_robust('https://en.wikipedia.org/wiki/Python_Conference')

{'state': 'Year\n',
 'date_admitted': None,
 'population': None,
 'area_sq_mi': None,
 'median_household_income': None}

In [94]:
get_state_info_robust('https://notawebsiteatleastihopenot.net')

Cannot parse table: https://notawebsiteatleastihopenot.net


### Adding pauses
We have just one final consideration before we cycle through the state links to scrape information. Web scraping at a fast rate--that is, many pages per second--is frowned upon by many websites, Wikipedia included. We will add in artificial pauses so we don't overwhelm the Wikipedia server.

In [95]:
import time

In [96]:
time.sleep(3)

In [97]:
a = 5

print(f"Pausing for {a} seconds")
time.sleep(a)

b = a + 1
print(f"b equals {b}")

Pausing for 5 seconds
b equals 6


To responsibly scrape websites, you should know what the site's rate limit is and respect it! Most sites list their rate limit for web scraping in their robots.txt file. More on this later.

Wikipedia [ at least a one second pause per page request](https://en.wikipedia.org/wiki/Wikipedia:Database_download#Please_do_not_use_a_web_crawler). We will pause 1 second between each page scrape.

### Data storage revisited
We now have a function to extract information for each state as a dictionary. We can convert this information into a pandas dataframe and store it to an Excel or .csv file if we pass in a list of dictionaries, all with the same keys.

In [98]:
pd.DataFrame([penn_dict, ny_dict])

Unnamed: 0,state,date_admitted,population,area_sq_mi,median_household_income
0,Pennsylvania,1787-12-12,12801989,46055,59195
1,New York,1788-07-26,19453561,54555,64894


Now let's build out our full pipeline:

1. Gather a list of links to each state. (DONE)
2. For each state link, gather state information as a dictionary.
3. Append each state dictionary to a list.
4. Convert list of dictionaries to dataframe.
5. Save dataframe as a .csv or an Excel file.

In [106]:
state_info_list = []

for link in state_urls:
    # step 2
    state_info = get_state_info_robust(link)
    
    # step 3
    if state_info:
        state_info_list.append(state_info)
        
    # pause
    time.sleep(1)

In [144]:
state_info_list[:5]

[{'state': 'Alabama',
  'date_admitted': datetime.datetime(1819, 12, 14, 0, 0),
  'population': 4903185,
  'area_sq_mi': 52419,
  'median_household_income': 48123},
 {'state': 'AlaskaAlax',
  'date_admitted': datetime.datetime(1959, 1, 3, 0, 0),
  'population': 710249,
  'area_sq_mi': 663268,
  'median_household_income': 73181},
 {'state': 'Arizona',
  'date_admitted': datetime.datetime(1912, 2, 14, 0, 0),
  'population': 7278717,
  'area_sq_mi': 113990,
  'median_household_income': 56581},
 {'state': 'Arkansas',
  'date_admitted': datetime.datetime(1836, 6, 15, 0, 0),
  'population': 3017804,
  'area_sq_mi': 53179,
  'median_household_income': 45869},
 {'state': 'California',
  'date_admitted': datetime.datetime(1850, 9, 9, 0, 0),
  'population': 39512223,
  'area_sq_mi': 163696,
  'median_household_income': 71228}]

In [139]:
# step 4
state_data = pd.DataFrame(state_info_list)
state_data

Unnamed: 0,state,date_admitted,population,area_sq_mi,median_household_income
0,Alabama,1819-12-14,4903185,52419,48123
1,AlaskaAlax,1959-01-03,710249,663268,73181
2,Arizona,1912-02-14,7278717,113990,56581
3,Arkansas,1836-06-15,3017804,53179,45869
4,California,1850-09-09,39512223,163696,71228
5,Colorado,1876-08-01,5758736,104094,69117
6,Connecticut,1788-01-09,3565287,5567,76106
7,Delaware,1787-12-07,982895,1982,62852
8,Florida,1845-03-03,21477737,65757,53267
9,Georgia,1788-01-02,10617423,59425,56183


In [141]:
# Kansas date_admitted to Union is invalid
state_data.loc[(state_data.state == 'Kansas'),'date_admitted']

15   NaT
Name: date_admitted, dtype: datetime64[ns]

In [142]:
# fill the panda data frame with correct admitted date
state_data.loc[(state_data.state == 'Kansas'),'date_admitted'] = to_date('January 29, 1861')
state_data.loc[(state_data.state == 'Kansas'),'date_admitted']

15   1861-01-29
Name: date_admitted, dtype: datetime64[ns]

In [149]:
# Alaska's name is in correct
state_data.iloc[1].state

'AlaskaAlax'

In [None]:
# correcting state name for Alaska
state_data.loc[(state_data.state == 'AlaskaAlax'),'state'] = 'Alaska'

In [151]:
state_data.iloc[1].state

'Alaska'

In [152]:
state_data

Unnamed: 0,state,date_admitted,population,area_sq_mi,median_household_income
0,Alabama,1819-12-14,4903185,52419,48123
1,Alaska,1959-01-03,710249,663268,73181
2,Arizona,1912-02-14,7278717,113990,56581
3,Arkansas,1836-06-15,3017804,53179,45869
4,California,1850-09-09,39512223,163696,71228
5,Colorado,1876-08-01,5758736,104094,69117
6,Connecticut,1788-01-09,3565287,5567,76106
7,Delaware,1787-12-07,982895,1982,62852
8,Florida,1845-03-03,21477737,65757,53267
9,Georgia,1788-01-02,10617423,59425,56183


In [153]:
# step 5
state_data.to_csv('state_data.csv', index=False)