### Scraping Wikipedia

In this notebook, I will scrape one of the Wikipedia pages from the web. 
I will be using BeautifulSoup, requests libraries.

#### Getting Started with requests library
Python library requests retrieves information from the web. 

In [1]:
# importing requests library
import requests

In [2]:
url = 'https://en.wikipedia.org/wiki/Pennsylvania'
response = requests.get(url)

In [3]:
response.status_code # success

200

In [5]:
print(response.text[:200])

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Pennsylvania - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wg


In [6]:
page = response.text

In [7]:
type(page)

str

#### Using requests with BeautifulSoup

In [8]:
from bs4 import BeautifulSoup as bs

In [9]:
soup = bs(page)

In [10]:
soup.find('h1')

<h1 class="firstHeading" id="firstHeading" lang="en">Pennsylvania</h1>

In [11]:
soup.find('h1').text

'Pennsylvania'

Find a disambiguation link for Pensillvania

In [13]:
soup.find('a')

<a id="top"></a>

In [14]:
a_tags = [link.href for soup.find_all('a')]
a_tags

SyntaxError: invalid syntax (<ipython-input-14-dc5ef192f3db>, line 1)

In [23]:
soup.find(class_="mw-disambig")['href']

'/wiki/Pennsylvania_(disambiguation)'

In [25]:
soup.find(class_ = 'geo-dec').text

'41°N 77.5°W'

prettify() methods gives a nice indented structure to an HTML

In [None]:
print(soup.find('table').prettify())

In [30]:
first_table = soup.find('table')

In [31]:
type(first_table)

bs4.element.Tag

In [32]:
# find first table header 
first_table.find('th').text

'Pennsylvania'

In [33]:
# find first table data
first_table.find('td').text

'State'

In [34]:
# find first table row
first_table.find('tr').text

'Pennsylvania'

Chaining methods and saving data in Python variables

In [35]:
state = soup.find('table').find('th').text
state

'Pennsylvania'

In [36]:
for row in soup.find('table').find_all('tr')[:10]:
    print(row.text)    

Pennsylvania
State
Commonwealth of Pennsylvania

FlagSeal
Nickname(s): Keystone State;[1] Quaker State
Motto(s): Virtue, Liberty and Independence
Anthem: "Pennsylvania"
Map of the United States with Pennsylvania highlighted
CountryUnited States
Before statehoodProvince of Pennsylvania


#### Locating information by position

In [39]:
soup.find(text='Admitted') # nothing comes out, text search should be exact match

In [40]:
soup.find(text='Admitted to the Union')

'Admitted to the Union'

Alternatively, we could use RegularExpression

In [41]:
import re
admitted_regex = re.compile('Admitted')
soup.find(text=admitted_regex)

'Admitted to the Union'

In [42]:
admitted = soup.find(text=admitted_regex)
type(admitted) # because it is a BeautifulSoup element, we can navigate the DOM

bs4.element.NavigableString

In [44]:
admitted.next

<td>December 12, 1787 (2nd)</td>

In [45]:
admitted.next.next

'December 12, 1787 (2nd)'

#### Find the Capital of Pennsylvania

In [49]:
capital_regex = "Capital"
soup.find(text=capital_regex)

'Capital'

In [50]:
capital = soup.find(text=capital_regex)
capital.next

<td><a href="/wiki/Harrisburg,_Pennsylvania" title="Harrisburg, Pennsylvania">Harrisburg</a></td>

In [51]:
capital.next.text

'Harrisburg'

#### Print out the first three references (at the bottom of the page).

In [108]:
ref3 = soup.find(class_ = 'references').find_all('cite')[:3]

In [109]:
# print first 3 references
for ref in ref3:
    print(ref.text)       

"Symbols of Pennsylvania". Portal.state.pa.us. Archived from the original on October 14, 2007. Retrieved May 4, 2014.
"Elevations and Distances in the United States". United States Geological Survey. 2001. Archived from the original on October 15, 2011. Retrieved October 24, 2011.
"Median Annual Household Income". The Henry J. Kaiser Family Foundation. Archived from the original on December 20, 2016. Retrieved December 9, 2016.


In [111]:
# print first 3 external links
for ref in ref3:
    for link in ref.find_all('a', class_ = 'external'):
        print(link['href'])

http://www.portal.state.pa.us/portal/server.pt/community/things/4280/symbols_of_pennsylvania/478690
https://web.archive.org/web/20071014215922/http://www.phmc.state.pa.us/bah/pahist/symbols.asp?secid=31
https://web.archive.org/web/20111015012701/http://egsc.usgs.gov/isb/pubs/booklets/elvadist/elvadist.html
http://egsc.usgs.gov/isb/pubs/booklets/elvadist/elvadist.html
http://kff.org/other/state-indicator/median-annual-income/?currentTimeframe=0
https://web.archive.org/web/20161220091007/http://kff.org/other/state-indicator/median-annual-income/?currentTimeframe=0


## Data Preparation
Now that we know how to gather information from the web, what do we do with it?

This data can be

* aggregated to look for trends
* visualized to understand patterns
* leveraged with machine learning algorithms
But first we need to

* convert several strings into numerical or datetime values
* collect and store data from multiple pages (next section)

**Tip:** Most web scraping project rely on multiple pages of information, each of which serving as a data observation. For this case, we might collect data about Pennsylvania and then collect the same kinds of information for all 50 United States before analyzing or visualizing the data.

### Data processing
#### Date Admitted
In the last section, we collected the date that Pennsylvania was admitted to the union.

In [117]:
admitted_date = admitted.next.text
admitted_date

'December 12, 1787 (2nd)'

In [118]:
# Split the string
admitted_date_list = admitted_date.split(' ')[:-1]
admitted_date_list

['December', '12,', '1787']

In [119]:
# Join the list
admitted_date_str = ' '.join(admitted_date_list)
admitted_date_str

'December 12, 1787'

Now we will convert this string into a datetime data type.

In [120]:
import dateutil.parser

In [121]:
date_admitted = dateutil.parser.parse(admitted_date_str)
date_admitted

datetime.datetime(1787, 12, 12, 0, 0)

In [122]:
type(date_admitted)

datetime.datetime

In [123]:
date_admitted.year

1787

#### Population and Area

In [126]:
soup.find(text=re.compile('Total'))

'\xa0•\xa0Total'

In [127]:
soup.find(text=re.compile('Total')).next

<td>46,055 sq mi (119,283 km<sup>2</sup>)</td>

In [128]:
# Save area text in a variable
area_text = soup.find(text=re.compile('Total')).next.text
area_text

'46,055\xa0sq\xa0mi (119,283\xa0km2)'

In [131]:
soup.find(text=re.compile('Population')).parent.parent

<tr class="mergedtoprow"><th colspan="2" style="text-align:center;text-align:left">Population<div style="font-weight:normal;display:inline;"><span class="nowrap"> </span>(2019)</div></th></tr>

In [133]:
soup.find(text=re.compile('Population')).parent.parent.next_sibling

<tr class="mergedrow"><th scope="row"> • Total</th><td>12,801,989</td></tr>

In [134]:
soup.find(text=re.compile('Population')).parent.parent.next_sibling.find('td')

<td>12,801,989</td>

In [138]:
population_text = soup.find(text=re.compile('Population')).parent.parent.next_sibling.find('td').text
population_text

'12,801,989'

#### Convert strings into integers

In [139]:
population = int(population_text.replace(',', ''))

population

12801989

In [140]:
# Create converter functions

def to_date(date_str):
    date_str = re.match('[\w\s,]+', date_str)[0]
    return dateutil.parser.parse(date_str)

def to_int(number_str):
    number_str = re.match('[\d,$]+', number_str)[0]
    number_str = number_str.replace('$', '').replace(',', '')
    return int(number_str)

In [141]:
area_text = to_int(area_text)
area_text

46055