### Scraping Wikipedia

In this notebook, I will scrape one of the Wikipedia pages from the web. 
I will be using BeautifulSoup, requests libraries.

#### Getting Started with requests library
Python library requests retrieves information from the web. 

In [1]:
# importing requests library
import requests

In [2]:
url = 'https://en.wikipedia.org/wiki/Pennsylvania'
response = requests.get(url)

In [3]:
response.status_code # success

200

In [5]:
print(response.text[:200])

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Pennsylvania - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wg


In [6]:
page = response.text

In [7]:
type(page)

str

#### Using requests with BeautifulSoup

In [8]:
from bs4 import BeautifulSoup as bs

In [9]:
soup = bs(page)

In [10]:
soup.find('h1')

<h1 class="firstHeading" id="firstHeading" lang="en">Pennsylvania</h1>

In [11]:
soup.find('h1').text

'Pennsylvania'

Find a disambiguation link for Pensillvania

In [13]:
soup.find('a')

<a id="top"></a>

In [14]:
a_tags = [link.href for soup.find_all('a')]
a_tags

SyntaxError: invalid syntax (<ipython-input-14-dc5ef192f3db>, line 1)

In [23]:
soup.find(class_="mw-disambig")['href']

'/wiki/Pennsylvania_(disambiguation)'

In [25]:
soup.find(class_ = 'geo-dec').text

'41°N 77.5°W'

prettify() methods gives a nice idented structure to an HTML

In [28]:
print(soup.find('table').prettify())

<table class="infobox geography vcard" style="width:22em;width:23em">
 <tbody>
  <tr>
   <th colspan="2" style="text-align:center;font-size:125%;font-weight:bold;font-size:1.25em; white-space:nowrap">
    <div class="fn org" style="display:inline">
     Pennsylvania
    </div>
   </th>
  </tr>
  <tr>
   <td colspan="2" style="text-align:center;background-color:#cddeff; font-weight:bold;">
    <div class="category">
     <a href="/wiki/U.S._state" title="U.S. state">
      State
     </a>
    </div>
   </td>
  </tr>
  <tr class="mergedtoprow">
   <td colspan="2" style="text-align:center;font-weight:bold;">
    Commonwealth of Pennsylvania
   </td>
  </tr>
  <tr class="mergedtoprow">
   <td class="maptable" colspan="2" style="text-align:center">
    <div style="display:table; width:100%; background:none;">
     <div style="display:table-row">
      <div style="display:table-cell;vertical-align:middle; text-align:center;">
       <a class="image" href="/wiki/File:Flag_of_Pennsylvania.svg"

In [30]:
first_table = soup.find('table')

In [31]:
type(first_table)

bs4.element.Tag

In [32]:
# find first table header 
first_table.find('th').text

'Pennsylvania'

In [33]:
# find first table data
first_table.find('td').text

'State'

In [34]:
# find first table row
first_table.find('tr').text

'Pennsylvania'

Chaining methods and saving data in Python variables

In [35]:
state = soup.find('table').find('th').text
state

'Pennsylvania'

In [36]:
for row in soup.find('table').find_all('tr')[:10]:
    print(row.text)    

Pennsylvania
State
Commonwealth of Pennsylvania

FlagSeal
Nickname(s): Keystone State;[1] Quaker State
Motto(s): Virtue, Liberty and Independence
Anthem: "Pennsylvania"
Map of the United States with Pennsylvania highlighted
CountryUnited States
Before statehoodProvince of Pennsylvania


#### Locating information by position

In [39]:
soup.find(text='Admitted') # nothing comes out, text search should be exact match

In [40]:
soup.find(text='Admitted to the Union')

'Admitted to the Union'

Alternatively, we could use RegularExpression

In [41]:
import re
admitted_regex = re.compile('Admitted')
soup.find(text=admitted_regex)

'Admitted to the Union'

In [42]:
admitted = soup.find(text=admitted_regex)
type(admitted) # because it is a BeautifulSoup element, we can navigate the DOM

bs4.element.NavigableString

In [44]:
admitted.next

<td>December 12, 1787 (2nd)</td>

In [45]:
admitted.next.next

'December 12, 1787 (2nd)'