#  	Lecture: Web Data Acquisition B

This lecture adapts content from the  Lab for Data Mining with Pandas on Wikipedia data by [Brian Keegan](https://www.brianckeegan.com),[Department of Information Science, CU Boulder](www.colorado.edu/cmci/academics/information-science), as well as the [PyCon 2015 Pandas tutorial](https://github.com/brandon-rhodes/pycon-pandas-tutorial) by Brandon Rhodes.


This notebook is copyrighted and made available under the [Apache License v2.0](https://creativecommons.org/licenses/by-sa/4.0/) license.


## Import modules and set up environment

In [1]:
import pandas as pd 

# Package query and download from web resources! Alternatives: URLlib2, URLlib3

import requests

# If you want HTML to make sense, you need soup

from bs4 import BeautifulSoup

# Webscraping (in the sense of "Screen Scraping")
Credit for part of this tutorial goes to https://www.dataquest.io/blog/web-scraping-tutorial-python/

Beautiful Soup Doc: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

* HTML – contain the main content of the page.
* CSS – add styling to make the page look nicer.
* JS – Javascript files add interactivity to web pages.
* Images – image formats, such as JPG and PNG allow web pages to show pictures.

OPEN http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html

## Starting example

In [2]:
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
type(page)

requests.models.Response

In [3]:
soup = BeautifulSoup(page.content, 'html.parser')
soup

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>
</html>

In [4]:
print(soup.prettify())

<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <div>
   <p class="inner-text first-item" id="first">
    First paragraph.
   </p>
   <p class="inner-text">
    Second paragraph.
   </p>
  </div>
  <p class="outer-text first-item" id="second">
   <b>
    First outer paragraph.
   </b>
  </p>
  <p class="outer-text">
   <b>
    Second outer paragraph.
   </b>
  </p>
 </body>
</html>



In [5]:
list(soup.html.body.div.children)

['\n',
 <p class="inner-text first-item" id="first">
                 First paragraph.
             </p>,
 '\n',
 <p class="inner-text">
                 Second paragraph.
             </p>,
 '\n']

In [6]:
soup.find_all('p')

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>,
 <p class="inner-text">
                 Second paragraph.
             </p>,
 <p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [7]:
len(soup.find_all('p'))

4

In [8]:
soup.find_all('p')[0]

<p class="inner-text first-item" id="first">
                First paragraph.
            </p>

In [9]:
soup.find('p')

<p class="inner-text first-item" id="first">
                First paragraph.
            </p>

In [10]:
soup.p

<p class="inner-text first-item" id="first">
                First paragraph.
            </p>

In [11]:
soup.find_all('p', class_='outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [12]:
soup.find_all(id="first")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

In [13]:
soup.select("div p")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>,
 <p class="inner-text">
                 Second paragraph.
             </p>]

In [14]:
soup.select("p")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>,
 <p class="inner-text">
                 Second paragraph.
             </p>,
 <p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [15]:
soup.select("p.first-item")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>,
 <p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>]

In [16]:
soup.select("div p.first-item#second")

[]

In [17]:
soup.select("#second")

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>]

In [18]:
soup.select("p#second")


[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>]

In [19]:
soup.select("p#second")[0].get_text().strip()

'First outer paragraph.'

## HTML tables

How the hedge do HTML tables work ?
https://www.w3schools.com/html/html_tables.asp

In [20]:
html_string = '''
<html>
<body>

<table>
  <tr>
    <th>Company</th>
    <th>Contact</th>
    <th>Country</th>
  </tr>
  <tr>
    <td>Alfreds Futterkiste</td>
    <td>Maria Anders</td>
    <td>Germany</td>
  </tr>
  <tr>
    <td>Centro comercial Moctezuma</td>
    <td>Francisco Chang</td>
    <td>Mexico</td>
  </tr>
  <tr>
    <td>Ernst Handel</td>
    <td>Roland Mendel</td>
    <td>Austria</td>
  </tr>
  <tr>
    <td>Island Trading</td>
    <td>Helen Bennett</td>
    <td>UK</td>
  </tr>
  <tr>
    <td>Laughing Bacchus Winecellars</td>
    <td>Yoshi Tannamuri</td>
    <td>Canada</td>
  </tr>
  <tr>
    <td>Magazzini Alimentari Riuniti</td>
    <td>Helen Bennett</td>
    <td>Italy</td>
  </tr>

</table>


</body>

</html>
'''

In [21]:
html_dataframe_object = pd.read_html(html_string)

In [22]:
html_dataframe_object[0]

Unnamed: 0,Company,Contact,Country
0,Alfreds Futterkiste,Maria Anders,Germany
1,Centro comercial Moctezuma,Francisco Chang,Mexico
2,Ernst Handel,Roland Mendel,Austria
3,Island Trading,Helen Bennett,UK
4,Laughing Bacchus Winecellars,Yoshi Tannamuri,Canada
5,Magazzini Alimentari Riuniti,Helen Bennett,Italy


In [23]:
soup2 = BeautifulSoup(html_string, 'html.parser')

In [24]:
print(soup2.prettify())

<html>
 <body>
  <table>
   <tr>
    <th>
     Company
    </th>
    <th>
     Contact
    </th>
    <th>
     Country
    </th>
   </tr>
   <tr>
    <td>
     Alfreds Futterkiste
    </td>
    <td>
     Maria Anders
    </td>
    <td>
     Germany
    </td>
   </tr>
   <tr>
    <td>
     Centro comercial Moctezuma
    </td>
    <td>
     Francisco Chang
    </td>
    <td>
     Mexico
    </td>
   </tr>
   <tr>
    <td>
     Ernst Handel
    </td>
    <td>
     Roland Mendel
    </td>
    <td>
     Austria
    </td>
   </tr>
   <tr>
    <td>
     Island Trading
    </td>
    <td>
     Helen Bennett
    </td>
    <td>
     UK
    </td>
   </tr>
   <tr>
    <td>
     Laughing Bacchus Winecellars
    </td>
    <td>
     Yoshi Tannamuri
    </td>
    <td>
     Canada
    </td>
   </tr>
   <tr>
    <td>
     Magazzini Alimentari Riuniti
    </td>
    <td>
     Helen Bennett
    </td>
    <td>
     Italy
    </td>
   </tr>
  </table>
 </body>
</html>



In [25]:
soup2.select('td')

[<td>Alfreds Futterkiste</td>,
 <td>Maria Anders</td>,
 <td>Germany</td>,
 <td>Centro comercial Moctezuma</td>,
 <td>Francisco Chang</td>,
 <td>Mexico</td>,
 <td>Ernst Handel</td>,
 <td>Roland Mendel</td>,
 <td>Austria</td>,
 <td>Island Trading</td>,
 <td>Helen Bennett</td>,
 <td>UK</td>,
 <td>Laughing Bacchus Winecellars</td>,
 <td>Yoshi Tannamuri</td>,
 <td>Canada</td>,
 <td>Magazzini Alimentari Riuniti</td>,
 <td>Helen Bennett</td>,
 <td>Italy</td>]

In [26]:
soup2.select('.x')

[]

## A more complicated webpage

* open https://www.uni-konstanz.de/ in your browser

* right-click and go to "inspect" anywhere on the page to open the HTML "Elements" view in Chrome (or equivalent in FF)
* explore the structure

## Scraping Wikipedia user pages
* open https://en.wikipedia.org/wiki/User:Rjwilmsi in your browser

* right-click and go to "inspect" anywhere on the page to open the HTML "Elements" view in Chrome (or equivalent in FF)
* explore the structure

In [27]:
page = requests.get("https://en.wikipedia.org/wiki/User:Rjwilmsi")
page

<Response [200]>

In [28]:
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled vector-feature-custom-font-size-clientpref-disabled vector-feature-client-preferences-disabled vector-feature-typography-survey-disabled vector-toc-available" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   User:Rjwilmsi - Wikipedia
  </title>
  <script>
   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-fea

Getting all DOM (Document Object Model) elements that are of class "wikipediauserbox"

In [31]:
userboxes = soup.find_all(class_='wikipediauserbox')
userboxes

[<div class="wikipediauserbox" style="float:left;border:1px solid #6EF7A7;margin:1px;width:238px"><table role="presentation" style="border-collapse:collapse;width:238px;margin-bottom:0;margin-top:0;background:#C5FCDC"><tbody><tr><td style="border:0;width:45px;height:45px;background:#6EF7A7;text-align:center;font-size:14pt;font-weight:bold;color:black;padding:0 1px 0 0;line-height:1.25em;vertical-align:middle;white-space:nowrap;"><a href="/wiki/English_language" title="English language">en</a></td><td style="border:0;text-align:left;font-size:8pt;padding:0 4px 0 4px;height:45px;line-height:1.25;color:black;vertical-align:middle">This user is a <b><a href="/wiki/Category:User_en-N" title="Category:User en-N">native speaker</a></b> of the <b><a href="/wiki/English_language" title="English language">English language</a></b>.</td></tr></tbody></table></div>,
 <div class="wikipediauserbox" style="float:left;border:1px solid #99B3FF;margin:1px;width:238px"><table role="presentation" style="bo

In [32]:
print(userboxes[0].prettify())

<div class="wikipediauserbox" style="float:left;border:1px solid #6EF7A7;margin:1px;width:238px">
 <table role="presentation" style="border-collapse:collapse;width:238px;margin-bottom:0;margin-top:0;background:#C5FCDC">
  <tbody>
   <tr>
    <td style="border:0;width:45px;height:45px;background:#6EF7A7;text-align:center;font-size:14pt;font-weight:bold;color:black;padding:0 1px 0 0;line-height:1.25em;vertical-align:middle;white-space:nowrap;">
     <a href="/wiki/English_language" title="English language">
      en
     </a>
    </td>
    <td style="border:0;text-align:left;font-size:8pt;padding:0 4px 0 4px;height:45px;line-height:1.25;color:black;vertical-align:middle">
     This user is a
     <b>
      <a href="/wiki/Category:User_en-N" title="Category:User en-N">
       native speaker
      </a>
     </b>
     of the
     <b>
      <a href="/wiki/English_language" title="English language">
       English language
      </a>
     </b>
     .
    </td>
   </tr>
  </tbody>
 </table>
</

In [33]:
%%html

<div class="wikipediauserbox" style="float:left;border:1px solid #6EF7A7;margin:1px;width:238px">
 <table role="presentation" style="border-collapse:collapse;width:238px;margin-bottom:0;margin-top:0;background:#C5FCDC">
  <tbody>
   <tr>
    <td style="border:0;width:45px;height:45px;background:#6EF7A7;text-align:center;font-size:14pt;font-weight:bold;color:black;padding:0 1px 0 0;line-height:1.25em;vertical-align:middle;white-space:nowrap;">
     <a href="https://en.wikipedia.org/wiki/English_language" title="English language">
      en
     </a>
    </td>
    <td style="border:0;text-align:left;font-size:8pt;padding:0 4px 0 4px;height:45px;line-height:1.25;color:black;vertical-align:middle">
     This user is a
     <b>
      <a href="https://en.wikipedia.org/wiki/Category:User_en-N" title="Category:User en-N">
       native
      </a>
     </b>
     speaker of
     <b>
      <a href="https://en.wikipedia.org/wiki/Category:User_en" title="Category:User en">
       English
      </a>
     </b>
     .
    </td>
   </tr>
  </tbody>
 </table>
</div>


0,1
en,This user is a  native  speaker of  English  .


-------------------------------------------------------

In [34]:
for u in userboxes:
    print(u.get_text().strip())

enThis user is a native speaker of the English language.
fr-3Cet utilisateur peut contribuer avec un niveau avancé de français.
This user has been on Wikipedia for 18 years, 7 months and 2 days.
501,000+This user has made  more than 501,000 contributions to Wikipedia.
“,;:’This user is a punctuation stickler.
. TheThis user does not put two spaces after a full stop.
A, B and CThis user prefers not to use the serial comma.
,This user fixes comma-splices; they are annoying.
itsThis user understands the difference between its (of it) and it's (it is or it has).
’sThi's user know's that not every word that end's with s need's an apostrophe and will remove misused apostrophe's from Wikipedia with extreme prejudice.
UKThis user uses British English.
This user's time zone is BST.
This user edits with the gadget wikEd.
This user uses Google as a primary search engine.
This user contributes using Firefox.
This user contributes with openSUSE.
64This user uses an x86-64 processor.
This user uses 

### Userboxes to dataframe

In [35]:
pd.options.display.max_colwidth = 600  #execute and ignore this, just a way of adapting our notebook column width

In [36]:
userboxes[0].find_all('td')

[<td style="border:0;width:45px;height:45px;background:#6EF7A7;text-align:center;font-size:14pt;font-weight:bold;color:black;padding:0 1px 0 0;line-height:1.25em;vertical-align:middle;white-space:nowrap;"><a href="/wiki/English_language" title="English language">en</a></td>,
 <td style="border:0;text-align:left;font-size:8pt;padding:0 4px 0 4px;height:45px;line-height:1.25;color:black;vertical-align:middle">This user is a <b><a href="/wiki/Category:User_en-N" title="Category:User en-N">native speaker</a></b> of the <b><a href="/wiki/English_language" title="English language">English language</a></b>.</td>]

In [37]:
for u in userboxes:
    tds = u.find_all('td')[1]
    print(tds.get_text())

This user is a native speaker of the English language.
Cet utilisateur peut contribuer avec un niveau avancé de français.
 This user has been on Wikipedia for 18 years, 7 months and 2 days.
This user has made  more than 501,000 contributions to Wikipedia.
This user is a punctuation stickler.
This user does not put two spaces after a full stop.
This user prefers not to use the serial comma.
This user fixes comma-splices; they are annoying.
This user understands the difference between its (of it) and it's (it is or it has).
Thi's user know's that not every word that end's with s need's an apostrophe and will remove misused apostrophe's from Wikipedia with extreme prejudice.
This user uses British English.
This user's time zone is BST.
This user edits with the gadget wikEd.
This user uses Google as a primary search engine.
This user contributes using Firefox.
This user contributes with openSUSE.
This user uses an x86-64 processor.
This user uses Azureus.
This user enjoys electronic music.

In [38]:
col1 = []
col2 = []

for u in userboxes:
    tds = u.find_all('td')
    col1.append(tds[0].get_text())
    col2.append(tds[1].get_text())

In [39]:
len(col1) == len(col2)

True

**Note: this makes already several assumptions: 1. all userboxes have the same class, 2. their structure is always the same (e.g. always two TDs)**

In [40]:
pd.DataFrame({'box_name':col1, 'box_note':col2})

Unnamed: 0,box_name,box_note
0,en,This user is a native speaker of the English language.
1,fr-3,Cet utilisateur peut contribuer avec un niveau avancé de français.
2,,"This user has been on Wikipedia for 18 years, 7 months and 2 days."
3,"501,000+","This user has made more than 501,000 contributions to Wikipedia."
4,"“,;:’",This user is a punctuation stickler.
5,. The,This user does not put two spaces after a full stop.
6,"A, B and C",This user prefers not to use the serial comma.
7,",",This user fixes comma-splices; they are annoying.
8,its,This user understands the difference between its (of it) and it's (it is or it has).
9,’s,Thi's user know's that not every word that end's with s need's an apostrophe and will remove misused apostrophe's from Wikipedia with extreme prejudice.


Open https://en.wikipedia.org/wiki/User:Werdum2  and use "inspect" to open the HTML "Elements" view

In [41]:
page = requests.get("https://en.wikipedia.org/w/index.php?title=User:Werdum2")  #Some example users: Carlossuarez46 Bearcat
soup = BeautifulSoup(page.content, 'html.parser')

userboxes = soup.find_all(class_='wikipediauserbox' )

col1 = []
col2 = []

for u in userboxes:
    tds = u.find_all('td')
    col1.append(tds[0].get_text())
    col2.append(tds[1].get_text())


In [42]:
pd.DataFrame({'box_name':col1, 'box_note':col2})

Unnamed: 0,box_name,box_note
0,en-4,This user can contribute with a near-native level of English.
1,de,Dieser Benutzer spricht Deutsch als Muttersprache.
2,,This user is an atheist.
3,,This user has autopatrolled rights on the English Wikipedia. (verify)
4,,This user has extended confirmed rights on the English Wikipedia. (verify)
5,,This user has pending changes reviewer rights on the English Wikipedia. (verify)
6,,This user has rollback rights on the English Wikipedia. (verify)
7,69,"As of last count, this user is #69 of active Wikipedians."
8,,This user is a member ofWikiProject Football.
9,,This user is male.


In [43]:
def get_boxes(user):
    page = requests.get("https://en.wikipedia.org/wiki/User:"+user)
    soup = BeautifulSoup(page.content, 'html.parser')

    userboxes = soup.find_all(class_='wikipediauserbox' )

    col1 = []
    col2 = []

    for u in userboxes:
        tds = u.find_all('td')
        col1.append(tds[0].get_text())
        col2.append(tds[1].get_text())


    return col1, col2

In [44]:
get_boxes('Bearcat')

(['en',
  'fr-2',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  'CBC',
  'r3',
  '',
  "Your company's logo goes here.",
  '',
  'LJ'],
 ['This user is a native speaker of the English\xa0language.',
  'Cet utilisateur peut contribuer avec un niveau intermédiaire en français.',
  'This user is an administrator on the English Wikipedia. (verify)',
  'This user is a  Wikipedian  in Canada.',
  'This user identifies as gay.',
  'This user is a bear. Grr!',
  'This user is male.',
  'This user has depression.',
  'This user enjoys indie rock.',
  'This user is a CBC Radio fanatic.',
  'This user breaks new ground by breaking new sound.',
  'This user thinks that registration should be required to edit articles.',
  'This user has decided to sell out to corporate sponsorship. Sorry!',
  'This user contributes using Firefox.',
  'This user maintains a LiveJournal.'])

**Be *very* careful with what you infer from mass-extracted data based on heuristics.**

Wikipedia is NOT *structured data* and neither are most Websites

### Extracting EN proficiency

In [47]:
page = requests.get("https://en.wikipedia.org/wiki/User:Werdum2")
soup = BeautifulSoup(page.content, 'html.parser')
userboxes = soup.find_all(class_='wikipediauserbox')

In [48]:
import re

In [49]:
for u in userboxes:
    td1 = u.find('td')
    #print(td1)
    one = re.search('^en$', td1.get_text().strip())
    #print(one)
    more = re.search('^en-(\d)$', td1.get_text().strip())
    #print(more)

In [50]:
en_lvl=''

for u in userboxes:
    td1 = u.find('td')
    one = re.search('^en$', td1.get_text().strip())
    more = re.search('^en-(\d)$', td1.get_text().strip())

    if en_lvl == '':
        if one:
            en_lvl = '6'
        elif more:
            en_lvl = more.group(1)

print(en_lvl)

4


In [51]:
def get_en_lvl(user):
    page = requests.get("https://en.wikipedia.org/wiki/User:"+user)
    soup = BeautifulSoup(page.content, 'html.parser')
    userboxes = soup.find_all(class_='wikipediauserbox')
    en_lvl=''

    for u in userboxes:
        try: td1 = u.find('td')
        except: continue # jumps to the next iteration in the loop if there is no td at all
        if td1: # in case there is a None value
            one = re.search('^en$', td1.get_text().strip())
            more = re.search('^en-(\d)$', td1.get_text().strip())
            if en_lvl == '':
                if one:
                    en_lvl = '6'
                elif more:
                    en_lvl = more.group(1) # --> THIS gets the *firt* sub-match of the regular expression (here, the stuff in the brackets). '0' gets everything, e.g. en-4 in the example of Werdum2

    return en_lvl

In [52]:
get_en_lvl('Werdum2')

'4'

I prepared a list of users. Let's get their English level

In [53]:
reg_u_df = pd.read_pickle('data/wiki_users.p')

In [54]:
reg_u_df

Unnamed: 0,reg_users
0,KolbertBot
1,Elissabulkin
2,Snooganssnoogans
3,Robudor
4,BrownHairedGirl
...,...
3406,TimShell
3407,WojPob
3408,KoyaanisQatsi
3409,Fw-us-hou-8.bmc.com


In [62]:
for u in reg_u_df.reg_users.head(50):
    x = get_en_lvl(u)
    if x != '':
        print(u, '--> has an English lvl of' , x)

BrownHairedGirl --> has an English lvl of 6
Eyer --> has an English lvl of 5
Mandruss --> has an English lvl of 6
Chumash11 --> has an English lvl of 6
Rfl0216 --> has an English lvl of 6
Timrollpickering --> has an English lvl of 6
Haribanshnp --> has an English lvl of 1


In [64]:
reg_u_df['en_lvl'] = reg_u_df.reg_users.apply(get_en_lvl)

ConnectionError: ('Connection aborted.', ConnectionAbortedError(10053, 'An established connection was aborted by the software in your host machine', None, 10053, None))

In [None]:
reg_u_df

In [None]:
reg_u_df.en_lvl.value_counts()