# Week 5 Day 3: Data Wrangling with Web Scraping

## Wikifunctions

Dr. Brian Keegan in our INFO department also made a very helpful package through GitHub that allows you to ping the wikipedia API with alot of ease. here is the wikifunctions repository [https://github.com/brianckeegan/wikifunctions]

- download wikifunctions.py

our code will pull the functions and classes from this file to work with the API

In [1]:
#import wikifunctions

import wikifunctions as wf

In [2]:
# Get all page revisions

rev_df = wf.get_all_page_revisions("Buffalo")
rev_df.head()

Unnamed: 0,revid,parentid,user,anon,userid,timestamp,size,sha1,comment,page,date,diff,lag,age
0,240464,0,12.30.225.xxx,True,0,2001-12-15 21:46:10+00:00,2271,e655c8c7e20feb83d25017cec32d08eee713d130,*,Buffalo,2001-12-15,,,0.0
1,240465,240464,The Epopt,,30,2001-12-15 22:09:54+00:00,2761,27164baa6abb807c7f20e5dc1323c43503a79348,added a stub about buffalo,Buffalo,2001-12-15,490.0,1424.0,0.016481
2,240466,240465,The Epopt,,30,2001-12-15 22:10:27+00:00,2762,32c20c60c410376c96d9b1da4570f477483d2d77,*,Buffalo,2001-12-15,1.0,33.0,0.016863
3,240467,240466,Paul Drye,,6,2001-12-16 01:16:17+00:00,478,0409290a0ab940c21c2bd3d75963d1b1a7d71e31,"Slice out Buffalo, New York and move to [[Buff...",Buffalo,2001-12-16,-2284.0,11150.0,0.145914
4,240468,240467,The Epopt,,30,2001-12-16 01:55:32+00:00,479,5703f32d66c0e61ca2391db4045514faffdef655,*,Buffalo,2001-12-16,1.0,2355.0,0.173171


In [3]:
# Get pageviews
pvs1 = wf.get_pageviews("Denver Broncos")
pvs1.head()

timestamp
2015-07-01    937
2015-07-02    932
2015-07-03    825
2015-07-04    824
2015-07-05    852
Name: views, dtype: int64

In [7]:
# Get current page content
page_content = wf.get_page_raw_content("Will Smith")
page_content.find('shortdescription nomobile noexcerpt noprint searchaux')



77

In [8]:
# Get interlanguage links

ill_df = wf.get_interlanguage_links("Will Smith")
ill_df

{'en': 'Will Smith',
 'af': 'Will Smith',
 'am': 'ዊል ስሚዝ',
 'an': 'Will Smith',
 'ar': 'ويل سميث',
 'arz': 'ويل سميث',
 'ast': 'Will Smith',
 'az': 'Vill Smit',
 'azb': 'ویل اسمیت',
 'be': 'Уіл Сміт',
 'bg': 'Уил Смит',
 'bh': 'विल स्मिथ',
 'bn': 'উইল স্মিথ',
 'br': 'Will Smith',
 'bs': 'Will Smith',
 'ca': 'Will Smith',
 'ceb': 'Will Smith',
 'ckb': 'ویڵ سمیت',
 'co': 'Will Smith',
 'cs': 'Will Smith',
 'cv': 'Уилл Смит',
 'cy': 'Will Smith',
 'da': 'Will Smith',
 'de': 'Will Smith',
 'el': 'Γουίλ Σμιθ',
 'eo': 'Will Smith',
 'es': 'Will Smith',
 'et': 'Will Smith',
 'eu': 'Will Smith',
 'fa': 'ویل اسمیت',
 'fi': 'Will Smith',
 'fo': 'Will Smith',
 'fr': 'Will Smith',
 'frp': 'Will Smith',
 'fy': 'Will Smith',
 'ga': 'Will Smith',
 'gd': 'Will Smith',
 'gl': 'Will Smith',
 'got': '𐍅𐌹𐌻𐌻 𐍃𐌼𐌹𐌸',
 'gv': 'Will Smith',
 'ha': 'Will Smith',
 'he': "ויל סמית'",
 'hi': 'विल स्मिथ',
 'hr': 'Will Smith',
 'ht': 'Will Smith',
 'hu': 'Will Smith',
 'hy': 'Ուիլ Սմիթ',
 'hyw': 'Ուիլլ Սմիթ',
 'id': '

# save data sources

#### as page revisions df to a csv file

In [9]:
rev_df.to_csv("class_data/page_revisions_buffalo.csv")

In [7]:
# Most straight-forward way to import a library in Python
import requests

# BeautifulSoup is a module inside the "bs4" library, we only import the BeautifulSoup module
from bs4 import BeautifulSoup

# We import pandas but give the library a shortcut alias "pd" since we will call its functions so much
import pandas as pd

### Reading an HTML table into Python

[The Numbers](http://www.the-numbers.com) is a popular source of data about movies' box office revenue numbers. Their daily domestic charts are HTML tables with the top-grossing movies for each day of the year, going back for several years. This [table](https://www.the-numbers.com/box-office-chart/daily/2018/12/25) for Christmas day in 2018 has coluns for the current week's ranking, previous week's ranking, name of movie, distributor, gross, change over the previous week, number of theaters, revenue per theater, total gross, and number of days since release. This looks like a fairly straightforward table that could be read directly into data frame-like structure.

Using the Inspect tool, we can see the table exists as a `<table border="0" ... align="CENTER">` element with child tags like `<tbody>` and `<tr>` (table row). Each `<tr>` has `<td>` which defines each of the cells and their content. For more on how HTML defines tables, check out [this tutoral](https://www.w3schools.com/html/html_tables.asp).

Using `requests` and `BeautifulSoup` we would get this webpage's HTML, turn it into soup, and then find the table (`<table>`) or the table rows (`<tr>`) and pull out their content.

In [8]:
# Make the request

#this is just asking for information stuff from the website -- it can kind of be whatever

user_agent = {'user-agent':'info-2201/0.0 Web Data Science, laurie.jones@colorado.edu'}

xmas_bo_raw = requests.get( 'https://www.the-numbers.com/box-office-chart/daily/2018/12/25', headers = user_agent).text


In [9]:
# Turn into soup, specify the HTML parser


In [11]:
# Use .find_all to retrieve all the tables in the page


It turns out there are two tables on the page, the first is a baby table consisting of the "Previous Chart", "Chart Index", and "Next Chart" at the top. We want the second table with all the data: `xmas_bo_tables[1]` returns the second chart (remember that Python is 0-indexed, so the first chart is at `xmas_bo_tables[0]`). With this table identified, we can do a second `find_all` to get the table rows inside it and we save it as `xmas_bo_trs`.

<!--  -->

### `pandas`'s `read_html`
That was a good amount of work just to get this simple HTML table into Python. But it was important to cover how table elements moved from a string in `requests`, into a soup object from `BeautifulSoup`. into a list of data, and finally into `pandas`. 

`pandas` also has powerful functionality for reading tables directly from HTML. If we convert the soup of the first table (`xmas_bo_tables[1]`) back into a string, `pandas` can read it directly into a table. 

There are a few ideosyncracies here, the result is a list of dataframes—even if there's only a single table/dataframe—so we need to return the first (and only) element of this list. This is why there's a `[0]` at the end and the `.head()` is just to show the first five rows.

In [12]:
#read the string of the first table into pd.read_html


<!--  -->

The column names got lumped in as rows, but we can fix this as well with the `read_html` function by passing the row index where the column lives. In this case, it is the first row, so we pass `header=0`.

In [13]:
#just ge the head of the df
