## Data Collection - Web Scraping - Data Parsing



---

In [1]:
from IPython.display import HTML
style = "<style>div.exercise { background-color: #ffcccc;border-color: #E9967A; border-left: 5px solid #800080; padding: 0.5em;}</style>"
HTML(style)

In [2]:
dog_dict = {} # initialize a dictionary

# Populate the dictionary
dog_dict["jack"] = "border collie"
dog_dict["sophi"] = "beagle"
dog_dict["betty"] = "irish wolfhound"

print(dog_dict["jack"]) # Access one element of the dictionary
print("\n") # Just make a space for convenience

print(dog_dict) # Print the entire dictionary

border collie


{'jack': 'border collie', 'sophi': 'beagle', 'betty': 'irish wolfhound'}


You can do much more with dictionaries, but that example encapsulates the basics.

In [3]:
def bad_func(x: float, y: float) -> float:
    try:
        result = x/y
    except ZeroDivisionError:
        print("WARNING:")
        print("You set y = 0 but y must be non-zero.")
        print("We are setting y = 1.  This may drastically change your results.")
        y = 1.0
        result = x/y
    return result

x, y = 1.0, 0.0
important_quantity = bad_func(x, y)

print("\n Your important_quantity has a value of {0:3.6f}".format(important_quantity))

You set y = 0 but y must be non-zero.
We are setting y = 1.  This may drastically change your results.

 Your important_quantity has a value of 1.000000


## Part 2:  I/O and Preprocessing


In [4]:
# This approach should not be used!
f = open("data/brief_comments.txt", "r") # Open the file for reading
dogs = f.read() # Read the file
f.close() # Remember to close the file!

In [5]:
# This approach is the correct way, and should always be used.
with open("data/.txt", "r") as f:
    dogs = f.read()

### Part 2.2:  Preprocessing


In [6]:
print(dogs) # What are the contents of the object we just read in?

Dogs have been with humans for millenia.  Although they do not speak human languages (e.g. English or Chinese), they have been watching and observing
us for all that time.  Their emotional intelligence is prodigious.  We are only just beginning to scrape the surface of the mind of dogs.  The nascent
field of dog cognition is beginning to shed light on ways to form meaningful communication with dogs.



In [7]:
type(dogs) # What kind of data are we dealing with?

str

In [8]:
l = len(dogs) # How many characters are in this string?
print(l)

403


In [9]:
dogs[10] # Let's access the 11th item

'b'

In [10]:
words = dogs.split()
print(words)

['Dogs', 'have', 'been', 'with', 'humans', 'for', 'millenia.', 'Although', 'they', 'do', 'not', 'speak', 'human', 'languages', '(e.g.', 'English', 'or', 'Chinese),', 'they', 'have', 'been', 'watching', 'and', 'observing', 'us', 'for', 'all', 'that', 'time.', 'Their', 'emotional', 'intelligence', 'is', 'prodigious.', 'We', 'are', 'only', 'just', 'beginning', 'to', 'scrape', 'the', 'surface', 'of', 'the', 'mind', 'of', 'dogs.', 'The', 'nascent', 'field', 'of', 'dog', 'cognition', 'is', 'beginning', 'to', 'shed', 'light', 'on', 'ways', 'to', 'form', 'meaningful', 'communication', 'with', 'dogs.']


In [11]:
type(words)

list

In [12]:
words[10]

'not'

In [13]:
N = len(words) # Number of words
print("There are {0} words in our brief comments.".format(N))

There are 67 words in our brief comments.


In [14]:
words.count("dogs")

0

In [15]:
more_words = [word.split('.')[0] for word in words] # List comprehension
more_words.count("dogs")

2

In [16]:
# We can write the list comprehension as a for loop as follows:
more_words1 = []
for word in words:
    inter = word.split('.')
    inter1 = inter[0]
    more_words1.append(inter1)
more_words1.count("dogs")

2

In [17]:
# your code here
my_ints = [i for i in range(-5, 6)]
my_ints2 = [i*i for i in my_ints]
print(my_ints2)

[25, 16, 9, 4, 1, 0, 1, 4, 9, 16, 25]


In [18]:
my_str = 'HELLO Bonnie'
my_str.lower()

'hello bonnie'

In [19]:
# Your code here
lower_words = [word.lower() for word in more_words]
lower_words.count("dogs")

3

<div class="exercise"><b> Exercise </b> </div>
* `hamlet.txt` is in the `data` directory.  Open and read it into a variable called `hamlettext`.
* What is the type of `hamlettext`?  What is its length?  Print the first $500$ items of `hamlettext`.
* Create a list called `hamletwords` where the items are the words of the play.
  * Confirm that the list you created is really a list
  * Confirm that each element of the list is a string
  * Print the first 10 items in the list.  
  * Print "There are $N$ total words in Hamlet.",  where $N$ is the total number of words in Hamlet.
* Using a *list comprehension*, create `hamletwords_lc` which converts the items in `hamletwords` to lower-case. 
* Count the number of occurences of the word "thou".
* Use `set` to determine the set of unique words in `hamletwords_lc`.  Here's documentation on the `set` datatype:  [Sets](https://docs.python.org/3/tutorial/datastructures.html#sets).
  * Print "There are $M$ unique words in Hamlet.", where $M$ is the number of unique words.  As a sanity check, verify that $M < N$.
  * Your output should be 
  ```
  "There are 7456 unique words in Hamlet."
  ```

In [20]:
# Your code here

# Open file
with open("data/hamlet.txt", "r") as f:
    hamlettext = f.read()

print(type(hamlettext), "\n") # data type
print(len(hamlettext), "\n") # length
print(hamlettext[:500], "\n") # first 500 items

hamletwords = hamlettext.split() # get list of words in hamlet
print(type(hamletwords), "\n") # confirm that it's a list
print(hamletwords[0:10], "\n") # first 10 items
print("There are %d total words in Hamlet.\n" %len(hamletwords)) # Total words in hamlet

hamletwords_lc = [word.lower() for word in hamletwords] # convert to lowercase
print(hamletwords_lc.count("thou"), "\n") # occurences of thou

uniquewords_lc = set(hamletwords_lc) # unique words
print(len(uniquewords_lc), len(hamletwords_lc), "\n")
print("There are {0} unique words in Hamlet.".format(len(uniquewords_lc)))

<class 'str'> 

173946 

﻿XXXX
HAMLET, PRINCE OF DENMARK

by William Shakespeare




PERSONS REPRESENTED.

Claudius, King of Denmark.
Hamlet, Son to the former, and Nephew to the present King.
Polonius, Lord Chamberlain.
Horatio, Friend to Hamlet.
Laertes, Son to Polonius.
Voltimand, Courtier.
Cornelius, Courtier.
Rosencrantz, Courtier.
Guildenstern, Courtier.
Osric, Courtier.
A Gentleman, Courtier.
A Priest.
Marcellus, Officer.
Bernardo, Officer.
Francisco, a Soldier
Reynaldo, Servant to Polonius.
Players.
Two Clowns,  

<class 'list'> 

['\ufeffXXXX', 'HAMLET,', 'PRINCE', 'OF', 'DENMARK', 'by', 'William', 'Shakespeare', 'PERSONS', 'REPRESENTED.'] 

There are 31659 total words in Hamlet.

95 

7456 31659 

There are 7456 unique words in Hamlet.


### Part 2.3:  Writing Files



In [4]:
my_ints = [i for i in range(-5, 6)]
my_ints2 = [i*i for i in my_ints]
print("Our list is {}.".format(my_ints2))

Our list is [25, 16, 9, 4, 1, 0, 1, 4, 9, 16, 25].


In [6]:
with open("datafile.txt", "w") as dataf:
    # header
    dataf.write("Here is a list of squared ints.\n\n")
    # Columns
    dataf.write("n")
    dataf.write(", ")
    dataf.write("n^2" + "\n")
    # Data
    for i, i2 in zip(my_ints, my_ints2):
        dataf.write("{}, {}\n".format(str(i), str(i2)))

### `json`

In [23]:
import json # import the json library

In [24]:
dog_shelter = {} # Initialize dictionary

# Set up dictionary elements
dog_shelter['dog1'] = {'name': 'Cloe', 'age': 3, 'breed': 'Border Collie', 'playgroup': 'Yes'}
dog_shelter['dog2'] = {'name': 'Karl', 'age': 7, 'breed': 'Beagle', 'playgroup': 'Yes'}

dog_shelter

{'dog1': {'age': 3,
  'breed': 'Border Collie',
  'name': 'Cloe',
  'playgroup': 'Yes'},
 'dog2': {'age': 7, 'breed': 'Beagle', 'name': 'Karl', 'playgroup': 'Yes'}}

In [25]:
dog_shelter['dog1']

{'age': 3, 'breed': 'Border Collie', 'name': 'Cloe', 'playgroup': 'Yes'}

In [26]:
dog_shelter['dog2']['name']

'Karl'

#### Writing to `json` file

In [27]:
with open('dog_shelter_info.txt', 'w') as output:  
    json.dump(dog_shelter, output)

In [29]:
!cat dog_shelter_info.txt #cat stands for concatenate

{"dog1": {"name": "Cloe", "age": 3, "breed": "Border Collie", "playgroup": "Yes"}, "dog2": {"name": "Karl", "age": 7, "breed": "Beagle", "playgroup": "Yes"}}

#### Reading from `json` file

In [30]:
with open('dog_shelter_info.txt', 'r') as f:
    dog_data = json.load(f)

In [31]:
print(dog_data)

{'dog1': {'name': 'Cloe', 'age': 3, 'breed': 'Border Collie', 'playgroup': 'Yes'}, 'dog2': {'name': 'Karl', 'age': 7, 'breed': 'Beagle', 'playgroup': 'Yes'}}


In [32]:
for dogid, info in dog_data.items():
    print(dogid)
    print("{0} is a {1} year old {2}.".format(info['name'], info['age'], info['breed']))
    if info['playgroup'].lower() == 'yes':
        print("{0} can attend playgroup.".format(info['name']))
    else:
        print("{0} is not permitted at playgroup.".format(info['name']))
    print("======================================\n")

dog1
Cloe is a 3 year old Border Collie.
Cloe can attend playgroup.

dog2
Karl is a 7 year old Beagle.
Karl can attend playgroup.



##  Part 2: Regular Expressions



In [33]:
birthday = "June 11"

In [34]:
birth_month = birthday.strip()[:-3]
print(birth_month)

June


In [35]:
regex = r"\w+" # A first regular expression

In [36]:
import re # Regular expression module
months = re.search(regex, birthday) # Search string for regex
print(months)

<_sre.SRE_Match object; span=(0, 4), match='June'>


In [37]:
print("The matched pattern starts at index {0} and ends at index {1}.".format(months.start(), months.end()))

The matched pattern starts at index 0 and ends at index 4.


In [38]:
regex = r"June"
re.search(regex, birthday)

<_sre.SRE_Match object; span=(0, 4), match='June'>

In [39]:
re.search(r"Oct", birthday) # nothing prints out

In [40]:
months = re.search(r"Oct", birthday)
print(months) # printing the match object shows us the result, even if no match was found.

None


In [41]:
statement = "June is a lovely month."
regex = r"June"
fragment = statement[re.search(regex, statement).end():]
print(fragment)

 is a lovely month.


In [42]:
regex = r"\d+"
re.search(regex, birthday)

<_sre.SRE_Match object; span=(5, 7), match='11'>

In [43]:
regex = r"[A-Za-z]+"
re.search(regex, birthday)

<_sre.SRE_Match object; span=(0, 4), match='June'>

In [44]:
regex = r"[0-9]"
re.search(regex, birthday)

<_sre.SRE_Match object; span=(5, 6), match='1'>

In [45]:
regex_month = r"[A-Za-z]+"
month = re.findall(regex_month, birthday)
print(month)

regex_day = r"\d+"
day = re.findall(regex_day, birthday)
print(day)

['June']
['11']


In [46]:
birthdays = "June 11th, December 13th, September 21st, May 12th"

In [47]:
regex = r"[A-Za-z]+"
bdays = re.findall(regex, birthdays)
print(bdays)

['June', 'th', 'December', 'th', 'September', 'st', 'May', 'th']


In [48]:
regex = r"([A-Za-z]+) (\d+\w+)"
bdays = re.findall(regex, birthdays)
print(bdays)

[('June', '11th'), ('December', '13th'), ('September', '21st'), ('May', '12th')]


In [49]:
regex = r"[A-Za-z]+ \d+"
bdays = re.findall(regex, birthdays)
for bday in bdays:
    print(bday)

June 11
December 13
September 21
May 12


In [97]:
# your code here

# Open file
with open("data/shelterdogs.xml", "r") as f:
    dogs = f.read()

regex = r"<name> (.*) </name>" # regex to get the dog names
names = re.findall(regex, dogs) # find the names according to the regex
names

['Cloe', 'Karl']

In [98]:
# your code here

regex = r"<\?xml version=\"1.0\" encoding=\"UTF-8\"\?>[\n]+"
start = re.search(regex, dogs).end()
dogs = dogs[start:]
print(dogs)

<dogshelter>
    <dog id="dog1">
        <name> Cloe </name>
        <age> 3 </age>
        <breed> Border Collie </breed>
        <playgroup> Yes </playgroup>
    </dog>
    <dog id="dog2"> 
        <name> Karl </name> 
        <age> 7 </age>
        <breed> Beagle </breed>
        <playgroup> Yes </playgroup>
    </dog>
</dogshelter>



## Pandas



In [52]:
import pandas as pd

### Importing data



In [53]:
# Read in the csv files
dfcars=pd.read_csv("data/mtcars.csv")

# Display the header and the first five rows of data
dfcars.head()

Unnamed: 0.1,Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


Initial data exploration is as simple as a one-liner.

In [54]:
dfcars.describe()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
count,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0
mean,20.090625,6.1875,230.721875,146.6875,3.596563,3.21725,17.84875,0.4375,0.40625,3.6875,2.8125
std,6.026948,1.785922,123.938694,68.562868,0.534679,0.978457,1.786943,0.504016,0.498991,0.737804,1.6152
min,10.4,4.0,71.1,52.0,2.76,1.513,14.5,0.0,0.0,3.0,1.0
25%,15.425,4.0,120.825,96.5,3.08,2.58125,16.8925,0.0,0.0,3.0,2.0
50%,19.2,6.0,196.3,123.0,3.695,3.325,17.71,0.0,0.0,4.0,2.0
75%,22.8,8.0,326.0,180.0,3.92,3.61,18.9,1.0,1.0,4.0,4.0
max,33.9,8.0,472.0,335.0,4.93,5.424,22.9,1.0,1.0,5.0,8.0


In [55]:
dfcars=dfcars.rename(columns={"Unnamed: 0":"car name"})
dfcars.head()

Unnamed: 0,car name,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


### Dataframes and Series

In [56]:
print(dfcars.shape)     # 12 columns, each of length 32
print(len(dfcars))      # the number of rows in the dataframe, also the length of a series
print(len(dfcars.mpg))  # the length of a series

(32, 12)
32
32


In [57]:
for ele in dfcars: # iterating iterates over column names though, like a dictionary
    print(ele)

car name
mpg
cyl
disp
hp
drat
wt
qsec
vs
am
gear
carb


In [58]:
dfcars.columns

Index(['car name', 'mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs',
       'am', 'gear', 'carb'],
      dtype='object')

In [59]:
for ele in dfcars.cyl:
    print(ele)

6
6
4
6
8
6
8
4
4
6
6
8
8
8
8
8
8
4
4
4
4
8
8
8
8
4
4
4
8
6
8
4


In [60]:
dfcars.head()

Unnamed: 0,car name,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


In [61]:
# index for the dataframe
print(list(dfcars.index))

# index for the cyl series
dfcars.cyl.index

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]


RangeIndex(start=0, stop=32, step=1)

In [62]:
# create values from 5 to 36
new_index = [i+5 for i in range(32)]

# new dataframe with indexed rows from 5 to 36
dfcars_reindex = dfcars.reindex(new_index)
dfcars_reindex.head()

Unnamed: 0,car name,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
5,Valiant,18.1,6.0,225.0,105.0,2.76,3.46,20.22,1.0,0.0,3.0,1.0
6,Duster 360,14.3,8.0,360.0,245.0,3.21,3.57,15.84,0.0,0.0,3.0,4.0
7,Merc 240D,24.4,4.0,146.7,62.0,3.69,3.19,20.0,1.0,0.0,4.0,2.0
8,Merc 230,22.8,4.0,140.8,95.0,3.92,3.15,22.9,1.0,0.0,4.0,2.0
9,Merc 280,19.2,6.0,167.6,123.0,3.92,3.44,18.3,1.0,0.0,4.0,4.0


In [63]:
dfcars_reindex.iloc[0:3]

Unnamed: 0,car name,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
5,Valiant,18.1,6.0,225.0,105.0,2.76,3.46,20.22,1.0,0.0,3.0,1.0
6,Duster 360,14.3,8.0,360.0,245.0,3.21,3.57,15.84,0.0,0.0,3.0,4.0
7,Merc 240D,24.4,4.0,146.7,62.0,3.69,3.19,20.0,1.0,0.0,4.0,2.0


In [64]:
dfcars_reindex.loc[0:7] # or dfcars_reindex.loc[5:7]

Unnamed: 0,car name,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
5,Valiant,18.1,6.0,225.0,105.0,2.76,3.46,20.22,1.0,0.0,3.0,1.0
6,Duster 360,14.3,8.0,360.0,245.0,3.21,3.57,15.84,0.0,0.0,3.0,4.0
7,Merc 240D,24.4,4.0,146.7,62.0,3.69,3.19,20.0,1.0,0.0,4.0,2.0


In [65]:
dfcars_reindex.iloc[2:5, 1:4]

Unnamed: 0,mpg,cyl,disp
7,24.4,4.0,146.7
8,22.8,4.0,140.8
9,19.2,6.0,167.6


In [66]:
dfcars_reindex.loc[7:9, ['mpg', 'cyl', 'disp']]

Unnamed: 0,mpg,cyl,disp
7,24.4,4.0,146.7
8,22.8,4.0,140.8
9,19.2,6.0,167.6


In [99]:
# your code here

column_1 = pd.Series(range(4)) # Q1
column_2 = pd.Series(range(4,8)) # Q2
table = pd.DataFrame({'col_1': column_1, 'col_2': column_2}) # Q3
table = table.rename(columns={"col_1": "Col_1", "col_2":"Col_2"}) # Q4

table = table.rename({0: "zero", 1: "one", 2: "two", 3: "three"})

table

Unnamed: 0,Col_1,Col_2
zero,0,4
one,1,5
two,2,6
three,3,7


### Reading `json` into `pandas` dataframe

In [68]:
# Load dog shelter data
with open('dog_shelter_info.txt', 'r') as f:
    dog_data = json.load(f)

dog_data_json_str = json.dumps(dog_data) # Convert data to json string
df = pd.read_json(dog_data_json_str) # Convert to pandas dataframe
df.head() # Look at data

Unnamed: 0,dog1,dog2
age,3,7
breed,Border Collie,Beagle
name,Cloe,Karl
playgroup,Yes,Yes


##  Beautiful Soup 

### `requests`:  Retrieving Data from the Web

In [1]:
# You tell Python that you want to use a library with the import statement.
import requests

In [2]:
# Get the HU Wikipedia page
req = requests.get("https://en.wikipedia.org/wiki/Harvard_University")

In [3]:
req

<Response [200]>

In [4]:
type(req)

requests.models.Response

In [5]:
dir(req)

['__attrs__',
 '__bool__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__nonzero__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_content',
 '_content_consumed',
 '_next',
 'apparent_encoding',
 'close',
 'connection',
 'content',
 'cookies',
 'elapsed',
 'encoding',
 'headers',
 'history',
 'is_permanent_redirect',
 'is_redirect',
 'iter_content',
 'iter_lines',
 'json',
 'links',
 'next',
 'ok',
 'raise_for_status',
 'raw',
 'reason',
 'request',
 'status_code',
 'text',
 'url']

In [6]:
page = req.text
page[20000:30000]

's in the world.<sup id="cite_ref-7" class="reference"><a href="#cite_note-7">&#91;7&#93;</a></sup>\n</p><p>The Massachusetts colonial legislature, the <a href="/wiki/Massachusetts_General_Court" title="Massachusetts General Court">General Court</a>, authorized Harvard\'s founding. In its early years, <a href="/wiki/Harvard_College" title="Harvard College">Harvard College</a> primarily trained <a href="/wiki/Congregationalism_in_the_United_States" title="Congregationalism in the United States">Congregational</a> and <a href="/wiki/Unitarianism" title="Unitarianism">Unitarian</a> clergy, although it has never been formally affiliated with any <a href="/wiki/Religious_denomination" title="Religious denomination">denomination</a>. Its curriculum and student body were gradually secularized during the 18th century, and by the 19th century, Harvard had emerged as the central cultural establishment among <a href="/wiki/Boston_Brahmin" title="Boston Brahmin">the Boston elite</a>.<sup id="cite_

In [7]:
from bs4 import BeautifulSoup

In [8]:
soup = BeautifulSoup(page, 'html.parser')

In [9]:
type(soup)

bs4.BeautifulSoup

In [10]:
type(page)

str

In [11]:
print(soup.prettify()[:1000])

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Harvard University - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"2fa708d3-0de4-4ea7-b767-5ff616858654","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Harvard_University","wgTitle":"Harvard University","wgCurRevisionId":1017838724,"wgRevisionId":1017838724,"wgArticleId":18426501,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 maint: location","Webarchive template wayback links","CS1: Julian–Gregorian uncertainty","Articles with short description","Short

In [12]:
soup.title

<title>Harvard University - Wikipedia</title>

In [13]:
# Be careful with elements that show up multiple times.
soup.p

<p class="mw-empty-elt">
</p>

In [82]:
len(soup.find_all("p"))

75

In [83]:
soup.table["class"]

['infobox', 'vcard']

In [84]:
# the classes of all tables that have a class attribute set on them
[t["class"] for t in soup.find_all("table") if t.get("class")]

[['infobox', 'vcard'],
 ['toccolours'],
 ['plainlinks', 'metadata', 'ambox', 'mbox-small-left', 'ambox-content'],
 ['multicol'],
 ['infobox'],
 ['wikitable', 'sortable', 'collapsible', 'collapsed'],
 ['wikitable', 'sortable', 'collapsible', 'collapsed'],
 ['wikitable'],
 ['nowraplinks', 'collapsible', 'collapsed', 'navbox-inner'],
 ['nowraplinks', 'navbox-subgroup'],
 ['nowraplinks', 'navbox-subgroup'],
 ['nowraplinks', 'collapsible', 'collapsed', 'navbox-inner'],
 ['nowraplinks', 'collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'hlist', 'collapsible', 'autocollapse', 'navbox-inner'],
 [

In [85]:
table_demographics = soup.find_all("table", "wikitable")[2]

In [86]:
from IPython.core.display import HTML
HTML(str(table_demographics))

Unnamed: 0,Undergrad,Graduate,U.S. census
Asian/Pacific Islander,17%,11%,5%
Black/non-Hispanic,6%,4%,12%
Hispanics of any race,9%,5%,16%
White/non-Hispanic,46%,43%,64%
Mixed race/other,10%,8%,9%
International students,11%,27%,


In [87]:
rows = [row for row in table_demographics.find_all("tr")]
print(rows)

[<tr>
<th></th>
<th>Undergrad</th>
<th>Graduate</th>
<th>U.S. census</th>
</tr>, <tr>
<th>Asian/Pacific Islander</th>
<td>17%</td>
<td>11%</td>
<td>5%</td>
</tr>, <tr>
<th>Black/non-Hispanic</th>
<td>6%</td>
<td>4%</td>
<td>12%</td>
</tr>, <tr>
<th>Hispanics of any race</th>
<td>9%</td>
<td>5%</td>
<td>16%</td>
</tr>, <tr>
<th>White/non-Hispanic</th>
<td>46%</td>
<td>43%</td>
<td>64%</td>
</tr>, <tr>
<th>Mixed race/other</th>
<td>10%</td>
<td>8%</td>
<td>9%</td>
</tr>, <tr>
<th>International students</th>
<td>11%</td>
<td>27%</td>
<td>N/A</td>
</tr>]


In [88]:
header_row = rows[0]
HTML(str(header_row))

In [89]:
# Lambda expressions return the value of the expression inside it.
# In this case, it will return a string with new line characters replaced by spaces.
rem_nl = lambda s: s.replace("\n", " ")

#### Splitting the data


In [90]:
# the if col.get_text() takes care of no-text in the upper left
columns = [rem_nl(col.get_text()) for col in header_row.find_all("th") if col.get_text()]
columns

['Undergrad', 'Graduate', 'U.S. census']

In [91]:
indexes = [row.find("th").get_text() for row in rows[1:]]
indexes

['Asian/Pacific Islander',
 'Black/non-Hispanic',
 'Hispanics of any race',
 'White/non-Hispanic',
 'Mixed race/other',
 'International students']

In [92]:
def to_num(s):
    if s[-1] == "%":
        return int(s[:-1])
    else:
        return None

In [93]:
values = [to_num(value.get_text()) for row in rows[1:] for value in row.find_all("td")]
values

[17, 11, 5, 6, 4, 12, 9, 5, 16, 46, 43, 64, 10, 8, 9, 11, 27, None]

In [94]:
stacked_values_lists = [values[i::3] for i in range(len(columns))]
stacked_values_lists

[[17, 6, 9, 46, 10, 11], [11, 4, 5, 43, 8, 27], [5, 12, 16, 64, 9, None]]

In [95]:
stacked_values = zip(*stacked_values_lists)
list(stacked_values)

[(17, 11, 5), (6, 4, 12), (9, 5, 16), (46, 43, 64), (10, 8, 9), (11, 27, None)]

In [96]:
# Here's the original HTML table for visual understanding
HTML(str(table_demographics))

Unnamed: 0,Undergrad,Graduate,U.S. census
Asian/Pacific Islander,17%,11%,5%
Black/non-Hispanic,6%,4%,12%
Hispanics of any race,9%,5%,16%
White/non-Hispanic,46%,43%,64%
Mixed race/other,10%,8%,9%
International students,11%,27%,
