# <img style="float: left; padding-right: 10px; width: 45px" src="https://github.com/Harvard-IACS/2018-CS109A/blob/master/content/styles/iacs.png?raw=true"> CS-S109A Introduction to Data Science 

## Lecture 2 (Pandas + Beautiful Soup)

**Harvard University**<br>
**Summer 2020**<br>
**Instructors:** Kevin Rader<br>
**Authors:** Rahul Dave, David Sondak, Will Claybaugh, Pavlos Protopapas, Chris Tanner, Kevin Rader

---

In [1]:
## RUN THIS CELL TO GET THE RIGHT FORMATTING 
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)

# Table of Contents 
<ol start="0">
<li> Learning Goals </li>
<li> Data without Pandas </li>
<li> Loading and Cleaning with Pandas  </li>
<li> Combinng Data Sources </li>
<li> Basic Scraping with Beautiful Soup </li>
</ol>

## Learning Goals

This Jupyter notebook accompanies Lecture 2. By the end of this lecture, you should be able to:

- Appreciate that base Python is not great for most data handling.
- Understand why and how Pandas can be useful. 
- Use Pandas to:
    - Load data into a DataFrame
    - Access subsets of data based on column and row values
    - Address missing values (e.g., `NaN`)
    - Use `groupby()` to select sections of data.
    - Plot DataFrames (e.g., barplot())
- Use Beautiful Soup to download a webpage and all of its links, and begin to learn parse out tables, links, etc.

## Part 1: Processing Data without Pandas 

`../data/top50.csv` is a dataset found online (Kaggle.com) that contains information about the 50 most popular songs on Spotify in 2019.

Each row represents a distinct song.
The columns (in order) are:
```
ID: a unique ID (i.e., 1-50)
TrackName: Name of the Track
ArtistName: Name of the Artist
Genre: the genre of the track
BeatsPerMinute: The tempo of the song.
Energy: The energy of a song - the higher the value, the more energetic. song
Danceability: The higher the value, the easier it is to dance to this song.
Loudness: The higher the value, the louder the song.
Liveness: The higher the value, the more likely the song is a live recording.
Valence: The higher the value, the more positive mood for the song.
Length: The duration of the song (in seconds).
Acousticness: The higher the value, the more acoustic the song is.
Speechiness: The higher the value, the more spoken words the song contains.
Popularity: The higher the value, the more popular the song is.
```

In [3]:
from PIL import Image

In [None]:
Image.open("fig/top50_screenshot.png") # sample of the data

### Read and store `../data/top50.csv`

**Q1.1:** Read in the `../data/top50.csv` file and store all of its contents into any data structure(s) that make the most sense to you, keeping in mind that you'd want to easily access any row or column.  Why does a dictionary make the most sense to use for data storage?

In [None]:
f = open("../data/top50.csv")
column_names = f.readline().strip().split(",")[1:] # puts names in a list
cleaned_column_names = [name for name in column_names] # removes the extraneous quotes
cleaned_column_names.insert(0, "ID")

dataset = []

# iterates through each line of the .csv file
for line in f:
    attributes = line.strip().split(",")
    
    # constructs a new dictionary for each line, and
    # appends this dictionary to the `dataset`;
    # thus, the dataset is a list of dictionaries (1 dictionary per song)
    dataset.append(dict(zip(cleaned_column_names, attributes)))
    
# dataset[0:2]

**Q1.2:** Write code to print all songs (Artist and Track name) that are longer than 4 minutes (240 seconds):

In [None]:

########
# your code below: uncomment and fill in the ****
########

#for song in ****:
     #if int(song[****] > ****) :
        # print(****, "-", ****, "is", **** ,"seconds long")
        


**Q1.3:** Write code to print the most popular song (or song(s) if there is a tie):

In [None]:
########
# your code below: uncomment and fill in the ****
########

max_score = -1
most_populars = set()
# for ***:
#    if int(song["Popularity"]) > max_score:
#        most_populars = set([str(song["ArtistName"] + "-" + song["TrackName"])])
#        max_score = int(song["Popularity"])
    # elif ****:
    #    most_populars.add(****)
# print(most_populars)




**Q1.4:** How would you print the songs (and their attributes) in sorted order by their popularity (highest scoring ones first)?

*your answer here*

**Q1.5**: How could you check for null/empty entries?

*your answer here*

Often times, one dataset doesn't contain all of the information you are interested in -- in which case, you need to combine data from multiple files.

**Q1.6:** Imagine we had another table (i.e., .csv file) below. How could we combine its data with our already-existing *dataset*?

*your answer here*

## Part 2: Processing Data _with_ Pandas 

**Pandas** is an _open-source_ Python library designed for **data analysis and processing.** Being _open-sourced_ means that anyone can contribute to it (don't worry, a team of people vett all official updates to the library). Pandas allows for high-performance, easy-to-use data structures. Namely, instead of using N-dimensional arrays like NumPy (which are extremely fast, though), Pandas provides a 2D-table object calleda **DataFrame**.

As a very gross simplification: **NumPy** is great for performing math operations with matrices, whereas **Pandas** is excellent for wrangling, processing, and understanding 2D data like spreadsheets (2D data like spreadsheets is very common and great).

Let's get started with simple examples of how to use Pandas. We will continue with our ``top50.csv`` Spotify music data.

First, we need to import pandas so that we have access to it. For typing convenience, we choose to rename it as ``pd``, which is common practice.

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)

### Reading in the data
Pandas allows us to [read in various structured files](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) (e.g., .csv, .json, .html, etc) with just one line:

In [None]:
# we don't always need to specify the encoding, but this particular
# file has special characters that we need to handle
top50 = pd.read_csv("../data/top50.csv")

### High-level view of the data

We can view the data frame by simply printing it:

In [None]:
top50

Recall that we can also inspect the file by looking at just the first N rows or last N rows (instead of printing the entire dataframe).

In [None]:
# top50.head(5) # first 5 rows
top50.tail(3) # last 3 rows

**Q2.1:** That's cool, but we can't see all of the columns too well. Write code to print out the columns of 'top50' and the code to calculate number of columns in it.

In [None]:
######
# your code here
######


Fortunately, many of the features in our dataset are numeric. Conveniently, Pandas' `describe()` function calculates basic statistics for our columns. It's pretty amazing, as it allows us a very coarse-grain approach to understanding our data and checking for errors. That is, if we notice any summary statistics that are drastically different than what we deem reasonable, we should dive deeper and figure out why the values are what they are.

In [None]:
top50.describe()

**Q2.2:** Which of the variable above appears to be the most skewed?  Investigate its skew with a histogram.

In [None]:
######
# your code here
######



*your answer here* 



Notice, it calculated statistics only for the columns that are of numeric data types. What about the textual ones (e.g., Track name and Artist)? Pandas is smart enough to infer the data types. **Don't forget to inspect the columns that are text-based though, as we need to ensure they are sound, too.**

To view the data type of each column:

In [None]:
top50.dtypes

**Q2.3:** Write code to obtain the table of frequencies for any categorical variables in the dataset.

In [None]:
#######
# your code here
#######


### Exploring the data

I agree with Pandas' handling of the data. If any column contained floating point numbers, we would expect to see such here, too.

Now that we've viewed our dataset at a high-level, let's actually use and explore it.

Recall: we can **access a column of data** the same way we access dictionary by its keys:

In [None]:
top50["Length"]

We could have also used this syntax (identical results):

In [None]:
top50.Length

If we want just the highest or lowest **value** of a given column, we can use the functions ``max()`` and ``min()``, respectively.

In [None]:
top50['Length'].max()

In [None]:
top50['Length'].min()

If we want the **row index** that corresponds to a column's max or min value, we can use ``idxmax()`` and ``idxmin()``, respectively.

In [None]:
top50['Length'].idxmax()

In [None]:
top50['Length'].idxmin()

We can also add `conditional statements` (e.g., >, <, ==) for columns, which yields a boolean vector:

In [None]:
top50['Length'] > 240

This is useful, as it allows us to process only the rows with the True values.

The **`loc()`** function allows us to access data via labels:
- A single scalar label
- A list of labels
- A slice object
- A Boolean array

A single scalar:

In [None]:
# single scalar label
top50.loc[0] # prints the (unnamed) row that has a label of 0 (the 1st row)

In [None]:
# list of labels
top50.loc[[0,2]] # prints the (unnamed) rows that have the labels of 0 and 2 (the 1st and 3rd rows)

In [None]:
# a slice of the dataframe, based on the passed-in booleans;
# picture it's like a filter overlaying the DataFrame, and the filter
# dictates which values will be emitted/make it through to us

top50.loc[top50['Length'] > 240] # prints all rows that have Length > 240

Note, this returns a *DataFrame*. Everything we've learned so far concerns how to use DataFrames, so we can tack on additional syntax to this command if we wish to do further processing.

For example, if we want to index just select columns (e.g., ArtistName, TrackName, and Length) of this returned DataFrame:

In [None]:
top50.loc[top50['Length'] > 240][['ArtistName', 'TrackName', 'Length']]

Note, the above solves our original **Q1.2:** _(Write code to print all songs (Artist and Track name) that are longer than 4 minutes (240 seconds))_

**Q2.4:** Write code to print the most popular song (or song(s) if there is a tie):

In [None]:
#######
# your code here
#######

We can also sort our data by a single column! This pertains to our original **Q1.4**!

**Q2.4:** Write code to print the songs (and their attributes), if we sorted by their popularity (highest scoring ones first).

In [None]:
# use top50.sort_values() to answer this question

#######
# your code here
#######



While ``.loc()`` allows us to index based on passed-in labels, ``.iloc()`` allows us to **access data based on 0-based indices.**

The syntax is ``.iloc[<row selection>, <column selection>]``, where <row selection> and <column selection> can be scalars, lists, or slices of indices.

In [None]:
top50.iloc[5:6] # prints all columns for the 6th row

In [None]:
top50.iloc[:,2] # prints all rows for the 3rd column

In [None]:
top50.iloc[[0,2,3], [2,1]] # prints the 1st, 3rd, and 4th rows of the 3rd and 2nd columns (artist and track)

### Inspecting/cleaning the data

As mentioned, it is imperative to ensure the data is sound to use:
1. Did it come from a trustworthy, authoritative source?
2. Is the data a complete sample?
3. Does the data seem correct?
4. **(optional)** Is the data stored efficiently or does it have redundancies?

Let's walk through each of these points now:

1. Did it come from a trustworthy, authoritative source?

The data came from Kaggle.com, which anyone can publish to. However, the author claims that he/she used Spotify.com's official API to query songs in 2019. There are no public comments for it so far. It's potentially credible.

2. Is the data a complete sample?

Pandas has functions named ``isnull()`` and ``notnull()``, which return DataFrames corresponding to any null or non-null entries, respectively.

For example:

In [None]:
top50[top50.ArtistName.isnull()] # returns an empty DataFrame

In [None]:
top50[top50.ArtistName.notnull()] # returns the complete DataFrame since there are no null Artists

If we run this for all of our features/columns, we will see there are no nulls. Since this dataset is manageable in size, you can also just scroll through it and notice no nulls.

This answers our original **Q1.5**: How could you check for null/empty entries?

Continuing with our data sanity check list:

3. Does the data seem correct?

A quick scroll through the data, and we see a song by _Maluma_ titled _0.95833333_. This is possibly a song about probability, but I think the chances are slim. The song is 176 seconds long (2m56s). Looking on Spotify, we see **Maluma's most popular song is currently _11PM_ which is 2m56s in length!** Somehow, during the creation of the dataset, 11PM became 0.95833333. _Bonus points if you can figure out where this pointing number could have come from._

In [None]:
Image("fig/maluma.png") # sample of the data

Since only one song seems obviously wrong, we can manually fix it. And it's worth noting such to ourselves and to whomever else would see our results or receive a copy of our data. If there were many more wrong values, we'd potentialy not fix them, as we'd explore other options.

In [None]:
top50['TrackName'][top50['ArtistName'] == "Maluma"] = "11PM"
# Watch out for the warning.

## Part 3: Grouping and Combining Multiple Data Frames

As mentioned, often times one dataset doesn't contain all of the information you are interested in -- in which case, you need to combine data from multiple files. This also means you need to verify the accuracy (per above) of each dataset.

Pandas' ``groupby()`` function splits the DataFrame into different groups, depending on the passed-in variable. For example, we can group our data by the genres:

In [None]:
grouped_df = top50.groupby('Genre')
#for key, item in grouped_df:
#    print("Genre:", key, "(", len(grouped_df.get_group(key)), "items):", grouped_df.get_group(key), "\n\n")

``../data/spotify_aux.csv`` contains the same 50 songs as ``top50.csv``; however, it only contains 3 columns:
- Track Name
- Artist Name
- Explicit Language (boolean valued)

Note, that 3rd column is just random values, but pretend as if it's correct. The point of this section is to demonstrate how to merge columns together.

Let's load ``../data/spotify_aux.csv`` into a DataFrame:

In [None]:
explicit_lyrics = pd.read_csv("../data/spotify_aux.csv")
#explicit_lyrics

Let's merge it with our ``top50`` DataFrame.

``.merge()`` is a Pandas function that stitches together DataFrames by their columns.

``.concat()`` is a Pandas function that stitches together DataFrames by their rows (if you pass axis=1 as a flag, it will be column-based)

In [None]:
# 'on='' specifies the column used as the shared key
df_combined = pd.merge(explicit_lyrics, top50, on='TrackName')
#df_combined

We see that all columns from both DataFrames have been added. That's nice, but having duplicate ArtistName and TrackName is unecessary. Since ``merge()`` uses DataFrames as the passed-in objects, we can simply pass merge() a stripped-down copy of _ExplicitLanguage_, which helps merge() not add any redundant fields. 

In [None]:
df_combined = pd.merge(explicit_lyrics[['TrackName', 'ExplicitLanguage']], top50, on='TrackName')
#df_combined

This answers our original **Q1.6:** Imagine we had another table (i.e., .csv file) below. How could we combine its data with our already-existing *dataset*?

While we do not exhaustively illustrate Pandas' joining/splitting functionality, you may find the following functions useful:
- ``merge()``
- ``concat()``
- ``aggregate()``
- ``append()``

### Plotting DataFrames
As a very simple example of how one can plot elements of a DataFrame, we turn to Pandas' built-in plotting:

In [None]:
scatter_plot = top50.plot.scatter(x='Danceability', y='Popularity', c='DarkBlue')

**Q3.1:** Alternatively, use `plt.scatter` to recreate the scatterplot above.


In [None]:
######
# your code here
######



This shows the lack of a correlation between the Danceability of a song and its popularity, based on just the top 50 songs, of course.

Please feel free to experiment with plotting other items of interest, and we recommend using Seaborn.

## Practice Problems 

**Q3.2:** Print the shortest song (all features):

In [None]:
######
# your code here
######



**Q3.3:** Print the 5 shortest songs (all features):

In [None]:
######
# your code here
######



**Q3.4:**  What is the average length of the 5 shortest songs?

In [None]:
######
# your code here
######



**Q3.5:**  How many distinct genres are present in the top 50 songs?

In [None]:
######
# your code here
######



**Q3.6:**  Print the songs that have a Danceability score above 80 and a popularity above 86. HINT: you can combine conditional statements with the & operator, and each item must be surrounded with ( ) brackets.

In [None]:
######
# your code here
######


**Q3.7:**  Plot a histogram of the Genre counts (x-axis is the Genres, y-axis is the # of songs with that Genre)

In [None]:
######
# your code here
######

**Q3.8 (open ended):** Think of a _subset_ of the data that you're interested in. Think of an interesting plot that could be shown to illustrate that data. With a partner, discuss whose would be easier to create. Together, create that plot. Then, try to create the harder plot.

In [None]:
######
# your code here
######

##  Part 4: Beautiful Soup 
Data Engineering, the process of gathering and preparing data for analysis, is a very big part of Data Science.

Datasets might not be formatted in the way you need (e.g. you have categorical features but your algorithm requires numerical features); or you might need to cross-reference some dataset to another that has a different format; or you might be dealing with a dataset that contains missing or invalid data.

These are just a few examples of why data retrieval and cleaning are so important.

---

### `requests`:  Retrieving Data from the Web

In HW1, you will be asked to retrieve some data from the Internet. `Python` has many built-in libraries that were developed over the years to do exactly that (e.g. `urllib`, `urllib2`, `urllib3`).

However, these libraries are very low-level and somewhat hard to use. They become especially cumbersome when you need to issue POST requests or authenticate against a web service.

Luckly, as with most tasks in `Python`, someone has developed a library that simplifies these tasks. In reality, the requests made both on this lab and on HW1 are fairly simple, and could easily be done using one of the built-in libraries. However, it is better to get acquainted to `requests` as soon as possible, since you will probably need it in the future.

In [None]:
# You tell Python that you want to use a library with the import statement.
import requests

Now that the requests library was imported into our namespace, we can use the functions offered by it.

In this case we'll use the appropriately named `get` function to issue a *GET* request. This is equivalent to typing a URL into your browser and hitting enter.

In [None]:
# Get the HU Wikipedia page
req = requests.get("https://en.wikipedia.org/wiki/Harvard_University")

Python is an Object Oriented language, and everything on it is an object. Even built-in functions such as `len` are just syntactic sugar for acting on object properties.

We will not dwell too long on OO concepts, but some of Python's idiosyncrasies will be easier to understand if we spend a few minutes on this subject.

When you evaluate an object itself, such as the `req` object we created above, Python will automatially call the `__str__()` or `__repr__()` method of that object. The default values for these methods are usually very simple and boring. The `req` object however has a custom implementation that shows the object type (i.e. `Response`) and the HTTP status number (200 means the request was successful).

In [None]:
req

Just to confirm, we will call the `type` function on the object to make sure it agrees with the value above.

In [None]:
type(req)

Another very nifty Python function is `dir`. You can use it to list all the properties of an object.

By the way, properties starting with a single and double underscores are usually not meant to be called directly.

In [None]:
dir(req)

Right now `req` holds a reference to a *Request* object; but we are interested in the text associated with the web page, not the object itself.

So the next step is to assign the value of the `text` property of this `Request` object to a variable.

In [None]:
page = req.text
page[20000:30000]

Great! Now we have the text of the Harvard University Wikipedia page. But this mess of HTML tags would be a pain to parse manually. Which is why we will use another very cool Python library called `BeautifulSoup`.

### `BeautifulSoup`

Parsing data would be a breeze if we could always use well formatted data sources, such as CSV, JSON, or XML; but some formats such as HTML are at the same time a very popular and a pain to parse.

One of the problems with HTML is that over the years browsers have evolved to be very forgiving of "malformed" syntax. Your browser is smart enough to detect some common problems, such as open tags, and correct them on the fly.

Unfortunately, we do not have the time or patience to implement all the different corner cases, so we'll let BeautifulSoup do that for us.

You'll notice that the `import` statement bellow is different from what we used for `requests`. The _from library import thing_ pattern is useful when you don't want to reference a function byt its full name (like we did with `requests.get`), but you also don't want to import every single thing on that library into your namespace.

In [None]:
from bs4 import BeautifulSoup

`BeautifulSoup` can deal with `HTML` or `XML` data, so the next line parses the contents of the `page` variable using its `HTML` parser, and assigns the result of that to the `soup` variable.

In [None]:
soup = BeautifulSoup(page, 'html.parser')

In [None]:
type(soup)

Doesn't look much different from the `page` object representation. Let's make sure the two are different types.

In [None]:
type(page)

Looks like they are indeed different.

`BeautifulSoup` objects have a cool little method that allows you to see the `HTML` content in a nice, indented way.

In [None]:
print(soup.prettify()[:1000])

Looks like it's our page!

We can now reference elements of the `HTML` document in different ways. One very convenient way is by using the dot notation, which allows us to access the elements as if they were properties of the object.

In [None]:
soup.title

This is nice for `HTML` elements that only appear once per page, such the the `title` tag. But what about elements that can appear multiple times?

In [None]:
# Be careful with elements that show up multiple times.
soup.p

Uh Oh. Turns out the attribute syntax in `Beautiful` soup is what is called *syntactic sugar*. That's why it is safer to use the explicit commands behind that syntactic sugar I mentioned. These are:
* `BeautifulSoup.find` for getting single elements, and 
* `BeautifulSoup.find_all` for retrieving multiple elements.

In [None]:
len(soup.find_all("p"))

If you look at the Wikipedia page on your browser, you'll notice that it has a couple of tables in it. We will be working with the "Demographics" table, but first we need to find it.

One of the `HTML` attributes that will be very useful to us is the `class` attribute.

Getting the class of a single element is easy!

In [None]:
soup.table["class"]

Next we will use a *list comprehension* to see all the tables that have a `class` attribute. 

In [None]:
# the classes of all tables that have a class attribute set on them
[t["class"] for t in soup.find_all("table") if t.get("class")]

As already mentioned, we will be using the Demographics table for this lab. The next cell contains the `HTML` elements of said table. We will render it in different parts of the notebook to make it easier to follow along the parsing steps.

In [None]:
table_demographics = soup.find_all("table", "wikitable")[2]

In [None]:
from IPython.core.display import HTML
HTML(str(table_demographics))

First we'll use a list comprehension to extract the rows (*tr*) elements.

In [None]:
rows = [row for row in table_demographics.find_all("tr")]
print(rows)

In [None]:
header_row = rows[0]
HTML(str(header_row))

We will then use a `lambda` expression to replace new line characters with spaces. `Lambda` expressions are to functions what list comprehensions are to lists: namely a more concise way to achieve the same thing.

In reality, both lambda expressions and list comprehensions are a little different from their function and loop counterparts. But for the purposes of this class we can ignore those differences.

In [None]:
# Lambda expressions return the value of the expression inside it.
# In this case, it will return a string with new line characters replaced by spaces.
rem_nl = lambda s: s.replace("\n", " ")

#### Splitting the data
Next we extract the text value of the columns. If you look at the table above, you'll see that we have three columns and six rows.

Here we're doing the following:
* Taking the first element (`Python` indices start at zero)
* Iterating over the *th* elements inside it
* Taking the text value of those elements

We should end up with a list of column names.

But there is one little caveat: the first column of the table is actually an empty string (look at the cell right above the row names). We could add it to our list and then remove it afterwards; but instead we will use the `if` statement inside the list comprehension to filter that out.

In the following cell, `get_text` will return an empty string for the first cell of the table, which means that the test will fail and the value will not be added to the list.

In [None]:
# the if col.get_text() takes care of no-text in the upper left
columns = [rem_nl(col.get_text()) for col in header_row.find_all("th") if col.get_text()]
columns

Now let's do the same for the rows. Notice that since we have already parsed the header row, we will continue from the second row.

In [None]:
indexes = [row.find("th").get_text() for row in rows[1:]]
indexes

Now we want to transform the string on the cells to integers.  To do this, we follow a very common `python` pattern:
1. Check if the last character of the string is a percent sign
2. If it is, then convert the characters before the percent sign to integers
3. If one of the prior checks fails, return a value of `None`

These steps can be conveniently packaged into a function using `if-else` statements.

In [None]:
def to_num(s):
    if s[-1] == "%":
        return int(s[:-1])
    else:
        return None

Notice the `Python` slices are open on the upper bound. So the `[:-1]` construct will return all elements of the string, except for the last.

Another nice way to write our `to_num` function would be
```python
def to_num(s):
    return int(s[:-1]) if s[-1] == "%" else None
```
Notice that we only had to write `return` one time and everything conveniently fits on one line.  I'll leave it up to you to decide if it's readable or not.

Now we use the `to_num` function in a list comprehension to parse the table values.

Notice that we have two `for ... in ...` in this list comprehension. That is perfectly valid and somewhat common.

Although there is no real limit to how many iterations you can perform at once, having more than two can be visually unpleasant, at which point either regular nested loops or saving intermediate comprehensions might be a better solution.

In [None]:
values = [to_num(value.get_text()) for row in rows[1:] for value in row.find_all("td")]
values

The problem with the list above is that the values lost their grouping.

The `zip` function is used to combine two sequences element wise. So 
```python
zip([1,2,3], [4,5,6])
```
would return
```python
[(1, 4), (2, 5), (3, 6)]
```

Next we create three arrays corresponding to the three columns by putting every three values in each list.

In [None]:
stacked_values_lists = [values[i::3] for i in range(len(columns))]
stacked_values_lists

We then use `zip`. 

In [None]:
stacked_values = zip(*stacked_values_lists)
list(stacked_values)

Notice the use of the `*` in front: that converts the list of lists to a set of arguments to `zip`. See the ASIDE below.

In [None]:
# Here's the original HTML table for visual understanding
HTML(str(table_demographics))

**Q4.1:** Use the tables in `soup` to determine how Harvard's Computer Science program ranks both Nationally and Globally.

In [None]:
######
# your code here
######

