# Lab-P12:  Web Requests, Caching, DataFrames and Scraping
Version: 4/20, 11:40PM

## Segment 1: Web Requests and File Downloads

Import the `time`, `requests`, `os`, `json`, `pandas` and `BeautifulSoup` modules. 

For `pandas`, import it as `pd` - as was done in lecture. You can refer to the [lecture material](https://github.com/tylerharter/caraza-harter-com/blob/master/tyler/meena/cs220/s22/materials/readings/pandas-intro.ipynb) for help.

In [1]:
#Write import statements here

import time
import requests
import os
import json
import pandas as pd

from bs4 import BeautifulSoup


### Task 1.1 Fetch `rankings.json` from an internet URL

Use the `requests` library to fetch the file at this URL: `https://raw.githubusercontent.com/msyamkumar/cs220-s22-projects/main/p12/rankings.json`. Make sure to call the appropriate function to raise an HTTPError if status code is not 200.

Recall that you can invoke the `.json` method using the response object instance to convert the response content into the appropriate Python data structure. Store the JSON representation into a variable called `data_json`.

In [2]:
# Write your code here

url = "https://raw.githubusercontent.com/msyamkumar/cs220-s22-projects/main/p12/rankings.json"

try:
    r = requests.get(url)
    r.raise_for_status()
    data_json = r.json()
    
except requests.HTTPError as e:
    print("oops!!", e)


In [3]:
assert(data_json[25]["Institution"] == "New York University")
assert(data_json[-10]["Score"] == 65.8)
assert(data_json[5]["National Rank"] == 4)

### Task 1.2 Measure the time taken to fetch `rankings.json` from an Internet URL

The `time.time()` function returns the time at which it was called. By enclosing some code between two time values, we can measure how long it takes to run. Try it for the code in Task 1.1.

In [4]:
t1 = time.time()

# Copy and paste your Task 1.1 code here

url = "https://raw.githubusercontent.com/msyamkumar/cs220-s22-projects/main/p12/rankings.json"

try:
    r = requests.get(url)
    r.raise_for_status()
    data_json = r.json()
    
except requests.HTTPError as e:
    print("oops!!", e)

t2 = time.time()

duration = (t2-t1) * 1000 #converting seconds to milliseconds
print("Fetching the data from a URL took {:.2f} milliseconds".format(duration))

Fetching the data from a URL took 345.98 milliseconds


### Task 1.3 Save `rankings.json` as a file

Save the `data_json` variable defined in the previous task to a file `rankings.json`. 

**Hint**: Recall the use of the `json.dump` function from [lecture](https://www.msyamkumar.com/cs220/s22/materials/lecture_ppts/lec_19_S22.pdf)

In [5]:
# Your code here

def write_json(path, data):
    with open(path, 'w', encoding="utf-8") as f:
        json.dump(data, f, indent=2)

write_json("rankings.json", data_json)


In [6]:
assert(os.path.exists("rankings.json"))

Check your `lab12` directory in Finder (Mac) / Explorer (Windows). It should have a file `rankings.json`. 

### Task 1.4 Measure the time taken to fetch `rankings.json` from a saved file

Read the contents of the file saved in Task 1.3 into a variable called `read_data` and measure how long it takes. Use the `read_json` method from [lecture](https://www.msyamkumar.com/cs220/s22/materials/lecture_ppts/lec_19_S22.pdf) to read JSON data from a file.

In [7]:
t1 = time.time()

# Write your code here

def read_json(path):
    with open(path, encoding="utf-8") as f:
        return json.load(f) 

read_data = read_json("rankings.json")   

t2 = time.time()

duration = (t2-t1) * 1000

print("Fetching the data from a file took {:.2f} milliseconds".format(duration))

Fetching the data from a file took 15.06 milliseconds


In [8]:
assert(read_data[25]["Institution"] == "New York University")
assert(read_data[-10]["Score"] == 65.8)
assert(read_data[5]["National Rank"] == 4)

Fetching the file from your computer should have been much faster than fetching it from a URL. If this was not the case, ask a TA!

Web browsers use a similar technique to make browsing faster. The first time you visit a page, the web browser will download the content, and also save it on your computer. If you need to view the same page again soon, your browser may use the file on your computer instead of re-fetching the original. This technique is called **caching**.

### Task 1.5 Implement caching via the `download` function

Now, implement a function `download` to download data from the internet and save it to a file. 

This function takes in two arguments `filename` and `url`. The contents at the address pointed to by the `url` field should be saved into the file whose path is specified by `filename`.

This function will implement caching. Before downloading the file from the internet, the function should check if the file is already downloaded. If it is, return a message indicating that the file already exists and do not send a request to the URL. 

In [9]:
def download(filename, url):
    if os.path.exists(str(filename)):
        return (str(filename) + " already exists!")

    try:
        r = requests.get(url)
        r.raise_for_status()
        data_json = r.json()
        
        write_json(str(filename), data_json)
        
    except requests.HTTPError as e:
        print("oops!!", e)
        
    except json.JSONDecodeError: 
        data_text = r.text
        
        file_obj = open(str(filename), "w")
        file_obj.write(data_text)
        file_obj.close()
    
    return (str(filename) + " created!")

# def download(filename, url):
#     if os.path.exists(str(filename)):
#         return (str(filename) + " already exists!")

#     try:
#         r = requests.get(url)
#         r.raise_for_status()
#         data_json = r.json()
        
#         write_json(str(filename), data_json)
        
#     except requests.HTTPError as e:
#         print("oops!!", e)

    
#     return (str(filename) + " created!")


### Task 1.6 Test the `download` function

Run the cell below to test your function. Think about why the test code is written in this way. Ask a TA if you're not sure.

In [10]:
rankings_url = "https://raw.githubusercontent.com/msyamkumar/cs220-s22-projects/main/p12/rankings.json"
os.remove("rankings.json") ## delete the existing file
download("rankings.json", rankings_url)

assert(os.path.exists("rankings.json"))
assert(os.path.getsize("rankings.json") > 1600000 and os.path.getsize("rankings.json") < 2500000)
assert (download("rankings.json",rankings_url) == "rankings.json already exists!" )


You will have to use this `download` function to download files during p12. This will ensure that you do not download the files each time you 'Restart & Run All'.


## Segment 2:  Creating DataFrames

For this project, we will be analyzing statistics about world university rankings adapted from
[here](https://cwur.org/). The `rankings.json` file was created by scraping content from pages on the linked website. 

We are going to use `pandas` throughout the lab and project to analyze this dataset.

### Task 2.1 Load data from `rankings.json` into a dataframe

In lecture, you reviewed different ways to create pandas DataFrames. For this task, create a DataFrame `rankings` by reading the JSON data saved in `rankings.json`. 

We covered the `read_csv` method of pandas in lecture to read CSV files into a DataFrame. Now, we are going to use a similar method `read_json` to read a JSON file into a dataframe. Try this below, and seek help from a TA if you face any trouble.

Remember to cast the return value explicitly into a DataFrame object. You must do this throughout the lab and project. 
Sometimes, the `read_json` function's returned DataFrame has type issues on Windows laptops. Hence the need for explicit type conversion.

In [11]:
# Use the read_json method of pandas to create a DataFrame by reading from a file
# Cast the return value of read_json to a DataFrame explicitly

rankings = pd.DataFrame(read_json("rankings.json"))

rankings.head()


Unnamed: 0,World Rank,Year,Institution,Country,National Rank,Quality of Education Rank,Alumni Employment Rank,Quality of Faculty Rank,Research Performance Rank,Score
0,1,2019-2020,Harvard University,USA,1,2.0,1.0,1.0,1.0,100.0
1,2,2019-2020,Massachusetts Institute of Technology,USA,2,1.0,10.0,2.0,5.0,96.7
2,3,2019-2020,Stanford University,USA,3,9.0,3.0,3.0,2.0,95.2
3,4,2019-2020,University of Cambridge,United Kingdom,1,4.0,19.0,5.0,11.0,94.1
4,5,2019-2020,University of Oxford,United Kingdom,2,10.0,24.0,10.0,4.0,93.3


In [12]:
assert(type(rankings) == pd.DataFrame)
assert(rankings.iloc[0]["Institution"] == 'Harvard University')
assert(rankings.iloc[1]["Score"]== 96.7)

### Task 2.2 Find the unique universities in the dataset

As the dataset contains rankings for three different years, the same university may have featured multiple times. Find the names of the unique universities that are represented in the dataset.

First, extract just the names of the institutions as a pandas Series. Then, make a list of unique names called `institutions`. Think about what data structure(s) you have been using to extract unique values from a list. Series can be easily converted into that useful data structure, and that data structure can be converted back into a series.

In [13]:
# Create a pandas `Series` of just the institution names in the dataset. 

institutions = pd.Series(rankings['Institution'].unique())

university_count = len(institutions)

print(university_count)
print(type(institutions))


2156
<class 'pandas.core.series.Series'>


In [14]:
assert(type(institutions) == pd.Series)
assert(len(institutions) == 2156)

### Task 2.3 Use `value_counts` to count instances in a dataframe

Now, let's find the country that is the 5th most represented in the dataframe, and the number of times it features. Recall that `value_counts` enables us to count number of occurrences of unique values in a pandas Series.

#### Task 2.3a Obtain the counts for all countries

First, use the `value_counts` function to return a pandas series called `country_counts`. This series contains each country in the dataset and the number of times it occurs.

In [15]:
country_counts = rankings['Country'].value_counts()

In [16]:
assert(type(country_counts) == pd.Series)
assert(country_counts["USA"] == 1062)
assert(len(country_counts) == 103)

#### Task 2.3b Find the 5th most represented country

Use the `.index` attribute of the `Series` `country_counts` to fetch the name of the 5th most represented country. Use `loc` or `iloc` to fetch the count of this country. Make sure to use the pandas series defined in Task 2.3a.

**Hint**: The pandas `Series.index` works differently from the `.index` method you are familiar with for python lists. `Series.index` takes in the numerical index of the element you want to access, and returns the label you can pass to `.loc` to access it.

In [17]:
country = country_counts.index[4]
count = country_counts.iloc[4]

print(country, count)


France 256


In [18]:
assert(country == "France")
assert(count == 256)

### Task 2.4 `loc` vs `iloc`

In this lab and project, you must only use `iloc`. Using `loc` will be considered hardcording. This is since `iloc` selects rows and columns at the given integer position while `loc` selects rows at the given pandas index. 

Intuition: Recall that row index can be given meaningful names like string indices. Consider a scenario where you add rows to the beginning of the DataFrame - if you use `.loc` indexing, your answer will become incorrect if the data changes. Whereas if you use `.iloc`, you will always get the correct answer.

This distinction may not be as intuitive for the current `rankings` dataframe. As an example, use both `loc` and `iloc` to fetch the first row in `rankings`.

In [19]:
first_row_iloc = rankings.iloc[0]
print(first_row_iloc)

print()

first_row_loc = rankings.loc[0]
print(first_row_loc)

World Rank                                    1
Year                                  2019-2020
Institution                  Harvard University
Country                                     USA
National Rank                                 1
Quality of Education Rank                   2.0
Alumni Employment Rank                      1.0
Quality of Faculty Rank                     1.0
Research Performance Rank                   1.0
Score                                     100.0
Name: 0, dtype: object

World Rank                                    1
Year                                  2019-2020
Institution                  Harvard University
Country                                     USA
National Rank                                 1
Quality of Education Rank                   2.0
Alumni Employment Rank                      1.0
Quality of Faculty Rank                     1.0
Research Performance Rank                   1.0
Score                                     100.0
Name: 0, dtype: 

The results are exactly the same! This happens since the integer positions correspond to the pandas indices in the `rankings` dataframe. However, this will not always hold true - as we see in the next task.

### Task 2.5 Use boolean indexing to filter data

Now, use boolean indexing to extract data from the dataframe. Recall boolean indexing from [lecture](https://github.com/tylerharter/caraza-harter-com/blob/master/tyler/meena/cs220/s22/materials/meena_lec_notes/lec-28/lec_28_pandas2.ipynb)

Create a dataframe `rankings_arg_bra` that only consists of rankings of universities from Argentina and Brazil. Extract the first value in this new dataframe. As you'll see, using `loc` will not work the same way it did before. The code in line 5 of the next cell should now return a KeyError.

**Hint**: When implementing boolean indexing in pandas, the `or` operator is represented by `|` and the `and` operator is represented by `&`.

In [20]:
rankings_arg_bra = rankings[(rankings['Country'] == 'Argentina') | (rankings['Country'] == 'Brazil')]

rankings_arg_bra


Unnamed: 0,World Rank,Year,Institution,Country,National Rank,Quality of Education Rank,Alumni Employment Rank,Quality of Faculty Rank,Research Performance Rank,Score
127,128,2019-2020,University of São Paulo,Brazil,1,457.0,264.0,219.0,89.0,80.7
343,344,2019-2020,University of Buenos Aires,Argentina,1,238.0,1222.0,,320.0,76.1
348,349,2019-2020,Federal University of Rio de Janeiro,Brazil,2,378.0,408.0,,335.0,76.0
352,353,2019-2020,University of Campinas,Brazil,3,,,,324.0,76.0
443,444,2019-2020,São Paulo State University,Brazil,4,,,,416.0,74.8
...,...,...,...,...,...,...,...,...,...,...
5870,1871,2021-2022,Federal Rural University of Rio de Janeiro,Brazil,53,,,,1793.0,66.2
5908,1909,2021-2022,Federal University of Piauí,Brazil,54,,,,1834.0,66.1
5944,1945,2021-2022,Federal University of Amazonas,Brazil,55,,,,1870.0,65.9
5975,1976,2021-2022,National University of Tucumán,Argentina,10,,,,1901.0,65.8


In [21]:
first_row_iloc = rankings_arg_bra.iloc[0]
print(first_row_iloc)

# first_row_loc = rankings_arg_bra.loc[0]
# print(first_row_loc)

World Rank                                       128
Year                                       2019-2020
Institution                  University of São Paulo
Country                                       Brazil
National Rank                                      1
Quality of Education Rank                      457.0
Alumni Employment Rank                         264.0
Quality of Faculty Rank                        219.0
Research Performance Rank                       89.0
Score                                           80.7
Name: 127, dtype: object


Oops! We see that using `.loc` now causes a KeyError.

`.loc[0]` tries to find the row with the *labeled* index 0. Run the cell below and notice how `rankings_arg_bra` starts at the labeled index 127. There is no 0. Hence the KeyError.

In [22]:
rankings_arg_bra.head()

Unnamed: 0,World Rank,Year,Institution,Country,National Rank,Quality of Education Rank,Alumni Employment Rank,Quality of Faculty Rank,Research Performance Rank,Score
127,128,2019-2020,University of São Paulo,Brazil,1,457.0,264.0,219.0,89.0,80.7
343,344,2019-2020,University of Buenos Aires,Argentina,1,238.0,1222.0,,320.0,76.1
348,349,2019-2020,Federal University of Rio de Janeiro,Brazil,2,378.0,408.0,,335.0,76.0
352,353,2019-2020,University of Campinas,Brazil,3,,,,324.0,76.0
443,444,2019-2020,São Paulo State University,Brazil,4,,,,416.0,74.8


### Task 2.6 Sort the dataframe

The dataframe in Task 2.5 is sorted by World Rank, with the result that universities from Argentina and Brazil are interleaved throughout the data. Re-sort the data to sort by country so that all universities from Argentina appear first followed by universities from Brazil. Within each country, the universities should be sorted by their National Rank. 

Use the `sort_values` function of `pandas`. Remember - by default, `pandas` returns a new sorted DataFrame and does not modify the existing one.

Recall that `sort_values` takes an argument for the parameter `by` as the column name, based on which you want to do the sorting. If you want to use one column for primary sorting and another for secondary sorting, you can specify a list of column names.

In [23]:
sorted_rankings_arg_bra = rankings_arg_bra.sort_values(["Country", "National Rank"])

sorted_rankings_arg_bra.head()


Unnamed: 0,World Rank,Year,Institution,Country,National Rank,Quality of Education Rank,Alumni Employment Rank,Quality of Faculty Rank,Research Performance Rank,Score
343,344,2019-2020,University of Buenos Aires,Argentina,1,238.0,1222.0,,320.0,76.1
2353,354,2020-2021,University of Buenos Aires,Argentina,1,327.0,1281.0,,321.0,75.9
4355,356,2021-2022,University of Buenos Aires,Argentina,1,319.0,1347.0,,324.0,75.9
595,596,2019-2020,National University of La Plata,Argentina,2,495.0,1337.0,,557.0,73.2
2618,619,2020-2021,National University of La Plata,Argentina,2,,1404.0,,583.0,73.0


In [24]:
assert(sorted_rankings_arg_bra.iloc[0]["Institution"] == "University of Buenos Aires")
assert(sorted_rankings_arg_bra.iloc[-1]["World Rank"] == 1997)

### Task 2.7 Create a new, simplified dataframe to track changes in rankings

As we have seen, universities that have featured in rankings of multiple years are featured repeatedly. To simplify comparisons, we want to feature each university once and remove all other metrics. 

This time - instead of simply ranking universities, we want to find the absolute change in universities' rankings between the year 2019-2020 and 2020-2021. We are only interested in the absolute change and not whether the rank improved or declined.  

First, let's attempt to measure the change for one particular university.

**Hint**: The `abs` function can be used to find the absolute value.

#### Task 2.7a Find the absolute difference in World Rank for "University of Madras" between 2019-2020 and 2020-2021

Store the difference in a variable `absolute_diff_madras`

In [25]:
# First find the ranking of "University of Madras" in the year "2019-2020"
# Then find the ranking of "University of Madras" in the year "2020-2021
# Remember to use .iloc[0] to extract the value

univ_of_madras_19_20 = rankings[(rankings['Institution'] == "University of Madras") \
                          & (rankings['Year'] == "2019-2020")]

univ_of_madras_20_21 = rankings[(rankings['Institution'] == "University of Madras") \
                          & (rankings['Year'] == "2020-2021")]

absolute_diff_madras = abs(univ_of_madras_19_20.iloc[0]['World Rank'] - univ_of_madras_20_21.iloc[0]['World Rank'])

absolute_diff_madras


108

In [26]:
assert(absolute_diff_madras == 108)

#### Task 2.7b Create a Series with the absolute difference in ranks for "University of Madras" between 2019-2020 and 2020-2021

First, create a dictionary with the keys as "Institution" and "Absolute Change". The values should be the relevant values for "University of Madras". Then, convert this dictionary to a Series called `madras_series`.

In [28]:
madras_dict = {}

madras_dict['Institution'] = univ_of_madras_19_20['Institution'].iloc[0]
madras_dict['Absolute Change'] = absolute_diff_madras

madras_series = pd.Series(madras_dict)

madras_series


Institution        University of Madras
Absolute Change                     108
dtype: object

In [29]:
assert(madras_series["Institution"] == "University of Madras")
assert(madras_series["Absolute Change"] == 108)

#### Task 2.7c Create the `change_in_rankings` dataframe

Now, create a dataframe `change_in_rankings` with just 2 columns, "Institution" and "Absolute Change" where each university is only featured once. For this task, we are interested in universities in all countries. If the institution is not present in the rankings of either year, we will ignore it.

The institutions should be sorted in increasing order of their absolute change. For institutions with the same absolute change, sort them alphabetically by their names.

Note: this cell may take a few seconds to run.

In [30]:
y1 = "2019-2020"
y2 = "2020-2021"

rank_list = []
   
for inst in institutions:
    tr = rankings[rankings['Institution'] == inst]

    years = tr.Year.tolist()

    if y1 not in years or y2 not in years:
        continue
           
    y1_rank = int(tr[tr['Year'] == y1].iloc[0]['World Rank'])
    y2_rank = int(tr[tr['Year'] == y2].iloc[0]['World Rank'])
    
    abs_diff = abs(y2_rank - y1_rank)
    
    inst_dict = {'Institution': inst, 'Absolute Change': abs_diff}

    rank_list.append(inst_dict)

change_in_rankings = pd.DataFrame(rank_list)

change_in_rankings = change_in_rankings.sort_values(['Absolute Change', 'Institution'])

change_in_rankings


Unnamed: 0,Institution,Absolute Change
10,California Institute of Technology,0
229,Charles University in Prague,0
5,Columbia University,0
28,ETH Zurich,0
0,Harvard University,0
...,...,...
1306,Trinity College,620
990,Franklin & Marshall College,725
383,École des Hautes Études en Sciences Sociales,810
1875,Antioch College,1046


Test your function below.

In [31]:
change_in_rankings.iloc[-1]

Institution        USI - University of Italian Speaking Switzerland
Absolute Change                                                1081
Name: 1832, dtype: object

In [32]:
assert(change_in_rankings.iloc[100]["Institution"] == "Vrije Universiteit Brussel")
assert(change_in_rankings.iloc[-1]["Absolute Change"] == 1081)
assert(change_in_rankings.shape[1] == 2)

# Segment 3: Lint

The p12 autograder introduces lint checks to detect bad coding style. 
"Lint" refers to bad code that is not necessarily buggy (though "bad" coding style often leads to bugs).  A linter helps warn you about common issues. If you are interested in finding out about the origins of this term, check out the [Wikipedia page](https://en.wikipedia.org/wiki/Lint_(software)).

For project p12, we're adding a linter as part of `test.py`. It will notify you of code that is bad style, deducting 1% per issue (for a max of a 10% penalty).  

### Task 3.1 Install the pylint module

For the linter to run properly, install the `pylint` module by running this command in your terminal.

`
pip install pylint
`

Verify that the installation worked by simply running the `pylint` command in your terminal. You should see text explaining the various `pylint` options available. If you see a `command not found` error, ask a TA!

### Task 3.2 Run the pylint module

In a new notebook (e.g., named `lint_nb.ipynb`), paste the following code and save the notebook.

In [33]:
def abs(list):
    # Objective: return a new list, which contains absolute values of 
    #            items from the original list
    list = list[:] # copy it
    for i in range(len(list)):
        if list[i] < 0:
            list[i] = -list[i]
    return list

abs([-1, -3, 5, -4, 8])

[1, 3, 5, 4, 8]

Now open your terminal (Windows: PowerShell, Mac: Terminal), navigate to the directory you are currently working on (the folder which contains the lint_nb.ipynb and lint.py), and run the linter: 

`
python lint.py -v lint_nb.ipynb 
`

The command above assumes your code is in a notebook called `lint_nb.ipynb`. If you want to test some other code you've written in a different notebook, simply substitute `lint_nb.ipynb` with the name of your notebook (e.g. `main.ipynb`)

Consider why the linter is complaining, then write a better version of the function to make the linter happy. Recall that any word with green syntax highlighting in jupyter notebook is a Python keyword. You should not be using such words as variable names or function names.

You can find extensive documentation for the file lint.py [here](https://github.com/msyamkumar/cs220-s22-projects/tree/main/linter). If you find the linter confusing, please read the full documentation there!

# Segment 4: BeautifulSoup

As mentioned in Segment 2, the `rankings.json` file is created by parsing HTML content on the Web, specifically the web pages listed below.
* https://raw.githubusercontent.com/msyamkumar/cs220-s22-projects/main/p12/2019-2020.html
* https://raw.githubusercontent.com/msyamkumar/cs220-s22-projects/main/p12/2020-2021.html
* https://raw.githubusercontent.com/msyamkumar/cs220-s22-projects/main/p12/2021-2022.html

Now, let's write a function to do this ourselves. We will use the `BeautifulSoup` library to scrape web pages and extract information.

### Task 4.1 Download the HTML files
Use the `download` function you previously created to download the contents of each of the URLs above and save them into files. Name the files `2019-2020.html`, `2020-2021.html` and `2021-2022.html` based on the respective URL.

In [34]:
# Your code here

url_1 = "https://raw.githubusercontent.com/msyamkumar/cs220-s22-projects/main/p12/2019-2020.html"
url_2 = "https://raw.githubusercontent.com/msyamkumar/cs220-s22-projects/main/p12/2020-2021.html"
url_3 = "https://raw.githubusercontent.com/msyamkumar/cs220-s22-projects/main/p12/2021-2022.html"

download('2019-2020.html', url_1)
download('2020-2021.html', url_2)
download('2021-2022.html', url_3)


'2021-2022.html already exists!'

### Task 4.2 Read `2019-2020.html` content into a variable

**Note:** If you get a `UnicodeDecodeError`, make sure all your calls to `open()` have the keyword argument `encoding="utf-8"`. Delete the downloaded files and run the cell above again.

In [37]:
# Your code here

file_obj = open("2019-2020.html", "r")

content = file_obj.read()

file_obj.close()

print(content)

<!DOCTYPE html>
<html lang="en">
<head>

<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->

<meta name="description" content="The Center for World University Rankings (CWUR) is a leading consulting organization and publisher of the largest academic ranking of global universities.">

<meta name="keywords" content="ranking, rankings, university, universities, college, colleges, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, world, top, best, global, Ranking universitario mundial, Classement mondial des universités , Weltweites Universitätsranking, Zentrum für weltweite Universitätsrankings , דירוג האוניברסיטאות העולמי, המרכז לדירוג האוניברסיטאות העולמי, 세계 대학순위, が世界の大学トップ, 世界大學排名中心, 세계대학랭킹센터,世界大学ランキングセンター, Ranking mundial universitário, Рейтинг университетов мира , раз

### Task 4.3 Initialize BeautifulSoup object instance

Use the variable defined in Task 4.2. 

In [38]:
# Your code here

bs_obj = BeautifulSoup(content, "html.parser")
bs_obj


<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->
<meta content="The Center for World University Rankings (CWUR) is a leading consulting organization and publisher of the largest academic ranking of global universities." name="description"/>
<meta content="ranking, rankings, university, universities, college, colleges, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, world, top, best, global, Ranking universitario mundial, Classement mondial des universités , Weltweites Universitätsranking, Zentrum für weltweite Universitätsrankings , דירוג האוניברסיטאות העולמי, המרכז לדירוג האוניברסיטאות העולמי, 세계 대학순위, が世界の大学トップ, 世界大學排名中心, 세계대학랭킹센터,世界大学ランキングセンター, Ranking mundial universitário, Рейтинг университетов мира , разработки рейтин

### Task 4.4 Find the table element

The webpage has a table containing all the data we're trying to extract. Write the code to find this element and store it in a variable. Use the BeautifulSoup object instance defined in Task 4.3.

In [39]:
# Write your code here

table = bs_obj.find('table')

table


<table class="table">
<style>
          table {
                 border-collapse: collapse;
                 width: 100%;
                }
         th, td {
                  text-align: left;
                  padding: 0px;
                  border-style:hidden;
                 }
          tr:nth-child(even){background-color: #f2f2f2}

          th {
              background-color: #222222;
              color: #ffffff;
              font-weight: lighter;
              font-color: white
             }

         </style>
<thead>
<tr>
<th style="vertical-align:middle">World Rank</th>
<th style="vertical-align:middle">Institution</th>
<th style="vertical-align:middle">Country</th>
<th style="vertical-align:middle">National Rank</th>
<th style="vertical-align:middle">Quality of Education Rank</th>
<th style="vertical-align:middle">Alumni Employment Rank</th>
<th style="vertical-align:middle">Quality of Faculty Rank</th>
<th style="vertical-align:middle">Research Performance Rank</th>
<t

In [40]:
# Pulling out the year data 

table_rows = table.find_all("tr")

data = table_rows[1]

row_data = data.find_all('td')

year_data = data.find('a')

"".join([v[:9] for (k,v) in year_data.attrs.items()])


'2019-2020'

### Task 4.5 Find all th tags, to parse the table header

Use the variable defined in Task 4.4. Save your answer to a variable named `header` in order to pass the asserts.

**Hint**: The header should be a list of elements, that can be obtained by using the `get_text()` method for each `th` element in the table. List comprehension may be useful here.

In [42]:
# Write your code here

header = [th.get_text() for th in table.find_all("th")]

header


['World Rank',
 'Institution',
 'Country',
 'National Rank',
 'Quality of Education Rank',
 'Alumni Employment Rank',
 'Quality of Faculty Rank',
 'Research Performance Rank',
 'Score']

In [43]:
assert(len(header) == 9)
assert(type(header) == list)
assert(header[0] == "World Rank")
assert(header[-1] == "Score")

Great work! The next tasks are optional. You may choose to skip them and start the lab! You can revisit this section when you are solving the relevant portion of P12.

### Task 4.6 (Optional) Build row dictionary for one row

Scrape the second row (the first one is the header!), convert data to the appropriate types, and populate the data into a row dictionary. The keys of the dictionary are the columns in the dataframe. Avoid hardcoding these keys - instead, use the variable obtained in the previous task.

**Hint**: Rows can be found by locating the `tr` elements in the table.

- "World Rank", "National Rank", "Quality of Education Rank", "Alumni Employment Rank", "Quality of Faculty Rank", "Research Performance Rank": `int` conversion
- "Score"  : `float` conversion

You can compare your parsing output to `rankings.json` file contents, to confirm your result.


In [44]:
table_rows = table.find_all("tr")

data = table_rows[1]

row_data = data.find_all('td')

row_data[0].get_text()

#data_list[row.get_text(header.index('World Rank')]


'1'

In [45]:
# Write your code here

table_rows = table.find_all("tr")

data = table_rows[1]

rankings_dict = {}

world_rank_idx = header.index('World Rank')
institution_idx = header.index('Institution')
country_idx = header.index('Country')
national_rank_idx = header.index('National Rank')
qual_edu_rank_idx = header.index('Quality of Education Rank')
alumni_emp_rank_idx = header.index('Alumni Employment Rank')
qual_faculty_rank_idx = header.index('Quality of Faculty Rank')
research_perf_rank_idx = header.index('Research Performance Rank')
score_idx = header.index('Score')

row_data = data.find_all('td')

world_rank = int(row_data[world_rank_idx].get_text())
institution = row_data[institution_idx].get_text()
country = row_data[country_idx].get_text()
national_rank = int(row_data[national_rank_idx].get_text())
qual_edu_rank = int(row_data[qual_edu_rank_idx].get_text())
alumni_emp_rank = int(row_data[alumni_emp_rank_idx].get_text())
qual_faculty_rank = int(row_data[qual_faculty_rank_idx].get_text())
research_perf_rank = int(row_data[research_perf_rank_idx].get_text())
score = float(row_data[score_idx].get_text())
    
rankings_dict = { 'World Rank': world_rank, 
                  'Institution': institution, 
                  'Country': country, 
                  'National Rank': national_rank, 
                  'Quality of Education Rank': qual_edu_rank,
                  'Alumni Employment Rank': alumni_emp_rank,  
                  'Quality of Faculty Rank': qual_faculty_rank,
                  'Research Performance Rank': research_perf_rank,
                  'Score': score
                }
    
rankings_dict

{'World Rank': 1,
 'Institution': 'Harvard University',
 'Country': 'USA',
 'National Rank': 1,
 'Quality of Education Rank': 2,
 'Alumni Employment Rank': 1,
 'Quality of Faculty Rank': 1,
 'Research Performance Rank': 1,
 'Score': 100.0}

In [46]:
read_data = read_json("rankings.json")
len(read_data)

6000

### Task 4.7 (Optional) Build list of all row dictionaries

Scrape all rows, convert data to appropriate types, and populate data into a row dictionary and append row dictionaries into a list.

This is a natural extension of Task 4.6. You can use a loop to extract all rows and populate the list.

**Important**:
* Some fields in the dataset have missing values, represented simply as `-`.
* The "Year" value isn't present in the dataset. Think of a different way to populate this field.

In [47]:
# Write your code here

table_rows = table.find_all("tr")

data = table_rows[1:]

rankings_list = []
rankings_dict = {}

world_rank_idx = header.index('World Rank')
institution_idx = header.index('Institution')
country_idx = header.index('Country')
national_rank_idx = header.index('National Rank')
qual_edu_rank_idx = header.index('Quality of Education Rank')
alumni_emp_rank_idx = header.index('Alumni Employment Rank')
qual_faculty_rank_idx = header.index('Quality of Faculty Rank')
research_perf_rank_idx = header.index('Research Performance Rank')
score_idx = header.index('Score')

for row in data:
    try:
        row_data = row.find_all('td')
        
        year_data = row.find('a')
        year = "".join([v[:9] for (k,v) in year_data.attrs.items()])

        world_rank = row_data[world_rank_idx].get_text()
        institution = row_data[institution_idx].get_text()
        country = row_data[country_idx].get_text()
        national_rank = row_data[national_rank_idx].get_text()
        qual_edu_rank = row_data[qual_edu_rank_idx].get_text()
        alumni_emp_rank = row_data[alumni_emp_rank_idx].get_text()
        qual_faculty_rank = row_data[qual_faculty_rank_idx].get_text()
        research_perf_rank = row_data[research_perf_rank_idx].get_text()
        score = row_data[score_idx].get_text()

        int_list = [world_rank, national_rank, qual_edu_rank, alumni_emp_rank, \
                    qual_faculty_rank, research_perf_rank]
        
        float_list = [score]
        
        for item in int_list:
            if item != "-":
                item = int(item)
                
        for item in float_list:
            if item != "-":
                item = float(item)
                
        rankings_dict = { 'World Rank': world_rank,
                          'Year': year,
                          'Institution': institution, 
                          'Country': country, 
                          'National Rank': national_rank, 
                          'Quality of Education Rank': qual_edu_rank,
                          'Alumni Employment Rank': alumni_emp_rank,
                          'Quality of Faculty Rank': qual_faculty_rank,
                          'Research Performance Rank': research_perf_rank,
                          'Score': score
                         }
            
        rankings_list.append(rankings_dict)
            
    except ValueError:
        continue
    
rankings_list


[{'World Rank': '1',
  'Year': '2019-2020',
  'Institution': 'Harvard University',
  'Country': 'USA',
  'National Rank': '1',
  'Quality of Education Rank': '2',
  'Alumni Employment Rank': '1',
  'Quality of Faculty Rank': '1',
  'Research Performance Rank': '1',
  'Score': '100'},
 {'World Rank': '2',
  'Year': '2019-2020',
  'Institution': 'Massachusetts Institute of Technology',
  'Country': 'USA',
  'National Rank': '2',
  'Quality of Education Rank': '1',
  'Alumni Employment Rank': '10',
  'Quality of Faculty Rank': '2',
  'Research Performance Rank': '5',
  'Score': '96.7'},
 {'World Rank': '3',
  'Year': '2019-2020',
  'Institution': 'Stanford University',
  'Country': 'USA',
  'National Rank': '3',
  'Quality of Education Rank': '9',
  'Alumni Employment Rank': '3',
  'Quality of Faculty Rank': '3',
  'Research Performance Rank': '2',
  'Score': '95.2'},
 {'World Rank': '4',
  'Year': '2019-2020',
  'Institution': 'University of Cambridge',
  'Country': 'United Kingdom',
  '

### Task 4.8 (Optional) Write the parse_html function

Convert tasks 4.2 to 4.7 to a function. The function should take in a `filename` as input and return a list of dictionaries, each dictionary representing a row in the dataset.

In [48]:
def parse_html(filename):
    '''This function parses an HTML file and returns a list of dictionaries containing the tabular data'''
    #TODO: Write your code here
    
    file_obj = open(filename, "r")
    content = file_obj.read()
    file_obj.close()

    bs_obj = BeautifulSoup(content, "html.parser")
    
    table = bs_obj.find('table')
    
    header = [th.get_text() for th in table.find_all("th")]

    table_rows = table.find_all("tr")

    data = table_rows[1:]

    rankings_list = []
    rankings_dict = {}

    world_rank_idx = header.index('World Rank')
    institution_idx = header.index('Institution')
    country_idx = header.index('Country')
    national_rank_idx = header.index('National Rank')
    qual_edu_rank_idx = header.index('Quality of Education Rank')
    alumni_emp_rank_idx = header.index('Alumni Employment Rank')
    qual_faculty_rank_idx = header.index('Quality of Faculty Rank')
    research_perf_rank_idx = header.index('Research Performance Rank')
    score_idx = header.index('Score')

    for row in data:
        try:
            row_data = row.find_all('td')
            
            year_data = row.find('a')
            year = "".join([v[:9] for (k,v) in year_data.attrs.items()])

            world_rank = row_data[world_rank_idx].get_text()
            institution = row_data[institution_idx].get_text()
            country = row_data[country_idx].get_text()
            national_rank = row_data[national_rank_idx].get_text()
            qual_edu_rank = row_data[qual_edu_rank_idx].get_text()
            alumni_emp_rank = row_data[alumni_emp_rank_idx].get_text()
            qual_faculty_rank = row_data[qual_faculty_rank_idx].get_text()
            research_perf_rank = row_data[research_perf_rank_idx].get_text()
            score = row_data[score_idx].get_text()

            int_list = [world_rank, national_rank, qual_edu_rank, alumni_emp_rank, \
                        qual_faculty_rank, research_perf_rank]

            float_list = [score]

            for item in int_list:
                if item != "-":
                    item = int(item)
                    
            for item in float_list:
                if item != "-":
                    item = float(item)

            rankings_dict = { 'World Rank': world_rank,
                              'Year': year,
                              'Institution': institution, 
                              'Country': country, 
                              'National Rank': national_rank, 
                              'Quality of Education Rank': qual_edu_rank,
                              'Alumni Employment Rank': alumni_emp_rank,
                              'Quality of Faculty Rank': qual_faculty_rank,
                              'Research Performance Rank': research_perf_rank,
                              'Score': score
                             }

            rankings_list.append(rankings_dict)

        except ValueError:
            continue

    return rankings_list
    
    

Finally, test your code below.

In [49]:
file1 = parse_html("2019-2020.html")
print(len(file1))

file2 = parse_html("2020-2021.html")
print(len(file2))

file3 = parse_html("2021-2022.html")
print(len(file3))


2000
2000
2000


In [50]:
print(file1[-1]["Institution"])
print(file1[50]["Quality of Faculty Rank"])
print(file1[0]["Year"])

print(file2[15]["Score"])
print(file2[-5]["National Rank"])
print(file2[40]["Research Performance Rank"])

print(file3[87]["Alumni Employment Rank"])
print(file3[100]["Country"])
print(file3[25]["World Rank"])


Government College University Faisalabad
78
2019-2020
89.0
15
398
464
United Kingdom
26


In [64]:
assert(file1[-1]["Institution"] == 'Government College University Faisalabad')
assert(file1[50]["Quality of Faculty Rank"] == 78)
assert(file1[0]["Year"] == '2019-2020')

assert(file2[15]["Score"] == 89.0)
assert(file2[-5]["National Rank"] == 15)
assert(file2[40]["Research Performance Rank"] == 398)

assert(file3[87]["Alumni Employment Rank"] == 464)
assert(file3[100]["Country"] == 'United Kingdom')
assert(file3[25]["World Rank"] == 26)

In [51]:
print(parse_html("2019-2020.html")[-1]["Institution"])
print(parse_html("2019-2020.html")[50]["Quality of Faculty Rank"])
print(parse_html("2019-2020.html")[0]["Year"])

print(parse_html("2020-2021.html")[15]["Score"] )
print(parse_html("2020-2021.html")[-5]["National Rank"])
print(parse_html("2020-2021.html")[40]["Research Performance Rank"])

print(parse_html("2021-2022.html")[87]["Alumni Employment Rank"])
print(parse_html("2021-2022.html")[100]["Country"])
print(parse_html("2021-2022.html")[25]["World Rank"])

Government College University Faisalabad
78
2019-2020
89.0
15
398
464
United Kingdom
26


In [60]:
# assert(parse_html("2019-2020.html")[-1]["Institution"] == 'Government College University Faisalabad')
# assert(parse_html("2019-2020.html")[50]["Quality of Faculty Rank"] == 78)
# assert(parse_html("2019-2020.html")[0]["Year"] == "2019-2020")

# assert(parse_html("2020-2021.html")[15]["Score"] == 89.0)
# assert(parse_html("2020-2021.html")[-5]["National Rank"] == 15)
# assert(parse_html("2020-2021.html")[40]["Research Performance Rank"] == 398)

# assert(parse_html("2021-2022.html")[87]["Alumni Employment Rank"] == 464)
# assert(parse_html("2021-2022.html")[100]["Country"] == 'United Kingdom')
# assert(parse_html("2021-2022.html")[25]["World Rank"] == 26)


In [61]:
# assert(parse_html("2019-2020.html")[-1]["Institution"] == 'Government College University Faisalabad')
# assert(parse_html("2020-2021.html")[15]["Score"] == 89.0)
# assert(parse_html("2021-2022.html")[100]["Country"] == 'United Kingdom')
# assert(parse_html("2021-2022.html")[25]["World Rank"] == 26)
# assert(parse_html("2020-2021.html")[-5]["National Rank"] == 15)
# assert(parse_html("2019-2020.html")[50]["Quality of Faculty Rank"] == 78)
# assert(parse_html("2021-2022.html")[87]["Alumni Employment Rank"] == 464)
# assert(parse_html("2020-2021.html")[40]["Research Performance Rank"] == 398)
# assert(parse_html("2019-2020.html")[0]["Year"] == "2019-2020")

### Congratulations, you are now ready to start p12!