In order to extract from the universities, you first need the URLS of all the universities. To do this:

1. Go to [the US News list of national universities](https://www.usnews.com/best-colleges/rankings/national-universities). 
2. Open the JavaScript console (`Ctrl` `Shift` `I`, then click on the 'Console' tab)
3. Paste in the following code: `a = document.querySelector('button[class*="LoadMoreWrapper"]'); window.setInterval(() => a.click(), 1000)`, and press enter. This will continue to load more universities on the web page.
4. Wait for a minute or so, until the page stops loading new universities.
5. Save the web page (`Ctrl` `S`) into this folder.

I've already done these steps and saved it here.

Now, we need to extract the URLs from this page. To do this, we'll use BeautifulSoup.

In [11]:
from bs4 import BeautifulSoup

html_doc = open("2022 Best National Universities US News Rankings.htm", encoding="utf8")
soup = BeautifulSoup(html_doc, 'html.parser')

In [12]:
anchors = soup.select('a[class*="card-name"]')
links = list(map(lambda x: x.get('href'), anchors))
print(links[:10])

['https://www.usnews.com/best-colleges/princeton-university-2627', 'https://www.usnews.com/best-colleges/columbia-university-2707', 'https://www.usnews.com/best-colleges/harvard-university-2155', 'https://www.usnews.com/best-colleges/massachusetts-institute-of-technology-2178', 'https://www.usnews.com/best-colleges/yale-university-1426', 'https://www.usnews.com/best-colleges/stanford-university-1305', 'https://www.usnews.com/best-colleges/university-of-chicago-1774', 'https://www.usnews.com/best-colleges/university-of-pennsylvania-3378', 'https://www.usnews.com/best-colleges/california-institute-of-technology-1131', 'https://www.usnews.com/best-colleges/duke-university-2920']


Now we have all of the university links, we need to go individually through each one, request the various sub-pages from the server, then extract the info into a dataframe.

In [100]:
import requests

url = links[0] + "/academics"
agent = {"User-Agent":'Mozilla/5.0'}
response = requests.get(url, headers = agent)
soup = BeautifulSoup(response.text, 'html.parser')

In [144]:
def get_data_rows(soup_obj):
    datarows = soup_obj.select('div[class*="DataRow"]:not([class*="datarow-table"]):not([class*="pie-chart"])')
    # Exclude those with class 'pie-chart' or 'datarow-table', which are dealt with in different functions
    for d in datarows:
        if len(d.contents) == 2 and d.parent.name != "span":
            # don't include things with 'span' parent because those are list elements
            # and are captured in get_truncated_data_table
            print(d.contents[0].text + " ::: " + d.contents[1].text)
            
get_data_rows(soup)

Alumni starting salaries by major ::: UNLOCK WITH COMPASS UNLOCK WITH COMPASS
Minors ::: UNLOCK WITH COMPASS UNLOCK WITH COMPASS
Degrees offered ::: Bachelor's, Master's, Doctorate - research/scholarship
Combined-degree programs ::: N/A
Online bachelor's degree program offered ::: No
Student participation in special study options ::: UNLOCK WITH COMPASS UNLOCK WITH COMPASS
Qualified undergraduate students may take graduate-level classes ::: UNLOCK WITH COMPASS UNLOCK WITH COMPASS
Student-faculty ratio ::: 4:1
General education/core curriculum required ::: Yes
Total faculty ::: UNLOCK WITH COMPASS UNLOCK WITH COMPASS
Minority ::: UNLOCK WITH COMPASS UNLOCK WITH COMPASS
International student retention rate ::: UNLOCK WITH COMPASS UNLOCK WITH COMPASS
Graduation rates ::: UNLOCK WITH COMPASS UNLOCK WITH COMPASS
Enrolled in continuing education ::: UNLOCK WITH COMPASS UNLOCK WITH COMPASS


In [119]:
def get_rankings(soup_obj):
    #rankrows = soup.select('ul[class*="BadgeList"]')
    rank_rows = soup_obj.select('li[class*="BadgeList"]> div:nth-child(2)')
    for r in rank_rows:
        x = r.select('div:nth-child(1) > a > strong')
        if len(x) >= 2:
            print(x[0].text + ":::" + x[1].text)
            
get_rankings(soup)

In [117]:
def get_truncated_data_tables(soup_obj):
    tables = soup_obj.select('div[class*="datarow-table truncated"]')
    for t in tables:
        table_name = t.select('p')[0].text
        for s in t.select('div[class*="DataRow__Row"]'):
            x = s.contents
            if len(x) == 2:
                print(table_name + "." + x[0].text + " ::: " + x[1].text)

get_truncated_data_tables(soup)

Ten most popular majors for 2020 graduates.Social Sciences :::  20%
Ten most popular majors for 2020 graduates.Engineering :::  15%
Ten most popular majors for 2020 graduates.Computer and Information Sciences and Support Services :::  12%
Ten most popular majors for 2020 graduates.Biological and Biomedical Sciences :::  10%
Ten most popular majors for 2020 graduates.Public Administration and Social Service Professions :::  9%
Ten most popular majors for 2020 graduates.Physical Sciences :::  7%
Ten most popular majors for 2020 graduates.History :::  6%
Ten most popular majors for 2020 graduates.Foreign Languages, Literatures, and Linguistics :::  4%
Ten most popular majors for 2020 graduates.English Language and Literature/Letters :::  3%
Ten most popular majors for 2020 graduates.Philosophy and Religious Studies :::  3%


In [140]:
def get_bar_charts(soup_obj):
    bar_charts = soup_obj.select('div[class*="pie-chart"]') # Yes, all the bar charts are labelled as pie charts
    for b in bar_charts:
        bar_chart_title = b.select('p')[0].text
        for l in b.select('div[class*="BarChartStacked__LegendWrapper"] > div[class*="BarChartStacked__Legend"] > div'):
            print(bar_chart_title + "." + l.contents[1] + " ::: " + l.b.contents[0])
        print("---")

get_bar_charts(soup)

Full-time faculty gender distribution.Male ::: 63.5
Full-time faculty gender distribution.Female ::: 36.5
---
Part-time faculty gender distribution.Male ::: 54.8
Part-time faculty gender distribution.Female ::: 45.2
---
Class sizes.Classes with fewer than 20 students ::: 77.6
Class sizes.20-49 ::: 13.5
Class sizes.50 or more ::: 9
---


In [None]:
def get_full_tables(soup_obj):
    pass