With the importance placed on getting a college degree nowadays ([65 percent of workers have at least some postsecondary education](https://cew.georgetown.edu/cew-reports/americas-divided-recovery/)), students more than ever need to focus on getting into colleges that will best help them achieve their goals. Sites like (US News)[https://www.usnews.com/best-colleges] have become the de-facto standard for students trying to search through hundreds of universities and deciding which is the best fit. Specifically, US News' ranking system is given enormous weight, influencing (how schools are perceived by the general public)[https://www.forbes.com/sites/robertzafft/2021/12/07/us-news-business-school-rankings-crucial-but-meaningless/].



## Part 1: Web Scraping
After spending 8 hours or so writing a web scraper to get all the components from the website, it turns out that each page provides JSON objects that contain all the data in a much easier to extract format.

In [132]:
from bs4 import BeautifulSoup

html_doc = open("2022 Best National Universities US News Rankings.htm", encoding="utf8")
soup = BeautifulSoup(html_doc, 'html.parser')

In [133]:
anchors = soup.select('a[class*="card-name"]')
links = list(map(lambda x: x.get('href'), anchors))
print(links[:10])
print(len(links))

['https://www.usnews.com/best-colleges/princeton-university-2627', 'https://www.usnews.com/best-colleges/columbia-university-2707', 'https://www.usnews.com/best-colleges/harvard-university-2155', 'https://www.usnews.com/best-colleges/massachusetts-institute-of-technology-2178', 'https://www.usnews.com/best-colleges/yale-university-1426', 'https://www.usnews.com/best-colleges/stanford-university-1305', 'https://www.usnews.com/best-colleges/university-of-chicago-1774', 'https://www.usnews.com/best-colleges/university-of-pennsylvania-3378', 'https://www.usnews.com/best-colleges/california-institute-of-technology-1131', 'https://www.usnews.com/best-colleges/duke-university-2920']
392


In [65]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import json

link = "https://www.usnews.com/best-colleges/princeton-university-2627"

def get_soup_obj(link, subpage):
    url = link + "/" + subpage
    agent = {"User-Agent":'Mozilla/5.0'}
    response = requests.get(url, headers = agent)
    return BeautifulSoup(response.text, 'html.parser')

soup = get_soup_obj(link, "student-life")

Then, we look at the structure of the HTML files and find the line in the script containing the JSON object, and extract it:

In [140]:
def get_json_from_soup(soup_obj):
    script_json_line = soup_obj.find_all("script")[-2].contents[0].split("\n")[4]
    json_obj_str = script_json_line[script_json_line.find("{"):-1].replace("undefined","null")
    json_obj = json.loads(json_obj_str)
        
    base_key = 'src/containers/pages/education/higher-education/colleges/profile'
    json_data = json_obj.get(f'{base_key}/overview.js',False) or \
                json_obj.get(f'{base_key}/generic.js',False) or \
                json_obj.get(f'{base_key}/rankings.js',None)
    
    #print("json_data: " + str(json_data)[:20])
    json_data = json_data['data']['context']['data']['page']
    return json_data
    #return {'schoolData': json_data['schoolData'], 
    #        'fields': json_data['fields'],
    #        'schoolDetails': json_data['schoolDetails']}

#print(get_json_from_soup(get_soup_obj(link,"")))

The 'fields' data indicates how the data is structured, the 'schoolData' field provides the values, and 'schoolDetails' provides basic info not in 'schoolData'.

We can use these methods to get all the data from every university and put it in a single JSON object, with the university name as the key"

In [None]:
all_univ_json_data = {}
pages = ["", "overall-rankings", "applying", "academics", "student-life", "paying", "campus-info"]
for link in links:
    json_out = {'fields': {}, 'data': {}}
    for page in pages:
        json_data_fields = get_json_from_soup(get_soup_obj(link,page))
        #print(page + "::" + str(json_data_fields)[:20])
        json_out['data'].update(json_data_fields['schoolData'])
        json_out['fields'].update(json_data_fields['fields'])
    json_out['details'] = get_json_from_soup(get_soup_obj(link,""))['schoolDetails']
    name = json_out['details']['displayName']
    all_univ_json_data[name] = json_out
    print(name)

In [155]:
with open('all_univ.json', 'w') as outfile:
    json.dump(all_univ_json_data, outfile)

## Part 2: Analysis

In [157]:
# Read in the university data
with open('all_univ.json', 'r') as infile:
    all_univ_json_data = json.load(infile)

In [191]:
univ_name = 'Princeton University'
nested_fields = []
for field in all_univ_json_data[univ_name]['fields'].values():
    if field['isPublic']: 
        if 'subFields' in field:
            print(f"{field['label']} ({field['primaryKey']})")
            for subfield in field['subFields']:
                if "fields" in subfield:
                    for subfield_field in subfield['fields']:
                        nested_fields.append(subfield_field)
                        subfield_label = all_univ_json_data[univ_name]['fields'][subfield_field]['label']
                        subfield_data = all_univ_json_data[univ_name]['data'][subfield_field]
                        print(f"\t\t{subfield_label} ({subfield_data['fieldName']}): {subfield_data['rawValue']}")
                elif "field" in subfield:
                    nested_fields.append(subfield['field'])
                    subfield_name = all_univ_json_data[univ_name]['fields'][subfield['field']]['label']
                    subfield_data = all_univ_json_data[univ_name]['data'][subfield['field']]
                    print(f"\t{subfield_name} ({subfield_data['fieldName']}): {subfield_data['rawValue']}")
        else:
            if field['fieldName'] not in nested_fields:
                field_label = field['label']
                field_data = all_univ_json_data[univ_name]['data'][field['fieldName']]
                print(f"{field_label} ({field_data['fieldName']}): {field_data['rawValue']}")
        

Median starting salary of alumni (payscaleOverallStarting): None
By major (topMajors): [['Social Sciences', 20], ['Engineering', 15], ['Computer and Information Sciences and Support Services', 12], ['Biological and Biomedical Sciences', 10], ['Public Administration and Social Service Professions', 9], ['Physical Sciences', 7], ['History', 6], ['Foreign Languages, Literatures, and Linguistics', 4], ['English Language and Literature/Letters', 3], ['Philosophy and Religious Studies', 3]]
Selectivity (cSelectClass): Most selective
Fall 2020 acceptance rate (rCAcceptRate): 6
Application deadline (applicationDeadline): January 1
SAT/ACT scores must be received by (actSatiLatestDate): January 1
Class sizes  (gClassSizes)
	Classes with fewer than 20 students (vClasses1): 77.6
	20-49 (vClasses2): 13.5
	50 or more (vClasses3): 9
Student-faculty ratio (vStudentFacultyRatio): 4:1
4-year graduation rate (gradRate4Year): 90
Student gender distribution (gStudentGenderDistribution)
	Male (vPctUnderMen