# Top Colleges Exploration

The goal of this is to scrape the top 150 college towns according to the article [The 150 Best College Towns in America (2021 Ranking)](https://listwithclever.com/research/best-college-towns-2021/).

The article provides a table giving **Ranking**, **City (city, state)**, and **Colleges** associated with the city.

## Scraping the Data

In [1]:
# import libraries
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd

In [2]:
# scraping information
url = 'https://listwithclever.com/research/best-college-towns-2021/'
url_headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
# response of 403 without headers, 200 with
page = requests.get(url, headers=url_headers)
soup = BeautifulSoup(page.text, 'lxml')

In [3]:
# get table featuring the top 150
college_table = soup.find('div', class_ = 'clever-table border-top border-bot col-1-center col-2-left col-3-left')

In [4]:
# identify table rows
college_rows = college_table.find_all('tr')

In [5]:
# header has table header 'th' breaks
# data rows have columnn 'td' breaks

# header row
col_headers = [header.text for header in college_rows[0].find_all('th')]

# data rows
top_colleges_dict = {'ranking': [], 'city': [], 'colleges': []}
for row in college_rows:
    ranked_row = row.find_all('td')
    if len(ranked_row) == 3:
        top_colleges_dict['ranking'].append(ranked_row[0].text)
        top_colleges_dict['city'].append(ranked_row[1].text)
        top_colleges_dict['colleges'].append(ranked_row[2].text)

In [6]:
# put data into pandas dataframe
colleges_df = pd.DataFrame(top_colleges_dict)

In [8]:
# before
colleges_df.head(10)

Unnamed: 0,ranking,city,colleges
0,1,"Stanford, Calif.",Stanford University
1,2,"Williamsburg, Va.",William & Mary
2,3,"Pasadena, Calif.",California Institute of Technology
3,4,"Princeton, N.J.",Princeton University
4,5,"Charlottesville, Va.",University of Virginia
5,6,"Ann Arbor, Mich.",University of Michigan-Ann Arbor
6,7,"Cambridge, Mass.","Harvard University, Massachusetts Institute of..."
7,8,"Berkeley, Calif.",University of California-Berkeley
8,9,"Champaign, Ill.",University of Illinois Urbana-Champaign
9,10,"Gainesville, Fla.","University of Florida, Santa Fe College"


## Cleaning the Data

In [9]:
# split city and state
colleges_df[['city', 'state']] = colleges_df['city'].str.split(', ', expand=True)

# rearrange order
colleges_df = colleges_df[['city', 'state', 'colleges']]

# lowercase city. state. and colleges
colleges_df['city'] = colleges_df['city'].str.lower()
colleges_df['state'] = colleges_df['state'].str.lower()
colleges_df['colleges'] = colleges_df['colleges'].str.lower()

# unabbreviate state process
# remove punctuation
colleges_df['state'] = colleges_df['state'].str.replace('.', '')

  colleges_df['state'] = colleges_df['state'].str.replace('.', '')


In [10]:
# get a look at the values - number of unique states
colleges_df['state'].nunique()

44

In [11]:
# get a look at the values - states and state counts
colleges_df['state'].value_counts()

calif    15
nc        9
wis       8
ny        7
pa        7
texas     6
mich      6
va        6
wash      5
tenn      4
colo      4
mo        4
ind       4
ill       4
miss      3
wva       3
ohio      3
iowa      3
nj        3
md        3
mass      3
ariz      3
la        3
sc        3
mont      2
kan       2
conn      2
ala       2
ky        2
minn      2
utah      2
okla      2
fla       2
ore       2
ga        2
ark       1
wyo       1
nm        1
del       1
nev       1
ri        1
maine     1
vt        1
idaho     1
Name: state, dtype: int64

In [12]:
# create a dictionary to turn "abbreviated" states to their respective full names
# abbreviated states
abbreviated = ['calif','va','nj','mich','mass','ill','fla','ind','wva','nc','pa',
               'miss','texas','colo','ny','ga','ore','iowa','del','wis','md','okla',
               'mo','utah','minn','ariz','la','sc','ky','wash','ark','ala','wyo',
               'conn','kan','nm','tenn','ohio','mont','nev','ri','maine','vt','idaho']

# full states
full = ['california','virginia','new jersey','michigan','massachusetts','illinois',
        'florida','indiana','west virginia','north carolina','pennsylvania',
        'mississippi','texas','colorado','new york','georgia','oregon','iowa',
        'delaware','wisconsin','maryland','oklahoma','missouri','utah','minnesota',
        'arizona','louisiana','south carolina','kentucky','washington','arkansas',
        'alabama','wyoming','connecticut','kansas','new mexico','tennessee','ohio',
        'montana','nevada','rhode island','maine','vermont','idaho']

# dictionary to switch between abbreviation to full state names
# table article used non-standard abbreviations, so needed a custom dictionary
abbreviated_states_custom = {abbr_st: full_st for abbr_st, full_st in zip(abbreviated, full)}

In [13]:
# apply the dictionary
colleges_df['state'] = colleges_df['state'].apply(lambda state: abbreviated_states_custom[state])

In [15]:
# after
colleges_df.head(10)

Unnamed: 0,city,state,colleges
0,stanford,california,stanford university
1,williamsburg,virginia,william & mary
2,pasadena,california,california institute of technology
3,princeton,new jersey,princeton university
4,charlottesville,virginia,university of virginia
5,ann arbor,michigan,university of michigan-ann arbor
6,cambridge,massachusetts,"harvard university, massachusetts institute of..."
7,berkeley,california,university of california-berkeley
8,champaign,illinois,university of illinois urbana-champaign
9,gainesville,florida,"university of florida, santa fe college"


## Visualizing the Data

For future exploration, use `data/top-colleges.csv` pulled and cleaned by `scripts/top-colleges-extractor.py`.