# Data Engineering with Beautiful Soup

(created along with Nelson Santos for cs109)

Data Engineering, the process of gathering and preparing data for analysis, is a very big part of Data Science.

Datasets might not be formatted in the way you need (e.g. you have categorical features but your algorithm requires numerical features); or you might need to cross-reference some dataset to another that has a different format; or you might be dealing with a dataset that contains missing or invalid data.

These are just a few examples of why data retrieval and cleaning are so important.

## Retrieving data from the web

### requests

You might need to retrieve some data from the Internet. Python has many built-in libraries that were developed over the years to do exactly that (e.g. urllib, urllib2, urllib3).

However, these libraries are very low-level and somewhat hard to use. They become especially cumbersome when you need to issue POST requests or authenticate against a web service.

Luckily, as with most tasks in Python, someone has developed a library that simplifies these tasks. Get acquainted to `requests` as soon as possible, since you will probably need it in the future.

In [39]:
import requests

Now that the requests library was imported into our namespace, we can use the functions offered by it.

In this case we'll use the appropriately named `get` function to issue a *GET* request. This is equivalent to typing a URL into your browser and hitting enter.

In [40]:
# Get the HU Wikipedia page
localreq = requests.get("https://en.wikipedia.org/wiki/Kabarak_University")
req = requests.get("https://en.wikipedia.org/wiki/Harvard_University")

Python is an Object Oriented language, and everything on it is an object. Even built-in functions such as `len` are just syntactic sugar for acting on object properties.

We will not dwell too long on OO concepts, but some of Python's idiosyncrasies will be easier to understand if we spend a few minutes on this subject.

When you evaluate an object itself, such as the `req` object we created above, Python will automatically call the `__str__()` or `__repr__()` method of that object. The default values for these methods are usually very simple and boring. The `req` object however has a custom implementation that shows the object type (i.e. `Response`) and the HTTP status number (200 means the request was successful).

In [3]:
req

<Response [200]>

Just to confirm, we will call the `type` function on the object to make sure it agrees with the value above.

In [4]:
type(req)

requests.models.Response

Right now `req` holds a reference to a *Request* object; but we are interested in the text associated with the web page, not the object itself.

So the next step is to assign the value of the `text` property of this `Request` object to a variable.

In [42]:
localpage = localreq.text
localpage[:1000]

page = req.text
page[:1000]

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Harvard University - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"f4393be0-80af-4b51-9cee-dfecc37c53ee","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Harvard_University","wgTitle":"Harvard University","wgCurRevisionId":1069140536,"wgRevisionId":1069140536,"wgArticleId":18426501,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 maint: location","Webarchive template wayback links","CS1: Julian–Gregorian uncertainty","CS1 errors: generic name","CS1 maint:

In [43]:
from IPython.display import IFrame, HTML
#IFrame(HTML(page), 1024, 768)
HTML(localpage)
HTML(page)

0,1
,It has been suggested that The Harvard Gazette be merged into this article. (Discuss) Proposed since January 2022.

0,1
Coat of arms,Coat of arms
Latin: Universitas Harvardiana,Latin: Universitas Harvardiana
Former names,Harvard College
Motto,Veritas (Latin)[1]
Motto in English,Truth
Type,Private research university
Established,1636; 386 years ago[2]
Founder,Massachusetts General Court
Accreditation,NECHE
Academic affiliations,NAICU AICUM AAU URA Space-grant

0,1
School,Founded
Harvard College,1636
Medicine,1782
Divinity,1816
Law,1817
Dental Medicine,1867
Arts and Sciences,1872
Business,1908
Extension,1910
Design,1914

Academic rankings,Academic rankings
National,National.1
ARWU[83],1
Forbes[84],7
THE/WSJ[85],1
U.S. News & World Report[86],2
Washington Monthly[87],5
Global,Global
ARWU[88],1
QS[89],5
THE[90],2
U.S. News & World Report[91],1

National Graduate Rankings[92],National Graduate Rankings[92],National Graduate Rankings[92],National Graduate Rankings[92]
Program,Ranking,Unnamed: 2_level_1,Unnamed: 3_level_1
Biological Sciences,4,,
Business,6,,
Chemistry,2,,
Clinical Psychology,10,,
Computer Science,16,,
Earth Sciences,8,,
Economics,1,,
Education,1,,
Engineering,22,,
English,8,,

Global Subject Rankings[93],Global Subject Rankings[93],Global Subject Rankings[93],Global Subject Rankings[93]
Program,Ranking,Unnamed: 2_level_1,Unnamed: 3_level_1
Agricultural Sciences,22,,
Arts & Humanities,2,,
Biology & Biochemistry,1,,
Cardiac & Cardiovascular Systems,1,,
Chemistry,15,,
Clinical Medicine,1,,
Computer Science,47,,
Economics & Business,1,,
Electrical & Electronic Engineering,136,,
Engineering,27,,

School,Founded,Enrollment,U.S. News & World Report
Harvard College,1636,6755,2[104]
Medicine,1782,660,1[105]
Divinity,1816,377,
Law,1817,1990,3[106]
Dental Medicine,1867,280,
Arts and Sciences,1872,4824,
Business,1908,2011,5[107]
Extension,1910,3428,
Design,1914,878,
Education,1920,876,1[108]

Unnamed: 0,Undergrad,Grad/prof
Asian,21%,13%
Black,9%,5%
Hispanic or Latino,11%,7%
White,37%,38%
Two or more races,8%,3%
International,12%,32%

0,1
,Scholia has an organization profile for Harvard University.

vteHarvard University,vteHarvard University.1
History John Harvard statue President Lawrence Bacow Board of Overseers President and Fellows of Harvard College Provost Alan Garber Harvard Library,History John Harvard statue President Lawrence Bacow Board of Overseers President and Fellows of Harvard College Provost Alan Garber Harvard Library
Arts and  Sciences,"Dean Claudine Gay College Dean Rakesh Khurana Radcliffe College Freshman dormitories Upperclass houses Adams Cabot Currier Dudley Dunster Eliot Kirkland Leverett Lowell Mather Pforzheimer Quincy Winthrop Undergraduate organizations The Harvard Crimson The Harvard Lampoon The Harvard Advocate The Harvard Independent Hasty Pudding Theatricals Athletics Baseball Men's basketball Women's basketball Fencing Football Men's ice hockey Women's ice hockey Men's lacrosse‎ Rugby Men's soccer Men's squash Men's volleyball Women's volleyball Ivy League Harvard Stadium Yale football rivalry Lavietes Pavilion Bright-Landry Hockey Center Cornell hockey rivalry Beanpot Jordan Field Ohiri Field Malkin Athletic Center Newell Boathouse Weld Boathouse Continuing Education Dean Nancy Coleman Harvard Extension School Harvard Summer School History of Harvard Extension School Engineering and Applied Sciences Dean Francis J. Doyle III Lyman Laboratory of Physics Graduate School Dean Emma Dench Libraries Cabot Harvard-Yenching Houghton Harvard Review Lamont Pusey Widener Harry Elkins Widener Eleanor Elkins Widener Grossman Centers, Institutes & Societies Asia Center Carpenter Center for the Visual Arts Hellenic Studies Dumbarton Oaks Edmond J. Safra Center for Ethics Fairbank Center for Chinese Studies Harvard Forest Harvard–Smithsonian Center for Astrophysics Hutchins Center for African and African American Research W. E. B. Du Bois Institute Rowland Institute for Science Ukrainian Research Institute"
Dean Claudine Gay,Dean Claudine Gay
College,Dean Rakesh Khurana Radcliffe College Freshman dormitories Upperclass houses Adams Cabot Currier Dudley Dunster Eliot Kirkland Leverett Lowell Mather Pforzheimer Quincy Winthrop Undergraduate organizations The Harvard Crimson The Harvard Lampoon The Harvard Advocate The Harvard Independent Hasty Pudding Theatricals Athletics Baseball Men's basketball Women's basketball Fencing Football Men's ice hockey Women's ice hockey Men's lacrosse‎ Rugby Men's soccer Men's squash Men's volleyball Women's volleyball Ivy League Harvard Stadium Yale football rivalry Lavietes Pavilion Bright-Landry Hockey Center Cornell hockey rivalry Beanpot Jordan Field Ohiri Field Malkin Athletic Center Newell Boathouse Weld Boathouse
Continuing Education,Dean Nancy Coleman Harvard Extension School Harvard Summer School History of Harvard Extension School
Engineering and Applied Sciences,Dean Francis J. Doyle III Lyman Laboratory of Physics
Graduate School,Dean Emma Dench
Libraries,Cabot Harvard-Yenching Houghton Harvard Review Lamont Pusey Widener Harry Elkins Widener Eleanor Elkins Widener Grossman
"Centers, Institutes & Societies",Asia Center Carpenter Center for the Visual Arts Hellenic Studies Dumbarton Oaks Edmond J. Safra Center for Ethics Fairbank Center for Chinese Studies Harvard Forest Harvard–Smithsonian Center for Astrophysics Hutchins Center for African and African American Research W. E. B. Du Bois Institute Rowland Institute for Science Ukrainian Research Institute
Business,Dean Srikant Datar Harvard Business Publishing Harvard Business Review Baker Library/Bloomberg Center Spangler Center

0,1
Dean Claudine Gay,Dean Claudine Gay
College,Dean Rakesh Khurana Radcliffe College Freshman dormitories Upperclass houses Adams Cabot Currier Dudley Dunster Eliot Kirkland Leverett Lowell Mather Pforzheimer Quincy Winthrop Undergraduate organizations The Harvard Crimson The Harvard Lampoon The Harvard Advocate The Harvard Independent Hasty Pudding Theatricals Athletics Baseball Men's basketball Women's basketball Fencing Football Men's ice hockey Women's ice hockey Men's lacrosse‎ Rugby Men's soccer Men's squash Men's volleyball Women's volleyball Ivy League Harvard Stadium Yale football rivalry Lavietes Pavilion Bright-Landry Hockey Center Cornell hockey rivalry Beanpot Jordan Field Ohiri Field Malkin Athletic Center Newell Boathouse Weld Boathouse
Continuing Education,Dean Nancy Coleman Harvard Extension School Harvard Summer School History of Harvard Extension School
Engineering and Applied Sciences,Dean Francis J. Doyle III Lyman Laboratory of Physics
Graduate School,Dean Emma Dench
Libraries,Cabot Harvard-Yenching Houghton Harvard Review Lamont Pusey Widener Harry Elkins Widener Eleanor Elkins Widener Grossman
"Centers, Institutes & Societies",Asia Center Carpenter Center for the Visual Arts Hellenic Studies Dumbarton Oaks Edmond J. Safra Center for Ethics Fairbank Center for Chinese Studies Harvard Forest Harvard–Smithsonian Center for Astrophysics Hutchins Center for African and African American Research W. E. B. Du Bois Institute Rowland Institute for Science Ukrainian Research Institute

Links to related articles,Links to related articles.1
"vteIvy League Brown Bears Columbia Lions Cornell Big Red Dartmouth Big Green Harvard Crimson Penn Quakers Princeton Tigers Yale Bulldogs vteCambridge, MassachusettsHistory Timeline Squares Central Square Harvard Square Inman Square Kendall Square Lechmere Square Porter Square Neighborhoods East Cambridge (Area 1) MIT Campus (Area 2) Wellington-Harrington (Area 3) The Port (Area 4) Cambridgeport (Area 5) Mid-Cambridge (Area 6) Riverside (Area 7) Agassiz (Area 8) Peabody (Area 9) West Cambridge (Area 10) North Cambridge (Area 11) Cambridge Highlands (Area 12) Strawberry Hill (Area 13) Education Cambridge PSD Amigos School Rindge and Latin Community Charter Prospect Hill Academy Charter St. Paul's Choir School Harvard University template Massachusetts Institute of Technology template Cambridge Public Library Landmarks List of tallest buildings and structures City Hall Harvard Book Store Mount Auburn Cemetery Transportation Bus routes MBTA Green Line (Lechmere) MBTA Red Line (Alewife Central Harvard Porter Kendall/MIT) MA-2/MA-2A, MA-16, MA-28, MA-60, US-3 This list is incomplete. vteColonial colleges Brown Columbia Dartmouth Harvard Penn Princeton Rutgers William & Mary Yale vteColleges and universities in metropolitan Boston Babson College Bay State College Benjamin Franklin Institute of Technology Bentley University Berklee College of Music Boston Architectural College Boston Baptist College Boston College Boston Conservatory Boston Graduate School of Psychoanalysis Boston University Brandeis University Bunker Hill Community College Cambridge College Curry College Eastern Nazarene College Emerson College Emmanuel College Fisher College Harvard University Hebrew College Hellenic College Hult International Business School Labouré College Lasell College Lesley University Longy School of Music Massachusetts Bay Community College Massachusetts College of Art and Design Massachusetts College of Pharmacy and Health Sciences Massachusetts Institute of Technology MGH Institute New England College of Optometry New England Conservatory New England School of Law Northeastern University North Shore Community College Pine Manor College Quincy College Roxbury Community College St. John's Seminary School of the Museum of Fine Arts at Tufts Simmons College Suffolk University Tufts University University of Massachusetts Boston Urban College of Boston Wentworth Institute of Technology William James College vteAssociation of American UniversitiesPublic Arizona California Berkeley Davis Irvine Los Angeles San Diego Santa Barbara Santa Cruz Colorado Florida Georgia Tech Illinois Indiana Iowa Iowa State Kansas Maryland Michigan Michigan State Minnesota Missouri New York Buffalo Stony Brook North Carolina Ohio State Oregon Penn State Pittsburgh Purdue Rutgers Texas Texas A&M Utah Virginia Washington Wisconsin Private Boston U Brandeis Brown Caltech Carnegie Mellon Case Western Reserve Chicago Columbia Cornell Dartmouth Duke Emory Harvard Johns Hopkins MIT Northwestern NYU Penn Princeton Rice Rochester USC Stanford Tufts Tulane Vanderbilt Wash U Yale Canadian (public) McGill Toronto vteAssociation of Independent Colleges and Universities in Massachusetts (AICUM) Amherst Anna Maria Assumption Babson Bay Path Becker Bentley Berklee Boston Architectural Boston Baptist Boston College Boston U Brandeis Cambridge College Clark College of the Holy Cross Curry Dean Eastern Nazarene Elms Emerson Emmanuel Endicott Fisher Gordon Hampshire Harvard Hebrew Labouré Lasell Lesley Marian Court MCPHS MIT Merrimack MGH Institute Mount Holyoke Mount Ida NECO New England Conservatory Newbury Nichols Northeastern Olin Pine Manor Regis Simmons Smith Springfield Stonehill Suffolk Tufts Urban College of Boston Wellesley WIT Western New England Wheaton Wheelock Williams WPI vteHCP Research Network California Berkeley Los Angeles Chieti–Pescara D'Annunzio Ernst Strüngmann Institute Harvard Indiana Minnesota Nijmegen Radboud Oxford Saint Louis Warwick Washington vteECAC HockeyTeams Brown Bears men women Clarkson Golden Knights men women Colgate Raiders men women Cornell Big Red men women Dartmouth Big Green men women Harvard Crimson men women Princeton Tigers men women Quinnipiac Bobcats men women Rensselaer Engineers men women St. Lawrence Saints men women Union Dutchmen men women Yale Bulldogs men women Venues Meehan Auditorium (Brown) Cheel Arena (Clarkson) Class of 1965 Arena (Colgate) Lynah Rink (Cornell) Thompson Arena (Dartmouth) Bright Hockey Center (Harvard) Hobey Baker Memorial Rink (Princeton) People's United Center (Quinnipiac) Houston Field House (Rensselaer) Appleton Arena (St. Lawrence) Achilles Rink (Union) Ingalls Rink (Yale) Herb Brooks Arena (Men's tournament) Men's awards All-ECAC Hockey Team Player of the Year Rookie of the Year Tim Taylor Award (Coach of the Year) Best Defensive Defenseman Best Defensive Forward Ken Dryden Award (Best Goaltender) Student-Athlete of the Year Most Outstanding Player in Tournament All-Tournament Team Women's awards Women's champions Men's seasons 1961–62 1962–63 1963–64 1964–65 1965–66 1966–67 1967–68 1968–69 1969–70 1970–71 1971–72 1972–73 1973–74 1974–75 1975–76 1976–77 1977–78 1978–79 1979–80 1980–81 1981–82 1982–83 1983–84 1984–85 1985–86 1986–87 1987–88 1988–89 1989–90 1990–91 1991–92 1992–93 1993–94 1994–95 1995–96 1996–97 1997–98 1998–99 1999–00 2000–01 2001–02 2002–03 2003–04 2004–05 2005–06 2006–07 2007–08 2008–09 2009–10 2010–11 2011–12 2012–13 2013–14 2014–15 2015–16 2016–17 2017–18 2018–19 2019–20 2020–21 2021–22 vteEastern Association of Rowing Colleges BU Terriers Brown Bears Columbia Lions Cornell Big Red Dartmouth Big Green Georgetown Hoyas Harvard Crimson Holy Cross Crusaders MIT Engineers Navy Midshipmen Northeastern Huskies Penn Quakers Princeton Tigers Rutgers Scarlet Knights Syracuse Orange Wisconsin Badgers Yale Bulldogs vteEastern Intercollegiate Volleyball AssociationCurrent members Charleston Golden Eagles George Mason Patriots Harvard Crimson NJIT Highlanders Penn State Nittany Lions Princeton Tigers Sacred Heart Pioneers (leaving in July 2022) St. Francis Brooklyn Terriers (leaving in July 2022) Saint Francis Red Flash (leaving in July 2022) Former members Concordia Clippers East Stroudsburg Warriors Juniata Eagles New Haven Chargers New Paltz Hawks NYU Violets Queens Knights Rutgers–Newark Scarlet Raiders Springfield Pride Vassar Brewers vteNational Intercollegiate Rugby AssociationDivision 1 Army Black Knights Brown Bears Dartmouth Big Green Harvard Crimson Long Island Sharks Mount St. Mary's Mountaineers Quinnipiac Bobcats Sacred Heart Pioneers Division 2 Alderson Broaddus Battlers American International Yellow Jackets Molloy Lions Notre Dame College Falcons Post Eagles Queens Royals West Chester Golden Rams Division 3 Bowdoin Polar Bears Castleton Spartans Colby–Sawyer Chargers Guilford Quakers Manhattanville Valiants New England College Pilgrims Norwich Cadets University of New England Nor'easters Future Teams Adrian Bulldogs vte Sports teams based in MassachusettsAustralian rules football USAFL Boston Demons Baseball MLB Boston Red Sox TAE Worcester Red Sox CCBL Bourne Braves Brewster Whitecaps Chatham Anglers Cotuit Kettleers Falmouth Commodores Harwich Mariners Hyannis Harbor Hawks Orleans Firebirds Wareham Gatemen Yarmouth–Dennis Red Sox FCBL Brockton Rox Pittsfield Suns Westfield Starfires Worcester Bravehearts NECBL Martha's Vineyard Sharks North Adams SteepleCats North Shore Navigators Valley Blue Sox Basketball NBA Boston Celtics Esports CDL Boston Breach OWL Boston Uprising Football NFL New England Patriots IFL Massachusetts Pirates WFA Boston Renegades Hockey NHL Boston Bruins AHL Springfield Thunderbirds ECHL Worcester Railers PHF Boston Pride Lacrosse MLL Boston Cannons WPLL New England Command UWLX Boston Storm Roller derby WFTDA Bay State Brawlers Roller Derby Boston Roller Derby MRDA Pioneer Valley Roller Derby Rugby league NARL Boston Thirteens USA Super Rugby League Oneida FC Rugby union MLR New England Free Jacks NERFU Boston RFC Boston Ironsides RFC Boston Irish Wolfhounds Boston Maccabi Rugby Charles River Rats Mystic River Old Gold Rugby South Shore Anchors Springfield RFC Worcester RFC Soccer MLS New England Revolution USL1 New England Revolution II USL2 Black Rock FC Boston Bolts Western Mass Pioneers NPSL Boston City FC Greater Lowell Rough Diamonds Valeo FC UWS New England Mutiny Ultimate AUDL Boston Glory College athleticsNCAA Division I American International Yellow Jackets (men's ice hockey and men's volleyball) Babson Beavers (men's and women's alpine skiing) Bentley Falcons (men's ice hockey) Boston College Eagles Brandeis Judges (men's and women's fencing) Boston University Terriers Franklin Pierce Ravens (women's ice hockey (team plays at the Jason Ritchie Ice Arena, a part of The Winchendon School in Winchendon, Massachusetts)). Harvard Crimson Holy Cross Crusaders MIT Engineers (men's and women's fencing, men's and women's rifle, and men's and women's rowing (crew)) Merrimack Warriors Northeastern Huskies Springfield Pride (men's and women's gymnastics) Stonehill Skyhawks (future women's hockey member in 2022-23) Tufts Jumbos (women's fencing) UMass Amherst Minutemen and Minutewomen UMass Lowell River Hawks Wellesley Blue (women's fencing) Williams Ephs (men's and women's rowing (crew) and men's and women's alpine and nordic skiing) NCAA Division II American International Yellow Jackets Assumption Greyhounds Bentley Falcons Franklin Pierce Ravens (men's ice hockey (team plays at the Jason Ritchie Ice Arena, a part of The Winchendon School in Winchendon, Massachusetts)). Stonehill Skyhawks NCAA Division III Amherst Mammoths Anna Maria Amcats Babson Beavers Brandeis Judges Bridgewater State Bears Clark Cougars Curry Colonels Dean Bulldogs Eastern Nazarene Lions Elms Blazers Emerson Lions Emmanuel Saints Endicott Gulls Fitchburg State Falcons Framingham State Rams Gordon Fighting Scots Lasell Lasers Lesley Lynx Mount Holyoke Lyons MCLA Trailblazers Massachusetts Maritime Buccaneers MIT Engineers Nichols Bison Regis Pride Salem State Vikings Simmons Sharks Smith Pioneers Springfield Pride Suffolk Rams Tufts Jumbos UMass Boston Beacons UMass Dartmouth Corsairs Wellesley Blue Wentworth Leopards Western New England Golden Bears Westfield State Owls Wheaton Lyons Williams Ephs WPI Engineers Worcester State Lancers NAIA Fisher Falcons USCAA Bay Path Wildcats Hampshire Black Sheep NJCAA Division II Massasoit Community College Warriors NJCAA Division III Benjamin Franklin Institute of Technology Shockers Bristol Community College Bayhawks Bunker Hill Community College Bulldogs Holyoke Community College Cougars Massachusetts Bay Community College Buccaneers Northern Essex Community College Knights Quinsigamond Community College Chiefs Roxbury Community College Tigers Springfield Technical Community College Rams","vteIvy League Brown Bears Columbia Lions Cornell Big Red Dartmouth Big Green Harvard Crimson Penn Quakers Princeton Tigers Yale Bulldogs vteCambridge, MassachusettsHistory Timeline Squares Central Square Harvard Square Inman Square Kendall Square Lechmere Square Porter Square Neighborhoods East Cambridge (Area 1) MIT Campus (Area 2) Wellington-Harrington (Area 3) The Port (Area 4) Cambridgeport (Area 5) Mid-Cambridge (Area 6) Riverside (Area 7) Agassiz (Area 8) Peabody (Area 9) West Cambridge (Area 10) North Cambridge (Area 11) Cambridge Highlands (Area 12) Strawberry Hill (Area 13) Education Cambridge PSD Amigos School Rindge and Latin Community Charter Prospect Hill Academy Charter St. Paul's Choir School Harvard University template Massachusetts Institute of Technology template Cambridge Public Library Landmarks List of tallest buildings and structures City Hall Harvard Book Store Mount Auburn Cemetery Transportation Bus routes MBTA Green Line (Lechmere) MBTA Red Line (Alewife Central Harvard Porter Kendall/MIT) MA-2/MA-2A, MA-16, MA-28, MA-60, US-3 This list is incomplete. vteColonial colleges Brown Columbia Dartmouth Harvard Penn Princeton Rutgers William & Mary Yale vteColleges and universities in metropolitan Boston Babson College Bay State College Benjamin Franklin Institute of Technology Bentley University Berklee College of Music Boston Architectural College Boston Baptist College Boston College Boston Conservatory Boston Graduate School of Psychoanalysis Boston University Brandeis University Bunker Hill Community College Cambridge College Curry College Eastern Nazarene College Emerson College Emmanuel College Fisher College Harvard University Hebrew College Hellenic College Hult International Business School Labouré College Lasell College Lesley University Longy School of Music Massachusetts Bay Community College Massachusetts College of Art and Design Massachusetts College of Pharmacy and Health Sciences Massachusetts Institute of Technology MGH Institute New England College of Optometry New England Conservatory New England School of Law Northeastern University North Shore Community College Pine Manor College Quincy College Roxbury Community College St. John's Seminary School of the Museum of Fine Arts at Tufts Simmons College Suffolk University Tufts University University of Massachusetts Boston Urban College of Boston Wentworth Institute of Technology William James College vteAssociation of American UniversitiesPublic Arizona California Berkeley Davis Irvine Los Angeles San Diego Santa Barbara Santa Cruz Colorado Florida Georgia Tech Illinois Indiana Iowa Iowa State Kansas Maryland Michigan Michigan State Minnesota Missouri New York Buffalo Stony Brook North Carolina Ohio State Oregon Penn State Pittsburgh Purdue Rutgers Texas Texas A&M Utah Virginia Washington Wisconsin Private Boston U Brandeis Brown Caltech Carnegie Mellon Case Western Reserve Chicago Columbia Cornell Dartmouth Duke Emory Harvard Johns Hopkins MIT Northwestern NYU Penn Princeton Rice Rochester USC Stanford Tufts Tulane Vanderbilt Wash U Yale Canadian (public) McGill Toronto vteAssociation of Independent Colleges and Universities in Massachusetts (AICUM) Amherst Anna Maria Assumption Babson Bay Path Becker Bentley Berklee Boston Architectural Boston Baptist Boston College Boston U Brandeis Cambridge College Clark College of the Holy Cross Curry Dean Eastern Nazarene Elms Emerson Emmanuel Endicott Fisher Gordon Hampshire Harvard Hebrew Labouré Lasell Lesley Marian Court MCPHS MIT Merrimack MGH Institute Mount Holyoke Mount Ida NECO New England Conservatory Newbury Nichols Northeastern Olin Pine Manor Regis Simmons Smith Springfield Stonehill Suffolk Tufts Urban College of Boston Wellesley WIT Western New England Wheaton Wheelock Williams WPI vteHCP Research Network California Berkeley Los Angeles Chieti–Pescara D'Annunzio Ernst Strüngmann Institute Harvard Indiana Minnesota Nijmegen Radboud Oxford Saint Louis Warwick Washington vteECAC HockeyTeams Brown Bears men women Clarkson Golden Knights men women Colgate Raiders men women Cornell Big Red men women Dartmouth Big Green men women Harvard Crimson men women Princeton Tigers men women Quinnipiac Bobcats men women Rensselaer Engineers men women St. Lawrence Saints men women Union Dutchmen men women Yale Bulldogs men women Venues Meehan Auditorium (Brown) Cheel Arena (Clarkson) Class of 1965 Arena (Colgate) Lynah Rink (Cornell) Thompson Arena (Dartmouth) Bright Hockey Center (Harvard) Hobey Baker Memorial Rink (Princeton) People's United Center (Quinnipiac) Houston Field House (Rensselaer) Appleton Arena (St. Lawrence) Achilles Rink (Union) Ingalls Rink (Yale) Herb Brooks Arena (Men's tournament) Men's awards All-ECAC Hockey Team Player of the Year Rookie of the Year Tim Taylor Award (Coach of the Year) Best Defensive Defenseman Best Defensive Forward Ken Dryden Award (Best Goaltender) Student-Athlete of the Year Most Outstanding Player in Tournament All-Tournament Team Women's awards Women's champions Men's seasons 1961–62 1962–63 1963–64 1964–65 1965–66 1966–67 1967–68 1968–69 1969–70 1970–71 1971–72 1972–73 1973–74 1974–75 1975–76 1976–77 1977–78 1978–79 1979–80 1980–81 1981–82 1982–83 1983–84 1984–85 1985–86 1986–87 1987–88 1988–89 1989–90 1990–91 1991–92 1992–93 1993–94 1994–95 1995–96 1996–97 1997–98 1998–99 1999–00 2000–01 2001–02 2002–03 2003–04 2004–05 2005–06 2006–07 2007–08 2008–09 2009–10 2010–11 2011–12 2012–13 2013–14 2014–15 2015–16 2016–17 2017–18 2018–19 2019–20 2020–21 2021–22 vteEastern Association of Rowing Colleges BU Terriers Brown Bears Columbia Lions Cornell Big Red Dartmouth Big Green Georgetown Hoyas Harvard Crimson Holy Cross Crusaders MIT Engineers Navy Midshipmen Northeastern Huskies Penn Quakers Princeton Tigers Rutgers Scarlet Knights Syracuse Orange Wisconsin Badgers Yale Bulldogs vteEastern Intercollegiate Volleyball AssociationCurrent members Charleston Golden Eagles George Mason Patriots Harvard Crimson NJIT Highlanders Penn State Nittany Lions Princeton Tigers Sacred Heart Pioneers (leaving in July 2022) St. Francis Brooklyn Terriers (leaving in July 2022) Saint Francis Red Flash (leaving in July 2022) Former members Concordia Clippers East Stroudsburg Warriors Juniata Eagles New Haven Chargers New Paltz Hawks NYU Violets Queens Knights Rutgers–Newark Scarlet Raiders Springfield Pride Vassar Brewers vteNational Intercollegiate Rugby AssociationDivision 1 Army Black Knights Brown Bears Dartmouth Big Green Harvard Crimson Long Island Sharks Mount St. Mary's Mountaineers Quinnipiac Bobcats Sacred Heart Pioneers Division 2 Alderson Broaddus Battlers American International Yellow Jackets Molloy Lions Notre Dame College Falcons Post Eagles Queens Royals West Chester Golden Rams Division 3 Bowdoin Polar Bears Castleton Spartans Colby–Sawyer Chargers Guilford Quakers Manhattanville Valiants New England College Pilgrims Norwich Cadets University of New England Nor'easters Future Teams Adrian Bulldogs vte Sports teams based in MassachusettsAustralian rules football USAFL Boston Demons Baseball MLB Boston Red Sox TAE Worcester Red Sox CCBL Bourne Braves Brewster Whitecaps Chatham Anglers Cotuit Kettleers Falmouth Commodores Harwich Mariners Hyannis Harbor Hawks Orleans Firebirds Wareham Gatemen Yarmouth–Dennis Red Sox FCBL Brockton Rox Pittsfield Suns Westfield Starfires Worcester Bravehearts NECBL Martha's Vineyard Sharks North Adams SteepleCats North Shore Navigators Valley Blue Sox Basketball NBA Boston Celtics Esports CDL Boston Breach OWL Boston Uprising Football NFL New England Patriots IFL Massachusetts Pirates WFA Boston Renegades Hockey NHL Boston Bruins AHL Springfield Thunderbirds ECHL Worcester Railers PHF Boston Pride Lacrosse MLL Boston Cannons WPLL New England Command UWLX Boston Storm Roller derby WFTDA Bay State Brawlers Roller Derby Boston Roller Derby MRDA Pioneer Valley Roller Derby Rugby league NARL Boston Thirteens USA Super Rugby League Oneida FC Rugby union MLR New England Free Jacks NERFU Boston RFC Boston Ironsides RFC Boston Irish Wolfhounds Boston Maccabi Rugby Charles River Rats Mystic River Old Gold Rugby South Shore Anchors Springfield RFC Worcester RFC Soccer MLS New England Revolution USL1 New England Revolution II USL2 Black Rock FC Boston Bolts Western Mass Pioneers NPSL Boston City FC Greater Lowell Rough Diamonds Valeo FC UWS New England Mutiny Ultimate AUDL Boston Glory College athleticsNCAA Division I American International Yellow Jackets (men's ice hockey and men's volleyball) Babson Beavers (men's and women's alpine skiing) Bentley Falcons (men's ice hockey) Boston College Eagles Brandeis Judges (men's and women's fencing) Boston University Terriers Franklin Pierce Ravens (women's ice hockey (team plays at the Jason Ritchie Ice Arena, a part of The Winchendon School in Winchendon, Massachusetts)). Harvard Crimson Holy Cross Crusaders MIT Engineers (men's and women's fencing, men's and women's rifle, and men's and women's rowing (crew)) Merrimack Warriors Northeastern Huskies Springfield Pride (men's and women's gymnastics) Stonehill Skyhawks (future women's hockey member in 2022-23) Tufts Jumbos (women's fencing) UMass Amherst Minutemen and Minutewomen UMass Lowell River Hawks Wellesley Blue (women's fencing) Williams Ephs (men's and women's rowing (crew) and men's and women's alpine and nordic skiing) NCAA Division II American International Yellow Jackets Assumption Greyhounds Bentley Falcons Franklin Pierce Ravens (men's ice hockey (team plays at the Jason Ritchie Ice Arena, a part of The Winchendon School in Winchendon, Massachusetts)). Stonehill Skyhawks NCAA Division III Amherst Mammoths Anna Maria Amcats Babson Beavers Brandeis Judges Bridgewater State Bears Clark Cougars Curry Colonels Dean Bulldogs Eastern Nazarene Lions Elms Blazers Emerson Lions Emmanuel Saints Endicott Gulls Fitchburg State Falcons Framingham State Rams Gordon Fighting Scots Lasell Lasers Lesley Lynx Mount Holyoke Lyons MCLA Trailblazers Massachusetts Maritime Buccaneers MIT Engineers Nichols Bison Regis Pride Salem State Vikings Simmons Sharks Smith Pioneers Springfield Pride Suffolk Rams Tufts Jumbos UMass Boston Beacons UMass Dartmouth Corsairs Wellesley Blue Wentworth Leopards Western New England Golden Bears Westfield State Owls Wheaton Lyons Williams Ephs WPI Engineers Worcester State Lancers NAIA Fisher Falcons USCAA Bay Path Wildcats Hampshire Black Sheep NJCAA Division II Massasoit Community College Warriors NJCAA Division III Benjamin Franklin Institute of Technology Shockers Bristol Community College Bayhawks Bunker Hill Community College Bulldogs Holyoke Community College Cougars Massachusetts Bay Community College Buccaneers Northern Essex Community College Knights Quinsigamond Community College Chiefs Roxbury Community College Tigers Springfield Technical Community College Rams"
vteIvy League,vteIvy League
Brown Bears Columbia Lions Cornell Big Red Dartmouth Big Green Harvard Crimson Penn Quakers Princeton Tigers Yale Bulldogs,Brown Bears Columbia Lions Cornell Big Red Dartmouth Big Green Harvard Crimson Penn Quakers Princeton Tigers Yale Bulldogs
"vteCambridge, Massachusetts","vteCambridge, Massachusetts"
History,Timeline
Squares,Central Square Harvard Square Inman Square Kendall Square Lechmere Square Porter Square
Neighborhoods,East Cambridge (Area 1) MIT Campus (Area 2) Wellington-Harrington (Area 3) The Port (Area 4) Cambridgeport (Area 5) Mid-Cambridge (Area 6) Riverside (Area 7) Agassiz (Area 8) Peabody (Area 9) West Cambridge (Area 10) North Cambridge (Area 11) Cambridge Highlands (Area 12) Strawberry Hill (Area 13)
Education,Cambridge PSD Amigos School Rindge and Latin Community Charter Prospect Hill Academy Charter St. Paul's Choir School Harvard University template Massachusetts Institute of Technology template Cambridge Public Library
Landmarks,List of tallest buildings and structures City Hall Harvard Book Store Mount Auburn Cemetery
Transportation,"Bus routes MBTA Green Line (Lechmere) MBTA Red Line (Alewife Central Harvard Porter Kendall/MIT) MA-2/MA-2A, MA-16, MA-28, MA-60, US-3"

vteIvy League,vteIvy League.1
Brown Bears Columbia Lions Cornell Big Red Dartmouth Big Green Harvard Crimson Penn Quakers Princeton Tigers Yale Bulldogs,Brown Bears Columbia Lions Cornell Big Red Dartmouth Big Green Harvard Crimson Penn Quakers Princeton Tigers Yale Bulldogs

"vteCambridge, Massachusetts","vteCambridge, Massachusetts.1"
History,Timeline
Squares,Central Square Harvard Square Inman Square Kendall Square Lechmere Square Porter Square
Neighborhoods,East Cambridge (Area 1) MIT Campus (Area 2) Wellington-Harrington (Area 3) The Port (Area 4) Cambridgeport (Area 5) Mid-Cambridge (Area 6) Riverside (Area 7) Agassiz (Area 8) Peabody (Area 9) West Cambridge (Area 10) North Cambridge (Area 11) Cambridge Highlands (Area 12) Strawberry Hill (Area 13)
Education,Cambridge PSD Amigos School Rindge and Latin Community Charter Prospect Hill Academy Charter St. Paul's Choir School Harvard University template Massachusetts Institute of Technology template Cambridge Public Library
Landmarks,List of tallest buildings and structures City Hall Harvard Book Store Mount Auburn Cemetery
Transportation,"Bus routes MBTA Green Line (Lechmere) MBTA Red Line (Alewife Central Harvard Porter Kendall/MIT) MA-2/MA-2A, MA-16, MA-28, MA-60, US-3"
This list is incomplete.,This list is incomplete.

vteColonial colleges,vteColonial colleges.1
Brown Columbia Dartmouth Harvard Penn Princeton Rutgers William & Mary Yale,Brown Columbia Dartmouth Harvard Penn Princeton Rutgers William & Mary Yale

vteColleges and universities in metropolitan Boston,vteColleges and universities in metropolitan Boston.1
Babson College Bay State College Benjamin Franklin Institute of Technology Bentley University Berklee College of Music Boston Architectural College Boston Baptist College Boston College Boston Conservatory Boston Graduate School of Psychoanalysis Boston University Brandeis University Bunker Hill Community College Cambridge College Curry College Eastern Nazarene College Emerson College Emmanuel College Fisher College Harvard University Hebrew College Hellenic College Hult International Business School Labouré College Lasell College Lesley University Longy School of Music Massachusetts Bay Community College Massachusetts College of Art and Design Massachusetts College of Pharmacy and Health Sciences Massachusetts Institute of Technology MGH Institute New England College of Optometry New England Conservatory New England School of Law Northeastern University North Shore Community College Pine Manor College Quincy College Roxbury Community College St. John's Seminary School of the Museum of Fine Arts at Tufts Simmons College Suffolk University Tufts University University of Massachusetts Boston Urban College of Boston Wentworth Institute of Technology William James College,Babson College Bay State College Benjamin Franklin Institute of Technology Bentley University Berklee College of Music Boston Architectural College Boston Baptist College Boston College Boston Conservatory Boston Graduate School of Psychoanalysis Boston University Brandeis University Bunker Hill Community College Cambridge College Curry College Eastern Nazarene College Emerson College Emmanuel College Fisher College Harvard University Hebrew College Hellenic College Hult International Business School Labouré College Lasell College Lesley University Longy School of Music Massachusetts Bay Community College Massachusetts College of Art and Design Massachusetts College of Pharmacy and Health Sciences Massachusetts Institute of Technology MGH Institute New England College of Optometry New England Conservatory New England School of Law Northeastern University North Shore Community College Pine Manor College Quincy College Roxbury Community College St. John's Seminary School of the Museum of Fine Arts at Tufts Simmons College Suffolk University Tufts University University of Massachusetts Boston Urban College of Boston Wentworth Institute of Technology William James College

vteAssociation of American Universities,vteAssociation of American Universities.1
Public,Arizona California Berkeley Davis Irvine Los Angeles San Diego Santa Barbara Santa Cruz Colorado Florida Georgia Tech Illinois Indiana Iowa Iowa State Kansas Maryland Michigan Michigan State Minnesota Missouri New York Buffalo Stony Brook North Carolina Ohio State Oregon Penn State Pittsburgh Purdue Rutgers Texas Texas A&M Utah Virginia Washington Wisconsin
Private,Boston U Brandeis Brown Caltech Carnegie Mellon Case Western Reserve Chicago Columbia Cornell Dartmouth Duke Emory Harvard Johns Hopkins MIT Northwestern NYU Penn Princeton Rice Rochester USC Stanford Tufts Tulane Vanderbilt Wash U Yale
Canadian (public),McGill Toronto

vteAssociation of Independent Colleges and Universities in Massachusetts (AICUM),vteAssociation of Independent Colleges and Universities in Massachusetts (AICUM).1
Amherst Anna Maria Assumption Babson Bay Path Becker Bentley Berklee Boston Architectural Boston Baptist Boston College Boston U Brandeis Cambridge College Clark College of the Holy Cross Curry Dean Eastern Nazarene Elms Emerson Emmanuel Endicott Fisher Gordon Hampshire Harvard Hebrew Labouré Lasell Lesley Marian Court MCPHS MIT Merrimack MGH Institute Mount Holyoke Mount Ida NECO New England Conservatory Newbury Nichols Northeastern Olin Pine Manor Regis Simmons Smith Springfield Stonehill Suffolk Tufts Urban College of Boston Wellesley WIT Western New England Wheaton Wheelock Williams WPI,Amherst Anna Maria Assumption Babson Bay Path Becker Bentley Berklee Boston Architectural Boston Baptist Boston College Boston U Brandeis Cambridge College Clark College of the Holy Cross Curry Dean Eastern Nazarene Elms Emerson Emmanuel Endicott Fisher Gordon Hampshire Harvard Hebrew Labouré Lasell Lesley Marian Court MCPHS MIT Merrimack MGH Institute Mount Holyoke Mount Ida NECO New England Conservatory Newbury Nichols Northeastern Olin Pine Manor Regis Simmons Smith Springfield Stonehill Suffolk Tufts Urban College of Boston Wellesley WIT Western New England Wheaton Wheelock Williams WPI

vteHCP Research Network,vteHCP Research Network.1
California Berkeley Los Angeles Chieti–Pescara D'Annunzio Ernst Strüngmann Institute Harvard Indiana Minnesota Nijmegen Radboud Oxford Saint Louis Warwick Washington,California Berkeley Los Angeles Chieti–Pescara D'Annunzio Ernst Strüngmann Institute Harvard Indiana Minnesota Nijmegen Radboud Oxford Saint Louis Warwick Washington

vteECAC Hockey,vteECAC Hockey.1
Teams,Brown Bears men women Clarkson Golden Knights men women Colgate Raiders men women Cornell Big Red men women Dartmouth Big Green men women Harvard Crimson men women Princeton Tigers men women Quinnipiac Bobcats men women Rensselaer Engineers men women St. Lawrence Saints men women Union Dutchmen men women Yale Bulldogs men women
Venues,Meehan Auditorium (Brown) Cheel Arena (Clarkson) Class of 1965 Arena (Colgate) Lynah Rink (Cornell) Thompson Arena (Dartmouth) Bright Hockey Center (Harvard) Hobey Baker Memorial Rink (Princeton) People's United Center (Quinnipiac) Houston Field House (Rensselaer) Appleton Arena (St. Lawrence) Achilles Rink (Union) Ingalls Rink (Yale) Herb Brooks Arena (Men's tournament)
Men's awards,All-ECAC Hockey Team Player of the Year Rookie of the Year Tim Taylor Award (Coach of the Year) Best Defensive Defenseman Best Defensive Forward Ken Dryden Award (Best Goaltender) Student-Athlete of the Year Most Outstanding Player in Tournament All-Tournament Team
Women's awards,Women's champions
Men's seasons,1961–62 1962–63 1963–64 1964–65 1965–66 1966–67 1967–68 1968–69 1969–70 1970–71 1971–72 1972–73 1973–74 1974–75 1975–76 1976–77 1977–78 1978–79 1979–80 1980–81 1981–82 1982–83 1983–84 1984–85 1985–86 1986–87 1987–88 1988–89 1989–90 1990–91 1991–92 1992–93 1993–94 1994–95 1995–96 1996–97 1997–98 1998–99 1999–00 2000–01 2001–02 2002–03 2003–04 2004–05 2005–06 2006–07 2007–08 2008–09 2009–10 2010–11 2011–12 2012–13 2013–14 2014–15 2015–16 2016–17 2017–18 2018–19 2019–20 2020–21 2021–22

vteEastern Association of Rowing Colleges,vteEastern Association of Rowing Colleges.1
BU Terriers Brown Bears Columbia Lions Cornell Big Red Dartmouth Big Green Georgetown Hoyas Harvard Crimson Holy Cross Crusaders MIT Engineers Navy Midshipmen Northeastern Huskies Penn Quakers Princeton Tigers Rutgers Scarlet Knights Syracuse Orange Wisconsin Badgers Yale Bulldogs,BU Terriers Brown Bears Columbia Lions Cornell Big Red Dartmouth Big Green Georgetown Hoyas Harvard Crimson Holy Cross Crusaders MIT Engineers Navy Midshipmen Northeastern Huskies Penn Quakers Princeton Tigers Rutgers Scarlet Knights Syracuse Orange Wisconsin Badgers Yale Bulldogs

vteEastern Intercollegiate Volleyball Association,vteEastern Intercollegiate Volleyball Association.1
Current members,Charleston Golden Eagles George Mason Patriots Harvard Crimson NJIT Highlanders Penn State Nittany Lions Princeton Tigers Sacred Heart Pioneers (leaving in July 2022) St. Francis Brooklyn Terriers (leaving in July 2022) Saint Francis Red Flash (leaving in July 2022)
Former members,Concordia Clippers East Stroudsburg Warriors Juniata Eagles New Haven Chargers New Paltz Hawks NYU Violets Queens Knights Rutgers–Newark Scarlet Raiders Springfield Pride Vassar Brewers

vteNational Intercollegiate Rugby Association,vteNational Intercollegiate Rugby Association.1
Division 1,Army Black Knights Brown Bears Dartmouth Big Green Harvard Crimson Long Island Sharks Mount St. Mary's Mountaineers Quinnipiac Bobcats Sacred Heart Pioneers
Division 2,Alderson Broaddus Battlers American International Yellow Jackets Molloy Lions Notre Dame College Falcons Post Eagles Queens Royals West Chester Golden Rams
Division 3,Bowdoin Polar Bears Castleton Spartans Colby–Sawyer Chargers Guilford Quakers Manhattanville Valiants New England College Pilgrims Norwich Cadets University of New England Nor'easters
Future Teams,Adrian Bulldogs

vte Sports teams based in Massachusetts,vte Sports teams based in Massachusetts.1
Australian rules football,USAFL Boston Demons
Baseball,MLB Boston Red Sox TAE Worcester Red Sox CCBL Bourne Braves Brewster Whitecaps Chatham Anglers Cotuit Kettleers Falmouth Commodores Harwich Mariners Hyannis Harbor Hawks Orleans Firebirds Wareham Gatemen Yarmouth–Dennis Red Sox FCBL Brockton Rox Pittsfield Suns Westfield Starfires Worcester Bravehearts NECBL Martha's Vineyard Sharks North Adams SteepleCats North Shore Navigators Valley Blue Sox
Basketball,NBA Boston Celtics
Esports,CDL Boston Breach OWL Boston Uprising
Football,NFL New England Patriots IFL Massachusetts Pirates WFA Boston Renegades
Hockey,NHL Boston Bruins AHL Springfield Thunderbirds ECHL Worcester Railers PHF Boston Pride
Lacrosse,MLL Boston Cannons WPLL New England Command UWLX Boston Storm
Roller derby,WFTDA Bay State Brawlers Roller Derby Boston Roller Derby MRDA Pioneer Valley Roller Derby
Rugby league,NARL Boston Thirteens USA Super Rugby League Oneida FC
Rugby union,MLR New England Free Jacks NERFU Boston RFC Boston Ironsides RFC Boston Irish Wolfhounds Boston Maccabi Rugby Charles River Rats Mystic River Old Gold Rugby South Shore Anchors Springfield RFC Worcester RFC

0,1
NCAA Division I,"American International Yellow Jackets (men's ice hockey and men's volleyball) Babson Beavers (men's and women's alpine skiing) Bentley Falcons (men's ice hockey) Boston College Eagles Brandeis Judges (men's and women's fencing) Boston University Terriers Franklin Pierce Ravens (women's ice hockey (team plays at the Jason Ritchie Ice Arena, a part of The Winchendon School in Winchendon, Massachusetts)). Harvard Crimson Holy Cross Crusaders MIT Engineers (men's and women's fencing, men's and women's rifle, and men's and women's rowing (crew)) Merrimack Warriors Northeastern Huskies Springfield Pride (men's and women's gymnastics) Stonehill Skyhawks (future women's hockey member in 2022-23) Tufts Jumbos (women's fencing) UMass Amherst Minutemen and Minutewomen UMass Lowell River Hawks Wellesley Blue (women's fencing) Williams Ephs (men's and women's rowing (crew) and men's and women's alpine and nordic skiing)"
NCAA Division II,"American International Yellow Jackets Assumption Greyhounds Bentley Falcons Franklin Pierce Ravens (men's ice hockey (team plays at the Jason Ritchie Ice Arena, a part of The Winchendon School in Winchendon, Massachusetts)). Stonehill Skyhawks"
NCAA Division III,Amherst Mammoths Anna Maria Amcats Babson Beavers Brandeis Judges Bridgewater State Bears Clark Cougars Curry Colonels Dean Bulldogs Eastern Nazarene Lions Elms Blazers Emerson Lions Emmanuel Saints Endicott Gulls Fitchburg State Falcons Framingham State Rams Gordon Fighting Scots Lasell Lasers Lesley Lynx Mount Holyoke Lyons MCLA Trailblazers Massachusetts Maritime Buccaneers MIT Engineers Nichols Bison Regis Pride Salem State Vikings Simmons Sharks Smith Pioneers Springfield Pride Suffolk Rams Tufts Jumbos UMass Boston Beacons UMass Dartmouth Corsairs Wellesley Blue Wentworth Leopards Western New England Golden Bears Westfield State Owls Wheaton Lyons Williams Ephs WPI Engineers Worcester State Lancers
NAIA,Fisher Falcons
USCAA,Bay Path Wildcats Hampshire Black Sheep
NJCAA Division II,Massasoit Community College Warriors
NJCAA Division III,Benjamin Franklin Institute of Technology Shockers Bristol Community College Bayhawks Bunker Hill Community College Bulldogs Holyoke Community College Cougars Massachusetts Bay Community College Buccaneers Northern Essex Community College Knights Quinsigamond Community College Chiefs Roxbury Community College Tigers Springfield Technical Community College Rams

Authority control,Authority control.1
General,Integrated Authority File (Germany) ISNI 1 Online PWN VIAF 1 WorldCat
National libraries,Norway France (data) United States Latvia Czech Republic Australia Greece Israel Korea Croatia Poland Vatican
Art research institutes,Artist Names (Getty)
Scientific databases,CiNii (Japan)
Other,MusicBrainz place RERO (Switzerland) 1 Social Networks and Archival Context SUDOC (France) 1 Trove (Australia) 1


Great! Now we have the text of the HU Wikipedia page. But this mess of HTML tags would be a pain to parse manually. Which is why we will use another very cool Python library called BeautifulSoup.

### BeautifulSoup

Parsing data would be a breeze if we could always use well formatted data sources, such as CSV, JSON, or XML; but some formats such as HTML are at the same time a very popular and a pain to parse.

One of the problems with HTML is that over the years browsers have evolved to be very forgiving of "malformed" syntax. Your browser is smart enough to detect some common problems, such as open tags, and correct them on the fly.

Unfortunately, we do not have the time or patience to implement all the different corner cases, so we'll let BeautifulSoup do that for us.

You'll notice that the `import` statement bellow is different from what we used for `requests`. The _from library import thing_ pattern is useful when you don't want to reference a function byt its full name (like we did with `requests.get`), but you also don't want to import every single thing on that library into your namespace.

In [7]:
from bs4 import BeautifulSoup

BeautifulSoup can deal with HTML or XML data, so the next line parser the contents of the `page` variable using its HTML parser, and assigns the result of that to the `soup` variable.

In [8]:
soup = BeautifulSoup(page, 'html.parser')

In [9]:
type(soup)

bs4.BeautifulSoup

Doesn't look much different from the `page` object representation. Let's make sure the two are different types.

In [10]:
type(page)

str

Looks like they are indeed different.

`BeautifulSoup` objects have a cool little method that allows you to see the HTML content in a nice, indented way.

In [11]:
print(soup.prettify()[:1000])

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Harvard University - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"f4393be0-80af-4b51-9cee-dfecc37c53ee","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Harvard_University","wgTitle":"Harvard University","wgCurRevisionId":1069140536,"wgRevisionId":1069140536,"wgArticleId":18426501,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 maint: location","Webarchive template wayback links","CS1: Julian–Gregorian uncertainty","CS1 errors: generic name"

Looks like it's our page!

We can now reference elements of the HTML document in different ways. One very convenient way is by using the dot notation, which allows us to access the elements as if they were properties of the object.

In [12]:
soup.title

<title>Harvard University - Wikipedia</title>

This is nice for HTML elements that only appear once per page, such the the `title` tag. But what about elements that can appear multiple times?

In [13]:
# Be careful with elements that show up multiple times.
soup.p

<p class="mw-empty-elt">
</p>

Uh Oh. Turns out the attribute syntax in Beautiful soup is what is called syntactic sugar. That's why it is safer to use the explicit commands behind that syntactic sugar I mentioned. These are `BeautifulSoup.find` for getting single elements, and `BeautifulSoup.find_all` for retrieving multiple elements.

In [14]:
len(soup.find_all("p"))

106

---

If you look at the Wikipedia page on your browser, you'll notice that it has a couple of tables in it. We will be working with the "Demographics" table, but first we need to find it.

One of the HTML attributes that will be very useful to us is the "class" attribute.

Getting the class of a single element is easy...

In [15]:
soup.table["class"]

['box-Merge_from', 'plainlinks', 'metadata', 'ambox', 'ambox-move']

Next we will use a list comprehension to see all the tables that have a "class" attribute. 

In [16]:
#the classes of all tables that have a class sttribute set on them
[t["class"] for t in soup.find_all("table") if t.get("class")]

[['box-Merge_from', 'plainlinks', 'metadata', 'ambox', 'ambox-move'],
 ['infobox', 'vcard'],
 ['toccolours'],
 ['infobox'],
 ['wikitable', 'sortable', 'collapsible', 'collapsed', 'floatright'],
 ['wikitable', 'sortable', 'collapsible', 'collapsed', 'floatright'],
 ['wikitable', 'sortable'],
 ['wikitable'],
 ['metadata', 'mbox-small'],
 ['nowraplinks', 'mw-collapsible', 'mw-collapsed', 'navbox-inner'],
 ['nowraplinks', 'navbox-subgroup'],
 ['nowraplinks', 'mw-collapsible', 'mw-collapsed', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'mw-collapsed', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks

As mentioned, we will be using the Demographics table. To find this, we notice that it is the only table with just the class `wikitable` on it, whereas there are 3 tables with the class `wikitable`, with the other  two having multiple classes on them. This is why `find_all` below returns 3 results.

In [17]:
tables_wikitable = soup.find_all("table", "wikitable")

In [18]:
len(tables_wikitable)

4

Below we use a **matching** lambda function to find the table with just the class wikitable. Note that we have asked for a list with just `wikitable` in it. That ensures its the only class

In [19]:
dfinder = lambda tag: tag.name=='table' and tag.get('class') == ['wikitable']
table_demographics = soup.find_all(dfinder)

By contrast a simple find would give us just the first match. The below would be a great way to do things if we were guaranteed uniqueness. But since we are not, we use the full power of passing in a matching function.

In [20]:
soup.find("table", "wikitable")

<table class="wikitable sortable collapsible collapsed floatright">
<tbody><tr>
<th colspan="4" style="background-color:#A31F36;color:white;box-shadow: inset 2px 2px 0 #2C2A29, inset -2px -2px 0 #2C2A29;">National Graduate Rankings<sup class="reference" id="cite_ref-92"><a href="#cite_note-92">[92]</a></sup>
</th></tr>
<tr>
<th>Program
</th>
<th>Ranking
</th></tr>
<tr>
<td>Biological Sciences</td>
<td>4
</td></tr>
<tr>
<td>Business</td>
<td>6
</td></tr>
<tr>
<td>Chemistry</td>
<td>2
</td></tr>
<tr>
<td>Clinical Psychology</td>
<td>10
</td></tr>
<tr>
<td>Computer Science</td>
<td>16
</td></tr>
<tr>
<td>Earth Sciences</td>
<td>8
</td></tr>
<tr>
<td>Economics</td>
<td>1
</td></tr>
<tr>
<td>Education</td>
<td>1
</td></tr>
<tr>
<td>Engineering</td>
<td>22
</td></tr>
<tr>
<td>English</td>
<td>8
</td></tr>
<tr>
<td>History</td>
<td>4
</td></tr>
<tr>
<td>Law</td>
<td>3
</td></tr>
<tr>
<td>Mathematics</td>
<td>2
</td></tr>
<tr>
<td>Medicine: Primary Care</td>
<td>10
</td></tr>
<tr>
<td>Medicine

Since we used `find_all` we get back a list:

In [21]:
HTML(str(table_demographics[0]))

Unnamed: 0,Undergrad,Grad/prof
Asian,21%,13%
Black,9%,5%
Hispanic or Latino,11%,7%
White,37%,38%
Two or more races,8%,3%
International,12%,32%


First we'll use a list comprehension to extract the rows (*tr*) elements.

In [22]:
rows = [row for row in table_demographics[0].find_all("tr")]
rows

[<tr>
 <th></th>
 <th>Undergrad</th>
 <th>Grad/prof
 </th></tr>,
 <tr>
 <th>Asian
 </th>
 <td>21%</td>
 <td>13%
 </td></tr>,
 <tr>
 <th>Black
 </th>
 <td>9%</td>
 <td>5%
 </td></tr>,
 <tr>
 <th>Hispanic or Latino
 </th>
 <td>11%</td>
 <td>7%
 </td></tr>,
 <tr>
 <th>White
 </th>
 <td>37%</td>
 <td>38%
 </td></tr>,
 <tr>
 <th>Two or more races
 </th>
 <td>8%</td>
 <td>3%
 </td></tr>,
 <tr>
 <th>International
 </th>
 <td>12%</td>
 <td>32%
 </td></tr>]

In [23]:
header_row = rows[0]
header_row

<tr>
<th></th>
<th>Undergrad</th>
<th>Grad/prof
</th></tr>

### Splitting the data

Next we extract the text value of the columns. If you look at the table above, you'll see that we have three columns and six rows.

Here we're taking the first element (Python indexes start at zero), iterating over the *th* elements inside it, and taking the text value of those elements. We should end up with a list of column names.

But there is one little caveat: the first column of the table is actually an empty string (look at the cell right above the row names). We could add it to our list and then remove it afterwards; but instead we will use the `if` statement inside the list comprehension to filter that out.

Here the `get_text` will return an empty string for the first cell of the table, which means that the test will fail and the value will not be added to the list.

In [24]:
#the if col.get_text() takes care of no-text in the upper left
columns = [col.get_text() for col in header_row.find_all("th") if col.get_text()]
columns

['Undergrad', 'Grad/prof\n']

In [25]:
# Lambda expressions return the value of the expression inside it.
# In this case, it will return a string with new line characters replaced by spaces.
rem_nl = lambda s: s.replace("\n", " ")

In [26]:
columns = [rem_nl(c) for c in columns]
columns

['Undergrad', 'Grad/prof ']

Now let's do the same for the rows. Notice that since we have already parsed the header row, we will continue from the second row. The `[1:]` is a slice notation and in this case it means we want all values starting from the second position.

In [27]:
indexes = [row.find("th").get_text() for row in rows[1:]]
indexes

['Asian\n',
 'Black\n',
 'Hispanic or Latino\n',
 'White\n',
 'Two or more races\n',
 'International\n']

We need to transform the string on the "data" cells to integers. We start by checking if the last character of the string (Python allows for negative indexes) is a percent sign. If that is true, then we convert the characters before the sign to integers. Lastly, if one of the prior checks fails, we return a value of None.

In [28]:
def to_num(s):
    if s[-1] == "%":
        return int(s[:-1])
    else:
        return None

In [29]:
values = []
for row in rows[1:]:
    for value in row.find_all("td"):
        values.append(to_num(value.get_text()))
values

[17, 11, 5, 6, 4, 12, 9, 5, 16, 46, 43, 64, 10, 8, 9, 11, 27, None]

The problem with the list above is that the values lost their grouping.

The `zip` function is used to combine two sequences element wise. So `zip([1,2,3], [4,5,6])` would return `[(1, 4), (2, 5), (3, 6)]`.

Here we create 3 arrays corresponding to the 3 columns by putting every 3 values in each list

In [30]:
stacked_values_lists = [values[i::3] for i in range(len(columns))]
stacked_values_lists

[[17, 6, 9, 46, 10, 11], [11, 4, 5, 43, 8, 27], [5, 12, 16, 64, 9, None]]

We then use `zip`. Notice the use of the `*` in front: that converts the list of lists to a set of arguments to `zip`. 

In [31]:
def print_them(a, b, c):
    print("a", a, "b", b, "c", c)
print_them(1, 2, 3)

a 1 b 2 c 3


In [32]:
print_them(*[1, 2, 3])

a 1 b 2 c 3


In [33]:
stacked_values=zip(*stacked_values_lists)
list(stacked_values)

[(17, 11, 5), (6, 4, 12), (9, 5, 16), (46, 43, 64), (10, 8, 9), (11, 27, None)]

In [34]:
# Here's the original HTML table for visual understanding
HTML(str(table_demographics))

Unnamed: 0,Undergraduate,Graduate and professional,U.S. census
Asian/Pacific Islander,17%,11%,5%
Black/non-Hispanic,6%,4%,12%
Hispanics of any race,9%,5%,16%
White/non-Hispanic,46%,43%,64%
Mixed race/other,10%,8%,9%
International students,11%,27%,


---

##  Putting things into Pandas

### Dataframes

To recap, we now have three data structures holding our column names, our row (index) names, and our values grouped by index.

We will now load this data into a Pandas Dataframe. The loading process is pretty straightforward, and all we need to do is tell Pandas which container goes where.


In [35]:
import pandas as pd

In [36]:
list(stacked_values)

[]

Wait! What happened?

Remember that `stacked_values` waz a zip object. We ran a `list(stacked_values)` to print it. But this had an unfortunate side effect. It **exhausted the iterator**, by iterating over the zip. Nothing was left. So we'll need to redefine the zip first. And we'll name it a bit better

In [37]:
stacked_values_iterator = zip(*stacked_values_lists)

Labeling variables like this follows the philosophy of [Hungarian Notation](https://en.wikipedia.org/wiki/Hungarian_notation). Use sparingly, when its critical to the understanding of your code, like here

In [38]:
df = pd.DataFrame(list(stacked_values_iterator), columns=columns, index=indexes)
df

Unnamed: 0,Undergraduate,Graduate and professional,U.S. census
Asian/Pacific Islander,17,11,5.0
Black/non-Hispanic,6,4,12.0
Hispanics of any race,9,5,16.0
White/non-Hispanic,46,43,64.0
Mixed race/other,10,8,9.0
International students,11,27,


---

#### Other ways to create the Dataframe

That was one of many ways to construct a dataframe. Here is another that uses a list of dictionaries:

First we combine the list and dictionary comprehensions to get a list of dictionaries representing each row in the data.

In [40]:
stacked_values_iterator = zip(*stacked_values_lists)
data_dicts = [{col: val for col, val in zip(columns, col_values)} for col_values in stacked_values_iterator]
data_dicts

[{'Graduate and professional': 11, 'U.S. census': 5, 'Undergraduate': 17},
 {'Graduate and professional': 4, 'U.S. census': 12, 'Undergraduate': 6},
 {'Graduate and professional': 5, 'U.S. census': 16, 'Undergraduate': 9},
 {'Graduate and professional': 43, 'U.S. census': 64, 'Undergraduate': 46},
 {'Graduate and professional': 8, 'U.S. census': 9, 'Undergraduate': 10},
 {'Graduate and professional': 27, 'U.S. census': None, 'Undergraduate': 11}]

In [41]:
pd.DataFrame(data_dicts, index=indexes)

Unnamed: 0,Graduate and professional,U.S. census,Undergraduate
Asian/Pacific Islander,11,5.0,17
Black/non-Hispanic,4,12.0,6
Hispanics of any race,5,16.0,9
White/non-Hispanic,43,64.0,46
Mixed race/other,8,9.0,10
International students,27,,11


And yet another that uses a dictionary of lists:

To achieve this we group the values columnwise...

In [42]:
stacked_by_col = [values[i::3] for i in range(len(columns))]
stacked_by_col

[[17, 6, 9, 46, 10, 11], [11, 4, 5, 43, 8, 27], [5, 12, 16, 64, 9, None]]

and then revert the pattern we used to create a list of dictionaries.

In [43]:
data_lists = {col: val for col, val in zip(columns, stacked_by_col)}
data_lists

{'Graduate and professional': [11, 4, 5, 43, 8, 27],
 'U.S. census': [5, 12, 16, 64, 9, None],
 'Undergraduate': [17, 6, 9, 46, 10, 11]}

In [44]:
pd.DataFrame(data_lists, index=indexes)

Unnamed: 0,Graduate and professional,U.S. census,Undergraduate
Asian/Pacific Islander,11,5.0,17
Black/non-Hispanic,4,12.0,6
Hispanics of any race,5,16.0,9
White/non-Hispanic,43,64.0,46
Mixed race/other,8,9.0,10
International students,27,,11


---

### DataFrame cleanup

Our DataFrame looks nice; but does it have the right data types?

In [45]:
df.dtypes

Undergraduate                  int64
Graduate and professional      int64
U.S. census                  float64
dtype: object

The `U.S Census` looks a little strange. It should have been evaluated as an integer, but instead it came in as a float. It probably has something to do with the `NaN` value...

In fact, missing values can mess up a lot of our calculations, and some function don't work at all when `NaN` are present. So we should probably clean this up.

One way to do that is by dropping the rows that have missing values:

In [46]:
df.dropna()

Unnamed: 0,Undergraduate,Graduate and professional,U.S. census
Asian/Pacific Islander,17,11,5.0
Black/non-Hispanic,6,4,12.0
Hispanics of any race,9,5,16.0
White/non-Hispanic,46,43,64.0
Mixed race/other,10,8,9.0


Or the columns that have missing values:

In [47]:
df.dropna(axis=1)

Unnamed: 0,Undergraduate,Graduate and professional
Asian/Pacific Islander,17,11
Black/non-Hispanic,6,4
Hispanics of any race,9,5
White/non-Hispanic,46,43
Mixed race/other,10,8
International students,11,27


But we will take a less radical approach and replace the missing value with a zero. In this case this solution makes sense, since 0% value meaningful in this context. We will also transform all the values to integers at the same time.

In [48]:
df_clean = df.fillna(0).astype(int)
df_clean

Unnamed: 0,Undergraduate,Graduate and professional,U.S. census
Asian/Pacific Islander,17,11,5
Black/non-Hispanic,6,4,12
Hispanics of any race,9,5,16
White/non-Hispanic,46,43,64
Mixed race/other,10,8,9
International students,11,27,0


In [49]:
df_clean.dtypes

Undergraduate                int64
Graduate and professional    int64
U.S. census                  int64
dtype: object

Now our table looks good!

Let's see some basic statistics about it.

In [50]:
df_clean.describe()

Unnamed: 0,Undergraduate,Graduate and professional,U.S. census
count,6.0,6.0,6.0
mean,16.5,16.333333,17.666667
std,14.896308,15.513435,23.36379
min,6.0,4.0,0.0
25%,9.25,5.75,6.0
50%,10.5,9.5,10.5
75%,15.5,23.0,15.0
max,46.0,43.0,64.0
