# Wikipedia Collection script for R1/R2 Universities

## Setup

Grab and install the wikipedia api library.  This only needs ot be run once ever.

In [1]:
%pip install wikipedia # run once then comment this out

Note: you may need to restart the kernel to use updated packages.


## Imports

In [2]:
import wikipedia
import pandas as pd

## Code

Get a list of links from the page "List of research univiserties in the United States"

In [3]:
list_page = wikipedia.page("List_of_research_universities_in_the_United_States")
list_page.links[:10]

['Aggregate data',
 'Air Force Institute of Technology',
 'Alabama',
 'Alaska',
 'Albany, New York',
 'Albert Einstein College of Medicine',
 'Albuquerque',
 'American Council on Education',
 'American University',
 'Ames, Iowa']

### Links pertaining to R1/R2 Universities

The list contains things other than universities... rather than grab a bunch of extra links we'll go to each page, look at the possible categories and make at least one contains the stemmed word for `univsersit` at the beginning (there's no 'university' category exactly but plenty of categories that start with "Universit").  This still isn't perfect but should be good enough for the assignment.

In [5]:
dataset = []
for college in list_page.links:
    try:
        print(college, end="")
        page = wikipedia.page(college)
        if any(c.lower().startswith("universit") for c in page.categories):
            dataset += [{"college": college,
                         "content": page.content,
                         "nrefs": len(page.references),
                         "nlinks": len(page.links),
                         #"lat": round(page.coordinates[0], 10), # nice idea but not defined for all pages
                         #"long": round(page.coordinates[1], 10)
                        }]
            print(" |  okay")
        else:
            print(" |  other")
    except wikipedia.PageError:
        print(" |  error (page)")
    except wikipedia.DisambiguationError:
        print(" |  error (double)")
dataset = pd.DataFrame(dataset)

Aggregate data |  other
Air Force Institute of Technology |  okay
Alabama |  other
Alaska |  other
Albany, New York |  other
Albert Einstein College of Medicine |  okay
Albuquerque |  other
American Council on Education |  other
American University |  okay
Ames, Iowa |  other
Amherst, Massachusetts |  other
Ann Arbor, Michigan |  other
Arizona |  other
Arizona State University |  okay
Arkansas |  other
Arkansas State University |  okay
Arlington, Texas |  other
Association of American Universities |  other
Athens, Georgia |  other
Athens, Ohio |  other
Atlanta |  other
Auburn, Alabama |  other
Auburn University |  okay
Augusta University |  okay
Austin, Texas |  other
Azusa Pacific University |  okay
Ball State University |  okay
Baltimore |  other
Baton Rouge |  other
Baylor College of Medicine |  okay
Baylor University |  okay
Berkeley, California |  other
Binghamton University |  okay
Birmingham, Alabama |  other
Blacksburg, Virginia |  other
Bloomington, Indiana |  other
Boise Stat



  lis = BeautifulSoup(html).find_all('li')


 |  error (double)
Indiana University Bloomington |  okay
Indiana University of Pennsylvania |  okay
Indiana University – Purdue University Indianapolis |  okay
Iowa |  other
Iowa City |  other
Iowa State University |  error (page)
Irvine, California |  other
Ithaca, New York |  other
Jackson State University |  okay
James Madison University |  okay
Johns Hopkins University |  okay
Kansas |  other
Kansas State University |  okay
Kennesaw State University |  okay
Kent, Ohio |  other
Kent State University |  okay
Kentucky |  other
Knoxville, Tennessee |  other
LIU Post |  error (page)
Lafayette, Louisiana |  other
Las Vegas |  other
Lawrence, Kansas |  other
Lehigh University |  okay
Lexington, Kentucky |  other
Lincoln, Nebraska |  other
Logan, Utah |  other
Loma Linda University |  okay
Los Angeles |  other
Louisiana |  other
Louisiana State University |  okay
Louisiana Tech University |  okay
Louisville |  other
Loyola Marymount University |  okay
Loyola University Chicago |  okay
Lub

### Verify  

Double check that the articles collect are mostly universities.


In [6]:
dataset.head()

Unnamed: 0,college,content,nrefs,nlinks
0,Air Force Institute of Technology,The Air Force Institute of Technology (AFIT) i...,142,253
1,Albert Einstein College of Medicine,The Albert Einstein College of Medicine is a p...,88,234
2,American University,The American University (AU or American) is a ...,148,423
3,Arizona State University,Arizona State University (Arizona State or ASU...,515,880
4,Arkansas State University,Arkansas State University (A-State or ASU) is ...,109,281


### Save

In [7]:
dataset.to_json("universities.json")

### Open (for your homework)

In [None]:
import pandas as pd
