Questions:
1: What is the most common set of prerequisite courses a freshman would need to take at UC Davis by college?
2: What would be the best six courses to take at uc davis as a freshman who does not know what they want to study?

We'll start with the college of biological sciences.

The first thing to do is to import all of the packages that we need to use for the code. I did this at the beginning to keep them organized so that I could reference what I already had instead of accidentally importing one package multiple times.

In [2]:
# We'll start by importing all the packages we need.
# BeautifulSoup is used for webscraping
# urllib.request is used to open the URLs
# requests, requests_cache, and re are used to cache the web results so we don't have to retreive from the server each time we run the program.

from bs4 import BeautifulSoup
import urllib.request as urllib2

import requests
import requests_cache
import re
requests_cache.install_cache('demo_cache')

The next step is to define where exactly in the webpage the lower-division pre-requisite courses are located. This was tricky because the majors were in slightly different formats, but I was eventually able to come up with a general function to properly extract all the necessary data. A list of UC Davis classes will be returned at the end for the major whose URL was used for the function.

In [18]:
"""
This function takes a URL from a UC Davis major page and returns the list of rerequisite couress for that major.

Inputs:
    majorUrl: This is the URL that we will be text scraping from
    
Outputs:
    classList: This is the list of lower-division courses required for any given major.

"""

def getLowerDivRequirements(majorUrl):
    page = requests.get(majorUrl)

# This extracts the HTML from the page the running it through beautiful soup to parse
    soup = BeautifulSoup(page.text,'html.parser')
    if 'registrar' in majorUrl:       
        page = requests.get(soup.a.get('href'))
        soup = BeautifulSoup(page.text,'html.parser')

# This defines the start and stop locations for extracting text. We're only interested in the lower div courses
# Two start strings and two stop strings were used as different colleges phrased the start and stop areas slightly differently
    startString1 = 'Preparatory Subject Matter'
    startString2 = 'Lower Division Required Courses'
    stopString1 = 'Depth Subject Matter'
    stopString2 = 'Upper Division Required Courses'
    stopChar = '('
    prepMatter = False
    i = 0

# "tr" and "td" are html tags containing the text that we need.    
    classList = []
    for row in soup.findAll('tr'):
        
        data = row.findAll('td')
        if len(data) < 0:
            print(data[0].text)
            if len(data) < 2 and prepMatter or (data[0].text == stopString1 or data[0].text == stopString2):
                prepMatter = False

# This extracts the prerequisite courses                
        if prepMatter:
            classStrings = data[0].text.split()
            if len(classStrings) < 2 or len(classStrings[0]) < 2:
                continue
            classNumbers = False
            className = ''
            
# This searches for the end of the prerequisite courses            
            if classStrings[0] == 'Depth' or (classStrings[0] == 'Total' and classStrings[1] == 'Depth') or (classStrings[0] == 'Upper' and classStrings[1] == 'Division'):
                break

          
            for string in classStrings:

# Here, we're looking for the correct strings that represent courses.            
                if string == 'Students' or string == 'Prerequisites' or 'string' == 'Select' or string == 'Choose':
                    break

# If two courses are separated by an "and" we skip it and go to the next number.
                if classNumbers == True and string == 'and':
                    continue
# This removes ampersands                    
                if len(string) > 1 and string[0] == '&':
                    string = string.replace('&','')

# If an "either" appears, we reset the class name and then skip. Take either CHE 2A or BIS 2A both CHE2A and BIS 2A would be included.         
                if string == 'either' or ':' in string or '(' in string:
                    className = ''
                    continue

# If we come arcross an "or" then we skip it                
                if string == 'or' or (string == '&' and classNumbers == True):
                    continue

# This finds the start of new classes if we find an uppercase character immediately after a number            
                if classNumbers == True and string[0].isupper() == True:
                    className = ''
                    classNumbers = False

# If there is a string that is not a number or if there is a number followed by an uppercase letter, then we add it to the name of the course subject area.         
                if (string[0].isdigit() == False) or (classNumbers == True and classNumbers == string[0].isupper()):
                    className += string + ' '

# This puts the course letters and numbers in the right format taking into accoutn the initial different in how they are listed (e.g. commas, semicolons, dashes, etc..
                else:
                    classNumbers = True                   
                    classNum = ''.join([i for i in string if i.isdigit()])                                       
                    if len(classNum) > 1 and '-' in string: 
                        
                        if classNum[0] == classNum[1]:
                            classSubtitleList = string.split('-')
                            for sub in classSubtitleList:
                                sub = sub.replace(',','')
                                sub = sub.replace(';','')
                                classList.append(className + sub)
                        elif len(classNum) > 3:
                            if classNum[0] == classNum[2]:
                                classSubtitleList = string.split('-')
                                for sub in classSubtitleList:
                                    sub = sub.replace(',','')
                                    sub = sub.replace(';','')
                                    classList.append(className + sub)
                    else:
                        classLetter = ''.join([i for i in string if (not i.isdigit() and i != ',' and i != ';')])
                        classLetters = []
                        

                        classLetters = classLetter.split('&')
                            
                        if int(classNum) > 99:
                            continue
                        for letter in classLetters:
                            if len(className) > 1:
                                classList.append(className + classNum + letter)
        if data and (startString1 == data[0].text or startString2 == data[0].text):
            prepMatter = True
    print(classList)
    return classList

The next step is to start extracting the list of major links. I took all of the links from the same URL. I created an empty list for each college that was then populated with a list of all the majors from that undergraduate college at UC Davis.

In [4]:
# This goes to the website with links to all of the majors.
page = requests.get('https://www.ucdavis.edu/majors/college')
soup = BeautifulSoup(page.text,'html.parser')

# Initializing lists that will contain all the college major links
collegeCount = 0
prevStartChar = 'Z'
agList = []
bioList = []
engList = []
lsList = []

# This goes through everything within the tag 'a'
for row in soup.findAll('a'):

    # If there is a link we follow it to the webpage
    if row.get('href'):

        # If the link contains a link to a major we follow it
        if '/majors/' in row.get('href'):
            if row.text[0] < prevStartChar:
                collegeCount = collegeCount + 1

            # Below, we loop though each of the undergraduate colleges to extract the link to each major page in that college
            if row.text[0] == 'A' or (collegeCount > 4 and collegeCount < 9):
                if collegeCount == 5:
                    agList.append((row.text,'https://www.ucdavis.edu' + row.get('href')))
                    
                elif collegeCount == 6:
                    bioList.append((row.text,'https://www.ucdavis.edu' + row.get('href')))
                    
                elif collegeCount == 7:
                    engList.append((row.text,'https://www.ucdavis.edu' + row.get('href')))
                    
                elif collegeCount == 8:
                    lsList.append((row.text,'https://www.ucdavis.edu' + row.get('href')))

                prevStartChar = row.text[0]  

The next step was to feed in the requirements. I used a for loop that iterated through the list and ran each URL through the function automatically instead of copying and pasting by hand which is both less efficient and introduces the possibility of human error. The major prerequieites were automatically sorted by undergraduate college. The last set of obtaining my data was to export each college's prerequisites to a csv. I did this so that any website outages would not effect my ability to complete my project. I also plan on providing these requirements to other students if they want to use this data for any further analysis.

In [5]:
import csv

"""
This function lists all of the major requirements for each major in a given undergraduate UC Davis college.

Inputs:
    majorList: The list of URL links for each major in an undergraduate UC Davis college. Function defined above.
    csvTitle: The name of the csv that the output will be saved.
    
Outputs:
    requirementsList: A list of all of the requirements sorted by major within college along with their lower-division prerequisite courses.

"""
def getAllRequirements(majorList,csvTitle):

    # Creating an empty list that will be filled with the requirements
    requirementsList = []

    # Looping through each major in the given list and extracting the prerequisite information
    for major in majorList:
        page = requests.get(major[1])
        soup = BeautifulSoup(page.text,'html.parser')


        for row in soup.findAll('a'):
            if row.text:
                if row.text == 'Detailed Major Requirements':
                    print(major[0])
                    print(row.get('href'))
                    requirementsList.append(getLowerDivRequirements(row.get('href')))
                    print('------------------')

    # This saves the list to a csv and exports it in the same directory                
    import csv                
    with open(csvTitle,'w') as resultFile:
        wr = csv.writer(resultFile, dialect='excel')
        wr.writerows(requirementsList)
    
    return requirementsList

# Putting each of the colleges through the function defined above
print("Lower-division prerequisite list for the College of Letters and Science:")
getAllRequirements(lsList,'lsreqs.csv')

print("Lower-division prerequisite list for the Agriculture and Environmental Sciences:")
getAllRequirements(agList,'agreqs.csv')

print("Lower-division prerequisite list for the College of Engineering:")
getAllRequirements(engList,'engreqs.csv')

print("Lower-division prerequisite list for the College of Biological Sciences:")
getAllRequirements(bioList,'bioreqs.csv')


Lower-division prerequisite list for the College of Letters and Science:
African American and African Studies Major
http://catalog.ucdavis.edu/programs/AAS/AASreqt.html
['African American and African Studies 10', 'African American and African Studies 12', 'African American and African Studies 15', 'African American and African Studies 17', 'African American and African Studies 18', 'African American and African Studies 50', 'African American and African Studies 51', 'African American and African Studies 52', 'African American and African Studies 80', 'Anthropology 2', 'Economics 1A', 'Economics 1B', 'Geography 2', 'Sociology 1', 'Political Science 1', 'Political Science 2', 'Psychology 1', 'Chicana/o Studies 10', 'Native American Studies 1', 'Native American Studies 10', 'Women & Gender Studies 50', 'American Studies 10', 'Asian American Studies 1', 'Asian American Studies 2', 'History 15', 'History 17A', 'History 17B', 'African American and African Studies 16', 'African American and A

East Asian Studies Major
http://catalog.ucdavis.edu/programs/EAS/EASreqt.html
['History 9A', 'History 9B', 'Art History 1D', 'Chinese 7', 'Chinese 10', 'Chinese 11', 'Comparative Literature 53A', 'East Asian Studies 88', 'Japanese 10', 'Japanese 25', 'Japanese 50', 'Religious Studies, 75']
------------------
Economics Major
http://catalog.ucdavis.edu/programs/ECN/ECNreqt.html
['Economics 1A', 'Economics 1B', 'Statistics 13', 'Statistics 32', 'Mathematics 16A', 'Mathematics 16B', 'Mathematics 21A', 'Mathematics 21B']
------------------
English Major
http://catalog.ucdavis.edu/programs/ENL/ENLreqt.html
['English 3', 'University Writing Program 1', 'English 40', 'English 43', 'English 44', 'English 45', 'English 10A', 'English 10B', 'English 10C']
------------------
French Major
http://registrar.ucdavis.edu/UCDWebCatalog/Programs/FRE/FREreqt.html
['French 1', 'French 2', 'French 3', 'French 21', 'French 22', 'French 23', 'Linguistics 1', 'Linguistics 4']
------------------
Gender, Sexuali

['Philosophy 1', 'Philosophy 21', 'Philosophy 22', 'Philosophy 13G', 'Philosophy 14', 'Philosophy 15', 'Philosophy 24', 'Philosophy 30', 'Philosophy 31', 'Philosophy 32', 'Philosophy 38', 'Philosophy 17', 'Philosophy 12']
------------------
Physics Major
http://catalog.ucdavis.edu/Programs/PHY/PHYreqt.html
['Physics 9A', 'Physics 9B', 'Physics 9C', 'Physics 9D', 'Physics 9HA', 'Physics 9HB', 'Physics 9HC', 'Physics 9HD', 'Physics 9HE', 'Mathematics 21A', 'Mathematics 21B', 'Mathematics 21C', 'Mathematics 21D', 'Mathematics 22A', 'Mathematics 22B']
------------------
Political Science Major
http://catalog.ucdavis.edu/Programs/POL/POLreqt.html#Political
['1', '2', '3', '4', '5', 'Political Science 51', 'Statistics 13', 'Statistics 32']
------------------
Political Science – Public Service  Major
http://catalog.ucdavis.edu/Programs/POL/POLreqt.html#Public
['1', '2', '3', '4', '5', 'Political Science 51', 'Statistics 13', 'Statistics 32']
------------------
Psychology Major
http://catalog.

------------------
Ecological Management and Restoration Major
http://catalog.ucdavis.edu/programs/EMR/EMRreqt.html
['Biological Sciences 2A', 'Biological Sciences 2B', 'Biological Sciences 2C', 'Chemistry 2A', 'Chemistry 2B', 'Physics 1A', 'Physics 1B', 'Physics 7A', 'Physics 7B', 'Physics 7C', 'Mathematics 16A', 'Mathematics 16B', 'Mathematics 17A', 'Mathematics 17B', 'Mathematics 21A', 'Mathematics 21B', 'Environmental Science and Policy 1']
------------------
Entomology Major
http://catalog.ucdavis.edu/programs/ENT/ENTreqt.html
['Biological Sciences 2A', 'Biological Sciences 2B', 'Biological Sciences 2C', 'Chemistry 2A', 'Chemistry 2B', 'Chemistry 8A', 'Chemistry 8B', 'Mathematics 16A', 'Mathematics 16B', 'Mathematics 16C', 'Mathematics 17A', 'Mathematics 17B', 'Mathematics 17C', 'Mathematics 21A', 'Mathematics 21B', 'Mathematics 21C', 'Physics 1A', 'Physics 1B', 'Statistics 13', 'Statistics 32', 'Plant Sciences 21', 'Engineering 5']
------------------
Environmental Horticulture an

Sustainable Agriculture and Food Systems Major
http://catalog.ucdavis.edu/programs/SAFS/SAFSreqt.html
['Mathematics 16A', 'Mathematics 16B', 'Chemistry 2A', 'Chemistry 2B', 'Physics 1A', 'Biological Sciences 2A', 'Biological Sciences 2B', 'Plant Sciences 2', 'Animal Sciences 1', 'Animal Sciences 2', 'Food Science 1', 'Economics 1A', 'Community and Regional Development 1', 'Philosophy 14', 'Philosophy 15', 'Philosophy 24', 'Anthropology 2', 'Political Science 4', 'Sociology 1', 'Sociology 3']
------------------
Sustainable Environmental Design Major
http://catalog.ucdavis.edu/programs/SED/SEDreqt.html
['Biological Sciences 2A', 'Biological Sciences 2B', 'Landscape Architecture 1', 'Landscape Architecture 2', 'Landscape Architecture 3', 'Landscape Architecture 21', 'Landscape Architecture 30', 'Landscape Architecture 50', 'Landscape Architecture 70']
------------------
Textiles and Clothing Major (suspended 2018-20)
http://catalog.ucdavis.edu/programs/TXC/TXCreqt.html
['Plant Sciences 21

Materials Science and Engineering Major
http://catalog.ucdavis.edu/programs/ECH/ECHreqt.html#Materials1
['Mathematics 21A', 'Mathematics 21B', 'Mathematics 21C', 'Mathematics 21D', 'Mathematics 22A', 'Mathematics 22B', 'Physics 9A', 'Physics 9B', 'Physics 9C', 'Chemistry 2A', 'Chemistry 2B', 'Chemistry 2C', 'Chemistry 2AH', 'Chemistry 2BH', 'Chemistry 2CH', 'Chemical Engineering and Materials Science 5', 'Chemical Engineering and Materials Science 6', 'Chemical Engineering and Materials Science 51', 'Chemical Engineering and Materials Science 80', 'Engineering 45', 'Engineering 45Y', 'Biotechnology 1', 'Biotechnology 1Y', 'Biological Sciences 2A', 'English 3', 'University Writing Program 1', 'University Writing Program 1V', 'University Writing Program 1Y', 'Comparative Literature 1', 'Comparative Literature 2', 'Comparative Literature 3', 'Comparative Literature 4', 'Native American Studies 5']
------------------
Mechanical Engineering Major
http://registrar.ucdavis.edu/UCDWebCatalog/p

[['Biological Sciences 2A',
  'Biological Sciences 2B',
  'Biological Sciences 2C',
  'Chemistry 2A',
  'Chemistry 2B',
  'Chemistry 2C',
  'Chemistry 2AH',
  'Chemistry 2BH',
  'Chemistry 2CH',
  'Mathematics 17A',
  'Mathematics 17B',
  'Mathematics 17C',
  'Mathematics 21A',
  'Mathematics 21B',
  'Physics 7A',
  'Physics 7B',
  'Physics 7C'],
 ['Biological Sciences 2A',
  'Biological Sciences 2B',
  'Biological Sciences 2C',
  'Chemistry 2A',
  'Chemistry 2B',
  'Chemistry 8A',
  'Chemistry 8B',
  'Chemistry 118A',
  'Chemistry 118B',
  'Chemistry 118C',
  'Mathematics 17A',
  'Mathematics 17B',
  'Mathematics 21A',
  'Mathematics 21B',
  'Physics 1A',
  'Physics 1B',
  'Physics 7A',
  'Physics 7B',
  'Physics 7C',
  'Chemistry 2C',
  'Math 17C',
  'Math 21C)'],
 ['Biological Sciences 2A',
  'Biological Sciences 2B',
  'Biological Sciences 2C',
  'Chemistry 2A',
  'Chemistry 2B',
  'Chemistry 2C',
  'Chemistry 2AH',
  'Chemistry 2BH',
  'Chemistry 2CH',
  'Mathematics 17A',
  'Math

The next step is to sort all of the majors in the College of Letters and Science into one of four "undeclared" categories. This is because entering "undeclared" students in the College of Letters and Science must be in one of four general subject areas: Fine Arts, Humanties, Social Sciences, or Physical Sciences. The suggested courses will be different depending on which area a student has chosen to further explore.

In [19]:
# Create lists of keywords for each subject area in Letters and Sciences

fineArtsKeywordList = ['Art','Theater','Cinema','Theatre','Music']
socialSciencesKeywordList = ['Econ','Political','International','Soci','Anthro','Psych']
mathPhysicalSciencesKeywordList = ['Math','Physic','Stat','Chem','Computer','Geology','Marine','Tech']

# This sorts each of the majors into one of the four undeclared categories based on keywords defined above
fineArtsMajorList = []
socialSciencesMajorList = []
mathPhysicalSciencesMajorList = []
humanitiesMajorList = []
for major in lsList:
    if any(keyword in major[0] for keyword in fineArtsKeywordList):
        fineArtsMajorList.append(major)
    elif any(keyword in major[0] for keyword in socialSciencesKeywordList):
        socialSciencesMajorList.append(major)
    elif any(keyword in major[0] for keyword in mathPhysicalSciencesKeywordList):
        mathPhysicalSciencesMajorList.append(major)
    else:
        humanitiesMajorList.append(major)

# The final list of "Fine Arts"-based majors
print('Fine Arts Majors:')
for major in fineArtsMajorList:
    print(major[0])
print('------------------------')

# The final list of "Social Sciencs"-based majors
print('Social Sciences Majors:')
for major in socialSciencesMajorList:
    print(major[0])
print('------------------------')


# The final list of "Physical Sciences"-based majors
print('Physical Sciences Majors:')
for major in mathPhysicalSciencesMajorList:
    print(major[0])

print('------------------------')

# The final list of "Humanities"-based majors
print('Humanties Majors:')
for major in humanitiesMajorList:
    print(major[0])
    
print('------------------------')


# Exporting all of the major groups
getAllRequirements(fineArtsMajorList,'fareqs.csv')
getAllRequirements(socialSciencesMajorList,'ssreqs.csv')
getAllRequirements(mathPhysicalSciencesMajorList,'mpreqs.csv')
getAllRequirements(humanitiesMajorList,'hmreqs.csv')

Fine Arts Majors:
Art History Major
Art Studio Major
Cinema and Digital Media Major
Music Major
Theatre and Dance Major
Undeclared—Fine Arts
------------------------
Social Sciences Majors:
Anthropology Major
Economics Major
International Relations Major
Political Science Major
Political Science – Public Service  Major
Psychology Major
Sociology Major
Sociology – Organizational Studies Major
Undeclared—Social Sciences
------------------------
Physical Sciences Majors:
Applied Mathematics Major
Applied Physics Major
Chemical Physics Major
Chemistry Major
Computer Science Major
Geology Major
Marine and Coastal Science—Oceans and the Earth System
Mathematical Analytics and Operations Research Major
Mathematical and Scientific Computation Major
Mathematics Major
Pharmaceutical Chemistry Major
Physics Major
Science and Technology Studies Major
Statistics Major
Undeclared—Physical Sciences
------------------------
Humanties Majors:
African American and African Studies Major
American Studies 

Marine and Coastal Science—Oceans and the Earth System
http://catalog.ucdavis.edu/programs/MCS/MCSreqt.html
['Biological Sciences 2A', 'Biological Sciences 2B', 'Biological Sciences 2C', 'Chemistry 2A', 'Chemistry 2B', 'Chemistry 2C', 'Mathematics 16A', 'Mathematics 16B', 'Mathematics 16C', 'Mathematics 17A', 'Mathematics 17B', 'Mathematics 17C', 'Mathematics 21A', 'Mathematics 21B', 'Mathematics 21C', 'Physics 7A', 'Physics 7B', 'Physics 7C', 'Physics 9A', 'Physics 9B', 'Physics 9C', 'Chemistry 8A', 'Chemistry 8B', 'Evolution & Ecology 12', 'Geology 16']
------------------
Mathematical Analytics and Operations Research Major
http://catalog.ucdavis.edu/Programs/MAT/MATreqt.html
['Mathematics 12', 'Mathematics 21A', 'Mathematics 21B', 'Mathematics 21C', 'Mathematics 21D', 'Mathematics 22B', 'Mathematics 25', 'Mathematics 22A', 'Mathematics 67', 'Computer Science 30', 'Engineering 6', 'Mathematics 22AL', 'Basic knowledge of MATLAB is required for both Mathematics 67', 'Basic knowledge of

East Asian Studies Major
http://catalog.ucdavis.edu/programs/EAS/EASreqt.html
['History 9A', 'History 9B', 'Art History 1D', 'Chinese 7', 'Chinese 10', 'Chinese 11', 'Comparative Literature 53A', 'East Asian Studies 88', 'Japanese 10', 'Japanese 25', 'Japanese 50', 'Religious Studies, 75']
------------------
English Major
http://catalog.ucdavis.edu/programs/ENL/ENLreqt.html
['English 3', 'University Writing Program 1', 'English 40', 'English 43', 'English 44', 'English 45', 'English 10A', 'English 10B', 'English 10C']
------------------
French Major
http://registrar.ucdavis.edu/UCDWebCatalog/Programs/FRE/FREreqt.html
['French 1', 'French 2', 'French 3', 'French 21', 'French 22', 'French 23', 'Linguistics 1', 'Linguistics 4']
------------------
Gender, Sexuality and Women's Studies Major
http://catalog.ucdavis.edu/programs/WMS/WMSreqt.html
['Women\x92s Studies, 50', 'Women\x92s Studies, 60', 'Women\x92s Studies, 70', 'African American and African Studies 10', 'African American and Afric

[['African American and African Studies 10',
  'African American and African Studies 12',
  'African American and African Studies 15',
  'African American and African Studies 17',
  'African American and African Studies 18',
  'African American and African Studies 50',
  'African American and African Studies 51',
  'African American and African Studies 52',
  'African American and African Studies 80',
  'Anthropology 2',
  'Economics 1A',
  'Economics 1B',
  'Geography 2',
  'Sociology 1',
  'Political Science 1',
  'Political Science 2',
  'Psychology 1',
  'Chicana/o Studies 10',
  'Native American Studies 1',
  'Native American Studies 10',
  'Women & Gender Studies 50',
  'American Studies 10',
  'Asian American Studies 1',
  'Asian American Studies 2',
  'History 15',
  'History 17A',
  'History 17B',
  'African American and African Studies 16',
  'African American and African Studies 51',
  'African American and African Studies 54',
  'Dramatic Art 41A',
  'Dramatic Art 41B',
  '

Sources used:
https://stackoverflow.com/questions/7055048/how-to-extract-certain-parts-of-a-web-page-in-python
https://stackoverflow.com/questions/10393157/splitting-a-string-with-multiple-delimiters-in-python
https://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautifulsoup
https://stackoverflow.com/questions/43043437/wordcloud-python-with-generate-from-frequencies
https://stackoverflow.com/questions/40444821/convert-csv-to-a-string-variable
https://stackoverflow.com/questions/3271478/check-list-of-words-in-another-string
https://www.tutorialspoint.com/python/list_remove.htm    
https://pythonspot.com/reading-csv-files-in-python/
http://amueller.github.io/word_cloud/auto_examples/simple.html#sphx-glr-auto-examples-simple-py
    