# How high of an incentive (measured through differences in overall GPA) do students face by enrolling in classes with relatively higher grade distributions?


## Diego Saldonid, James Mata, Lucas Hsu, Roger Ruan


## Introduction and Background

With grade inflation becoming a growing problem between public and private school we were interested in taking a closer look to see what may be happening within a school. As students, we often hear advice on which professor to take for a certain class because the chances of getting a good grade are higher. For many students at UCSD, who plan to pursue careers that involve more education or careers in their subject’s job market, GPA is an important factor to look competitive in these fields. This leads students to wait for a class to be taught by a certain professor in order to boost their transcripts and have a higher chance at success. This leads us to our question: what would be the difference in GPA between students that happen to take professors with lower grade distributions relative to their fellow professors? If we are able to find a large enough disparity between these students, what would be our call to action? Furthermore, what might this reflect on the way classes are being taught, not only at UCSD but at other Universities? As our reference allude to, just because they are getting higher grades does not imply that students are performing higher or learning more. The growing competition to get into medical school, grad school, or other competitive career fields incentives students to pursue a higher grade over leaving college with technical skills and knowledge. Also, are the classes with higher distribution, are they not challenging students as much or do they provide a better teaching environment?



## Data Description

The dataset used for our project is the UCSD Cape Evaluations, found at https://cape.ucsd.edu/responses/Results.aspx. This dataset has grade distributions for various classes among multiple disciplines at UCSD from Spring quarter 2007 to Spring Quarter 2017. We will be focusing on select majors: Cogs, Bio, Biochem, MAE, Chem, and CS. From these majors, we will be analyzing required major courses.

Each data row has the following columns: Instructor, Course, Term, Enroll, Evals Made, Rcmnd Class, Rcmdn Instr, Study Hrs/wk, Avg Grade Expected, and Avg Grade Received. After cleaning our data, we will be focusing on Instructor, Course, Term, Rcmnd Instr, Study Hrs/wk, and Avg Grade Received.

By comparing different students across different professors taken at different times we will be able to analyze any changes between professors as well as pick up on any trends that are happening within a certain department. By tracking the changes across multiple departments (to have an accurate representation of what is happening campus wide) we can compare how changes at UCSD agree or disagree with national trends. We will also compare the amount of study hours between the courses taught by professors of the same department, this will serve as an indicator on how rigorous classes may be, which may point to the differences in grade distributions.

After scraping the data, we disregard ones with NaN rows and data where the number of evaluations made for the class is under 40%.

Loading modules. We are using the beautiful soup library to scrap through different categories that are given with evaluations.

In [1]:
import ssl
import urllib.request
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen

Here, we are setting up the dataframe for the data we are going to scrap. We create a function called scrapData that goes through the online dataset with a specified major and extract the data onto our dataframe. We create a variable of the column values and then create a datafram using the implemented column variables. 

In [2]:
#all courses
def scrapData(links, i, dfT):
    gcontext = ssl.SSLContext(ssl.PROTOCOL_TLSv1)
    req = urllib.request.Request(url="https://cape.ucsd.edu/responses/Results.aspx?Name=&CourseNumber=" + links[i],
        data=b'None',headers={'User-Agent':' Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0'})
    handler = urllib.request.urlopen(req, context=gcontext)
    htmltext = handler.read()
    soup = BeautifulSoup(htmltext,'lxml')

    # Create four variables to score the scraped data in
    Name = []
    course = []
    rcmndClass = []
    term = []
    enroll = []
    evalMade = []
    rcmndInstructor = []
    studyHrs = []
    avgGradeExpected = []
    avgGradeRecieved = []

    # Create an object of the first object that is class=dataframe
    table = soup.find(class_='styled')

    # Find all the <tr> tag pairs, skip the first one, then for each.
    for row in table.find_all('tr')[1:]:

        spans = row.find_all('span')
        a = row.find_all('a')

        # Create a variable of all the <td> tag pairs in each <tr> tag pair,
        col = row.find_all('td')

        # Create a variable of the string inside 1st <td> tag pair,
        column_1 = col[0].string
        # and append it to first_name variable
        Name.append(column_1)

        #find class name
        column_10 = a[0].text
        course.append(column_10)

        # Create a variable of the string inside 2nd <td> tag pair,
        column_2 = spans[1].text
        # and append it to last_name variable
        rcmndClass.append(column_2)

        # Create a variable of the string inside 3rd <td> tag pair,
        column_3 = col[2].string
        # and append it to age variable
        term.append(column_3)

        # Create a variable of the string inside 4th <td> tag pair,
        column_4 = col[3].text
        # and append it to enroll variable
        enroll.append(column_4)

        # Create a variable of the string inside 5th <td> tag pair,
        column_5 = spans[0].text
        # and append it to postTestScore variable
        evalMade.append(column_5)

        column_6 = spans[2].text
        rcmndInstructor.append(column_6)

        column_7 = spans[3].text
        studyHrs.append(column_7)

        column_8 = spans[4].text
        avgGradeExpected.append(column_8)

        column_9 = spans[5].text
        avgGradeRecieved.append(column_9)

    # Create a variable of the value of the columns
    columns = {'Name': Name, 'rcmndClass': rcmndClass, 'term': term, 'enroll': enroll, 'evalMade': evalMade, 
               'rcmndInstructor':rcmndInstructor, 'studyHrs':studyHrs, 'avgGradeExpected':avgGradeExpected 
               ,'avgGradeRecieved':avgGradeRecieved, 'course':course}

    # Create a dataframe from the columns variable
    df = pd.DataFrame(columns)
    df = df[['Name', 'course','term', 'enroll', 'evalMade', 'rcmndClass', 'rcmndInstructor','studyHrs', 
             'avgGradeExpected', 'avgGradeRecieved' ]]
    return df

Right now, we're focusing on data relating to the cogs major and the core classes needed. We extract the data for the cogs major by scraping the data for each required course and the different professors that teach each course. We fill our dataframe using the scrapData function created above to give us raw data. 

In [3]:
#links = ['MATH+20B', 'MATH+20A']
cogsMajor = ['MATH+20A', 'MATH+20B','MATH+20C','MATH+20F','COGS+1','COGS+14A',
             'COGS+101A','COGS+102A','COGS+107A','CSE+7']
dfTemp = pd.DataFrame()
dfT = pd.DataFrame();
for i in range(len(cogsMajor)):
    dfTemp = scrapData(cogsMajor,i, dfT)
    dfT = dfT.append(dfTemp)

In [4]:
dfT


Unnamed: 0,Name,course,term,enroll,evalMade,rcmndClass,rcmndInstructor,studyHrs,avgGradeExpected,avgGradeRecieved
0,"Stevens, Laura Jeanne",MATH 20A - Calculus/Science & Engineering (A),WI17,150,85,97.5 %,100.0 %,5.48,B (3.30),C+ (2.51)
1,"Um, Ko Woon",MATH 20A - Calculus/Science & Engineering (B),WI17,110,49,70.5 %,65.9 %,6.45,B- (2.84),C+ (2.36)
2,"Dewey, Edward Harold",MATH 20A - Calculus/Science & Engineering (A),FA16,262,137,93.9 %,85.6 %,6.12,B+ (3.50),B- (2.87)
3,"Bowers, Adam R.",MATH 20A - Calculus/Science & Engineering (B),FA16,191,89,96.5 %,97.6 %,6.70,B (3.23),C+ (2.38)
4,"Quarfoot, David James",MATH 20A - Calculus/Science & Engineering (C),FA16,214,121,99.1 %,97.4 %,5.75,B+ (3.49),C+ (2.50)
5,"Um, Ko Woon",MATH 20A - Calculus/Science & Engineering (D),FA16,198,94,94.3 %,67.0 %,5.55,B+ (3.49),C+ (2.51)
6,"Bach, Quang Tran",MATH 20A - Calculus/Science & Engineering (A),S216,27,18,94.4 %,83.3 %,7.39,B+ (3.33),B (3.29)
7,"Chow, Bennett",MATH 20A - Calculus/Science & Engineering (A),S116,28,9,100.0 %,100.0 %,8.21,A- (3.71),B+ (3.43)
8,"Rhoades, Brendon Patrick",MATH 20A - Calculus/Science & Engineering (A),SP16,97,61,93.1 %,93.1 %,5.57,B (3.19),B- (2.83)
9,"Wang, Xu",MATH 20A - Calculus/Science & Engineering (A),WI16,72,30,78.6 %,14.3 %,5.50,C+ (2.54),B- (2.93)


## Data Cleaning/Pre-processing

The data we got above is raw data. We have to clean it now by omitting rows where the grade is not available. To make sure the data points hold significance, we remove rows where the number of evaluations made is less than 40%. This allows a more accurate representation of grade distribution among students. 



In [6]:
#converts the strings from enroll and evalMade to int
dfT.enroll = pd.to_numeric(dfT.enroll)
dfT.evalMade = pd.to_numeric(dfT.evalMade)
#drops classes with N/a grade distributions 
dfT = dfT[dfT.avgGradeRecieved != 'N/A'] 
#drops classes with less than 40% evals made
dfT = dfT[dfT.evalMade/dfT.enroll > .40]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


In [7]:
dfT

Unnamed: 0,Name,course,term,enroll,evalMade,rcmndClass,rcmndInstructor,studyHrs,avgGradeExpected,avgGradeRecieved
0,"Stevens, Laura Jeanne",MATH 20A - Calculus/Science & Engineering (A),WI17,150,85,97.5 %,100.0 %,5.48,B (3.30),C+ (2.51)
1,"Um, Ko Woon",MATH 20A - Calculus/Science & Engineering (B),WI17,110,49,70.5 %,65.9 %,6.45,B- (2.84),C+ (2.36)
2,"Dewey, Edward Harold",MATH 20A - Calculus/Science & Engineering (A),FA16,262,137,93.9 %,85.6 %,6.12,B+ (3.50),B- (2.87)
3,"Bowers, Adam R.",MATH 20A - Calculus/Science & Engineering (B),FA16,191,89,96.5 %,97.6 %,6.70,B (3.23),C+ (2.38)
4,"Quarfoot, David James",MATH 20A - Calculus/Science & Engineering (C),FA16,214,121,99.1 %,97.4 %,5.75,B+ (3.49),C+ (2.50)
5,"Um, Ko Woon",MATH 20A - Calculus/Science & Engineering (D),FA16,198,94,94.3 %,67.0 %,5.55,B+ (3.49),C+ (2.51)
6,"Bach, Quang Tran",MATH 20A - Calculus/Science & Engineering (A),S216,27,18,94.4 %,83.3 %,7.39,B+ (3.33),B (3.29)
8,"Rhoades, Brendon Patrick",MATH 20A - Calculus/Science & Engineering (A),SP16,97,61,93.1 %,93.1 %,5.57,B (3.19),B- (2.83)
9,"Wang, Xu",MATH 20A - Calculus/Science & Engineering (A),WI16,72,30,78.6 %,14.3 %,5.50,C+ (2.54),B- (2.93)
10,"Bowers, Adam R.",MATH 20A - Calculus/Science & Engineering (B),WI16,144,71,95.5 %,95.5 %,6.02,B+ (3.35),C+ (2.40)


## Data Analysis and Results

With this data we will be able to track changes in student GPA for the last ten years. We will compare changes in GPA across departments, we can see if UCSD follows national trends while keeping our datapool campus wide. Comparing study hours bewteen the courses taught by professors of the same department allows us to analyze study methods as a descrepency. Ultimately, we are looking to see if students who take professors with higher grade distributions and students who take professors with lower grade distributions and see how much of an impact the difference in professors have on students' overall GPA.

## Conclusions/Discussion