## COGS 108 - Final Project

### Important
- ONE, and only one, member of your group should upload this notebook to TritonED.
- Each member of the group will receive the same grade on this assignment.
- Keep the file name the same: submit the file 'FinalProject.ipynb'.
- Only upload the .ipynb file to TED, do not upload any associted data. Make sure that 
  for cells in which you want graders to see output that these cells have been executed.

### Group Members: Fill in the Student IDs of each group member here
Replace the lines below to list each persons full student ID, ucsd email and full name.

- 
- 
- 


#### Imports
Below are specific third-party libraries we will be using

In [None]:
import sys
!{sys.executable} -m pip install textblob
!{sys.executable} -m pip install gender-guesser
!{sys.executable} -m pip install selenium
!{sys.executable} -m pip install pillow
!{sys.executable} -m pip install wordcloud

In [1]:
import requests
import json
import math
import re
import os
import json
import getpass
import pandas as pd
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import patsy
import numpy as np
import statsmodels.api as sm
import scipy.stats as stats
from scipy.stats import ttest_ind, chisquare, normaltest
%matplotlib inline

##These Dependencies Need To be Downloaded
from textblob import TextBlob 
import gender_guesser.detector as gender
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

### Rate My Professor (RMP) Scraping
#### Scraping Rate My Professor for All UCSD Professors
To scrape data from RMP we first need to figure out the GET request URL's that happen in the background when we use the site. Once we have those, we can get a list of all professors at UCSD <em>getProfessorList</em> and extract the ids of those professors. Using the teacher ids(tids) we can then query all the reviews for that professor <em>getProfessorInformation</em> and save it to a dataframe. <em>generateAllProfInformation</em> does this for all professors and save it to a dataframe along with some extra professor metadata that was returned with the list of all professors

In [None]:
def getProfessorList(schoolID):
    page_id = 1
    professorList = []
    while True:
        page = requests.get("http://www.ratemyprofessors.com/filter/professor/?&page=" 
                            + str(page_id) + "&filter=teacherlastname_sort_s+asc&query=*%3A*&queryoption=TEACHER&queryBy=schoolId&sid=" 
                            + str(schoolID))
        
        jsonpage = json.loads(page.content)
        professors = jsonpage['professors']
        professorList.extend(professors)
        
        if(int(jsonpage['remaining']) == 0):
            break
        else:
            page_id += 1
             
    
    df = pd.DataFrame(professorList)
    df = df.drop(df[df['tNumRatings'] == 0].index) #drop rows without responses
    df.to_json('professors.json')

    '''
    #save to json file
    with open('professors.json', 'w') as outfile:
        json.dump(professorList, outfile)
    '''
    
    return df
    
def getProfessorInformation(tid):
    page_id=1
    pages = []
    while True:
        url = 'http://www.ratemyprofessors.com/paginate/professors/ratings?tid='+ str(tid)+'&filter=&courseCode=&page='+str(page_id)
        page = requests.get(url);
        r_json = json.loads(page.content)
        #page_of_comments = pd.DataFrame.from_dict(r_json['ratings'], orient='columns')
        pages.extend(r_json['ratings'])
        
        if(int(r_json['remaining']) == 0):
            break
        else:
            page_id += 1
        
    df = pd.DataFrame(pages)
    prof = pList.loc[pList['tid'] == tid]
    df.insert(0,'tDept',prof['tDept'].values[0])
    df.insert(0,'tFname',prof['tFname'].values[0])
    df.insert(0,'tLname',prof['tLname'].values[0])
    df.insert(0,'tid',tid)
    
    return df

def generateAllProfInformation():
    data = []
    tids = pList['tid'].values
    
    for i in tids:
        data.append(getProfessorInformation(i))
    
    data = pd.concat(data)
    data.to_csv("profData.csv",index=False)
    
    return data

#### First We Grab All The Professors Metadata

In [None]:
ucsdID = 1079
df_professors = getProfessorList(ucsdID)

#### Next We Scrape All of Their Reviews
** Note This Will Take A While

In [None]:
df_responses = generateAllProfInformation()

### Now Let's Augment this Data for Better Analysis
#### Adding Gender
Since Rate My Professor doesn't provide any information on gender, we needed to way to obtain the gender some other way. Fortunately we do have name data, and for the most part we an deduce gender from common names. With this intuition we found a python model that does exactly this and the following code attempts to classify a professor by gender using their first name

#### Create id to gender dict

In [None]:
names = df_professors['tFname'].values
tids = df_professors['tid'].values

gender_model = gender.Detector(case_sensitive=False)
genders = {}
u = 0

for i,v in enumerate(names):
    name = v.split(' ')[0]
    g = gender_model.get_gender(name)
    if g == 'male' or g == 'mostly_male':
        genders[tids[i]] = 'M'
    elif g == 'female' or g =='mostly_female':
        genders[tids[i]] = 'F'
    elif g == 'unknown' or g == 'andy':
        genders[tids[i]] = 'U'
        u+=1

** An important note is that the gender data produced is only as good as the model, and we are aware that this may affect our overall analysis

#### Add genders to dataframe

In [None]:
# insert dummy column
df_responses.insert(4,'gender','M')

for k,v in genders.items():
    index = df_responses[df_responses['tid'] == str(k)].index
    df_responses.loc[index,'gender'] = v

#### Getting Sentiment from RMP Comments
The majority of the data from RMP comes in the form of comments, which are just long strings. To be able to analyze this data numerically, we obtained the sentiment value of each comment. To do this we used a common python model TextBlob, which allowed us to simply plug in comments to generate sentiments.

Comments are generally dirty, containing punctuation and numbers which doesn't help in determining sentiment. The following functions cleans up the comments

In [None]:
def clean_comment(comment): 
    return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", comment).split()) 

The sentiment values take on values between -1 and 1, which refer to negative and positive sentiments, respectively

In [None]:
def get_comment_sentiment(comment): 
    analysis = TextBlob(clean_comment(comment)) 
    # set sentiment 
    if analysis.sentiment.polarity > 0: 
        return 'positive', analysis.sentiment.polarity
    elif analysis.sentiment.polarity == 0: 
        return 'neutral', analysis.sentiment.polarity
    else: 
        return 'negative', analysis.sentiment.polarity

In [None]:
# insert dummy columns
df_responses.insert(18,'sentimentValue',0)
df_responses.insert(18,'sentiment','positive')

In [None]:
for i in df_responses.index:
    comment = df_responses.loc[i,'rComments']
    if(not pd.isna(comment)):
        sentiment,polarity = get_comment_sentiment(comment)
        df_responses.loc[i,'sentiment'] = sentiment
        df_responses.loc[i,'sentimentValue'] = polarity
    else:
        df_responses.loc[i,'sentiment'] = 'N/A'
        df_responses.loc[i,'sentimentValue'] = 0

#### Now that we have some useful data, let's clean up what we don't need and standardize the columns we want to keep
 Remove Columns that are not Useful

In [None]:
dropColumns = ['rOverallString', 'onlineClass', 'rErrorMsg', 'rStatus', 'teacher', 'unUsefulGrouping', 'usefulGrouping', 'easyColor', 'helpColor', 'clarityColor']
df_responses.drop(columns=dropColumns,inplace=True)

Standardize Useful Columns

In [None]:
def yesNoToInt(str_in):
    if(str_in == "Yes"):
        return 1
    elif (str_in == "No"):
        return 0
    return str_in

In [None]:
def letterToGPA(str_in):
    gpaDict = {
        'A+': 4.0,
        'A' : 4.0,
        'A-': 3.7,
        'B+': 3.3,
        'B' : 3.0,
        'B-': 2.7,
        'C+': 2.3,
        'C' : 2.0,
        'C-': 1.7,
        'D+': 1.3,
        'D' : 1.0,
        'D-': 0.7,
        'F' : 0.0
    }
    return gpaDict.get(str_in, np.nan)

In [None]:
def interestToInt(str_in):
    if(str_in == "Low"):
        return 1
    elif (str_in == "Meh"):
        return 2
    elif (str_in == "Sorta interested"):
        return 3
    elif (str_in == "Really into it"):
        return 4
    elif (str_in == "It's my life"):
        return 5
    return str_in

In [None]:
def genderToInt(str_in):
    if(str_in == 'M'):
        return 1
    elif(str_in == 'F'):
        return -1
    else:
        return np.nan

In [None]:
df_responses["rTextBookUse"] = df_responses["rTextBookUse"].apply(yesNoToInt)
df_responses["rWouldTakeAgain"] = df_responses["rWouldTakeAgain"].apply(yesNoToInt)
df_responses["takenForCredit"] = df_responses["takenForCredit"].apply(yesNoToInt)
df_responses["rInterest"] = df_responses["rInterest"].apply(interestToInt)
df_responses["gender"] = df_responses["gender"].apply(genderToInt)
df_responses["teacherGrade"] = df_responses["teacherGrade"].apply(letterToGPA)

In [None]:
df_responses.to_csv("modifiedProfInfo.csv", index=False)

### Let's have a little fun with our new dataset Before Moving On
#### We're going to generate a wordcloud of all the comments in RMP

In [None]:
text = " ".join(comment for comment in df_responses.rComments if pd.notnull(comment))

In [None]:
stopwords = set(STOPWORDS)
wordcloud = WordCloud(max_font_size=50, max_words=100,stopwords=stopwords, background_color="white").generate(text)

plt.figure(figsize=[20,10])
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

### CAPES Scraping
##### Instructions : Download Appropriate Web Driver from the ChromeDrivers folder and add the Executeable to Path
https://chromedriver.storage.googleapis.com/index.html?path=74.0.3729.6/

Scraping data from CAPES is much more tedious than RMP due to the fact that CAPES data is only accessible to UCSD students. For this reason we need to actually perform an automated login to get to the data. We used Selenium and a Chrome Driver to programically login to capes

#### Let's Login To Capes
The following codes requires you to input a username, and will provide a secure input field for password when run. The code then opens up a Google Chrome browser and performs an automated login to CAPES.

** Don't worry, the password is protected and erased as soon as the cell finishes running or potentially crashes

In [None]:
#enter credentials
username = ''
password = getpass.getpass()


if(len(username) < 1):
    assert len(username) > 0
    
if(len(password) < 1):  
    password = '' #safety
    assert len(password) > 0

##init chrome driver
driver = webdriver.Chrome()
driver.get("https://cape.ucsd.edu/responses/Results.aspx")

#fill in username
elem = driver.find_element_by_name("urn:mace:ucsd.edu:sso:username")
elem.clear()
elem.send_keys(username)

#fill in password
elem = driver.find_element_by_name("urn:mace:ucsd.edu:sso:password")
elem.clear()
elem.send_keys(password)

#login!
elem = driver.find_element_by_name("_eventId_proceed").click()

#reset username & password for safety
username = ''
password = ''

#### Get all the Departments
To be able to query into capes, we need all the department names. Thanfully Selenium has a method to select from the dropdown menu that provides all the deparment names

In [None]:
select = Select(driver.find_element_by_name('ctl00$ContentPlaceHolder1$ddlDepartments'))
options = select.options
del options[0] #remove "select department tag"
departments = [o.get_attribute('value') for o in options]

#### Download All Page Sources For Each Department
Using Selenium we download all the webpages source files for each department so that we can use beautiful soup to parse them. Notice we needed to add a timeout since each query has a loading time in CAPES

In [None]:
basecape = 'http://cape.ucsd.edu/responses/Results.aspx?Name=&CourseNumber='
page_sources = []
for depts in departments:
    req = basecape + depts
    driver.get(req)
    time.sleep(1)
    page_sources.append(driver.page_source)

#### Build Headers For DataFrame
This function builds all the table headers which describe each column in the CAPES tables.

In [None]:
headers = []
soup = BeautifulSoup(source, 'html.parser')
table = soup.find('table', attrs={'class':'styled'})
for th in table.find('tr').findAll('th'):
    headers.append(th.text.strip())

#### Parse Table Containing Reviews Into DataFrame
Now we use beautiful soup to extract all the data for each department table into a dataframe

In [None]:
dataframes = []
for source in page_sources:
    soup = BeautifulSoup(source, 'html.parser')
    table = soup.find('table', attrs={'class':'styled'})
    data = []
    for i,row in enumerate(table.findAll('tr')):
        if i==0:
            continue
        else:
            col_data = []
            for td in row.findAll('td'): 
                col_data.append(td.text.strip())
        data.append(col_data)
    dataframes.append(pd.DataFrame(data))

#### Concat DataFrame, Add Headers


In [None]:
df = pd.concat(dataframes)
df.columns = headers
df.reset_index(drop=True,inplace=True)

#### Standardizing Columns

Replacing percent with decimal float values

In [None]:
def cleanPercentage(percent):
    percent = percent.strip()
    percent = percent.split(' ')[0]
    percent = float(percent) / 100.0
    return percent

Replacing letter grades with purely numerical values for analysis

In [None]:
def cleanGrades(grades):
    if(pd.notnull(grades)):
        grades.strip()
        grades = grades.split(' ')[1]
        grades = grades.strip('()')
        return float(grades)
    else:
        return grades

Here we encode the terms by quarter and year so that we can perform time series analysis further on

In [None]:
def cleanTerms(terms):
    semester = terms[:2]
    year     = terms[2:4]
    
    if(semester == "WI"):
        return (int)(year+"0")
    
    if(semester == "SP"):
        return (int)(year+"1")
    
    if(semester == "S1" or semester == "S2" or semester == "S3" or semester == "SU"):
        return (int)(year+"2")
    
    if(semester == "FA"):
        return (int)(year+"3")
    

#### The following functions are used to make it possible to combine both RMP and CAPES datasets into one dataframe

We extract the department name from the course description since RMP only has reliable department data

In [None]:
def splitDepartment(course):
    course = course.strip()
    course = course.split(" ")[0]
    return course

We need to split professors names into first and last to match the convention in RMP

In [None]:
def splitFirstName(inst):
    if(pd.notnull(inst)):
        inst = inst.strip()
        inst = inst.split(",")[1].strip()
        inst = inst.split(" ")[0].strip()
        return inst
    else:
        return inst

In [None]:
def splitLastName(inst):
    if(pd.notnull(inst)):
        inst = inst.strip()
        inst = inst.split(",")[0].strip()
        return inst
    else:
        return inst

### Eval / Enroll
You will notice we are using Eval / Enroll as one of the augmented columns. This is because the relationship between these two values is what we will need to correlate RMP and CAPE data as enroll by itself is meaningless without knowing the total possible evals that could have been made.

In [None]:
df["Rcmnd Class"] = df["Rcmnd Class"].apply(cleanPercentage)
df["Rcmnd Instr"] = df["Rcmnd Instr"].apply(cleanPercentage)
df["Avg Grade Expected"] = df["Avg Grade Expected"].apply(cleanGrades)
df["Avg Grade Received"] = df["Avg Grade Received"].apply(cleanGrades)
df["Term"] = df["Term"].apply(cleanTerms)
df["tDept"] = df["Course"].apply(splitDepartment)
df["tLname"] = df["Instructor"].apply(splitLastName)
df["tFname"] = df["Instructor"].apply(splitFirstName)
df["Eval / Enroll"] = df["Evals Made"].values / df["Enroll"].values

In [None]:
df = df[['Instructor','tLname','tFname','tDept','Course','Term','Enroll','Evals Made', 'Eval / Enroll',
'Rcmnd Class', 'Rcmnd Instr', 'Study Hrs/wk', 'Avg Grade Expected', 'Avg Grade Received']]

df.head()

#### Let's Find Duplicate Names as This Could Cause Issues

In [None]:
fn = [(splitFirstName(s) + " " + splitLastName(s)) for s in df['Instructor'].unique() if pd.notnull(s)]

In [None]:
seen = {}
dupes = []

for x in fn:
    if x not in seen:
        seen[x] = 1
    else:
        if seen[x] == 1:
            dupes.append(x)
        seen[x] += 1
        
print("There are {} professors with the same first and last name, however they are in different departments after further analysis".format(len(dupes))) 

#### Below Are the Number of Duplicate Professors In Different Departments
Through Individual Analysis we Deduced that the Rest of the 22 Professors were Not Actually Duplicates but Inputted Wrong in CAPES

In [None]:
dps = []
for d in dupes:
    ff = d.split(' ')[0]
    ll = d.split(' ')[1]
    if(len(df[(df['tLname'] == ll) & (df['tFname'] == ff)]['tDept'].unique()) > 1):
        dps.append((ff,ll))
display(dps)

### Goal:  Combine RMP and CAPES Data into one Dataframe of All Professors Common to Both Datasets
To do this we need a way to find only use the professors that have reviews in RMP and attach those tid's (teacher ids) to the cape dataframe. First let's take a look at those duplicates and import our RMP dataframe.

In [None]:
df_rmp = pd.read_csv(open('modifiedProfInfo.csv'), header=0)

Find the duplicates that are in RMP

In [None]:
trim_dps = []
for d in dps:
    if(len(df_rmp[(df_rmp['tLname'] == d[1]) & (df_rmp['tFname'] == d[0])]) > 0):
        trim_dps.append(d)
        
trim_dps

At this point due to the complexity of adding them in and the low number of duplicates we decideds to not to use these professors

#### Now We Will Try to Append the tid's from RMP to CAPES

In [None]:
tids_nf = []
tids = df_rmp['tid'].unique()
for tid in tids:
    if(pd.notnull(tid)):
        rmp_fname = df_rmp[df_rmp['tid'] == tid]['tFname'].unique()[0]
        rmp_lname = df_rmp[df_rmp['tid'] == tid]['tLname'].unique()[0]
        rmp_fl = (rmp_fname,rmp_lname)
        if(rmp_fl not in dps):
            indices = df[(df['tLname'] == rmp_lname) & (df['tFname'] == rmp_fname)].index
            if(len(indices) > 0):
                for i in indices:
                    df.loc[i,"tid"] = tid
            else:
                tids_nf.append(tid)       

replace nan tids with -1 and convert to int

In [None]:
df['tid'] = df['tid'].fillna(-1)
df['tid'] = df['tid'].astype(int)
df['tid'] = df['tid'].replace('-1', np.nan)

In [None]:
profs_not_on_RMP = len(df[pd.isnull(df['tid'])]['Instructor'].unique())
total_profs_on_CAPES = len(df['Instructor'].unique())
print("There are {} professors without reviews on RMP, out of the total {} professors on cape".format(profs_not_on_RMP,total_profs_on_CAPES))

save the augmentated capes dataframe

In [None]:
df.to_csv('capeReviewsCleaned2.csv')

### It's Finally Time to Consolidate RMP and CAPES into One DataFrame!
Now we want to put together our two dataframes into one, where each row represents a professor. Professors without a tid on RateMyProfessor will have to be excluded.

In [None]:
cape = pd.read_csv("capeReviewsCleaned2.csv")
rmp = pd.read_csv("modifiedProfInfo.csv")

Below are the list of columns that our new dataframe will contain, which removes information such as first and last name to follow <strong>safe harbour methods</strong>.

In [None]:
columnList = ['tid', 'gender', 'tDept', 'Enroll', 'Evals Made', 'Eval/Enroll', 'Rcmnd Class', 
           'Rcmnd Instr', 'Study Hrs/wk', 'Avg Grade Expected', 'Avg Grade Received', 
           'rEasy', 'rHelpful', 'rInterest' , 'rOverall', 'rWouldTakeAgain', 'rmp Grade', 
           'sentimentValue', 'teacherRatingTags', 'rmp Evals/Enroll']

df = pd.DataFrame(columns=columnList)

#### Generating tid list

In [None]:
tidList = cape['tid'].unique() # get array of unique tid's
tidList = np.setdiff1d(tidList,-1) # remove -1 from array and put in numerical order
df['tid'] = tidList #generate rows for tid column

Our final dataframe will contain rows of professors, therefore there are a few things to consider during the consolidation process

1) RMP data will need to be averaged per professor, as each row in the RMP dataframe corresponds to one review per a professor

2) CAPE data will need to be averaged by weight, where the weight represents the number of evaluations per class over the total number of evaluations of all classes for that professor (eval/total eval). This is because each row in the CAPE dataframe corresponds to n number of evaluations and should be weighted accordingly


#### This method handles calculating the weighted (eval/total eval) average per column

In [None]:
#gets weighted average of values of a column
def getWeightedAvg(col, tid, totEvals):
    totEvals = totEvals.values[0]
    
    for index, row in cape[(pd.isnull(cape[col])) & (cape['tid'] == tid)].iterrows(): #remove evals from total for null entries in col
        totEvals -= row['Evals Made']
    avg = 0.0
    evalsCheck = 0
    for index, row in cape[cape['tid'] == tid].iterrows():
        if(not pd.isnull(row[col])):
            evalsCheck += row['Evals Made']
            avg += row[col] * row['Evals Made'] / totEvals # add to weighted avg
    
    if(evalsCheck != totEvals): #check to see if evals were added correctly
        print("EVALS CALCULATION ERROR: evalsCheck == " + evalsCheck + " , totEvals == " + totEvals)
        
    if(avg == 0): #return NaN if nothing was added to avg
        return np.nan
    
    return avg

#### This method handles parsing all the teacher tags from RMP and generating a unique csv string

In [None]:
#get all unique rating tags for a tid
def getRatingTags(tid):
    tags = []
    for index, row in rmp[rmp['tid'] == tid].iterrows():
        tg = row['teacherRatingTags']
        if(not pd.isnull(tg)):
            tg = tg.strip('[ ]')
            tg = tg.replace('\'', ' ')
            tg = tg.split(',')
            tags.extend([a.strip() for a in tg if (a.strip() not in tags)])
    if('' in tags):    
        tags.remove('')
        
    tgstr = ""
    for tg in tags:
        tgstr += tg
        tgstr += ','
        
    return tgstr

Now we will iterate through each row and add values to each column from either rmp or capes. Since each row represents a unique professor, we must do some data processing on capes and rmp to consolidate multiple entries for each professor, using the functions above

In [None]:
counts = rmp.tid.value_counts()

for tid in tidList:
    index = df[df['tid'] == tid].index
    #print(tid)
    df.loc[index, 'gender'] = rmp[rmp['tid'] == str(tid)]['gender'].values[0]
    df.loc[index, 'tDept'] = cape[cape['tid'] == tid]['tDept'].values[0]
    df.loc[index, 'Enroll'] = cape[cape['tid'] == tid]['Enroll'].sum()
    df.loc[index, 'Evals Made'] = cape[cape['tid'] == tid]['Evals Made'].sum()
    df.loc[index, 'Eval/Enroll'] = df[df['tid'] == tid]['Evals Made'] / df[df['tid'] == tid]['Enroll']
    df.loc[index, 'Rcmnd Class'] = getWeightedAvg('Rcmnd Class', tid, df[df['tid'] == tid]['Evals Made'])
    df.loc[index, 'Rcmnd Instr'] = getWeightedAvg('Rcmnd Instr', tid, df[df['tid'] == tid]['Evals Made'])
    df.loc[index, 'Study Hrs/wk'] = getWeightedAvg('Study Hrs/wk', tid, df[df['tid'] == tid]['Evals Made'])
    df.loc[index, 'Avg Grade Expected'] = getWeightedAvg('Avg Grade Expected', tid, df[df['tid'] == tid]['Evals Made'])
    df.loc[index, 'Avg Grade Received'] = getWeightedAvg('Avg Grade Received', tid, df[df['tid'] == tid]['Evals Made'])
    df.loc[index, 'rEasy'] = rmp[rmp['tid'] == str(tid)]['rEasyString'].mean()
    df.loc[index, 'rHelpful'] = rmp[rmp['tid'] == str(tid)]['rHelpful'].mean()
    df.loc[index, 'rInterest'] = rmp[rmp['tid'] == str(tid)]['rInterest'].mean()
    df.loc[index, 'rOverall'] = rmp[rmp['tid'] == str(tid)]['rOverall'].mean()
    df.loc[index, 'rWouldTakeAgain'] = rmp[rmp['tid'] == str(tid)]['rWouldTakeAgain'].mean()
    df.loc[index, 'rmp Grade'] = rmp[rmp['tid'] == str(tid)]['teacherGrade'].mean()
    df.loc[index, 'sentimentValue'] = rmp[rmp['tid'] == str(tid)]['sentimentValue'].mean()
    df.loc[index, 'teacherRatingTags'] = getRatingTags(str(tid)) 
    df.loc[index, 'rmp Evals/Enroll'] = counts.get(str(tid)) / df[df['tid'] == tid]['Enroll']

### Finally we have the final dataframe we need to perform analysis

In [None]:
df.to_csv("FullData.csv", index=False)