# In this step, we'll process graduation data from the federal files
## In most cases, this is a straight "pull" from the data, but there are a few possible modifications:

- If the sample is too small from the most recent year, use 3 years of data
- For HBCUs, boost by 15%
- For a handful of schools, adjust down to reflect the true Noble rate of success
- Add in a handful of estimates

In [1]:
import pandas as pd
import numpy as np
import os

# Edit these to reflect any changes
work_location = 'inputs'
directory_file = 'hd2018.csv'
base_dir = 'base_dir.csv'
noble_attending = '../../raw_inputs/noble_attending.csv'
gr_output = 'grad_rates.csv'
gr_files = {'latest':'gr2018.csv',
            'one_removed':'gr2017.csv',
            'two_removed':'gr2016.csv'}
output_files = {'latest':'grad2018.csv',
            'one_removed':'grad2017.csv',
            'two_removed':'grad2016.csv'}

In [2]:
os.chdir(work_location)

In [3]:
# We'll use a dict to keep track of each grad rate file, reading in each one
years=['latest','one_removed','two_removed']
gr_dfs = {}
for year in years:
    gr_dfs[year] = pd.read_csv(gr_files[year], index_col=['UNITID'],
                     usecols=['UNITID', 'GRTYPE', 'GRTOTLT','GRBKAAT','GRHISPT'],
                     na_values='.',
                     dtype={'GRTOTLT':float,'GRBKAAT':float,'GRHISPT':float},
                     encoding='latin-1')
    gr_dfs[year].rename(columns={'GRTOTLT':'Total','GRBKAAT':'Black','GRHISPT':'Hisp'}, inplace=True)
    gr_dfs[year]['AA_H']=gr_dfs[year].Black+gr_dfs[year].Hisp
gr_dfs['latest'].head()

Unnamed: 0_level_0,GRTYPE,Total,Black,Hisp,AA_H
UNITID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
100654,2,756.0,731.0,4.0,735.0
100654,3,203.0,196.0,1.0,197.0
100654,4,318.0,302.0,3.0,305.0
100654,6,757.0,732.0,4.0,736.0
100654,7,1.0,1.0,0.0,1.0


In [4]:
# We now have to sort through these GRTYPES:
# 8 is the adjusted cohort for bachelor's seeking students (completions: 12=6yr, 13=4yr, 14=5yr; transfers=16)
# 29 for associate's seeking (completions: 30=3yr 35=2yr; transfers=33)
# We'll build a list of unitids that have both starting cohorts and completions for either one
valid_unitids = {}
for year in years:
    df = gr_dfs[year]
    valid_unitids[year] = list( (set(df[df['GRTYPE']==8].index) & set(df[df['GRTYPE']==12].index)) |
                                (set(df[df['GRTYPE']==29].index) & set(df[df['GRTYPE']==30].index)) )
print('%d, %d' % (len(gr_dfs['latest']), len(valid_unitids['latest'])))

51868, 3669


In [5]:
# We'll use the basic "hd" directory to form the base of the final year output
def create_year_df(df, source_df1, source_df2):
    """Apply function to pull the appropriate data into a single row per college"""
    ix = df.name
    if ix in source_df1.index:
        return source_df1.loc[ix][['Total','Black','Hisp','AA_H']]
    elif ix in source_df2.index:
        return source_df2.loc[ix][['Total','Black','Hisp','AA_H']]
    else:
        return [np.nan,np.nan,np.nan,np.nan]

year_dfs = {}
for year in years:
    dir_df = pd.read_csv(directory_file, index_col=['UNITID'],
                     usecols=['UNITID','INSTNM'],encoding='latin-1')
    dir_df = dir_df[dir_df.index.isin(valid_unitids[year])]
    
    # First do the starts
    start1 = gr_dfs[year][gr_dfs[year].GRTYPE == 12]
    start2 = gr_dfs[year][gr_dfs[year].GRTYPE == 30]
    dir_df[['Cl_Total','Cl_Black','Cl_Hisp','Cl_AA_H']]=dir_df.apply(create_year_df,axis=1,result_type="expand",
                                                                    args=(start1,start2))
    # Then do the completions
    start1 = gr_dfs[year][gr_dfs[year].GRTYPE == 8]
    start2 = gr_dfs[year][gr_dfs[year].GRTYPE == 29]
    dir_df[['St_Total','St_Black','St_Hisp','St_AA_H']]=dir_df.apply(create_year_df,axis=1,result_type="expand",
                                                                    args=(start1,start2))
    # Next the transfers
    start1 = gr_dfs[year][gr_dfs[year].GRTYPE == 16]
    start2 = gr_dfs[year][gr_dfs[year].GRTYPE == 33]
    dir_df[['Xf_Total','Xf_Black','Xf_Hisp','Xf_AA_H']]=dir_df.apply(create_year_df,axis=1,result_type="expand",
                                                                    args=(start1,start2))
    
    # Finally, calculated within year stats
    for type in ['Total','Black','Hisp','AA_H']:
        dir_df['GR_'+type]=dir_df['Cl_'+type]/dir_df['St_'+type]
        dir_df['Xfr_'+type]=dir_df['Xf_'+type]/dir_df['St_'+type]
        dir_df['CI_'+type]=np.sqrt(dir_df['GR_'+type]*(1-dir_df['GR_'+type])/dir_df['St_'+type])
        dir_df.replace(np.inf,np.nan)
    
    year_dfs[year]=dir_df.copy()
year_dfs['latest'].head()

Unnamed: 0_level_0,INSTNM,Cl_Total,Cl_Black,Cl_Hisp,Cl_AA_H,St_Total,St_Black,St_Hisp,St_AA_H,Xf_Total,...,CI_Total,GR_Black,Xfr_Black,CI_Black,GR_Hisp,Xfr_Hisp,CI_Hisp,GR_AA_H,Xfr_AA_H,CI_AA_H
UNITID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
100654,Alabama A & M University,203.0,196.0,1.0,197.0,756.0,731.0,4.0,735.0,318.0,...,0.016119,0.268126,0.413133,0.016384,0.25,0.75,0.216506,0.268027,0.414966,0.016338
100663,University of Alabama at Birmingham,962.0,196.0,31.0,227.0,1652.0,373.0,50.0,423.0,369.0,...,0.012134,0.525469,0.254692,0.025855,0.62,0.2,0.068644,0.536643,0.248227,0.024245
100690,Amridge University,4.0,0.0,0.0,0.0,10.0,3.0,0.0,3.0,6.0,...,0.154919,0.0,1.0,0.0,,,,0.0,1.0,0.0
100706,University of Alabama in Huntsville,319.0,36.0,16.0,52.0,615.0,91.0,28.0,119.0,182.0,...,0.020148,0.395604,0.406593,0.051259,0.571429,0.321429,0.093522,0.436975,0.386555,0.045469
100724,Alabama State University,431.0,399.0,6.0,405.0,1436.0,1352.0,19.0,1371.0,530.0,...,0.012095,0.295118,0.370562,0.012404,0.315789,0.473684,0.106639,0.295405,0.371991,0.012321


In [6]:
# Here, we're just saving the one year files locally for reference
for yr in ['latest', 'one_removed', 'two_removed']:
    year_dfs[yr].to_csv(output_files[yr], na_rep="N/A")

## The above code created three DFs for the most recent three years
## Each DF has the in year counting stats and rates for graduation
### Now we need create a final set of statistics based on these:
- Adj6yrGrad (overall number after adjustments)
- Adj6yrAAH (African American/Hispanic number after adjustments)
- 6yrGrad (overall number, no adjustments)
- 6yrAAH (AA/H no adjustments)
- 6yrAA
- 6yrH
- Xfer
- XferAAH
- XferAA
- XferH


In [7]:
# We'll start with reading some of the rows from the 'base_dir' created in the last step
dir_df = pd.read_csv(base_dir, index_col=['UNITID'],
                     usecols=['UNITID','INSTNM','Type','HBCU'],encoding='latin-1')
dir_df.head()

Unnamed: 0_level_0,INSTNM,HBCU,Type
UNITID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
100654,Alabama A & M University,Yes,4 year
100663,University of Alabama at Birmingham,No,4 year
100690,Amridge University,No,4 year
100706,University of Alabama in Huntsville,No,4 year
100724,Alabama State University,Yes,4 year


In [8]:
# NOTE THAT THERE ARE YEAR REFERENCES IN THIS CODE THAT NEED TO BE UPDATED TOO
def bump15(x):
    """Helper function to increase by 15% or half the distance to 100"""
    if x > .7:
        return x + (1-x)*.5
    else:
        return x + .15
    
def set_gradrates(df, year_dfs):
    """Apply function to decide how to set the specific values specified above"""
    ix = df.name
    
    # First we see if there is actual data for the latest year
    if ix in year_dfs['latest'].index:
        ty = year_dfs['latest'].loc[ix]
        gr_source = '2018'
        gr_6yr,gr_6yr_aah,gr_6yr_aa,gr_6yr_h,xf,xf_aah,xf_aa,xf_h = ty.reindex(
            ['GR_Total','GR_AA_H','GR_Black','GR_Hisp','Xfr_Total','Xfr_AA_H','Xfr_Black','Xfr_Hisp'])
        
        # If there's data in the latest year, we'll check how robust and add in prior years if necessary
        ci, ci_aah = ty.reindex(['CI_Total','CI_AA_H'])
        # For HBCUs, we bump by the lesser of 15% or half the distance to 100%
        if (df.HBCU == 'Yes') and (ci_aah <= 0.04):
            adj_6yr = gr_6yr
            adj_6yr_aah = bump15(gr_6yr_aah)
        # Otherwise, add more years if the confidence intervals are too low
        elif (ci >0.015) or (ci_aah >0.05):
            calc_fields = ['Cl_Total','Cl_Black','Cl_Hisp','Cl_AA_H',
                           'St_Total','St_Black','St_Hisp','St_AA_H',
                           'Xf_Total','Xf_Black','Xf_Hisp','Xf_AA_H']
            calc_data = ty.reindex(calc_fields)
            
            if ix in year_dfs['one_removed'].index:
                gr_source = '2017-2018'
                ty=year_dfs['one_removed'].loc[ix]
                calc_data = calc_data+ty.reindex(calc_fields)
                
                if ix in year_dfs['two_removed'].index:
                    gr_source = '2016-2018'
                    ty=year_dfs['two_removed'].loc[ix]
                    calc_data = calc_data+ty.reindex(calc_fields)
                    
                    
            gr_6yr = calc_data['Cl_Total']/calc_data['St_Total'] if calc_data['St_Total']>0 else np.nan
            gr_6yr_aah = calc_data['Cl_AA_H']/calc_data['St_AA_H'] if calc_data['St_AA_H']>0 else np.nan
            gr_6yr_aa = calc_data['Cl_Black']/calc_data['St_Black'] if calc_data['St_Black']>0 else np.nan
            gr_6yr_h = calc_data['Cl_Hisp']/calc_data['St_Hisp'] if calc_data['St_Hisp']>0 else np.nan
            xf = calc_data['Xf_Total']/calc_data['St_Total'] if calc_data['St_Total']>0 else np.nan
            xf_aah = calc_data['Xf_AA_H']/calc_data['St_AA_H'] if calc_data['St_AA_H']>0 else np.nan
            xf_aa = calc_data['Xf_Black']/calc_data['St_Black'] if calc_data['St_Black']>0 else np.nan
            xf_h = calc_data['Xf_Hisp']/calc_data['St_Hisp'] if calc_data['St_Hisp']>0 else np.nan
            adj_6yr = gr_6yr
            adj_6yr_aah = gr_6yr_aah
    
        else:
            adj_6yr = gr_6yr
            adj_6yr_aah = gr_6yr_aah
            
    # If there was no data in the most recent year, we got the prior (and stick--no need to add prior prior)
    elif ix in year_dfs['one_removed'].index:
        ty = year_dfs['one_removed'].loc[ix]
        gr_source = '2017'
        gr_6yr,gr_6yr_aah,gr_6yr_aa,gr_6yr_h,xf,xf_aah,xf_aa,xf_h = ty.reindex(
            ['GR_Total','GR_AA_H','GR_Black','GR_Hisp','Xfr_Total','Xfr_AA_H','Xfr_Black','Xfr_Hisp'])
        adj_6yr = gr_6yr
        adj_6yr_aah = gr_6yr_aah
    
    # If no data in the last two years, we'll go to prior prior (and stick--no need to check CI)
    elif ix in year_dfs['two_removed'].index:
        ty = year_dfs['two_removed'].loc[ix]
        gr_source = '2016'
        gr_6yr,gr_6yr_aah,gr_6yr_aa,gr_6yr_h,xf,xf_aah,xf_aa,xf_h = ty.reindex(
            ['GR_Total','GR_AA_H','GR_Black','GR_Hisp','Xfr_Total','Xfr_AA_H','Xfr_Black','Xfr_Hisp'])
        adj_6yr = gr_6yr
        adj_6yr_aah = gr_6yr_aah
    
    # No data in any of the last 3 years
    else:
        gr_source,adj_6yr,adj_6yr_aah,gr_6yr,gr_6yr_aah,gr_6yr_aa,gr_6yr_h,xf,xf_aah,xf_aa,xf_h=['N/A']+[np.nan]*10
        
    # 2 year schools are given 
    if df['Type'] == '2 year':
        adj_6yr = adj_6yr+0.5*xf
        adj_6yr_aah = adj_6yr_aah+0.5*xf_aah
        
    return [gr_source,
            np.round(adj_6yr,decimals=2),np.round(adj_6yr_aah,decimals=2),
            np.round(gr_6yr,decimals=2),np.round(gr_6yr_aah,decimals=2),
            np.round(gr_6yr_aa,decimals=2),np.round(gr_6yr_h,decimals=2),
            np.round(xf,decimals=2),np.round(xf_aah,decimals=2),
            np.round(xf_aa,decimals=2),np.round(xf_h,decimals=2)]

new_columns = ['GR_Source','Adj6yrGrad','Adj6yrAAH','6yrGrad',
               '6yrAAH','6yrAA','6yrH','Xfer','XferAAH','XferAA','XferH']
dir_df[new_columns] = dir_df.apply(set_gradrates,axis=1,args=(year_dfs,),result_type="expand")
dir_df.head()

Unnamed: 0_level_0,INSTNM,HBCU,Type,GR_Source,Adj6yrGrad,Adj6yrAAH,6yrGrad,6yrAAH,6yrAA,6yrH,Xfer,XferAAH,XferAA,XferH
UNITID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
100654,Alabama A & M University,Yes,4 year,2018,0.27,0.42,0.27,0.27,0.27,0.25,0.42,0.41,0.41,0.75
100663,University of Alabama at Birmingham,No,4 year,2018,0.58,0.54,0.58,0.54,0.53,0.62,0.22,0.25,0.25,0.2
100690,Amridge University,No,4 year,2016-2018,0.27,0.22,0.27,0.22,0.12,1.0,0.73,0.78,0.88,0.0
100706,University of Alabama in Huntsville,No,4 year,2016-2018,0.5,0.37,0.5,0.37,0.35,0.44,0.29,0.39,0.39,0.38
100724,Alabama State University,Yes,4 year,2018,0.3,0.45,0.3,0.3,0.3,0.32,0.37,0.37,0.37,0.47


In [9]:
dir_df.to_csv(gr_output,na_rep='N/A')

# A few more manual steps
## These should eventually be moved to code, but they should be modified in a number of cases (discussed in more detail below):
1. Add a correction for schools where we have a lot of historic results. Historically, this has meant reducing grad rates for schools by 1/3 of the difference between Noble retention and university retention (typically at only 3-4 schools)
2. Increase grad rates for partner colleges (15%)
3. Double check schools known to report oddly: Robert Morris University-Illinois specifically
4. Look for major shifts in grad rate at schools many Noble students attend and consider shifting to a 3year average

In all of these cases, we will change the grad rates and the "GR_Source" to designate that a non-standard practice was followed

You can see all of this work in the "manual_grad_rates_corrections_2020.xlsx" file in the raw_inputs folder. This file was created by importing columns from the prior year directory and then applying a process against them. Specifically:
1. Start with "grad_rates.csv" (saved above) and insert 6 columns between columns B&C:

-count: # of students (you can grab from financial_aid_analysis_output.xlsx from the archive-analysis)

-Adj6yr2019: from "manual_grad_rates_corrections_2019.xlsx" (these will be in the first few columns.)

-Adj6yrAAH2019: same

-2019note: same

-2019src: same

-2020-2019AAH: calculated from the above and what's in the file

2. Create a column after Type (will be Column K) with "2020 note". This is where you'll disposition each row.
3. Create columns X-AG as copies of columns M-V. This is where formula-modified values will go. First we'll fill in values for the modified entries. Second, we'll fill those columns in for the (vast majority) of rows with no corrections.

Then look at the notes below for specific steps, but the main thing to keep in mind is that "Special" rows in prior
years are likely "special" in current years, so be sure to check those. The vast majority will end up "stet" meaning no manual adjustment.

The sections below describe how to do each of the changes listed above.

_After work is completed in this file, the extra columns were removed and the result was saved as "grad_rates.csv" in the raw_inputs folder._

## For Case #1 in the above, see the "Noble history bump analysis2020.xlsx" file in raw_inputs

This file was taken from the post-NSC process, looking at the "College and GPA" tab of the "snapshot" report. To create it, take the last version of that file and perform the following steps:

1. Save the "College and GPA" tab alone in a new workbook.
2. Remove the columns at the right for all but the two most recent years and the "All" columns for the # starting and # remaining one year later sections. (In this case, we keep 2017 and 2018.)
3. Filter on "All" GPAs
4. Calculate the columns shown if there were 50 Noble students in 2018 OR if there were 200+ in prior years and 50+ in 2017+2018. (These are arbitrary. Be sure to use the # of starting students for your filter.) Note only keep the "Adjustment" columns if the result has a magnitude greater than 1% and is not a two-year college.


For the columns that have an adjustment, edit those rows in the "manual_grad_rate_corrections_2020.xlsx" file:
1. Change the "GR_Source" to "2018-1/3 Noble gap" (or +) and "2020 note" to "reduce by x%" (or increase)
2. Change the values in X-AC based on the rounded adjustment. AD-AG should just equal the original values.
3. Finally, eyeball how the AA/H value changes compared to the prior year. If there is 5+% drop, change the note to "stet, big natural drop" and change the source to 2018.

## For Case #2 in the above:
1. Filter on the 2019 note for the word "partner".
2. Mirror that increase in columns X-AG unless the partnership has ended. (Also mirror the language in the note and source.)

## For Case #3 in the above:
1. Filter the 2019 note for anything not "stet" or "N/A".
2. For the ones with no 2020 note yet, look at the details and determine (with college counseling guidance) whether a change should be made. A few more notes:
3. "minimum value: .25" is added for 4-year schools with N/A for grad rate if Noble students attend. Apply this rule to all such schools (even if 2019 note was stet or N/A).
4. "floor CCC at 20%" means the rate for City Colleges should be at least 20%. Apply this rule to all City colleges (regardless of 2019 note).
5. Again for any of these, update the 2020 note, source, and then make the actual changes in X-AG.
6. Note that you might want to refer back to the manual grad rate corrections file from the prior year and look at X-AG there specifically to see the formulas used to apply the changes for any non-standard rows.

## For Case #4 in the above:
1. Filter for rows with "2020 note" still blank. (The 2019 note for all of these should be "N/A" or "stet".)
2. Filter for rows with "count" >= 5.
3. Filter for rows with declines bigger than 5%. If the school is using 2018 as the source, switch to the 3yr average. (You'll need to do this by manually grabbing the source files.
4. Filter for rows with increases bigger than 10%. You probably won't change these, but discuss the increases with a college counselor to see if they pass the "smell test".

## Final disposition:
1. Change the remaining 2020 note values that were blank to "stet"
2. Populate X-AG for all blank rows with the values in M-V (just use an assignment formula in Excel on the filtered selection).
3. Save this file for reference.
3. Copy and paste values for X-AG.
4. Delete columns B-K. (The file will start with UNITID and GR_Source.
5. Delete the remaining columns until your next column is the old column X.
6. Save in raw_inputs as grad_rates.csv