# About

This notebook contains data cleaning routines for loading academic data: year of study (yr_sch), enrollment status (enroll), GPA (gpa_sr or gr_{A,B,C,D,F,none,dk}), degree type (degree_\*), field type (field_\*)impact of mental health on studies (aca_impa), ability to persist in the studies (persist). Null values and irregular situations (more than 2 degrees at once, more than 3 fields at once) are flagged for removal.

## Import necessary packages

In [None]:
import numpy as np
import pandas as pd

## Load the file 


### Release notes:

Note that 2020-2021 is very different since textual answers were used instead of numerical ones. Afterwards, gr_{A,B,C,D,F,dk,none} were used instead of gpa_sr. Some years have gr_{a,b,c,d,d,f,dk,none} instead.

In [None]:
year=2024
field_cols=["field_hum","field_nat","field_soc","field_arc","field_art","field_bus","field_den","field_ed","field_eng","field_law","field_med","field_mus","field_nur","field_pharm","field_prep","field_ph","field_pp","field_sw","field_und","field_other"]
degree_cols=["degree_ass","degree_bach","degree_ma","degree_jd","degree_phd","degree_other","degree_nd"]
if year==2021 or year==2022:
    gpa_cols=["gr_a","gr_b","gr_c","gr_d","gr_f","gr_none","gr_dk"]
elif year==2020 or year > 2022:
    gpa_cols=["gr_A","gr_B","gr_C","gr_D","gr_F","gr_none","gr_dk"]
else:
    gpa_cols=["gpa_sr"]
other_cols=["responseid","yr_sch","enroll","aca_impa","persist"]
load_cols = field_cols + degree_cols + gpa_cols + other_cols
df = pd.read_csv("HMS_"+str(year)+"-"+str(year+1)+"_PUBLIC_instchars.csv",usecols=load_cols)
dfNew = df[field_cols + degree_cols + other_cols + gpa_cols].copy()
dfNew["will_remove"] = False

## Some basic checks

In [None]:
df.info()

Ideally, we will change the type of columns using pandera package and verifying schema. Next time, maybe.

Checking for duplicates (I didn't find any, all files clean)

In [None]:
dup_rows = df.duplicated(keep = False)
print('Duplicate rows:', dup_rows.sum())

## Year of study (new *year* column)

Usuall numerical value from 1 to 7 as per codebook, or Nan. In 2020, textual data were used so we have to treat this separately.

In [None]:
df["yr_sch"].value_counts()

In [None]:
# 2020 conversion
if year != 2020:
    dfNew["year"] = df["yr_sch"]
else:
    dfNew["year"] = 1*(df["yr_sch"] == '1st year') + 2*(df["yr_sch"] == '2nd year') + 3*(df["yr_sch"] == '3rd year') + 4*(df["yr_sch"] == '4th year') + 5*(df["yr_sch"] == '5th year') + 6*(df["yr_sch"] == '6th year') + 7*(df["yr_sch"] == '7th+ year')
    dfNew[dfNew['year']==0] = np.nan
    print("Updated value counts for 2020-2021:")
    print(dfNew["year"].value_counts())

In [None]:
print("NaNs: " + str(np.sum(np.isnan(dfNew["year"]))))

In [None]:
dfNew["year"].unique()

Let's filter out the NaNs. Some of them are non-degree students:

In [None]:
df["degree_nd"].value_counts()

In [None]:
if year == 2020:
    dfNew.loc[df["degree_nd"] == "Non-degree student","degree_nd"] = 1.0
    dfNew.loc[df["degree_nd"] != "Non-degree student","degree_nd"] = np.nan
else:
    dfNew["degree_nd"] = df["degree_nd"]
dfNew["degree_nd"] = dfNew["degree_nd"].astype(np.float64)

In [None]:
dfNew["degree_nd"].unique()

In [None]:
dfNew["degree_nd"].value_counts(),np.sum(np.isnan(dfNew["degree_nd"]))

In [None]:
np.sum(np.isnan(dfNew["year"]) & ~np.isnan(dfNew["degree_nd"]))

So, let's remove degree students who did not specify their year of study:

In [None]:
print("Flagging %d entries for removal" % np.sum(np.isnan(dfNew["year"]) & np.isnan(dfNew["degree_nd"])))
dfNew["will_remove"] |= ((np.isnan(dfNew["year"])) & (np.isnan(dfNew["degree_nd"])))
print("Total to be removed: %d\n" % np.sum(dfNew["will_remove"]))

And check how many are flagged for removal, so far:

In [None]:
dfNew["will_remove"].value_counts()

## Enrollment (enroll)

Non-degree students are not enrolled. Other NaNs to be filtered out.

In [None]:
df["enroll"].value_counts()

In [None]:
if year == 2020:
    dfNew.loc[df["enroll"] == "Full-time student","enroll"] = 1.0
    dfNew.loc[df["enroll"] == "Part-time student","enroll"] = 2.0
    dfNew.loc[df["enroll"] == "Other (please specify)","enroll"] = 3.0
    dfNew.loc[(dfNew["enroll"] != 1.0) & (dfNew["enroll"] != 2.0) & (dfNew["enroll"] != 3.0),"enroll"] = np.nan
else:
    dfNew["enroll"] = df["enroll"]
dfNew["enroll"] = dfNew["enroll"].astype(np.float64)

In [None]:
dfNew["enroll"].unique()

In [None]:
print("NaNs: " + str(np.sum(np.isnan(dfNew["enroll"]))))

Non-degree students are not enrolled:

In [None]:
np.sum(np.isnan(dfNew["enroll"]) & np.isnan(dfNew["degree_nd"]))

So, filter other out:

In [None]:
print("Flagging %d entries for removal" % np.sum((np.isnan(dfNew["enroll"]) & ~np.isnan(dfNew["degree_nd"]))))
dfNew["will_remove"] |= (np.isnan(dfNew["enroll"]) & ~np.isnan(dfNew["degree_nd"]))
print("Total to be removed: %d\n" % np.sum(dfNew["will_remove"]))

## GPA, gpa_sr

Same routine here, except since 2020 other convention is used. I convert old to the new one since latter is less specific. The final scheme is as follows:
1. Mostly A's; 2. Mostly B's; 3. Mostly C's; 4. Mostly D's; 5. Mostly F's; 6. None of these; 7. No grade or don't know
   
The format is 1.0 (True) and NaN (False)

In [None]:
if year>=2020:
    df[gpa_cols[0]].value_counts(),df[gpa_cols[0]].unique()
else:
    dfNew["gpa_sr"] = df["gpa_sr"]
    print(df["gpa_sr"].value_counts(),df["gpa_sr"].unique())

If year is 2020, convert textual information into 1 (selected) or NaN (not selected), as in other years.

In [None]:
if year == 2020:
    for gr in gpa_cols:
        gr_name=list(df[gr].value_counts().to_dict().keys())[0]
        # then make it 1 in dfNew, and NaN otherwise
        dfNew.loc[df[gr] == gr_name,gr] = 1.0
        dfNew.loc[dfNew[gr] != 1.0,gr] = np.nan
        dfNew[gr] = dfNew[gr].astype(np.float64)

In [None]:
if year >= 2020:
    for gr in gpa_cols:
        print(dfNew[gr].value_counts(),np.sum(np.isnan(dfNew[gr])))
else:
    print(dfNew["gpa_sr"].value_counts(),np.sum(np.isnan(dfNew["gpa_sr"])))

In [None]:
if year>=2020:
    j=1
    dfNew["gpa_sr"] = 0
    dfNew["gpa_check"] = 0
    for gr in gpa_cols:
        dfNew["gpa_sr"] += j*(~np.isnan(dfNew[gr]))
        dfNew["gpa_check"] += 1*(~np.isnan(dfNew[gr]))
        j += 1
    # if more than two GPAs is selected or none at all, remove it.
    dfNew.loc[dfNew["gpa_check"]>2,"gpa_sr"] = np.nan
    dfNew.loc[dfNew["gpa_check"]==0,"gpa_sr"] = np.nan
else:
    dfNew.loc[(df["gpa_sr"] > -1) & (df["gpa_sr"] < 2.5),"gpa_sr"] = 1
    dfNew.loc[(df["gpa_sr"] > 2.5) & (df["gpa_sr"] < 5.5),"gpa_sr"] = 2
    dfNew.loc[(df["gpa_sr"] > 5.5) & (df["gpa_sr"] < 8.5),"gpa_sr"] = 3
    dfNew.loc[(df["gpa_sr"] > 8.5) & (df["gpa_sr"] < 9.5),"gpa_sr"] = 4
    dfNew.loc[(df["gpa_sr"] > 9.5) & (df["gpa_sr"] < 10.5),"gpa_sr"] = 7
    dfNew["gpa_check"] = 1
    dfNew.loc[np.isnan(df["gpa_sr"]),"gpa_sr"] = np.nan

In [None]:
dfNew["gpa_sr"].value_counts(),np.sum(np.isnan(dfNew["gpa_sr"]))

In [None]:
dfNew["gpa_check"].value_counts(),np.sum(np.isnan(dfNew["gpa_check"]))

I think I know what's going on. Some people marked A's and C's, or B's and D's. What do we do about it? I'm thinking we can compute some kinda average.

In [None]:
dfNew.loc[dfNew["gpa_check"]==2,gpa_cols]

In [None]:
if year>=2020:
    j=1
    # reset the GPA of those selecting two checkboxes, then find the average
    dfNew.loc[dfNew["gpa_check"]==2,"gpa_sr"] = 0
    for gr in gpa_cols:
        # no combos with "none of these" and "no grade"
        if j>5:
            continue
        dfNew.loc[dfNew["gpa_check"]==2,"gpa_sr"] += j*(~np.isnan(dfNew[gr]))*0.5
        j += 1

In [None]:
#dfNew.loc[(dfNew["gpa_check"]<=2) & (dfNew["gpa_check"]>0),"gpa_sr"].value_counts()
dfNew["gpa_sr"].value_counts(),np.sum(np.isnan(dfNew["gpa_sr"]))

**Finally**, filter out NaNs.

In [None]:
print("Flagging %d entries for removal" % np.sum(np.isnan(dfNew["gpa_sr"])))
dfNew["will_remove"] |= np.isnan(dfNew["gpa_sr"])
print("Total to be removed: %d\n" % np.sum(dfNew["will_remove"]))

## Academic impact, aca_impa

For 2020, textual format was used. Since removing NaNs gets rid of too many items, I put the value corresponding to the situation where is no academic impact on their studies.

In [None]:
if year != 2020:
    print(df["aca_impa"].value_counts(),df["aca_impa"].unique(),np.sum(np.isnan(df["aca_impa"])))
else:
    print(df["aca_impa"].value_counts(),df["aca_impa"].unique())

In [None]:
if year == 2020:
    dfNew.loc[df["aca_impa"] == "1-2 days","aca_impa"] = 2.0
    dfNew.loc[df["aca_impa"] == "3-5 days","aca_impa"] = 3.0
    dfNew.loc[df["aca_impa"] == "6 or more days","aca_impa"] = 4.0
    dfNew.loc[(dfNew["aca_impa"] != 2.0) & (dfNew["aca_impa"] != 3.0) & (dfNew["aca_impa"] != 4.0),"aca_impa"] = 1.0
else:
    dfNew["aca_impa"] = df["aca_impa"]
dfNew["aca_impa"] = dfNew["aca_impa"].astype(np.float64)

In [None]:
print("Flagging %d entries for removal" % np.sum(np.isnan(dfNew["aca_impa"])))
dfNew["will_remove"] |= np.isnan(dfNew["aca_impa"])
print("Total to be removed: %d\n" % np.sum(dfNew["will_remove"]))

## Persistence (persist)

1..6 (Strongly agree..Strongly disagree)

In [None]:
print(df["persist"].value_counts(),df["persist"].unique(),np.sum(np.isnan(df["persist"])))

Fortunately, in 2020 all is numerical. Filter out those that are NaNs or aren't between 1 and 6.

In [None]:
dfNew["persist"] = df["persist"]
if year == 2020:
    dfNew.loc[(df["persist"] < 1) | (df["persist"] > 6),"persist"] = np.nan

In [None]:
print("Flagging %d entries for removal" % np.sum(np.isnan(dfNew["persist"])))
dfNew["will_remove"] |= np.isnan(dfNew["persist"])
print("Total to be removed: %d\n" % np.sum(dfNew["will_remove"]))

## Degree (new column *degree*)

Note: 2020 degree_nd has been processed already.

In [None]:
if year != 2020:
    print(df["degree_other"].value_counts(),df["degree_other"].unique(),np.sum(np.isnan(df["degree_other"])))
else:
    print(df["degree_other"].value_counts(),df["degree_other"].unique())

In [None]:
if year == 2020:
    dfNew.loc[df["degree_ass"] == "Associate's","degree_ass"] = 1.0
    dfNew.loc[dfNew["degree_ass"] != 1.0,"degree_ass"] = np.nan

    dfNew.loc[df["degree_bach"] == "Bachelor's","degree_bach"] = 1.0
    dfNew.loc[dfNew["degree_bach"] != 1.0,"degree_bach"] = np.nan

    dfNew.loc[df["degree_ma"] == "Master's","degree_ma"] = 1.0
    dfNew.loc[dfNew["degree_ma"] != 1.0,"degree_ma"] = np.nan

    dfNew.loc[df["degree_jd"] == "jd","degree_jd"] = 1.0
    dfNew.loc[dfNew["degree_jd"] != 1.0,"degree_jd"] = np.nan

    dfNew.loc[df["degree_phd"] == "PhD (or equivalent doctoral program)","degree_phd"] = 1.0
    dfNew.loc[dfNew["degree_phd"] != 1.0,"degree_phd"] = np.nan

    dfNew.loc[df["degree_other"] == "Other (please specify)","degree_other"] = 1.0
    dfNew.loc[dfNew["degree_other"] != 1.0,"degree_other"] = np.nan
else:
    dfNew["degree_ass"] = df["degree_ass"]
    dfNew["degree_bach"] = df["degree_bach"]
    dfNew["degree_ma"] = df["degree_ma"]
    dfNew["degree_jd"] = df["degree_jd"]
    dfNew["degree_phd"] = df["degree_phd"]
    dfNew["degree_other"] = df["degree_other"]
dfNew["degree_ass"] = dfNew["degree_ass"].astype(np.float64)
dfNew["degree_bach"] = dfNew["degree_bach"].astype(np.float64)
dfNew["degree_ma"] = dfNew["degree_ma"].astype(np.float64)
dfNew["degree_jd"] = dfNew["degree_jd"].astype(np.float64)
dfNew["degree_phd"] = dfNew["degree_phd"].astype(np.float64)
dfNew["degree_other"] = dfNew["degree_other"].astype(np.float64)

Create a new column "degree" and perform a few sanity checks on it:

In [None]:
dfNew["degree"] = 1*(~np.isnan(dfNew["degree_ass"]))+2*(~np.isnan(dfNew["degree_bach"]))+3*(~np.isnan(dfNew["degree_ma"]))+4*(~np.isnan(dfNew["degree_jd"]))+5*(~np.isnan(dfNew["degree_phd"]))+6*(~np.isnan(dfNew["degree_other"]))+7*(~np.isnan(dfNew["degree_nd"]))
dfNew["degreeCheck"] = 1*(~np.isnan(dfNew["degree_ass"]))+1*(~np.isnan(dfNew["degree_bach"]))+1*(~np.isnan(dfNew["degree_ma"]))+1*(~np.isnan(dfNew["degree_jd"]))+1*(~np.isnan(dfNew["degree_phd"]))+1*(~np.isnan(dfNew["degree_other"]))+1*(~np.isnan(dfNew["degree_nd"]))

In [None]:
dfNew["degreeCheck"].value_counts()

Filter out those doing zero degrees and more than two:

In [None]:
print("Flagging %d entries for removal" % (np.sum((dfNew["degreeCheck"]>2) | (dfNew["degreeCheck"] < 1))))
dfNew["will_remove"] |= (dfNew["degreeCheck"]>2) | (dfNew["degreeCheck"] < 1)
print("Total to be removed: %d\n" % np.sum(dfNew["will_remove"]))

## Field (new column *field*)

First off, we have to sort out 2020 mess. Since we have 20 different columns, I will have to automate this.

In [None]:
if year != 2020:
    print(df["field_hum"].value_counts(),df["field_hum"].unique(),np.sum(np.isnan(df["field_hum"])))
else:
    print(df["field_hum"].value_counts(),df["field_hum"].unique())

In [None]:
list(df["field_hum"].value_counts().to_dict().keys())[0]

In [None]:
for field in field_cols:
    if year == 2020:
        # select most populous entry
        field_name=list(df[field].value_counts().to_dict().keys())[0]
        # then make it 1 in dfNew, and NaN otherwise
        dfNew.loc[df[field] == field_name,field] = 1.0
        dfNew.loc[dfNew[field] != 1.0,field] = np.nan
    else:
        dfNew[field] = df[field]
    dfNew[field] = dfNew[field].astype(np.float64)

In [None]:
dfNew["field_nat"].value_counts(),dfNew["field_nat"].unique(),np.sum(np.isnan(dfNew["field_nat"]))

Now we are finally ready to create a field column. This is a preliminary version, hot encoding will be required later.

In [None]:
dfNew["field"]=1*(~np.isnan(dfNew["field_hum"]))+2*(~np.isnan(dfNew["field_nat"]))+3*(~np.isnan(dfNew["field_soc"]))+4*(~np.isnan(dfNew["field_arc"]))+5*(~np.isnan(dfNew["field_art"]))+6*(~np.isnan(dfNew["field_bus"]))+7*(~np.isnan(dfNew["field_den"]))+8*(~np.isnan(dfNew["field_ed"]))+9*(~np.isnan(dfNew["field_eng"]))+10*(~np.isnan(dfNew["field_law"]))+11*(~np.isnan(dfNew["field_med"]))+12*(~np.isnan(dfNew["field_mus"]))+13*(~np.isnan(dfNew["field_nur"]))+14*(~np.isnan(dfNew["field_pharm"]))+15*(~np.isnan(dfNew["field_prep"]))+16*(~np.isnan(dfNew["field_ph"]))+17*(~np.isnan(dfNew["field_pp"]))+18*(~np.isnan(dfNew["field_sw"]))+19*(~np.isnan(dfNew["field_und"]))+20*(~np.isnan(dfNew["field_other"]))

In [None]:
dfNew["field"].value_counts()

Some a bogus, of course. Let's filter out those doing more than 3 fields:

In [None]:
j=1
dfNew["fieldCheck"]=0
for field in field_cols:
    dfNew["fieldCheck"] += 1*(~np.isnan(dfNew[field]))
    j += 1

In [None]:
dfNew["fieldCheck"].value_counts()

In [None]:
np.sum(dfNew["fieldCheck"] > 3)

In [None]:
print("Flagging %d entries for removal" % (np.sum(dfNew["fieldCheck"]>3)))
dfNew["will_remove"] |= dfNew["fieldCheck"]>3
print("Total to be removed: %d\n" % np.sum(dfNew["will_remove"]))

## Saving data into a file

In [None]:
print("Warning: Total data to be removed: %.2f percent \n" % (np.sum(dfNew["will_remove"])/len(dfNew)*100))

In [None]:
dfNew.columns

In [None]:
output_cols=['responseid', 'will_remove','year', 'enroll', 'gpa_sr', 'aca_impa', 'persist', 'degree',  'field'   ] + degree_cols + field_cols

In [None]:
dfNew.to_csv(str(year) + '-' + str(year+1) + '_Alexandr.csv',columns=output_cols,index=False)