In [1]:
import pandas as pd
import numpy as np
import pylab as plt
%matplotlib inline

## Open up GPA data

In [2]:
gpa_data_raw = pd.read_csv("data/FR_GPA_by_Inst_data_converted.csv")
gpa_data_raw.head()

Unnamed: 0,Calculation1,Campus,City,County,Fall Term,Measure Names,School,Measure Values
0,21ST CENTURY EXPERIMENTAL SCH694223,Santa Cruz,,Not Applicable,2017,Enrl GPA,21ST CENTURY EXPERIMENTAL SCH,
1,21ST CENTURY EXPERIMENTAL SCH694223,Santa Barbara,,Not Applicable,2017,Enrl GPA,21ST CENTURY EXPERIMENTAL SCH,
2,21ST CENTURY EXPERIMENTAL SCH694223,San Diego,,Not Applicable,2017,Enrl GPA,21ST CENTURY EXPERIMENTAL SCH,
3,21ST CENTURY EXPERIMENTAL SCH694223,Los Angeles,,Not Applicable,2017,Enrl GPA,21ST CENTURY EXPERIMENTAL SCH,
4,21ST CENTURY EXPERIMENTAL SCH694223,Irvine,,Not Applicable,2017,Enrl GPA,21ST CENTURY EXPERIMENTAL SCH,


Clean up the GPA data.
Rename many of the columns, replace string NaNs with literal NaN values, unify the "applied", "admitted", and "enrolled" codes.

In [3]:
gpa_data = gpa_data_raw.drop(columns="School")
renaming = {"Uad Uc Ethn 6 Cat": "Ethnicity", 
             "Calculation1": "school",
             "Campus": "campus",
             "City": "city",
             "County": "county",
             "Fall Term":"year",
             "Measure Names": "type",
             'Uad Uc Ethn 6 Cat':"ethnicity",
             "Measure Values": "gpa"}
gpa_data = gpa_data.rename(index=str, columns=renaming)
gpa_data['city'].replace('n/a ', np.nan, inplace=True) #TODO maybe not use nan since not a number
gpa_data['county'].replace('Not Applicable', np.nan, inplace=True)
renaming = {"Enrl GPA":"enr",
           "Adm GPA":"adm",
           "App GPA":"app"}
gpa_data['type'].replace(renaming, inplace=True)
gpa_data.head()

Unnamed: 0,school,campus,city,county,year,type,gpa
0,21ST CENTURY EXPERIMENTAL SCH694223,Santa Cruz,,,2017,enr,
1,21ST CENTURY EXPERIMENTAL SCH694223,Santa Barbara,,,2017,enr,
2,21ST CENTURY EXPERIMENTAL SCH694223,San Diego,,,2017,enr,
3,21ST CENTURY EXPERIMENTAL SCH694223,Los Angeles,,,2017,enr,
4,21ST CENTURY EXPERIMENTAL SCH694223,Irvine,,,2017,enr,


The "school" represents a high school; "campus" is which UC campus ("Univeristywide" represents total of all campuses); "city" is which city the high school is in, or NaN if HS is outside California; "county" is which county the high school is in, or NaN if HS is outside California; "year" is the year of the fall term that students started; "type" is which metric we are measuring, either the GPA of applied, admitted, or enrolled students; "gpa" is the actual GPA of this group of students.

## Open up count data

In [4]:
hs_data_raw = pd.read_csv("data/HS_by_Year_data_converted.csv")
print(hs_data_raw.columns)
hs_data_raw.head()

Index(['Calculation1', 'Campus', 'City', 'County/State/ Territory',
       'Fall Term', 'Measure Names', 'Uad Uc Ethn 6 Cat', 'Measure Values'],
      dtype='object')


Unnamed: 0,Calculation1,Campus,City,County/State/ Territory,Fall Term,Measure Names,Uad Uc Ethn 6 Cat,Measure Values
0,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2017,enr,All,12.0
1,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2017,adm,All,30.0
2,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2017,app,All,43.0
3,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2016,enr,Inter- national,5.0
4,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2016,adm,Inter- national,12.0


Clean up this data. Make the column names consistent with the other table so we can merge them

In [5]:
# hs_data = hs_data_raw.drop(columns="School")
renaming = {"Uad Uc Ethn 6 Cat": "Ethnicity", 
             "Calculation1": "school",
             "Campus": "campus",
             "City": "city",
             "County/State/ Territory": "region",
             "Fall Term":"year",
             "Measure Names": "type",
             'Uad Uc Ethn 6 Cat':"ethnicity",
             "Measure Values": "num"}
hs_data = hs_data_raw.rename(index=str, columns=renaming)
hs_data.head()

Unnamed: 0,school,campus,city,region,year,type,ethnicity,num
0,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2017,enr,All,12.0
1,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2017,adm,All,30.0
2,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2017,app,All,43.0
3,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2016,enr,Inter- national,5.0
4,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2016,adm,Inter- national,12.0


The "school" represents a high school; "campus" is which UC campus ("Univeristywide" represents total of all campuses); "city" is which city the high school is in, or NaN if HS is outside California; "region" is either the name of a US state, or the name of the country that HS is in; "year" is the year of the fall term that students started; "type" is which metric we are measuring, either the GPA of applied, admitted, or enrolled students; "ethnicity" is the ethnicity of thie group of students, or "All" for the union of these groups; "num" is how many students belong to this group.

## Merge the two datasets
When any two rows match on every one of the columns ['school', 'campus', 'city', 'year', 'type'], then merge these two rows. The GPA data contains two columns that the HS data does not: "county" and "gpa". The HS data contains two columns that the GPA data does not: "region" and "num". Therefore, these columns will be filled with NaNs in the merged. For instance, if a record exists in the HS data but not in the GPA data, then the corresponding record in the merged table will have NaN in the "region" column, since we don't know how to fill this in.

In [6]:
merged = pd.merge(hs_data, gpa_data, on=['school', 'campus', 'city', 'year', 'type'])
merged.head()

Unnamed: 0,school,campus,city,region,year,type,ethnicity,num,county,gpa
0,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2017,enr,All,12.0,,3.986667
1,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2017,adm,All,30.0,,4.020417
2,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2017,app,All,43.0,,3.864324
3,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2016,enr,Inter- national,5.0,,3.92
4,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2016,enr,All,5.0,,3.92


## A few other cleaning steps
Some of the rows have NaN in bot the GPA and num fields, and thus are useless, so delete them.

In [7]:
print("number or rows before dropping NaN rows:", merged.shape[0])
cleaned = merged[   pd.notnull(merged['num']) & pd.notnull(merged['gpa'])  ]
print("number or rows after dropping NaN rows:", cleaned.shape[0])
cleaned.head()

number or rows before dropping NaN rows: 146424
number or rows after dropping NaN rows: 69869


Unnamed: 0,school,campus,city,region,year,type,ethnicity,num,county,gpa
0,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2017,enr,All,12.0,,3.986667
1,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2017,adm,All,30.0,,4.020417
2,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2017,app,All,43.0,,3.864324
3,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2016,enr,Inter- national,5.0,,3.92
4,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2016,enr,All,5.0,,3.92


"region" field is NaN whenever the school is in California, so make this actually work. "region" field is also fully capitalized if it is not a US state, so make a new boolean column for if the high school is international.

In [8]:
merged['region'].replace(np.nan, 'California', inplace=True)
merged['is_international'] = merged['region'] == merged['region'].str.upper()
merged['school_id'] = pd.applymerged['school']

In [9]:
merged

Unnamed: 0,school,campus,city,region,year,type,ethnicity,num,county,gpa,is_international
0,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2017,enr,All,12.0,,3.986667,True
1,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2017,adm,All,30.0,,4.020417,True
2,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2017,app,All,43.0,,3.864324,True
3,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2016,enr,Inter- national,5.0,,3.920000,True
4,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2016,enr,All,5.0,,3.920000,True
5,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2016,adm,Inter- national,12.0,,3.755833,True
6,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2016,adm,All,12.0,,3.755833,True
7,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2016,app,Inter- national,18.0,,3.680000,True
8,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2016,app,All,19.0,,3.680000,True
9,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2015,enr,All,3.0,,,True
