# 1a - Determination of Dataset(s) and Variables
## Introduction
In this part of the project, we aim to
* Determine which dataset or datasets we will use
* Determine the variables that we will use
* Handle data problems (e.g. missing data)
* Do feature engineering

At the moment, the main datasets of interest are the IPEDS data and the College Scorecard data.

The IPEDS data can be found at [this page](https://nces.ed.gov/ipeds/use-the-data/download-access-database).  We are using the final 2019-20 Access database and the accompanying 2019-20 Excel documentation.  The data are in multiple tables in a Microsoft Access database file.  Excel can be used to retrieve the tables and save them as .csv files.

The College Scorecard data is on [this page](https://data.ed.gov/dataset/college-scorecard-all-data-files/resources).

## Comparison of IPEDS data and College Scorecard data
### Size comparison
IPEDS has 2212 variables (according to the vartable19 sheet of the documentation file).  The HD table has info for 6559 colleges, while the ADM table has info for 2011.

College Scorecard has 2989 variables and 6694 colleges.
### What colleges are only in one of the datasets?

In [1]:
import numpy as np
import pandas as pd

pd.options.display.max_columns = None

# read the HD table of the ipeds data, which contains basic school information
ipeds_hd = pd.read_csv('data/ipeds/HD.csv', index_col="UNITID")

In [2]:
# read the entire scorecard data
sc = pd.read_csv('data/scorecard/Most-Recent-Cohorts-All-Data-Elements.csv', index_col="UNITID")

  sc = pd.read_csv('data/scorecard/Most-Recent-Cohorts-All-Data-Elements.csv', index_col="UNITID")


First, we look to see if the IDs used to identify the colleges are shared among the two datasets.  To do this, we find colleges where the IDs are the same, but the names are different:

In [3]:
combined = pd.merge(ipeds_hd.INSTNM, sc.INSTNM, left_index=True, right_index=True, how='outer', suffixes=('_ipeds', '_sc'))
combined_dropna = combined.dropna()
same_id_diff_name = combined_dropna.loc[combined_dropna.INSTNM_ipeds != combined_dropna.INSTNM_sc, :]
print(f'{same_id_diff_name.shape[0]} rows have the same ID value but have different names:\n')
print(same_id_diff_name)

150 rows have the same ID value but have different names:

                                    INSTNM_ipeds  \
UNITID                                             
104151            Arizona State University-Tempe   
104665        School of Architecture at Taliesin   
106360    Arthur's Beauty College Inc-Fort Smith   
106458     Arkansas State University-Main Campus   
106494  Arthur's Beauty College Inc-Jacksonville   
...                                          ...   
491826   Avenue Academy, A Cosmetology Institute   
492209               Reiss-Davis Graduate Center   
493549                 McAllen Careers Institute   
494171                     Arizona College-Tempe   
494588      Pima Medical Institution-San Antonio   

                                        INSTNM_sc  
UNITID                                             
104151  Arizona State University Campus Immersion  
104665                 The School of Architecture  
106360                    Arthur's Beauty College  
1064

It looks like if the IDs are shared, then they probably refer to the same college.  The differences in naming are minor; they are referring to the same colleges.

Next, we check which IDs are only in the IPEDS data:

In [4]:
id_only_in_ipeds = combined.loc[pd.isnull(combined.INSTNM_sc), :]
print(f'{id_only_in_ipeds.shape[0]} college IDs exist in IPEDS which don\'t exist in scorecard:\n')
print(id_only_in_ipeds)

370 college IDs exist in IPEDS which don't exist in scorecard:

                                             INSTNM_ipeds INSTNM_sc
UNITID                                                             
100733                University of Alabama System Office       NaN
103529    University of Alaska System of Higher Education       NaN
103909                            Carrington College-Mesa       NaN
103927                          Carrington College-Tucson       NaN
104504                           Cortiva Institute-Tucson       NaN
...                                                   ...       ...
494861  CUNY Brooklyn College - Feirstein Graduate Sch...       NaN
494870    Rabbinical Seminary of America - Ma'yan HaTorah       NaN
494889                              Baker College - Flint       NaN
494913          Franciscan School of Theology - San Diego       NaN
494922  University of Montana (The) - Bitterroot Colle...       NaN

[370 rows x 2 columns]


Now, we look to see which IDs are only in the Scorecard data:

In [5]:
id_only_in_sc = combined.loc[pd.isnull(combined.INSTNM_ipeds), :]
print(f'{id_only_in_sc.shape[0]} college ids exist in scorecard which don\'t exist in IPEDS:\n')
print(id_only_in_sc)

505 college ids exist in scorecard which don't exist in IPEDS:

         INSTNM_ipeds                                 INSTNM_sc
UNITID                                                         
10236801          NaN        Troy University-Phenix City Campus
10236802          NaN         Troy University-Montgomery Campus
10236803          NaN             Troy University-Dothan Campus
10236808          NaN                    Troy University-Online
10236809          NaN             Troy University-Support Sites
...               ...                                       ...
48511113          NaN        Georgia Military College - Eastman
48616901          NaN  American College of Barbering - Florence
49005401          NaN      HCI College - Fort Lauderdale Campus
49146401          NaN          ABC Adult School - Cabrillo Lane
49175601          NaN           Urban Barber College - San Jose

[505 rows x 2 columns]


### Comparison of missing data in an SAT score column (25th percentile for math SAT score)

In [6]:
ipeds_adm = pd.read_csv('data/ipeds/ADM.csv', index_col="UNITID")
ipeds_satmt25_na = pd.isnull(ipeds_adm.SATMT25)
print(f'IPEDS has {sum(~ipeds_satmt25_na)} actual values and {sum(ipeds_satmt25_na)} missing values for the SATMT25 (SAT math 25th percentile) column:')
sc_satmt25_na = pd.isnull(sc.SATMT25)
print(f'scorecard has {sum(~sc_satmt25_na)} actual values and {sum(sc_satmt25_na)} missing values for the SATMT25 (SAT math 25th percentile) column:')

IPEDS has 1220 actual values and 791 missing values for the SATMT25 (SAT math 25th percentile) column:
scorecard has 1218 actual values and 5476 missing values for the SATMT25 (SAT math 25th percentile) column:


In [7]:
print(f'ipeds has {sum(~ipeds_satmt25_na & sc_satmt25_na)} values for SATMT25 that scorecard doesn\'t have:')
print(f'scorecard has {sum(ipeds_satmt25_na & ~sc_satmt25_na)} values for SATMT25 that ipeds doesn\'t have:')

ipeds has 1 values for SATMT25 that scorecard doesn't have:
scorecard has 0 values for SATMT25 that ipeds doesn't have:


### Comparison of missing data for all shared rows between Scorecard and the HD/ADM IPEDS datasets

In [9]:
ipeds_hd_adm_cols = set(ipeds_hd.columns).union(set(ipeds_adm.columns))
sc_cols = set(sc.columns)
shared_columns = ipeds_hd_adm_cols.intersection(sc_cols)
for col in shared_columns:
    print(f'column: {col}')
    ipeds_col = ipeds_hd.loc[:, col] if col in ipeds_hd.columns else ipeds_adm.loc[:, col]
    n1 = len(ipeds_col)
    n2 = sum(pd.isnull(ipeds_col))
    n3 = sc.shape[0]
    n4 = sum(pd.isnull(sc.loc[:, col]))
    print(f'IPEDS: {n1-n2} actual values, with {n2} missing')
    print(f'Scorecard: {n3-n4} actual values, with {n4} missing')
    print('-------------------------------------------')

column: CONTROL
IPEDS: 6559 actual values, with 0 missing
Scorecard: 6694 actual values, with 0 missing
-------------------------------------------
column: ACTMT75
IPEDS: 1171 actual values, with 840 missing
Scorecard: 1170 actual values, with 5524 missing
-------------------------------------------
column: SATMT75
IPEDS: 1220 actual values, with 791 missing
Scorecard: 1218 actual values, with 5476 missing
-------------------------------------------
column: SATVR75
IPEDS: 1220 actual values, with 791 missing
Scorecard: 1218 actual values, with 5476 missing
-------------------------------------------
column: ACTCM25
IPEDS: 1254 actual values, with 757 missing
Scorecard: 1252 actual values, with 5442 missing
-------------------------------------------
column: ACTEN25
IPEDS: 1171 actual values, with 840 missing
Scorecard: 1170 actual values, with 5524 missing
-------------------------------------------
column: ICLEVEL
IPEDS: 6559 actual values, with 0 missing
Scorecard: 6694 actual values