# <font color=teal>University Admission Modelling: Initial cleaning & EDA</font>

**BrainStation Bootcamp (12 September to 1 December 2023)**

*Capstone Project,     
Deadline: 30 November 2023   
Author: Reema Sipra*

---

##  <font color=teal>Table of Contents</font>

1. **[Introduction](#Introduction)**
2. **[Data](#Data)**
3. **[Part 1: Basic Analysis](#Part1)**

---

##  <font color=teal>Introduction</font><a id = "Introduction"></a>

One of the biggest decision points for students and their parents is picking a suitable university and career. People spend considerable resources to pick and get into the right university. 

*What are the best predictors for students getting picked by top universities?*

**Objectives:**
- Support decision making of high school students & parents.
- Inform universities on improvements in their student intake & diversity.

[Back to the top](#Table-of-Contents)

## <font color=teal>Data</font><a id = "Data"></a>

**Data Source:** 
- US Department of Education, College Scorecard (https://collegescorecard.ed.gov/data/)

**College Scorecard:** An online tool created by the US government to allow users to compare the cost and value of higher institutional learning. Includes data on Title IV Universities (receive federal funding to make tuition affordable).

Initiated by President Obama to: " be able to see how much each school's graduates earn, how much debt they graduate with, and what percentage of a school's students can pay back their loans." 

**Three main components:** 
 - Institutional level data: (6543 universities, 15 categories, 3232 columns) over 26 years.
 - Fields of study at the universities (233,979 study areas, 160 columns) over 6 years.
 - Crosswalks:  map of differing institutional data defintion between university and the federal government (25004 records, 21 columns) over 20 years.



[Back to the top](#Table-of-Contents)

## <font color=teal>Part 1: Basic Analysis</font><a id = "Part1"></a>

In [12]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
from scipy.stats import norm
import yaml
import re

In [2]:
# Read in one institution csv file to get an idea of the data for the year 2021-2022. 
institution_df = pd.read_csv("data/MERGED2021_22_PP.csv", low_memory = False)

# A look at the top 5 records in the dataframe.
institution_df.head()

Unnamed: 0,UNITID,OPEID,OPEID6,INSTNM,CITY,STABBR,ZIP,ACCREDAGENCY,INSTURL,NPCURL,...,OMAWDP8_NOPELL_FIRSTTIME,OMENRUP_NOPELL_FIRSTTIME,OMENRYP_NOPELL_NOTFIRSTTIME,OMENRAP_NOPELL_NOTFIRSTTIME,OMAWDP8_NOPELL_NOTFIRSTTIME,OMENRUP_NOPELL_NOTFIRSTTIME,OMACHT8_NOPELL_ALL,OMACHT8_NOPELL_FIRSTTIME,OMACHT8_NOPELL_NOTFIRSTTIME,ADDR
0,100654,100200.0,1002.0,Alabama A & M University,Normal,AL,35762,Southern Association of Colleges and Schools C...,www.aamu.edu/,www.aamu.edu/admissions-aid/tuition-fees/net-p...,...,0.3187,0.2709,0.0128,0.2949,0.4744,0.2179,329.0,251.0,78.0,4900 Meridian Street
1,100663,105200.0,1052.0,University of Alabama at Birmingham,Birmingham,AL,35294-0110,Southern Association of Colleges and Schools C...,https://www.uab.edu/,https://tcc.ruffalonl.com/University of Alabam...,...,0.6937,0.066,0.0111,0.2636,0.5136,0.2117,2358.0,1182.0,1176.0,Administration Bldg Suite 1070
2,100690,2503400.0,25034.0,Amridge University,Montgomery,AL,36117-3553,Southern Association of Colleges and Schools C...,https://www.amridgeuniversity.edu/,https://www2.amridgeuniversity.edu:9091/,...,0.0,0.5,0.0,0.3333,0.4583,0.2083,26.0,2.0,24.0,1200 Taylor Rd
3,100706,105500.0,1055.0,University of Alabama in Huntsville,Huntsville,AL,35899,Southern Association of Colleges and Schools C...,www.uah.edu/,finaid.uah.edu/,...,0.6471,0.0941,0.0082,0.2647,0.5948,0.1324,1122.0,510.0,612.0,301 Sparkman Dr
4,100724,100500.0,1005.0,Alabama State University,Montgomery,AL,36104-0271,Southern Association of Colleges and Schools C...,www.alasu.edu/,www.alasu.edu/cost-aid/tuition-costs/net-price...,...,0.4381,0.2167,0.0,0.1444,0.3667,0.4889,510.0,420.0,90.0,915 S Jackson Street


In [3]:
# Determine the shape of the csv file.
institution_df.shape

(6543, 3232)

A single .csv datafile has 6,543 rows and 3,232 columns.

In [4]:
# Get the basic info on the dataframe.
institution_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6543 entries, 0 to 6542
Columns: 3232 entries, UNITID to ADDR
dtypes: float64(2693), int64(13), object(526)
memory usage: 161.3+ MB


In [5]:
institution_df.columns.tolist()

['UNITID',
 'OPEID',
 'OPEID6',
 'INSTNM',
 'CITY',
 'STABBR',
 'ZIP',
 'ACCREDAGENCY',
 'INSTURL',
 'NPCURL',
 'SCH_DEG',
 'HCM2',
 'MAIN',
 'NUMBRANCH',
 'PREDDEG',
 'HIGHDEG',
 'CONTROL',
 'ST_FIPS',
 'REGION',
 'LOCALE',
 'LOCALE2',
 'LATITUDE',
 'LONGITUDE',
 'CCBASIC',
 'CCUGPROF',
 'CCSIZSET',
 'HBCU',
 'PBI',
 'ANNHI',
 'TRIBAL',
 'AANAPII',
 'HSI',
 'NANTI',
 'MENONLY',
 'WOMENONLY',
 'RELAFFIL',
 'ADM_RATE',
 'ADM_RATE_ALL',
 'SATVR25',
 'SATVR75',
 'SATMT25',
 'SATMT75',
 'SATWR25',
 'SATWR75',
 'SATVRMID',
 'SATMTMID',
 'SATWRMID',
 'ACTCM25',
 'ACTCM75',
 'ACTEN25',
 'ACTEN75',
 'ACTMT25',
 'ACTMT75',
 'ACTWR25',
 'ACTWR75',
 'ACTCMMID',
 'ACTENMID',
 'ACTMTMID',
 'ACTWRMID',
 'SAT_AVG',
 'SAT_AVG_ALL',
 'PCIP01',
 'PCIP03',
 'PCIP04',
 'PCIP05',
 'PCIP09',
 'PCIP10',
 'PCIP11',
 'PCIP12',
 'PCIP13',
 'PCIP14',
 'PCIP15',
 'PCIP16',
 'PCIP19',
 'PCIP22',
 'PCIP23',
 'PCIP24',
 'PCIP25',
 'PCIP26',
 'PCIP27',
 'PCIP29',
 'PCIP30',
 'PCIP31',
 'PCIP38',
 'PCIP39',
 'PCIP40',

The columns are difficult to comprehend without the data dictionary. The dataset came with a data dictionary in the `data.yaml` file included in the data folder. 

The file contains nested dictionaries. The data relevant to the institutional data columns is provided in the dictionary titled `dictionary`. This contains the following generic information:

*< main_cateogry_name.column_name >:*
- *source: < the source of the data >*
- *type: < data type>*
- *description: < text description of the columns >*
- *index: < varchar length for text fields >* 
    
The `.yaml` file can be used to extract relevant information to understand the columns and the data they contain. 

In [8]:
# Read the data dictionary from the .yaml file into a python dictionary:

with open('data/data.yaml', 'r') as file: data_dictionary = yaml.safe_load(file)

In [9]:
# view the dictionary content
data_dictionary['dictionary']

{'id': {'source': 'UNITID',
  'type': 'integer',
  'description': 'Unit ID for institution'},
 'ope8_id': {'source': 'OPEID',
  'description': '8-digit OPE ID for institution',
  'index': 'varchar(10)'},
 'ope6_id': {'source': 'OPEID6',
  'description': '6-digit OPE ID for institution',
  'index': 'varchar(10)'},
 'school.name': {'source': 'INSTNM',
  'type': 'autocomplete',
  'description': 'Institution name',
  'index': 'fulltext'},
 'school.city': {'source': 'CITY',
  'type': 'autocomplete',
  'description': 'City',
  'index': 'varchar(200)'},
 'school.state': {'source': 'STABBR',
  'description': 'State postcode',
  'index': 'varchar(50)'},
 'school.zip': {'source': 'ZIP',
  'description': 'ZIP code',
  'index': 'varchar(20)'},
 'school.accreditor': {'source': 'ACCREDAGENCY',
  'description': 'Accreditor for institution'},
 'school.school_url': {'source': 'INSTURL',
  'description': "URL for institution's homepage",
  'index': 'varchar(200)'},
 'school.price_calculator_url': {'sour

The first step is to extract the names of the main categories which match the accompaning technical documentation provided with the dataset. This is done looping through the `dictionary` keys and and extracting the tag component before the period.

In [18]:
# Create a list of the main categories:
categories = []
for name in data_dictionary['dictionary'].keys():
    category = name.split('.')
    categories.append(category[0])
main_categories = set(categories)
main_categories = sorted(main_categories)
main_categories      

['academics',
 'admissions',
 'aid',
 'completion',
 'cost',
 'earnings',
 'fed_sch_cd',
 'id',
 'location',
 'ope6_id',
 'ope8_id',
 'programs',
 'repayment',
 'school',
 'student']

Now we can use the `main categories` set to loop through the dictionary and group the colunmns by their categories using a regex formula:

#### Steps for the loop:
1. append to list of main category
2. column:
    - locate the right column using the regex loop.
    - slice the component after the period.
    - append to the column name list
    - append description to the column description list.
3. Turn it all into a dataframe.

In [14]:
# Test the outer loop using the School column list
expression='school.*'
category1 = []
for k in data_dictionary['dictionary'].keys():
    if re.match(expression, k):
        category1.append(k)
category1

['school.name',
 'school.city',
 'school.state',
 'school.zip',
 'school.accreditor',
 'school.school_url',
 'school.price_calculator_url',
 'school.degrees_awarded.predominant_recoded',
 'school.under_investigation',
 'school.main_campus',
 'school.branches',
 'school.degrees_awarded.predominant',
 'school.degrees_awarded.highest',
 'school.ownership',
 'school.state_fips',
 'school.region_id',
 'school.locale',
 'school.degree_urbanization',
 'school.carnegie_basic',
 'school.carnegie_undergrad',
 'school.carnegie_size_setting',
 'school.minority_serving.historically_black',
 'school.minority_serving.predominantly_black',
 'school.minority_serving.annh',
 'school.minority_serving.tribal',
 'school.minority_serving.aanipi',
 'school.minority_serving.hispanic',
 'school.minority_serving.nant',
 'school.men_only',
 'school.women_only',
 'school.religious_affiliation',
 'school.online_only',
 'school.operating',
 'school.tuition_revenue_per_fte',
 'school.instructional_expenditure_per_ft

In [24]:
# Test for the inner loop - still debugging!
category1 = []
column_names=[]
column_desc=[]

for category_name in main_categories:
    expression = category_name + '.*'
    for k in data_dictionary['dictionary'].keys():
        if re.match(expression, k):
            columns = index.split('.') 
            column_names.append(columns[1]) #obtain the column name from the data dictionary
            category1.append(k)
        else:
            pass

print(category1)

['academics.program_percentage.agriculture', 'academics.program_percentage.resources', 'academics.program_percentage.architecture', 'academics.program_percentage.ethnic_cultural_gender', 'academics.program_percentage.communication', 'academics.program_percentage.communications_technology', 'academics.program_percentage.computer', 'academics.program_percentage.personal_culinary', 'academics.program_percentage.education', 'academics.program_percentage.engineering', 'academics.program_percentage.engineering_technology', 'academics.program_percentage.language', 'academics.program_percentage.family_consumer_science', 'academics.program_percentage.legal', 'academics.program_percentage.english', 'academics.program_percentage.humanities', 'academics.program_percentage.library', 'academics.program_percentage.biological', 'academics.program_percentage.mathematics', 'academics.program_percentage.military', 'academics.program_percentage.multidiscipline', 'academics.program_percentage.parks_recreat

In [25]:
# Full loop - still debugging!

# Initialization:
column_names=[]
column_desc=[]

# Outer loop:
for category_name in main_categories:
    expression = category_name + '.*'
    # Inner loop to extract column name and description corresponding to the main category:
    for index in data_dictionary['dictionary'].keys():
            if re.match(expression, index):
                columns = index.split('.') 
                column_names.append(columns[2]) #obtain the column name from the data dictionary
                #column_desc.append(data_dictionary['dictionary'][index]['description']) # obtain the column description
                main_category_name.append(category_name)
            else:
                pass

# create a data frame using the lists:                
column_data = pd.DataFrame()

column_data['category']=main_category_name
column_data['column_names']=column_names

column_data['column_description']=column_description

#
column_data.head()

IndexError: list index out of range