# <center>SIADS 591/592</center>
# <center>Milestone I Project Report</center>

#### <b>Project Title:</b>
#### Understanding demographic dynamics of high school aged students in the United States

#### <b>Team members:</b>
- Leonardo Cedeno (lcedeno)
- Yin-Chen Hsu (yinchenh)

#### <b>Motivation:</b>
##### The main motivation for this project was the need of higher-education institutions to have a database with granulary demographic information to use in their recruitment strategies in the context of increasing student diversity, supporting student academic success and meeting financial stewardship goals. We believe that one of the foundations of a strong egalitarian society is to promote access to quality education for students of all races and particularly to recruit minority students that may otherwise not apply or enroll in college.
##### Our goal is to curate a collection of tables and visualizations that can help college and university recruiters be more effective in implementing a recruitment strategy that focuses on meeting a very inclusive diversity agenda.


#### <b>Acquiring Data</b>
##### Since each one of the 2 main datasets, Census and CRCD, came from different sources and had different variables, we decided to explore the nuances of acquiring data separately.

##### <b>Census Data</b>
##### CensusData is a Python library that supports accessing the official US Census Bureau’s API. It can be downloaded from the following link: https://pypi.org/project/CensusData/, it supports pulling data  from the American Community Survey series. We decided to use the 5 year estimates from 2015 through 2019.

##### The CensusData library supports download of detailed demographic and geographic data and to align with our research objectives we chose the following variables by zip code for the entire nation: 
- Number of teens ages 15 to 17 by gender and race
- Number of households with incomes between $150k to $220k and above $220k by zip code for each year.

In [None]:
import censusdata, pandas as pd

def import_censusdata0(year, variables, geo_agg):
    census_data = censusdata.download('acs5', year, 
                                censusdata.censusgeo([('state','*'), geo_agg]), variables)

    dictionary = {'B02001_001E': 'Total Pop Estimate '+str(year), 'B19001_001E':'HHI', 'B19001_016E':'HHI 150K-200K','B19001A_001E':'HHI 220K+',\
    'B01001_006E':'Males 15-17', 'B01001A_006E':'White Males 15-17','B01001B_006E':'Black Males 15-17', 'B01001I_006E':'Hispanic Males 15-17',\
    'B01001_030E': 'Females 15-17', 'B01001A_021E':'White Females 15-17',' B01001B_021E':'Black Females 15-17','B01001I_021E': 'Hispanic Females 15-17'}
    
    census_data.rename(columns =dictionary, inplace=True)
    for index, data in census_data.iterrows():
        census_data.loc[index,'County'], census_data.loc[index,'State']= index.name, index.params()[0][1]

    #census_data.reset_index(drop=True,inplace=True)
    census_dataframe = pd.DataFrame(census_data)

    return census_dataframe#.groupby(['State', 'County']).agg({'Total Pop Estimate'+str(year):'sum',
                                                           #     'Males 15-17':'sum', 'Black Males 15-17':'sum', 'Hispanic Males 15-17':'sum',\
                                                           # 'HHI 220K+':'sum'})

male_vars = ['B02001_001E', 'B01001_006E','B01001A_006E','B01001_006E' , 'B01001B_006E', 'B01001I_006E',\
                                                'B19001_001E', 'B19001_016E','B19001A_001E']

female_vars = ['B02001_001E', 'B01001_030E', 'B01001A_021E','B01001B_021E','B01001I_021E']

males_census_2019df = import_censusdata0(2019, male_vars, ('county', '*') )
males_census_2018df = import_censusdata0(2018, male_vars, ('county', '*') )

growth_trend = pd.concat([males_census_2019df['Total Pop Estimate 2019'], males_census_2018df['Total Pop Estimate 2018']], axis=1)
growth_trend['% delta'] = growth_trend['Total Pop Estimate 2019'] / growth_trend['Total Pop Estimate 2018']

In [None]:
#this variant uses groupings for the variables
import requests
 
class census_api:
    baseurl = "https://api.census.gov/data/2019/acs/acs5?get=NAME,group(B02001)&for=zip%20code%20tabulation%20area:*&in=state:*"
    
    def __init__(self):
        #self.paramx = {'dataset'}
        self.resp = requests.get(self.baseurl).json()
         
    def output(self):
        return self.resp
census_header = census_api().output()

##### <b>CRCD Data</b>
##### The federal government releases large amounts of data on U.S. schools every year. But this information is scattered across multiple datasets that are often difficult to access. Urban Institute’s Education Data Portal (URL: https://educationdata.urban.org/documentation/) offers an application programming interface (API) for users to assess data in a single portal. All data are returned in JSON format and viewed directly in a web browser by hitting a URL endpoint.

##### To make a specific API call, we need to add modifiers for the  URL, including source, endpoint, year, other specifiers, and filters. The URL follows this format:
<blockquote>https://educationdata.urban.org/api/v1/{topic}/{source}/{endpoint}/{year}/[additional_specifiers _or_disaggregators]/[optional filters]</blockquote>

##### To align with our research goal, we get the data with specified conditions:
- Directory of high schools (source: CDC)
- Enrollment data of high schools (source: CDC)
- Enrollment data by grade of high schools (source: CRDC)
- SAT and ACT participation record (source: CRDC)
- Retention record by grade of high schools (source: CRDC)
- Finance data of high schools (source: CRDC)

In [None]:
def school_data(year, dataset, var, grade, fips):
    import urllib, json, pandas as pd
    base_url = "https://educationdata.urban.org/api/v1/schools/"
    if (dataset, var) in {('ccd', 'directory'), ('crdc', 'directory'), ('crdc', 'school-finance')}:
         url = base_url+dataset+"/"+var+"/"+str(year)+"/?fips="+str(fips)
    elif (dataset, var) in {('ccd', 'enrollment'), ('crdc', 'retention')}:
         url = base_url+dataset+"/"+var+"/"+str(year)+"/"+str(grade)+"/race/sex/?fips="+str(fips)
    elif (dataset, var) in {('crdc', 'enrollment'), ('crdc', 'sat-act-participation')}:
         url = base_url+dataset+"/"+var+"/"+str(year)+"/race/sex/?fips="+str(fips)

    print(url)
    df_lst = []
    while True:
         try:
               response = urllib.request.urlopen(url)
               retention = json.loads(response.read())
               df = pd.json_normalize(retention, record_path= ['results'])
               if response.getcode() == 200:
                   break
         except Exception as inst:
               print (inst)
    df_lst.append(df)
    while retention['next'] != None:
         url = retention['next']
         print(url)
         while True:
               try:
                    response = urllib.request.urlopen(url)
                    retention = json.loads(response.read())
                    df = pd.json_normalize(retention, record_path= ['results'])
                    if response.getcode() == 200:
                         break
               except Exception as inst:
                    print (inst)
         df_lst.append(df)
    return pd.concat(df_lst)
# school_data(2015, 'ccd', 'enrollment', 'grade-12', 45)

In [None]:
def get_ccd_multiple(lst, var, grade, fips):
    df_loop=[]
    for y in lst:
        d = school_data(y, 'ccd', var, grade, fips)
        df_loop.append(d)
    return df_loop
years = [2015, 2016, 2017, 2018, 2019]
# dfs_ccd_directory = get_ccd_multiple(years, 'directory', False, 45)
# dfs_ccd_enrollment = get_ccd_multiple(years, 'enrollment', 'grade-12', 45)

In [None]:
# The crdc enrollment data is published biennially!!! (The most recent years are 2013, 2015 and 2017)
def get_crdc_multiple(lst, var, grade, fips):
    df_loop=[]
    for y in lst:
        d = school_data(y, 'crdc', var, grade, fips)
        df_loop.append(d)
    return df_loop
years = [2013, 2015, 2017]
# dfs_crdc_directory = get_crdc_multiple(years, 'directory', False, 45)
# dfs_crdc_enrollment = get_crdc_multiple(years, 'enrollment', 'grade-12', 45)
# dfs_crdc_sat_act_part = get_crdc_multiple(years, 'sat-act-participation', 'grade-12', 45)
# dfs_school_finance = get_crdc_multiple(years, 'school-finance', False, 45)
# dfs_retention = get_crdc_multiple(years, 'retention', 'grade-12', 45)


##### In summary, acquiring the data for this project was a mixed bag of convenience and learning from scratch. Using the CensusData Python library was very convenient and it sped up the process dramatically. On the other hand, the US Department of Education data was a challenge that involved writing an API wrapper, which led to high volume files (total of 331 megabytes for all years)



##### <b>Pre-processing data</b>
##### Once the data was downloaded using APIs as described above we decided that it was best to create a PostgreSQL database  to lower our reliance on Python in-memory calculations. Originally the PostgreSQL database was developed locally in Leo’s machine and then migrated to an Azure database server. This step was crucial in allowing both team members to continue to work on a consolidated datasets. 