# Acquire Data - Collect Education Data from Various Endpoint

**Table of Contents**
1. [Intro](#1.-Intro)
2. [Download Data by Different Sources](#2.-Download_Data_by_Different_Sources)
3. [Download Data in a Range of Years](#3.-Download_Data_in_a_Range_of_Years)

## 1. Intro
 The US Department of Education releases numerous datasets with educational information every year that can be found at the [Urban Institute’s Education Data Portal](https://educationdata.urban.org/documentation/). We'll write a Python API wrapper to access the following datasets: Common Core Data (CCD) for school-level information such as location, status (open, closed, etc.), type of school (grade school, high school, etc.), among others. In addition, we also pulled data from the Civil Rights Data Collection (CRCD) dataset for race, sex and enrollment data.

The URL follows this format: 
> h<span>ttps</span>://educationdata.urban.org/api/v1/{topic}/{source}/{endpoint}/{year}/[additional_specifiers_or_disaggregators]/[optional filters]

This URL demonstrates an example request for 2013 CDC directory:<br>
https://educationdata.urban.org/api/v1/schools/ccd/directory/2013/


In [None]:
import urllib, json, pandas as pd

## 2. Download Data by Different Sources
According to the data source, we are making functions to deal with different conditions. It generally follows these steps:
1. Define url pattern for filtering the exact data we need
2. Send the quest and get the response with json file
3. Extract data from the json
4. Concatenate the data from multiple pages 

Certain variables within the documentation site can be used to subset or filter the data. These variables are found under the “Filter” drop-down menu on the right side of the page. To apply a filter variable to your API call, either select a variable within the drop-down menu or add a query string to the end of the URL. The query string will take the form of:
> ?filter_variable=filter_value

To include multiple filters, separate them with an &. For instance, if you were interested only in the enrollment of charter schools in the District of Columbia, you could make the following call:<br>
https://educationdata.urban.org/api/v1/schools/ccd/enrollment/2013/grade-3/?charter=1&fips=11

In [3]:
def school_data(year, dataset, var, grade):
    # Define url pattern for filtering the exact data we need
    base_url = "https://educationdata.urban.org/api/v1/schools/"
    if (dataset, var) == ('ccd', 'directory'):
         url = base_url+dataset+"/"+var+"/"+str(year)+"/?school_level=3&school_status=1&school_type=1"
    elif (dataset, var) == ('ccd', 'enrollment'):
         url = base_url+dataset+"/"+var+"/"+str(year)+"/grade-"+str(grade)+"/"
    elif (dataset, var) == ('crdc', 'directory'):
         url = base_url+dataset+"/"+var+"/"+str(year)+"/"
    elif (dataset, var) == ('crdc', 'enrollment'):
         url = base_url+dataset+"/"+var+"/"+str(year)+"/race/sex/?race=99&sex=99"
    elif (dataset, var) == ('crdc', 'school-finance'):
         url = base_url+dataset+"/"+var+"/"+str(year)+"/"
    elif (dataset, var) == ('crdc', 'retention'):
         url = base_url+dataset+"/"+var+"/"+str(year)+"/grade-"+str(grade)+"/race/sex/?race=99&sex=99"
    elif (dataset, var) == ('crdc', 'sat-act-participation'):
         url = base_url+dataset+"/"+var+"/"+str(year)+"/race/sex/?race=99&sex=99"

    print(url)
    # Send the quest and get the response with json file
    df_lst = []
    while True:
         try:
               response = urllib.request.urlopen(url)
               retention = json.loads(response.read())
               df = pd.json_normalize(retention, record_path= ['results'])
               if response.getcode() == 200:
                   break
         except Exception as inst:
               print (inst)
    df_lst.append(df)
    # Extract data from the json and concatenate the data from multiple pages 
    while retention['next'] != None:
         url = retention['next']
         print(url)
         while True:
               try:
                    response = urllib.request.urlopen(url)
                    retention = json.loads(response.read())
                    df = pd.json_normalize(retention, record_path= ['results'])
                    if response.getcode() == 200:
                         break
               except Exception as inst:
                    print (inst)
         df_lst.append(df)
    # Concatenate the data from multiple pages 
    return pd.concat(df_lst)
    
# Uncommnet to download the files
# ccd_directory_2015 = school_data(2015, 'ccd', 'directory', 12)
# ccd_enrollment_2015 = school_data(2015, 'ccd', 'enrollment', 12)
# crdc_directory_2015 = school_data(2015, 'crdc', 'directory', 12)
# crdc_enrollment_2015 = school_data(2015, 'crdc', 'enrollment', 12)
# crdc_school_finance_2015 = school_data(2015, 'crdc', 'school-finance', 12)
# crdc_retention_2015 = school_data(2015, 'crdc', 'retention', 12)
# crdc_sat_act_participation_2015 = school_data(2015, 'crdc', 'sat-act-participation', 12)
# print(ccd_directory_2015.shape, ccd_enrollment_2015.shape, crdc_directory_2015.shape, crdc_enrollment_2015.shape, crdc_school_finance_2015.shape, crdc_retention_2015.shape, crdc_sat_act_participation_2015.shape)

## 3. Download Data in a Range of Years
We'll make functions for downloading data in a range of years. The idea is to do multiple function calls and restore the data into a list of dataframes.

In [4]:
def get_ccd_multiple(lst, var, grade):
    df_loop=[]
    for y in lst:
        d = school_data(y, 'ccd', var, grade)
        df_loop.append(d)
    return df_loop
years = [2013, 2015, 2017]

# Uncommnet to download the files
# dfs_ccd_directory = get_ccd_multiple(years, 'directory', 12)
# dfs_g9_enrollment = get_ccd_multiple(years, 'enrollment', 9)
# dfs_g10_enrollment = get_ccd_multiple(years, 'enrollment', 10)
# dfs_g11_enrollment = get_ccd_multiple(years, 'enrollment', 11)
# dfs_g12_enrollment = get_ccd_multiple(years, 'enrollment', 12)

https://educationdata.urban.org/api/v1/schools/ccd/directory/2013/?school_level=3&school_status=1&school_type=1
https://educationdata.urban.org/api/v1/schools/ccd/directory/2013/?page=2&school_level=3&school_status=1&school_type=1
https://educationdata.urban.org/api/v1/schools/ccd/directory/2013/?page=3&school_level=3&school_status=1&school_type=1
https://educationdata.urban.org/api/v1/schools/ccd/directory/2013/?page=4&school_level=3&school_status=1&school_type=1
https://educationdata.urban.org/api/v1/schools/ccd/directory/2013/?page=5&school_level=3&school_status=1&school_type=1
https://educationdata.urban.org/api/v1/schools/ccd/directory/2013/?page=6&school_level=3&school_status=1&school_type=1
https://educationdata.urban.org/api/v1/schools/ccd/directory/2015/?school_level=3&school_status=1&school_type=1
https://educationdata.urban.org/api/v1/schools/ccd/directory/2015/?page=2&school_level=3&school_status=1&school_type=1
https://educationdata.urban.org/api/v1/schools/ccd/directory/2

In [7]:
def get_crdc_multiple(lst, var, grade):
    df_loop=[]
    for y in lst:
        d = school_data(y, 'crdc', var, grade)
        df_loop.append(d)
    return df_loop
years = [2013, 2015, 2017]

# Uncommnet to download the files
# dfs_crdc_sat_act_part = get_crdc_multiple(years, 'sat-act-participation', 12)
# dfs_crdc_school_finance = get_crdc_multiple(years, 'school-finance', 12)
# dfs_crdc_retention = get_crdc_multiple(years, 'retention', 12)

https://educationdata.urban.org/api/v1/schools/crdc/sat-act-participation/2013/race/sex/?race=99&sex=99
https://educationdata.urban.org/api/v1/schools/crdc/sat-act-participation/2013/race/sex/?page=2&race=99&sex=99
https://educationdata.urban.org/api/v1/schools/crdc/sat-act-participation/2013/race/sex/?page=3&race=99&sex=99
https://educationdata.urban.org/api/v1/schools/crdc/sat-act-participation/2013/race/sex/?page=4&race=99&sex=99
https://educationdata.urban.org/api/v1/schools/crdc/sat-act-participation/2013/race/sex/?page=5&race=99&sex=99
https://educationdata.urban.org/api/v1/schools/crdc/sat-act-participation/2013/race/sex/?page=6&race=99&sex=99
https://educationdata.urban.org/api/v1/schools/crdc/sat-act-participation/2013/race/sex/?page=7&race=99&sex=99
https://educationdata.urban.org/api/v1/schools/crdc/sat-act-participation/2013/race/sex/?page=8&race=99&sex=99
https://educationdata.urban.org/api/v1/schools/crdc/sat-act-participation/2013/race/sex/?page=9&race=99&sex=99
https://

### We'll finally saving the data as cvs files

In [8]:
i = 2013
for df in dfs_ccd_directory:
    print(df.shape)
    df.to_csv("./educationdata/ccd_directory_" + str(i) + ".csv")
    i+=2

(16310, 52)
(16532, 52)
(17922, 52)


In [12]:
i = 2013
for df in dfs_crdc_sat_act_part:
    print(df.shape)
    df.to_csv("./educationdata/crdc_sat_act_part_" + str(i) + ".csv")
    i+=2

(95507, 10)
(96360, 10)
(97632, 10)


In [14]:
i = 2013
for df in dfs_crdc_school_finance:
    print(df.shape)
    df.to_csv("./educationdata/crdc_school_finance_" + str(i) + ".csv")
    i+=2

(95507, 15)
(96360, 15)
(97632, 15)


In [15]:
i = 2013
for df in dfs_crdc_retention:
    print(df.shape)
    df.to_csv("./educationdata/crdc_retention_" + str(i) + ".csv")
    i+=2

(95507, 11)
(96360, 11)
(97632, 11)


In [None]:
i = 2013
for df in dfs_g9_enrollment:
    print(df.shape)
    df.to_csv("./educationdata/ccd_g9_enrollment_" + str(i) + ".csv")
    i+=2

In [None]:
i = 2013
for df in dfs_g10_enrollment:
    print(df.shape)
    df.to_csv("./educationdata/ccd_g10_enrollment_" + str(i) + ".csv")
    i+=2

In [None]:
i = 2013
for df in dfs_g11_enrollment:
    print(df.shape)
    df.to_csv("./educationdata/ccd_g11_enrollment_" + str(i) + ".csv")
    i+=2

In [9]:
i = 2013
for df in dfs_g12_enrollment:
    print(df.shape)
    df.to_csv("./educationdata/ccd_g12_enrollment_" + str(i) + ".csv")
    i+=2

(26555, 9)
(26670, 9)
(32139, 9)
