# Create 2017-2018 School Datasets¶
### This program uses all flattened raw datasets to create the school dataset files within the NCEA repository.

1. This notebook reads raw dataset .csv files directly from the \EducationDataNC\2017\Raw Datasets folder.
2. Each raw dataset is transformed to contain only one record per public school campus or unique agency_code.
3. Many raw datasets have more than one record per campus, per year. In these instances, table pivots are used to create new columns from row level entries and reduce each dataset to one record per school. This adds many new colums the flattened dataset. (see the code below for more details)
4. School datasets merge all flattened files into one dataset with one record per agency_code.

In [20]:
#import required Libraries
import pandas as pd
import numpy as np
import os
import string

#**********************************************************************************
# Set the following variables before running this code!!!
#**********************************************************************************

#'C:/Users/Jake/Documents/GitHub/EducationDataNC/2018/
dirPath = 'D:/BenepactLLC/Belk/NC_Report_Card_Data/2019/April 2019/2018/'

#Location where copies of the raw data files will be read in from csv files.
# 'C:/Users/Jake/Documents/GitHub/EducationDataNC/2018/Raw Datasets/'
dataDir = dirPath + 'Raw Datasets/'

#Location where the new school datasets will be created.
# 'C:/Users/Jake/Documents/GitHub/EducationDataNC/2018/School Datasets/'
outputDir = dirPath + 'School Datasets/'

#All raw data files are processed for the year below
schoolYear = 2018

# Read in the Raw Data Files
### This section reads raw data files directly from the \\Raw Datasets folder.

* The file input location is specified at the dataDir parameter.
* The file output location is specified at the outputDir parameter.
* The schoolYear parameter is used to specify the correct school year to process.

# A List of All Files Processed

In [21]:
#Use wildcards to find files in a directory
import glob
#Use ntpath.basename to get a filename from a filepath
import ntpath

#Get and display a list of all .csv file names for 2018 download
rcdFiles = glob.glob(dataDir + 'rcd*.csv')

rcdFileNames = [ntpath.basename(x)[:-4] for x in rcdFiles]

print('A List of File Names and Record Counts for Processing:\n')

#Create dataframes for each file
for fileName in rcdFileNames:
    #create one ataframe for each .csv file in rcdFileNames  
    exec(fileName + ' = pd.read_csv("' + dataDir + '" + "' + fileName + '" + ".csv", low_memory=False, dtype={"agency_code": object})')  
    print(fileName + ', ' + str(len(eval(fileName).index)) )
    

A List of File Names and Record Counts for Processing:

rcd_161, 644
rcd_acc_aapart, 53
rcd_acc_act, 742
rcd_acc_awa, 688
rcd_acc_cgr, 738
rcd_acc_eds, 2761
rcd_acc_elp, 1809
rcd_acc_essa_desig, 2645
rcd_acc_gp, 658
rcd_acc_irm, 1276
rcd_acc_lowperf, 2760
rcd_acc_ltg, 2461
rcd_acc_ltg_detail, 2631
rcd_acc_mcr, 717
rcd_acc_part, 2527
rcd_acc_part_detail, 2527
rcd_acc_pc, 2697
rcd_acc_rta, 1576
rcd_acc_spg1, 2584
rcd_acc_spg2, 2538
rcd_acc_wk, 517
rcd_adm, 3197
rcd_ap, 563
rcd_arts, 2509
rcd_att, 3115
rcd_charter, 241
rcd_chronic_absent, 2719
rcd_college, 690
rcd_courses1, 773
rcd_courses2, 636
rcd_cte_concentrators, 492
rcd_cte_credentials, 436
rcd_cte_endorsement, 537
rcd_cte_enrollment, 1184
rcd_dlmi, 2723
rcd_effectiveness, 2724
rcd_esea_att, 2700
rcd_experience, 2758
rcd_funds, 292
rcd_hqt, 2697
rcd_ib, 51
rcd_improvement, 19
rcd_inc1, 3097
rcd_inc2, 2792
rcd_licenses, 3121
rcd_location, 2759
rcd_naep, 2
rcd_nbpts, 3124
rcd_pk_enroll, 988
rcd_prin_demo, 116
rcd_readiness, 1323
rcd_s

# Merge all datasets to one master dataset with one record per school
* **Starting with the location table we left outer join on agency_code, merging data from each reshaped table into one master record.**
* **The report below ensures that merges by location result in one unique record per public school campus.**
* **This report also shows changes to the final dataset's column and row counts as each flattened raw dataset is merged into the final Public School Datasets.**

In [23]:
#Make a copy of a variable (by value) using copy() or deepcopy()
import copy 

#Remove state and district level location records before performing campus level merges
rcd_location = rcd_location[(rcd_location['agency_code'] != 'NC-SEA') & 
                            (rcd_location['agency_code'].str.contains("LEA") == False)]

#Do not merge file: rcd_acc_pc
mergeFileNames = copy.deepcopy(rcdFileNames)
mergeFileNames.remove('rcd_acc_pc')

print('*********************************Start: RCD Location Data*********************************')
rcd_location.info(verbose=False)

for fileName in mergeFileNames:
    rcd_location = rcd_location.merge(eval(fileName),how='left',on='agency_code', suffixes=('', '_Drop'))
    print('*********************************After: ' + fileName + '**************************')
    rcd_location.info(verbose=False)
    

#Rename final merged rcd file! 
PublicSchools = rcd_location

#Delete all of the duplicate / overlapping columns 
#i.e. When two tables have columns with identical names, the column from the table inside the merge() is deleted.
dropCols = [x for x in PublicSchools.columns if x.endswith('_Drop')]
PublicSchools = PublicSchools.drop(dropCols, axis=1)

#Delete any masking columns that were missed. 
dropCols = [x for x in PublicSchools.columns if x.endswith('_masking')]
PublicSchools = PublicSchools.drop(dropCols, axis=1)

print('*********************************After: Deleting Duplicated Columns*********')
PublicSchools.info(verbose=False)

*********************************Start: RCD Location Data*********************************
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2643 entries, 0 to 2757
Columns: 21 entries, agency_code to stem
dtypes: float64(3), object(18)
memory usage: 454.3+ KB
*********************************After: rcd_161**************************
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2643 entries, 0 to 2642
Columns: 56 entries, agency_code to ccc_pct_NOINFO_WH7_161
dtypes: float64(38), object(18)
memory usage: 1.1+ MB
*********************************After: rcd_acc_aapart**************************
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2643 entries, 0 to 2642
Columns: 76 entries, agency_code to pct_SC_HS_AAPART
dtypes: float64(58), object(18)
memory usage: 1.6+ MB
*********************************After: rcd_acc_act**************************
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2643 entries, 0 to 2642
Columns: 188 entries, agency_code to pct_ACWR_WH7_ACT
dtypes: fl

*********************************After: rcd_cte_endorsement**************************
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2643 entries, 0 to 2642
Columns: 1334 entries, agency_code to college_only
dtypes: float64(1294), object(40)
memory usage: 26.9+ MB
*********************************After: rcd_cte_enrollment**************************
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2643 entries, 0 to 2642
Columns: 1337 entries, agency_code to cte_enrollment_pct
dtypes: float64(1297), object(40)
memory usage: 27.0+ MB
*********************************After: rcd_dlmi**************************
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2643 entries, 0 to 2642
Columns: 1350 entries, agency_code to category_code_Drop
dtypes: float64(1302), object(48)
memory usage: 27.2+ MB
*********************************After: rcd_effectiveness**************************
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2643 entries, 0 to 2642
Columns: 1416 entries, agency_code to 

In [24]:
#Save the master file to disk
PublicSchools.to_csv(outputDir + 'PublicSchools' + str(schoolYear) + '.csv', sep=',', index=False)

print('*********************************All Public Schools****************************')
PublicSchools.info(verbose=False)

#Filter regular public high schools
HighSchools = PublicSchools[((PublicSchools.category_code == 'H') | 
                             (PublicSchools.category_code == 'T') | 
                             (PublicSchools.category_code == 'A')) 
                            ]

#Save the file to disk
HighSchools.to_csv(outputDir + 'PublicHighSchools' + str(schoolYear) + '.csv', sep=',', index=False)

print('*********************************Regular Public High Schools*******************')
HighSchools.info(verbose=False)

#Filter regular public middle schools
MiddleSchools = PublicSchools[((PublicSchools.category_code == 'M') | 
                               (PublicSchools.category_code == 'T') | 
                               (PublicSchools.category_code == 'A') |
                               (PublicSchools.category_code == 'I'))
                             ]

#Save the file to disk
MiddleSchools.to_csv(outputDir + 'PublicMiddleSchools' + str(schoolYear) + '.csv', sep=',', index=False)

print('*********************************Regular Public Middle Schools******************')
MiddleSchools.info(verbose=False)


#Filter regular elementary high schools
ElementarySchools = PublicSchools[((PublicSchools.category_code == 'E') | 
                                   (PublicSchools.category_code == 'I') | 
                                   (PublicSchools.category_code == 'A')) 
                                 ]

#Save the file to disk
ElementarySchools.to_csv(outputDir + 'PublicElementarySchools' + str(schoolYear) + '.csv', sep=',', index=False)

print('*********************************Regular Public Elementary Schools**************')
ElementarySchools.info(verbose=False)

*********************************All Public Schools****************************
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2643 entries, 0 to 2642
Columns: 1790 entries, agency_code to welcome_url
dtypes: float64(1741), object(49)
memory usage: 36.1+ MB
*********************************Regular Public High Schools*******************
<class 'pandas.core.frame.DataFrame'>
Int64Index: 679 entries, 0 to 2640
Columns: 1790 entries, agency_code to welcome_url
dtypes: float64(1741), object(49)
memory usage: 9.3+ MB
*********************************Regular Public Middle Schools******************
<class 'pandas.core.frame.DataFrame'>
Int64Index: 806 entries, 0 to 2640
Columns: 1790 entries, agency_code to welcome_url
dtypes: float64(1741), object(49)
memory usage: 11.0+ MB
*********************************Regular Public Elementary Schools**************
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1575 entries, 0 to 2641
Columns: 1790 entries, agency_code to welcome_url
dtypes: float6

# Data Columns Available in Each Public School Dataset

In [25]:
PublicSchools.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2643 entries, 0 to 2642
Data columns (total 1790 columns):
agency_code                                           object
category_code                                         object
agency_level                                          object
lea_code                                              object
designation_type                                      object
name                                                  object
county                                                object
street_addr                                           object
stree_addr2                                           float64
city                                                  object
state                                                 object
zip                                                   float64
phone                                                 object
grade_span                                            object
school_type                        