# This program prepares the selected data from the 8th Grade Cohort Longitudinal Study for mapping
The 8th grade Cohort Longitudinal Study is a data product of the Texas Higher Education Coordinating Board.

### Begin by downloading the 2007 Cohort Workbook from [the THECB website](http://www.txhighereddata.org/index.cfm?objectId=F2CBE4A0-C90B-11E5-8D610050560100A9). 

The selected data focus on enrollment and completion rates in higher education. In addittion to examining the overall cohort, the data also describe the target populations - African American, and Hispanic, economically disadvantaged, and male students -  from the Texas Higher Education Strategic Plan. The target populations examined here are .

### Statistics are presented for each of the 20 regions defined by the Texas Education Agency. These are also know as the Education Service Center (ERC) Regions.


In [1]:
import pandas as pd
import requests
import zipfile
import arcpy
import io
import os

arcpy.env.overwriteOutput = True
pd.options.display.max_rows = 10

### Start by downloading the 2007 8th grade cohort workbook from the THECB website

The cohort workbooks are available at: http://www.txhighereddata.org/index.cfm?objectId=F2CBE4A0-C90B-11E5-8D610050560100A9

Save the workbook in 'Data\8th Grade FY2007 Cohort Workbook.xlsx'

In [2]:
#We'll also make a file geodatabase in the new folder
arcpy.CreateFileGDB_management("Data","Cohort.gdb")

<Result 'Data\\Cohort.gdb'>

### Now we'll extract data from the workbook starting with the Gender by Ethnicity Data.

In [3]:
xl = pd.read_excel('Data\8th Grade FY2007 Cohort Workbook.xlsx', sheet_name='TEA by Gender by Ethnicity', header=None, index_col=None, skiprows=6)

#Keep the columns I need
xl2=xl[[0,1,2,3,4,17,18,21,22]]

#Drop the rows I don't need
GenEth=xl2[:160]
GenEth.columns=['TEAReg','RegName','Gender','Eth', 'CohoN', 'nEnr', 'pEnr', 'nComp', 'pComp']

#Make Dataset just of African American Males (60x30TX target popultation - this wasn't used for the maps)
AAmales=GenEth.loc[(GenEth['Eth']=='African American') & (GenEth['Gender']=='Male')].copy() #copy to avoid chained indexing
AAmales=AAmales.drop(['Gender','Eth'], axis=1)
AAmales.columns=['TEAReg','RegName','AAmCoho', 'AAmnEnr', 'AAmpEnr', 'AAmnComp', 'AAmpComp']
AAmales['AAmpEnr']=100*AAmales['AAmpEnr']
AAmales['AAmpComp']=100*AAmales['AAmpComp']
#AAmales.to_csv('AAmales.csv', index=False)

print(GenEth)

    TEAReg      RegName  Gender               Eth    CohoN    nEnr      pEnr  \
0        1     Edinburg  Female  African American     33.0    19.0  0.575758   
1        1     Edinburg  Female          Hispanic  12607.0  7559.0  0.599588   
2        1     Edinburg  Female             White    383.0   285.0  0.744125   
3        1     Edinburg  Female            Others     70.0    44.0  0.628571   
4        1     Edinburg    Male  African American     27.0    16.0  0.592593   
..     ...          ...     ...               ...      ...     ...       ...   
155     20  San Antonio  Female            Others    220.0   160.0  0.727273   
156     20  San Antonio    Male  African American   1070.0   513.0  0.479439   
157     20  San Antonio    Male          Hispanic   9266.0  4004.0  0.432117   
158     20  San Antonio    Male             White   3583.0  2014.0  0.562099   
159     20  San Antonio    Male            Others    269.0   180.0  0.669145   

      nComp     pComp  
0       8.0  0.

### Now get African American and Hispanic totals by region. 
We'll collapse on ethicity to remove gender.

In [4]:
#Keep Hispanic and African American counts, collapse to remove gender, and then recalculate percents 
EthCounts=GenEth.drop(GenEth.columns[[2,6,8]], axis=1) #axis=0 for rows, axis=1 for columns

#Make African American Group
AAtemp=EthCounts.loc[EthCounts['Eth']=='African American'].copy() #copy to avoid chained indexing
AA=AAtemp.groupby(["TEAReg", "RegName","Eth"], as_index=False).sum()
AA['AApEnr']=100*AA['nEnr']/AA['CohoN']
AA['AApComp']=100*AA['nComp']/AA['CohoN']
AA=AA.drop(['Eth'], axis=1) #Keep the columns I need
AA.columns=['TEAReg','RegName','AACoho', 'AAnEnr','AAnComp','AApEnr','AApComp']

#Make Hispanic Group
Hisptemp=EthCounts.loc[EthCounts['Eth']=='Hispanic'].copy() #copy to avoid chained indexing
Hisp=Hisptemp.groupby(["TEAReg", "RegName","Eth"], as_index=False).sum()
Hisp['HispEnr']=100*Hisp['nEnr']/Hisp['CohoN']
Hisp['HispComp']=100*Hisp['nComp']/Hisp['CohoN']
Hisp=Hisp.drop(['Eth'], axis=1) #Keep the columns I need
Hisp.columns=['TEAReg','RegName','HisCoho', 'HisnEnr','HisnComp','HispEnr','HispComp']

print(AA)
print(Hisp)

    TEAReg         RegName   AACoho  AAnEnr  AAnComp     AApEnr    AApComp
0        1        Edinburg     60.0    35.0     12.0  58.333333  20.000000
1        2  Corpus Christi    306.0   151.0     40.0  49.346405  13.071895
2        3        Victoria    447.0   222.0     71.0  49.664430  15.883669
3        4         Houston  17218.0  9734.0   2546.0  56.533860  14.786851
4        5        Beaumont   1813.0   957.0    204.0  52.785438  11.252068
..     ...             ...      ...     ...      ...        ...        ...
15      16        Amarillo    364.0   184.0     38.0  50.549451  10.439560
16      17         Lubbock    480.0   202.0     31.0  42.083333   6.458333
17      18         Midland    343.0   143.0     39.0  41.690962  11.370262
18      19         El Paso    380.0   170.0     42.0  44.736842  11.052632
19      20     San Antonio   2074.0  1095.0    342.0  52.796528  16.489875

[20 rows x 7 columns]
    TEAReg         RegName  HisCoho  HisnEnr  HisnComp    HispEnr   HispComp


### Male enrollment and completion by region. 
For this, we'll collapse on gender and remove ethnicity.

In [5]:
#Get total male counts by region, collape on gender, counts only.
GenCounts=GenEth.drop(GenEth.columns[[3,6,8]], axis=1) #axis=0 for rows, axis=1 for columns
Allmalestemp=GenCounts.loc[GenCounts['Gender']=='Male'].copy() #copy to avoid chained indexing
Allmales=Allmalestemp.groupby(["TEAReg", "RegName"], as_index=False).sum().copy()
Allmales['AllmpEnr']=100*Allmales['nEnr']/Allmales['CohoN']
Allmales['AllmpComp']=100*Allmales['nComp']/Allmales['CohoN']
Allmales.columns=['TEAReg', 'RegName','TotmCoho', 'TotmnEnr','TotmnComp','TotmpEnr','TotmpComp']

print(Allmales)

    TEAReg         RegName  TotmCoho  TotmnEnr  TotmnComp   TotmpEnr  \
0        1        Edinburg   13488.0    7135.0     2411.0  52.898873   
1        2  Corpus Christi    4113.0    1857.0      601.0  45.149526   
2        3        Victoria    2144.0    1008.0      435.0  47.014925   
3        4         Houston   37900.0   19325.0     6852.0  50.989446   
4        5        Beaumont    3199.0    1503.0      525.0  46.983432   
..     ...             ...       ...       ...        ...        ...   
15      16        Amarillo    2987.0    1443.0      515.0  48.309340   
16      17         Lubbock    2899.0    1361.0      498.0  46.947223   
17      18         Midland    3005.0    1311.0      433.0  43.627288   
18      19         El Paso    6775.0    3673.0     1041.0  54.214022   
19      20     San Antonio   14188.0    6711.0     2407.0  47.300536   

    TotmpComp  
0   17.875148  
1   14.612205  
2   20.289179  
3   18.079156  
4   16.411379  
..        ...  
15  17.241379  
16  17.

### Get Economic Disadvantaged student data by region. 

In [6]:
xlEcon = pd.read_excel('Data\8th Grade FY2007 Cohort Workbook.xlsx', sheet_name='TEA Region by Eco', header=None, index_col=None, skiprows=6)

#Keep the columns I need
xlEcon2=xlEcon[[0,1,2,3,16,17,20,21]]
EconTemp=xlEcon2.loc[xlEcon2[2]=='Economically Disadvantaged'].copy()

EconTemp2=EconTemp.drop([2], axis=1).copy()

#Get Region Totals and drop the rows I don't need
Econ=EconTemp2[:20].copy()
Econ.columns=['TEAReg','RegName','EcoCoho', 'EconEnr', 'EcopEnr', 'EconComp', 'EcopComp']

Econ['EcopEnr']=100*Econ['EcopEnr']
Econ['EcopComp']=100*Econ['EcopComp']
print(Econ)

   TEAReg         RegName  EcoCoho  EconEnr    EcopEnr  EconComp   EcopComp
1       1        Edinburg  22752.0  12222.0  53.718354    4377.0  19.237869
3       2  Corpus Christi   4635.0   1816.0  39.180151     460.0   9.924488
5       3        Victoria   2153.0    833.0  38.690200     250.0  11.611705
7       4         Houston  37985.0  16583.0  43.656707    4718.0  12.420692
9       5        Beaumont   3056.0   1260.0  41.230366     304.0   9.947644
..    ...             ...      ...      ...        ...       ...        ...
31     16        Amarillo   3005.0   1280.0  42.595674     393.0  13.078203
33     17         Lubbock   3198.0   1211.0  37.867417     329.0  10.287680
35     18         Midland   3039.0   1109.0  36.492267     296.0   9.740046
37     19         El Paso   9956.0   5500.0  55.243070    1683.0  16.904379
39     20     San Antonio  16482.0   7128.0  43.247179    2075.0  12.589492

[20 rows x 7 columns]


### Here we get overall totals by region for comparison

In [7]:
xl = pd.read_excel('Data\8th Grade FY2007 Cohort Workbook.xlsx', sheet_name='Summary', header=None, index_col=None, skiprows=16)

#Keep the columns I need
xl2=xl[[0,1,2,15,16,19,20]]

#Get Region Totals and drop the rows I don't need
RegTotals=xl2[:20].copy()
RegTotals.columns=['TEAReg','RegName','TotCoho', 'TotnEnr', 'TotpEnr', 'TotnComp', 'TotpComp']

RegTotals['TotpEnr']=100*RegTotals['TotpEnr']
RegTotals['TotpComp']=100*RegTotals['TotpComp']

print(RegTotals)

   TEAReg         RegName  TotCoho  TotnEnr    TotpEnr  TotnComp   TotpComp
0       1        Edinburg  26581.0  15042.0  56.589293    5754.0  21.647041
1       2  Corpus Christi   7912.0   3996.0  50.505561    1429.0  18.061173
2       3        Victoria   4109.0   2197.0  53.467997     995.0  24.215138
3       4         Houston  74398.0  40848.0  54.904702   16498.0  22.175327
4       5        Beaumont   6094.0   3224.0  52.904496    1235.0  20.265835
..    ...             ...      ...      ...        ...       ...        ...
15     16        Amarillo   5830.0   3157.0  54.150943    1278.0  21.921098
16     17         Lubbock   5639.0   2855.0  50.629544    1150.0  20.393687
17     18         Midland   5880.0   2837.0  48.248299    1021.0  17.363946
18     19         El Paso  13214.0   7716.0  58.392614    2605.0  19.713940
19     20     San Antonio  27421.0  14194.0  51.763247    5645.0  20.586412

[20 rows x 7 columns]


### And now merge the tables

In [8]:
#Combine into one table
All=pd.merge(AA, Hisp,on=['TEAReg', 'RegName']).copy()
All=pd.merge(All, Allmales,on=['TEAReg', 'RegName']).copy()
All=pd.merge(All, RegTotals,on=['TEAReg', 'RegName']).copy()
All=pd.merge(All, Econ,on=['TEAReg', 'RegName']).copy()

#Calculate % point differences for AA/Hisp/Males/Eco enrollmnet and completion rates from total cohort by region
All['AAEnrpDi']=All['AApEnr']-All['TotpEnr']
All['HisEnrpDi']=All['HispEnr']-All['TotpEnr']
All['MaleEnrpDi']=All['TotmpEnr']-All['TotpEnr'] #all males
All['EcoEnrpDi']=All['EcopEnr']-All['TotpEnr']
All['AAComppDi']=All['AApComp']-All['TotpComp']
All['HisComppDi']=All['HispComp']-All['TotpComp']
All['MaleCpDi']=All['TotmpComp']-All['TotpComp'] #all males
All['EcoComppDi']=All['EcopComp']-All['TotpComp']

Final=All

#Make perc of total for AA, Hisp, and Eco
Final['AApCoho']=100*All['AACoho']/All['TotCoho']
Final['HispCoho']=100*All['HisCoho']/All['TotCoho']
Final['EcopCoho']=100*All['EcoCoho']/All['TotCoho']

#Make variables with "_" suffix. They will have zero decmals and be used as symbol layers
Final['TotpEnr_']=Final['TotpEnr']
Final['TotpComp_']=Final['TotpComp'] 
Final['TotmpComp_']=Final['TotmpComp']
Final['AApComp_']=Final['AApComp']
Final['HispComp_']=Final['HispComp']
Final['EcopComp_']=Final['EcopComp']


Final['AApCoho_']=Final['AApCoho']
Final['HispCoho_']=Final['HispCoho']
Final['EcopCoho_']=Final['EcopCoho']
Final['AAComppD_']=Final['AAComppDi']
Final['HisComppD_']=Final['HisComppDi']
Final['EcoComppD_']=Final['EcoComppDi']
Final['AAEnrpD_']=Final['AAEnrpDi']
Final['HisEnrpD_']=Final['HisEnrpDi']
Final['EcoEnrpD_']=Final['EcoEnrpDi']
Final['MaleEnrpD_']=Final['MaleEnrpDi']
Final['MaleCpD_']=Final['MaleCpDi']


#set percentages to have just one decimal place
Processed = Final.round({'AApEnr': 1, 'AApComp': 1, 
             'HispEnr': 1, 'HispComp': 1, 
             'TotmpEnr': 1, 'TotmpComp': 1, 
             'TotpEnr': 1, 'TotpComp': 1, 
            'AAEnrpDi': 1,  'AAComppDi': 1,
            'HisEnrpDi': 1, 'HisComppDi': 1, 
             'AApCoho': 1, 'HispCoho': 1, 'EcopCoho':1,  
             'EcopEnr': 1, 'EcopComp': 1, 
            'EcoEnrpDi': 1, 'EcoComppDi': 1, 
            'MaleEnrpDi':1, 'MaleCpDi':1,
            'TotpEnr_':0, 'TotpComp_':0, 'TotmpComp_': 0,
            'AApComp_': 0, 'HispComp_': 0, 'EcopComp_': 0, 
            'AApCoho_':0, 'HispCoho_':0, 'EcopCoho_':0, 
            'AAComppD_':0, 'HisComppD_':0, 'EcoComppD_':0, 'MaleCpD_':0,
            'AAEnrpD_':0, 'HisEnrpD_':0, 'EcoEnrpD_':0, 'MaleEnrpD_':0}).copy()

Processed.to_csv('Data/ProcessedData.csv', index=False)
print(Processed)

   TEAReg         RegName   AACoho  AAnEnr  AAnComp  AApEnr  AApComp  HisCoho  \
0       1        Edinburg     60.0    35.0     12.0    58.3     20.0  25584.0   
1       2  Corpus Christi    306.0   151.0     40.0    49.3     13.1   5423.0   
2       3        Victoria    447.0   222.0     71.0    49.7     15.9   1821.0   
3       4         Houston  17218.0  9734.0   2546.0    56.5     14.8  30026.0   
4       5        Beaumont   1813.0   957.0    204.0    52.8     11.3    652.0   
..    ...             ...      ...     ...      ...     ...      ...      ...   
15     16        Amarillo    364.0   184.0     38.0    50.5     10.4   2213.0   
16     17         Lubbock    480.0   202.0     31.0    42.1      6.5   2819.0   
17     18         Midland    343.0   143.0     39.0    41.7     11.4   3393.0   
18     19         El Paso    380.0   170.0     42.0    44.7     11.1  11690.0   
19     20     San Antonio   2074.0  1095.0    342.0    52.8     16.5  18037.0   

    HisnEnr  HisnComp    ..

# The rest of the code prepares the shapefiles for mapping.

### We'll need:
    
* Polygons for TEA Regions [available from TEA](http://schoolsdata2-tea-texas.opendata.arcgis.com)
* Centroids (points) for TEA Regions
* An outline of the State of Texas
    
    

In [9]:
#get TEARegion file and unzip
URL=requests.get('http://opendata.arcgis.com/datasets/12142ff8beec4a1797334c9c41ba7b18_0.zip')
zippedRegions=zipfile.ZipFile(io.BytesIO(URL.content))
zippedRegions.extractall('Data/rawESC_Regions')

In [10]:
#get State of Texas file and unzip
URLtexas=requests.get('http://www2.census.gov/geo/tiger/GENZ2016/shp/cb_2016_us_state_5m.zip')
zippedState=zipfile.ZipFile(io.BytesIO(URLtexas.content))
zippedState.extractall('Data/TexasOutline')

#Delete unnecessary fields
arcpy.DeleteField_management("Data/TexasOutline/cb_2016_us_state_5m.shp", 
                             ["sTATENS", "AFFGEOID", "STUSPS", 'NAME', 'LSAD', 'ALAND', 'AWATER'])

arcpy.MakeFeatureLayer_management ("Data/TexasOutline/cb_2016_us_state_5m.shp", "TexasOutline", "STATEFP='48'")


<Result 'TexasOutline'>

In [11]:
#copy shapefiles to geodatabase
arcpy.FeatureClassToGeodatabase_conversion('Data/rawESC_Regions/ESC_Regions.shp', 'Data/Cohort.gdb')


#List fields in dataset
fields = arcpy.ListFields('Data/Cohort.gdb/ESC_Regions')

for field in fields:
    print("{0} is a type of {1} with a length of {2}"
          .format(field.name, field.type, field.length))

OBJECTID_1 is a type of OID with a length of 4
Shape is a type of Geometry with a length of 0
FID_1 is a type of Integer with a length of 4
OBJECTID is a type of Integer with a length of 4
CITY is a type of String with a length of 80
REGION is a type of String with a length of 80
ORG_E_ID is a type of Integer with a length of 4
WEBSITE is a type of String with a length of 80
SHAPE_Leng is a type of Double with a length of 8
Shape_Length is a type of Double with a length of 8
Shape_Area is a type of Double with a length of 8


In [12]:
#Delete unnecessary fields
arcpy.DeleteField_management("Data/Cohort.gdb/ESC_Regions", ["FID_1", "OBJECTID", "CITY", 'REGION', 'ORG_E_ID', 'WEBSITE', 'SHAPE_Leng'])                            

#Add Cohort data to GeoDataBase
arcpy.TableToTable_conversion('Data/ProcessedData.csv', 'Data/Cohort.gdb', 'CohortData')

#Merge Cohort Data to TEA Region Polygons
arcpy.JoinField_management('Data/Cohort.gdb/ESC_Regions', 'OBJECTID_1','Data/Cohort.gdb/CohortData', 'TEAReg')

<Result 'Data/Cohort.gdb/ESC_Regions'>

In [13]:
#Make folder if it doesn't exist
if not os.path.exists('Data/FinalShapefiles'):
    os.makedirs('Data/FinalShapefiles')
    
#Export merged TEARegions with Cohort data to shapefile
arcpy.FeatureClassToShapefile_conversion ('Data/Cohort.gdb/ESC_Regions', 'Data/FinalShapefiles')

#Export TexasOutline to shapefile
arcpy.FeatureClassToShapefile_conversion ('TexasOutline', 'Data/FinalShapefiles')

<Result 'Data\\FinalShapefiles'>

### Now make the centrids for the TEA Regions

(Requires the advanced license)

In [14]:
#  Set local variables
inFeatures = "Data/Cohort.gdb/ESC_Regions"
outFeatureClass = "Data/Cohort.gdb/ESC_Points"

# Use FeatureToPoint function to find a point inside each park
arcpy.FeatureToPoint_management(inFeatures, outFeatureClass)

<Result 'Data\\Cohort.gdb\\ESC_Points'>

In [15]:
#Export merged TEARegion Points to shapefile
arcpy.FeatureClassToShapefile_conversion ('Data/Cohort.gdb/ESC_Points', 'Data/FinalShapefiles')

<Result 'Data\\FinalShapefiles'>

### Now, go to linux and use the [GDAL](https://www.gdal.org/) ogr2ogr tool to convert shapefiles to geojson. Then use the [Tippecanoe](https://github.com/mapbox/tippecanoe) tool to make .MBtiles

I used the following commands:

* ogr2ogr -f GeoJSON CohortTEARegionPolys.json Data/FinalShapefiles/ESC_Regions.shp -progress
* ogr2ogr -f GeoJSON TexasOutline.json Data/FinalShapefiles/TexasOutline.shp -progress
* ogr2ogr -f GeoJSON CohortTEARegionPoints.json Data/FinalShapefiles/ESC_Points.shp -progress
* tippecanoe --output=8thGradeCohort2007TEARegionData.mbtiles CohortTEARegionPoints.json CohortTEARegionPolys.json TexasOutline.json -r1 --drop-fraction-as-needed  --simplification=9 --maximum-zoom=15 --minimum-zoom=3 --exclude=OBJECTID_1 --detect-shared-borders

Finally, we uploaded the custom .MBtiles to mapbox studio and served them from there. You could also set up your own vector tile server using TileServer-GL