# Data Storytelling Project: SAT New York

**Summary os results:**

In [11]:
import pandas as pd
import numpy as np

## Understanding the data

### Read data into dataframe, put each dataframe into a dictionary

In [31]:
files = ["AP_results.csv", "Class_size.csv", "Demographics.csv", "Graduation_outcomes.csv", "hsdirectory.csv", "Math_test_results.csv", "SAT_results.csv"]
data = {}
for f in files:
    print f
    d = pd.read_csv("data/{}".format(f))
    data[f.replace(".csv", "")] = d

AP_results.csv
Class_size.csv
Demographics.csv
Graduation_outcomes.csv
hsdirectory.csv
Math_test_results.csv
SAT_results.csv


### Check dictionary

In [33]:
for k,v in data.items():
    print("\n" + k + "\n")
    print(v.head())


Math_test_results

      DBN Grade  Year      Category  Number Tested Mean Scale Score Level 1 #  \
0  01M015     3  2006  All Students             39              667         2   
1  01M015     3  2007  All Students             31              672         2   
2  01M015     3  2008  All Students             37              668         0   
3  01M015     3  2009  All Students             33              668         0   
4  01M015     3  2010  All Students             26              677         6   

  Level 1 % Level 2 # Level 2 % Level 3 # Level 3 % Level 4 # Level 4 %  \
0      5.1%        11     28.2%        20     51.3%         6     15.4%   
1      6.5%         3      9.7%        22       71%         4     12.9%   
2        0%         6     16.2%        29     78.4%         2      5.4%   
3        0%         4     12.1%        28     84.8%         1        3%   
4     23.1%        12     46.2%         6     23.1%         2      7.7%   

  Level 3+4 # Level 3+4 %  
0          26 

**Patterns/preliminary analysis:**

* District Borough Number (DBN): Math_test_results, SAT_results, Demographics, AP_results, Graduation_outcomes, hsdirectory
* School name: SAT_results, AP_results, Graduation_outcomes, Class_size, hsdirectory
* Location1 field in hsdirectory can be used for maps

### Data unification  

This makes working with all the data an easier process.
DBN is a common column, it just doesnt appear in Class_size - unless School code is DBN.

In [39]:
data["AP_results"]["DBN"].head()

0    01M448
1    01M450
2    01M515
3    01M539
4    02M296
Name: DBN, dtype: object

In [40]:
data["Class_size"].head()

Unnamed: 0,CSD,BOROUGH,SCHOOL CODE,SCHOOL NAME,GRADE,PROGRAM TYPE,CORE SUBJECT (MS CORE and 9-12 ONLY),CORE COURSE (MS CORE and 9-12 ONLY),SERVICE CATEGORY(K-9* ONLY),NUMBER OF STUDENTS / SEATS FILLED,NUMBER OF SECTIONS,AVERAGE CLASS SIZE,SIZE OF SMALLEST CLASS,SIZE OF LARGEST CLASS,DATA SOURCE,SCHOOLWIDE PUPIL-TEACHER RATIO
0,1,M,M015,P.S. 015 Roberto Clemente,0K,GEN ED,-,-,-,19.0,1.0,19.0,19.0,19.0,ATS,
1,1,M,M015,P.S. 015 Roberto Clemente,0K,CTT,-,-,-,21.0,1.0,21.0,21.0,21.0,ATS,
2,1,M,M015,P.S. 015 Roberto Clemente,01,GEN ED,-,-,-,17.0,1.0,17.0,17.0,17.0,ATS,
3,1,M,M015,P.S. 015 Roberto Clemente,01,CTT,-,-,-,17.0,1.0,17.0,17.0,17.0,ATS,
4,1,M,M015,P.S. 015 Roberto Clemente,02,GEN ED,-,-,-,15.0,1.0,15.0,15.0,15.0,ATS,


It looks like DBN is combination of CSD, BOROUGH, and SCHOOL CODE.

from [string format](https://docs.python.org/2/library/stdtypes.html#str.format) [documentation](https://docs.python.org/3/library/string.html#format-string-syntax):

{:02d}

02d formats an integer (d) to a field of minimum width 2 (2), with zero-padding on the left (leading 0)


In [59]:
data["Class_size"]["DBN"] = data["Class_size"].apply(lambda x: "{0:02d}{1}".format(x["CSD"], x["SCHOOL CODE"]), axis=1)
data["hsdirectory"]["DBN"] = data["hsdirectory"]["dbn"]

In [62]:
data["Class_size"].head(3)

Unnamed: 0,CSD,BOROUGH,SCHOOL CODE,SCHOOL NAME,GRADE,PROGRAM TYPE,CORE SUBJECT (MS CORE and 9-12 ONLY),CORE COURSE (MS CORE and 9-12 ONLY),SERVICE CATEGORY(K-9* ONLY),NUMBER OF STUDENTS / SEATS FILLED,NUMBER OF SECTIONS,AVERAGE CLASS SIZE,SIZE OF SMALLEST CLASS,SIZE OF LARGEST CLASS,DATA SOURCE,SCHOOLWIDE PUPIL-TEACHER RATIO,DBN
0,1,M,M015,P.S. 015 Roberto Clemente,0K,GEN ED,-,-,-,19.0,1.0,19.0,19.0,19.0,ATS,,01M015
1,1,M,M015,P.S. 015 Roberto Clemente,0K,CTT,-,-,-,21.0,1.0,21.0,21.0,21.0,ATS,,01M015
2,1,M,M015,P.S. 015 Roberto Clemente,01,GEN ED,-,-,-,17.0,1.0,17.0,17.0,17.0,ATS,,01M015
