**Introduction**  

One of the most controversial issues in the U.S. educational system is the efficacy of standardized tests, and whether they're unfair to certain groups. The SAT, or Scholastic Aptitude Test, is an exam that U.S. high school students take before applying to college. Colleges take the test scores into account when deciding who to admit, so it's fairly important to perform well on it.

The test consists of three sections, each of which has 800 possible points. The combined score is out of 2,400 possible points (while this number has changed a few times, the data set for our project is based on 2,400 total points). Organizations often rank high schools by their average SAT scores. The scores are also considered a measure of overall school district quality.

**Some interrelated datasets we'll be using for our analysis**:  

[SAT scores by school](https://data.cityofnewyork.us/Education/SAT-Results/f9bf-2cp4) - SAT scores for each high school in New York City  
[School attendance](https://data.cityofnewyork.us/Education/School-Attendance-and-Enrollment-Statistics-by-Dis/7z8d-msnt) - Attendance information for each school in New York City  
[Class size](https://data.cityofnewyork.us/Education/2010-2011-Class-Size-School-level-detail/urz7-pzb3) - Information on class size for each school  
[AP test results](https://data.cityofnewyork.us/Education/AP-College-Board-2010-School-Level-Results/itfs-ms3e) - Advanced Placement (AP) exam results for each high school (passing an optional AP exam in a particular subject can earn a student college credit in that subject)  
[Graduation outcomes](https://data.cityofnewyork.us/Education/Graduation-Outcomes-Classes-Of-2005-2010-School-Le/vh2h-md7a) - The percentage of students who graduated, and other outcome information  
[Demographics](https://data.cityofnewyork.us/Education/School-Demographics-and-Accountability-Snapshot-20/ihfw-zy9j) - Demographic information for each school  
[School survey](https://data.cityofnewyork.us/Education/NYC-School-Survey-2011/mnz3-dyi8) - Surveys of parents, teachers, and students at each school  

**Notable aspects about the datasets**:  
* Only high school students take the SAT, so we'll want to focus on high schools.  
* New York City is made up of five boroughs, which are essentially distinct regions.  
* New York City schools fall within several different school districts, each of which can contains dozens of schools.  
* Our data sets include several different types of schools. We'll need to clean them so that we can focus on high schools only.  
* Each school in New York City has a unique code called a DBN, or district borough number.  
* Aggregating data by district will allow us to use the district mapping data to plot district-by-district differences.  

## Import libraries and data

In [None]:
%autosave 10
import pandas as pd
import numpy as np
import re
import warnings
import matplotlib.pyplot as plt
warnings.filterwarnings('ignore') 
%matplotlib inline

Mapping datafile names to location

In [None]:
data_files = [
    "ap_2010.csv",
    "class_size.csv",
    "demographics.csv",
    "graduation.csv",
    "hs_directory.csv",
    "sat_results.csv"
]
data_loc = {}
pattern = r'([\w_]+).csv'

for x in data_files:
    y = re.findall(pattern, x)[0]
    data_loc[y] = '/kaggle/input/nyc-high-school-data/nyc_highschool_data/schools/'+x

Loading the data into dictionary 

In [None]:
data = {}
for key,value in data_loc.items():
    data[key] = pd.read_csv("{}".format(value))

Stripping white-spaces from column names of each dataset

In [None]:
for key in data.keys():
    data[key].columns = data[key].columns.str.strip()

In [None]:
data['sat_results'].head(5)

**Observations**:

* The DBN appears to be a unique ID for each school.  
* We can tell from the first few rows of names that we only have data about high schools.  
* There's only a single row for each high school, so each DBN is unique in the SAT data.    

We may eventually want to combine the three columns that contain SAT scores -- SAT Critical Reading Avg. Score, SAT Math Avg. Score, and SAT Writing Avg. Score -- into a single column to make the scores easier to analyze.

Inspecting all datasets available

In [None]:
for key,val in data.items():
    print(data[key].head(5))

**Observations**:

* Each dataset has the column DBN to interrelate them with one another  
* Some of the data sets appear to contain multiple rows for each school (because the rows have duplicate DBN values). That means we’ll have to do some preprocessing to ensure that each DBN is unique within each data set. 

## Reading the survey information

In [None]:
all_survey = pd.read_csv("/kaggle/input/nyc-high-school-data/nyc_highschool_data/schools/survey_all.txt", delimiter="\t", encoding='windows-1252')
d75_survey = pd.read_csv("/kaggle/input/nyc-high-school-data/nyc_highschool_data/schools/survey_d75.txt", delimiter="\t", encoding='windows-1252')

In [None]:
survey = pd.concat([all_survey, d75_survey], axis=0)
print(survey.head())

**Observations**:  

* There are over 2000 columns, nearly all of which we don't need. We'll have to filter the data to remove the unnecessary ones. Working with fewer columns will make it easier to print the dataframe out and find correlations within it.
* The survey data has a dbn column that we'll want to convert to uppercase (DBN). The conversion will make the column name consistent with the other data sets. 

Using the publicly available [data dictionary](https://data.cityofnewyork.us/Education/NYC-School-Survey-2011/mnz3-dyi8) to filter out the unnecessary columns

In [None]:
survey_fields = ["DBN", "rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p",
                 "saf_p_11", "com_p_11", "eng_p_11", "aca_p_11",
                 "saf_t_11", "com_t_11", "eng_t_11", "aca_t_11", 
                 "saf_s_11", "com_s_11", "eng_s_11", "aca_s_11", 
                 "saf_tot_11", "com_tot_11", "eng_tot_11", "aca_tot_11"]

Before we filter columns out, we'll want to copy the data from the dbn column into a new column called DBN

In [None]:
survey["DBN"] = survey["dbn"]

Filtering the dataset

In [None]:
survey = survey.loc[:,survey_fields]

Assign the dataframe survey to the key survey in the dictionary data

In [None]:
data['survey'] = survey

## Fixing interrelation

When we explored all of the data sets, we noticed that some of them, like class_size and hs_directory, don't have a DBN column. hs_directory does have a dbn column, though, so we can just rename it.

However, class_size doesn't appear to have the column at all. Here are the first few rows of the data set:

In [None]:
data['class_size'].head(5)

From looking at these rows, we can tell that the DBN in the sat_results data is just a combination of the CSD and SCHOOL CODE columns in the class_size data.

In [None]:
data['hs_directory']['DBN'] = data['hs_directory']['dbn']

Function to pad the csd columns of a series

In [None]:
def pad_csd(row):
    item = str(row)
    if(len(item)==2):
        return item
    else:
        return '0'+item

In [None]:
padded_csd= data["class_size"]["CSD"].apply(pad_csd)
data["class_size"]['DBN'] = padded_csd+data["class_size"]['SCHOOL CODE']

Checking progress

In [None]:
data["class_size"].head()

## Generating column sat_score
Convert the SAT Math Avg. Score, SAT Critical Reading Avg. Score, and SAT Writing Avg. Score columns into the sat_score column for ease of analysis

In [None]:
data['sat_results']['SAT Math Avg. Score'] = pd.to_numeric(data['sat_results']['SAT Math Avg. Score'],errors = "coerce")
data['sat_results']['SAT Critical Reading Avg. Score'] = pd.to_numeric(data['sat_results']['SAT Critical Reading Avg. Score'],errors = "coerce")
data['sat_results']['SAT Writing Avg. Score'] = pd.to_numeric(data['sat_results']['SAT Writing Avg. Score'],errors = "coerce")

Adding up scores to sat_score column

In [None]:
data['sat_results']['sat_score'] = data['sat_results']['SAT Math Avg. Score']+data['sat_results']['SAT Writing Avg. Score']+data['sat_results']['SAT Critical Reading Avg. Score']

## Parsing the location fields
Parsing the latitude and longitude coordinates for each school to enable us to map the schools and uncover any geographic patterns in the data.

Function to parse through a string and return the latitide data from it

In [None]:
def find_lat(loc):
    coords = re.findall("\(.+\)", loc)
    lat = coords[0].split(",")[0].replace("(", "")
    return lat

Function to parse through a string and return the longitide data from it

In [None]:
def find_lon(loc):
    coords = re.findall("\(.+\)", loc)
    lon = coords[0].split(",")[1].replace(")", "")
    return lon

In [None]:
data["hs_directory"]["lat"] = data["hs_directory"]["Location 1"].apply(find_lat)
data["hs_directory"]["lon"] = data["hs_directory"]["Location 1"].apply(find_lon)

In [None]:
data["hs_directory"]["lon"] = pd.to_numeric(data["hs_directory"]["lon"])
data["hs_directory"]["lat"] = pd.to_numeric(data["hs_directory"]["lat"])

Progress check

In [None]:
data["hs_directory"].head(5)

## Condensation of datasets

class_size, graduation, and demographics data sets are condensed so that each DBN is unique in each, to avoid problems during merging of the datasets.

Looking at unique values in 'GRADE' column in the class_size dataset 

In [None]:
data['class_size']['GRADE'].unique()

Because we're dealing with high schools, we're only concerned with grades 9 through 12. That means we only want to pick rows where the value in the GRADE column is 09-12

Looking at unique values in 'PROGRAM TYPE' column in the class_size dataset 

In [None]:
data['class_size']['PROGRAM TYPE'].unique()

Each school can have multiple program types. Because GEN ED is the largest category by far, let's only select rows where PROGRAM TYPE is GEN ED

Activity:

In [None]:
class_size = data["class_size"]

class_size = class_size.loc[class_size['GRADE'] == '09-12',:]
class_size = class_size.loc[class_size['PROGRAM TYPE'] == 'GEN ED',:]

Progress check

In [None]:
class_size.loc[:,["GRADE","PROGRAM TYPE"]].describe()

In [None]:
class_size.loc[:,"DBN"].value_counts().sort_values(ascending = False).head(5)

As we saw when we displayed class_size on the last screen, DBN still isn't completely unique. This is due to the CORE COURSE (MS CORE and 9-12 ONLY) and CORE SUBJECT (MS CORE and 9-12 ONLY) columns

Looking at unique values in 'CORE COURSE (MS CORE and 9-12 ONLY)' column in the class_size dataset 

In [None]:
data['class_size']['CORE COURSE (MS CORE and 9-12 ONLY)'].unique()

Looking at unique values in 'CORE SUBJECT (MS CORE and 9-12 ONLY)' column in the class_size dataset 

In [None]:
data['class_size']['CORE SUBJECT (MS CORE and 9-12 ONLY)'].unique()

Activity:

In [None]:
class_size = class_size.groupby('DBN').agg(numpy.mean)
class_size.reset_index(inplace = True)
data['class_size'] = class_size

In case of the demographics dataset, the only column that prevents a given DBN from being unique is schoolyear. We only want to select rows where schoolyear is 20112012. This will give us the most recent year of data, and also match our SAT results data.

In [None]:
data["demographics"] = data["demographics"][data["demographics"]['schoolyear'] == 20112012]

The Demographic and Cohort columns are what prevent DBN from being unique in the graduation dataset.  

A Cohort appears to refer to the year the data represents, and the Demographic appears to refer to a specific demographic group. In this case, we want to pick data from the most recent Cohort available, which is 2006. We also want data from the full cohort, so we'll only pick rows where Demographic is Total Cohort

In [None]:
data["graduation"] = data["graduation"][data["graduation"]["Cohort"] == "2006"]
data["graduation"] = data["graduation"][data["graduation"]["Demographic"] == "Total Cohort"]

Convert the Advanced Placement (AP) test scores from strings to numeric values

In [None]:
cols = ['AP Test Takers', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']
for item in cols:
    data['ap_2010'][item] = pd.to_numeric(data['ap_2010'][item], errors = "coerce")

## Merging the datasets

We'll merge two data sets at a time. Because this project is concerned with determing demographic factors that correlate with SAT score, we'll want to preserve as many rows as possible from sat_results while minimizing null values.

Some of the data sets have a lot of missing DBN values. This makes a left join more appropriate, because we don't want to lose too many rows when we merge.

In [None]:
combined = data["sat_results"]
combined = combined.merge(data["ap_2010"], on="DBN", how="left")
combined = combined.merge(data["graduation"], on="DBN", how="left")
combined.shape

Now that we've performed the left joins, we still have to merge class_size, demographics, survey, and hs_directory into combined. Because these files contain information that's more valuable to our analysis and also have fewer missing DBN values, we'll use the inner join type.

In [None]:
data['hs_directory'].describe()

In [None]:
to_merge = ["class_size", "demographics", "survey", "hs_directory"]
for m in to_merge:
    combined = combined.merge(data[m], on="DBN", how="inner")
combined.shape

## Missing value imputation

Filling the missing values fields columns of numeric datatype with column means,  
Filling the rest of the missing values with 0

In [None]:
combined = combined.fillna(combined.mean())
combined = combined.fillna(0)

The school district is just the first two characters of the DBN. We can apply a function over the DBN column of combined that pulls out the first two letters and place the substring in new column named 'school_dist'

In [None]:
def get_first_two_chars(string):
    return(string[0:2])

In [None]:
combined['school_dist']= combined['DBN'].apply(get_first_two_chars)

## Exploratory Analysis

Using correlations to infer how closely related a pair column is

In [None]:
correlations = combined.corr()['sat_score']
correlations

**Observations**:

* total_enrollment has a strong positive correlation with sat_score. This is surprising because we'd expect smaller schools where students receive more attention to have higher scores. However, it looks like the opposite is true -- larger schools tend to do better on the SAT.  
* Both the percentage of females at a school correlate positively with SAT score, whereas the percentage of malescorrelate negatively. This could indicate that women do better on the SAT than men.  
* Teacher and student ratings of school safety correlate with sat_score  
* There is significant racial inequality in SAT scores   
* The percentage of English language learners at the school (ell_percent, frl_percent) has a strong negative correlation with SAT scores

Analyzing the total_enrollement vs sat_score

**Enrollment vs. SAT score**

In [None]:
combined.plot(x = "total_enrollment", y = "sat_score", kind = "scatter",
              xlabel = "Total enrollments", ylabel = "SAT score",
              title = "SAT score vs. Enrollment", alpha = 0.3)

Judging from the plot we just created, it doesn't appear that there's an extremely strong correlation between the two columns.

Filtering the data to analyse the information for schools with less than 1000 enrollments and less that 1000 SAT scores

In [None]:
low_enrollment = combined[combined['total_enrollment']<1000]
low_enrollment = low_enrollment[low_enrollment['sat_score']<1000]
low_enrollment['School Name']

We observe that most of the schools that have low enrollment and low SATs have high number of english learners and therefore it is the ell_percent field that correlates strongly with the SAT score

**English learners vs. SAT score**

In [None]:
combined.plot(x = 'ell_percent', y = 'sat_score', kind = 'scatter',
             xlabel = "Total enrollments", ylabel = "SAT score",
              title = "English learners vs. SAT score", alpha = 0.3)

Aggregating the combined dataset by district, which will enable us to understand how ell_percent varies district-by-district instead of the unintelligibly granular school-by-school variation.

In [None]:
districts = combined.groupby('school_dist').agg(np.mean)
districts.reset_index(inplace = True)

Remove DBN since it's a unique identifier, not a useful numerical value for correlation.

In [None]:
survey_fields.remove("DBN")

**SAT score vs. Survey fields**

In [None]:
fig = plt.figure()
combined.corr().loc['sat_score', survey_fields].plot(kind = "bar",xlabel = "Survey fields", 
                                                     ylabel = "SAT score", title = "SAT score vs. Survey fields")
plt.ylabel('sat_score')

**SAT score vs. Survey field: saf_s_11**  
students's safety rating of their respective school

In [None]:
combined.plot(x='sat_score', y='saf_s_11', kind = "scatter", xlabel = "saf_s_11", 
              ylabel = "SAT score", title = "SAT score vs. Survey field: saf_s_11", alpha = 0.3) 

### Mapping out average score by boroughs

Calculating average score by school district

In [None]:
avg_school_dist = combined.groupby('school_dist').agg(np.mean)

In [None]:
longitudes = avg_school_dist['lon'].tolist()
latitudes = avg_school_dist['lat'].tolist()

In [None]:
# AREA LEFT FOR MAP

### Investigating racial difference in SAT scores

**SAT score vs. Racial background**

In [None]:
races = ['white_per', 'asian_per', 'black_per', 'hispanic_per']

In [None]:
fig = plt.figure()
combined.corr().loc['sat_score', races].plot(kind = "bar", xlabel = "Racial background", ylabel = "SAT score", 
                                             title = "SAT score vs. Racial background")
plt.ylabel('sat_score')

White and asian groups have higher correlation with SAT score whereas black and hispanic groups have negative correlations  
  
Examining SAT scores across all school with varying percentages of Hispanic students  
  
  
**SAT score vs. Hispanic student percentage**

In [None]:
combined.plot(x='hispanic_per', y='sat_score', kind = "scatter", 
              xlabel = "Hispanic background", ylabel = "SAT score", 
              title = "SAT score vs. Hispanic student percentage", alpha = 0.3)

A majority of the points occupy the bottom portion of the graph, signifying that regardless of hispanic student percentage the college has lower SAT scores still, hence hispanic_per is not the only field responsible for the low SAT score

Examing schools that have high percentage of hispanic students 

In [None]:
high_hispanic = combined[combined['hispanic_per']>95]
high_hispanic['SCHOOL NAME_x']

### Examing significance of gender in SAT scores

In [None]:
gender = ['male_per', 'female_per']

In [None]:
fig = plt.figure()
combined.corr().loc['sat_score', gender].plot(kind = 'bar', xlabel = "Gender", ylabel = "SAT score", 
              title = "SAT score vs. Gender")

Although each have very low correlation, it can be observed that females on average have higher correlation whereas males on average have lower

**SAT score vs. Female Percentages**

In [None]:
combined.plot(x='female_per', y='sat_score', kind = "scatter", xlabel = "Female percentage", ylabel = "SAT score", 
              title = "SAT score vs. Female Percentages", alpha = 0.3)

Schools which have more than 60% female and their SAT score being above 1700

In [None]:
high_female = combined[combined['female_per']>60]
high_SAT_female = high_female[high_female['sat_score']>1700]
high_SAT_female['SCHOOL NAME_x']

Examing how percentage of AP test Takers affect the SAT score of an institude

In [None]:
combined['ap_taker_per'] = combined['AP Test Takers'] / combined['total_enrollment']

**SAT score vs. AP Test Takers Percentage**

In [None]:
combined.plot(x='ap_taker_per', y='sat_score', kind = "scatter", xlabel = "AP Test Takers Percentage", ylabel = "SAT score", 
              title = "SAT score vs. AP Test Takers Percentage", alpha = 0.3)

In [None]:
combined.corr().loc['sat_score', 'ap_taker_per']

Observing the correlation coefficient we can conclude that the number of AP Test takers doesn't affect the on average SAT score of an institute

# To-do

* Determing whether there's a correlation between class size and SAT scores  
* Figuring out which neighborhoods have the best schools  
* If we combine this information with a dataset containing property values, we could find the least expensive neighborhoods that have good schools.  
* Investigating the differences between parent, teacher, and student responses to surveys.  
* Assigning scores to schools based on sat_score and other attributes.  