# High Performance Factors in NYC High Schools
### Essence Carson; Jarrell Cooper; Bryan Garcia; Noah Morton
### Los Angeles, Washington D.C, Indianapolis,  New York,
#### Team #23 

__Sunday, January 31st, 2021__

In [2]:
# Load relevant packages
import re 
import plotly
import plotly.express          as px
import numpy                   as np
import pandas                  as pd
import matplotlib.pyplot       as plt
import matplotlib.ticker       as tick
import seaborn                 as sns
import plotly.express          as px
from   sklearn                 import linear_model
from   sklearn.linear_model    import LinearRegression
from   sklearn.impute          import SimpleImputer
from   sklearn.metrics         import r2_score
from scipy                     import stats
from sklearn.decomposition     import PCA
%matplotlib inline
sns.set()
# To be able to see full output
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# Suppress all warnings
import warnings
warnings.filterwarnings("ignore")  

# School Quality Report (SQR) -  Summary and Student Achievement Data
### demograph_df = sqr_summary + sqr_sa

New York City is the largest public school district in the United States with 1.1 million students enrolled in K-12 schools across the city. Out of this number, over **300,000 students (CONFIRM NUMBER)** go to one of the 486 public high schools across the five boroughs. Our project investigate the factors that influence high performance in New York City high school. All of our project data comes from the New York City Department of Education. We measure high proformance based on the data collected for four year high school graduation rate. Thus, we begin with our question: **what factors influence high performance in New York City public high schools?** 

####  2018-19 School Quality Reports:
https://infohub.nyced.org/reports/school-quality/school-quality-reports-and-resources/school-quality-report-citywide-data

In [7]:
sqr_xlsx = pd.ExcelFile('data/201819_hs_sqr_results.xlsx')

#read in quality report excel file - summary 
sqr_summary = pd.read_excel(sqr_xlsx,'Summary', header=3, usecols='D:G,AF:AI,AK,AM:AW')
sqr_summary.drop(0, inplace = True)
#read in quality report excel file - student achievement
sqr_sa = pd.read_excel(sqr_xlsx,'Student Achievement', header=3, usecols='D:F,BF:BG,BW,CU,EJ:EK,ER:ES,EZ:FA,FH:FI,FW')
sqr_sa.drop(0, inplace = True)

In [8]:
#merge summary & student achievement from quality report excel file
demograph_df = pd.merge(sqr_summary,sqr_sa, on=['DBN','School Name'], how='inner')
print('sqr_summary dimensions: {}'.format(sqr_summary.shape))
print('sqr_sa dimensions: {}'.format(sqr_sa.shape))
print('demograph_df dimensions: {}'.format(demograph_df.shape))

sqr_summary dimensions: (486, 20)
sqr_sa dimensions: (486, 16)
demograph_df dimensions: (486, 34)


In [9]:
# Delete unnecessary/duplicate columns
demograph_df = demograph_df.drop(columns = ['School Type_y'])
demograph_df.rename(columns={"School Type_x":"School Type"}, inplace = True)

demograph_df.head(1)

Unnamed: 0,DBN,School Name,School Type,Enrollment,Average Grade 8 English Proficiency,Average Grade 8 Math Proficiency,Percent English Language Learners,Percent Students with Disabilities,Economic Need Index,Percent in Temp Housing,...,Metric Value - Average Regents Score - Algebra I (Common Core),N count - Postsecondary Enrollment Rate - 18 Months,Metric Value - Postsecondary Enrollment Rate - 18 Months,N count - Postsecondary Enrollment Rate - 6 Months,Metric Value - Postsecondary Enrollment Rate - 6 Months,N count - College and Career Preparatory Course Index,Metric Value - College and Career Preparatory Course Index,N count - 4-Year College Readiness Index,Metric Value - 4-Year College Readiness Index,Percentage of Students with 90%+ Attendance
0,01M292,Orchard Collegiate Academy,High School,197.0,2.701,2.325,0.117,0.289,0.871,0.183,...,68.9,38,0.579,29,0.724,28,0.196,28,0.5,0.711


#  School Financial Funding
#### FY 2019 New York State School Funding Transparency Form:
https://infohub.nyced.org/reports/financial/financial-data-and-reports/new-york-state-school-funding-transparency-forms

In [10]:
# read in Financial data
financial_xlsx = pd.ExcelFile('data/NewYorkCitySchoolTransparency201819 A-E web.xlsx')
financial_df = pd.read_excel(financial_xlsx,'Part-C',header=6, usecols = ['School Name','Local School Code',
                       'State & Local Funding',
                       'Federal \nFunding',
                       'Total Funding Source by School', 
                       'Total School Funding per Pupil'])
# Convert to in Millions
financial_df['Total Funding Source by School']= financial_df['Total Funding Source by School'].div(1000000)
financial_df['State & Local Funding']= financial_df['State & Local Funding'].div(1000000)
financial_df['Federal \nFunding']= financial_df['Federal \nFunding'].div(1000000)
# Convert to in Thousands
financial_df['Total School Funding per Pupil']= financial_df['Total School Funding per Pupil'].div(1000)

#change local school code to BN and Renaming Columns
financial_df.rename(columns={"Local School Code":"BN",
                             "State & Local Funding":"State & Local Funding (M)",
                             "Federal \nFunding":"Federal Funding (M)", 
                             "Total Funding Source by School":"Total Funding (M)",
                             "Total School Funding per Pupil":"Total Funding per Pupil (K)" 
                            }, inplace = True)
financial_df.head(2)

Unnamed: 0,School Name,BN,State & Local Funding (M),Federal Funding (M),Total Funding (M),Total Funding per Pupil (K)
0,P.S. 001 The Bergen,K001,19.124833,1.702345,20.827178,23.66191
1,Parkside Preparatory Academy,K002,7.642887,0.552985,8.195872,23.231317


# Creating a final merged dataframe  
## df = demograph_df + financial_df

In [12]:
# Take last 4 digits of DBN to merge with local school code
demograph_df['BN'] = demograph_df['DBN'].str[-4:]

#Merging Financial data and quality report
df = pd.merge(demograph_df, financial_df, on='BN', how='left')

# Removing/Editing duplicate columns
del df['School Name_y']
df.rename(columns={'School Name_x': 'School Name'}, inplace = True)

print('df dimensions: {}'.format(df.shape))

df dimensions: (486, 38)


In [13]:
# create function to assign borough based on DBN
def borough_finder (value):
    if value[2] == 'M' :
      return 'Manhattan'
    if value[2] == 'K' :
      return 'Brooklyn'
    if value[2] == 'Q' :
      return 'Queens'
    if value[2] == 'R' :
      return 'Staten Island'
    if value[2] == 'X' :
      return 'Bronx'
    return 'Other'
# create column for Borough
df['Borough'] = df['DBN'].apply (lambda value: borough_finder(value))

In [14]:
# Creating datapoints of # of students with an economic need
df['N count - Economic Need Index'] = df['Economic Need Index'] * df['Enrollment']
# Creating datapoints of % of unrepresented race
df['Percent Unrepresented Race'] = df['Percent Asian'] + df['Percent Black'] + df['Percent Hispanic'] + df['Percent White']
df['Percent Unrepresented Race'] = 1 - df['Percent Unrepresented Race']

df[['N count - Economic Need Index','Enrollment','Economic Need Index','Percent Unrepresented Race']].head(2)

Unnamed: 0,N count - Economic Need Index,Enrollment,Economic Need Index,Percent Unrepresented Race
0,171.587,197.0,0.871,0.036
1,400.89,483.0,0.83,0.007


In [15]:
# Creating datapoints of # of students of each race
df['White'] = df['Enrollment']*df['Percent White']
df['Hispanic'] = df['Enrollment']*df['Percent Hispanic']
df['Black'] = df['Enrollment']*df['Percent Black']
df['Asian'] = df['Enrollment']*df['Percent Asian']
df['Unrepresented Race'] =  df['Enrollment'] * df['Percent Unrepresented Race']

# Full Data Frame

In [16]:
df.head(1)

Unnamed: 0,DBN,School Name,School Type,Enrollment,Average Grade 8 English Proficiency,Average Grade 8 Math Proficiency,Percent English Language Learners,Percent Students with Disabilities,Economic Need Index,Percent in Temp Housing,...,Total Funding (M),Total Funding per Pupil (K),Borough,N count - Economic Need Index,Percent Unrepresented Race,White,Hispanic,Black,Asian,Unrepresented Race
0,01M292,Orchard Collegiate Academy,High School,197.0,2.701,2.325,0.117,0.289,0.871,0.183,...,4.293751,28.806738,Manhattan,171.587,0.036,5.91,112.093,52.993,18.912,7.092


**READ ME** Thinking about the possibles factors that may contribute to high performance (read: graduation rate) in New York City public high school, we search the New York City Department of Education (NYC DOE) database to focus on possible datasets that may be useful for our project. We focused on two primary dataset to build a dataframe that would best be suited for our project.

Our first dataset came from the school quality reports which shared enrollment, school name, DBN (school unique identifier), and demographic data of the New York City Public School. NYC DOE only collects racial data in the following catergories: %White, %Asian, %Black, and %Hispanic. We realized that mixed racial idenities as well as indentities. In addition, this dataset also shared the Economic Needs Index, which is an index created by NYC DOE to measure the economic need based on the social stuations of the students and their families. The four criteria: 

(1) Be eligible for Human Resources Administration (HRA) benefits (Public Assistance); 
(2) Have lived in temporary housing in the last four years; 
(3) Have entered the NYC DOE for the first time within the last four years, is enrolled in high school 
(4) English Language Learner (has a home language other than English)

Thus, a high ENI could be a factor that influence high performance in New York City public high schools due to the social conditions of the students and families. 

However, our first dataset did not include school funding, which may provide more context to measure the relationship of how much a school recieves compared their high performance. As a result, we merged the School Transparency Dataset from the NYC DOE via the unique school code idenitifer (District Borough Number, DBN) to create a more comphrensive dataframe for our project analysis.

# Imputing Nan Values

In [17]:
# Finding only the columns that have NaN values
print('Columns with NaN values')
print()
for columns in df.columns:
    if df[columns].isnull().any() == True:
        print('[',columns,']')
        print('     NaN count:', df[columns].isnull().sum(axis = 0))
      # print('     1st val: ', df[columns].iloc[0]) # Checking what type of values NaN columns have
        print('-------------------------------------------')
# Funding features are the only columns with NaN values.

Columns with NaN values

[ State & Local Funding (M) ]
     NaN count: 65
-------------------------------------------
[ Federal Funding (M) ]
     NaN count: 65
-------------------------------------------
[ Total Funding (M) ]
     NaN count: 65
-------------------------------------------
[ Total Funding per Pupil (K) ]
     NaN count: 65
-------------------------------------------


In [18]:
# changing funding columns into floats
df = df.astype({'State & Local Funding (M)': 'float64',
                'Federal Funding (M)': 'float64',
                'Total Funding (M)': 'float64', 
                'Total Funding per Pupil (K)': 'float64',
              })
pd.options.display.float_format = "{:.3f}".format
df[['State & Local Funding (M)',
    'Federal Funding (M)',
    'Total Funding (M)',
    'Total Funding per Pupil (K)']].describe()

# The mean is greater than the median for State & Local Funding (M), 
# Federal Funding (M), and Total Funding (M).

# replace Nan values with the AVERAGE for Total Funding per Pupil (K)

# replace Nan values with the MEDIAN for State & Local Funding (M), 
# Federal Funding (M), and Total Funding (M)

Unnamed: 0,State & Local Funding (M),Federal Funding (M),Total Funding (M),Total Funding per Pupil (K)
count,421.0,421.0,421.0,421.0
mean,9.625,0.537,10.162,22.732
std,8.622,0.535,9.044,3.631
min,2.528,0.002,2.673,16.068
25%,5.526,0.265,5.837,20.455
50%,6.772,0.405,7.194,22.314
75%,8.978,0.575,9.582,24.229
max,57.611,4.113,58.096,52.241


In [19]:
# Imputing NaN values with median/average
df['State & Local Funding (M)'].fillna((df['State & Local Funding (M)'].median()), inplace=True)
df['Federal Funding (M)'].fillna((df['Federal Funding (M)'].median()), inplace=True)
df['Total Funding (M)'].fillna((df['Total Funding (M)'].median()), inplace=True)
df['Total Funding per Pupil (K)'].fillna((df['Total Funding per Pupil (K)'].mean()), inplace=True)

df[['State & Local Funding (M)','Federal Funding (M)',
    'Total Funding (M)', 'Total Funding per Pupil (K)']].tail(2)

Unnamed: 0,State & Local Funding (M),Federal Funding (M),Total Funding (M),Total Funding per Pupil (K)
484,6.772,0.405,7.194,22.732
485,6.772,0.405,7.194,22.732


After merging the datasets to create a comprehensive dataframe, we notice that there were some NaN values present in our dataframe. In particular, out of the 486 New York City public high schools, **only 65 school's Total Funding Data (therefore; "Total Funding by pupil") was missing. Given the 13% of missing Funding Data** from the schools compared with the Funding information for other schools as well as the distriubtion of the Funding data, we decided to impute the NaN values with the median for "Total Funding" to account for any outliers in for School Funding. In addition, we imputed the NaN values with the mean for "Total Funding" per Student, which provides better accuracy for our project. **It is important to note that ENI and demographic data did not have any NaN values.**

# Imputing "N<15" Data

In [20]:
# Replace values with the mean
for columns in df.columns:
     if len(df[df[columns] == 'N<15']):
        print('N<15 count: ', df[columns].str.count('N<15').sum())
        # Replacing "N<15" values with NaN
        df[columns] = df[columns].replace('N<15', np.NaN)
        print('average:',df[columns].mean())
        # Imputing NaN with the mean.
        df[columns] = df[columns].fillna(df[columns].mean())
        print(df[columns].value_counts().head(3))
        print('------------------------------------------------------------')

N<15 count:  21.0
average: 0.8360559139784945
0.836    21
1.000    20
0.800     7
Name: Metric Value - 4-Year Graduation Rate - All Students, dtype: int64
------------------------------------------------------------
N<15 count:  8.0
average: 73.3673640167364
73.367    8
67.900    6
67.100    6
Name: Metric Value - Average Regents Score - English (Common Core), dtype: int64
------------------------------------------------------------
N<15 count:  30.0
average: 66.48991228070174
66.490    30
62.700     8
60.600     7
Name: Metric Value - Average Regents Score - Algebra I (Common Core), dtype: int64
------------------------------------------------------------
N<15 count:  41.0
average: 0.6978224719101124
0.698    41
0.500     6
0.814     4
Name: Metric Value - Postsecondary Enrollment Rate - 18 Months, dtype: int64
------------------------------------------------------------
N<15 count:  28.0
average: 0.6498668122270742
0.650    28
0.667     6
0.606     5
Name: Metric Value - Postsecondar

# Imputing "." Values

In [21]:
# Replace values with the mean
for columns in df.columns:
     if len(df[df[columns] == '.']):
        print('"." count: ', df[columns].str.count('.').sum())
        # Replacing "N<15" values with NaN
        df[columns] = df[columns].replace('.', np.NaN)
        print('average:',df[columns].mean())
        # Imputing NaN with the mean.
        df[columns] = df[columns].fillna(df[columns].mean())
        print(df[columns].value_counts().head(3))
        print('------------------------------------------------------------')

"." count:  66.0
average: 6.735476190476191
6.735    66
6.100    16
7.900    15
Name: Years of principal experience at this school, dtype: int64
------------------------------------------------------------
"." count:  65.0
average: 0.7534513064133016
0.753    65
0.750    11
0.800     8
Name: Percent of teachers with 3 or more years of experience, dtype: int64
------------------------------------------------------------
"." count:  9.0
average: 0.8749371069182389
0.875    9
0.911    8
0.849    8
Name: Student Attendance Rate, dtype: int64
------------------------------------------------------------
"." count:  10.0
average: 0.3579579831932773
0.358    10
0.513     6
0.472     5
Name: Percent of Students Chronically Absent, dtype: int64
------------------------------------------------------------
"." count:  65.0
average: 0.9663016627078385
0.966    65
0.967    29
0.965    22
Name: Teacher Attendance Rate, dtype: int64
------------------------------------------------------------
"." coun

In [22]:
# Renaming Columns
df.rename(columns = {
'Economic Need Index':'Economic Need Index (%)',
'Metric Value - 4-Year Graduation Rate - All Students':'Graduation Rate (%)',
'Metric Value - 4-Year College Readiness Index':'College Readiness Index (%)',  
'Metric Value - College and Career Preparatory Course Index':'College & Career Preparatory Course Index (%)',       
'Metric Value - Average Regents Score - English (Common Core)':'Avg Regents Score - English (%)',   
'Metric Value - Average Regents Score - Algebra I (Common Core)': 'Avg Regents Score - Algebra I (%)',
'Metric Value - Postsecondary Enrollment Rate - 6 Months':'Postsecondary Enrollment - 6 Months (%)',
'Metric Value - Postsecondary Enrollment Rate - 18 Months':'Postsecondary Enrollment - 18 Months (%)',         
'Percent Asian':'Asian (%)','Percent Black':'Black (%)','Percent Hispanic':'Hispanic (%)','Percent White':'White (%)',
'Percent Unrepresented Race':'Unrepresented Race (%)',
'N count - 4-Year Graduation Rate - All Students':'Students Graduated',
'N count - 4-Year College Readiness Index': 'College Readiness Index',
'N count - College and Career Preparatory Course Index': 'College & Career Preparatory Course Index',
'N count - Postsecondary Enrollment Rate - 6 Months': 'Postsecondary Enrollment - 6 Months',
'N count - Postsecondary Enrollment Rate - 18 Months': 'Postsecondary Enrollment - 18 Months',
'N count - Economic Need Index': 'Economic Need Index',
'Years of principal experience at this school':'Years of principal experience', 
'Average Grade 8 English Proficiency':'Avg Grade 8 English Proficiency',
'Average Grade 8 Math Proficiency':'Avg Grade 8 Math Proficiency',    
'Teacher Attendance Rate':'Teacher Attendance (%)',
'Student Attendance Rate':'Student Attendance (%)',
'Percentage of Students with 90%+ Attendance': 'Students with 90%+ Attendance (%)',
'Percent English Language Learners': 'English Language Learners (%)',
'Percent Students with Disabilities': 'Students with Disabilities (%)',
'Percent in Temp Housing':'Temp Housing (%)',
'Percent HRA Eligible': 'HRA Eligible (%)',
'Percent of teachers with 3 or more years of experience':'Teachers w/ 3+ years of experience (%)',
'Percent of Students Chronically Absent': 'Students Chronically Absent (%)'},inplace = True)

In [23]:
# Creating a classifying feature for schools that have >95% Graduation
grad = []  
for value in df['Graduation Rate (%)']:
    if value >= .95:
        grad.append('High Graduation')
    elif value <= 0.80:
        grad.append('Low Graduation')
    else:
        grad.append('Medium Graduation')
df['Graduation'] = grad 
df['Graduation'].value_counts().unique

<bound method Series.unique of Medium Graduation    215
Low Graduation       172
High Graduation       99
Name: Graduation, dtype: int64>

In [24]:
# Reordering all columns
df = df[['DBN','BN','School Name','School Type','Borough','Enrollment',
'State & Local Funding (M)','Federal Funding (M)','Total Funding (M)',
'Total Funding per Pupil (K)',
'Economic Need Index (%)',
'Graduation Rate (%)',
'College Readiness Index (%)',  
'College & Career Preparatory Course Index (%)',       
'Avg Regents Score - English (%)',    
'Avg Regents Score - Algebra I (%)',
'Postsecondary Enrollment - 6 Months (%)',
'Postsecondary Enrollment - 18 Months (%)',
'Asian (%)','Black (%)','Hispanic (%)','White (%)','Unrepresented Race (%)',         
'Students Graduated',
'College Readiness Index',
'College & Career Preparatory Course Index',
'Postsecondary Enrollment - 6 Months',
'Postsecondary Enrollment - 18 Months',
'Economic Need Index',
'White','Hispanic','Black','Asian','Unrepresented Race',
'Years of principal experience',
'Avg Grade 8 English Proficiency',
'Avg Grade 8 Math Proficiency',
'Teacher Attendance (%)','Student Attendance (%)',
'Students with 90%+ Attendance (%)',         
'English Language Learners (%)',
'Students with Disabilities (%)',
'Temp Housing (%)',
'HRA Eligible (%)',
'Teachers w/ 3+ years of experience (%)',
'Students Chronically Absent (%)',
'Graduation']]

# Establishing Dtype

In [25]:
# Rounding and changing data types to INT
int_vars =['Enrollment','Students Graduated','College Readiness Index',                          
'College & Career Preparatory Course Index','White','Hispanic',
'Black','Asian','Unrepresented Race','Postsecondary Enrollment - 6 Months',
'Postsecondary Enrollment - 18 Months','Economic Need Index']
for column in df[int_vars].columns:
    df[column] = df[column].round(decimals=0)
    df[column] = df[column].astype(int)   
# Change borough data type as category
df['Borough'] = df['Borough'].astype('category')

In [26]:
# Round all floats
for col in df.columns:
    if df[col].dtype == 'float':
        df[col] = df[col].round(3)

# Completion of Data Cleaning

In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 486 entries, 0 to 485
Data columns (total 47 columns):
 #   Column                                         Non-Null Count  Dtype   
---  ------                                         --------------  -----   
 0   DBN                                            486 non-null    object  
 1   BN                                             486 non-null    object  
 2   School Name                                    486 non-null    object  
 3   School Type                                    486 non-null    object  
 4   Borough                                        486 non-null    category
 5   Enrollment                                     486 non-null    int64   
 6   State & Local Funding (M)                      486 non-null    float64 
 7   Federal Funding (M)                            486 non-null    float64 
 8   Total Funding (M)                              486 non-null    float64 
 9   Total Funding per Pupil (K)                

In [28]:
df.to_csv('data/clean_data.csv',index=False, header=True)

NYC Geography Question
In what way can we see location visualized in the data?
Using DBN to group high schools by borough
Then use DBN to focus on the schools by district
How does the graduation rate differ among racial categories?
How is “average regents complete rate” distributed across racial categories?

# Merging completed dataframe with school locations data

#### Kaggle:
https://www.kaggle.com/new-york-city/ny-2010-2016-school-safety-report/home

In [30]:
geo_csv = pd.read_csv('data/2010-2016-School-Safety-Report.csv', header=0, 
                           usecols=[0,2,3,5,25,26,27,33])                  
geo_csv

Unnamed: 0,School Year,DBN,Location Name,Address,Postcode,Latitude,Longitude,NTA
0,2013-14,15K001,P.S. 001 The Bergen,309 47 STREET,11220.000,40.649,-74.012,Sunset Park West ...
1,2013-14,17K002,Parkside Preparatory Academy,655 PARKSIDE AVENUE,11226.000,40.656,-73.952,Prospect Lefferts Gardens-Wingate ...
2,2013-14,75K141,P.S. K141,655 PARKSIDE AVENUE,11226.000,40.656,-73.952,Prospect Lefferts Gardens-Wingate ...
3,2013-14,84K704,Explore Charter School,655 PARKSIDE AVENUE,11226.000,40.656,-73.952,Prospect Lefferts Gardens-Wingate ...
4,2013-14,,655 PARKSIDE AVENUE CONSOLIDATED LOCATION,655 PARKSIDE AVENUE,11226.000,40.656,-73.952,Prospect Lefferts Gardens-Wingate ...
...,...,...,...,...,...,...,...,...
6305,2015-16,08X519,Felisa Rincon de Gautier Institute for Law and...,1440 STORY AVENUE,10473.000,40.821,-73.881,Soundview-Castle Hill-Clason Point-Harding Par...
6306,2015-16,08X537,Bronx Arena High School,1440 STORY AVENUE,10473.000,40.821,-73.881,Soundview-Castle Hill-Clason Point-Harding Par...
6307,2015-16,,1440 STORY AVENUE CONSOLIDATED LOCATION,1440 STORY AVENUE,10473.000,40.821,-73.881,Soundview-Castle Hill-Clason Point-Harding Par...
6308,2015-16,12X271,East Bronx Academy for the Future,1716 SOUTHERN BOULEVARD,10460.000,40.836,-73.888,Crotona Park East ...


In [33]:
# Merging quality report and locations data
geo_df = pd.merge(df, geo_csv, on='DBN', how='left')
# Creating a new dataframe with only the most recent years
geo_df = geo_df[geo_df['School Year'].astype(str).str.contains('2015-16')]
# Seeing how many rows have missing location data
print('Rows with missing location data:', geo_df.isnull().any(axis = 1).sum())
print()
print('Original Completed Data frame:', df.shape)
print('Final Completed Data frame', geo_df.shape)
print('```````````````````````````````````````````````````')
print('[Around 50 rows(Schools) has missing location data]')
geo_df.dropna(axis='rows',inplace=True)
geo_df['Postcode'] = geo_df['Postcode'].round(decimals=0)
geo_df['Postcode'] = geo_df['Postcode'].astype('int64')
geo_df.head(2)

Rows with missing location data: 1

Original Completed Data frame: (486, 47)
Final Completed Data frame (437, 54)
```````````````````````````````````````````````````
[Around 50 rows(Schools) has missing location data]


Unnamed: 0,DBN,BN,School Name,School Type,Borough,Enrollment,State & Local Funding (M),Federal Funding (M),Total Funding (M),Total Funding per Pupil (K),...,Teachers w/ 3+ years of experience (%),Students Chronically Absent (%),Graduation,School Year,Location Name,Address,Postcode,Latitude,Longitude,NTA
2,01M292,M292,Orchard Collegiate Academy,High School,Manhattan,197,4.139,0.155,4.294,28.807,...,0.75,0.289,Medium Graduation,2015-16,Henry Street School for International Studies,220 HENRY STREET,10002,40.714,-73.986,Lower East Side ...
5,01M448,M448,University Neighborhood High School,High School,Manhattan,483,6.908,0.548,7.456,21.152,...,0.6,0.274,Medium Graduation,2015-16,University Neighborhood High School,200 MONROE STREET,10002,40.712,-73.984,Lower East Side ...


In [34]:
geo_df.to_csv('data/clean_geo_data.csv',index=False, header=True)

In [35]:
#geo_df['Postcode']
#final_df.drop(columns=['Graduation'],inplace=True)

2       10002
5       10002
8       10009
11      10002
14      10002
        ...  
1285    10029
1292    10030
1300    10456
1311    10463
1312    10463
Name: Postcode, Length: 436, dtype: int64

NameError: name 'final_df' is not defined