# Import NYC Dataset



In [25]:
import pandas as pd
import re


data_files = [
"https://raw.githubusercontent.com/TomazFilgueira/Dq_Datascientist/refs/heads/main/3%20-%20Data_Cleaning/03_Data_Cleaning_Walktrough/data/ap_2010.csv",
"https://raw.githubusercontent.com/TomazFilgueira/Dq_Datascientist/refs/heads/main/3%20-%20Data_Cleaning/03_Data_Cleaning_Walktrough/data/class_size.csv",
"https://raw.githubusercontent.com/TomazFilgueira/Dq_Datascientist/refs/heads/main/3%20-%20Data_Cleaning/03_Data_Cleaning_Walktrough/data/demographics.csv",
"https://raw.githubusercontent.com/TomazFilgueira/Dq_Datascientist/refs/heads/main/3%20-%20Data_Cleaning/03_Data_Cleaning_Walktrough/data/graduation.csv",
"https://raw.githubusercontent.com/TomazFilgueira/Dq_Datascientist/refs/heads/main/3%20-%20Data_Cleaning/03_Data_Cleaning_Walktrough/data/hs_directory.csv",
"https://raw.githubusercontent.com/TomazFilgueira/Dq_Datascientist/refs/heads/main/3%20-%20Data_Cleaning/03_Data_Cleaning_Walktrough/data/sat_results.csv",
]
data = {}

for d in data_files:
    file = pd.read_csv(d)
    #get "name" without .csv
    name = d.rsplit('data/')[1]
    name = name.replace(".csv", "")
    data[name]=file
    
#survey files    
data_files = [
"https://raw.githubusercontent.com/TomazFilgueira/Dq_Datascientist/refs/heads/main/3%20-%20Data_Cleaning/03_Data_Cleaning_Walktrough/data/survey_all.txt",
"https://raw.githubusercontent.com/TomazFilgueira/Dq_Datascientist/refs/heads/main/3%20-%20Data_Cleaning/03_Data_Cleaning_Walktrough/data/survey_d75.txt",
]

all_survey = pd.read_csv(data_files[0],delimiter="\t",encoding="windows-1252")
d75_survey = pd.read_csv(data_files[1],delimiter="\t",encoding="windows-1252")    
    
survey = pd.concat([all_survey,d75_survey],axis=0)

survey = survey.copy()
survey['DBN'] = survey['dbn']

cols = ["DBN", "rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p", "saf_p_11", "com_p_11", "eng_p_11", "aca_p_11", "saf_t_11", "com_t_11", "eng_t_11", "aca_t_11", "saf_s_11", "com_s_11", "eng_s_11", "aca_s_11", "saf_tot_11", "com_tot_11", "eng_tot_11", "aca_tot_11"]

#Filter survey so it only contains the columns we listed above
survey = survey[cols]

#Assign the dataframe survey to the key survey in the dictionary data.
data['survey'] = survey

data['hs_directory']['DBN'] = data['hs_directory']['dbn']

#if string is 2 digits long - return the string
#if string is less than 2 digits - fills with 0 in the front
data['class_size']['padded_csd'] = data['class_size']['CSD'].apply(lambda x: str(x).zfill(2))

data['class_size']['DBN'] =  data['class_size']['padded_csd'] + data['class_size']['SCHOOL CODE']

#convert to numeric data
data['sat_results']['SAT Math Avg. Score'] = pd.to_numeric(data['sat_results']['SAT Math Avg. Score'],errors="coerce")

data['sat_results']['SAT Critical Reading Avg. Score'] = pd.to_numeric(data['sat_results']['SAT Critical Reading Avg. Score'],errors="coerce")

data['sat_results']['SAT Writing Avg. Score']= pd.to_numeric(data['sat_results']['SAT Writing Avg. Score'],errors="coerce")


#sum up all the three columns
data['sat_results']['sat_score'] = data['sat_results']['SAT Math Avg. Score'] + data['sat_results']['SAT Critical Reading Avg. Score'] + data['sat_results']['SAT Writing Avg. Score']



def get_lat(x):
    #extract raw coordinates
    y = re.findall("\(.+\)", x)
    
    #split lat and lon. remove '(' for latitude
    lat = y[0].split(',')[0].replace("(","")
    return lat

def find_lon(x):
    #extract raw coordinates
    y = re.findall("\(.+\)", x)
    
    #split lat and lon. remove ')' for longitude
    lon = y[0].split(',')[1].replace(")","").strip()
    return lon

data['hs_directory']['lon'] = data['hs_directory']['Location 1'].apply(find_lon)
data['hs_directory']['lat'] = data['hs_directory']['Location 1'].apply(get_lat)


#convert coordinates to numeric
data["hs_directory"]["lat"] = pd.to_numeric(data["hs_directory"]["lat"], errors="coerce")

data["hs_directory"]["lon"] = pd.to_numeric(data["hs_directory"]["lon"], errors="coerce")




  y = re.findall("\(.+\)", x)
  y = re.findall("\(.+\)", x)


# 1) Introduction

In this lesson, we'll clean the NYC data more, then combine it. Finally, we'll compute correlations and perform some analysis.

The first thing we'll need to do in preparation for the merge is condense some of the datasets. In the last lesson, we noticed that the values in the `DBN` column were unique in the `sat_results` data set. Other data sets like `class_size` had duplicate `DBN` values.

While the main dataset we want to analyze, sat_results, has unique `DBN` values for every high school in New York City, other datasets aren't as clean. A single row in the `sat_results` dataset may match multiple rows in the `class_size` dataset, for example. This situation creates problems, because we don't know which of the multiple entries in the `class_size` dataset we should combine with the single matching entry in `sat_results`. Here's a diagram that illustrates the problem:

![alt text](image-1.png)

In the diagram above, we can't combine the rows from both datasets, because there are several cases where multiple rows in `class_size` match a single row in `sat_results`.

To resolve this issue, we'll condense the `class_size`, graduation and demographics datasets so that each `DBN` is unique.

# 2 3) Condensing the Class Size Dataset

The first dataset that we'll condense is `class_size`. The first few rows of class_size look like this:

In [20]:
data['class_size'].head()

Unnamed: 0,CSD,BOROUGH,SCHOOL CODE,SCHOOL NAME,GRADE,PROGRAM TYPE,CORE SUBJECT (MS CORE and 9-12 ONLY),CORE COURSE (MS CORE and 9-12 ONLY),SERVICE CATEGORY(K-9* ONLY),NUMBER OF STUDENTS / SEATS FILLED,NUMBER OF SECTIONS,AVERAGE CLASS SIZE,SIZE OF SMALLEST CLASS,SIZE OF LARGEST CLASS,DATA SOURCE,SCHOOLWIDE PUPIL-TEACHER RATIO,padded_csd,DBN
0,1,M,M015,P.S. 015 Roberto Clemente,0K,GEN ED,-,-,-,19.0,1.0,19.0,19.0,19.0,ATS,,1,01M015
1,1,M,M015,P.S. 015 Roberto Clemente,0K,CTT,-,-,-,21.0,1.0,21.0,21.0,21.0,ATS,,1,01M015
2,1,M,M015,P.S. 015 Roberto Clemente,01,GEN ED,-,-,-,17.0,1.0,17.0,17.0,17.0,ATS,,1,01M015
3,1,M,M015,P.S. 015 Roberto Clemente,01,CTT,-,-,-,17.0,1.0,17.0,17.0,17.0,ATS,,1,01M015
4,1,M,M015,P.S. 015 Roberto Clemente,02,GEN ED,-,-,-,15.0,1.0,15.0,15.0,15.0,ATS,,1,01M015


As you can see, the first few rows all pertain to the same school, which is why the `DBN` appears more than once. It looks like each school has multiple values for `GRADE`, `PROGRAM TYPE`, `CORE SUBJECT (MS CORE and 9-12 ONLY)`, and `CORE COURSE (MS CORE and 9-12 ONLY)`.

If we look at the unique values for `GRADE`, we get the following:

```
array(['0K', '01', '02', '03', '04', '05', '0K-09', nan, '06', '07', '08',
       'MS Core', '09-12', '09'], dtype=object)
```

Since we're dealing with high schools, we're only concerned with grades `9 through 12`. That means we only want to pick rows where the value in the `GRADE` column is `09-12`.

If we look at the unique values for `PROGRAM TYPE`, we get the following:

```
array(['GEN ED', 'CTT', 'SPEC ED', nan, 'G&T'], dtype=object)
```

Each school can have multiple program types. Since `GEN ED` is the largest category by far, let's only select rows where `PROGRAM TYPE` = `GEN ED`.


# 3) Instructions

1. Create a new variable called `class_size` and assign the value of `data["class_size"]` to it.

1. Filter `class_size` so the `GRADE ` column only contains the value `09-12`. Note that the name of the `GRADE` column has a space at the end; you'll generate an error if you don't include it.

1. Filter `class_size` so that the `PROGRAM TYPE` column only contains the value `GEN ED`.

1. Display the first five rows of `class_size` to verify.

In [21]:
class_size = data["class_size"]

#Filter class_size so the GRADE = 09-12
#Filter class_size so that the PROGRAM = GEN ED
class_size = class_size[ (class_size["GRADE "]=="09-12") & (class_size["PROGRAM TYPE"]=="GEN ED")]

class_size.head()

Unnamed: 0,CSD,BOROUGH,SCHOOL CODE,SCHOOL NAME,GRADE,PROGRAM TYPE,CORE SUBJECT (MS CORE and 9-12 ONLY),CORE COURSE (MS CORE and 9-12 ONLY),SERVICE CATEGORY(K-9* ONLY),NUMBER OF STUDENTS / SEATS FILLED,NUMBER OF SECTIONS,AVERAGE CLASS SIZE,SIZE OF SMALLEST CLASS,SIZE OF LARGEST CLASS,DATA SOURCE,SCHOOLWIDE PUPIL-TEACHER RATIO,padded_csd,DBN
225,1,M,M292,Henry Street School for International Studies,09-12,GEN ED,ENGLISH,English 9,-,63.0,3.0,21.0,19.0,25.0,STARS,,1,01M292
226,1,M,M292,Henry Street School for International Studies,09-12,GEN ED,ENGLISH,English 10,-,79.0,3.0,26.3,24.0,31.0,STARS,,1,01M292
227,1,M,M292,Henry Street School for International Studies,09-12,GEN ED,ENGLISH,English 11,-,38.0,2.0,19.0,16.0,22.0,STARS,,1,01M292
228,1,M,M292,Henry Street School for International Studies,09-12,GEN ED,ENGLISH,English 12,-,69.0,3.0,23.0,13.0,30.0,STARS,,1,01M292
229,1,M,M292,Henry Street School for International Studies,09-12,GEN ED,MATH,Integrated Algebra,-,53.0,3.0,17.7,16.0,21.0,STARS,,1,01M292


# 4 - 5) Computing Average Class Sizes

As we saw when we displayed `class_size` on the last exercise, `DBN` still isn't completely unique. This is due to the `CORE COURSE (MS CORE and 9-12 ONLY)` and `CORE SUBJECT (MS CORE and 9-12 ONLY)` columns.

`CORE COURSE (MS CORE and 9-12 ONLY)` and `CORE SUBJECT (MS CORE and 9-12 ONLY)` seem to pertain to different kinds of classes. For example, here are the unique values for `CORE SUBJECT (MS CORE and 9-12 ONLY)`:

```
array(['ENGLISH', 'MATH', 'SCIENCE', 'SOCIAL STUDIES'], dtype=object)
```

This column only seems to include **certain subjects**. We want our class size data to **include every single class a school offers** -- not just a subset of them. What we can do is take the average across all of the classes a school offers. This gives us unique DBN values, while also incorporating as much data as possible into the average.

Fortunately, we can use the `pandas.DataFrame.groupby()` method to help us with this. The DataFrame.`groupby()` method splits a dataframe up into unique groups, based on a given column. We can then use the `agg()` method on the resulting pandas.core.groupby object to find the **mean** of each column.

Using the `groupby()` method, we'll split this dataframe into four separate groups -- one with the `DBN 01M292`, one with the `DBN 01M332`, one with the `DBN 01M378`, and one with the `DBN 01M448`:

![alt text](image-2.png)

![alt text](image-3.png)

Then, we can compute the averages for the `AVERAGE CLASS SIZE` column in each of the four groups using the `agg()` method:

![alt text](image-4.png)

After we group a dataframe and aggregate data based on it, the column we performed the grouping on (in this case `DBN`) **becomes the index** and no longer appears as a column in the data itself. To undo this change and keep DBN as a column, we'll need to use `pandas.DataFrame.reset_index()`. This method **resets the index** to a list of integers and make DBN a column again.

In [22]:
class_size = class_size.groupby("DBN").agg('mean',numeric_only=True)
class_size.reset_index(inplace=True)

data['class_size']=class_size

data['class_size']

Unnamed: 0,DBN,CSD,NUMBER OF STUDENTS / SEATS FILLED,NUMBER OF SECTIONS,AVERAGE CLASS SIZE,SIZE OF SMALLEST CLASS,SIZE OF LARGEST CLASS,SCHOOLWIDE PUPIL-TEACHER RATIO
0,01M292,1.0,88.000000,4.000000,22.564286,18.500000,26.571429,
1,01M332,1.0,46.000000,2.000000,22.000000,21.000000,23.500000,
2,01M378,1.0,33.000000,1.000000,33.000000,33.000000,33.000000,
3,01M448,1.0,105.687500,4.750000,22.231250,18.250000,27.062500,
4,01M450,1.0,57.600000,2.733333,21.200000,19.400000,22.866667,
...,...,...,...,...,...,...,...,...
578,32K549,32.0,71.066667,3.266667,22.760000,19.866667,25.866667,
579,32K552,32.0,102.375000,4.312500,23.900000,19.937500,28.000000,
580,32K554,32.0,66.937500,3.812500,17.793750,14.750000,21.625000,
581,32K556,32.0,132.333333,5.400000,25.060000,18.333333,30.000000,
