# 1) Introduction

You'll find the ability to create data science projects useful in several different contexts:

* Projects help you build a portfolio, which is critical to finding a job as a data analyst or scientist.
* Working on projects helps you learn new skills and reinforce existing concepts.
* Most "real-world" data science and analysis work consists of developing internal projects.
* Projects allow you to investigate interesting phenomena and satisfy your curiosity.

In this lesson, we'll walk through the first part of a complete data science project, including how to acquire the raw data. The project focuses on exploring and analyzing a dataset. We'll develop our data 
cleaning and storytelling skills, which enables us to build complete projects on our own.

The first step in creating a project is to *decide on a topic*. You want the topic to be something you're interested in and motivated to explore. 

Here are two ways to find a good topic:

* Think about what sectors or angles you're really interested in, then find data sets relating to those sectors.
* Review several datasets and find one that seems interesting enough to explore.

Whichever approach you take, you can start your search at these sites:

* [Data.gov](https://data.gov/) - A directory of USA government data downloads

* [/r/datasets](https://www.reddit.com/r/datasets/) - A subreddit that has hundreds of interesting datasets

* [Awesome datasets](https://github.com/awesomedata/awesome-public-datasets) - A list of datasets hosted on GitHub

* [rs.io](https://rs.io/100-interesting-data-sets-for-statistics/) - A great blog post with hundreds of interesting datasets

For the purposes of this project, we'll be using data about **New York City public schools**, which can be found [here](https://data.cityofnewyork.us/browse?category=Education).

# 2) Finding All of the Relevant Datasets

One of the most controversial issues in the U.S. educational system is the efficacy of standardized tests and whether they're unfair to certain groups. Given our prior knowledge of this topic, investigating the correlations between **SAT scores** and demographics might be an interesting angle to take. We could correlate SAT scores with factors like race, gender, income, and more.

The SAT, or Scholastic Aptitude Test, is an exam that U.S. high school students take before applying to college. Colleges take the test scores into account when deciding who to admit, so it's important to perform well.

The test consists of three sections, each of which has 800 possible points. The combined score is out of 2,400 possible points (while this number has changed a few times, the dataset for our project is based on 2,400 total points). Organizations often rank high schools by their average **SAT scores**. The scores are also considered a measure of overall school district quality.

New York City makes its data on high school [SAT scores available onlin](https://data.cityofnewyork.us/Education/2012-SAT-Results/f9bf-2cp4/about_data), as well as the [demographics for each high school](https://data.cityofnewyork.us/Education/2014-2015-DOE-High-School-Directory/n3p6-zve2/about_data). The first few rows of the SAT data look like this:

![Alt text](https://s3.amazonaws.com/dq-content/136/sat.png)

Unfortunately, combining both of the datasets won't give us all of the demographic information we want to use. We'll need to supplement our data with other sources to do our full analysis.

The same website has several related datasets covering demographic information and test scores. Here are the links to all of the datasets we'll be using:

* [SAT scores by school](https://data.cityofnewyork.us/Education/2012-SAT-Results/f9bf-2cp4) - SAT scores for each high school in New York City

* [School attendance](https://data.cityofnewyork.us/Education/2010-2011-School-Attendance-and-Enrollment-Statist/7z8d-msnt/about_data) - Attendance information for each school in New York City

* [Class size](https://data.cityofnewyork.us/Education/2010-2011-Class-Size-School-level-detail/urz7-pzb3/about_data) - Information on class size for each school

* [AP test results](https://data.cityofnewyork.us/Education/2010-AP-College-Board-School-Level-Results/itfs-ms3e) - Advanced Placement (AP) exam results for each high school (passing an optional AP exam in a particular subject can earn a student college credit in that subject)

* [Graduation outcomes](https://data.cityofnewyork.us/login) - The percentage of students who graduated, and other outcome information

* [Demographics](https://data.cityofnewyork.us/Education/2006-2012-School-Demographics-and-Accountability-S/ihfw-zy9j/about_data) - Demographic information for each school

* [School survey](https://data.cityofnewyork.us/Education/2011-NYC-School-Survey/mnz3-dyi8/about_data) - Surveys of parents, teachers, and students at each school

All of these datasets are interrelated. We'll need to combine them into a single dataset before we can find correlations.

# 3) Finding Background Information

Before we move into coding, we'll need to do some background research. A thorough understanding of the data helps us avoid costly mistakes, such as thinking that a column represents something other than what it does. Background research gives us a better understanding of how to combine and analyze the data.

In this case, we'll want to research:

* New York City
* The SAT
* Schools in New York City
* Our data

We can learn a few different things from these resources. For example:

* Only high school students take the SAT, so we'll want to focus on **high schools**.
* New York City is made up of **five boroughs**, which are essentially distinct regions.
* New York City schools fall within several different school districts, each of which can contain dozens of schools.
* Our datasets include several different types of schools. We'll need to clean them so that we can focus on high schools only.
* Each school in New York City has a unique code called a DBN or district borough number.
* Aggregating data by district allows us to use the district mapping data to plot district-by-district differences.

# 4) Reading in the Data

Here are all of the files in the folder:

* ap_2010.csv - Data on [AP test results](https://data.cityofnewyork.us/Education/2010-AP-College-Board-School-Level-Results/itfs-ms3e/about_data)

* class_size.csv - Data on [class size](https://data.cityofnewyork.us/Education/2010-2011-Class-Size-School-level-detail/urz7-pzb3/about_data)

* demographics.csv - Data on [demographics](https://data.cityofnewyork.us/Education/2006-2012-School-Demographics-and-Accountability-S/ihfw-zy9j/about_data)

* graduation.csv - Data on [graduation outcomes](https://data.cityofnewyork.us/Education/Graduation-Outcomes-Classes-Of-2005-2010-School-Le/vh2h-md7a)

* hs_directory.csv - A directory of [High schools](https://data.cityofnewyork.us/Education/DOE-High-School-Directory-2014-2015/n3p6-zve2)

* sat_results.csv - Data on [SAT scores](https://data.cityofnewyork.us/Education/SAT-Results/f9bf-2cp4)

* survey_all.txt - Data on [surveys](https://data.cityofnewyork.us/Education/NYC-School-Survey-2011/mnz3-dyi8) from all schools

* survey_d75.txt - Data on [surveys](https://data.cityofnewyork.us/Education/NYC-School-Survey-2011/mnz3-dyi8) from New York City district 75

## Instructions

* Read each of the files in the list `data_files` into a pandas dataframe using the pandas.read_csv() function.

* Add each of the dataframes to the dictionary `data`, using the base of the filename as the key. For example, you'd enter `ap_2010` for the file `ap_2010.csv`.

* Afterwards, data should have the following keys:

    ``ap_2010
    class_size
    demographics
    graduation
    hs_directory
    sat_results``

In addition, each key in `data` should have the corresponding dataframe as its value.

In [1]:
import pandas as pd



data_files = [
"https://raw.githubusercontent.com/TomazFilgueira/Dq_Datascientist/refs/heads/main/3%20-%20Data_Cleaning/03_Data_Cleaning_Walktrough/data/ap_2010.csv",
"https://raw.githubusercontent.com/TomazFilgueira/Dq_Datascientist/refs/heads/main/3%20-%20Data_Cleaning/03_Data_Cleaning_Walktrough/data/class_size.csv",
"https://raw.githubusercontent.com/TomazFilgueira/Dq_Datascientist/refs/heads/main/3%20-%20Data_Cleaning/03_Data_Cleaning_Walktrough/data/demographics.csv",
"https://raw.githubusercontent.com/TomazFilgueira/Dq_Datascientist/refs/heads/main/3%20-%20Data_Cleaning/03_Data_Cleaning_Walktrough/data/graduation.csv",
"https://raw.githubusercontent.com/TomazFilgueira/Dq_Datascientist/refs/heads/main/3%20-%20Data_Cleaning/03_Data_Cleaning_Walktrough/data/hs_directory.csv",
"https://raw.githubusercontent.com/TomazFilgueira/Dq_Datascientist/refs/heads/main/3%20-%20Data_Cleaning/03_Data_Cleaning_Walktrough/data/sat_results.csv",
]
data = {}

for d in data_files:
    file = pd.read_csv(d)
    #get "name" without .csv
    name = d.rsplit('data/')[1]
    name = name.replace(".csv", "")
    data[name]=file
    
data

{'ap_2010':         DBN                                         SchoolName  \
 0    01M448                       UNIVERSITY NEIGHBORHOOD H.S.   
 1    01M450                             EAST SIDE COMMUNITY HS   
 2    01M515                                LOWER EASTSIDE PREP   
 3    01M539                     NEW EXPLORATIONS SCI,TECH,MATH   
 4    02M296              High School of Hospitality Management   
 ..      ...                                                ...   
 253  31R605                         STATEN ISLAND TECHNICAL HS   
 254  32K545                      EBC-HS FOR PUB SERVICE (BUSH)   
 255  32K552                          Academy of Urban Planning   
 256  32K554               All City Leadership Secondary School   
 257  32K556  Bushwick Leaders High School for Academic Exce...   
 
      AP Test Takers   Total Exams Taken  Number of Exams with scores 3 4 or 5  
 0               39.0               49.0                                  10.0  
 1               19.0

# 5 Exploring the SAT Data

Let's explore `sat_results` to see what we can discover. Exploring the dataframe helps us understand the structure of the data and make it easier for us to analyze it.

Display the first five rows of the `SAT` scores data.

Use the key `sat_results` to access the SAT scores dataframe stored in the dictionary data.



In [2]:
data['sat_results'].head()

Unnamed: 0,DBN,SCHOOL NAME,Num of SAT Test Takers,SAT Critical Reading Avg. Score,SAT Math Avg. Score,SAT Writing Avg. Score
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29,355,404,363
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91,383,423,366
2,01M450,EAST SIDE COMMUNITY SCHOOL,70,377,402,370
3,01M458,FORSYTH SATELLITE ACADEMY,7,414,401,359
4,01M509,MARTA VALLE HIGH SCHOOL,44,390,433,384


# 6) Exploring the remaining data

We can make a few observations based on this output:

* The `DBN` appears to be a unique ID for each school.

* We can tell from the first **few rows of names** that we only have data about high schools.

* There's only a single row for each high school, so each **DBN is uniqu**e in the SAT data.

* We may eventually want to combine the three columns that contain SAT scores -- `SAT Critical Reading Avg. Score`, `SAT Math Avg. Score`, and `SAT Writing Avg. Score` -- into a **single column** to make the scores easier to analyze.

Given these observations, let's explore the other datasets to see if we can gain any insight into how to combine them.

Loop through each `key` in data. For each key: Display the **first five rows** of the dataframe associated with the key.

In [3]:
for key in data:
    print("Key: ", key)
    print(data[key].head())

Key:  ap_2010
      DBN                             SchoolName  AP Test Takers   \
0  01M448           UNIVERSITY NEIGHBORHOOD H.S.             39.0   
1  01M450                 EAST SIDE COMMUNITY HS             19.0   
2  01M515                    LOWER EASTSIDE PREP             24.0   
3  01M539         NEW EXPLORATIONS SCI,TECH,MATH            255.0   
4  02M296  High School of Hospitality Management              NaN   

   Total Exams Taken  Number of Exams with scores 3 4 or 5  
0               49.0                                  10.0  
1               21.0                                   NaN  
2               26.0                                  24.0  
3              377.0                                 191.0  
4                NaN                                   NaN  
Key:  class_size
   CSD BOROUGH SCHOOL CODE                SCHOOL NAME GRADE  PROGRAM TYPE  \
0    1       M        M015  P.S. 015 Roberto Clemente     0K       GEN ED   
1    1       M        M015  P.S. 0

# 7 8) Reading in the Survey Data

We can make some observations based on the first few rows of each one.

* Each dataset appears to either have a `DBN` column or the information we need to create one. That means we can use a `DBN` column to combine the datasets. First we'll pinpoint matching rows from different datasets by looking for identical `DBNs`, then group all of their columns together in a single dataset.

* Some fields look interesting for mapping -- particularly `Location 1`, which contains coordinates inside a larger string.

* Some of the datasets appear to contain multiple rows for each school (because the rows have duplicate `DBN` values). That means we’ll have to do some preprocessing to ensure that each `DBN` is **unique within each dataset**. If we don't do this, we'll run into problems when we combine the datasets, because we might be merging two rows in one data set with one row in another dataset.

survey_all.txt and survey_d75.txt files are *tab delimited* and encoded with `Windows-1252` encoding. An encoding defines how a computer stores the contents of a file in binary. The most common encodings are `UTF-8` and `ASCII`. Windows-1252 is rarely used and can cause errors if we read such a file in without specifying the encoding.

We'll need to specify the encoding and delimiter to the pandas `pandas.read_csv()` function to ensure it reads the surveys in properly.

After we read in the survey data, we'll want to combine it into a single dataframe. We can do this by calling the `pandas.concat()` function:

```Python
z = pd.concat([x,y], axis=0)
```

## Instructions

1. Read in `survey_all.txt`.

    * Use the pandas.read_csv() function to read survey_all.txt into the variable all_survey. Recall that this file is located in the schools folder.

    * Specify the keyword argument delimiter="\t".

    * Specify the keyword argument encoding="windows-1252".

1. Read in `survey_d75.txt`.
    
    * Use the `pandas.read_csv()` function to read `data/survey_d75.txt` into the variable `d75_survey`. Recall that this file is located in the `data` folder.

    * Specify the keyword argument `delimiter="\t".`
    
    * Specify the keyword argument `encoding="windows-1252"`.

1. Combine `d75_survey` and `all_survey` into a **single dataframe**.

    * Use the pandas `concat()` function with the keyword argument `axis=0` to combine `d75_survey` and `all_survey` into the dataframe `survey`.

    * Pass in `all_survey` first, then `d75_survey` when calling the `pandas.concat()` function.

1. Display the first five rows of survey using the `pandas.DataFrame.head()` function.

In [4]:
data_files = [
"https://raw.githubusercontent.com/TomazFilgueira/Dq_Datascientist/refs/heads/main/3%20-%20Data_Cleaning/03_Data_Cleaning_Walktrough/data/survey_all.txt",
"https://raw.githubusercontent.com/TomazFilgueira/Dq_Datascientist/refs/heads/main/3%20-%20Data_Cleaning/03_Data_Cleaning_Walktrough/data/survey_d75.txt",
]

all_survey = pd.read_csv(data_files[0],delimiter="\t",encoding="windows-1252")
d75_survey = pd.read_csv(data_files[1],delimiter="\t",encoding="windows-1252")    
    
survey = pd.concat([all_survey,d75_survey],axis=0)

print(survey.head())

      dbn    bn                      schoolname  d75 studentssurveyed  \
0  01M015  M015       P.S. 015 Roberto Clemente    0               No   
1  01M019  M019             P.S. 019 Asher Levy    0               No   
2  01M020  M020            P.S. 020 Anna Silver    0               No   
3  01M034  M034  P.S. 034 Franklin D. Roosevelt    0              Yes   
4  01M063  M063       P.S. 063 William McKinley    0               No   

   highschool                  schooltype  rr_s  rr_t  rr_p  ...  s_q14_2  \
0         0.0           Elementary School   NaN    88    60  ...      NaN   
1         0.0           Elementary School   NaN   100    60  ...      NaN   
2         0.0           Elementary School   NaN    88    73  ...      NaN   
3         0.0  Elementary / Middle School  89.0    73    50  ...      NaN   
4         0.0           Elementary School   NaN   100    60  ...      NaN   

   s_q14_3  s_q14_4  s_q14_5  s_q14_6  s_q14_7  s_q14_8  s_q14_9  s_q14_10  \
0      NaN      NaN 

# 9) Cleaning Up the Surveys

There are two immediate facts that we can see in the data:

1. There are over 2000 columns, nearly all of which we don't need. We'll have to filter the data to remove the unnecessary ones. **Working with fewer columns** makes it easier to print the dataframe out and find correlations within it.

1. The survey data has a `dbn` column that we'll want to convert to uppercase (`DBN`). The conversion makes the column name consistent with the other data sets.

First, we'll need to filter the columns to remove the ones we don't need. Luckily, there's a **data dictionary** at the original data download location. Based on the dictionary, it looks like these are the relevant columns:

```
["dbn", "rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p", "saf_p_11", "com_p_11", "eng_p_11", "aca_p_11", "saf_t_11", "com_t_11", "eng_t_11", "aca_t_11", "saf_s_11", "com_s_11", "eng_s_11", "aca_s_11", "saf_tot_11", "com_tot_11", "eng_tot_11", "aca_tot_11"]
```

These columns give us aggregate survey data about how parents, teachers, and students feel about school safety, academic performance, and more. It also gives us the `DBN`, which allows us to uniquely identify the school.

Before we filter columns out, we'll want to copy the data from the `dbn` column into a new column called `DBN`



## Instructions

1. Use the `pandas.DataFrame.copy()` method to create a copy the `survey` dataframe and assign it to the variable `survey`.

1. Copy the data from the `dbn` column of `survey` into a new column in survey called `DBN`.

1. Filter `survey` so it only contains the columns we listed above. You can do this using pandas column selection with a list of column names, or using `pandas.DataFrame.loc[]`.

1. Assign the dataframe `survey` to the **key** `survey` in the dictionary data.

1. When you're finished, the value in data["survey"] should be a dataframe with 23 columns and 1702 rows.


In [5]:
survey = survey.copy()
survey['DBN'] = survey['dbn']

cols = ["DBN", "rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p", "saf_p_11", "com_p_11", "eng_p_11", "aca_p_11", "saf_t_11", "com_t_11", "eng_t_11", "aca_t_11", "saf_s_11", "com_s_11", "eng_s_11", "aca_s_11", "saf_tot_11", "com_tot_11", "eng_tot_11", "aca_tot_11"]

#Filter survey so it only contains the columns we listed above
survey = survey[cols]

#Assign the dataframe survey to the key survey in the dictionary data.
data['survey'] = survey

data['survey']

Unnamed: 0,DBN,rr_s,rr_t,rr_p,N_s,N_t,N_p,saf_p_11,com_p_11,eng_p_11,...,eng_t_11,aca_t_11,saf_s_11,com_s_11,eng_s_11,aca_s_11,saf_tot_11,com_tot_11,eng_tot_11,aca_tot_11
0,01M015,,88,60,,22.0,90.0,8.5,7.6,7.5,...,7.6,7.9,,,,,8.0,7.7,7.5,7.9
1,01M019,,100,60,,34.0,161.0,8.4,7.6,7.6,...,8.9,9.1,,,,,8.5,8.1,8.2,8.4
2,01M020,,88,73,,42.0,367.0,8.9,8.3,8.3,...,6.8,7.5,,,,,8.2,7.3,7.5,8.0
3,01M034,89.0,73,50,145.0,29.0,151.0,8.8,8.2,8.0,...,6.8,7.8,6.2,5.9,6.5,7.4,7.3,6.7,7.1,7.9
4,01M063,,100,60,,23.0,90.0,8.7,7.9,8.1,...,7.8,8.1,,,,,8.5,7.6,7.9,8.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51,75X352,90.0,58,48,38.0,46.0,160.0,8.9,8.3,7.9,...,5.7,5.8,6.8,6.0,7.8,7.6,7.4,6.6,7.1,7.2
52,75X721,84.0,90,48,237.0,82.0,239.0,8.6,7.6,7.5,...,6.7,7.0,7.8,7.2,7.8,7.9,8.0,7.1,7.3,7.6
53,75X723,77.0,74,20,103.0,69.0,74.0,8.4,7.8,7.8,...,6.7,7.6,6.7,7.2,7.7,7.7,7.6,7.4,7.4,7.7
54,75X754,63.0,93,22,336.0,82.0,124.0,8.3,7.5,7.5,...,6.6,7.1,6.8,6.6,7.6,7.7,7.2,6.9,7.3,7.5


# 10 11) Inserting DBN fields

When we explored all of the datasets, we noticed that some of issues:

1. `class_size` and `hs_directory`, don't have a `DBN` column.
1. `hs_directory` does have a `dbn` column, though, so we can just rename it.

`class_size` doesn't appear to have the DBN column at all. Here are the first few rows of the data set:

```
CSD BOROUGH SCHOOL CODE                SCHOOL NAME GRADE  PROGRAM TYPE  \
0    1       M        M015  P.S. 015 Roberto Clemente     0K       GEN ED
1    1       M        M015  P.S. 015 Roberto Clemente     0K          CTT
2    1       M        M015  P.S. 015 Roberto Clemente     01       GEN ED
3    1       M        M015  P.S. 015 Roberto Clemente     01          CTT
4    1       M        M015  P.S. 015 Roberto Clemente     02       GEN ED
```

Here are the first few rows of the `sat_results` data, which does have a `DBN` column:

```
DBN                                    SCHOOL NAME  \
0  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES
1  01M448            UNIVERSITY NEIGHBORHOOD HIGH SCHOOL
2  01M450                     EAST SIDE COMMUNITY SCHOOL
3  01M458                      FORSYTH SATELLITE ACADEMY
4  01M509                        MARTA VALLE HIGH SCHOOL
```

From looking at these rows, we can tell that the `DBN` in the `sat_results` data is just a combination of the `CSD` and `SCHOOL CODE` columns in the `class_size` data. The main difference is that the `DBN` is padded, so that the `CSD` portion of it always consists of **two digits**. That means we'll need to add a leading `0` to the `CSD` if the CSD is less than two digits long. 

We can accomplish this using the `pandas.Series.apply()` method, along with a custom function that:

1. Takes in a number.
1. Converts the number to a string using the str() function.
1. Check the length of the string using the len() function.
    * If the string is two digits long, returns the string.
    * If the string is one digit long, adds a 0 to the front of the string, then returns it.
        * You can use the string method zfill() to do this.

Once we've padded the `CSD`, we can use the addition operator (`+`) to combine the values in the `CSD` and `SCHOOL CODE` columns. Here's an example of how we would do this:

```Python

dataframe["new_column"] = dataframe["column_one"] + dataframe["column_two"]
```

And here's a diagram illustrating the basic concept:

![alt text](image.png)

## Instructions

1. Copy the `dbn` column in `hs_directory` into a new column called `DBN`.

1. Create a new column called `padded_csd` in the `class_size` dataset.

    * Use the `pandas.Series.apply()` method along with a custom function to generate this column.
    * Make sure to apply the function along the `data["class_size"]["CSD"]` column.

1. Use the addition operator `(+)` along with the `padded_csd` and `SCHOOL CODE` columns of `class_size`, then assign the result to the `DBN` column of `class_size`.

1. Display the first few rows of class_size to double check the DBN column.

In [6]:
data['hs_directory']['DBN'] = data['hs_directory']['dbn']

#if string is 2 digits long - return the string
#if string is less than 2 digits - fills with 0 in the front
data['class_size']['padded_csd'] = data['class_size']['CSD'].apply(lambda x: str(x).zfill(2))

data['class_size']['DBN'] =  data['class_size']['padded_csd'] + data['class_size']['SCHOOL CODE']

data['class_size'].head()

Unnamed: 0,CSD,BOROUGH,SCHOOL CODE,SCHOOL NAME,GRADE,PROGRAM TYPE,CORE SUBJECT (MS CORE and 9-12 ONLY),CORE COURSE (MS CORE and 9-12 ONLY),SERVICE CATEGORY(K-9* ONLY),NUMBER OF STUDENTS / SEATS FILLED,NUMBER OF SECTIONS,AVERAGE CLASS SIZE,SIZE OF SMALLEST CLASS,SIZE OF LARGEST CLASS,DATA SOURCE,SCHOOLWIDE PUPIL-TEACHER RATIO,padded_csd,DBN
0,1,M,M015,P.S. 015 Roberto Clemente,0K,GEN ED,-,-,-,19.0,1.0,19.0,19.0,19.0,ATS,,1,01M015
1,1,M,M015,P.S. 015 Roberto Clemente,0K,CTT,-,-,-,21.0,1.0,21.0,21.0,21.0,ATS,,1,01M015
2,1,M,M015,P.S. 015 Roberto Clemente,01,GEN ED,-,-,-,17.0,1.0,17.0,17.0,17.0,ATS,,1,01M015
3,1,M,M015,P.S. 015 Roberto Clemente,01,CTT,-,-,-,17.0,1.0,17.0,17.0,17.0,ATS,,1,01M015
4,1,M,M015,P.S. 015 Roberto Clemente,02,GEN ED,-,-,-,15.0,1.0,15.0,15.0,15.0,ATS,,1,01M015


# 12) Combining the SAT Scores

 We've already discussed one such variable -- a column that **totals up the SAT scores** for the different sections of the exam. This makes it much easier to correlate scores with demographic factors because we'll be working with a single number, rather than three different ones.

Before we can generate this column, we'll need to convert:

* the SAT Math Avg. Score;
* SAT Critical Reading Avg. Score;
* and SAT Writing Avg. Score columns

 in the `sat_results` dataset from the object (string) data type to a **numeric data type**. We can use the `pandas.to_numeric()` method for the conversion. If we don't convert the values, we won't be able to add the columns together.

 It's important to pass the keyword argument `errors="coerce"` when we call `pandas.to_numeric()`, so that pandas treats any invalid strings it can't convert to numbers as missing values instead.

 After we perform the conversion, we can use the addition operator `(+)` to add all three columns together.


## Instructions

1. Convert the `SAT Math Avg. Score, SAT Critical Reading Avg. Score, and SAT Writing Avg. Score` columns in the sat_results data set from the object (string) data type to a numeric data type.

    * Use the `pandas.to_numeric()` function on each of the columns, and assign the result back to the same column.
    * Pass in the keyword argument `errors="coerce"`.

1. Create a column called `sat_score` in `sat_results` that holds the **combined SAT score** for each student.

    * Add up `SAT Math Avg. Score, SAT Critical Reading Avg. Score, and SAT Writing Avg. Score`, and assign the total to the `sat_score` column of `sat_results`.

1. Display the first five rows of the sat_score column of sat_results to verify that everything went okay.

In [7]:
#convert to numeric data
data['sat_results']['SAT Math Avg. Score'] = pd.to_numeric(data['sat_results']['SAT Math Avg. Score'],errors="coerce")

data['sat_results']['SAT Critical Reading Avg. Score'] = pd.to_numeric(data['sat_results']['SAT Critical Reading Avg. Score'],errors="coerce")

data['sat_results']['SAT Writing Avg. Score']= pd.to_numeric(data['sat_results']['SAT Writing Avg. Score'],errors="coerce")


#sum up all the three columns
data['sat_results']['sat_score'] = data['sat_results']['SAT Math Avg. Score'] + data['sat_results']['SAT Critical Reading Avg. Score'] + data['sat_results']['SAT Writing Avg. Score']

print(data['sat_results']['sat_score'].head())

0    1122.0
1    1172.0
2    1149.0
3    1174.0
4    1207.0
Name: sat_score, dtype: float64


# 13) Parsing Geographic Coordinates 

Next, we'll want to parse the latitude and longitude coordinates for each school. This enables us to map the schools and uncover any geographic patterns in the data. The coordinates are currently in the text field `Location 1` in the `hs_directory` dataset.

```
0    883 Classon Avenue\nBrooklyn, NY 11225\n(40.67...
1    1110 Boston Road\nBronx, NY 10456\n(40.8276026...
2    1501 Jerome Avenue\nBronx, NY 10452\n(40.84241...
3    411 Pearl Street\nNew York, NY 10038\n(40.7106...
4    160-20 Goethals Avenue\nJamaica, NY 11432\n(40...
```

As you can see, this field contains a lot of information we don't need. We want to extract the **coordinates**, which are in **parentheses** at the end of the field. Here's an example:

```
1110 Boston Road\nBronx, NY 10456\n(40.8276026690005, -73.90447525699966)
```

We can do the extraction with a regular expression. The following expression pulls out everything inside the parentheses:

In [8]:
import re
re.findall("\(.+\)", "1110 Boston Road\nBronx, NY 10456\n(40.8276026690005, -73.90447525699966)")

  re.findall("\(.+\)", "1110 Boston Road\nBronx, NY 10456\n(40.8276026690005, -73.90447525699966)")


['(40.8276026690005, -73.90447525699966)']

## Instructions

1. Write a function that:
    * Takes in a string
    * Uses the regular expression above to extract the coordinates
    * Uses string manipulation functions to pull out the latitude
    * Returns the latitude

1. Use the `Series.apply()` method to apply the function across the `Location 1` column of `hs_directory`. Assign the result to the `lat column` of `hs_directory`.

1. Display the first five rows of `hs_directory` to verify the results.

In [9]:
import re

def get_lat(x):
    #extract raw coordinates
    y = re.findall("\(.+\)", x)
    
    #split lat and lon. remove '(' for latitude
    lat = y[0].split(',')[0].replace("(","")
    return lat

data['hs_directory']['lat'] = data['hs_directory']['Location 1'].apply(get_lat)

data['hs_directory'].head()

  y = re.findall("\(.+\)", x)


Unnamed: 0,dbn,school_name,borough,building_code,phone_number,fax_number,grade_span_min,grade_span_max,expgrade_span_min,expgrade_span_max,...,priority10,Location 1,Community Board,Council District,Census Tract,BIN,BBL,NTA,DBN,lat
0,27Q260,Frederick Douglass Academy VI High School,Queens,Q465,718-471-2154,718-471-2890,9.0,12,,,...,,"8 21 Bay 25 Street\nFar Rockaway, NY 11691\n(4...",14.0,31.0,100802.0,4300730.0,4157360000.0,Far Rockaway-Bayswater ...,27Q260,40.601989336
1,21K559,Life Academy High School for Film and Music,Brooklyn,K400,718-333-7750,718-333-7775,9.0,12,,,...,,"2630 Benson Avenue\nBrooklyn, NY 11214\n(40.59...",13.0,47.0,306.0,3186454.0,3068830000.0,Gravesend ...,21K559,40.593593811
2,16K393,Frederick Douglass Academy IV Secondary School,Brooklyn,K026,718-574-2820,718-574-2821,9.0,12,,,...,,"1014 Lafayette Avenue\nBrooklyn, NY 11221\n(40...",3.0,36.0,291.0,3393805.0,3016160000.0,Stuyvesant Heights ...,16K393,40.692133704
3,08X305,Pablo Neruda Academy,Bronx,X450,718-824-1682,718-824-1663,9.0,12,,,...,,"1980 Lafayette Avenue\nBronx, NY 10473\n(40.82...",9.0,18.0,16.0,2022205.0,2036040000.0,Soundview-Castle Hill-Clason Point-Harding Par...,08X305,40.822303765
4,03M485,Fiorello H. LaGuardia High School of Music & A...,Manhattan,M485,212-496-0700,212-724-5748,9.0,12,,,...,,"100 Amsterdam Avenue\nNew York, NY 10023\n(40....",7.0,6.0,151.0,1030341.0,1011560000.0,Lincoln Square ...,03M485,40.773670507


# 14) Extracting the Longitude

On the last exercoce, we parsed the `latitude` from the `Location 1` column. Now we'll just need to do the same for the `longitude`.

Once we have both coordinates, we'll need to convert them to **numeric** values. We can use the `pandas.to_numeric()` function to convert them from `strings to numbers`.

# Instructions


1. Write a function that:
    * Takes in a string
    * Uses the regular expression above to extract the coordinates
    * Uses string manipulation functions to pull out the longitude
    * Returns the longitude

1. Use the `Series.apply()` method to apply the function across the `Location 1` column of `hs_directory`. Assign the result to the `lon column` of `hs_directory`.

1. Use the `to_numeric()` function to convert the `lat` and `lon` columns of `hs_directory` to numbers.
Specify the `errors="coerce"` keyword argument to handle missing values properly.

1. Display the first five rows of `hs_directory` to verify the results.

In [11]:
import re

def find_lon(x):
    #extract raw coordinates
    y = re.findall("\(.+\)", x)
    
    #split lat and lon. remove ')' for longitude
    lon = y[0].split(',')[1].replace(")","").strip()
    return lon

data['hs_directory']['lon'] = data['hs_directory']['Location 1'].apply(find_lon)

#convert coordinates to numeric
data["hs_directory"]["lat"] = pd.to_numeric(data["hs_directory"]["lat"], errors="coerce")

data["hs_directory"]["lon"] = pd.to_numeric(data["hs_directory"]["lon"], errors="coerce")

data['hs_directory'].head()


  y = re.findall("\(.+\)", x)


Unnamed: 0,dbn,school_name,borough,building_code,phone_number,fax_number,grade_span_min,grade_span_max,expgrade_span_min,expgrade_span_max,...,Location 1,Community Board,Council District,Census Tract,BIN,BBL,NTA,DBN,lat,lon
0,27Q260,Frederick Douglass Academy VI High School,Queens,Q465,718-471-2154,718-471-2890,9.0,12,,,...,"8 21 Bay 25 Street\nFar Rockaway, NY 11691\n(4...",14.0,31.0,100802.0,4300730.0,4157360000.0,Far Rockaway-Bayswater ...,27Q260,40.601989,-73.762834
1,21K559,Life Academy High School for Film and Music,Brooklyn,K400,718-333-7750,718-333-7775,9.0,12,,,...,"2630 Benson Avenue\nBrooklyn, NY 11214\n(40.59...",13.0,47.0,306.0,3186454.0,3068830000.0,Gravesend ...,21K559,40.593594,-73.984729
2,16K393,Frederick Douglass Academy IV Secondary School,Brooklyn,K026,718-574-2820,718-574-2821,9.0,12,,,...,"1014 Lafayette Avenue\nBrooklyn, NY 11221\n(40...",3.0,36.0,291.0,3393805.0,3016160000.0,Stuyvesant Heights ...,16K393,40.692134,-73.931503
3,08X305,Pablo Neruda Academy,Bronx,X450,718-824-1682,718-824-1663,9.0,12,,,...,"1980 Lafayette Avenue\nBronx, NY 10473\n(40.82...",9.0,18.0,16.0,2022205.0,2036040000.0,Soundview-Castle Hill-Clason Point-Harding Par...,08X305,40.822304,-73.855961
4,03M485,Fiorello H. LaGuardia High School of Music & A...,Manhattan,M485,212-496-0700,212-724-5748,9.0,12,,,...,"100 Amsterdam Avenue\nNew York, NY 10023\n(40....",7.0,6.0,151.0,1030341.0,1011560000.0,Lincoln Square ...,03M485,40.773671,-73.985269
