# 1) Introduction

You'll find the ability to create data science projects useful in several different contexts:

* Projects help you build a portfolio, which is critical to finding a job as a data analyst or scientist.
* Working on projects helps you learn new skills and reinforce existing concepts.
* Most "real-world" data science and analysis work consists of developing internal projects.
* Projects allow you to investigate interesting phenomena and satisfy your curiosity.

In this lesson, we'll walk through the first part of a complete data science project, including how to acquire the raw data. The project focuses on exploring and analyzing a dataset. We'll develop our data 
cleaning and storytelling skills, which enables us to build complete projects on our own.

The first step in creating a project is to *decide on a topic*. You want the topic to be something you're interested in and motivated to explore. 

Here are two ways to find a good topic:

* Think about what sectors or angles you're really interested in, then find data sets relating to those sectors.
* Review several datasets and find one that seems interesting enough to explore.

Whichever approach you take, you can start your search at these sites:

* [Data.gov](https://data.gov/) - A directory of USA government data downloads

* [/r/datasets](https://www.reddit.com/r/datasets/) - A subreddit that has hundreds of interesting datasets

* [Awesome datasets](https://github.com/awesomedata/awesome-public-datasets) - A list of datasets hosted on GitHub

* [rs.io](https://rs.io/100-interesting-data-sets-for-statistics/) - A great blog post with hundreds of interesting datasets

For the purposes of this project, we'll be using data about **New York City public schools**, which can be found [here](https://data.cityofnewyork.us/browse?category=Education).

# 2) Finding All of the Relevant Datasets

One of the most controversial issues in the U.S. educational system is the efficacy of standardized tests and whether they're unfair to certain groups. Given our prior knowledge of this topic, investigating the correlations between **SAT scores** and demographics might be an interesting angle to take. We could correlate SAT scores with factors like race, gender, income, and more.

The SAT, or Scholastic Aptitude Test, is an exam that U.S. high school students take before applying to college. Colleges take the test scores into account when deciding who to admit, so it's important to perform well.

The test consists of three sections, each of which has 800 possible points. The combined score is out of 2,400 possible points (while this number has changed a few times, the dataset for our project is based on 2,400 total points). Organizations often rank high schools by their average **SAT scores**. The scores are also considered a measure of overall school district quality.

New York City makes its data on high school [SAT scores available onlin](https://data.cityofnewyork.us/Education/2012-SAT-Results/f9bf-2cp4/about_data), as well as the [demographics for each high school](https://data.cityofnewyork.us/Education/2014-2015-DOE-High-School-Directory/n3p6-zve2/about_data). The first few rows of the SAT data look like this:

![Alt text](https://s3.amazonaws.com/dq-content/136/sat.png)

Unfortunately, combining both of the datasets won't give us all of the demographic information we want to use. We'll need to supplement our data with other sources to do our full analysis.

The same website has several related datasets covering demographic information and test scores. Here are the links to all of the datasets we'll be using:

* [SAT scores by school](https://data.cityofnewyork.us/Education/2012-SAT-Results/f9bf-2cp4) - SAT scores for each high school in New York City

* [School attendance](https://data.cityofnewyork.us/Education/2010-2011-School-Attendance-and-Enrollment-Statist/7z8d-msnt/about_data) - Attendance information for each school in New York City

* [Class size](https://data.cityofnewyork.us/Education/2010-2011-Class-Size-School-level-detail/urz7-pzb3/about_data) - Information on class size for each school

* [AP test results](https://data.cityofnewyork.us/Education/2010-AP-College-Board-School-Level-Results/itfs-ms3e) - Advanced Placement (AP) exam results for each high school (passing an optional AP exam in a particular subject can earn a student college credit in that subject)

* [Graduation outcomes](https://data.cityofnewyork.us/login) - The percentage of students who graduated, and other outcome information

* [Demographics](https://data.cityofnewyork.us/Education/2006-2012-School-Demographics-and-Accountability-S/ihfw-zy9j/about_data) - Demographic information for each school

* [School survey](https://data.cityofnewyork.us/Education/2011-NYC-School-Survey/mnz3-dyi8/about_data) - Surveys of parents, teachers, and students at each school

All of these datasets are interrelated. We'll need to combine them into a single dataset before we can find correlations.

# 3) Finding Background Information

Before we move into coding, we'll need to do some background research. A thorough understanding of the data helps us avoid costly mistakes, such as thinking that a column represents something other than what it does. Background research gives us a better understanding of how to combine and analyze the data.

In this case, we'll want to research:

* New York City
* The SAT
* Schools in New York City
* Our data

We can learn a few different things from these resources. For example:

* Only high school students take the SAT, so we'll want to focus on **high schools**.
* New York City is made up of **five boroughs**, which are essentially distinct regions.
* New York City schools fall within several different school districts, each of which can contain dozens of schools.
* Our datasets include several different types of schools. We'll need to clean them so that we can focus on high schools only.
* Each school in New York City has a unique code called a DBN or district borough number.
* Aggregating data by district allows us to use the district mapping data to plot district-by-district differences.

# 4) Reading in the Data

Here are all of the files in the folder:

* ap_2010.csv - Data on [AP test results](https://data.cityofnewyork.us/Education/2010-AP-College-Board-School-Level-Results/itfs-ms3e/about_data)

* class_size.csv - Data on [class size](https://data.cityofnewyork.us/Education/2010-2011-Class-Size-School-level-detail/urz7-pzb3/about_data)

* demographics.csv - Data on [demographics](https://data.cityofnewyork.us/Education/2006-2012-School-Demographics-and-Accountability-S/ihfw-zy9j/about_data)

* graduation.csv - Data on [graduation outcomes](https://data.cityofnewyork.us/Education/Graduation-Outcomes-Classes-Of-2005-2010-School-Le/vh2h-md7a)

* hs_directory.csv - A directory of [High schools](https://data.cityofnewyork.us/Education/DOE-High-School-Directory-2014-2015/n3p6-zve2)

* sat_results.csv - Data on [SAT scores](https://data.cityofnewyork.us/Education/SAT-Results/f9bf-2cp4)

* survey_all.txt - Data on [surveys](https://data.cityofnewyork.us/Education/NYC-School-Survey-2011/mnz3-dyi8) from all schools

* survey_d75.txt - Data on [surveys](https://data.cityofnewyork.us/Education/NYC-School-Survey-2011/mnz3-dyi8) from New York City district 75

## Instructions

* Read each of the files in the list `data_files` into a pandas dataframe using the pandas.read_csv() function.

* Add each of the dataframes to the dictionary `data`, using the base of the filename as the key. For example, you'd enter `ap_2010` for the file `ap_2010.csv`.

* Afterwards, data should have the following keys:

    ``ap_2010
    class_size
    demographics
    graduation
    hs_directory
    sat_results``

In addition, each key in `data` should have the corresponding dataframe as its value.

In [4]:
import pandas as pd

url1 = 'https://raw.githubusercontent.com/TomazFilgueira/Dq_Datascientist/refs/heads/main/04_Data_Cleaning/04_2_Advanced_data_cleaning/Data/supplemental_data.csv'


url = 'https://raw.githubusercontent.com/TomazFilgueira/Dq_Datascientist/refs/heads/main/04_Data_Cleaning/04_2_Advanced_data_cleaning/Data/nypd_mvc_2018.csv'


data_files = [
    "TomazFilgueira/Dq_Datascientist/04_Data_Cleaning/03_Data_Cleaning_Walktrough/data/ap_2010.csv"
    ]
data = {}

for d in data_files:
    file = pd.read_csv(d)
    #get "name" without .csv
    name = d.rsplit('.')[0]
    data[name]=file
    


FileNotFoundError: [Errno 2] No such file or directory: 'TomazFilgueira/Dq_Datascientist/04_Data_Cleaning/03_Data_Cleaning_Walktrough/data/ap_2010.csv'