# 3. Advanced ETL Techniques

Supercharge your workflow with advanced data pipelining techniques, such as working with non-tabular data and persisting DataFrames to SQL databases. Discover tooling to tackle advanced transformations with pandas, and uncover best-practices for working with complex data.

## Libraries

In [43]:
# Common
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Json
import json

# Parquet Files
import parquet as pq
import fastparquet

# SQL
import sqlalchemy
import psycopg2

# Pandas SQL
import pandasql as ps

# Logging
import logging

## User Variables

In [44]:
testing_scores_json_path = "../Datasets/testing_scores.json"
raw_testing_scores = pd.read_csv(testing_scores_json_path)

raw_testing_scores_path = "../Datasets/nested_school_scores.json"
raw_testing_scores = pd.read_json(raw_testing_scores_path)

# Exercises

## 1. ngesting JSON data with pandas

### Description

When developing a data pipeline, you may have to work with non-tabular data and data sources, such as APIs or JSON files. In this exercise, we'll practice extracting data from a JSON file using ``pandas``.

``pandas`` has been imported as ``pd``, and the JSON file you'll ingest is stored at the path ``"testing_scores.json"``.

### Instructions

* Update the ``extract()`` function read a JSON file into a ``pandas`` DataFrame, orienting by records.
* Pass the path ``testing_scores.json`` to the ``extract()`` function, and store the output to a variable called ``raw_testing_scores``.
* Print the head of the ``raw_testing_scores`` DataFrame.

In [45]:
def extract(file_path):
  # Read the JSON file into a DataFrame
  return pd.read_json(file_path, orient="records")

# Call the extract function with the appropriate path, assign to raw_testing_scores
raw_testing_scores = extract("../Datasets/testing_scores.json")

# Output the head of the DataFrame
print(raw_testing_scores.head())

              street_address       city  math_score  reading_score  \
02M260  425 West 33rd Street  Manhattan         NaN            NaN   
06M211    650 Academy Street  Manhattan         NaN            NaN   
01M539   111 Columbia Street  Manhattan       657.0          601.0   
02M294      350 Grand Street  Manhattan       395.0          411.0   
02M308      350 Grand Street  Manhattan       418.0          428.0   

        writing_score  
02M260            NaN  
06M211            NaN  
01M539          601.0  
02M294          387.0  
02M308          415.0  


Excellent data extracting! Ingesting JSON files into pandas DataFrames is the first step in preparing non-tabular data for further transformation.

## 2. Reading JSON data into memory

### Description

When data is stored in JSON format, it's not always easy to load into a DataFrame. This is the case for the "nested_testing_scores.json" file. Here, the data will have to be manually manipulated before it can be stored in a DataFrame.

To help get you started, ``pandas`` has been loaded into the workspace as ``pd``.

### Instructions

* Use ``pandas`` to read a JSON file into a DataFrame. Pass the ``"nested_scores.json"`` file path to the ``extract()`` function.
* Import the ``json`` library. Use the ``json`` library to load the ``"nested_scores.json"`` file into memory.

In [46]:
# Import the json library
import json

def extract(file_path):
  	# Read the JSON file into a DataFrame, orient by index
    with open(file_path, "r") as json_file:
		# Load the data from the JSON file
        raw_data = json.load(json_file)
		
	# return pd.read_json(file_path, orient="index")
    return raw_data

# Call the extract function, pass in the desired file_path
raw_testing_scores = extract("../Datasets/nested_scores.json")

# print(raw_testing_scores.head())
print(raw_testing_scores)

{'02M260': {'street_address': '425 West 33rd Street', 'city': 'Manhattan', 'scores': {'math': None, 'reading': None, 'writing': None}}, '06M211': {'street_address': '650 Academy Street', 'city': 'Manhattan', 'scores': {'math': None, 'reading': None, 'writing': None}}, '01M539': {'street_address': '111 Columbia Street', 'city': 'Manhattan', 'scores': {'math': 657.0, 'reading': 601.0, 'writing': 601.0}}, '02M294': {'street_address': '350 Grand Street', 'city': 'Manhattan', 'scores': {'math': 395.0, 'reading': 411.0, 'writing': 387.0}}, '02M308': {'street_address': '350 Grand Street', 'city': 'Manhattan', 'scores': {'math': 418.0, 'reading': 428.0, 'writing': 415.0}}, '02M545': {'street_address': '350 Grand Street', 'city': 'Manhattan', 'scores': {'math': 613.0, 'reading': 453.0, 'writing': 463.0}}, '01M292': {'street_address': '220 Henry Street', 'city': 'Manhattan', 'scores': {'math': 410.0, 'reading': 406.0, 'writing': 381.0}}, '01M696': {'street_address': '525 East Houston Street', 'c

You're off to a great start! The data from the JSON file has been loaded into a dictionary in-memory.

## 3. Iterating over dictionaries

### Description

Once JSON data is loaded into a dictionary, you can leverage Python's built-in tools to iterate over its keys and values.

The ``"nested_school_scores.json"`` file has been read into a dictionary stored in the ``raw_testing_scores`` variable, which takes the following form:

```json
{
    "01M539": {
        "street_address": "111 Columbia Street",
        "city": "Manhattan",
        "scores": {
              "math": 657,
              "reading": 601,
              "writing": 601
        }
  }, ...
}
```

### Instructions

* Loop through the keys of the ``raw_testing_scores`` dictionary. Add each key to the ``raw_testing_scores_keys`` list.
* Now, loop through a list of values from the ``raw_testing_scores`` dictionary.
* Finally, loop through both the keys and values of the ``raw_testing_scores`` dictionary, simultaneously.

In [47]:
raw_testing_scores_keys = []

# Iterate through the keys of the raw_testing_scores dictionary
for school_id in raw_testing_scores.keys():
  	# Append each key to the raw_testing_scores_keys list
	raw_testing_scores_keys.append(school_id)
    
print(raw_testing_scores_keys[0:3])

['02M260', '06M211', '01M539']


In [48]:
raw_testing_scores_values = []

# Iterate through the values of the raw_testing_scores dictionary
for school_info in raw_testing_scores.values():
	raw_testing_scores_values.append(school_info)
    
print(raw_testing_scores_values[0:3])


[{'street_address': '425 West 33rd Street', 'city': 'Manhattan', 'scores': {'math': None, 'reading': None, 'writing': None}}, {'street_address': '650 Academy Street', 'city': 'Manhattan', 'scores': {'math': None, 'reading': None, 'writing': None}}, {'street_address': '111 Columbia Street', 'city': 'Manhattan', 'scores': {'math': 657.0, 'reading': 601.0, 'writing': 601.0}}]


In [49]:
raw_testing_scores_keys = []
raw_testing_scores_values = []

# Iterate through the values of the raw_testing_scores dictionary
for school_id, school_info in raw_testing_scores.items():
	raw_testing_scores_keys.append(school_id)
	raw_testing_scores_values.append(school_info)

print(raw_testing_scores_keys[0:3])
print(raw_testing_scores_values[0:3])


['02M260', '06M211', '01M539']
[{'street_address': '425 West 33rd Street', 'city': 'Manhattan', 'scores': {'math': None, 'reading': None, 'writing': None}}, {'street_address': '650 Academy Street', 'city': 'Manhattan', 'scores': {'math': None, 'reading': None, 'writing': None}}, {'street_address': '111 Columbia Street', 'city': 'Manhattan', 'scores': {'math': 657.0, 'reading': 601.0, 'writing': 601.0}}]


Great iteration! Iterating through the both the keys and values of dictionaries lays a foundation for working with non-tabluar JSON data. Keep up the good work!

## 4. Parsing data from dictionaries

### Description

When JSON data is loaded into memory, the resulting dictionary can be complicated. Key-value pairs may contain another dictionary, such are called nested dictionaries. These nested dictionaries are frequently encountered when dealing with APIs or other JSON data. In this exercise, you will practice extracting data from nested dictionaries and handling missing values.

The dictionary below is stored in the ``school`` variable. Good luck!

```json
{
    "street_address": "111 Columbia Street",
    "city": "Manhattan",
    "scores": {
        "math": 657,
        "reading": 601
    }
}
```

### Instructions

* Parse the value stored at the ``"street_address"`` key from the ``school`` dictionary.
* Parse the value stored at the ``"scores"`` key from the ``school`` dictionary.
* Parse the values stored at the ``"math"``, ``"reading"``, and ``"writing"`` keys from the ``scores`` dictionary, and set the default value to 0.

In [50]:
school = {'street_address': '111 Columbia Street',
 'city': 'Manhattan',
 'scores': {'math': 657, 'reading': 601}}

In [51]:
# Parse the street_address from the dictionary
street_address = school.get("street_address")

# Parse the scores dictionary
scores = school.get("scores")

# Try to parse the math, reading and writing values from scores
math_score = scores.get("math", 0)
reading_score = scores.get("reading", 0)
writing_score = scores.get("writing", 0)

print(f"Street Address: {street_address}")
print(f"Math: {math_score}, Reading: {reading_score}, Writing: {writing_score}")

Street Address: 111 Columbia Street
Math: 657, Reading: 601, Writing: 0


Great work! Understanding how to pull data from nested dictionaries is a valuable skill when working with non-tabluar data.

## 5. Transforming JSON data

### Description

Chances are, when reading data from JSON format into a dictionary, you'll probably have to apply some level of manual transformation to the data before it can be stored in a DataFrame. This is common when working with nested dictionaries, which you'll have the opportunity to explore in this exercise.

The ``"nested_school_scores.json"`` file has been read into a dictionary available in the ``raw_testing_scores`` variable, which takes the following form:

```json
{
    "01M539": {
        "street_address": "111 Columbia Street",
        "city": "Manhattan",
        "scores": {
              "math": 657,
              "reading": 601,
              "writing": 601
        }
  }, ...
}
```

TIP: You can use the following to convert ``dict`` to ```.json```

```py
import json
json_str = json.dumps(dict_name)
json_var = json.loads(json_str)
```

### Instructions

* Loop through both the keys and values of the ``raw_testing_scores`` dictionary.
* Extract the ``"street_address"`` from each dictionary nested in the ``raw_testing_scores`` object. 

In [52]:
normalized_testing_scores = []

# Loop through each of the dictionary key-value pairs
for school_id, school_info in raw_testing_scores.items():
	normalized_testing_scores.append([
    	school_id,
    	school_info.get("street_address"),  # Pull the "street_address"
    	school_info.get("city"),
    	school_info.get("scores").get("math", 0),
    	school_info.get("scores").get("reading", 0),
    	school_info.get("scores").get("writing", 0),
    ])

print(normalized_testing_scores)

[['02M260', '425 West 33rd Street', 'Manhattan', None, None, None], ['06M211', '650 Academy Street', 'Manhattan', None, None, None], ['01M539', '111 Columbia Street', 'Manhattan', 657.0, 601.0, 601.0], ['02M294', '350 Grand Street', 'Manhattan', 395.0, 411.0, 387.0], ['02M308', '350 Grand Street', 'Manhattan', 418.0, 428.0, 415.0], ['02M545', '350 Grand Street', 'Manhattan', 613.0, 453.0, 463.0], ['01M292', '220 Henry Street', 'Manhattan', 410.0, 406.0, 381.0], ['01M696', '525 East Houston Street', 'Manhattan', 634.0, 641.0, 639.0], ['02M305', '350 Grand Street', 'Manhattan', 389.0, 395.0, 381.0], ['01M509', '145 Stanton Street', 'Manhattan', 438.0, 413.0, 394.0], ['01M448', '200 Monroe Street', 'Manhattan', 437.0, 355.0, 352.0], ['02M543', '350 Grand Street', 'Manhattan', 381.0, 396.0, 372.0], ['02M298', '100 Hester Street', 'Manhattan', 430.0, 435.0, 427.0], ['02M420', '345 East 15th Street', 'Manhattan', 452.0, 445.0, 430.0], ['02M399', '40 Irving Place', 'Manhattan', 446.0, 433.0, 

Outstanding! Using the ``json`` library and native-Python, you've extracted the JSON file into a list of lists.

## 6. Transforming and cleaning DataFrames

### Description

Once data has been curated into a cleaned Python data structure, such as a list of lists, it's easy to convert this into a ``pandas`` DataFrame. You'll practice doing just this with the data that was curated in the last exercise.

Per usual, ``pandas`` has been imported as ``pd``, and the ``normalized_testing_scores`` variable stores the list of each schools testing data, as shown below.

```py
[
    ['01M539', '111 Columbia Street', 'Manhattan', 657.0, 601.0, 601.0],
    ...
]  
```

### Instructions

* Create a ``pandas`` DataFrame from the list of lists stored in the ``normalized_testing_scores`` variable.
* Set the columns names for the ``normalized_data`` DataFrame.

In [53]:
# Alternate way to get back `normalized_testing_scores`

# df = raw_testing_scores.T
# scores_df = pd.json_normalize(df['scores'])
# df_normalized = pd.concat([df.drop(columns=['scores']), scores_df], axis=1)
# normalized_testing_scores = [[idx] + row.tolist() for idx, row in df_normalized.iterrows()]

In [54]:
# Create a DataFrame from the normalized_testing_scores list
normalized_data = pd.DataFrame(normalized_testing_scores)

# Set the column names
normalized_data.columns = ["school_id", "street_address", "city", "avg_score_math", "avg_score_reading", "avg_score_writing"]

normalized_data = normalized_data.set_index("school_id")
print(normalized_data.head())

                 street_address       city  avg_score_math  avg_score_reading  \
school_id                                                                       
02M260     425 West 33rd Street  Manhattan             NaN                NaN   
06M211       650 Academy Street  Manhattan             NaN                NaN   
01M539      111 Columbia Street  Manhattan           657.0              601.0   
02M294         350 Grand Street  Manhattan           395.0              411.0   
02M308         350 Grand Street  Manhattan           418.0              428.0   

           avg_score_writing  
school_id                     
02M260                   NaN  
06M211                   NaN  
01M539                 601.0  
02M294                 387.0  
02M308                 415.0  


Congrats! You've extracted a JSON file, and manipulated it such that it can be stored in a pandas DataFrame for downstream transformation.

## 7. Filling missing values with pandas

### Description

When building data pipelines, it's inevitable that you'll stumble upon missing data. In some cases, you may want to remove these records from the dataset. But in others, you'll need to impute values for the missing information. In this exercise, you'll practice using ``pandas`` to impute missing test scores.

Data from the file ``"testing_scores.json"`` has been read into a DataFrame, and is stored in the variable ``raw_testing_scores``. In addition to this, ``pandas`` has been loaded as ``pd``.

### Instructions

* Print the head of the ``raw_testing_scores`` DataFrame, and observe the NaN values.
* Use the average of the ``"math_score"`` column to fill the ``NaN`` values in the ``"math_score"`` column.
* Print the head of the updated DataFrame.
* For the ``"math_score"``, ``"reading_score"`` and ``"writing_score"`` columns, update the ``transform()`` function to fill NaN values with the mean of the respective columns, in place.
* Print the head of the cleaned DataFrame.

In [62]:
raw_testing_scores = pd.read_json("../Datasets/testing_scores.json")
raw_testing_scores

Unnamed: 0,street_address,city,math_score,reading_score,writing_score
02M260,425 West 33rd Street,Manhattan,,,
06M211,650 Academy Street,Manhattan,,,
01M539,111 Columbia Street,Manhattan,657.0,601.0,601.0
02M294,350 Grand Street,Manhattan,395.0,411.0,387.0
02M308,350 Grand Street,Manhattan,418.0,428.0,415.0
...,...,...,...,...,...
27Q302,8-21 Bay 25th Street,Far Rockaway,372.0,362.0,352.0
27Q324,100-00 Beach Channel Drive,Rockaway Park,357.0,381.0,376.0
27Q262,100-00 Beach Channel Drive,Rockaway Park,427.0,430.0,423.0
27Q351,100-00 Beach Channel Drive,Rockaway Park,399.0,403.0,405.0


In [60]:
# Print the head of the `raw_testing_scores` DataFrame
print(raw_testing_scores.head())

# Fill NaN values with the average from that column
raw_testing_scores["math_score"] = raw_testing_scores["math_score"].fillna(raw_testing_scores["math_score"].mean())

# Print the head of the raw_testing_scores DataFrame
print(raw_testing_scores.head())

def transform(raw_data):
	raw_data.fillna(
    	value={
			# Fill NaN values with column mean
			"math_score": raw_data["math_score"].mean(),
			"reading_score": raw_data["reading_score"].mean(),
			"writing_score": raw_data["writing_score"].mean()
		}, inplace=True
	)
	return raw_data

clean_testing_scores = transform(raw_testing_scores)

# Print the head of the clean_testing_scores DataFrame
print(clean_testing_scores.head())

              street_address       city  math_score  reading_score  \
02M260  425 West 33rd Street  Manhattan         NaN            NaN   
06M211    650 Academy Street  Manhattan         NaN            NaN   
01M539   111 Columbia Street  Manhattan       657.0          601.0   
02M294      350 Grand Street  Manhattan       395.0          411.0   
02M308      350 Grand Street  Manhattan       418.0          428.0   

        writing_score  
02M260            NaN  
06M211            NaN  
01M539          601.0  
02M294          387.0  
02M308          415.0  
              street_address       city  math_score  reading_score  \
02M260  425 West 33rd Street  Manhattan     432.944            NaN   
06M211    650 Academy Street  Manhattan     432.944            NaN   
01M539   111 Columbia Street  Manhattan     657.000          601.0   
02M294      350 Grand Street  Manhattan     395.000          411.0   
02M308      350 Grand Street  Manhattan     418.000          428.0   

        writin

Nicely done! Working with missing values is something that takes practice, and an understanding of the problem at hand. Thanks to pandas, it's easy to implement a wide variety of logic using the .fillna() method. Keep up the great work!

## 8. Grouping data with pandas

### Description

The output of a data pipeline is typically a "modeled" dataset. This dataset provides data consumers easy access to information, without having to perform much manipulation. Grouping data with ``pandas`` helps to build modeled datasets,

``pandas`` has been imported as ``pd``, and the ``raw_testing_scores`` DataFrame contains data in the following form:

```bash
              street_address       city  math_score  reading_score  writing_score
01M539   111 Columbia Street  Manhattan       657.0          601.0          601.0
02M294      350 Grand Street  Manhattan       395.0          411.0          387.0
02M308      350 Grand Street  Manhattan       418.0          428.0          415.0
```

### Instructions

* Use ``.loc[]`` to only keep the ``"city"``, ``"math_score"``, ``"reading_score"``, and ``"writing_score"`` columns.
* Group the DataFrame by the ``"city"`` column, and find the mean of each city's math, reading, and writing scores.
* Use the ``transform()`` function to create a grouped DataFrame.

In [63]:
raw_testing_scores

Unnamed: 0,street_address,city,math_score,reading_score,writing_score
02M260,425 West 33rd Street,Manhattan,,,
06M211,650 Academy Street,Manhattan,,,
01M539,111 Columbia Street,Manhattan,657.0,601.0,601.0
02M294,350 Grand Street,Manhattan,395.0,411.0,387.0
02M308,350 Grand Street,Manhattan,418.0,428.0,415.0
...,...,...,...,...,...
27Q302,8-21 Bay 25th Street,Far Rockaway,372.0,362.0,352.0
27Q324,100-00 Beach Channel Drive,Rockaway Park,357.0,381.0,376.0
27Q262,100-00 Beach Channel Drive,Rockaway Park,427.0,430.0,423.0
27Q351,100-00 Beach Channel Drive,Rockaway Park,399.0,403.0,405.0


In [61]:
def transform(raw_data):
	# Use .loc[] to only return the needed columns
	raw_data = raw_data.loc[:, ["city", "math_score", "reading_score", "writing_score"]]
	
    # Group the data by city, return the grouped DataFrame
	grouped_data = raw_data.groupby(by=["city"], axis=0).mean()
	return grouped_data

# Transform the data, print the head of the DataFrame
grouped_testing_scores = transform(raw_testing_scores)
print(grouped_testing_scores.head())

           math_score  reading_score  writing_score
city                                               
Astoria    496.824000     483.084000     484.076444
Bayside    523.000000     479.000000     485.000000
Bellerose  453.000000     434.000000     439.000000
Bronx      409.202373     406.246441     399.679435
Brooklyn   418.044033     412.124364     404.615736


  grouped_data = raw_data.groupby(by=["city"], axis=0).mean()


Great grouping! Leveraging pandas' aggregation capabilities help to create report-ready datasets for downstream data consumers.

## 9. Applying advanced transformations to DataFrames

### Description

``pandas`` has a plethora of built-in transformation tools, but sometimes, more advanced logic needs to be used in a transformation. The ``apply`` function lets you apply a user-defined function to a row or column of a DataFrame, opening the door for advanced transformation and feature generation.

The ``find_street_name()`` function parses the street name from the ``"street_address"``, dropping the street number from the string. This function has been loaded into memory, and is ready to be applied to the ``raw_testing_scores`` DataFrame.

### Instructions

* In the definition of the ``transform()`` function, use the ``find_street_name()`` function to create a new column with the name ``"street_name"``.
* Use the ``transform()`` function to clean the ``raw_testing_scores`` DataFrame.
* Print the head of the ``cleaned_testing_scores`` DataFrame, observing the new ``"street_name"`` column.

In [65]:
def find_street_name(row):
    # Split the street_address by spaces
    split_street_address = row["street_address"].split(" ")
    # Remove the number
    street_number = split_street_address[0]
    try:
        int(street_number)
    except ValueError:
        return row["street_address"]
    
    return " ".join(split_street_address[1:])

In [66]:
def transform(raw_data):
	# Use the apply function to extract the street_name from the street_address
    raw_data["street_name"] = raw_data.apply(
    	# Pass the correct function to the apply method
        find_street_name,
        axis=1
    )
    return raw_data

# Transform the raw_testing_scores DataFrame
cleaned_testing_scores = transform(raw_testing_scores)

# Print the head of the cleaned_testing_scores DataFrame
print(cleaned_testing_scores.head())

              street_address       city  math_score  reading_score  \
02M260  425 West 33rd Street  Manhattan         NaN            NaN   
06M211    650 Academy Street  Manhattan         NaN            NaN   
01M539   111 Columbia Street  Manhattan       657.0          601.0   
02M294      350 Grand Street  Manhattan       395.0          411.0   
02M308      350 Grand Street  Manhattan       418.0          428.0   

        writing_score       street_name  
02M260            NaN  West 33rd Street  
06M211            NaN    Academy Street  
01M539          601.0   Columbia Street  
02M294          387.0      Grand Street  
02M308          415.0      Grand Street  


Amazing stuff! Being able to 'apply' functions to a DataFrame can really help streamline data transformation, especially when the logic in a transformation is more complex than pandas can handle with its built-in functionality.

## 10. Loading data to a Postgres database

### Description

After data has been extracted from a source system and transformed to align with analytics or reporting use cases, it's time to load the data to a final storage medium. Storing cleaned data in a SQL database makes it simple for data consumers to access and run queries against. In this example, you'll practice loading cleaned data to a Postgres database.

``sqlalchemy`` has been imported, and ``pandas`` is available as ``pd``. The first few rows of the ``cleaned_testing_scores`` DataFrame are shown below:

```bash
             street_address       city  math_score  ... best_score
01M539  111 Columbia Street  Manhattan       657.0      Math
02M545     350 Grand Street  Manhattan       613.0      Math
01M292     220 Henry Street  Manhattan       410.0      Math
```

### Instructions

* Update the connection string to write to the ``schools`` database and create a connection object using ``sqlalchemy``.
* Use pandas to write the ``cleaned_testing_scores`` DataFrame to the ``scores`` table in the ``schools`` database.
* If the table is already populated with data, make sure to replace the values with the current DataFrame.

In [67]:
# Update the connection string, create the connection object to the schools database
# db_engine = sqlalchemy.create_engine("postgresql+psycopg2://repl:password@localhost:5432/schools")

# Write the DataFrame to the scores table
'''cleaned_testing_scores.to_sql(
	name="scores",
	con=db_engine,
	index=False,
	if_exists="replace"
)'''

'cleaned_testing_scores.to_sql(\n\tname="scores",\n\tcon=db_engine,\n\tindex=False,\n\tif_exists="replace"\n)'

Lovely loading! The ``.to_sql()`` method is a powerful tool that helps to simplify the ``'load'`` component of ETL pipelines.

## 11. Validating data loaded to a Postgres Database

### Description

In this exercise, you'll finally get to build a data pipeline from end-to-end. This pipeline will extract school testing scores from a JSON file and transform the data to drop rows with missing scores. In addition to this, each will be ranked by the city they are located in, based on their total scores. Finally, the transformed dataset will be stored in a Postgres database.

To give you a head start, the ``extract()`` and ``transform()`` functions have been built and used as shown below. In addition to this, ``pandas`` has been imported as ``pd``. Best of luck!

```py
# Extract and clean the testing scores.
raw_testing_scores = extract("testing_scores.json")
cleaned_testing_scores = transform(raw_testing_scores)
```

### Instructions

* Update the ``load()`` function to write the ``clean_data ``DataFrame to the ``scores_by_city`` table in the ``schools`` database.
* If data exists in the ``scores_by_city`` table, makes sure to replace it with the updated data.
* Load the data from the ``cleaned_testing_scores``, using the ``db_engine`` that has already been defined.
* Use ``pandas ``to read data from the ``scores_by_city`` table, and print the first few rows of the DataFrame to validate that data was persisted.

In [69]:
def load(clean_data, con_engine):
	
    clean_data.to_sql(
    	name="scores_by_city",
		con=con_engine,
		if_exists="replace",  # Make sure to replace existing data
		index=True,
		index_label="school_id"
    )
 
# Call the load function, passing in the cleaned DataFrame
# load(cleaned_testing_scores, db_engine)

# Call query the data in the scores_by_city table, check the head of the DataFrame
# to_validate = pd.read_sql("SELECT * FROM scores_by_city", con=db_engine)
# print(to_validate.head())


Take a second to enjoy this! You just built a data pipeline that extracts data from a JSON file, transforms it, and stores it in a Postgres database for easy downstream access. Congrats!