
---

## 🧩 Extracting Non-Tabular Data — Detailed Notes

### 1. 📦 Transitioning from Tabular to Non-Tabular Data
- Previous work focused on **tabular data** (structured: rows and columns).
- In this lesson, the focus shifts to **non-tabular data** — the majority in real-world scenarios.

---

### 2. 🌐 Understanding Non-Tabular Data

> **Fact**: Over 80% of data produced today is **unstructured** (source: MIT Sloan).

#### Types of Non-Tabular Data:
| Type       | Examples |
|------------|----------|
| Text       | Emails, Social Media, Logs |
| Audio      | Voice commands, Podcasts |
| Image      | Photos, Diagrams |
| Video      | Surveillance, Tutorials |
| Spatial    | GIS data, Maps |
| IoT Data   | Sensor readings, Device logs |

🔍 **Goal of the Data Engineer**: Convert raw unstructured data into a **structured/tabular format** for analysis.

---

### 3. 🔌 Working with APIs and JSON Data

#### What is an API?
- **API (Application Programming Interface)** = Interface to interact with software/data without direct DB access.
- Similar to a **bank teller** – you can deposit or withdraw, but not access the vault.

#### Key Features:
- Common in third-party data extraction.
- Ensures **security** (no direct DB access).
- Most APIs return data in **JSON format**.

---

### 4. 📁 What is JSON?

- **JSON (JavaScript Object Notation)** = Schema-less, key-value structure.
- Resembles a **Python dictionary**.
- Can store:
  - Simple key-value pairs.
  - Lists.
  - Nested dictionaries (can get complex!).

---

### 5. 📚 Reading JSON with `pandas.read_json()`

Use this when JSON is **already in a structured format**.

```python
import pandas as pd
df = pd.read_json("path_to_file.json", orient="columns")
```

#### `orient` Parameter:
| Orient Value | JSON Structure                                      |
|--------------|-----------------------------------------------------|
| `"columns"`  | `{col1: [..], col2: [..]}` (default)               |
| `"records"`  | `[{col1: val1, col2: val2}, {...}]`               |
| `"index"`    | `{index1: {col1: val, col2: val}, ...}`           |

🔎 Use the right `orient` value based on how your JSON file is structured.

---

### 6. 🧬 Dealing with Nested or Complex JSON

- If JSON is **not flat** or **DataFrame-ready**:
  - Load it as a **dictionary** first using Python's `json` module.
  - Then manually transform into DataFrame.

---

### 7. 🐍 Reading JSON Using `json.load()`

```python
import json

with open("file.json") as file:
    raw_data = json.load(file)  # returns dict or list
```

- Returns Python dict (or list).
- You can inspect with `type(raw_data)` and start transforming:
  - Extract nested fields.
  - Flatten structures.
  - Convert list of dicts to DataFrame.

Example transformation:
```python
df = pd.DataFrame(raw_data["data"])  # if "data" key holds records
```

---

### 8. ✅ Summary of Best Practices

| Task | Tool | Notes |
|------|------|-------|
| Structured JSON | `pd.read_json()` | Use `orient` wisely |
| Unstructured/Nested JSON | `json.load()` | Flatten to dict → transform to DataFrame |
| Real-time data/API | Requests + JSON | Use `.json()` method to parse response |
| Goal | Transform non-tabular → tabular | Enables analysis |

---

Would you like code samples for **flattening deeply nested JSON**, or examples of **real-world API extraction** using Python?

### **Ingesting JSON data with pandas**
When developing a data pipeline, you may have to work with non-tabular data and data sources, such as APIs or JSON files. In this exercise, we'll practice extracting data from a JSON file using pandas.

pandas has been imported as pd, and the JSON file you'll ingest is stored at the path "testing_scores.json".

In [None]:
def extract(file_path):
  # Read the JSON file into a DataFrame
  return pd.read_json(file_path, orient="records")

# Call the extract function with the appropriate path, assign to raw_testing_scores
raw_testing_scores = extract("testing_scores.json")


# Output the head of the DataFrame
print(raw_testing_scores.head())


### **Reading JSON data into memory**
When data is stored in JSON format, it's not always easy to load into a DataFrame. This is the case for the "nested_testing_scores.json" file. Here, the data will have to be manually manipulated before it can be stored in a DataFrame.

To help get you started, pandas has been loaded into the workspace as pd.

In [None]:
# Use pandas to read a JSON file into a DataFrame.
# Pass the "nested_scores.json" file path to the extract() function.

def extract(file_path):
  	# Read the JSON file into a DataFrame, orient by index
	return pd.read_json(file_path, orient="index")

# Call the extract function, pass in the desired file_path
raw_testing_scores = extract("nested_scores.json")
print(raw_testing_scores.head())


# Import the json library.
# Use the json library to load the "nested_scores.json" file into memory.
# Import the json library
import json

def extract(file_path):
    with open(file_path, "r") as json_file:
        # Load the data from the JSON file
        raw_data = json.load(json_file)
    return raw_data

raw_testing_scores = extract("nested_scores.json")

# Print the raw_testing_scores
print(raw_testing_scores)


Here are detailed and structured notes on the topic **Transforming Non-Tabular Data**, especially focusing on converting nested JSON or dictionary structures into a pandas DataFrame — a vital skill for data pipeline construction:

---

## 🔄 Transforming Non-Tabular Data — Detailed Notes

### 1. 🗺️ Recap: From JSON to Dictionary
- **Previous step**: Parsed complex JSON using `json.load()` into a Python dictionary.
- ✅ Now the task is to **transform this dictionary into a tabular structure (DataFrame)**.

---

### 2. 🧱 Dictionary Structure Recap

Typical structure of the loaded dictionary:
```python
{
  "2023-01-01": {
    "price": {"open": 150, "close": 158},
    "volume": 30000
  },
  "2023-01-02": {
    "price": {"open": 160, "close": 165},
    "volume": 32000
  }
}
```

- 📌 Keys: Timestamps (or unique IDs).
- 📌 Values: Nested dictionaries (with price, volume, etc.).
- 📌 Target: Convert this structure into a **list of lists** → DataFrame.

---

### 3. 🔁 Iterating Over Dictionary Components

#### Dictionary methods:
| Method     | Output Type  | Description |
|------------|--------------|-------------|
| `.keys()`  | list         | List of all keys (e.g. timestamps) |
| `.values()`| list         | List of values (e.g. nested dicts) |
| `.items()` | list of tuples | Each tuple = (key, value) pair |

✅ Use `.items()` when both key and value are needed during iteration.

---

### 4. 🔍 Accessing Dictionary Values with `.get()`

```python
value = dictionary.get("key", default_value)
```

#### Benefits:
- Avoids errors if a key is missing.
- You can specify a fallback value (`None` or something else).
- ✅ Use `.get()` **twice** for nested dictionaries.

Example (nested access):
```python
open_price = value.get("price", {}).get("open", None)
```

---

### 5. 🧱 Transforming into List of Lists

Create a list where each element is a **row**:
```python
parsed_stock_data = []

for timestamp, data in raw_stock_data.items():
    open_price = data.get("price", {}).get("open", None)
    close_price = data.get("price", {}).get("close", None)
    volume = data.get("volume", None)
    
    parsed_stock_data.append([timestamp, open_price, close_price, volume])
```

---

### 6. 📊 Creating a DataFrame

```python
import pandas as pd

df = pd.DataFrame(parsed_stock_data, columns=["timestamp", "open", "close", "volume"])
df.set_index("timestamp", inplace=True)
```

> 🧠 Now your non-tabular JSON data has become a clean, tabular DataFrame!

---

### 7. ✅ Summary Workflow

| Step | Action |
|------|--------|
| 1️⃣ | Load JSON using `json.load()` |
| 2️⃣ | Use `.items()` to loop through key-value pairs |
| 3️⃣ | Use `.get()` for safe value access |
| 4️⃣ | Build a `list of lists` with the desired fields |
| 5️⃣ | Convert to DataFrame using `pd.DataFrame()` |
| 6️⃣ | Set meaningful column names and index |

---

### 🧪 Example Code (All Together)
```python
import json
import pandas as pd

# Step 1: Load JSON data
with open("stock_data.json") as file:
    raw_stock_data = json.load(file)

# Step 2: Transform dictionary into list of lists
parsed_stock_data = []
for timestamp, data in raw_stock_data.items():
    open_price = data.get("price", {}).get("open", None)
    close_price = data.get("price", {}).get("close", None)
    volume = data.get("volume", None)
    parsed_stock_data.append([timestamp, open_price, close_price, volume])

# Step 3: Create DataFrame
df = pd.DataFrame(parsed_stock_data, columns=["timestamp", "open", "close", "volume"])
df.set_index("timestamp", inplace=True)
```

---

Would you like to try transforming **another nested JSON structure**, or need an example with real-world APIs?

### **Iterating over dictionaries**
Once JSON data is loaded into a dictionary, you can leverage Python's built-in tools to iterate over its keys and values.

The "nested_school_scores.json" file has been read into a dictionary stored in the raw_testing_scores variable, which takes the following form:
```
{
    "01M539": {
        "street_address": "111 Columbia Street",
        "city": "Manhattan",
        "scores": {
              "math": 657,
              "reading": 601,
              "writing": 601
        }
  }, ...
}
```

In [None]:
raw_testing_scores_keys = []
raw_testing_scores_values = []

# Iterate through the values of the raw_testing_scores dictionary
for school_id, school_info in raw_testing_scores.items():
	raw_testing_scores_keys.append(school_id)
	raw_testing_scores_values.append(school_info)

print(raw_testing_scores_keys[0:3])
print(raw_testing_scores_values[0:3])



# Iterate through the values of the raw_testing_scores dictionary
for school_info in raw_testing_scores.values():
	raw_testing_scores_values.append(school_info)
    
print(raw_testing_scores_values[0:3])

raw_testing_scores_keys = []
raw_testing_scores_values = []

# Iterate through the values of the raw_testing_scores dictionary
for school_id, school_info in raw_testing_scores.items():
	raw_testing_scores_keys.append(school_id)
	raw_testing_scores_values.append(school_info)

print(raw_testing_scores_keys[0:3])
print(raw_testing_scores_values[0:3])






### **Parsing data from dictionaries**
When JSON data is loaded into memory, the resulting dictionary can be complicated. Key-value pairs may contain another dictionary, such are called nested dictionaries. These nested dictionaries are frequently encountered when dealing with APIs or other JSON data. In this exercise, you will practice extracting data from nested dictionaries and handling missing values.

The dictionary below is stored in the school variable. Good luck!
```
{
    "street_address": "111 Columbia Street",
    "city": "Manhattan",
    "scores": {
        "math": 657,
        "reading": 601
    }
}
```

In [None]:
# Parse the street_address from the dictionary
street_address = school.get("street_address")

# Parse the scores dictionary
scores = school.get("scores")

# Try to parse the math, reading and writing values from scores
math_score = scores.get("math", 0)
reading_score = scores.get('reading',0)
writing_score = scores.get('writing',0)

print(f"Street Address: {street_address}")
print(f"Math: {math_score}, Reading: {reading_score}, Writing: {writing_score}")


### **Transforming JSON data**
Chances are, when reading data from JSON format into a dictionary, you'll probably have to apply some level of manual transformation to the data before it can be stored in a DataFrame. This is common when working with nested dictionaries, which you'll have the opportunity to explore in this exercise.

The "nested_school_scores.json" file has been read into a dictionary available in the raw_testing_scores variable, which takes the following form:

```
{
    "01M539": {
        "street_address": "111 Columbia Street",
        "city": "Manhattan",
        "scores": {
              "math": 657,
              "reading": 601,
              "writing": 601
        }
  }, ...
}
```

In [None]:
normalized_testing_scores = []

# Loop through each of the dictionary key-value pairs
for school_id, school_info in raw_testing_scores.items():
	normalized_testing_scores.append([
    	school_id,
    	school_info.get("street_address"),  # Pull the "street_address"
    	school_info.get("city"),
    	school_info.get("scores").get("math", 0),
    	school_info.get("scores").get("reading", 0),
    	school_info.get("scores").get("writing", 0),
    ])

print(normalized_testing_scores)


### **Transforming and cleaning DataFrames**
Once data has been curated into a cleaned Python data structure, such as a list of lists, it's easy to convert this into a pandas DataFrame. You'll practice doing just this with the data that was curated in the last exercise.

Per usual, pandas has been imported as pd, and the normalized_testing_scores variable stores the list of each schools testing data, as shown below.
```
[
    ['01M539', '111 Columbia Street', 'Manhattan', 657.0, 601.0, 601.0],
    ...
]   
```


In [None]:
# Create a DataFrame from the normalized_testing_scores list
normalized_data =  pd.DataFrame(normalized_testing_scores)

# Set the column names
normalized_data.columns = ["school_id", "street_address", "city", "avg_score_math", "avg_score_reading", "avg_score_writing"]

normalized_data = normalized_data.set_index("school_id")
print(normalized_data.head())


Here are detailed and well-structured notes for **Advanced Data Transformation with Pandas**, a crucial step for refining data within pipelines:

---

## 🧠 Advanced Data Transformation with pandas — Notes

After converting non-tabular data to a DataFrame, the next step is **cleaning, organizing, and enriching the data** using pandas’ advanced functionalities.

---

### 1. 🩹 Handling Missing Values (`.fillna()`)

#### 🔹 Basic usage:
```python
df.fillna(0)
```
Replaces **all NaN values** in the DataFrame with `0`.

---

#### 🔹 Column-specific replacement using a dictionary:
```python
df.fillna(value={"open": 0, "close": 0.5}, axis=1)
```
Fills missing:
- `open` values with `0`
- `close` values with `0.5`

---

#### 🔹 Use another column to fill missing values:
```python
df["open"].fillna(df["close"], inplace=True)
```
Fills NaNs in `open` with the corresponding `close` values.

---

### 2. 📊 Grouping Data (`.groupby()`)

#### Similar to SQL’s `GROUP BY`:

SQL Example:
```sql
SELECT ticker, AVG(open), AVG(close)
FROM stock_data
GROUP BY ticker;
```

#### pandas equivalent:
```python
grouped_df = df.groupby("ticker").mean()
```

Other aggregations:
```python
df.groupby("ticker").min()
df.groupby("ticker").max()
df.groupby("ticker").sum()
```

---

### 3. 🛠️ Custom Transformations (`.apply()`)

When built-in methods aren’t enough, define a function and use `.apply()`.

#### Example: Classify stock price movement
```python
def classify_change(row):
    if row["close"] > row["open"]:
        return "Increase"
    else:
        return "Decrease"

df["change"] = df.apply(classify_change, axis=1)
```

✅ Setting `axis=1` ensures that the function is applied **row-wise**.

---

### ✅ Summary Table

| Task                        | Method              | Description |
|-----------------------------|---------------------|-------------|
| Fill all NaNs               | `df.fillna(0)`      | Replaces all missing values |
| Fill column-specific NaNs   | `df.fillna({...})`  | Dictionary of column-value pairs |
| Fill using another column   | `df[col1].fillna(df[col2])` | Row-wise filling |
| Group by and aggregate      | `df.groupby("col")` | Similar to SQL GROUP BY |
| Custom logic per row        | `df.apply(func, axis=1)` | Apply complex functions |

---

### 🧪 Practice Tip:
Try combining all three techniques (filling, grouping, and applying custom logic) on a small dataset to simulate a **mini ETL pipeline**.

Want me to give you a practice JSON or a mini challenge for this topic?

### **Filling missing values with pandas**
When building data pipelines, it's inevitable that you'll stumble upon missing data. In some cases, you may want to remove these records from the dataset. But in others, you'll need to impute values for the missing information. In this exercise, you'll practice using pandas to impute missing test scores.

Data from the file "testing_scores.json" has been read into a DataFrame, and is stored in the variable raw_testing_scores. In addition to this, pandas has been loaded as pd.

In [None]:
# Fill NaN values with the average from that column
raw_testing_scores["math_score"] = raw_testing_scores["math_score"].fillna(raw_testing_scores["math_score"].mean())

# Print the head of the raw_testing_scores DataFrame
print(raw_testing_scores.head())


In [None]:
def transform(raw_data):
	raw_data.fillna(
    	value={
			# Fill NaN values with column mean
			"math_score": raw_data["math_score"].mean(),
			"reading_score": raw_data["reading_score"].mean(),
			"writing_score": raw_data["writing_score"].mean(),
		}, inplace=True
	)
	return raw_data

clean_testing_scores = transform(raw_testing_scores)

# Print the head of the clean_testing_scores DataFrame
print(clean_testing_scores.head())

### **Grouping data with pandas**
The output of a data pipeline is typically a "modeled" dataset. This dataset provides data consumers easy access to information, without having to perform much manipulation. Grouping data with pandas helps to build modeled datasets,

pandas has been imported as pd, and the raw_testing_scores DataFrame contains data in the following form:



In [None]:
def transform(raw_data):
	# Use .loc[] to only return the needed columns
	raw_data = raw_data.loc[:, ['city','math_score','reading_score', 'writing_score']]
	
    # Group the data by city, return the grouped DataFrame
	grouped_data = raw_data.groupby(by=["city"], axis=0).mean()
	return grouped_data

# Transform the data, print the head of the DataFrame
grouped_testing_scores = transform(raw_testing_scores)
print(grouped_testing_scores.head())


### **Applying advanced transformations to DataFrames**
pandas has a plethora of built-in transformation tools, but sometimes, more advanced logic needs to be used in a transformation. The apply function lets you apply a user-defined function to a row or column of a DataFrame, opening the door for advanced transformation and feature generation.

The find_street_name() function parses the street name from the "street_address", dropping the street number from the string. This function has been loaded into memory, and is ready to be applied to the raw_testing_scores DataFrame.

In [None]:
def transform(raw_data):
	# Use the apply function to extract the street_name from the street_address
    raw_data["street_name"] = raw_data.apply(
   		# Pass the correct function to the apply method
        find_street_name,
        axis=1
    )
    return raw_data

# Transform the raw_testing_scores DataFrame
cleaned_testing_scores = transform(raw_testing_scores)

# Print the head of the cleaned_testing_scores DataFrame
print(cleaned_testing_scores.head())


Here are your final and comprehensive notes for this chapter on **Loading Data to a SQL Database with pandas**, the final step in the ETL process:

---

## 🧠 Loading Data to SQL with pandas — Notes

After extracting and transforming data, the final ETL step is **loading the cleaned data** into a SQL database for downstream use in analytics and reporting.

---

### 1. 📥 `.to_sql()` — Load DataFrame to SQL

```python
df.to_sql(
    name="table_name",       # SQL table name
    con=engine,              # SQLAlchemy engine (connection object)
    if_exists="append",      # Options: 'fail', 'replace', 'append'
    index=True,              # Include the DataFrame index
    index_label="id_column"  # Label for index column in SQL
)
```

---

### 2. 🔗 Create SQLAlchemy Engine

To connect to a SQL database (e.g., **PostgreSQL**), use SQLAlchemy:

```python
from sqlalchemy import create_engine

db_uri = "postgresql://username:password@host:port/database"
engine = create_engine(db_uri)
```

📌 **Example**:
```python
engine = create_engine("postgresql://user:pass@localhost:5432/market")
```

---

### 3. 🧾 Example: Persisting Data

```python
clean_stock_data.to_sql(
    name="filtered_stock_data",
    con=engine,
    if_exists="append",
    index=True,
    index_label="timestamps"
)
```

- `name`: **filtered_stock_data** — Table name
- `con`: **engine** — Connection to the Postgres database
- `if_exists`: **append** — Adds to existing table
- `index`: **True** — Writes index to SQL
- `index_label`: **timestamps** — SQL column for index

---

### 4. ✅ Validating Data Load

Once data is loaded, validation is critical:

```python
# Read from SQL to validate
sql_df = pd.read_sql("SELECT * FROM filtered_stock_data", con=engine)

# Compare with original DataFrame
assert len(sql_df) == len(clean_stock_data)
assert sql_df.equals(clean_stock_data)
```

#### Best Practices:
- ✅ Row counts should match
- ✅ Each row’s values should be identical
- ✅ Perform manual data quality checks
- ✅ Add this validation to monitoring pipelines to **instill trust**

---

### ✅ Summary Table

| Task                            | Tool/Method           | Example |
|---------------------------------|------------------------|---------|
| Connect to SQL DB               | `create_engine()`      | `engine = create_engine(...)` |
| Load DataFrame to SQL           | `.to_sql()`            | `df.to_sql(name="table", con=engine)` |
| Validate data in SQL            | `pd.read_sql()`        | `pd.read_sql("SELECT * ...", con=engine)` |
| Compare with original DataFrame | `.equals()`, `len()`   | `assert df1.equals(df2)` |

---

### 🚀 Real-World Application
Once persisted in SQL, your data is now:
- Ready for **visualization tools** (e.g., Power BI, Tableau)
- Queryable using standard **SQL**
- Accessible for other **data consumers** and **systems**

---

Let me know if you want a hands-on example with a SQLite/Postgres setup or help integrating it into a full pipeline!

### **Loading data to a Postgres database**
After data has been extracted from a source system and transformed to align with analytics or reporting use cases, it's time to load the data to a final storage medium. Storing cleaned data in a SQL database makes it simple for data consumers to access and run queries against. In this example, you'll practice loading cleaned data to a Postgres database.

sqlalchemy has been imported, and pandas is available as pd. The first few rows of the cleaned_testing_scores DataFrame are shown below:

In [None]:
# Update the connection string, create the connection object to the schools database
db_engine = sqlalchemy.create_engine("postgresql+psycopg2://repl:password@localhost:5432/schools")

# Write the DataFrame to the scores table
cleaned_testing_scores.to_sql(
	name="scores",
	con=db_engine,
	index=False,
	if_exists="replace"
)


### **Validating data loaded to a Postgres Database**
In this exercise, you'll finally get to build a data pipeline from end-to-end. This pipeline will extract school testing scores from a JSON file and transform the data to drop rows with missing scores. In addition to this, each will be ranked by the city they are located in, based on their total scores. Finally, the transformed dataset will be stored in a Postgres database.

To give you a head start, the extract() and transform() functions have been built and used as shown below. In addition to this, pandas has been imported as pd. Best of luck!
``` python
# Extract and clean the testing scores.
raw_testing_scores = extract("testing_scores.json")
cleaned_testing_scores = transform(raw_testing_scores)
```

In [None]:
# Update the load() function to write the clean_data DataFrame to the scores_by_city table in the schools database.
# If data exists in the scores_by_city table, makes sure to replace it with the updated data.
def load(clean_data, con_engine):
	# Store the data in the schools database
    clean_data.to_sql(
    	name="scores_by_city",
		con=con_engine,
		if_exists="replace",  # Make sure to replace existing data
		index=True,
		index_label="school_id"
    )




# Load the data from the cleaned_testing_scores, using the db_engine that has already been defined.
# Use pandas to read data from the scores_by_city table, and print the first few rows of the DataFrame to validate that data was persisted.

def load(clean_data, con_engine):
    clean_data.to_sql(name="scores_by_city", con=con_engine, if_exists="replace", index=True, index_label="school_id")
    
# Call the load function, passing in the cleaned DataFrame
load(cleaned_testing_scores, db_engine)

# Call query the data in the scores_by_city table, check the head of the DataFrame
to_validate = pd.read_sql("SELECT * FROM scores_by_city", con=db_engine)
print(to_validate.head())


### **Creating fixtures with pytest**
When building unit tests, you'll sometimes have to do a bit of setup before testing can begin. Doing this setup within a unit test can make the tests more difficult to read, and may have to be repeated several times. Luckily, pytest offers a way to solve these problems, with fixtures.

For this exercise, pandas has been imported as pd, and the extract() function shown below is available for use!
```python
def extract(file_path):
    return pd.read_csv(file_path)
```

In [None]:
# Import pytest
import pytest

# Create a pytest fixture
@pytest.fixture()
def raw_tax_data():
	raw_data = extract("raw_tax_data.csv")
   
    # Return the raw DataFrame
	return raw_data


### **Unit testing a data pipeline with fixtures**
You've learned in the last video that unit testing can help to instill more trust in your data pipeline, and can even help to catch bugs throughout development. In this exercise, you'll practice writing both fixtures and unit tests, using the pytest library and assert.

The transform function that you'll be building unit tests around is shown below for reference. pandas has been imported as pd, and the pytest() library is loaded and ready for use.

```python
def transform(raw_data):
    raw_data["tax_rate"] = raw_data["total_taxes_paid"] / raw_data["total_taxable_income"]
    raw_data.set_index("industry_name", inplace=True)
    return raw_data
```

In [None]:
# Define a pytest fixture
@pytest.fixture()
def clean_tax_data():
    raw_data = pd.read_csv("raw_tax_data.csv")
    
    # Transform the raw_data, store in clean_data DataFrame, and return the variable
    clean_data = transform(raw_data)
    return clean_data

# Pass the fixture to the function
def test_tax_rate(clean_tax_data):
    # Assert values are within the expected range
    assert clean_tax_data["tax_rate"].max() <= 1 and clean_tax_data["tax_rate"].min() >= 0
