In [33]:
import duckdb as ddb
import pandas as pd

In [34]:
con = ddb.connect("../air_quality.db")
print(con)



<duckdb.duckdb.DuckDBPyConnection object at 0x11ffbad30>


In [35]:
# This will allow us to take the result of our query 
# And turn it into a dataframe
# df = con.query("SELECT * FROM raw.air_quality_data WHERE parameter in ('so2', 'pm10', 'pm25')").to_df()
df = con.query("SELECT * FROM raw.air_quality_data").to_df()

df.head()
df.tail()


Unnamed: 0,location_id,sensors_id,location,datetime,lat,lon,parameter,units,value,month,year,ingestion_datetime
3307,2009,3569,San Francisco-2009,2024-01-26 10:00:00,37.7658,-122.3978,pm25,µg/m³,14.0,1,2024,2025-01-10 16:28:50.084
3308,2009,4272468,San Francisco-2009,2024-01-26 09:00:00,37.7658,-122.3978,no,ppm,0.0019,1,2024,2025-01-10 16:28:50.084
3309,2009,4272468,San Francisco-2009,2024-01-26 10:00:00,37.7658,-122.3978,no,ppm,0.0013,1,2024,2025-01-10 16:28:50.084
3310,2009,4272198,San Francisco-2009,2024-01-26 09:00:00,37.7658,-122.3978,nox,ppm,0.0209,1,2024,2025-01-10 16:28:50.084
3311,2009,4272198,San Francisco-2009,2024-01-26 10:00:00,37.7658,-122.3978,nox,ppm,0.0145,1,2024,2025-01-10 16:28:50.084


In [36]:
# Basic inspection of data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3312 entries, 0 to 3311
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   location_id         3312 non-null   int64         
 1   sensors_id          3312 non-null   int64         
 2   location            3312 non-null   object        
 3   datetime            3312 non-null   datetime64[us]
 4   lat                 3312 non-null   float64       
 5   lon                 3312 non-null   float64       
 6   parameter           3312 non-null   object        
 7   units               3312 non-null   object        
 8   value               3312 non-null   float64       
 9   month               3312 non-null   object        
 10  year                3312 non-null   int64         
 11  ingestion_datetime  3312 non-null   datetime64[us]
dtypes: datetime64[us](2), float64(3), int64(3), object(4)
memory usage: 310.6+ KB


In [37]:
# High level distribution 
df.describe()

Unnamed: 0,location_id,sensors_id,datetime,lat,lon,value,year,ingestion_datetime
count,3312.0,3312.0,3312,3312.0,3312.0,3312.0,3312.0,3312
mean,2009.0,1423494.0,2024-01-13 21:27:21.304347,37.7658,-122.3978,1.064233,2024.0,2025-01-10 16:28:50.084000
min,2009.0,3569.0,2024-01-01 09:00:00,37.7658,-122.3978,-3.0,2024.0,2025-01-10 16:28:50.084000
25%,2009.0,3570.0,2024-01-07 14:45:00,37.7658,-122.3978,0.0064,2024.0,2025-01-10 16:28:50.084000
50%,2009.0,25672.0,2024-01-13 21:00:00,37.7658,-122.3978,0.0204,2024.0,2025-01-10 16:28:50.084000
75%,2009.0,4272198.0,2024-01-20 04:00:00,37.7658,-122.3978,0.4,2024.0,2025-01-10 16:28:50.084000
max,2009.0,4272468.0,2024-01-26 10:00:00,37.7658,-122.3978,102.0,2024.0,2025-01-10 16:28:50.084000
std,0.0,2003818.0,,0.0,1.4213e-14,3.146158,0.0,


In [38]:
# Perform describe method on string like data. Note: That's an O not a zero
df.describe(include='O')

Unnamed: 0,location,parameter,units,month
count,3312,3312,3312,3312
unique,1,6,2,1
top,San Francisco-2009,pm25,ppm,1
freq,3312,572,2740,3312


In [39]:
df.columns

Index(['location_id', 'sensors_id', 'location', 'datetime', 'lat', 'lon',
       'parameter', 'units', 'value', 'month', 'year', 'ingestion_datetime'],
      dtype='object')

In [40]:
# Check for duplicates. We want most of the columns, but we don't want, for example, 'location'. Because that will have duplicates. Note: This returned an empty dataframe. That means we don't have any duplicates
df[df.duplicated(subset=['location_id', 'datetime', 'parameter', 'units', 'value'])]

Unnamed: 0,location_id,sensors_id,location,datetime,lat,lon,parameter,units,value,month,year,ingestion_datetime


In [41]:
# Want to check if the different parameters have the same number of measurements or do they differ. Note: This returned only pm25.
df.groupby(by='parameter', as_index=False).count()

Unnamed: 0,parameter,location_id,sensors_id,location,datetime,lat,lon,units,value,month,year,ingestion_datetime
0,co,548,548,548,548,548,548,548,548,548,548,548
1,no,548,548,548,548,548,548,548,548,548,548,548
2,no2,548,548,548,548,548,548,548,548,548,548,548
3,nox,548,548,548,548,548,548,548,548,548,548,548
4,o3,548,548,548,548,548,548,548,548,548,548,548
5,pm25,572,572,572,572,572,572,572,572,572,572,572


In [42]:
con.close()

### Explanation for the ROW_NUMBER() function 

Of course! Think of the `ROW_NUMBER()` function as a way to assign a unique number to each row in a group of data, based on a specific sorting order. Here’s an easy way to break it down:

### Imagine a Stack of Papers
- You have a stack of papers for different locations, sensors, and dates.
- You organize the papers by location, then by sensor, and then by date and pollutant (this is the **PARTITION BY** part—it’s how the rows are grouped together).
- Within each group, you sort the papers by the time they were received, starting with the most recent at the top (this is the **ORDER BY ingestion_datetime DESC** part).

### Now Assign Numbers
- After organizing, you go through each group and number the papers: 1 for the first, 2 for the second, and so on. 
- The first paper in each group (the one with the most recent `ingestion_datetime`) gets the number `1`.

### What Does This Look Like in SQL?
Here’s an example:  
If you have a table with data like this:

| location_id | sensors_id | datetime   | parameter | ingestion_datetime    |
|-------------|------------|------------|-----------|-----------------------|
| 101         | A          | 2025-01-01 | co        | 2025-01-02 10:00:00  |
| 101         | A          | 2025-01-01 | co        | 2025-01-02 08:00:00  |
| 102         | B          | 2025-01-01 | no2       | 2025-01-03 09:00:00  |

After applying `ROW_NUMBER()`:

| location_id | sensors_id | datetime   | parameter | ingestion_datetime    | rn |
|-------------|------------|------------|-----------|-----------------------|----|
| 101         | A          | 2025-01-01 | co        | 2025-01-02 10:00:00  |  1 |
| 101         | A          | 2025-01-01 | co        | 2025-01-02 08:00:00  |  2 |
| 102         | B          | 2025-01-01 | no2       | 2025-01-03 09:00:00  |  1 |

### Why is This Useful?
Once rows are numbered, you can easily pick the "best" or "latest" row (like `rn = 1`) within each group, leaving behind duplicates or older data. 

Does this make it easier to understand? 😊

* This does help, but if we have duplicate rns (row numbers), how is rn used as an identifier for say no2. Because we have now two groups of no2 with the same rn. 

Great observation! Here's the key idea: **`ROW_NUMBER()` ensures there are no duplicates within a single group**. However, there can be multiple groups (e.g., different `datetime` values for `no2`), and each group's rows will have their own `rn` starting from `1`.

Let’s dive into why this is not a problem:

### Unique Row Numbers Are Within Each Group
The `ROW_NUMBER()` function assigns `rn` **separately for each group** (as defined by the `PARTITION BY` clause). So, if you have two different groups of `no2` (say, one for `datetime = 2025-01-01` and another for `datetime = 2025-01-02`), each group starts counting its rows independently. This means:

- The **row numbers are unique within a group**.
- You can safely use `rn = 1` to identify the most recent record *within that specific group*.

### Why Duplicates Across Groups Aren’t an Issue
When you query the data, the combination of the `GROUP BY` fields (`location_id`, `sensors_id`, `datetime`, and `parameter`) ensures that each group is unique. Let’s revisit our example:

| location_id | sensors_id | datetime   | parameter | ingestion_datetime    | rn |
|-------------|------------|------------|-----------|-----------------------|----|
| 102         | B          | 2025-01-01 | no2       | 2025-01-03 09:00:00  |  1 |
| 102         | B          | 2025-01-02 | no2       | 2025-01-04 10:00:00  |  1 |

Here’s what happens:
1. `location_id = 102`, `sensors_id = B`, `datetime = 2025-01-01`, `parameter = no2`:  
   This is one group. The most recent record here has `rn = 1`.

2. `location_id = 102`, `sensors_id = B`, `datetime = 2025-01-02`, `parameter = no2`:  
   This is a **different group**, so the row with the most recent `ingestion_datetime` in this group also has `rn = 1`.

When querying for `rn = 1`, you’re selecting **the most recent record for each unique group**. The groups are already distinct because of `location_id`, `sensors_id`, `datetime`, and `parameter`.

### Summary
While there can be multiple `rn = 1` rows across groups, **they belong to separate groups**, making it easy to distinguish them. When you query the view, the unique combination of fields like `location_id`, `sensors_id`, `datetime`, and `parameter` ensures there’s no confusion. 

This means `rn` is effectively a local identifier within each group, not a global one, and that’s why duplicates across groups are perfectly fine.