# Exploring the ozone dataset

On a first glance, the one-hour and eight-hour data look the same. Comparing the two dataframes element-by-element tells us they are not _exactly_ the same. We investigate what the differences are exactly. It is a good exercise in slicing and datawrangling your NumPy and pandas data wrangling skills.

At the end, we can conclude 101 rows are different, of which:
- 99 have a different target parameter `y`
- 2 rows are present in the one-hr data and not in the eight-hour data

In [1]:
from pathlib import Path

import numpy as np
import pandas as pd

In [2]:
ozone = Path("discover-projects/ozone-level-detection")
for data in ["onehr", "eighthr"]:
    with open(ozone / ".".join([data, "names.txt"]), "r") as f:
        # extract column names from txt files, use Date column as index, y in last column
        names = [line.split(":")[0] for line in f.readlines() if ("continuous" in line or "Date" in line)] + ["y"]
        exec(f'{data} = pd.read_csv(ozone / ".".join(["{data}", "data.csv"]), header=None, names=names[1:], index_col=0)')

FileNotFoundError: [Errno 2] No such file or directory: 'discover-projects/ozone-level-detection/onehr.names.txt'

In [None]:
# check whether one- and eighthr data is the same
onehr.equals(eighthr)

False

In [None]:
# check element-wise and get index of rows that are different
diff_rows = (onehr.eq(eighthr).sum(axis=1) < 73)
diff = onehr.loc[diff_rows, :]
diff

Unnamed: 0,WSR0,WSR1,WSR2,WSR3,WSR4,WSR5,WSR6,WSR7,WSR8,WSR9,...,RH50,U50,V50,HT50,KI,TT,SLP,SLP_,Precp,y
4/5/1998,0.4,0.5,2.1,2.2,2.5,2.4,2.1,2.9,3.6,3.4,...,0.1,20.91,-3.9,5755,-15.9,19.4,10140,20,0,0.0
4/11/1998,0,0.6,0.4,0.3,0.1,0.3,0.2,1.4,2.6,3.8,...,0.15,17.27,-12.27,5795,-12.6,24.2,10220,45,0,0.0
4/20/1998,1.8,0.3,0.1,0.1,0.1,0.2,0.2,0.7,0.9,2,...,0.31,20.36,2.61,5740,-3.5,30.6,10180,35,0,0.0
4/23/1998,0.5,0.1,0.1,0.1,0.1,0.2,0.3,0.8,1.2,1.3,...,0.14,16.78,-17.99,5680,-2.4,37.6,10195,-10,0,0.0
4/25/1998,3.1,2.4,2.4,3,3.4,3.4,3.9,4.5,5.5,5.5,...,0.13,9.22,-5.96,5790,7.1,35.4,10165,-25,0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8/4/2004,0.6,1.2,0.8,0.4,0.4,0.4,0.5,2.4,2.7,2.3,...,0.13,-8.99,-8.24,5895,20.55,33.3,10115,-10,0,0.0
8/17/2004,0.4,0.1,0.3,0.5,0.5,0.5,0.5,1.3,1.6,1.6,...,0.31,?,?,5895,5.3,35,10195,-10,0,0.0
9/2/2004,0.4,0.8,0.8,0.8,0.8,1.2,1.2,1.5,1.9,1.8,...,0.44,4.42,2.42,5880,8.9,42.4,10155,-25,1.19,0.0
9/29/2004,0.1,0.3,1,0.5,0.4,0.4,0.5,1,1.1,1.1,...,0.07,14.12,6.61,5835,20.2,36.4,10150,25,0,0.0


In [None]:
# check which elements are different
diff_elements = set()
for row in diff.index:
    try:
        _sample = onehr.loc[row,:].eq(eighthr.loc[row,:])
        diff_elements.add(np.where(_sample == False)[0][0])
    except KeyError:
        print(f"Row {row} not in eighthr data")
print(f"Column(s) containing different values: {onehr.columns[list(diff_elements)]}")

Row 6/10/1998 not in eighthr data
Row 6/30/1999 not in eighthr data
Column(s) containing different values: Index(['y'], dtype='object')


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=89f06f1a-225f-4230-8c62-417ce89cf8e1' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>