## UCI Ozone Level Detection Dataset

- **UCI ID**: 172
- **Name**: Ozone Level Detection
- **Repository URL**: [UCI Repository Link](https://archive.ics.uci.edu/dataset/172/ozone+level+detection)
- **Data URL**: [Download Data](https://archive.ics.uci.edu/static/public/172/data.csv)
- **Abstract**: Contains two ground ozone level datasets: an eight-hour peak set (`eighthr.data`) and a one-hour peak set (`onehr.data`). Data were collected from 1998 to 2004 in the Houston, Galveston, and Brazoria areas.

### General Information

- **Area**: Climate and Environment
- **Tasks**: Classification
- **Characteristics**: Multivariate, Sequential, Time-Series
- **Number of Instances**: 2536
- **Number of Features**: 72
- **Feature Types**: Real
- **Target Column**: `Class`
- **Index Columns**: `Dataset`, `Date`
- **Has Missing Values**: Yes
- **Missing Values Symbol**: `NaN`
- **Year of Dataset Creation**: 2008
- **Last Updated**: Fri, Mar 29, 2024
- **Dataset DOI**: [10.24432/C5NG6W](https://doi.org/10.24432/C5NG6W)
- **Creators**: Kun Zhang, Wei Fan, XiaoJing Yuan

### Additional Information

- **Attributes Summary**: Temperature (`T`), wind speed (`WS`), relative humidity, geopotential height, K-Index, T-Totals, sea level pressure (`SLP`), and precipitation (`Precp`). Attributes for key metrics include:
  - **WSR_PK**: Peak wind speed resultant (average of wind vector)
  - **WSR_AV**: Average wind speed
  - **T_PK**: Peak temperature
  - **T_AV**: Average temperature
  - **T85, RH85, U85, V85, HT85**: Metrics at 850 hpa (approx. 1500m)
  - **T70, RH70, U70, V70, HT70**: Metrics at 700 hpa (approx. 3100m)
  - **T50, RH50, U50, V50, HT50**: Metrics at 500 hpa (approx. 5500m)
  - **SLP_**: Sea level pressure change from previous day

### Features Overview

| Name      | Role     | Type        | Description                                  | Units | Missing Values |
|-----------|----------|-------------|----------------------------------------------|-------|----------------|
| Dataset   | ID       | Categorical | None                                         | None  | No             |
| Date      | ID       | Date        | None                                         | None  | No             |
| WSR0      | Feature  | Continuous  | Wind Speed at reference time                 | None  | Yes            |
| WSR1      | Feature  | Continuous  | Wind Speed at next time interval             | None  | Yes            |
| ...       | ...      | ...         | ...                                          | ...   | ...            |
| TT        | Feature  | Continuous  | T-Totals                                     | None  | Yes            |
| SLP       | Feature  | Integer     | Sea Level Pressure                           | None  | Yes            |
| SLP_      | Feature  | Integer     | Change in Sea Level Pressure from previous day | None | Yes            |
| Precp     | Feature  | Continuous  | Precipitation                                | None  | No             |
| Class     | Target   | Binary      | Ozone level category                         | None  | No             |


In [2]:
# !pip install ucimlrepo

In [None]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
ozone_level_detection = fetch_ucirepo(id=172) 
  

In [5]:
# data (as pandas dataframes) 
X = ozone_level_detection.data.features 
y = ozone_level_detection.data.targets 
  

In [6]:
# metadata 
print(ozone_level_detection.metadata) 

{'uci_id': 172, 'name': 'Ozone Level Detection', 'repository_url': 'https://archive.ics.uci.edu/dataset/172/ozone+level+detection', 'data_url': 'https://archive.ics.uci.edu/static/public/172/data.csv', 'abstract': 'Two ground ozone level data sets are included in this collection. One is the eight hour peak set (eighthr.data), the other is the one hour peak set (onehr.data). Those data were collected from 1998 to 2004 at the Houston, Galveston and Brazoria area.', 'area': 'Climate and Environment', 'tasks': ['Classification'], 'characteristics': ['Multivariate', 'Sequential', 'Time-Series'], 'num_instances': 2536, 'num_features': 72, 'feature_types': ['Real'], 'demographics': [], 'target_col': ['Class'], 'index_col': ['Dataset', 'Date'], 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 2008, 'last_updated': 'Fri Mar 29 2024', 'dataset_doi': '10.24432/C5NG6W', 'creators': ['Kun Zhang', 'Wei Fan', 'XiaoJing Yuan'], 'intro_paper': None, 'additional_i

In [None]:
# variable information 
print(ozone_level_detection.variables) 


In [8]:
X.shape, y.shape

((5070, 72), (5070, 1))

In [10]:
X.head()

Unnamed: 0,WSR0,WSR1,WSR2,WSR3,WSR4,WSR5,WSR6,WSR7,WSR8,WSR9,...,T50,RH50,U50,V50,HT50,KI,TT,SLP,SLP_,Precp
0,0.8,1.8,2.4,2.1,2.0,2.1,1.5,1.7,1.9,2.3,...,-15.5,0.15,10.67,-1.56,5795.0,-12.1,17.9,10330.0,-55.0,0.0
1,2.8,3.2,3.3,2.7,3.3,3.2,2.9,2.8,3.1,3.4,...,-14.5,0.48,8.39,3.84,5805.0,14.05,29.0,10275.0,-55.0,0.0
2,2.9,2.8,2.6,2.1,2.2,2.5,2.5,2.7,2.2,2.5,...,-15.9,0.6,6.94,9.8,5790.0,17.9,41.3,10235.0,-40.0,0.0
3,4.7,3.8,3.7,3.8,2.9,3.1,2.8,2.5,2.4,3.1,...,-16.8,0.49,8.73,10.54,5775.0,31.15,51.7,10195.0,-40.0,2.08
4,2.6,2.1,1.6,1.4,0.9,1.5,1.2,1.4,1.3,1.4,...,,,,,,,,,,0.58


In [11]:
y.head()

Unnamed: 0,Class
0,0
1,0
2,0
3,0
4,0


In [12]:
X.columns

Index(['WSR0', 'WSR1', 'WSR2', 'WSR3', 'WSR4', 'WSR5', 'WSR6', 'WSR7', 'WSR8',
       'WSR9', 'WSR10', 'WSR11', 'WSR12', 'WSR13', 'WSR14', 'WSR15', 'WSR16',
       'WSR17', 'WSR18', 'WSR19', 'WSR20', 'WSR21', 'WSR22', 'WSR23', 'WSR_PK',
       'WSR_AV', 'T0', 'T1', 'T2', 'T3', 'T4', 'T5', 'T6', 'T7', 'T8', 'T9',
       'T10', 'T11', 'T12', 'T13', 'T14', 'T15', 'T16', 'T17', 'T18', 'T19',
       'T20', 'T21', 'T22', 'T23', 'T_PK', 'T_AV', 'T85', 'RH85', 'U85', 'V85',
       'HT85', 'T70', 'RH70', 'U70', 'V70', 'HT70', 'T50', 'RH50', 'U50',
       'V50', 'HT50', 'KI', 'TT', 'SLP', 'SLP_', 'Precp'],
      dtype='object')

In [13]:
X.isnull().sum()

WSR0     598
WSR1     584
WSR2     588
WSR3     584
WSR4     586
        ... 
KI       272
TT       250
SLP      190
SLP_     317
Precp      4
Length: 72, dtype: int64

In [14]:
X.dtypes

WSR0     float64
WSR1     float64
WSR2     float64
WSR3     float64
WSR4     float64
          ...   
KI       float64
TT       float64
SLP      float64
SLP_     float64
Precp    float64
Length: 72, dtype: object

In [18]:
numerical_feeature  = X.select_dtypes(include=['int64', 'float64'])
numerical_feeature

Unnamed: 0,WSR0,WSR1,WSR2,WSR3,WSR4,WSR5,WSR6,WSR7,WSR8,WSR9,...,T50,RH50,U50,V50,HT50,KI,TT,SLP,SLP_,Precp
0,0.8,1.8,2.4,2.1,2.0,2.1,1.5,1.7,1.9,2.3,...,-15.5,0.15,10.67,-1.56,5795.0,-12.10,17.90,10330.0,-55.0,0.00
1,2.8,3.2,3.3,2.7,3.3,3.2,2.9,2.8,3.1,3.4,...,-14.5,0.48,8.39,3.84,5805.0,14.05,29.00,10275.0,-55.0,0.00
2,2.9,2.8,2.6,2.1,2.2,2.5,2.5,2.7,2.2,2.5,...,-15.9,0.60,6.94,9.80,5790.0,17.90,41.30,10235.0,-40.0,0.00
3,4.7,3.8,3.7,3.8,2.9,3.1,2.8,2.5,2.4,3.1,...,-16.8,0.49,8.73,10.54,5775.0,31.15,51.70,10195.0,-40.0,2.08
4,2.6,2.1,1.6,1.4,0.9,1.5,1.2,1.4,1.3,1.4,...,,,,,,,,,,0.58
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5065,0.3,0.4,0.5,0.5,0.2,0.3,0.4,0.4,1.3,2.2,...,-12.4,0.07,7.93,-4.41,5800.0,-25.60,21.80,10295.0,65.0,0.00
5066,1.0,1.4,1.1,1.7,1.5,1.7,1.8,1.5,2.1,2.4,...,-12.0,0.04,5.95,-1.14,5845.0,-19.40,19.10,10310.0,15.0,0.00
5067,0.8,0.8,1.2,0.9,0.4,0.6,0.8,1.1,1.5,1.5,...,-11.8,0.06,7.80,-0.64,5845.0,-9.60,35.20,10275.0,-35.0,0.00
5068,1.3,0.9,1.5,1.2,1.6,1.8,1.1,1.0,1.9,2.0,...,-10.8,0.25,7.72,-0.89,5845.0,-19.60,34.20,10245.0,-30.0,0.05


In [22]:
X.nunique

<bound method DataFrame.nunique of       WSR0  WSR1  WSR2  WSR3  WSR4  WSR5  WSR6  WSR7  WSR8  WSR9  ...   T50  \
0      0.8   1.8   2.4   2.1   2.0   2.1   1.5   1.7   1.9   2.3  ... -15.5   
1      2.8   3.2   3.3   2.7   3.3   3.2   2.9   2.8   3.1   3.4  ... -14.5   
2      2.9   2.8   2.6   2.1   2.2   2.5   2.5   2.7   2.2   2.5  ... -15.9   
3      4.7   3.8   3.7   3.8   2.9   3.1   2.8   2.5   2.4   3.1  ... -16.8   
4      2.6   2.1   1.6   1.4   0.9   1.5   1.2   1.4   1.3   1.4  ...   NaN   
...    ...   ...   ...   ...   ...   ...   ...   ...   ...   ...  ...   ...   
5065   0.3   0.4   0.5   0.5   0.2   0.3   0.4   0.4   1.3   2.2  ... -12.4   
5066   1.0   1.4   1.1   1.7   1.5   1.7   1.8   1.5   2.1   2.4  ... -12.0   
5067   0.8   0.8   1.2   0.9   0.4   0.6   0.8   1.1   1.5   1.5  ... -11.8   
5068   1.3   0.9   1.5   1.2   1.6   1.8   1.1   1.0   1.9   2.0  ... -10.8   
5069   1.5   1.3   1.8   1.4   1.2   1.7   1.6   1.4   1.6   3.0  ... -11.9   

      RH50    U5

In [24]:
# Separate numerical features
numerical_features = X.select_dtypes(include=['int64', 'float64'])

# Separate categorical features
categorical_features = X.select_dtypes(include=['object', 'category'])


In [27]:
# Initial separation by data types
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X.select_dtypes(include=['object', 'category']).columns.tolist()

# Adjust if needed
categorical_features.extend(['ZipCode', 'ID'])  # Adding known categorical features
numerical_features = [col for col in numerical_features if col not in categorical_features]


In [29]:
numerical_features

['WSR0',
 'WSR1',
 'WSR2',
 'WSR3',
 'WSR4',
 'WSR5',
 'WSR6',
 'WSR7',
 'WSR8',
 'WSR9',
 'WSR10',
 'WSR11',
 'WSR12',
 'WSR13',
 'WSR14',
 'WSR15',
 'WSR16',
 'WSR17',
 'WSR18',
 'WSR19',
 'WSR20',
 'WSR21',
 'WSR22',
 'WSR23',
 'WSR_PK',
 'WSR_AV',
 'T0',
 'T1',
 'T2',
 'T3',
 'T4',
 'T5',
 'T6',
 'T7',
 'T8',
 'T9',
 'T10',
 'T11',
 'T12',
 'T13',
 'T14',
 'T15',
 'T16',
 'T17',
 'T18',
 'T19',
 'T20',
 'T21',
 'T22',
 'T23',
 'T_PK',
 'T_AV',
 'T85',
 'RH85',
 'U85',
 'V85',
 'HT85',
 'T70',
 'RH70',
 'U70',
 'V70',
 'HT70',
 'T50',
 'RH50',
 'U50',
 'V50',
 'HT50',
 'KI',
 'TT',
 'SLP',
 'SLP_',
 'Precp']

In [30]:
# Check data type
first_col = X.iloc[:, 0]
print(first_col.dtype)


float64


In [31]:
X.drop(X.columns[0], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X.drop(X.columns[0], axis=1, inplace=True)


In [None]:
categorical_features.extend(['ZipCode', 'ID'])  # Adding known categorical features
numerical_features = [col for col in numerical_features if col not in categorical_features]