## Ozone Level Detection

In this project we will be using the [Ozone Level Detection dataset from UCI](https://archive.ics.uci.edu/dataset/172/ozone+level+detection).

This is a **binary classification** with `70+` numerical features that are *temperature* and *wind speed* measurements at at various time during `1998-2004` at the *Houston*, *Galveston* and *Brazoria* area.


The steps are as follows:
1. Data pre-proccesing
    - Loading
    - Cleaning
    - Exploring
2. Building the model
    - Logistic Regression
    - Naive Bayes
    - SVM
    - Keras Linear + Sigmoid
    - Pytorch Linear + Sigmoid

In [1]:
## first the imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### 1.1. Loading

In [2]:
data = pd.read_csv('https://archive.ics.uci.edu/static/public/172/data.csv', index_col='Date')
data.head()

Unnamed: 0_level_0,Dataset,WSR0,WSR1,WSR2,WSR3,WSR4,WSR5,WSR6,WSR7,WSR8,...,RH50,U50,V50,HT50,KI,TT,SLP,SLP_,Precp,Class
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1/1/1998,8hr,0.8,1.8,2.4,2.1,2.0,2.1,1.5,1.7,1.9,...,0.15,10.67,-1.56,5795.0,-12.1,17.9,10330.0,-55.0,0.0,0
1/2/1998,8hr,2.8,3.2,3.3,2.7,3.3,3.2,2.9,2.8,3.1,...,0.48,8.39,3.84,5805.0,14.05,29.0,10275.0,-55.0,0.0,0
1/3/1998,8hr,2.9,2.8,2.6,2.1,2.2,2.5,2.5,2.7,2.2,...,0.6,6.94,9.8,5790.0,17.9,41.3,10235.0,-40.0,0.0,0
1/4/1998,8hr,4.7,3.8,3.7,3.8,2.9,3.1,2.8,2.5,2.4,...,0.49,8.73,10.54,5775.0,31.15,51.7,10195.0,-40.0,2.08,0
1/5/1998,8hr,2.6,2.1,1.6,1.4,0.9,1.5,1.2,1.4,1.3,...,,,,,,,,,0.58,0


In [8]:
## the overall info on the loaded data
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5070 entries, 1/1/1998 to 12/31/2004
Data columns (total 74 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Dataset  5070 non-null   object 
 1   WSR0     4472 non-null   float64
 2   WSR1     4486 non-null   float64
 3   WSR2     4482 non-null   float64
 4   WSR3     4486 non-null   float64
 5   WSR4     4484 non-null   float64
 6   WSR5     4486 non-null   float64
 7   WSR6     4488 non-null   float64
 8   WSR7     4492 non-null   float64
 9   WSR8     4490 non-null   float64
 10  WSR9     4496 non-null   float64
 11  WSR10    4494 non-null   float64
 12  WSR11    4486 non-null   float64
 13  WSR12    4496 non-null   float64
 14  WSR13    4494 non-null   float64
 15  WSR14    4494 non-null   float64
 16  WSR15    4498 non-null   float64
 17  WSR16    4502 non-null   float64
 18  WSR17    4504 non-null   float64
 19  WSR18    4498 non-null   float64
 20  WSR19    4486 non-null   float64
 21  WSR20 

Issues that stand out are:
- Missing values
- `float64` and `int64` when less memory would suffice
- Unclear naming for the columns

Our approach would be to: 
- Fill the null values with averages for temperature and wind speed, and drop the rest
- Downcasting would be an option
- Changing the names would be an option

In [4]:
def 

['WSR0',
 'WSR1',
 'WSR2',
 'WSR3',
 'WSR4',
 'WSR5',
 'WSR6',
 'WSR7',
 'WSR8',
 'WSR9',
 'WSR10',
 'WSR11',
 'WSR12',
 'WSR13',
 'WSR14',
 'WSR15',
 'WSR16',
 'WSR17',
 'WSR18',
 'WSR19',
 'WSR20',
 'WSR21',
 'WSR22',
 'WSR23',
 'WSR_PK',
 'WSR_AV']

In [3]:
pd.melt(data, id_vars = [])

TypeError: melt() missing 1 required positional argument: 'frame'

In [5]:
data[['WSR_PK','WSR_AV']]

Unnamed: 0_level_0,WSR_PK,WSR_AV
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
1/1/1998,5.5,3.1
1/2/1998,5.5,3.4
1/3/1998,5.6,3.5
1/4/1998,4.7,3.2
1/5/1998,3.7,2.3
...,...,...
12/27/2004,3.9,1.6
12/28/2004,5.0,2.6
12/29/2004,3.9,1.9
12/30/2004,4.0,2.1


In [6]:
data.columns

Index(['Dataset', 'WSR0', 'WSR1', 'WSR2', 'WSR3', 'WSR4', 'WSR5', 'WSR6',
       'WSR7', 'WSR8', 'WSR9', 'WSR10', 'WSR11', 'WSR12', 'WSR13', 'WSR14',
       'WSR15', 'WSR16', 'WSR17', 'WSR18', 'WSR19', 'WSR20', 'WSR21', 'WSR22',
       'WSR23', 'WSR_PK', 'WSR_AV', 'T0', 'T1', 'T2', 'T3', 'T4', 'T5', 'T6',
       'T7', 'T8', 'T9', 'T10', 'T11', 'T12', 'T13', 'T14', 'T15', 'T16',
       'T17', 'T18', 'T19', 'T20', 'T21', 'T22', 'T23', 'T_PK', 'T_AV', 'T85',
       'RH85', 'U85', 'V85', 'HT85', 'T70', 'RH70', 'U70', 'V70', 'HT70',
       'T50', 'RH50', 'U50', 'V50', 'HT50', 'KI', 'TT', 'SLP', 'SLP_', 'Precp',
       'Class'],
      dtype='object')