# Polynomial Regression Assignment (Please do not remove the output cells)

## The objective is to apply polynomial regression on the provided data using 3 regularization techniques

Tasks are defined below:
1-  Read the dataset, and get acquainted with its features and labels. Check the link below for more details about the data.  
https://archive.ics.uci.edu/ml/datasets/Air+Quality  

2-  This dataset has 5 different output labels. For this assignment we only need the NO2 output. Please drop other outputs.  

3-  Handle missing data similar to what was covered in tutorial 3.  

4-  Replace the date feature with 3 separate features (Day, Month and Year).  

5-  Replace the time feature with 3 separate features (Hours, Minutes and Seconds).  

6-  Apply feature scaling.  

7-  Evaluate different degrees of lasso CV polynomial regression.  

8-  Choose the best degree and print the errors on the test data, model coefficients and the lasso parameters.  

9-  Repeat steps 4 and 5 using Ridge CV polynomial regression.  

10- Repeat steps 4 and 5 using ElasticNet CV polynomial regression.  

11- Compare the errors from the 3 regularization techniques, and save the best model.  

12- Load the best model and test it on a sample that you manually created it.  


## Enter your IDs and Names below

1-   
  
2-


### Imports

In [11]:
import pandas as pd
import numpy as np

### 1- Read the data

In [12]:
df = pd.read_csv("AirQualityUCI.csv")
print(df)

           Date      Time  CO(GT)  PT08.S1(CO)  NMHC(GT)  C6H6(GT)  \
0     3/10/2004  18:00:00     2.6       1360.0     150.0      11.9   
1     3/10/2004  19:00:00     2.0       1292.0     112.0       9.4   
2     3/10/2004  20:00:00     2.2       1402.0      88.0       9.0   
3     3/10/2004  21:00:00     2.2       1376.0      80.0       9.2   
4     3/10/2004  22:00:00     1.6       1272.0      51.0       6.5   
...         ...       ...     ...          ...       ...       ...   
9352   4/4/2005  10:00:00     3.1       1314.0       NaN      13.5   
9353   4/4/2005  11:00:00     2.4       1163.0       NaN      11.4   
9354   4/4/2005  12:00:00     2.4       1142.0       NaN      12.4   
9355   4/4/2005  13:00:00     2.1       1003.0       NaN       9.5   
9356   4/4/2005  14:00:00     2.2       1071.0       NaN      11.9   

      PT08.S2(NMHC)  NOx(GT)  PT08.S3(NOx)  NO2(GT)  PT08.S4(NO2)  \
0            1046.0    166.0        1056.0    113.0        1692.0   
1             955.0  

### 2- Drop unwanted labels

In [13]:
df = df.drop(['CO(GT)','NMHC(GT)', 'C6H6(GT)', 'NOx(GT)'], axis=1)
list(df.columns.values)


['Date',
 'Time',
 'PT08.S1(CO)',
 'PT08.S2(NMHC)',
 'PT08.S3(NOx)',
 'NO2(GT)',
 'PT08.S4(NO2)',
 'PT08.S5(O3)',
 'T',
 'RH',
 'AH']

### 3- Handle missing data

In [14]:
df.isna().sum()

Date                0
Time                0
PT08.S1(CO)       366
PT08.S2(NMHC)     366
PT08.S3(NOx)      366
NO2(GT)          1642
PT08.S4(NO2)      366
PT08.S5(O3)       366
T                 366
RH                366
AH                366
dtype: int64

In [15]:
df.drop(df[df['PT08.S4(NO2)'].isna()].index, inplace=True)
df

Unnamed: 0,Date,Time,PT08.S1(CO),PT08.S2(NMHC),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH
0,3/10/2004,18:00:00,1360.0,1046.0,1056.0,113.0,1692.0,1268.0,13.6,48.9,0.7578
1,3/10/2004,19:00:00,1292.0,955.0,1174.0,92.0,1559.0,972.0,13.3,47.7,0.7255
2,3/10/2004,20:00:00,1402.0,939.0,1140.0,114.0,1555.0,1074.0,11.9,54.0,0.7502
3,3/10/2004,21:00:00,1376.0,948.0,1092.0,122.0,1584.0,1203.0,11.0,60.0,0.7867
4,3/10/2004,22:00:00,1272.0,836.0,1205.0,116.0,1490.0,1110.0,11.2,59.6,0.7888
...,...,...,...,...,...,...,...,...,...,...,...
9352,4/4/2005,10:00:00,1314.0,1101.0,539.0,190.0,1374.0,1729.0,21.9,29.3,0.7568
9353,4/4/2005,11:00:00,1163.0,1027.0,604.0,179.0,1264.0,1269.0,24.3,23.7,0.7119
9354,4/4/2005,12:00:00,1142.0,1063.0,603.0,175.0,1241.0,1092.0,26.9,18.3,0.6406
9355,4/4/2005,13:00:00,1003.0,961.0,702.0,156.0,1041.0,770.0,28.3,13.5,0.5139


In [16]:
df.isna().sum()

Date                0
Time                0
PT08.S1(CO)         0
PT08.S2(NMHC)       0
PT08.S3(NOx)        0
NO2(GT)          1598
PT08.S4(NO2)        0
PT08.S5(O3)         0
T                   0
RH                  0
AH                  0
dtype: int64

### 4- Replace date feature

In [23]:
df[["Month", "Day", "Year"]] = df["Date"].str.split("/", expand = True)
df = df.drop('Date',axis=1)
df

Unnamed: 0,Time,PT08.S1(CO),PT08.S2(NMHC),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,Day,Month,Year
0,18:00:00,1360.0,1046.0,1056.0,113.0,1692.0,1268.0,13.6,48.9,0.7578,3,10,2004
1,19:00:00,1292.0,955.0,1174.0,92.0,1559.0,972.0,13.3,47.7,0.7255,3,10,2004
2,20:00:00,1402.0,939.0,1140.0,114.0,1555.0,1074.0,11.9,54.0,0.7502,3,10,2004
3,21:00:00,1376.0,948.0,1092.0,122.0,1584.0,1203.0,11.0,60.0,0.7867,3,10,2004
4,22:00:00,1272.0,836.0,1205.0,116.0,1490.0,1110.0,11.2,59.6,0.7888,3,10,2004
...,...,...,...,...,...,...,...,...,...,...,...,...,...
9352,10:00:00,1314.0,1101.0,539.0,190.0,1374.0,1729.0,21.9,29.3,0.7568,4,4,2005
9353,11:00:00,1163.0,1027.0,604.0,179.0,1264.0,1269.0,24.3,23.7,0.7119,4,4,2005
9354,12:00:00,1142.0,1063.0,603.0,175.0,1241.0,1092.0,26.9,18.3,0.6406,4,4,2005
9355,13:00:00,1003.0,961.0,702.0,156.0,1041.0,770.0,28.3,13.5,0.5139,4,4,2005


## 5- Replace time feature

In [24]:
df[["Hour", "Minute", "Second"]] = df["Time"].str.split(":", expand = True)
df = df.drop('Time',axis=1)
df

Unnamed: 0,PT08.S1(CO),PT08.S2(NMHC),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,Day,Month,Year,Hour,Minute,Second
0,1360.0,1046.0,1056.0,113.0,1692.0,1268.0,13.6,48.9,0.7578,3,10,2004,18,00,00
1,1292.0,955.0,1174.0,92.0,1559.0,972.0,13.3,47.7,0.7255,3,10,2004,19,00,00
2,1402.0,939.0,1140.0,114.0,1555.0,1074.0,11.9,54.0,0.7502,3,10,2004,20,00,00
3,1376.0,948.0,1092.0,122.0,1584.0,1203.0,11.0,60.0,0.7867,3,10,2004,21,00,00
4,1272.0,836.0,1205.0,116.0,1490.0,1110.0,11.2,59.6,0.7888,3,10,2004,22,00,00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9352,1314.0,1101.0,539.0,190.0,1374.0,1729.0,21.9,29.3,0.7568,4,4,2005,10,00,00
9353,1163.0,1027.0,604.0,179.0,1264.0,1269.0,24.3,23.7,0.7119,4,4,2005,11,00,00
9354,1142.0,1063.0,603.0,175.0,1241.0,1092.0,26.9,18.3,0.6406,4,4,2005,12,00,00
9355,1003.0,961.0,702.0,156.0,1041.0,770.0,28.3,13.5,0.5139,4,4,2005,13,00,00


### 6- Apply feature scaling

In [16]:
from sklearn.model_selection import train_test_split

In [None]:
X = df.drop('NO2(GT)',axis=1)
y = df['NO2(GT)']
X_train, X_test, y_train, y_test = train_test_split(x,y, test_size=0.3, random_state=101)

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

### 7- Evaluate different degrees of lasso CV polynomial regression

### 8- Choose the best degree and print the errors, model coefficients and the lasso parameters.

### 9a- Evaluate different degrees of Ridge CV polynomial regression

### 9b- Choose the best degree and print the errors, model coefficients and the Ridge parameters.

### 10a- Evaluate different degrees of ElasticNet CV polynomial regression

### 10b- Choose the best degree and print the errors, model coefficients and ElasticNet parameters.

### 11- Compare the errors from the 3 regularization techniques, and save the best model.

### 12- Load the best model and test it on a sample that you manually created it.

## Great work!
----