Problem:
Humans are very sensitive to humidity, as the skin relies on the air to get rid of moisture. The process of sweating is your body's attempt to keep cool and maintain its current temperature. If the air is at 100-percent relative humidity, sweat will not evaporate into the air. As a result, we feel much hotter than the actual temperature when the relative humidity is high. If the relative humidity is low, we can feel much cooler than the actual temperature because our sweat evaporates easily, cooling us off. For example, if the air temperature e is 24 degrees Celsius and the relative humidity is zero percent, the air temperature feels like 21 C to our bodies. If the air temperature is 24 C and the relative humidity is 100 percent, we feel like it's 80 degrees (27 C) out.


Create a model that will predi Relative Humidity from other variables.

---

Data description:

0. Date (DD/MM/YYYY)
1. Time (HH.MM.SS)
2. True hourly averaged concentration CO in mg/m^3 (reference analyzer)
3. PT08.S1 (tin oxide) hourly averaged sensor response (nominally CO targeted)
4. True hourly averaged overall Non Metanic HydroCarbons concentration in microg/m^3 (reference analyzer)
5. True hourly averaged Benzene concentration in microg/m^3 (reference analyzer)
6. PT08.S2 (titania) hourly averaged sensor response (nominally NMHC targeted)
7. True hourly averaged NOx concentration in ppb (reference analyzer)
8. PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally NOx targeted)
9. True hourly averaged NO2 concentration in microg/m^3 (reference analyzer)
10. PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted)
11. PT08.S5 (indium oxide) hourly averaged sensor response (nominally O3 targeted)
12. Temperature in °C
13. Relative Humidity (%)
14. AH Absolute Humidity

---

Load data, sort out missing data (fill it or remove columns/rows)

In order for ML model to be able to work with Date, convert month (int) variable. Remove original Date after.

Similarly, convert Time's hour to (int) variable. Drop original Time after.

After all of that is done and data is ready. Create a baseline using `LinearRegression()`. No preprocessing needed, no optuna, just a simple baseline to see if it is working. For evaluation use `mean_absolute_error()`.

---

Once this is working, take the Optuna code you made during last lesson and adjust it to work with regression.

0. Change function where it loads the data to what you have created earlier.
1. You will need to change part where it chooses `classifiers` into `regressors`. Use folllowing: `LinearRegression`, `DecisionTreeRegressor`, `RandomForestRegressor`, `SVR`. Look these up at sci-kit and add some parameters you think might be impactful for Optuna to optimize.
2. You will need to change `stratifiedKfoldSplit` - which work with classification exapmles only, to a cross-validation method that can work with regression. ([here are](https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html#visualize-cross-validation-indices-for-many-cv-objects) the plots if you need help choosing)
3. change evaluation metric into whatever you choose to use during the baseline. Make sure optuna is set to optimize in the right direction!
4. Apart from the existing numerical-data-scalers optuna chooses from, add `Normalizer` to the choice. Find it in Scikit, import, and add it.

In the end, Optuna should be able to run 100 trials.

In [None]:
!pip install -q xlrd
!pip install -q optuna
!pip install -q openpyxl

print("-------------- Necessary packages installed --------------")

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler, RobustScaler, StandardScaler, PowerTransformer,Normalizer
import pandas as pd
import datetime as dt
import optuna


print("-------------- Packages loaded --------------")

-------------- Necessary packages installed --------------
-------------- Packages loaded --------------


In [None]:
data_path = "/work/data/homework 22/AirQualityUCI.xlsx"
col_names = ['date', 'time', 'co_gt', 'pt08_s1_co', 'nmhc_gt', 'c6h6_gt', 'pt08_s2_nmhc', 'nox_gt', 'pt08_s3_nox', 'no2_gt', 'pt08_s4_no2', 'pt08_s5_o3', 't', 'rh', 'ah']

df = pd.read_excel(data_path, names=col_names, usecols=range(15)) # usecols parameters is there because for some reason read_excel reads the file with 2 more (completely empty) columns than there actually are, so I am telling it to only load first 15 cols - there are 15 cols in the datasets 
df["date"] = df["date"].astype(int)/ 10**9
df["time"] = df["time"].astype(str).str.split(":").apply(lambda x: int(x[0]) * 60 + int(x[1]))
X = df.drop(["rh"],axis=1)
y= df["rh"]

X_train,X_test,y_train,y_test = train_test_split(X,y, test_size=0.3,random_state=42,shuffle=True)

scaler = MinMaxScaler()
scaler.fit(X_train)

model=LinearRegression()
scaled_X_train = scaler.transform(X_train)
scaled_X_test = scaler.transform(X_test)
model.fit(scaled_X_train,y_train)
prediction = model.predict(scaled_X_test)
print('MAE:', metrics.mean_absolute_error(y_test, prediction))




MAE: 5.697231904290597


In [None]:
def load_data():
    data_path = "/work/data/homework 22/AirQualityUCI.xlsx"
    col_names = ['date', 'time', 'co_gt', 'pt08_s1_co', 'nmhc_gt', 'c6h6_gt', 'pt08_s2_nmhc', 'nox_gt', 'pt08_s3_nox', 'no2_gt', 'pt08_s4_no2', 'pt08_s5_o3', 't', 'rh', 'ah']
    df = pd.read_excel(data_path, names=col_names, usecols=range(15))
    df["date"] = df["date"].astype(int)/ 10**9
    df["time"] = df["time"].astype(str).str.split(":").apply(lambda x: int(x[0]) * 60 + int(x[1]))
    X = df.drop(["rh"],axis=1)
    y= df["rh"]
    return X, y


def objective(trial):

    
    regressor_name = trial.suggest_categorical("regressors", ["SVR", "RandomForestRegressor","DecisionTreeRegressor","LinearRegression"])
    if regressor_name == "SVR":
        svr_c = trial.suggest_float("svr_c", 1e-3, 1e3, log=True)
        model = SVR(C=svr_c)
    elif regressor_name == "RandomForestRegressor":
        rf_max_depth = trial.suggest_int("rf_max_depth", 2, 12, log=True)
        model = RandomForestRegressor(max_depth=rf_max_depth, 
                                                       n_estimators=10)
    elif regressor_name == "DecisionTreeRegressor":
        dt_criteria = trial.suggest_categorical('criterion', ['mse', 'friedman_mse'])
        dt_max_depth = trial.suggest_int("dt_max_depth", 2, 12)
        model = DecisionTreeRegressor(criterion= dt_criteria,
                                                   max_depth=dt_max_depth)
    elif regressor_name == 'LinearRegression':
        model = LinearRegression() 
    

    scaler_string = trial.suggest_categorical("------------------------------------_scaler",["no_scaler", "StandardScaler","RobustScaler","MinMaxScaler", "MaxAbsScaler", "StandardScaler", "PowerTransformer","Normalizer"])    
    if scaler_string == "no_scaler":
        scaled_X = X
    else:
        scaler = eval(scaler_string)()
        scaler.fit(X)
        scaled_X = scaler.transform(X)

    cv=KFold(shuffle=True,random_state=42)
    score=cross_val_score(model,scaled_X,y,cv=cv,scoring="neg_mean_absolute_error")
    trial_score = score.mean() 

    return trial_score

X,y = load_data()
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=10)
print(study.best_trial)


        

[32m[I 2021-10-03 17:49:02,845][0m A new study created in memory with name: no-name-d693a011-1c39-41a8-a8e0-0280d11ff3aa[0m
[32m[I 2021-10-03 17:49:03,679][0m Trial 0 finished with value: -11.310037343770166 and parameters: {'regressors': 'RandomForestRegressor', 'rf_max_depth': 2, '------------------------------------_scaler': 'RobustScaler'}. Best is trial 0 with value: -11.310037343770166.[0m
[32m[I 2021-10-03 17:49:31,662][0m Trial 1 finished with value: -10.849885498063445 and parameters: {'regressors': 'SVR', 'svr_c': 0.8296090641500931, '------------------------------------_scaler': 'StandardScaler'}. Best is trial 1 with value: -10.849885498063445.[0m
[32m[I 2021-10-03 17:49:31,756][0m Trial 2 finished with value: -5.713765012476422 and parameters: {'regressors': 'LinearRegression', '------------------------------------_scaler': 'MinMaxScaler'}. Best is trial 2 with value: -5.713765012476422.[0m
[32m[I 2021-10-03 17:49:32,042][0m Trial 3 finished with value: -13.8

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=7d3ce7c8-a514-49e4-9ba4-a5899ac52ea5' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>