In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Homework: Random Forest Regression

In [None]:
os.system("wget https://archive.ics.uci.edu/ml/machine-learning-databases/00514/Bias_correction_ucl.csv")

## Bias correction of numerical prediction model temperature forecast Data Set 

Reference: Cho, D., Yoo, C., Im, J., & Cha, D. (2020). <a href='https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2019EA000740'>Comparative assessment of various machine learning-based bias correction methods for numerical weather prediction model forecasts of extreme air temperatures in urban areas</a>. Earth and Space Science.

<u>Description (source)</u>:

This data is for the purpose of bias correction of next-day maximum and minimum air temperatures forecast of the LDAPS model operated by the Korea Meteorological Administration over Seoul, South Korea. This data consists of summer data from 2013 to 2017. The input data is largely composed of the LDAPS model's next-day forecast data, in-situ maximum and minimum temperatures of present-day, and geographic auxiliary variables. There are two outputs (i.e. next-day maximum and minimum air temperatures) in this data. Hindcast validation was conducted for the period from 2015 to 2017.

<u>Attribute Description</u>:

1. station - used weather station number: 1 to 25
2. Date - Present day: yyyy-mm-dd ('2013-06-30' to '2017-08-30')
3. Present_Tmax - Maximum air temperature between 0 and 21 h on the present day (Â°C): 20 to 37.6
4. Present_Tmin - Minimum air temperature between 0 and 21 h on the present day (Â°C): 11.3 to 29.9
5. LDAPS_RHmin - LDAPS model forecast of next-day minimum relative humidity (%): 19.8 to 98.5
6. LDAPS_RHmax - LDAPS model forecast of next-day maximum relative humidity (%): 58.9 to 100
7. LDAPS_Tmax_lapse - LDAPS model forecast of next-day maximum air temperature applied lapse rate (Â°C): 17.6 to 38.5
8. LDAPS_Tmin_lapse - LDAPS model forecast of next-day minimum air temperature applied lapse rate (Â°C): 14.3 to 29.6
9. LDAPS_WS - LDAPS model forecast of next-day average wind speed (m/s): 2.9 to 21.9
10. LDAPS_LH - LDAPS model forecast of next-day average latent heat flux (W/m2): -13.6 to 213.4
11. LDAPS_CC1 - LDAPS model forecast of next-day 1st 6-hour split average cloud cover (0-5 h) (%): 0 to 0.97
12. LDAPS_CC2 - LDAPS model forecast of next-day 2nd 6-hour split average cloud cover (6-11 h) (%): 0 to 0.97
13. LDAPS_CC3 - LDAPS model forecast of next-day 3rd 6-hour split average cloud cover (12-17 h) (%): 0 to 0.98
14. LDAPS_CC4 - LDAPS model forecast of next-day 4th 6-hour split average cloud cover (18-23 h) (%): 0 to 0.97
15. LDAPS_PPT1 - LDAPS model forecast of next-day 1st 6-hour split average precipitation (0-5 h) (%): 0 to 23.7
16. LDAPS_PPT2 - LDAPS model forecast of next-day 2nd 6-hour split average precipitation (6-11 h) (%): 0 to 21.6
17. LDAPS_PPT3 - LDAPS model forecast of next-day 3rd 6-hour split average precipitation (12-17 h) (%): 0 to 15.8
18. LDAPS_PPT4 - LDAPS model forecast of next-day 4th 6-hour split average precipitation (18-23 h) (%): 0 to 16.7
19. lat - Latitude (Â°): 37.456 to 37.645
20. lon - Longitude (Â°): 126.826 to 127.135
21. DEM - Elevation (m): 12.4 to 212.3
22. Slope - Slope (Â°): 0.1 to 5.2
23. Solar radiation - Daily incoming solar radiation (wh/m2): 4329.5 to 5992.9
24. Next_Tmax - The next-day maximum air temperature (Â°C): 17.4 to 38.9
25. Next_Tmin - The next-day minimum air temperature (Â°C): 11.3 to 29.8

## Question 1: Dataset Construction

Follow the reference above to find information on how to construct the dataset. Specifically, read section "3.1 Methods - Data Processing" to determine which columns of the data frame are features and which as labels/tasks (also see Table 1 and Figure 3). It will also be necessary to drop rows containing 'NaN' values (see: <a href='https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html'>pandas.DataFrame.dropna</a> and examples contained within).

The data is split into training and test datasets based on the acquisition date (training on data pre-2015 data, testing on 2015-2017 data). Split the training and test data in a similar way. See <a href='https://stackoverflow.com/questions/37532098/split-dataframe-into-two-on-the-basis-of-date'>here</a> for a discussion on possible ideas on how to achieve this.

Hint:

```
X_train.shape = (3057, 21)
y_train.shape = (4531, 21)
X_test.shape = (3057, 2)
y_test.shape = (4531, 2)
```

## Question 2: Data Visualization

Choose a specific weather station and plot the maximum and minimum next-day air temperature measured ("Next_Tmax") and predicted by the model ("LDAPS_Tmax_lapse") over the time period given in the dataset. Note any qualitative differences you see between measurements and predictions. Title and label the axes of your plot appropriately.

Example:
```
df = pd.read_csv('Bias_correction_ucl.csv')
df1 = df[df['station'] == 1.0]
y_model = df1['LDAPS_Tmax_lapse'].values
y_meas = df1['Next_Tmax'].values
plt.plot(y_model)
plt.plot(y_meas)
plt.title("Station 1 Max Air Temperature")
plt.legend(['LDAPS model', 'Obs.'])
plt.xlabel('Day')
plt.ylabel('Max Temp. (degC)')
```

## Question 3: Random Forest Regression

Train a Random Forest Regression model to predict next-day minimum and maximum air temperature, one separate model for each task. Try splitting 10% off from the training set to produce a validation set. Can you improve performance on the validation set by adjusting some hyperparameters such as `n_estimators` or `max_features`?

To get started:
```
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
regr = RandomForestRegressor()
X_train_1, X_valid, y_train_1, y_valid = train_test_split(X_train, y_train, test_size=0.1)
regr.fit(...)
```

## Question 4: Hindcast Validation

Re-train the best RF model from Question 3 with the entire training set and predict on the test data (future forecasting). Calculate the R2 score and RMSE (root-mean squared error) of your prediction compared with the observations. How do your results compare with those from the original paper? (see: Figure 3 and Figure 4)

To get started:
```
from sklearn.metrics import mean_squared_error, r2_score
regr = RandomForestRegressor(...)
regr.fit(...)
preds = regr.predict(...)
r2 = r2_score(...)
rmse = mean_squared_error(..., squared=False)
```