# Requirements

In [None]:
import pandas as pd

In [None]:
# Add as many imports as you need.

# Laboratory Exercise - Run Mode (8 points)

## Introduction
In this laboratory assignment, the focus is on time series forecasting, specifically targeting the prediction of the current **average sea-level pressure** in the city of Skopje. Your task involves employing bagging and boosting methods to forecast the average sea-level pressure. To accomplish this, you will use data from the preceding three days, consisting of average, minimal, and maximal temperatures, precipitation, as well as wind direction and speed, and the current season. By applying these ensemble learning techniques, you aim to enhance the accuracy and reliability of your predictions, gaining valuable insights into the temporal dynamics of sea-level pressure based on the given meteorological variables.

**Note: You are required to perform this laboratory assignment on your local machine.**

## The Weather Dataset

## Downloading the Weather Dataset

In [None]:
!gdown 1F8hSJgpOTdoe9rhFj6DYiwmPZ5hJ-ubI # Download the dataset.

Traceback (most recent call last):
  File "/usr/local/bin/gdown", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/gdown/cli.py", line 151, in main
    filename = download(
  File "/usr/local/lib/python3.10/dist-packages/gdown/download.py", line 203, in download
    filename_from_url = m.groups()[0]
AttributeError: 'NoneType' object has no attribute 'groups'


## Exploring the Weather Dataset
This dataset consists of daily weather records for the city of Skopje from January 1, 2021, to August 1, 2023. Each entry includes a unique station ID, city name, date, corresponding season (e.g., summer, winter), and various meteorological parameters such as average, minimum, and maximum temperatures in Celsius, precipitation in millimeters, average wind direction in degrees, average wind speed in kilometers per hour, and average sea-level pressure in hectopascals. The dataset offers comprehensive insights into the climatic conditions, allowing for analysis and exploration of weather patterns in Skopje over the specified time period.

The dataset comprises the following columns:
- station_id - unique ID for the weather station,
- city_name - name of the city where the station is located,
- date - date of the weather record,
- season - season corresponding to the date (e.g., summer, winter),
- avg_temp_c - average temperature in Celsius,
- min_temp_c - minimum temperature in Celsius,
- max_temp_c - maximum temperature in Celsius,
- precipitation_mm - precipitation in millimeters,
- avg_wind_dir_deg - average wind direction in degrees,
- avg_wind_speed_kmh - average wind speed in kilometers per hour, and
- avg_sea_level_pres_hpa - average sea-level pressure in hectopascals.

*Note: The dataset is complete, with no missing values in any of its entries.*

Load the dataset into a `pandas` data frame.

In [None]:
# Write your code here. Add as many boxes as you need.
df = pd.read_csv('/content/weather.csv')

In [None]:
df.head()

Unnamed: 0,station_id,city_name,date,season,avg_temp_c,min_temp_c,max_temp_c,precipitation_mm,avg_wind_dir_deg,avg_wind_speed_kmh,avg_sea_level_pres_hpa
0,13588,Skopje,2021-01-01,Winter,5.1,0.5,13.2,0.0,330.0,5.9,1021.2
1,13588,Skopje,2021-01-02,Winter,3.0,-2.6,11.2,0.0,330.0,5.9,1021.2
2,13588,Skopje,2021-01-03,Winter,6.8,3.5,12.5,1.3,339.0,8.0,1017.8
3,13588,Skopje,2021-01-04,Winter,6.6,6.1,7.2,3.6,298.0,5.3,1011.3
4,13588,Skopje,2021-01-05,Winter,4.3,2.3,6.7,4.6,11.0,5.1,1014.5


Explore the dataset using visualizations of your choice.

In [None]:
# Write your code here. Add as many boxes as you need.
df.corr()

  df.corr()


Unnamed: 0,station_id,avg_temp_c,min_temp_c,max_temp_c,precipitation_mm,avg_wind_dir_deg,avg_wind_speed_kmh,avg_sea_level_pres_hpa
station_id,,,,,,,,
avg_temp_c,,1.0,0.955178,0.976177,-0.056634,0.025137,-0.065397,-0.393837
min_temp_c,,0.955178,1.0,0.898194,0.04543,0.016872,-0.028296,-0.451426
max_temp_c,,0.976177,0.898194,1.0,-0.117093,0.037166,-0.099911,-0.339306
precipitation_mm,,-0.056634,0.04543,-0.117093,1.0,-0.01124,0.050021,-0.193343
avg_wind_dir_deg,,0.025137,0.016872,0.037166,-0.01124,1.0,0.099502,0.088692
avg_wind_speed_kmh,,-0.065397,-0.028296,-0.099911,0.050021,0.099502,1.0,-0.157845
avg_sea_level_pres_hpa,,-0.393837,-0.451426,-0.339306,-0.193343,0.088692,-0.157845,1.0


Remove the highly correlated features.

In [None]:
df.columns

Index(['station_id', 'city_name', 'date', 'season', 'avg_temp_c', 'min_temp_c',
       'max_temp_c', 'precipitation_mm', 'avg_wind_dir_deg',
       'avg_wind_speed_kmh', 'avg_sea_level_pres_hpa'],
      dtype='object')

In [None]:
# Write your code here. Add as many boxes as you need.
df.drop(columns=['min_temp_c','max_temp_c'],axis=1, inplace=True)

In [None]:
df.drop(columns='city_name', axis=1,inplace=True)

In [None]:
df.drop(columns='station_id', axis=1,inplace=True)

Encode the categorical features.

In [None]:
df.dtypes

date                       object
season                     object
avg_temp_c                float64
precipitation_mm          float64
avg_wind_dir_deg          float64
avg_wind_speed_kmh        float64
avg_sea_level_pres_hpa    float64
dtype: object

In [None]:
# Write your code here. Add as many boxes as you need.
from sklearn.preprocessing import LabelEncoder

In [None]:
encoder = LabelEncoder()

In [None]:
df['season'] = encoder.fit_transform(df['season'])

In [None]:
df['date']=pd.to_datetime(df['date'])

# Feauture Extraction
Select the relevant features for prediction and apply a lag of one, two, and three days to each chosen feature (except `season`), creating a set of features representing the meteorological conditions from the previous three days. To maintain dataset integrity, eliminate any resulting missing values at the beginning of the dataset.

Hint: Use `df['column_name'].shift(period)`. Check the documentation at https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shift.html.

In [None]:
df.columns

Index(['date', 'season', 'avg_temp_c', 'precipitation_mm', 'avg_wind_dir_deg',
       'avg_wind_speed_kmh', 'avg_sea_level_pres_hpa'],
      dtype='object')

In [None]:
# Write your code here. Add as many boxes as you need.
for i in range(3,0,-1):
  for col in ['avg_temp_c','precipitation_mm', 'avg_wind_dir_deg','avg_wind_speed_kmh', 'avg_sea_level_pres_hpa']:
    df[f'{col} lag_{i}'] = df[col].shift(i)

In [None]:
df

Unnamed: 0,date,season,avg_temp_c,precipitation_mm,avg_wind_dir_deg,avg_wind_speed_kmh,avg_sea_level_pres_hpa,avg_temp_c lag_3,precipitation_mm lag_3,avg_wind_dir_deg lag_3,...,avg_temp_c lag_2,precipitation_mm lag_2,avg_wind_dir_deg lag_2,avg_wind_speed_kmh lag_2,avg_sea_level_pres_hpa lag_2,avg_temp_c lag_1,precipitation_mm lag_1,avg_wind_dir_deg lag_1,avg_wind_speed_kmh lag_1,avg_sea_level_pres_hpa lag_1
0,2021-01-01,3,5.1,0.0,330.0,5.9,1021.2,,,,...,,,,,,,,,,
1,2021-01-02,3,3.0,0.0,330.0,5.9,1021.2,,,,...,,,,,,5.1,0.0,330.0,5.9,1021.2
2,2021-01-03,3,6.8,1.3,339.0,8.0,1017.8,,,,...,5.1,0.0,330.0,5.9,1021.2,3.0,0.0,330.0,5.9,1021.2
3,2021-01-04,3,6.6,3.6,298.0,5.3,1011.3,5.1,0.0,330.0,...,3.0,0.0,330.0,5.9,1021.2,6.8,1.3,339.0,8.0,1017.8
4,2021-01-05,3,4.3,4.6,11.0,5.1,1014.5,3.0,0.0,330.0,...,6.8,1.3,339.0,8.0,1017.8,6.6,3.6,298.0,5.3,1011.3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
938,2023-07-28,2,22.8,0.0,2.0,6.8,1014.6,29.2,0.0,275.0,...,28.2,0.0,242.0,13.4,1002.9,20.7,0.0,316.0,16.0,1012.6
939,2023-07-29,2,26.3,0.0,261.0,6.2,1011.7,28.2,0.0,242.0,...,20.7,0.0,316.0,16.0,1012.6,22.8,0.0,2.0,6.8,1014.6
940,2023-07-30,2,28.2,0.0,317.0,8.0,1009.8,20.7,0.0,316.0,...,22.8,0.0,2.0,6.8,1014.6,26.3,0.0,261.0,6.2,1011.7
941,2023-07-31,2,25.8,0.0,307.0,12.3,1010.9,22.8,0.0,2.0,...,26.3,0.0,261.0,6.2,1011.7,28.2,0.0,317.0,8.0,1009.8


In [None]:
df.dropna(inplace=True)
df.set_index(df['date'])

Unnamed: 0_level_0,date,season,avg_temp_c,precipitation_mm,avg_wind_dir_deg,avg_wind_speed_kmh,avg_sea_level_pres_hpa,avg_temp_c lag_3,precipitation_mm lag_3,avg_wind_dir_deg lag_3,...,avg_temp_c lag_2,precipitation_mm lag_2,avg_wind_dir_deg lag_2,avg_wind_speed_kmh lag_2,avg_sea_level_pres_hpa lag_2,avg_temp_c lag_1,precipitation_mm lag_1,avg_wind_dir_deg lag_1,avg_wind_speed_kmh lag_1,avg_sea_level_pres_hpa lag_1
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2021-01-04,2021-01-04,3,6.6,3.6,298.0,5.3,1011.3,5.1,0.0,330.0,...,3.0,0.0,330.0,5.9,1021.2,6.8,1.3,339.0,8.0,1017.8
2021-01-05,2021-01-05,3,4.3,4.6,11.0,5.1,1014.5,3.0,0.0,330.0,...,6.8,1.3,339.0,8.0,1017.8,6.6,3.6,298.0,5.3,1011.3
2021-01-06,2021-01-06,3,6.2,0.0,18.0,6.7,1017.2,6.8,1.3,339.0,...,6.6,3.6,298.0,5.3,1011.3,4.3,4.6,11.0,5.1,1014.5
2021-01-07,2021-01-07,3,7.3,0.5,0.0,4.5,1015.2,6.6,3.6,298.0,...,4.3,4.6,11.0,5.1,1014.5,6.2,0.0,18.0,6.7,1017.2
2021-01-08,2021-01-08,3,5.7,7.9,346.0,6.5,1009.7,4.3,4.6,11.0,...,6.2,0.0,18.0,6.7,1017.2,7.3,0.5,0.0,4.5,1015.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-07-28,2023-07-28,2,22.8,0.0,2.0,6.8,1014.6,29.2,0.0,275.0,...,28.2,0.0,242.0,13.4,1002.9,20.7,0.0,316.0,16.0,1012.6
2023-07-29,2023-07-29,2,26.3,0.0,261.0,6.2,1011.7,28.2,0.0,242.0,...,20.7,0.0,316.0,16.0,1012.6,22.8,0.0,2.0,6.8,1014.6
2023-07-30,2023-07-30,2,28.2,0.0,317.0,8.0,1009.8,20.7,0.0,316.0,...,22.8,0.0,2.0,6.8,1014.6,26.3,0.0,261.0,6.2,1011.7
2023-07-31,2023-07-31,2,25.8,0.0,307.0,12.3,1010.9,22.8,0.0,2.0,...,26.3,0.0,261.0,6.2,1011.7,28.2,0.0,317.0,8.0,1009.8


In [None]:
df.drop(columns=['date', 'season', 'avg_temp_c', 'precipitation_mm', 'avg_wind_dir_deg','avg_wind_speed_kmh'], axis=1, inplace=True)

In [None]:
df

Unnamed: 0,avg_sea_level_pres_hpa,avg_temp_c lag_3,precipitation_mm lag_3,avg_wind_dir_deg lag_3,avg_wind_speed_kmh lag_3,avg_sea_level_pres_hpa lag_3,avg_temp_c lag_2,precipitation_mm lag_2,avg_wind_dir_deg lag_2,avg_wind_speed_kmh lag_2,avg_sea_level_pres_hpa lag_2,avg_temp_c lag_1,precipitation_mm lag_1,avg_wind_dir_deg lag_1,avg_wind_speed_kmh lag_1,avg_sea_level_pres_hpa lag_1
3,1011.3,5.1,0.0,330.0,5.9,1021.2,3.0,0.0,330.0,5.9,1021.2,6.8,1.3,339.0,8.0,1017.8
4,1014.5,3.0,0.0,330.0,5.9,1021.2,6.8,1.3,339.0,8.0,1017.8,6.6,3.6,298.0,5.3,1011.3
5,1017.2,6.8,1.3,339.0,8.0,1017.8,6.6,3.6,298.0,5.3,1011.3,4.3,4.6,11.0,5.1,1014.5
6,1015.2,6.6,3.6,298.0,5.3,1011.3,4.3,4.6,11.0,5.1,1014.5,6.2,0.0,18.0,6.7,1017.2
7,1009.7,4.3,4.6,11.0,5.1,1014.5,6.2,0.0,18.0,6.7,1017.2,7.3,0.5,0.0,4.5,1015.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
938,1014.6,29.2,0.0,275.0,4.7,1007.7,28.2,0.0,242.0,13.4,1002.9,20.7,0.0,316.0,16.0,1012.6
939,1011.7,28.2,0.0,242.0,13.4,1002.9,20.7,0.0,316.0,16.0,1012.6,22.8,0.0,2.0,6.8,1014.6
940,1009.8,20.7,0.0,316.0,16.0,1012.6,22.8,0.0,2.0,6.8,1014.6,26.3,0.0,261.0,6.2,1011.7
941,1010.9,22.8,0.0,2.0,6.8,1014.6,26.3,0.0,261.0,6.2,1011.7,28.2,0.0,317.0,8.0,1009.8


## Dataset Splitting
Partition the dataset into training and testing sets with an 80:20 ratio.

**WARNING: DO NOT SHUFFLE THE DATASET.**



In [None]:
# Write your code here. Add as many boxes as you need.
from sklearn.model_selection import train_test_split

In [None]:
X=df.drop(columns=['avg_sea_level_pres_hpa'],axis=1)
y=df['avg_sea_level_pres_hpa']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.2, shuffle=False, random_state=42)

## Ensemble Learning Methods

### Bagging

Create an instance of a Random Forest model and train it using the `fit` function.

In [None]:
# Write your code here. Add as many boxes as you need.
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=1500, criterion='squared_error', max_depth=10)

Use the trained model to make predictions for the test set.

In [None]:
# Write your code here. Add as many boxes as you need.
rf.fit(X_train,y_train)

In [None]:
y_pred=rf.predict(X_test)

Assess the performance of the model by using different metrics provided by the `scikit-learn` library.

In [None]:
# Write your code here. Add as many boxes as you need.
from sklearn.metrics import mean_squared_error, r2_score,mean_absolute_error

In [None]:
print('MSE: ',mean_squared_error(y_test, y_pred))
print('R2: ',r2_score(y_test, y_pred))

MSE:  14.763431754161509
R2:  0.6959568067725376


### Boosting

Create an instance of an XGBoost model and train it using the `fit` function.

In [None]:
# Write your code here. Add as many boxes as you need.
from xgboost import XGBRegressor

Use the trained model to make predictions for the test set.

In [None]:
# Write your code here. Add as many boxes as you need.
model = XGBRegressor(objective ='reg:linear',colsample_bytree=0.3,learning_rate=0.1,max_depth=5,alpha=2,n_estimators=10)

Assess the performance of the model by using different metrics provided by the `scikit-learn` library.

In [None]:
# Write your code here. Add as many boxes as you need,
model.fit(X_train,y_train)



In [None]:
y_pred2 = model.predict(X_test)

In [None]:
print('MSE: ',mean_squared_error(y_test, y_pred2))
print('R2: ',r2_score(y_test, y_pred2))

MSE:  31.667408229303042
R2:  0.3478304990598876


# Laboratory Exercise - Bonus Task (+ 2 points)

As part of the bonus task in this laboratory assignment, your objective is to fine-tune the max_depth (`max_depth`) for the Random Forest model using a cross-validation with grid search and time series split. This involves systematically experimenting with various values for `max_depth` and evaluating the model's performance using cross-validation. Upon determining the most suitable `max_depth` value, evaluate the model's performance on a test set for final assessment.

Hints:
- For grid search use the `GridCVSearch` from the `scikit-learn` library. Check the documentation at https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html.
- For cross-validation use the `TimeSeriesSplit` from the `scikit-learn` library. Check the documentation at https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html.

## Dataset Splitting
Partition the dataset into training and testing sets with an 90:10 ratio.

**WARNING: DO NOT SHUFFLE THE DATASET.**

In [None]:
# Write your code here. Add as many boxes as you need.
X_train,X_test, y_train, y_test = train_test_split(X,y, test_size=0.1, random_state=42, shuffle=False)

## Fine-tuning the Random Forest Hyperparameter
Experiment with various values for `max_depth` and evaluate the model's performance using cross-validation.

In [None]:
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV

In [None]:
# Write your code here. Add as many boxes as you need.
model = RandomForestRegressor()

In [None]:
params = {'max_depth':[1,5,10,20]}

In [None]:
time_series = TimeSeriesSplit(n_splits=5)

In [None]:
gridSearch = GridSearchCV(estimator=model, param_grid=params, cv=time_series, scoring='r2')
gridSearch.fit(X_train,y_train)

In [None]:
best_params = gridSearch.best_params_

In [None]:
best_model = gridSearch.best_estimator_

In [None]:
test_score = best_model.score(X_test,y_test)

In [None]:
test_score

0.5716533221929545

In [None]:
best_model.fit(X_train, y_train)

test_score = best_model.score(X_test, y_test)

y_pred = best_model.predict(X_test)

## Final Assessment of the Model Performance
Upon determining the most suitable `max_depth` value, evaluate the model's performance on a test set for final assessment.

In [None]:
import numpy as np

In [None]:
print("Mean Absolute Error : " + str(mean_absolute_error(y_pred, y_test)))
print("Mean Squared Error : " + str(mean_squared_error(y_pred, y_test)))
print("Root Mean Squared Error : " + str(np.sqrt(mean_squared_error(y_pred, y_test))))

Mean Absolute Error : 1.6011547620307713
Mean Squared Error : 4.602086220158189
Root Mean Squared Error : 2.1452473564039622


In [None]:
# Write your code here. Add as many boxes as you need.