# Data Preprocessing

## 1. Load the Data

We will start by loading the dataset and examining its structure.

In [142]:
import pandas as pd

# Load the data
df = pd.read_csv('berlin_weather_2014_2023.csv')

# Check the first few rows of the dataframe
df.head()

Unnamed: 0,dt,dt_iso,timezone,city_name,lat,lon,temp,visibility,dew_point,feels_like,...,wind_gust,rain_1h,rain_3h,snow_1h,snow_3h,clouds_all,weather_id,weather_main,weather_description,weather_icon
0,1407628800,2014-08-10 00:00:00 +0000 UTC,7200,Custom location,52.497492,13.349695,14.97,10000.0,13.0,14.82,...,,,,,,0,741,Fog,fog,50n
1,1407632400,2014-08-10 01:00:00 +0000 UTC,7200,Custom location,52.497492,13.349695,14.97,10000.0,13.0,14.82,...,,,,,,0,741,Fog,fog,50n
2,1407636000,2014-08-10 02:00:00 +0000 UTC,7200,Custom location,52.497492,13.349695,13.97,10000.0,13.02,13.88,...,,,,,,0,741,Fog,fog,50n
3,1407639600,2014-08-10 03:00:00 +0000 UTC,7200,Custom location,52.497492,13.349695,13.97,10000.0,13.02,13.88,...,,,,,,0,741,Fog,fog,50n
4,1407643200,2014-08-10 04:00:00 +0000 UTC,7200,Custom location,52.497492,13.349695,14.97,10000.0,14.01,14.98,...,,,,,,0,800,Clear,sky is clear,01d


#### 1.1 Splitting and Cleaning Data Columns:

To enhance the dataset's organization and clarity, we will perform the following steps to split and clean specific columns.



In [143]:
# Split the 'dt_iso' column into separate columns
df[['date', 'time', 'timezone','zone_sign']] = df['dt_iso'].str.split(' ', expand=True)

#### 1.2 Removing Redundant Columns:
We will remove the following columns as they are no longer needed for analysis
city_name, lat, lon, timezone, sea_level, grnd_level, snow_3h

they either have the same value or Null value for all the rows

In [144]:
# Drop the unnecessary and redundant columns
columns_to_drop = ['dt_iso','city_name', 'lat', 'lon', 'timezone', 'sea_level', 'grnd_level', 'snow_3h','zone_sign']
df = df.drop(columns=columns_to_drop)

#### 1.3 Handling Missing Values
Before using the data to train a machine learning model, it's essential to handle any missing values.

Filling the empty rows with zeros

In [145]:
df.fillna(0, inplace=True)

#### 1.4 Scaling Features
Many machine learning models perform better when numerical features have the same scale.

##### The main reasons for scaling are:

Algorithms Performance: Many machine learning algorithms, especially those that use gradient descent as an optimization technique, require data to be scaled. If features have vastly different scales, the algorithm might prioritize the feature with a larger scale over the smaller one, even if they are equally important. This can lead to sub-optimal performance.

- Convergence: Algorithms might converge (i.e., find a solution) faster if data is scaled.

- Interpretability: It can help in comparing the importance of different features in some models.

For our purposes, we'll use Standard Scaling.

This scales the data so that it has a mean of 0 and a standard deviation of 1.

In [146]:
from sklearn.preprocessing import StandardScaler

# Initialize a scaler
scaler = StandardScaler()

# Fit the scaler and transform the temperature and humidity columns
df['scaled_temp'] = scaler.fit_transform(df[['temp']])
df['scaled_humidity'] = scaler.fit_transform(df[['humidity']])


#### 1.5 Feature Engineering 
Given time series data, we can generate some derived features to help the model capture patterns:

Feature engineering is the process of using domain knowledge to extract additional features from raw data. These features can improve the performance of machine learning models. The process might include:

Feature Transformation: Creating new features from the existing ones. This can be done using various mathematical operations like logarithms, polynomial features, etc.

Encoding Categorical Data: Many machine learning models require inputs to be numeric. If your data includes categorical variables (like "red", "blue", "green"), you need to encode them into numbers. Common methods include one-hot encoding and label encoding.

In [147]:
from sklearn.preprocessing import LabelEncoder

# Create a rolling average of temperature and humidity for the past 7 days
df['rolling_avg_temp'] = df['scaled_temp'].rolling(window=7).mean()
df['rolling_avg_humidity'] = df['scaled_humidity'].rolling(window=7).mean()

# Create lag features for temperature and humidity (previous day's values)
df['lag_temp'] = df['scaled_temp'].shift(1)
df['lag_humidity'] = df['scaled_humidity'].shift(1)


##### Handling NaN values in Rolling Averages

In [148]:
df['rolling_avg_temp'].fillna(df['rolling_avg_temp'].mean(), inplace=True)
df['rolling_avg_humidity'].fillna(df['rolling_avg_humidity'].mean(), inplace=True)
df['lag_temp'].fillna(df['lag_temp'].mean(), inplace=True)
df['lag_humidity'].fillna(df['lag_humidity'].mean(), inplace=True)

##### dropping unneccesary string columns
dropping:
- weather_icon
- weather_main
- weather_description

as weather description is a further explanation of weather_main.
weather icon does not give important information.
and weather ID maps weather description

In [149]:
# Drop the unnecessary and redundant columns
columns_to_drop = ['weather_icon','weather_main', 'weather_description']
df = df.drop(columns=columns_to_drop)

##### Date and Time Feature Extraction:

To capture the cyclical nature of weather patterns and ensure our model recognizes the seasonal, daily, and hourly variations in the data, we'll be extracting the following time-related features:

- Month: This will help in recognizing monthly and seasonal trends. For instance, August might be generally warmer than January.
- Day: Captures the day of the month.
- Hour: Helps the model identify daily patterns. For example, midday might be warmer than early morning or late evening.
- Day of the Week: Some patterns might be more prevalent on specific days, like weekends versus weekdays.

In [150]:
# Convert the 'date' column to a datetime format
df['date'] = pd.to_datetime(df['date'])

# Extract month, day, and day of the week
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['day_of_week'] = df['date'].dt.dayofweek  # Monday=0, Sunday=6

# Extract hour from the 'time' column
df['hour'] = pd.to_datetime(df['time'], format='%H:%M:%S').dt.hour

# Display the first few rows to verify the changes
print(df.head())


           dt   temp  visibility  dew_point  feels_like  temp_min  temp_max  \
0  1407628800  14.97     10000.0      13.00       14.82     14.95     14.99   
1  1407632400  14.97     10000.0      13.00       14.82     14.95     14.99   
2  1407636000  13.97     10000.0      13.02       13.88     13.95     14.39   
3  1407639600  13.97     10000.0      13.02       13.88     13.95     14.39   
4  1407643200  14.97     10000.0      14.01       14.98     14.95     15.51   

   pressure  humidity  wind_speed  ...  scaled_temp  scaled_humidity  \
0      1016        88        0.51  ...     0.105661         0.832511   
1      1015        88        0.51  ...     0.105661         0.832511   
2      1015        94        1.03  ...     0.078008         1.161113   
3      1015        94        1.03  ...     0.078008         1.161113   
4      1015        94        1.54  ...     0.105661         1.161113   

   rolling_avg_temp  rolling_avg_humidity  lag_temp  lag_humidity  month day  \
0         -0

# 2. Create Yearly Samples
We'll extract yearly samples from the dataset as you described, 
i.e., from 10th of August of one year to the 9th of August of the next year.

## 2.1 Extract Yearly Data Samples

In [151]:
# Create an empty dictionary to store yearly samples
yearly_samples = {}

# Define the range of years we have in our dataset
years = list(range(2014, 2023))

# Extract samples for each year
for year in years:
    start_date = f"{year}-08-10"
    end_date = f"{year+1}-08-09"
    sample = df[(df['date'] >= start_date) & (df['date'] <= end_date)]
    yearly_samples[year] = sample


#### 2.2 Quick Inspection of the Yearly Samples

In [152]:
# Check the first few rows of the 2014 sample
print(yearly_samples[2022].head())

# Check the last few rows of the 2014 sample
print(yearly_samples[2022].tail())


               dt   temp  visibility  dew_point  feels_like  temp_min  \
71842  1660089600  18.19         0.0      12.19       17.84     17.42   
71843  1660093200  17.07         0.0      11.35       16.63     15.57   
71844  1660096800  16.71         0.0      11.22       16.26     14.45   
71845  1660100400  16.27         0.0      11.22       15.83     13.53   
71846  1660104000  15.52         0.0      10.91       15.06     12.98   

       temp_max  pressure  humidity  wind_speed  ...  scaled_temp  \
71842     18.95      1026        68        3.58  ...     0.194703   
71843     18.39      1026        69        3.13  ...     0.163732   
71844     18.91      1025        70        3.13  ...     0.153777   
71845     18.35      1025        72        3.13  ...     0.141610   
71846     17.24      1025        74        1.79  ...     0.120870   

       scaled_humidity  rolling_avg_temp  rolling_avg_humidity  lag_temp  \
71842        -0.262829          0.287735             -1.045215  0.2215

In [153]:
df.to_csv('file_name4.csv')

# Model Selection & Training:
## 1. Multi-Target Linear Regression

In multi-target regression, we aim to predict multiple dependent variables using the same set of features. For our weather prediction task, this means predicting both temperature and humidity using our selected features.

Using k-fold cross-validation, we will train our multi-target Linear Regression model on different yearly samples to ensure robustness and avoid overfitting.

In [170]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

# Features and Targets
features = [ 'rolling_avg_temp', 'rolling_avg_humidity', 'lag_temp', 'lag_humidity', 'month', 'day', 'day_of_week', 'hour']
targets = ['temp','humidity','scaled_temp', 'scaled_humidity', 'weather_id']


##### Model Training with Cross-Validation:
We'll train the model using 8 out of 9 samples and test on the remaining one, iterating through all possible combinations.

In [172]:
# List to store evaluation metrics
all_mae_temp, all_mse_temp, all_rmse_temp = [], [], []
all_mae_humidity, all_mse_humidity, all_rmse_humidity = [], [], []
all_mae_weather, all_mse_weather, all_rmse_weather = [], [], []
all_mae_tempr, all_mse_tempr, all_rmse_tempr = [], [], []
all_mae_humidityr, all_mse_humidityr, all_rmse_humidityr = [], [], []

# Loop through each year for validation
for year in yearly_samples.keys():
    # Use all samples except the current year for training
    train = pd.concat([yearly_samples[y] for y in yearly_samples.keys() if y != year])
    test = yearly_samples[year]
    
    # Splitting data
    X_train = train[features].dropna()
    y_train = train.loc[X_train.index][targets]
    X_test = test[features].dropna()
    y_test = test.loc[X_test.index][targets]

    # Model training
    multi_target_model = LinearRegression()
    multi_target_model.fit(X_train, y_train)

    # Predicting on test set
    predictions = multi_target_model.predict(X_test)

    # Evaluation
    all_mae_temp.append(mean_absolute_error(y_test['scaled_temp'], predictions[:, 0]))
    all_mse_temp.append(mean_squared_error(y_test['scaled_temp'], predictions[:, 0]))
    all_rmse_temp.append(np.sqrt(all_mse_temp[-1]))
    all_mae_humidity.append(mean_absolute_error(y_test['scaled_humidity'], predictions[:, 1]))
    all_mse_humidity.append(mean_squared_error(y_test['scaled_humidity'], predictions[:, 1]))
    all_rmse_humidity.append(np.sqrt(all_mse_humidity[-1]))
    all_mae_weather.append(mean_absolute_error(y_test['weather_id'], predictions[:, 2]))
    all_mse_weather.append(mean_squared_error(y_test['weather_id'], predictions[:, 2]))
    all_rmse_weather.append(np.sqrt(all_mse_weather[-1]))
    all_mae_tempr.append(mean_absolute_error(y_test['temp'], predictions[:, 2]))
    all_mse_tempr.append(mean_squared_error(y_test['temp'], predictions[:, 2]))
    all_rmse_tempr.append(np.sqrt(all_mse_tempr[-1]))
    all_mae_humidityr.append(mean_absolute_error(y_test['humidity'], predictions[:, 2]))
    all_mse_humidityr.append(mean_squared_error(y_test['humidity'], predictions[:, 2]))
    all_rmse_humidityr.append(np.sqrt(all_mse_humidityr[-1]))


##### Results:
After training and evaluating the model across all combinations, we can calculate the average performance.

In [173]:
avg_mae_temp = np.mean(all_mae_temp)
avg_mse_temp = np.mean(all_mse_temp)
avg_rmse_temp = np.mean(all_rmse_temp)
avg_mae_humidity = np.mean(all_mae_humidity)
avg_mse_humidity = np.mean(all_mse_humidity)
avg_rmse_humidity = np.mean(all_rmse_humidity)
avg_mae_weather = np.mean(all_mae_weather)
avg_mse_weather = np.mean(all_mse_weather)
avg_rmse_weather = np.mean(all_rmse_weather)
avg_mae_humidityr = np.mean(all_mae_humidityr)
avg_mse_humidityr = np.mean(all_mse_humidityr)
avg_rmse_humidityr = np.mean(all_rmse_humidityr)
avg_mae_tempr = np.mean(all_mae_tempr)
avg_mse_tempr = np.mean(all_mse_tempr)
avg_rmse_tempr = np.mean(all_rmse_tempr)

#avg_mae_temp, avg_mse_temp, avg_rmse_temp, avg_mae_humidity, avg_mse_humidity, avg_rmse_humidity, avg_mae_weather, avg_mse_weather, avg_rmse_weather


In [175]:
# Create a DataFrame to display metrics
metrics_data = {
    'Metric': ['Mean Absolute Error (MAE)', 'Mean Squared Error (MSE)', 'Root Mean Squared Error (RMSE)'],
    'Temperature': [avg_mae_temp, avg_mse_temp, avg_rmse_temp],
    'Humidity': [avg_mae_humidity, avg_mse_humidity, avg_rmse_humidity],
    'Weather Condition': [avg_mae_weather, avg_mse_weather, avg_rmse_weather],
    'r_Temperature': [avg_mae_tempr, avg_mse_tempr, avg_rmse_tempr],
    'r_Humidity': [avg_mae_humidityr, avg_mse_humidityr, avg_rmse_humidityr]
}

metrics_df = pd.DataFrame(metrics_data)
metrics_df.set_index('Metric', inplace=True)
display(metrics_df)

Unnamed: 0_level_0,Temperature,Humidity,Weather Condition,r_Temperature,r_Humidity
Metric,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Mean Absolute Error (MAE),11.739861,73.235991,738.192366,11.669853,72.80466
Mean Squared Error (MSE),2236.129038,10126.306779,561098.269921,1461.275447,5639.450456
Root Mean Squared Error (RMSE),27.288656,90.266453,749.034918,24.152677,75.079208


#### Result analysis:
from the table above we find that the model is okay for temperature but very bad for humidity in addition it is bad for weather id as it is a classification. which was expected.

### Classification Model for Weather Condition (`weather_id`)

To predict the specific weather condition represented by `weather_id`, we'll use a Random Forest classifier, a popular ensemble learning method known for its high accuracy, ability to handle large data sets with higher dimensionality, and its ability to handle missing values.


In [178]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
# Data Preparation
X_train = train[features].dropna()
y_train_weather = train.loc[X_train.index]['weather_id']
X_test = test[features].dropna()
y_test_weather = test.loc[X_test.index]['weather_id']

# Model Initialization and Training
clf = RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42)
clf.fit(X_train, y_train_weather)

# Predictions
y_pred = clf.predict(X_test)

# Evaluation
print(classification_report(y_test_weather, y_pred, zero_division=1))


              precision    recall  f1-score   support

         300       0.00      1.00      0.00         0
         310       0.00      1.00      0.00         0
         500       0.38      0.33      0.35      1464
         501       0.17      0.03      0.05       163
         502       0.00      0.00      1.00        17
         520       0.00      1.00      0.00         0
         600       0.29      0.03      0.06        62
         601       1.00      0.00      0.00         3
         701       0.00      1.00      0.00         0
         741       0.00      1.00      0.00         0
         800       0.18      0.84      0.30      1008
         801       0.03      0.00      0.01       505
         802       0.12      0.02      0.03       952
         803       0.14      0.20      0.16      1347
         804       0.56      0.08      0.14      3239

    accuracy                           0.22      8760
   macro avg       0.19      0.44      0.14      8760
weighted avg       0.33   

## 2. Random Forest Regressor for Temperature and Humidity
#### Introduction:

Random forests are a powerful ensemble learning technique that can capture complex patterns in the data without needing explicit feature interaction terms.
They are robust to outliers and can handle non-linear data.

##### Random Forest Regression Model for Temperature and Humidity

Now, we'll utilize the Random Forest Regression model to predict the temperature and humidity. Random forests are ensemble models that make predictions by combining the outputs of multiple decision trees.


In [176]:
from sklearn.ensemble import RandomForestRegressor

# Initialize the model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model for temperature
rf_model.fit(X_train, y_train['scaled_temp'])
y_pred_temp = rf_model.predict(X_test)

# Evaluate the model for temperature
mae_temp = mean_absolute_error(y_test['scaled_temp'], y_pred_temp)
mse_temp = mean_squared_error(y_test['scaled_temp'], y_pred_temp)
rmse_temp = np.sqrt(mse_temp)

# Train the model for humidity
rf_model.fit(X_train, y_train['scaled_humidity'])
y_pred_humidity = rf_model.predict(X_test)

# Evaluate the model for humidity
mae_humidity = mean_absolute_error(y_test['scaled_humidity'], y_pred_humidity)
mse_humidity = mean_squared_error(y_test['scaled_humidity'], y_pred_humidity)
rmse_humidity = np.sqrt(mse_humidity)

#mae_temp, mse_temp, rmse_temp, mae_humidity, mse_humidity, rmse_humidity


In [177]:
# Create a DataFrame to display metrics
metrics_data = {
    'Metric': ['Mean Absolute Error (MAE)', 'Mean Squared Error (MSE)', 'Root Mean Squared Error (RMSE)'],
    'Temperature': [mae_temp, mse_temp, rmse_temp],
    'Humidity': [mae_humidity, mse_humidity, rmse_humidity],
}

metrics_df = pd.DataFrame(metrics_data)
metrics_df.set_index('Metric', inplace=True)
display(metrics_df)

Unnamed: 0_level_0,Temperature,Humidity
Metric,Unnamed: 1_level_1,Unnamed: 2_level_1
Mean Absolute Error (MAE),0.043915,0.135819
Mean Squared Error (MSE),8.772526,0.040484
Root Mean Squared Error (RMSE),2.961845,0.201207


##### Random forest analysis
from the result it seems that random forest has performed way better than linear regression model

##### using K-fold Cross-Validation

In [179]:
from sklearn.ensemble import RandomForestRegressor

# Placeholder lists for metrics
maes_temp = []
mses_temp = []
rmses_temp = []
maes_humidity = []
mses_humidity = []
rmses_humidity = []

# Features and targets
#features = ['temp', 'visibility', 'dew_point', 'pressure', 'humidity', 'wind_speed', 'wind_deg', 'wind_gust', 'rain_1h', 'rain_3h', 'snow_1h', 'clouds_all', 'rolling_avg_temp', 'rolling_avg_humidity', 'lag_temp', 'lag_humidity', 'month', 'day', 'day_of_week', 'hour']
targets = ['temp','humidity','scaled_temp', 'scaled_humidity']

# Loop through each year for cross-validation
for year, test_data in yearly_samples.items():
    # Splitting the data
    train_data = pd.concat([data for k, data in yearly_samples.items() if k != year])
    
    X_train = train_data[features]
    y_train = train_data[targets]
    X_test = test_data[features]
    y_test = test_data[targets]
    
    # Training and predicting temperature
    rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
    rf_model.fit(X_train, y_train['scaled_temp'])
    y_pred_temp = rf_model.predict(X_test)
    
    # Metrics for temperature
    maes_temp.append(mean_absolute_error(y_test['scaled_temp'], y_pred_temp))
    mses_temp.append(mean_squared_error(y_test['scaled_temp'], y_pred_temp))
    rmses_temp.append(np.sqrt(mses_temp[-1]))
    
    # Training and predicting humidity
    rf_model.fit(X_train, y_train['scaled_humidity'])
    y_pred_humidity = rf_model.predict(X_test)
    
    # Metrics for humidity
    maes_humidity.append(mean_absolute_error(y_test['scaled_humidity'], y_pred_humidity))
    mses_humidity.append(mean_squared_error(y_test['scaled_humidity'], y_pred_humidity))
    rmses_humidity.append(np.sqrt(mses_humidity[-1]))

# Calculating average metrics
avg_mae_temp = np.mean(maes_temp)
avg_mse_temp = np.mean(mses_temp)
avg_rmse_temp = np.mean(rmses_temp)
avg_mae_humidity = np.mean(maes_humidity)
avg_mse_humidity = np.mean(mses_humidity)
avg_rmse_humidity = np.mean(rmses_humidity)

#avg_mae_temp, avg_mse_temp, avg_rmse_temp, avg_mae_humidity, avg_mse_humidity, avg_rmse_humidity


In [180]:
# Create a DataFrame to display metrics
metrics_data = {
    'Metric': ['Mean Absolute Error (MAE)', 'Mean Squared Error (MSE)', 'Root Mean Squared Error (RMSE)'],
    'Temperature': [avg_mae_temp, avg_mse_temp, avg_rmse_temp],
    'Humidity': [avg_mae_humidity, avg_mse_humidity, avg_rmse_humidity],
}

metrics_df = pd.DataFrame(metrics_data)
metrics_df.set_index('Metric', inplace=True)
display(metrics_df)

Unnamed: 0_level_0,Temperature,Humidity
Metric,Unnamed: 1_level_1,Unnamed: 2_level_1
Mean Absolute Error (MAE),0.015262,0.159993
Mean Squared Error (MSE),0.975006,0.052871
Root Mean Squared Error (RMSE),0.344768,0.226619


### Gradient Boosting Regression Model with K-fold Cross-validation

Gradient Boosting is a powerful ensemble technique that builds a model from the residuals of prior models, refining the predictions iteratively. Here, we implement it with k-fold cross-validation for a holistic evaluation.


In [138]:
list(df.columns) 

['dt',
 'temp',
 'visibility',
 'dew_point',
 'feels_like',
 'temp_min',
 'temp_max',
 'pressure',
 'humidity',
 'wind_speed',
 'wind_deg',
 'wind_gust',
 'rain_1h',
 'rain_3h',
 'snow_1h',
 'clouds_all',
 'weather_id',
 'date',
 'time',
 'scaled_temp',
 'scaled_humidity',
 'rolling_avg_temp',
 'rolling_avg_humidity',
 'lag_temp',
 'lag_humidity',
 'month',
 'day',
 'day_of_week',
 'hour']