# Time Series Forecasting with Linear Regression

This notebook demonstrates building and evaluating a linear regression model for forecasting weekly data across different provinces and districts using data from previous years (2019-2022) as training data and the year 2023 as testing data. The goal is to predict weekly values for 2023 and compare these predictions against actual data to assess the model's performance.

In [1]:
# Import necessary libraries
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt

## Data Loading

Load the datasets for the years 2019 to 2023. Each dataset contains weekly data for different provinces and districts.

In [2]:
df_2019 = pd.read_csv('2019.csv')
df_2020 = pd.read_csv('2020.csv')
df_2021 = pd.read_csv('2021.csv')
df_2022 = pd.read_csv('2022.csv')
df_2023 = pd.read_csv('2023.csv')

## Data Preprocessing

Combine the datasets from 2019 to 2022 to form the training dataset. The dataset for 2023 will be used as the testing dataset. Handle any missing values as necessary.

In [3]:
df_train = pd.concat([df_2019, df_2020, df_2021, df_2022])
df_test = df_2023
# Example preprocessing step
df_train.fillna(0, inplace=True)
df_test.fillna(0, inplace=True)

## Model Training

Train a linear regression model on the training data.

In [4]:
model = LinearRegression()
# Assuming the target variable is the sum of weekly data or another specific feature
X_train = df_train.drop(['Província', 'Distrito'], axis=1)
y_train = X_train.sum(axis=1)  # or any other operation to prepare the target variable
model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

## Prediction and Evaluation

Use the trained model to predict the 2023 data and evaluate the model's performance using metrics such as MAE, MSE, and R².

In [8]:
missing_weeks = set(X_train.columns) - set(X_test.columns)
for week in missing_weeks:
    X_test[week] = 0  # Add the missing week columns with default value of 0

# Ensure the columns are in the same order as the training set
X_test = X_test[X_train.columns]

In [9]:
X_test = df_test.drop(['Província', 'Distrito'], axis=1)
predictions = model.predict(X_test)
# Assuming y_test is prepared similarly to y_train
y_test = X_test.sum(axis=1)  # Adjust based on actual testing approach
mae = mean_absolute_error(y_test, predictions)
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)
print(f"MAE: {mae}, MSE: {mse}, R²: {r2}")

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 53 is different from 46)

## Visualization

Visualize the comparison between actual and predicted values for 2023.

In [None]:
plt.figure(figsize=(10, 5))
plt.plot(y_test, label='Actual', color='blue')
plt.plot(predictions, label='Predicted', color='red', linestyle='--')
plt.title('Actual vs Predicted Values for 2023')
plt.xlabel('Index')
plt.ylabel('Value')
plt.legend()
plt.show()