# Forecasting Model for Time Series Data

This notebook outlines the process of building a forecasting model for time series data, specifically focusing on predicting weekly data for 2023 based on historical data from 2019 to 2022. The dataset consists of multiple CSV files, each representing a year's worth of data across different provinces and districts, with measurements taken weekly.

## Approach

1. **Data Loading**: Load the historical data (2019-2022) for training and the current year data (2023) for testing.
2. **Data Preprocessing**: Combine the datasets, handle missing values, and encode categorical variables.
3. **Feature Engineering**: Extract and select relevant features for the forecasting model.
4. **Model Selection**: Choose a suitable model based on the data's characteristics.
5. **Model Training**: Train the model using historical data.
6. **Model Evaluation**: Evaluate the model's performance using the 2023 data.
7. **Results Generation**: Forecast the weekly data for 2023 and save the results to a CSV file.


## Step 1: Data Loading

First, we load the CSV files for each year into separate Pandas DataFrames. This step involves importing necessary libraries and reading the data.

In [4]:
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils import column_or_1d
import numpy as np

# Load the CSV files into DataFrames
df_2019 = pd.read_csv('2019.csv')
df_2020 = pd.read_csv('2020.csv')
df_2021 = pd.read_csv('2021.csv')
df_2022 = pd.read_csv('2022.csv')
df_2023 = pd.read_csv('2023.csv')

## Step 2: Data Preprocessing

Combine the datasets from 2019 to 2022 into a single DataFrame for training. Preprocess this data by handling missing values and encoding categorical variables.

In [2]:
df_train = pd.concat([df_2019, df_2020, df_2021, df_2022], ignore_index=True)

# Example preprocessing steps
df_train.fillna(method='ffill', inplace=True)
df_2023.fillna(method='ffill', inplace=True)

## Step 3: Feature Engineering

Extract features useful for forecasting, including encoding categorical variables and preparing time-based features.

In [7]:
class ExtendedLabelEncoder(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.classes_ = None

    def fit(self, y):
        y = column_or_1d(y, warn=True)
        # Convert all input data to strings
        y = y.astype(str)
        self.classes_ = np.unique(y)
        return self

    def transform(self, y):
        y = column_or_1d(y, warn=True)
        # Convert all input data to strings
        y = y.astype(str)
        unseen = set(y) - set(self.classes_)
        
        # Handle unseen labels
        y_transformed = np.searchsorted(self.classes_, y)
        y_transformed[np.isin(y, list(unseen))] = -1
        return y_transformed

    def inverse_transform(self, y):
        check_is_fitted(self, 'classes_')
        y = np.asarray(y)
        return self.classes_[y]

In [8]:
df_train['Província'] = df_train['Província'].astype(str)
df_train['Distrito'] = df_train['Distrito'].astype(str)
df_2023['Província'] = df_2023['Província'].astype(str)
df_2023['Distrito'] = df_2023['Distrito'].astype(str)

## Step 4: Model Selection and Training

Choose a suitable forecasting model. For time series data, models like ARIMA or SARIMA could be considered, but for simplicity, we'll use Linear Regression as a starting point.

In [15]:
from sklearn.linear_model import LinearRegression

# Initialize the model
model = LinearRegression()

# Prepare the features and target variable for training
X_train = df_train.drop(columns=['Província', 'Distrito'])
y_train = df_train['W1']

# Fit the model
model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

## Step 5: Model Evaluation

Evaluate the model's performance using the testing data from 2023. This involves predicting with the model and comparing the predictions to the actual data.

In [19]:
from sklearn.metrics import mean_absolute_error

# Prepare the features for testing
X_test = df_2023.drop(columns=['Província', 'Distrito'])
y_test = df_2023['W1']

# Predict on the testing data
predictions = model.predict(X_test)

# Evaluate the model
mae = mean_absolute_error(y_test, predictions)
print(f'Mean Absolute Error: {mae}')

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

## Step 6: Save Predicted Results

Finally, save the predicted results for 2023 to a CSV file, providing a record of the model's predictions.

In [20]:
results_df = pd.DataFrame({
    'Actual': y_test,
    'Predicted': predictions
})

# Save to CSV
results_df.to_csv('predicted_results_2023.csv', index=False)
print('Predicted results saved to predicted_results_2023.csv')

NameError: name 'predictions' is not defined

In [None]:
Now, build a jupyter notebook file so I can download it. For the above model you build and generate the results and evalute from it.  Write a complete code where all the previous 4 years data will take as a test data and 2023 data as test data, as per the model you build and predictied the data.

So, try to build a proper code in a recommended way and the notebook file should be just run and go code. DO NOT USE ANY PLACE HOLDER.

Use only generic code as per the above model you build.

NOTE: do not use any placeholder, write a proper generic code for the above dataset with a proper output of the code file with saving a csv of predicted values.

Write the each line of code properly with a proper explanation and also explains the why this model is using and any appaorch or logic. second also place each cell with a proper comments and also write each and every step for this model to predict the data and saving the csv file of each week predicted results as per the above csv you build from the results.

Also make graph of 2023 data to check the comparsion between actual and result data. and make a accurcy graph of results

Explain me what you understand in this prompt, so If any thing you missed so I can tell you clearly.