# 5.- Future Forecasting

> Important source: https://www.kaggle.com/code/ahmedabdulhamid/recursive-multistep-time-series-forecasting

## Sequence Length 4 and Prediction Length 3

| Model | MAE | RMSE | sMAPE | rRMSE |
|-------|-----|------|-------|-------|
| Transformer | 2.6083644115715288e-05 | 7.5391391874291e-05 | 56.07243347167969 | 14.047582626342773 |
| FEDformer | 2.497724381100852e-05 | 5.9200079704169184e-05 | 56.294803619384766 | 11.030675888061523 |
| Reformer | 2.5397184799658135e-05 | 6.633401062572375e-05 | 55.79774475097656 | 12.359931945800781 |

## Sequence Length 6 and Prediction Length 3

| Model | MAE | RMSE | sMAPE | rRMSE |
|-------|-----|------|-------|-------|
| Reformer | 3.162921348121017e-05 | 7.991403253981844e-05 | 57.381324768066406 | 14.890280723571777 |
| Autoformer | 3.828984699794091e-05 | 0.00010072392615256831 | 58.90266799926758 | 18.76776123046875 |
| Transformer | 3.055121487705037e-05 | 8.281940972665325e-05 | 57.19368362426758 | 15.431634902954102 |

## Prediction Analysis

In [37]:
import torch 
import os 
import torch
import numpy as np
import pandas as pd
from io import StringIO

In [38]:
normalized_data = pd.read_csv("../data/green_skill_classification/data_for_timeseries_normalized.csv")
# raw_data = pd.read_csv("../data/green_skill_classification/data_for_timeseries.csv")

real_months = [
    "2024-07", "2024-08", "2024-09", "2024-10", "2024-11", "2024-12",
    "2025-01", "2025-03", "2025-04", "2025-05", "2025-06", "2025-07"
]

forecast_months = [
    "2025-08", "2025-09", "2025-10", "2025-11", "2025-12", "2026-01"
]

total_months = real_months + forecast_months


In [40]:
def load_predictions(folder : str) -> dict:
    predictions = {}
    for file in os.listdir(folder):
        if file.endswith(".pt"):
            filepath = os.path.join(folder, file)
            data = torch.load(filepath)
            data_np = data.numpy()
            predictions[file] = data_np
            print(f"File: {file}, Shape: {data_np.shape}")
    return predictions

FOLDERS = [
    #"../models/predictions/absolute_predictions",
    "../models/predictions/normalized_predictions"
]

predictions = {}

for folder in FOLDERS:
    preds = load_predictions(folder)
    predictions.update(preds)

predictions

File: n_future_predictions_seq4_pred3_Reformer.pt, Shape: (1, 6, 274)
File: n_future_predictions_seq4_pred3_Informer.pt, Shape: (1, 6, 274)
File: n_future_predictions_seq4_pred3_FEDformer.pt, Shape: (1, 6, 274)


{'n_future_predictions_seq4_pred3_Reformer.pt': array([[[6.80392077e-06, 5.33249440e-05, 0.00000000e+00, ...,
          0.00000000e+00, 5.68135438e-05, 0.00000000e+00],
         [5.65342561e-05, 9.10131166e-06, 0.00000000e+00, ...,
          6.55207550e-05, 1.46099090e-04, 4.86975769e-05],
         [0.00000000e+00, 8.71094744e-05, 0.00000000e+00, ...,
          2.34805466e-05, 1.13980626e-04, 3.74025803e-05],
         [1.49655364e-06, 6.38646306e-05, 0.00000000e+00, ...,
          2.92301456e-06, 6.27678965e-05, 0.00000000e+00],
         [5.00510032e-05, 2.66587067e-05, 0.00000000e+00, ...,
          6.59008074e-05, 1.41435172e-04, 4.51187552e-05],
         [0.00000000e+00, 9.73666974e-05, 0.00000000e+00, ...,
          2.10270846e-05, 1.06719082e-04, 2.71768076e-05]]],
       shape=(1, 6, 274), dtype=float32),
 'n_future_predictions_seq4_pred3_Informer.pt': array([[[3.8511582e-05, 0.0000000e+00, 8.8737415e-06, ...,
          0.0000000e+00, 7.2288218e-05, 7.8789089e-05],
         [6.89

At this point, we have generated **6-month** predictions for the three best performing models (**Transformer**, **Autoformer**, and **Reformer**) trained with `seq_len` = 6 and `pred_len` = 3, as well as with `seq_len` = 4 and `pred_len` = 3. We have six **tensors** where each tensor has the shape `(batches, pred_len, num_features)`.


### Join dataset with predictions

In [41]:
dataset = pd.read_csv("../data/green_skill_classification/data_for_timeseries.csv")
dataset.shape

(274, 14)

In [42]:
map_predictions = {
    "absolute": {},
    "normalized": {}
}
for prediction in predictions:
    if not prediction.startswith("n_"):
        continue
    if prediction.startswith("n_"):
        map_predictions["normalized"][prediction] = predictions[prediction]
    else:
        map_predictions["absolute"][prediction] = predictions[prediction]

map_predictions.keys()

for key in map_predictions:
    for model in map_predictions[key]:
        print(f"Type: {key}, Model: {model}, Shape: {map_predictions[key][model].shape}")

Type: normalized, Model: n_future_predictions_seq4_pred3_Reformer.pt, Shape: (1, 6, 274)
Type: normalized, Model: n_future_predictions_seq4_pred3_Informer.pt, Shape: (1, 6, 274)
Type: normalized, Model: n_future_predictions_seq4_pred3_FEDformer.pt, Shape: (1, 6, 274)


In [47]:
print(map_predictions["normalized"]['n_future_predictions_seq4_pred3_Reformer.pt'])

[[[6.80392077e-06 5.33249440e-05 0.00000000e+00 ... 0.00000000e+00
   5.68135438e-05 0.00000000e+00]
  [5.65342561e-05 9.10131166e-06 0.00000000e+00 ... 6.55207550e-05
   1.46099090e-04 4.86975769e-05]
  [0.00000000e+00 8.71094744e-05 0.00000000e+00 ... 2.34805466e-05
   1.13980626e-04 3.74025803e-05]
  [1.49655364e-06 6.38646306e-05 0.00000000e+00 ... 2.92301456e-06
   6.27678965e-05 0.00000000e+00]
  [5.00510032e-05 2.66587067e-05 0.00000000e+00 ... 6.59008074e-05
   1.41435172e-04 4.51187552e-05]
  [0.00000000e+00 9.73666974e-05 0.00000000e+00 ... 2.10270846e-05
   1.06719082e-04 2.71768076e-05]]]


In [48]:
# region_id,skill_id,2024-07,2024-08,2024-09,2024-10,2024-11,2024-12,2025-01,2025-03,2025-04,2025-05,2025-06,2025-07
SAVE_ON = "../data/predictions/"

map_dataframes = {
    "absolute": pd.read_csv("../data/green_skill_classification/data_for_timeseries.csv"),
    "normalized": pd.read_csv("../data/green_skill_classification/data_for_timeseries_normalized.csv")
}

def create_future_dataframe() -> pd.DataFrame:
    new_dataframe = pd.DataFrame(columns=["region_id", "skill_id", "2024-07", "2024-08", "2024-09", "2024-10", "2024-11", "2024-12",
                                      "2025-01", "2025-03", "2025-04", "2025-05", "2025-06", "2025-07",
                                      "2025-08", "2025-09", "2025-10", "2025-11", "2025-12", "2026-01"])
    return new_dataframe

for key in map_predictions:
    for model in map_predictions[key]:
        if key != "normalized":
            continue
        data_array = map_predictions[key][model]
        future_df = create_future_dataframe()
        future_df[["region_id", "skill_id"]] = map_dataframes[key][["region_id", "skill_id"]]

        future_df[["2024-07", "2024-08", "2024-09", "2024-10", "2024-11", "2024-12",
                   "2025-01", "2025-03", "2025-04", "2025-05", "2025-06", "2025-07"]] = map_dataframes[key][["2024-07", "2024-08", "2024-09", "2024-10", "2024-11", "2024-12",
                                                                                                      "2025-01", "2025-03", "2025-04", "2025-05", "2025-06", "2025-07"]]
        for i in range(data_array.shape[1]):
            if i + 8 == 12:
                month = 2025
                month_str = f"{month}-12"
                future_df[month_str] = data_array[0, i, :]
                continue

            month = 2025 + (i + 8) // 12
            month_str = f"{month}-{(i + 8) % 12:02d}"
            future_df[month_str] = data_array[0, i, :]
            if key != "normalized":
                future_df[month_str] = future_df[month_str].round(2)

        filename = model.replace(".pt", ".csv")
        if key == "normalized":
            filepath = os.path.join(SAVE_ON, filename)
        else:
            filepath = os.path.join(SAVE_ON, filename)
        future_df.to_csv(filepath, index=False)

# Growth rate calculation

In [49]:
import pandas as pd
import numpy as np

def compute_and_merge_skill_growth(df: pd.DataFrame):
    real_months = [
        "2024-07", "2024-08", "2024-09", "2024-10", "2024-11", "2024-12",
        "2025-01", "2025-03", "2025-04", "2025-05", "2025-06", "2025-07"
    ]

    forecast_months = [
        "2025-08", "2025-09", "2025-10", "2025-11", "2025-12", "2026-01"
    ]

    df["R_avg"] = df[real_months].mean(axis=1)
    df["F_avg"] = df[forecast_months].mean(axis=1)

    df["G_abs"] = df["F_avg"] - df["R_avg"]
    epsilon = 1e-6
    df["G_rel"] = df["G_abs"] / df["R_avg"].replace(0, epsilon)

    tau_abs = df["G_abs"].quantile(0.75)
    tau_rel = df["G_rel"].quantile(0.75)

    def classify(row):
        abs_high = row["G_abs"] >= tau_abs
        rel_high = row["G_rel"] >= tau_rel

        if abs_high and rel_high:
            return "Star"
        elif (not abs_high) and rel_high:
            return "Emerging"
        elif abs_high and (not rel_high):
            return "Stable"
        else:
            return "Declining"

    df["quadrant"] = df.apply(classify, axis=1)

    ordered_cols = (
        ["region_id", "skill_id"] +
        real_months +
        forecast_months +
        ["R_avg", "F_avg", "G_abs", "G_rel", "quadrant"]
    )

    print(df[ordered_cols].head())
    return df[ordered_cols]


In [50]:
FOLDER = "../data/predictions/"
for file in os.listdir(FOLDER):
    if file.endswith(".csv"):
        filepath = os.path.join(FOLDER, file)
        df = pd.read_csv(filepath)
        
        result_df = compute_and_merge_skill_growth(df)
        print(result_df.head())
        output_filepath = os.path.join(FOLDER, f"skill_growth_{file}")
        result_df.to_csv(output_filepath, index=False)


   region_id  skill_id   2024-07   2024-08   2024-09   2024-10   2024-11  \
0          1         1  0.000059  0.000000  0.000154  0.000076  0.000127   
1          1         2  0.000118  0.000064  0.000051  0.000076  0.000064   
2          1         3  0.000000  0.000000  0.000000  0.000000  0.000064   
3          1         4  0.000000  0.000128  0.000051  0.000000  0.000064   
4          1         5  0.000118  0.000000  0.000000  0.000000  0.000064   

   2024-12   2025-01   2025-03  ...   2025-09   2025-10   2025-11   2025-12  \
0  0.00012  0.000050  0.000037  ...  0.000057  0.000000  0.000001  0.000050   
1  0.00006  0.000000  0.000000  ...  0.000009  0.000087  0.000064  0.000027   
2  0.00006  0.000000  0.000000  ...  0.000000  0.000000  0.000000  0.000000   
3  0.00030  0.000101  0.000257  ...  0.000229  0.000214  0.000096  0.000239   
4  0.00006  0.000000  0.000073  ...  0.000076  0.000042  0.000005  0.000081   

    2026-01     R_avg     F_avg         G_abs     G_rel   quadrant  

In [51]:
FOLDER = "../data/predictions/"

MD_FILE = {
    "Normalized_Data": "../doc/future_predictions_count_normalized.md",
    "Raw_Data": "../doc/future_predictions_count_raw.md"
}

for file in os.listdir(FOLDER):
    for key in MD_FILE:
        if file.startswith("skill_growth_") and file.endswith(".csv"):
            if (key == "Normalized_Data" and file.startswith("skill_growth_n_")) or (key == "Raw_Data" and not file.startswith("skill_growth_n_")):
                filepath = os.path.join(FOLDER, file)
                df = pd.read_csv(filepath)

                quadrant_counts = df['quadrant'].value_counts().to_dict()

                with open(MD_FILE[key], 'a') as md_file:
                    md_file.write(f"## {file}\n\n")
                    md_file.write("| Quadrant   | Count |\n")
                    md_file.write("|------------|-------|\n")
                    for quadrant in ["Star", "Emerging", "Stable", "Declining"]:
                        count = quadrant_counts.get(quadrant, 0)
                        md_file.write(f"| {quadrant} | {count} |\n")
                    md_file.write("\n")

In [None]:
NORMALIZED_DATA = "../data/predictions"
MAP_JSON = "../data/green_skill_classification/mapping/map_skills.json"
WRITE_TO = "../doc/top_skills_future_predictions.md"

import os
import json
import pandas as pd

if os.path.exists(WRITE_TO):
    os.remove(WRITE_TO)

with open(MAP_JSON, 'r') as f:
    skill_map = json.load(f)

data_frames = []

def getTopK(dataframe: pd.DataFrame, k: int = 10, val: str = "G_rel"):
    sorted_df = dataframe.sort_values(by=val, ascending=False)
    top_k = sorted_df.head(k)
    top_k["skill_name"] = top_k["skill_id"].map(lambda x: skill_map.get(str(x), "Unknown Skill"))
    return top_k[["region_id", "skill_name", "G_rel", "G_abs", "quadrant", "skill_id"]]

results_rel = []
results_abs = []

for file in os.listdir(NORMALIZED_DATA):
    if file.startswith("skill_growth_n"):
        print("========================================")
        print(f"Results for file: {file}")
        filepath = os.path.join(NORMALIZED_DATA, file)
        result_df = pd.read_csv(filepath)
        data_frames.append(result_df)
        top_rel = getTopK(result_df, k=10, val="G_rel")
        top_rel["source_file"] = file
        results_rel.append(top_rel)
        top_abs = getTopK(result_df, k=10, val="G_abs")
        top_abs["source_file"] = file
        results_abs.append(top_abs)

with open(WRITE_TO, 'a') as md_file:
    md_file.write("# Top Skills by Relative and Absolute Growth\n\n")
    md_file.write("## Relative Growth (G_rel)\n\n")
    for df in results_rel:
        file = df["source_file"].iloc[0]
        md_file.write(f"### {file}\n\n")
        md_file.write("| Region ID | Skill Name | G_rel | G_abs | Quadrant | Skill ID |\n")
        md_file.write("|-----------|-------------|-------|-------|-----------|----------|\n")
        for _, row in df.iterrows():
            md_file.write(
                f"| {row['region_id']} | {row['skill_name']} | "
                f"{row['G_rel']:.4f} | {row['G_abs']:.6f} | "
                f"{row['quadrant']} | {row['skill_id']} |\n"
            )
        md_file.write("\n")

    md_file.write("## Absolute Growth (G_abs)\n\n")
    for df in results_abs:
        file = df["source_file"].iloc[0]
        md_file.write(f"### {file}\n\n")
        md_file.write("| Region ID | Skill Name | G_rel | G_abs | Quadrant | Skill ID |\n")
        md_file.write("|-----------|-------------|-------|-------|-----------|----------|\n")
        for _, row in df.iterrows():
            md_file.write(
                f"| {row['region_id']} | {row['skill_name']} | "
                f"{row['G_rel']:.4f} | {row['G_abs']:.6f} | "
                f"{row['quadrant']} | {row['skill_id']} |\n"
            )
        md_file.write("\n")


Results for file: skill_growth_n_future_predictions_seq4_pred3_Reformer.csv
Results for file: skill_growth_n_future_predictions_seq4_pred3_FEDformer.csv
Results for file: skill_growth_n_future_predictions_seq4_pred3_Informer.csv


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_k["skill_name"] = top_k["skill_id"].map(lambda x: skill_map.get(str(x), "Unknown Skill"))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_k["skill_name"] = top_k["skill_id"].map(lambda x: skill_map.get(str(x), "Unknown Skill"))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_k["skill_name

In [65]:
INDEXES = [194, 106, 198]
MODELS = ["Reformer", "FEDformer", "Informer"]

for index in INDEXES:
    print("========================================")
    for idx, df in enumerate(data_frames):
        print(f"Results for dataframe {MODELS[idx]}:")
        row = df.iloc[index]
        print(f"Index: {index}, Region ID: {row['region_id']}, Skill ID: {row['skill_id']}, G_rel: {row['G_rel']}, G_abs: {row['G_abs']}, Quadrant: {row['quadrant']}")

Results for dataframe Reformer:
Index: 194, Region ID: 1, Skill ID: 195, G_rel: 3.0212122707608, G_abs: 9.225977105429535e-06, Quadrant: Emerging
Results for dataframe FEDformer:
Index: 194, Region ID: 1, Skill ID: 195, G_rel: 4.1047472744592, G_abs: 1.2534804238762874e-05, Quadrant: Emerging
Results for dataframe Informer:
Index: 194, Region ID: 1, Skill ID: 195, G_rel: 1.6904825075954002, G_abs: 5.16228305542954e-06, Quadrant: Emerging
Results for dataframe Reformer:
Index: 106, Region ID: 1, Skill ID: 107, G_rel: 0.8283773184545001, G_abs: 2.851949729582387e-06, Quadrant: Emerging
Results for dataframe FEDformer:
Index: 106, Region ID: 1, Skill ID: 107, G_rel: 6.127362667400001, G_abs: 2.1095375154582388e-05, Quadrant: Emerging
Results for dataframe Informer:
Index: 106, Region ID: 1, Skill ID: 107, G_rel: 0.326390939305, G_abs: 1.1237035712490534e-06, Quadrant: Declining
Results for dataframe Reformer:
Index: 198, Region ID: 1, Skill ID: 199, G_rel: 2.346572897789, G_abs: 7.1658082