## Import Required Libraries

In this section, we import the necessary libraries for our analysis:

- `pandas` for data manipulation and analysis.
- `StandardScaler` from `sklearn.preprocessing` for feature scaling.
- `LinearRegression` from `sklearn.linear_model` for building the regression model.
- `mean_absolute_error`, `mean_squared_error`, and `r2_score` from `sklearn.metrics` for evaluating the model performance.
- `os` for interacting with the operating system, such as creating directories.


In [1]:
# Import required libraries
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import os

## Load and Preprocess Data

In this section, we load the training and testing datasets, filter them to include only rows where `LocTypeName` is 'Country/Area', and preprocess the `AgeGrp` column.


In [2]:
# Load the training dataset
train_df = pd.read_csv("data/WPP2022_PopulationBySingleAgeSex_Medium_1950-2021.csv", low_memory=False)

# Load the testing dataset
test_df = pd.read_csv("data/WPP2022_PopulationBySingleAgeSex_Medium_2022-2100.csv", low_memory=False)

# Filter the datasets to include only rows where LocTypeName is 'Country/Area'
train_df = train_df[train_df['LocTypeName'] == 'Country/Area']
test_df = test_df[test_df['LocTypeName'] == 'Country/Area']

# Filter the test dataset for the years 2022 to 2024
test_df = test_df[(test_df['Time'] >= 2022) & (test_df['Time'] <= 2024)].copy()

# Change AgeGrp from '100+' to 100 in both datasets
train_df.loc[train_df['AgeGrp'] == '100+', 'AgeGrp'] = 100
test_df.loc[test_df['AgeGrp'] == '100+', 'AgeGrp'] = 100
train_df['AgeGrp'] = train_df['AgeGrp'].astype(int)
test_df['AgeGrp'] = test_df['AgeGrp'].astype(int)

## Scale Features

In this section, we initialize the `StandardScaler`, fit it on the training data, and transform both the training and testing data.


In [3]:
# Initialize the scaler
scaler = StandardScaler()

# Fit the scaler on the training data
scaler.fit(train_df[['Time', 'AgeGrp']])

# Transform the training and testing data
train_df[['Time', 'AgeGrp']] = scaler.transform(train_df[['Time', 'AgeGrp']])
test_df[['Time', 'AgeGrp']] = scaler.transform(test_df[['Time', 'AgeGrp']])

## Perform Regression

In this section, we define a function `perform_regression` to perform linear regression for each `LocID`, evaluate the model, and store the predictions and metrics.


In [4]:
def perform_regression(train_df, test_df):
    # Initialize an empty DataFrame to store combined results
    combined_predictions = pd.DataFrame()
    combined_metrics = pd.DataFrame()

    # Get unique LocIDs
    loc_ids = train_df['LocID'].unique()

    # Loop through each LocID
    for loc_id in loc_ids:
        # Filter data for the current LocID
        train_loc_df = train_df[train_df['LocID'] == loc_id]
        test_loc_df = test_df[test_df['LocID'] == loc_id]

        # Select features and target
        X_train = train_loc_df[['Time', 'AgeGrp']]
        Y_train = train_loc_df['PopTotal']
        X_test = test_loc_df[['Time', 'AgeGrp']]
        Y_test = test_loc_df['PopTotal']

        # Initialize the Linear Regression model
        model = LinearRegression()

        # Train the model
        model.fit(X_train, Y_train)

        # Make predictions
        Y_pred = model.predict(X_test)

        # Evaluate the model
        mae = mean_absolute_error(Y_test, Y_pred)
        mse = mean_squared_error(Y_test, Y_pred)
        r2 = r2_score(Y_test, Y_pred)

        print(f"LocID: {loc_id} - Mean Absolute Error: {mae}")
        print(f"LocID: {loc_id} - Mean Squared Error: {mse}")
        print(f"LocID: {loc_id} - R^2 Score: {r2}")

        # Save individual LocID predictions to a file
        predictions_df = test_loc_df[['Time', 'AgeGrp', 'LocID']].copy()
        predictions_df['Predicted_PopTotal'] = Y_pred

        # Store metrics
        metrics_df = pd.DataFrame({
            'LocID': [loc_id],
            'mae': [mae],
            'mse': [mse],
            'r2': [r2]
        })
        combined_metrics = pd.concat([combined_metrics, metrics_df])

        # Store predictions
        combined_predictions = pd.concat([combined_predictions, predictions_df])

    return combined_predictions, combined_metrics

## Execute Regression and Save Results

In this section, we execute the `perform_regression` function, ensure the output directory exists, and save the combined predictions and metrics to CSV files. We also evaluate the model on the combined predictions.


In [5]:
# Perform regression
combined_predictions, combined_metrics = perform_regression(train_df, test_df)

# Ensure the directory exists
os.makedirs("data/result", exist_ok=True)

# Save the combined predictions to a file
combined_predictions.to_csv("data/result/combined_predictions_2022_2024.csv", index=False)

# Save the combined metrics to a file
combined_metrics.to_csv("data/result/combined_metrics_2022_2024.csv", index=False)

# Evaluate the model on the combined predictions
mae_combined = mean_absolute_error(test_df['PopTotal'], combined_predictions['Predicted_PopTotal'])
mse_combined = mean_squared_error(test_df['PopTotal'], combined_predictions['Predicted_PopTotal'])
r2_combined = r2_score(test_df['PopTotal'], combined_predictions['Predicted_PopTotal'])

print(f"Combined - Mean Absolute Error: {mae_combined}")
print(f"Combined - Mean Squared Error: {mse_combined}")
print(f"Combined - R^2 Score: {r2_combined}")

LocID: 108 - Mean Absolute Error: 65.2712806164248
LocID: 108 - Mean Squared Error: 8697.501819861423
LocID: 108 - R^2 Score: 0.5463398902799662
LocID: 174 - Mean Absolute Error: 3.079386204356903
LocID: 174 - Mean Squared Error: 15.031515692215901
LocID: 174 - R^2 Score: 0.726725829332483
LocID: 262 - Mean Absolute Error: 4.208525481720999
LocID: 262 - Mean Squared Error: 20.69116188445457
LocID: 262 - R^2 Score: 0.7259526831544008
LocID: 232 - Mean Absolute Error: 14.622738586295764
LocID: 232 - Mean Squared Error: 328.681166658586
LocID: 232 - R^2 Score: 0.7297876937591179
LocID: 231 - Mean Absolute Error: 588.4772097268603
LocID: 231 - Mean Squared Error: 586602.4429022857
LocID: 231 - R^2 Score: 0.5924601173206672
LocID: 404 - Mean Absolute Error: 242.68991538752772
LocID: 404 - Mean Squared Error: 86277.55232313465
LocID: 404 - R^2 Score: 0.6603826295365396
LocID: 450 - Mean Absolute Error: 131.92292954597627
LocID: 450 - Mean Squared Error: 29997.793037697877
LocID: 450 - R^2 Sc