# **<font color='white gray'>panData</font>**
## **<font color='white gray'>Data Science for Multivariate Data Analysis</font>**

### **<font color='white gray'>Predicting Multiple Macroeconomic Indicators with Multi-Target Regression</font>**


## **Installing and Loading the Packages**


In [None]:
!pip install -q -U watermark

In [None]:
# Imports
import sklearn
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.multioutput import MultiOutputRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
import warnings
warnings.filterwarnings('ignore')

In [None]:
%reload_ext watermark
%watermark -a "panData"

Author: panData



## **Loading and Understanding the Data**

In [None]:
# 1 .Generating sample data
np.random.seed(42)
num_samples = 1000

In [None]:
# 2. Macroeconomic variables (input variables)
interest_rate = np.random.uniform(0, 15, num_samples)
exchange_rate = np.random.uniform(1, 5, num_samples)
industrial_production = np.random.uniform(50, 200, num_samples)

In [None]:
# 3. Economic indicators (output variables - target)
gdp = 2 * interest_rate + 3 * exchange_rate + 0.5 * industrial_production + np.random.normal(0, 5, num_samples)
inflation = 0.5 * interest_rate + 2 * exchange_rate + 0.2 * industrial_production + np.random.normal(0, 2, num_samples)
unemployment_rate = -0.1 * interest_rate + 0.3 * exchange_rate + 0.4 * industrial_production + np.random.normal(0, 1, num_samples)


In [None]:
# 4. Create the DataFrame
df = pd.DataFrame({'interest_rate': interest_rate,
                   'exchange_rate': exchange_rate,
                   'industrial_production': industrial_production,
                   'gdp': gdp,
                   'inflation': inflation,
                   'unemployment_rate': unemployment_rate})


In [None]:
df.head()

Unnamed: 0,interest_rate,exchange_rate,industrial_production,gdp,inflation,unemployment_rate
0,5.618102,1.740532,89.255853,66.385407,23.102492,33.62647
1,14.260715,3.167604,87.04682,84.63268,30.193113,34.390431
2,10.979909,4.491783,185.938187,131.822108,52.517792,74.375747
3,8.979877,3.9289,87.43193,66.63264,29.988437,33.58056
4,2.34028,4.226245,90.792459,68.815242,26.593498,37.755042



## **Separating Attributes and Targets**

In [None]:
# 5. Separate attributes and targets
X = df[['interest_rate', 'exchange_rate', 'industrial_production']]
y = df[['gdp', 'inflation', 'unemployment_rate']]

In [None]:
# 6. Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## **Standardization of Attributes**

In [None]:
# 7. Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)



## **Multi-Target Model Construction**

Multi-Target Regression, or Multi-Output Regression, is an approach used to predict multiple dependent variables simultaneously. Instead of predicting a single output variable, the model makes predictions for several output variables. This approach can be advantageous when there are correlations between the output variables, allowing the model to capture these dependencies and provide more accurate predictions.

There are several ways to implement Multi-Target Regression:

**Independent Models**: Train a separate regression model for each output variable. This is simple but ignores potential correlations between the output variables.

**MultiOutputRegressor**: Uses a base regression model and extends it to multiple outputs. Each output is treated as a separate regression, but the same regression algorithm is used for all outputs. For example, `MultiOutputRegressor(RandomForestRegressor(...))`.

**Multivariate Models**: Some regression algorithms are naturally capable of handling multiple outputs, such as Neural Networks and some deep learning methods. These models can better capture the dependencies between output variables.

**Joint Regression Models**: Use an approach that handles all output variables simultaneously in a single model. For example, the PLS (Partial Least Squares) algorithm can be used for multi-target regression.


In [None]:
# 8. Build and train the model
model = MultiOutputRegressor(RandomForestRegressor(n_estimators=100, random_state=42))

The code above defines a multi-output regression model using a combination of `MultiOutputRegressor` and `RandomForestRegressor`.

**`RandomForestRegressor(n_estimators=100, random_state=42)`**: This is the base regression model that will be used. It consists of a forest of 100 decision trees (indicated by `n_estimators=100`). The `random_state=42` parameter is used to ensure the reproducibility of results by fixing the seed of the random number generator.

**`MultiOutputRegressor(...)`**: This class allows extending a regression model to handle multiple dependent variables (multi-output). Each output is treated as a separate regression.

Therefore, `model` is a model that uses a random forest to predict multiple dependent variables simultaneously.



Documentation:

https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputRegressor.html
    
https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.RegressorChain.html

https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html


## **Training and Evaluation of the Model**

In [None]:
# 9. Train the model
model.fit(X_train_scaled, y_train)


In [None]:
# 10. Make predictions
y_pred = model.predict(X_test_scaled)

In [None]:
# 11. Mean Squared Error
mse = mean_squared_error(y_test, y_pred, multioutput='raw_values')
print("Mean Squared Error for each target variable:", mse)

Mean Squared Error for each target variable: [38.86388322  4.93906348  1.20767905]




The **Mean Squared Error (MSE)** is an evaluation metric used to measure the difference between the values predicted by a model and the actual observed values. The MSE calculates the mean of the squared errors, that is, the average of the squared differences between the actual and predicted values. A lower MSE indicates a more accurate model.


In [None]:
# 12. Coefficient of determination (R²)
r2 = r2_score(y_test, y_pred, multioutput='raw_values')
print("R² for each target variable:", r2)

R² for each target variable: [0.94231126 0.94961794 0.99611608]


The coefficient of determination, also known as **R²**, is a metric that evaluates the proportion of variability in the dependent variable that is explained by the model. It is widely used to measure the quality of fit for regression models.

**R²** ranges between 0 and 1, where:

- 0 indicates that the model does not explain any variability in the target variable.
- 1 indicates that the model explains all the variability in the target variable.

## **Visualizing the Predictions**


In [None]:
# 13. Visualize some results
results = pd.DataFrame(y_test, columns=['gdp', 'inflation', 'unemployment_rate'])
results['gdp_pred'] = y_pred[:, 0]
results['inflation_pred'] = y_pred[:, 1]
results['unemployment_rate_pred'] = y_pred[:, 2]

In [None]:
# 14. Display the first few results
results.head()

Unnamed: 0,gdp,inflation,unemployment_rate,gdp_pred,inflation_pred,unemployment_rate_pred
521,63.014526,26.680746,30.276211,54.654841,25.079532,28.152093
737,101.764482,36.175549,52.482609,96.803324,35.607149,51.887139
740,70.147008,30.863583,40.71911,78.477116,31.76931,41.675455
660,80.842988,33.689196,50.421798,83.755337,34.33291,49.269336
411,64.364788,24.215227,28.111229,73.112515,23.484329,28.844566


In [None]:
%watermark -a "panData"

Author: panData



In [None]:
%watermark -v -m

Python implementation: CPython
Python version       : 3.10.12
IPython version      : 7.34.0

Compiler    : GCC 11.4.0
OS          : Linux
Release     : 6.1.85+
Machine     : x86_64
Processor   : x86_64
CPU cores   : 2
Architecture: 64bit



In [None]:
%watermark --iversions

sklearn: 1.5.2
numpy  : 1.26.4
pandas : 2.2.2



# **The End**