# Usage Example

This notebook illustrates how to use this repository in different ways.
- To quickly test the code, a fake dataset has been created in the folder **BACI_HS12_V202501_test**. It is based on the real BACI dataset but has been reduced in size. We will use this fake dataset for demonstration in this notebook.


## Preparation

Before using the library, several preparation need to be done. 

1. Download and prepare the BACI dataset required for your project.
  
    - Open the official website: [BACI](https://www.cepii.fr/CEPII/en/bdd_modele/bdd_modele_item.asp?id=37)
    - In the "DOWNLOAD" section, there are dataset with different time ranges, for instance "HS92 (1995-2023)", click it, it will download automatically (The download might take a while).
    - Move the dataset into the repository folder. (The same directory as the fake dataset, which is ".../Revealed-Comparative-Advantage-Calculator")
    - Unzip your downloaded dataset.

2. Using `pip` to install this library.  
- Reminder: this command only needs to be run once. Once you have sucessfully installed, you will always have the library in your current environment.

In [1]:
# ! pip install rca_batch_calc  # Uncomment this line if you need to install the library.

3. Import **rca_batch_calc** library along with other required libraries. 
- Reminder: each time you restart the kernel, you need to import the library and create the class instances again. (This means run the step 3 and step 4 again before running other cells)

In [2]:
from rca_batch_calc.parallel_calc import Parallel_Calculator
from rca_batch_calc.data_extract import DataExtract

from functools import partial
from concurrent.futures import ThreadPoolExecutor, as_completed
import pandas as pd
import os

4. Create instances of `Parallel_Calculator` and `DataExtract` class. 
- `Parallel_Calculator`: is used to calculate RCA in parallel.
- `DataExtract`: is used to filter the required product from BACI dataset.

In [3]:
parallel = Parallel_Calculator()
extractor = DataExtract()

## Product filter

Because the BACI dataset includes trading data for 5000 products, it's difficult to visualize it in EXCEL, let alone operate on the data. Extract the targeted products is easier for further manipulate.  
  
So, let's do it.

1. You need to find the codes of the products you want to do experiments on, and then define them as constants in the code.  
(If you don't know the product code, please check the comparison table. It's included in the dataset, called: **product_codes_HS12_V202501.csv**.)
- Option 1: Define them directly in this notebook, it's convinent and easy to use.
- Option 2: Define them in `constants.py`, it's easier to manage and modify if you have many constants, good for unified management and modification.

    In this notebook, I define all the constants in [constants.py](./constants.py) (Click the name, you can direct to the script). The meaning of each constant and what you should assign to the variable is illustrated within the script. It's located in the same directory as this notebook.  

In [4]:
# Let's import the products and all other constants.
from constants import *

# Here we print the product code to verify that we have the correct one. (You call the constant by directly using the variable name defined in the constants.py)
print(PROD)

[121221, 121229]


2. Here, we iterate over all the BACI data files and filter out the required products.

In [5]:
product_data = extractor.find_product(FOLDER_PATH, PROD)

Extracting file: BACI_HS12_Y2014_V202501.csv
Extracting finished: BACI_HS12_Y2014_V202501.csv
Extracting file: BACI_HS12_Y2012_V202501.csv
Extracting finished: BACI_HS12_Y2012_V202501.csv
Extracting file: BACI_HS12_Y2013_V202501.csv
Extracting finished: BACI_HS12_Y2013_V202501.csv


3. Observe the extracted data.

In [6]:
product_data.head()

Unnamed: 0,t,i,j,k,v,q
0,2014,32,36,121221,379.273,31.615
1,2014,32,36,121229,348.991,30.081
2,2014,32,70,121221,0.045,0.001
3,2014,32,156,121229,116.567,70.003
4,2014,32,218,121221,46.287,10.0


4. Save the output in the BACI dataset folder as **output.csv**.

In [7]:
extractor.save_csv(product_data, FOLDER_PATH)

Extracted data saved.
---------------------


5. 🎁 **Bonus**: To help with readability, `convert_countries` function is provided in the library to convert country codes into country names.  
- The inputs to the function are the paths to the two files: the country code file and the comparison table file. (follow the order)

In [8]:
extractor.convert_countries(f"{FOLDER_PATH}/output.csv", f"{FOLDER_PATH}/country_codes_V202501.csv") # check your dataset folder, the converted file called: "output_countries.csv"

## RCA calculator

The strategy is to first calculate each component ($X^i_j$, $X^i_n$, $X^w_j$, $X^w_n$) of the formula, saving the results for each component into separate files. Then using these files as inputs to calculate the rca results ($RCA^i_j = \left( \frac{X^i_j}{X^i_n} \middle/ \frac{X^w_j}{X^w_n} \right)$).
- Implementation Logic: 
  - All the components use a similar way for calculation. We use the `partial` function from Python‘s standard library `functools` to fix some arguments of a function and generate a new function. So that we can pass different number of arguments for different component calculations to the function `parallel_run` to implement thread concurrency. `parallel_run` function will leverage the power of the computer to speed up the data processing.  
(Which basically means we use the `partial` function to help implement `parallel_run` function to process multiple files at the same time.)

1. Define the constants, all the constants in the `constants.py` file need to be defined to calculate RCA.  
(If you don't know the product code, please check the comparison table. It's included in the dataset, called: **product_codes_HS12_V202501.csv**.)
    - Option 1: Define them directly in this notebook, it's convinent and easy to use.
    - Option 2: Define them in `constants.py`, it's easier to manage and modify if you have many constants, good for unified management and modification.

    In this notebook, I define all the constants in [constants.py](./constants.py) (Click the name to go directly to the script). The meaning of each constant and what values to assign to the variables are explained within the script. It's located in the same directory as this notebook.  

2. Calculate "xij" values for all files in the BACI_HS12_V202501_test folder.  
(According to the RCA formula, "xij" represent export value of commodity i from a country to country j.)

In [9]:
xij_process_file = partial(parallel.run_xij, prod=PROD)
parallel.parallel_run(FOLDER_PATH, xij_process_file, "xij")

Processing BACI_HS12_Y2014_V202501.csv in thread: 123145549996032
Processing BACI_HS12_Y2012_V202501.csv in thread: 123145566785536
BACI_HS12_Y2014_V202501.csv is done.
Processing BACI_HS12_Y2013_V202501.csv in thread: 123145549996032
BACI_HS12_Y2012_V202501.csv is done.
BACI_HS12_Y2013_V202501.csv is done.
Total execution time: 0.02 seconds


3. Calculate "xin" values for all files in the BACI_HS12_V202501_test folder.  
(According to the RCA formula, "xin" represent total export value of commodity i from all exporting countries to country j.)

In [10]:
xin_process_file = partial(parallel.run_xin, val=VAL, prod=PROD)
parallel.parallel_run(FOLDER_PATH, xin_process_file, "xin", XIN_NAMES)

Processing BACI_HS12_Y2014_V202501.csv in thread: 123145549996032
Processing BACI_HS12_Y2012_V202501.csv in thread: 123145566785536
Processing BACI_HS12_Y2013_V202501.csv in thread: 123145583575040
BACI_HS12_Y2012_V202501.csv is done.
BACI_HS12_Y2013_V202501.csv is done.
BACI_HS12_Y2014_V202501.csv is done.
Total execution time: 0.68 seconds


4. Calculate "xwj" values for all files in the BACI_HS12_V202501_test folder.  
(According to the RCA formula, "xwj" represent total export value of all commodities from a country to country j.)

In [11]:
xwj_process_file = partial(parallel.run_xwj, val=VAL)
parallel.parallel_run(FOLDER_PATH, xwj_process_file, "xwj", XWJ_NAMES)

Processing BACI_HS12_Y2014_V202501.csv in thread: 123145549996032
Processing BACI_HS12_Y2012_V202501.csv in thread: 123145566785536
Processing BACI_HS12_Y2013_V202501.csv in thread: 123145583575040
Handling exporter 4.
Handling exporter 32.
Handling exporter 32.
Handling exporter 100.
Handling exporter 84.
Handling exporter 108.
BACI_HS12_Y2012_V202501.csv is done.
BACI_HS12_Y2014_V202501.csv is done.
BACI_HS12_Y2013_V202501.csv is done.
Total execution time: 5.47 seconds


5. Calculate "xwn" values for all files in the BACI_HS12_V202501_test folder.  
(According to the RCA formula, "xwn" represent total export value of all commodities from all exporting to country j.)

In [12]:
xwn_process_file = partial(parallel.run_xwn, val=VAL)  
parallel.parallel_run(FOLDER_PATH, xwn_process_file, "xwn", XWN_NAMES)

Processing BACI_HS12_Y2014_V202501.csv in thread: 123145549996032Processing BACI_HS12_Y2012_V202501.csv in thread: 123145566785536
Processing BACI_HS12_Y2013_V202501.csv in thread: 123145583575040

BACI_HS12_Y2012_V202501.csv is done.
BACI_HS12_Y2014_V202501.csv is done.
BACI_HS12_Y2013_V202501.csv is done.
Total execution time: 0.34 seconds


6. Calculate RCA values for all dataset files in the BACI_HS12_V202501_test folder, by formula: $RCA^i_j = \left( \frac{X^i_j}{X^i_n} \middle/ \frac{X^w_j}{X^w_n} \right)$.  
The output will be in the same folder as this notebook.

In [13]:
# Here, we create a list, including all intermediate files. 
# Note: don't change the file order in the list.
file_path_list = [
    os.path.join(os.getcwd(), "xij.csv"),
    os.path.join(os.getcwd(), "xin.csv"),
    os.path.join(os.getcwd(), "xwj.csv"),
    os.path.join(os.getcwd(), "xwn.csv")
]

# Here, we use the python module "concurrent" to calculate RCA in parallel.
max_workers = os.cpu_count() * 2 if os.cpu_count() else 4
with ThreadPoolExecutor(max_workers=max_workers) as executor:
    futures = {executor.submit(parallel.run_rca, val, file_path_list): val for val in VAL}

    dfs = [pd.read_csv(file_path_list[0], dtype={'Year': int, 'Importer': int, 'Exporter': int}).iloc[:, :4]]
    for future in as_completed(futures):
        df = future.result()
        dfs.append(df)

final_df = pd.concat(dfs, axis=1)
final_df.to_csv("rca.csv", index=False)

Processing in thread: 123145549996032
Processing in thread: 123145566785536


7. Let's convert country codes into country names for our RCA results.

In [14]:
extractor.convert_countries("./rca.csv", f"{FOLDER_PATH}/country_codes_V202501.csv")