## STRI (Services Trade Restrictiveness Index) Dataset
The STRI is an innovative tool that offers an overview of regulatory barriers across 22 major sectors and 51 countries. Based on the qualitative information in the database, composite indices quantify the identified restrictions across five standard policy categories, with values between zero and one. The five policy categories are restrictions on foreign entry, restrictions to movement of people, other discriminatory measures, barriers to competition and regulatory transparency. Complete openness to trade and investment gives a score of zero, while being completely closed to foreign services providers yields a score of one. 

Information on columns:
1. REF_AREA: 51 countries
2. ECONOMIC_ACTIVITY: 19 activities
3. TIME_PERIOD: 2014-2024
4. OBS_VALUE: The STRI value, which quantifies trade restrictiveness (0-1 scale)


In [1]:
import pandas as pd

# Read STRI.csv
df = pd.read_csv('./original/STRI.csv')

# Keep only the specified columns
df = df[['REF_AREA', 'Economic activity', 'TIME_PERIOD', 'OBS_VALUE']]

# Rename economic activity to 'ECONOMIC_ACTIVITY'
df = df.rename(columns={'Economic activity': 'ECONOMIC_ACTIVITY'})

# Rename REF_AREA to 'COUNTRY'
df = df.rename(columns={'REF_AREA': 'COUNTRY'})

# Save the new dataframe to 'STRI_cleaned.csv'
df.to_csv('./cleaned/STRI_cleaned.csv', index=False)

In [None]:
# Data exploration
print(df.head())

## NTM (Non-Tariff Measures) Dataset
A trade and market access information system combining data on trade, customs tariffs, and non-tariff measures. TRAINS contains HS-based tariff data for over 170 countries and for several years. The data covers all requirements that can potentially affect international trade for a specific product in a specific country and for a specific trading partner at one point in time. The TRAINS NTM database offers organized information categorized by product, measure type, countries imposing the measure, affected countries and several other variables. 

Information on columns:
1. NTM_CODE: Refer to https://wits.worldbank.org/wits/wits/witshelp/content/data_retrieval/p/intro/C2.Non_Tariff_Measures.htm 
2. NTM_DESCRIPTION
3. COUNTRY_IMPOSING
4. IMPLEMENTATION_DATE
5. COUNTRY_AFFECTED
6. IS_UNILATERAL: One-sided measure or not
7. REPEAL_DATE: When that measure will end

In [3]:
# Read NTM.csv 
df = pd.read_csv("./original/NTM.csv")

# Select only the relevant columns
relevant_columns = [
  "ntmCode",
  "ntmDescription",
  "countryImposingNTMs",
  "implementationDate",
  "affectedCountriesNames",
  "isUnilateral",
  "repealDate",
]

# Create a new dataframe with only the relevant columns
cleaned_df = df[relevant_columns].copy()

# Rename the columns to be more descriptive
cleaned_df = cleaned_df.rename(
  columns={
    "ntmCode": "NTM_CODE",
    "ntmDescription": "NTM_DESCRIPTION",
    "countryImposingNTMs": "COUNTRY_IMPOSING",
    "implementationDate": "IMPLEMENTATION_DATE",
    "affectedCountriesNames": "COUNTRY_AFFECTED",
    "isUnilateral": "IS_UNILATERAL",
    "repealDate": "REPEAL_DATE",
  }
)

# Change repeal date of missing values to 9999-12-31T00:00:00
cleaned_df["REPEAL_DATE"] = cleaned_df["REPEAL_DATE"].fillna("9999-12-31T00:00:00")

# Save the cleaned data to a new CSV file
cleaned_df.to_csv("./cleaned/NTM_cleaned.csv", index=False)

In [None]:
print(cleaned_df.head())

## WITS (World Integrated Trade Solution) Dataset
It captures trade volumes and values between countries, systematically categorized by sectors and product classifications such as the Harmonized System (HS). This database provides data across different time intervals, including quarterly and yearly trade records, ensuring sufficient temporal granularity for capturing trends and economic fluctuations.

Information on columns:
1. Year: 2020-2022
2. COUNTRY: The trading partner country name
3. EXPORT_USD: Value of exports in thousands of US dollars
4. IMPORT_USD: Value of imports in thousands of US dollars
5. EXPORT_SHARE: Percentage of global export
6. IMPORT_SHARE: Percentage of import export
7. EXPORT_PRODUCTS: Measure of trade diversity
8. IMPORT_PRODUCTS: Measure of trade diversity

In [2]:
import pandas as pd

# Combine WITS_2020.csv, WITS_2021.csv, and WITS_2022.csv into a single dataframe
df_2014 = pd.read_csv("./original/WITS_2014.csv", encoding='latin1')
df_2015 = pd.read_csv("./original/WITS_2015.csv", encoding='latin1')
df_2016 = pd.read_csv("./original/WITS_2016.csv", encoding='latin1')
df_2017 = pd.read_csv("./original/WITS_2017.csv", encoding='latin1')
df_2018 = pd.read_csv("./original/WITS_2018.csv", encoding='latin1')
df_2019 = pd.read_csv("./original/WITS_2019.csv", encoding='latin1')
df_2020 = pd.read_csv("./original/WITS_2020.csv", encoding='latin1')
df_2021 = pd.read_csv("./original/WITS_2021.csv", encoding="latin1")
df_2022 = pd.read_csv("./original/WITS_2022.csv", encoding="latin1")

# Add a column to each dataframe to indicate the year
df_2014["Year"] = 2014
df_2015["Year"] = 2015
df_2016["Year"] = 2016
df_2017["Year"] = 2017
df_2018["Year"] = 2018
df_2019["Year"] = 2019
df_2020["Year"] = 2020
df_2021["Year"] = 2021
df_2022["Year"] = 2022

# Combine the dataframes into a single dataframe
combined_df = pd.concat([df_2014, df_2015, df_2016, df_2017, df_2018, df_2019, df_2020, df_2021, df_2022], ignore_index=True)

# Save the combined dataframe to a new CSV file
# combined_df.to_csv("WITS_combined.csv", index=False)

In [4]:
# Rename columns for consistency
combined_df = combined_df.rename(columns={
    "Year": "YEAR",
    "Partner Name": "COUNTRY",
    "Export (US$ Thousand)": "EXPORT_USD",
    "Import (US$ Thousand)": "IMPORT_USD",
    "Export Partner Share (%)": "EXPORT_SHARE",
    "Import Partner Share (%)": "IMPORT_SHARE",
    "No Of exported HS6 digit Products": "EXPORT_PRODUCTS",
    "No Of imported HS6 digit Products": "IMPORT_PRODUCTS",
})

# Change export and import values to * 1000
combined_df["EXPORT_USD"] = combined_df["EXPORT_USD"] * 1000
combined_df["IMPORT_USD"] = combined_df["IMPORT_USD"] * 1000

relevant_columns = [
    "YEAR",
    "COUNTRY",
    "EXPORT_USD",
    "IMPORT_USD",
    "EXPORT_SHARE",
    "IMPORT_SHARE",
    "EXPORT_PRODUCTS",
    "IMPORT_PRODUCTS",
] 

# Change missing values in EXPORT_USD, IMPORT_USD, EXPORT_SHARE, and IMPORT_SHARE to 0
combined_df["EXPORT_USD"] = combined_df["EXPORT_USD"].fillna(0)
combined_df["IMPORT_USD"] = combined_df["IMPORT_USD"].fillna(0)
combined_df["EXPORT_SHARE"] = combined_df["EXPORT_SHARE"].fillna(0)
combined_df["IMPORT_SHARE"] = combined_df["IMPORT_SHARE"].fillna(0)

# Change missing values in EXPORT_PRODUCTS and IMPORT_PRODUCTS to 0
combined_df["EXPORT_PRODUCTS"] = combined_df["EXPORT_PRODUCTS"].fillna(0)
combined_df["IMPORT_PRODUCTS"] = combined_df["IMPORT_PRODUCTS"].fillna(0)

# Create a new dataframe with only the relevant columns
cleaned_df = combined_df[relevant_columns]

# Save the cleaned data to a new CSV file
cleaned_df.to_csv("./cleaned/WITS_combined_cleaned.csv", index=False)

In [None]:
print(cleaned_df.head())

## GPR (Geopolitical Risk Index) Dataset
It is an indicator used to measure the level of geopolitical risk worldwide at a specific point in time, along with 44 country-specific indexes. The GPR index is derived from an automated text search of digital archives from 10 major newspapers. It is calculated by measuring the proportion of news articles each month that discuss adverse geopolitical events. The index categorizes these events into the following eight groups: war threats, peace threats, military buildups, nuclear threats, terror threats, beginning of war, escalation of war, and terror acts.

Information on columns:

Current risk
1. MONTH
2. COUNTRY
3. GPR_SCORE

Historical risk
1. MONTH
2. COUNTRY
3. GPR_SCORE

Global risk
1. MONTH - The time period for the measurement
2. GPR - The main Geopolitical Risk Index value
3. GPRT - "Geopolitical Risk Threats" component
4. GPRA - "Geopolitical Risk Acts" component
5. GPRH - Historical Geopolitical Risk Index
6. GPRHT - Historical Geopolitical Risk Threats
7. GPRHA - Historical Geopolitical Risk Acts
8. SHARE_GPR - Share or proportion of current geopolitical risk
9. SHARE_GPRH - Share or proportion of historical geopolitical risk
10. SHAREH_CAT_1 through SHAREH_CAT_8 - Shares of the eight event categories mentioned




In [10]:
from pathlib import Path

# Use the pandas import that's already available in the notebook

def read_excel_file(path, sheet_name="Sheet1"):
  try:
    xls = pd.ExcelFile(path)
    return xls.parse(sheet_name)
  except Exception as e:
    print(f"Error reading file {path}: {e}")
    raise

def melt_risk_data(df, prefix):
  # Select columns starting with the given prefix
  cols = [col for col in df.columns if col.startswith(prefix)]
  long_df = df.melt(
    id_vars=["month"],
    value_vars=cols,
    var_name="COUNTRY",
    value_name="GPR_SCORE",
  )
  # Remove the prefix from the country names
  long_df["COUNTRY"] = long_df["COUNTRY"].str.replace(prefix, "", regex=False)
  long_df.dropna(subset=["GPR_SCORE"], inplace=True)
  return long_df[["month", "COUNTRY", "GPR_SCORE"]]

# Set paths
raw_gpr_path = Path("./original/data_gpr_export.xlsx")
output_dir = Path("./cleaned")
output_dir.mkdir(exist_ok=True)
output_gpr_base = output_dir / "GPR_export"

# Read and preprocess the raw DataFrame
raw_gpr_df = read_excel_file(raw_gpr_path)

# Rename the first column to "month" if necessary and convert it to datetime
first_col = raw_gpr_df.columns[0]
if first_col != "month":
  raw_gpr_df.rename(columns={first_col: "month"}, inplace=True)
raw_gpr_df["month"] = pd.to_datetime(raw_gpr_df["month"], errors="coerce")

# Create long format DataFrames for current and historical risk data
current_risk_df = melt_risk_data(raw_gpr_df, "GPRC_")
historical_risk_df = melt_risk_data(raw_gpr_df, "GPRHC_")

# Define global risk metric columns and extract them if available
global_risk_columns = [
  "month",
  "GPR",
  "GPRT",
  "GPRA",
  "GPRH",
  "GPRHT",
  "GPRHA",
  "SHARE_GPR",
  "SHARE_GPRH",
  "SHAREH_CAT_1",
  "SHAREH_CAT_2",
  "SHAREH_CAT_3",
  "SHAREH_CAT_4",
  "SHAREH_CAT_5",
  "SHAREH_CAT_6",
  "SHAREH_CAT_7",
  "SHAREH_CAT_8",
]
global_risk_df = raw_gpr_df[global_risk_columns].dropna(how="any")

# Rename month to MONTH
current_risk_df.rename(columns={"month": "MONTH"}, inplace=True)
global_risk_df.rename(columns={"month": "MONTH"}, inplace=True)
historical_risk_df.rename(columns={"month": "MONTH"}, inplace=True)

# Save the cleaned datasets as separate CSV files
global_risk_df.to_csv(f"{output_gpr_base}_global_risk.csv", index=False)
current_risk_df.to_csv(f"{output_gpr_base}_current_risk.csv", index=False)
historical_risk_df.to_csv(f"{output_gpr_base}_historical_risk.csv", index=False)


In [None]:
print(current_risk_df.head())

In [None]:
print(historical_risk_df.head())

In [None]:
print(global_risk_df.head())

In [1]:
import pandas as pd

df = pd.read_csv("./cleaned/FBIC_99-23.csv")

columns_to_keep = [
    # Country identifiers
    "iso3a",
    "iso3b",
    "year",
    # Trade volume metrics
    "exportsallgoodatob_alldata",
    "importsallgoodafromb_alldata",
    "totaltradeawithb",
    "totaltradeabgdpb",
    # Geopolitical relationship indicators
    "fbic",
    "bandwidth",
    "dependence",
    # Diplomatic relations
    "norm_lor_avg",
    # Economic agreements and institutions
    "tradeagreementindex",
    # Security and alliance metrics
    "norm_allianceindex",
    "securitybandwidth",
    "securitydependence",
]

df = df[columns_to_keep]

df.to_csv("./cleaned/FBIC_cleaned.csv", index=False)

In [2]:
relevant_countries = [
    "SGP",
    "CHN",
    "MYS",
    "USA",
    "HKG",
    "IDN",
    "KOR",
    "JPN",
    "THA",
    "AUS",
    "VNM",
    "IND",
    "ARE",
    "PHL",
    "DEU",
    "FRA",
    "CHE",
    "NLD",
]

# Only keep rows iso3a and iso3b are in relevant_countries
df = df[df["iso3a"].isin(relevant_countries) & df["iso3b"].isin(relevant_countries)]

df.to_csv("./cleaned/FBIC_cleaned.csv", index=False)


In [3]:
def get_country_pair(a, b, available_pairs):
    pair1 = f"{a}|{b}"
    pair2 = f"{b}|{a}"
    if pair1 in available_pairs:
        return pair1
    elif pair2 in available_pairs:
        return pair2
    else:
        return None

In [5]:
# Read the maritime index and set its index to Country_Pair
maritime_index = pd.read_csv("./cleaned/maritime_index.csv")
maritime_index = maritime_index.set_index("Country_Pair")

# Create a new column in your main dataframe that forms the country pair key
available_pairs = set(maritime_index.index)
df["Country_Pair"] = df.apply(
    lambda row: get_country_pair(row["iso3a"], row["iso3b"], available_pairs), axis=1
)

# Now, map the connectivity value from the maritime index using the new Country_Pair column
df["maritime_connectivity"] = df["Country_Pair"].map(
    maritime_index["Connectivity_Index"]
)

# Drop the Country_Pair column
df = df.drop(columns=["Country_Pair"])

# Save the result
df.to_csv("./cleaned/FBIC_cleaned.csv", index=False)

In [1]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Read the sentiment index
sentiment_index = pd.read_csv("./original/sentiment.csv")

# Use MinMaxScaler to normalize the sentiment index
scaler = MinMaxScaler()
sentiment_index["sentiment_index"] = scaler.fit_transform(
    sentiment_index[["AvgTone_Avg"]]
)

# Drop the AvgTone_Avg column
sentiment_index = sentiment_index.drop(columns=["AvgTone_Avg"])

# Save the normalized sentiment index
sentiment_index.to_csv("./cleaned/sentiment_index_normalized.csv", index=False)