# <a id='toc1_'></a>[**Seminar**: Topics in Sovereign Debt](#toc0_)
## <a id='toc1_1_'></a>[**Soveriegn Debt and Income:** New Evidence for Low- and Middle-Income African Countries](#toc0_)

This project aims to create a dynamic threshold model to analyze when a country's debt starts to hamper growth in GDP. 


**Table of contents**<a id='toc0_'></a>    
- [**Seminar**: Topics in Sovereign Debt](#toc1_)    
  - [**Soveriegn Debt and Income:** New Evidence for Low- and Middle-Income African Countries](#toc1_1_)    
    - [1. Exploratory Data Analysis (EDA)](#toc1_1_1_)    
      - [Descriptive Statistics](#toc1_1_1_1_)    
      - [Visualizations](#toc1_1_1_2_)    
      - [Dynamic Exploratory Data Analysis](#toc1_1_1_3_)    
    - [2. Baseline Model](#toc1_1_2_)    
    - [3. Nonlinearity Check](#toc1_1_3_)    
    - [4.Dynamic Threshold Model](#toc1_1_4_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [None]:
## Libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns  # For advanced visualizations
import numpy as np
import statsmodels.api as sm

# load the data
data = pd.read_csv('fin_dataset_africa copy.csv', delimiter=';')

# replace values ",," from the dataset with NaN
data = data.replace('..', np.nan)

In [None]:
# create dataset without countries – drop the column
data = data.drop(columns=['Country_Name'])

In [None]:
# set year as index
data = data.set_index('Year')

In [21]:
# replace nan values with the mean of the column
data = data.fillna(data.mean())

In [29]:
# print data column names
print(data.columns)

Index(['Government_Debt_(Percent_of_GDP)',
       'Foreign_direct_investment_net_inflows_(Percent_of_GDP)',
       'GDP_per_capita_growth_(annual_Percent)',
       'Gross_capital_formation_(Percent_of_GDP)',
       'Gross_national_expenditure_(Percent_of_GDP)',
       'Net_barter_terms_of_trade_index_(2015_=_100)',
       'Population_growth_(annual_Percent)', 'Trade_(Percent_of_GDP)',
       'fitted_values'],
      dtype='object')


In [32]:
# Specify variables 
dep_var = "GDP_per_capita_growth_(annual_Percent)"
endogenous_var = "Government_Debt_(Percent_of_GDP)"
exogenous_vars = [
                'Gross_capital_formation_(Percent_of_GDP)',
                'Gross_national_expenditure_(Percent_of_GDP)',
                'Net_barter_terms_of_trade_index_(2015_=_100)',
                'Population_growth_(annual_Percent)', 
                'Trade_(Percent_of_GDP)'] # Other control variables

instrument = "Foreign_direct_investment_net_inflows_(Percent_of_GDP)"  

In [36]:
first_stage_model = sm.OLS(data[endogenous_var], data[exogenous_vars + [instrument]]).fit()
fitted_values = first_stage_model.fittedvalues

# Add fitted values to the dataset
data["fitted_values"] = fitted_values

print(first_stage_model.summary())

                                        OLS Regression Results                                       
Dep. Variable:     Government_Debt_(Percent_of_GDP)   R-squared (uncentered):                   0.570
Model:                                          OLS   Adj. R-squared (uncentered):              0.567
Method:                               Least Squares   F-statistic:                              200.2
Date:                              Tue, 16 Apr 2024   Prob (F-statistic):                   2.96e-162
Time:                                      08:43:39   Log-Likelihood:                         -4828.5
No. Observations:                               912   AIC:                                      9669.
Df Residuals:                                   906   BIC:                                      9698.
Df Model:                                         6                                                  
Covariance Type:                          nonrobust                               

In [37]:
second_stage_model = sm.OLS(data['GDP_per_capita_growth_(annual_Percent)'], data[exogenous_vars + ['fitted_values']]).fit()

# Print results
print(second_stage_model.summary())


                                           OLS Regression Results                                          
Dep. Variable:     GDP_per_capita_growth_(annual_Percent)   R-squared (uncentered):                   0.094
Model:                                                OLS   Adj. R-squared (uncentered):              0.088
Method:                                     Least Squares   F-statistic:                              15.67
Date:                                    Tue, 16 Apr 2024   Prob (F-statistic):                    3.55e-17
Time:                                            08:43:41   Log-Likelihood:                         -2720.9
No. Observations:                                     912   AIC:                                      5454.
Df Residuals:                                         906   BIC:                                      5483.
Df Model:                                               6                                                  
Covariance Type:            

### <a id='toc1_1_1_'></a>[1. Exploratory Data Analysis (EDA)](#toc0_)
- Descriptive statistics
- Data visualization
- Dynamic exploratory data analysis

In [None]:
## Libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns  # For advanced visualizations
import numpy as np

# load the data
data = pd.read_csv('fin_dataset_africa copy.csv', delimiter=';')

# replace values ",," from the dataset with NaN
data = data.replace('..', np.nan)


In [None]:
# create new datafram from the data, but with the columns 'Country Name' and 'Year' as index

#data = data.set_index(['Country Name', 'Year'])

In [None]:
# create dataset without countries – drop the column
data = data.drop(columns=['Country Name'])

In [None]:
import seaborn as sns # type: ignore

# Plot the bell curve
sns.displot(data, x = 'Government Debt (% of GDP) (Percent of GDP)', kind="kde", height = 10, aspect = 3)


# Set the labels and title
plt.xlabel('Government Debt (% of GDP)', fontsize=15)
plt.ylabel('Density', fontsize=15)
plt.title('Bell Curve of Government Debt (% of GDP)', fontsize = 30)



# Show the plot
plt.show()



In [None]:
from linearmodels.iv import IVGMM

In [None]:
import pandas as pd
from linearmodels.iv import IVGMM

# Define the endogenous, exogenous, and instruments variables
endog_vars = ['Government Debt (% of GDP) (Percent of GDP)']
exog_vars = ['Foreign direct investment net inflows (% of GDP)',
             'Gross capital formation (% of GDP)', 'Gross national expenditure (% of GDP)',
             'Net barter terms of trade index (2015 = 100)', 'Population growth (annual %)',
             'Trade (% of GDP)']
instrument_vars = ['Government Debt (% of GDP) (Percent of GDP)']

# Perform IV estimation using IV-GMM
iv_model = IVGMM(dependent=data[endog_vars],
                 exog=data[exog_vars],
                 endog=data[endog_vars],
                 instruments=data[instrument_vars])

# Fit the model
iv_results = iv_model.fit()

# Print estimation results
print(iv_results)


#### <a id='toc1_1_1_1_'></a>[Descriptive Statistics](#toc0_)

In [None]:
def calculate_summary_stats(data):
    # Select the last seven columns
    last_seven_cols = data.iloc[:, -7:]

    # Basic statistics
    basic_stats = last_seven_cols.describe()

    # Calculate the percentage of missing values for the last seven columns
    missing_percentage = last_seven_cols.isnull().mean() * 100

    # Add the missing values percentage as a new row to the basic_stats DataFrame
    basic_stats.loc['missing %'] = missing_percentage

    # Ensure the 'count' row is adjusted to reflect non-missing values only
    basic_stats.loc['count'] = basic_stats.loc['count'] - (basic_stats.loc['count'] * basic_stats.loc['missing %'] / 100)    
    
    # Convert the combined stats to LaTeX format
    latex_output = basic_stats.to_latex()

    # Saving the LaTeX output to a file
    with open('statistics_description.tex', 'w') as f:
        f.write(latex_output)

    # Optionally print the LaTeX output to check
    print(latex_output)

calculate_summary_stats(data)


#### <a id='toc1_1_1_2_'></a>[Visualizations](#toc0_)
Should maybe be adjusted to individual countries. 

In [None]:
def visualize_indicators_grid(country_name):
    """
    Visualizes all indicators from column 3 onwards for the specified country in a grid format.

    Parameters:
    - country_name: The name of the country.
    """
    # Filter the data for the specified country and ensure the 'Year' column is numeric
    country_data = data[data['Country Name'] == country_name].copy()
    country_data['Year'] = pd.to_numeric(country_data['Year'])
    
    # Get all indicators from column 3 onwards
    indicators = data.columns[2:]
    n_indicators = len(indicators)
    
    # Determine the grid size
    rows = int(n_indicators**0.5)
    cols = int(n_indicators / rows) + (n_indicators % rows > 0)
    
    fig, axes = plt.subplots(rows, cols, figsize=(cols*5, rows*3))
    fig.suptitle(f'All Indicators for {country_name}', fontsize=16)
    
    # Flatten the axes array for easy indexing
    axes = axes.flatten()
    
    # Iterate through each indicator and plot in the grid
    for i, indicator in enumerate(indicators):
        ax = axes[i]
        country_data[indicator] = pd.to_numeric(country_data[indicator], errors='coerce')
        ax.plot(country_data['Year'], country_data[indicator], marker='o', linestyle='-')
        ax.set_title(indicator, fontsize=10)
        ax.set_xlabel('Year', fontsize=8)
        ax.set_ylabel(indicator, fontsize=8)
        ax.tick_params(axis='x', labelrotation=45)
        ax.tick_params(axis='both', labelsize=8)
        ax.grid(True)
    
    # Hide any unused axes if the number of indicators doesn't fill the last row
    for j in range(i + 1, rows * cols):
        axes[j].axis('off')

    plt.tight_layout(rect=[0, 0.03, 1, 0.95])  # Adjust the layout to make room for the main title
    plt.show()

# Visualize all indicators for Angola in a grid format
visualize_indicators_grid('Lesotho')


In [None]:
def visualize_country_data_fixed(country_name, indicator):
    """
    Visualizes the given indicator over the years for the specified country with fixed axes.

    Parameters:
    - country_name: The name of the country.
    - indicator: The specific indicator to visualize.
    """
    # Filter the data for the specified country and ensure numeric types
    country_data = data[data['Country Name'] == country_name].copy()
    country_data['Year'] = pd.to_numeric(country_data['Year'])
    country_data[indicator] = pd.to_numeric(country_data[indicator], errors='coerce')
    
    # Plot the data
    plt.figure(figsize=(10, 6))
    plt.plot(country_data['Year'], country_data[indicator], marker='o')
    
    # Adding title and labels with improved axes
    plt.title(f'{indicator} in {country_name} over the Years')
    plt.xlabel('Year')
    plt.xticks(country_data['Year'].unique(), rotation=45)  # Ensure all years are shown and rotated for readability
    plt.ylabel(indicator)
    
    # Display the plot
    plt.grid(True)
    plt.tight_layout()  # Adjust layout to not cut off labels
    plt.show()

# Attempt to visualize again with corrections
visualize_country_data_fixed('Angola', 'GDP per capita growth (annual %)')


### <a id='toc1_1_2_'></a>[2. Baseline Model](#toc0_)
- Simple OLS regression
- Bulding-blocks for a dynamic threshold model

In [None]:
from linearmodels.iv import IV2SLS

def estimate_iv_model(data, endog, exog, instrument):
    """
    Estimates an Instrumental Variables (IV) model using the provided data and variables.

    Parameters:
    - data: The DataFrame containing all data.
    - endog: The endogenous variable.
    - exog: The exogenous variable.
    - instrument: The instrument variable.

    Returns:
    - The estimated IV model results.
    """
    # Drop rows with missing values in the specified columns
    data_subset = data.dropna(subset=[endog, exog, instrument])
    
    # Create the IV model
    iv_model = IV2SLS(dependent=data_subset[endog],
                      exog=data_subset[exog],
                      endog=data_subset[exog],
                      instruments=data_subset[instrument])
    
    # Fit the model and return the results
    iv_results = iv_model.fit()
    return iv_results

# Estimate the IV model for the specified variables
iv_results = estimate_iv_model(data, 'GDP per capita growth (annual %)', 'GDP per capita, PPP (current international $)', 'GDP per capita, PPP (constant 2017 international $)')
print(iv_results)


In [None]:
from linearmodels.panel import PanelOLS
import numpy as np
import statsmodels.api as sm

# Since we are going to use lagged values, we need to set the dataframe to have a multi-index of Country and Year
data = data.set_index(['Country Name', 'Year'])

# Convert all data columns to numeric, errors='coerce' will set non-convertible values to NaN
data = data.apply(pd.to_numeric, errors='coerce')

# Sort the index to ensure that lagged values are computed correctly
data = data.sort_index()

# Lagged variables as potential instruments
data['Lagged_GDP_growth'] = data.groupby(level=0)['GDP per capita growth (annual %)'].shift(1)
data['Lagged_Government_Debt'] = data.groupby(level=0)['Government Debt (% of GDP) (Percent of GDP)'].shift(1)
data['Lagged_Terms_of_Trade'] = data.groupby(level=0)['Net barter terms of trade index (2015 = 100)'].shift(1)
data['Lagged_Trade'] = data.groupby(level=0)['Trade (% of GDP)'].shift(1)

# Drop rows with NaN values that were a result of lagging
data = data.dropna()

# Define the endogenous variable and instruments
endog = data['Government Debt (% of GDP) (Percent of GDP)']
exog = sm.add_constant(data[['Lagged_GDP_growth', 'Lagged_Government_Debt', 'Lagged_Terms_of_Trade', 'Lagged_Trade']])
instr = data[['Lagged_GDP_growth', 'Lagged_Government_Debt', 'Lagged_Terms_of_Trade', 'Lagged_Trade']]

# First-stage regression using PanelOLS (this is a simple OLS, as a placeholder for actual 2SLS that will be done next)
# We are regressing the endogenous variable on the instruments
mod = PanelOLS(endog, exog, entity_effects=True)
res = mod.fit(cov_type='clustered', cluster_entity=True)

# Let's display the results of the first-stage regression
res.summary.tables[1]


### <a id='toc1_1_3_'></a>[3. Nonlinearity Check](#toc0_)
- Exogeneity check
- Multicolinarity check
- Heteroskedasticity check
- Homoskedasticity check
- Nonlinearity check

Tests used: 
1. Durbin-Watson (Autocorrelation)
2. VIF
3. White-Test (Heteroskedasticity)
4. Breusch-Pagan (Heteroskedasticity)

In [None]:
import matplotlib.pyplot as plt

# Fit your baseline linear model from the previous step (if you haven't already)
# ... 

# plot the residuals
plt.figure(figsize=(10, 6))
plt.scatter(data.index.get_level_values(1), res.resids)
plt.title('Residuals of the First-Stage Regression')
plt.xlabel('Year')
plt.ylabel('Residuals')
plt.grid(True)
plt.tight_layout()
plt.show()



In [None]:
import statsmodels.formula.api as smf

# Assuming your fitted model is in  'results'

# Second-stage regression using 2SLS
# We are regressing the endogenous variable on the instruments
# We are also including the exogenous variables from the baseline model
# We are using the predicted values from the first-stage regression as the endogenous variable
# We are clustering standard errors at the entity level
mod_2sls = PanelOLS(data['GDP per capita growth (annual %)'], sm.add_constant(res), entity_effects=True)
res_2sls = mod_2sls.fit(cov_type='clustered', cluster_entity=True)

# Display the results of the second-stage regression
res_2sls.summary


### <a id='toc1_1_4_'></a>[4. Dynamic Threshold Model](#toc0_)
- Building the model
- Estimation
- Interpretation

First set possible threshold values. Then estimate the model.

In [None]:
potential_threshold = 60  

In [None]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
from sklearn.model_selection import train_test_split 

def split_data(data, threshold, target_var):
    data_below = data[data['Debt_Ratio_to_GDP'] <= threshold]
    data_above = data[data['Debt_Ratio_to_GDP'] > threshold]
    return data_below, data_above

threshold_values = np.arange(20, 90, 0.1)  # Range to explore
best_fit = 0  
best_threshold = None

for threshold in threshold_values:
    data_below, data_above = split_data(data, threshold, 'Real_GDP_Growth')

    formula = 'Real_GDP_Growth ~ Debt_Ratio_to_GDP + Trade_openess + Inflation + Government_Effectiveness' # Define the formula string here 

    model_below = smf.ols(formula=formula, data=data_below).fit()
    model_above = smf.ols(formula=formula, data=data_above).fit()

    current_fit = model_below.rsquared + model_above.rsquared  

    if current_fit > best_fit: 
        best_fit = current_fit
        best_threshold = threshold

print(f"Best Threshold: {best_threshold}") 


