# Analysis of Growth Determinants in BRICS Countries

This notebook performs an exploratory data analysis and prepares a dataset to investigate the determinants of economic growth in BRICS countries (Brazil, Russia, India, China, and South Africa). The analysis includes:

1.  **Data Loading and Inspection**: Loading the dataset containing macroeconomic variables for the selected countries and years.
2.  **Stationarity Testing**: Assessing the stationarity of the time series data for each variable and country using the Augmented Dickey-Fuller (ADF) test.
3.  **Data Preparation**: Transforming non-stationary variables through differencing to ensure suitability for panel data regression models.
4.  **Panel Data Modeling (Next Steps)**: Preparing the transformed data for Fixed Effects (FE) and Random Effects (RE) models, followed by a Hausman test to select the appropriate model.

The dataset used in this analysis was sourced from [mention your data source here, e.g., World Bank Data].

## Import Statements and Data Upload


In [3]:
# Install required packages (Colab only)
!pip install linearmodels arch openpyxl --quiet
!pip install wbdata --quiet
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from linearmodels.panel import PanelOLS, RandomEffects
from statsmodels.formula.api import ols
from arch.unitroot import ADF
from scipy import stats
from google.colab import files
from statsmodels.tsa.stattools import adfuller
from statsmodels.stats.outliers_influence import variance_inflation_factor



df_final = pd.read_csv('df_growth.csv')
print("Data Preview:", df_final.head(5))


print("Data Types: ", df_final.dtypes)

Data Preview:   country  year       gdp_pc       fdi       gfcf  inflation      trade  \
0  Brazil  1995  6596.335727  0.631586  20.286298  66.007034  16.984460   
1  Brazil  1996  6640.727007  1.475965  18.640654  15.757666  15.635591   
2  Brazil  1997  6764.858421  2.150453  19.122901   6.926713  16.576209   
3  Brazil  1998  6687.495236  3.340888  18.542348   3.195076  16.438585   
4  Brazil  1999  6621.636762  4.733770  17.016294   4.858447  20.982166   

     domcred  gdp_growth  
0  43.494525    2.632056  
1  40.778494    0.672969  
2  40.852237    1.869244  
3  29.532261   -1.143604  
4  29.826910   -0.984800  
Data Types:  country        object
year            int64
gdp_pc        float64
fdi           float64
gfcf          float64
inflation     float64
trade         float64
domcred       float64
gdp_growth    float64
dtype: object


## Merging 1994 Data

In [7]:
df_1994 = pd.read_csv('BRICS_1994_data.csv')
df_1994.head(5)

Unnamed: 0,Country Name,Country Code,Series Name,Series Code,1994 [YR1994]
0,Brazil,BRA,GDP per capita (constant 2015 US$),NY.GDP.PCAP.KD,6427.16901943704
1,Brazil,BRA,GDP per capita growth (annual %),NY.GDP.PCAP.KD.ZG,4.21094030332965
2,Brazil,BRA,"Foreign direct investment, net inflows (% of GDP)",BX.KLT.DINV.WD.GD.ZS,0.584731354072149
3,Brazil,BRA,Gross fixed capital formation (% of GDP),NE.GDI.FTOT.ZS,20.748084535259
4,Brazil,BRA,"Inflation, consumer prices (annual %)",FP.CPI.TOTL.ZG,2075.88839757325


## Formatting the Dataframe to Long Format

In [20]:
# Assuming your full dataset is called df
# Step 1: Melt year columns into long format
df_melted = df_1994.melt(
    id_vars=["Country Name", "Country Code", "Series Name", "Series Code"],
    var_name="Year",
    value_name="Value"
)

# Step 2: Clean up the Year column (e.g., "1994 [YR1994]" → 1994)
df_melted["Year"] = df_melted["Year"].str.extract(r"(\d{4})").astype(int)
df_melted["Value"] = pd.to_numeric(df_melted["Value"], errors="coerce")

# Step 3: Pivot to wide format
df_wide = df_melted.pivot_table(
    index=["Country Name", "Country Code", "Year"],
    columns="Series Name",
    values="Value",
    aggfunc="first"   # or "mean" if you expect duplicates
).reset_index()

# Step 4: Optional renaming
df_wide = df_wide.rename(columns={"Country Name": "Country"})

df_wide.drop(columns=['Country Code'], inplace=True)
# df_wide.head(5)

df_wide.sort_values(by=['Country'], inplace=True)
df_wide

Series Name,Country,Year,Domestic credit to private sector (% of GDP),"Foreign direct investment, net inflows (% of GDP)",GDP per capita (constant 2015 US$),GDP per capita growth (annual %),Gross fixed capital formation (% of GDP),"Inflation, consumer prices (annual %)",Trade (% of GDP)
0,Brazil,1994,69.523199,0.584731,6427.169019,4.21094,20.748085,2075.888398,18.67476
1,China,1994,85.487555,5.959647,1406.858457,11.801888,34.694508,24.25699,35.605464
2,India,1994,23.647341,0.297387,588.727892,4.500575,23.358888,10.247936,20.078144
3,Russian Federation,1994,,0.174541,5071.742676,-12.53979,21.810888,307.722636,50.953713
4,South Africa,1994,101.058603,0.243895,4258.261847,1.541959,14.988913,8.938564,37.114609


## Data Cleaning


In [39]:
# Map long names to short ones used in the script:
rename_map = {
    'Country': 'country',
    'Year': 'year',
    'Foreign direct investment, net inflows (% of GDP)': 'fdi',
    'GDP per capita (constant 2015 US$)': 'gdp_pc',      # will be used to compute growth
    'Gross fixed capital formation (% of GDP)': 'gfcf',
    'Inflation, consumer prices (annual %)': 'inflation',
    'Trade (% of GDP)': 'trade',
    'Domestic credit to private sector (% of GDP)': 'domcred',
    'GDP per capita growth (annual %)': 'gdp_growth'
}

# Rename the columns
df_wide.rename(columns=rename_map, inplace=True)

df_wide = df_wide[['country', 'year', 'gdp_pc', 'fdi', 'gfcf',
                   'inflation', 'trade', 'domcred', 'gdp_growth']]

df_wide.head(5)

#  Type conversions & cleaning

# Ensure year is integer and country is string
df_wide['year'] = df_wide['year'].astype(int)
df_wide['country'] = df_wide['country'].astype(str)

# Convert numeric-like columns to numeric (handles strings like "1,234" etc.)
num_cols = ['gdp_pc', 'fdi', 'gfcf', 'inflation', 'trade', 'domcred']
for col in num_cols:
    if col in df_wide.columns:
        df_wide[col] = pd.to_numeric(df_wide[col], errors='coerce')

# Interpolate numeric missing values within each country (time series)
df_wide = df_wide.sort_values(['country', 'year'])
df_wide.head(5)




Series Name,country,year,gdp_pc,fdi,gfcf,inflation,trade,domcred,gdp_growth
0,Brazil,1994,6427.169019,0.584731,20.748085,2075.888398,18.67476,69.523199,4.21094
1,China,1994,1406.858457,5.959647,34.694508,24.25699,35.605464,85.487555,11.801888
2,India,1994,588.727892,0.297387,23.358888,10.247936,20.078144,23.647341,4.500575
3,Russian Federation,1994,5071.742676,0.174541,21.810888,307.722636,50.953713,,-12.53979
4,South Africa,1994,4258.261847,0.243895,14.988913,8.938564,37.114609,101.058603,1.541959


In [41]:
# Adding 1994 data
df_combined = pd.concat([df_wide, df_final], ignore_index=True)
df_combined.sort_values(by=['country', 'year'], inplace=True)
df_combined.head(5)

Unnamed: 0,country,year,gdp_pc,fdi,gfcf,inflation,trade,domcred,gdp_growth
0,Brazil,1994,6427.169019,0.584731,20.748085,2075.888398,18.67476,69.523199,4.21094
5,Brazil,1995,6596.335727,0.631586,20.286298,66.007034,16.98446,43.494525,2.632056
6,Brazil,1996,6640.727007,1.475965,18.640654,15.757666,15.635591,40.778494,0.672969
7,Brazil,1997,6764.858421,2.150453,19.122901,6.926713,16.576209,40.852237,1.869244
8,Brazil,1998,6687.495236,3.340888,18.542348,3.195076,16.438585,29.532261,-1.143604


In [43]:
# 3) Type conversions & cleaning
# -------------------------
# Ensure year is integer and country is string
df_combined['year'] = df_combined['year'].astype(int)
df_combined['country'] = df_combined['country'].astype(str)

# Convert numeric-like columns to numeric (handles strings like "1,234" etc.)
num_cols = ['gdp_pc', 'fdi', 'gfcf', 'inflation', 'trade', 'domcred']
for col in num_cols:
    if col in df_combined.columns:
        df_combined[col] = pd.to_numeric(df_combined[col], errors='coerce')

# Interpolate numeric missing values within each country (time series)
df = df_combined.sort_values(['country', 'year'])
df_combined[num_cols] = df_combined.groupby('country')[num_cols].transform(lambda g: g.interpolate(method='linear', limit_direction='both'))
# 3) Type conversions & cleaning
# -------------------------
# Ensure year is integer and country is string
df['year'] = df['year'].astype(int)
df['country'] = df['country'].astype(str)

# Convert numeric-like columns to numeric (handles strings like "1,234" etc.)
num_cols = ['gdp_pc', 'fdi', 'gfcf', 'inflation', 'trade', 'domcred']
for col in num_cols:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors='coerce')

# Interpolate numeric missing values within each country (time series)
df = df.sort_values(['country', 'year'])
df[num_cols] = df.groupby('country')[num_cols].transform(lambda g: g.interpolate(method='linear', limit_direction='both'))
df_combined.isna().sum()

Unnamed: 0,0
country,0
year,0
gdp_pc,0
fdi,0
gfcf,0
inflation,0
trade,0
domcred,0
gdp_growth,0


## Stationarity Tests

In [44]:


# List of numeric variables to test
variables = ['gdp_pc', 'fdi', 'gfcf', 'inflation', 'trade', 'domcred', 'gdp_growth']

# Create a results list
results = []

# Loop through variables and countries
for var in variables:
    for country in df_combined['country'].unique():
        series = df_combined.loc[df_combined['country'] == country, var].dropna()

        if len(series) > 1:  # Ensure enough data points
            adf_result = adfuller(series, autolag='AIC')
            results.append({
                'Variable': var,
                'Country': country,
                'ADF Statistic': adf_result[0],
                'p-value': adf_result[1],
                'Stationary?': 'Yes' if adf_result[1] < 0.05 else 'No'
            })

# Convert to DataFrame
results_df = pd.DataFrame(results)

# Summary: % of countries where each variable is stationary
summary_df = results_df.groupby('Variable')['Stationary?'].apply(lambda x: (x == 'Yes').mean() * 100).reset_index()
summary_df.columns = ['Variable', '% Stationary (ADF)']

# Show results
print("Detailed results:")
print(results_df)

print("\nSummary of stationarity by variable:")
print(summary_df)


Detailed results:
      Variable             Country  ADF Statistic       p-value Stationary?
0       gdp_pc              Brazil      -0.629364  8.642079e-01          No
1       gdp_pc               China       4.108644  1.000000e+00          No
2       gdp_pc               India       2.686389  9.990862e-01          No
3       gdp_pc  Russian Federation      -0.209091  9.374293e-01          No
4       gdp_pc        South Africa      -2.151552  2.242966e-01          No
5          fdi              Brazil      -1.372232  5.955138e-01          No
6          fdi               China      -1.114708  7.091712e-01          No
7          fdi               India      -2.368727  1.507581e-01          No
8          fdi  Russian Federation      -1.396121  5.841342e-01          No
9          fdi        South Africa      -5.258504  6.670178e-06         Yes
10        gfcf              Brazil      -2.502054  1.149753e-01          No
11        gfcf               China      -1.851216  3.553466e-01       

## Data Preparation for Modelling - Differencing the non-stationary variables

In [49]:


# --- Step 1: Load your data ---
# df: panel dataset with columns country, year, and all variables
# adf_results: table of ADF results exactly like the one you posted
# Make sure 'Stationary?' column is exactly 'Yes' or 'No'

# Drop GDP per capita from ADF results (not needed for modelling)
adf_filtered = results_df[results_df["Variable"] != "gdp_pc"]

# --- Step 2: Build a transformation map ---
# Map: { (variable, country) : True if stationary, False if not }
stationarity_map = {
    (row["Variable"], row["Country"]): (row["Stationary?"] == "Yes")
    for _, row in adf_filtered.iterrows()
}

# --- Step 3: Transform the dataset ---
df = df_combined.sort_values(["country", "year"]).copy()

transformed_df = []
for country, group in df.groupby("country"):
    group = group.copy()
    for var in adf_filtered["Variable"].unique():
        if (var, country) in stationarity_map:
            if stationarity_map[(var, country)]:
                # Keep in levels if stationary
                pass  # leave as is
            else:
                # Difference if non-stationary
                group[var] = group[var].diff()
    transformed_df.append(group)

df_transformed = pd.concat(transformed_df)

# --- Step 4: Drop first NA after differencing ---
df_model = df_transformed.dropna()

df_model.head(5)

# df_model is now ready for FE/RE regression


Unnamed: 0,country,year,gdp_pc,fdi,gfcf,inflation,trade,domcred,gdp_growth
5,Brazil,1995,6596.335727,0.046855,-0.461787,66.007034,-1.6903,-26.028675,2.632056
6,Brazil,1996,6640.727007,0.844379,-1.645644,15.757666,-1.348869,-2.716031,0.672969
7,Brazil,1997,6764.858421,0.674488,0.482247,6.926713,0.940618,0.073743,1.869244
8,Brazil,1998,6687.495236,1.190435,-0.580553,3.195076,-0.137624,-11.319977,-1.143604
9,Brazil,1999,6621.636762,1.392882,-1.526054,4.858447,4.543582,0.29465,-0.9848


## Conclusion and Next Steps

Based on the stationarity tests conducted using the Augmented Dickey-Fuller (ADF) method, we found that the stationarity of variables varies across countries. GDP per capita, trade, and domestic credit are non-stationary in all countries tested. FDI and GFCF are stationary in 20% of the countries, while inflation is stationary in 60% and GDP growth in 80% of the countries.

To prepare the data for panel data modeling, non-stationary variables for each country were differenced, while stationary variables were kept in levels. The resulting `df_model` DataFrame contains the transformed data, ready for further analysis.

The next steps will involve performing Fixed Effects (FE) and Random Effects (RE) regressions using the `df_model` dataset to analyze the relationships between the variables. Following the regressions, a Hausman test will be conducted to determine the most appropriate model (FE or RE) for this dataset.