In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
# grab the data from online, Excel is fine
url = 'https://www.qogdata.pol.gu.se/data/qog_bas_ts_jan24.xlsx'
df = pd.read_excel(url)

In [2]:
df.head()

Unnamed: 0,ccode,cname,year,ccode_qog,cname_qog,ccodealp,ccodecow,version,cname_year,ccodealp_year,...,wdi_trade,wdi_unempfilo,wdi_unempilo,wdi_unempmilo,wdi_unempyfilo,wdi_unempyilo,wdi_unempymilo,wdi_wip,who_sanittot,whr_hap
0,4,Afghanistan,1946,4,Afghanistan,AFG,700.0,QoGBasTSjan24,Afghanistan 1946,AFG46,...,,,,,,,,,,
1,4,Afghanistan,1947,4,Afghanistan,AFG,700.0,QoGBasTSjan24,Afghanistan 1947,AFG47,...,,,,,,,,,,
2,4,Afghanistan,1948,4,Afghanistan,AFG,700.0,QoGBasTSjan24,Afghanistan 1948,AFG48,...,,,,,,,,,,
3,4,Afghanistan,1949,4,Afghanistan,AFG,700.0,QoGBasTSjan24,Afghanistan 1949,AFG49,...,,,,,,,,,,
4,4,Afghanistan,1950,4,Afghanistan,AFG,700.0,QoGBasTSjan24,Afghanistan 1950,AFG50,...,,,,,,,,,,


In [11]:
print('Q1')
print(df.columns)
# Check the description of the dependent variable (Life Expectancy)
print("\nSummary Statistics for Life Expectancy (wdi_lifexp):")
print(df[['wdi_lifexp']].describe())



Q1
Index(['ccode', 'cname', 'year', 'ccode_qog', 'cname_qog', 'ccodealp',
       'ccodecow', 'version', 'cname_year', 'ccodealp_year',
       ...
       'wdi_trade', 'wdi_unempfilo', 'wdi_unempilo', 'wdi_unempmilo',
       'wdi_unempyfilo', 'wdi_unempyilo', 'wdi_unempymilo', 'wdi_wip',
       'who_sanittot', 'whr_hap'],
      dtype='object', length=251)

Summary Statistics for Life Expectancy (wdi_lifexp):
         wdi_lifexp
count  10045.000000
mean      64.562192
std       11.230050
min       11.995000
25%       56.782000
50%       67.230000
75%       73.102000
max       84.560000


10045 observations indicate that life expectancy data is available for a substantial number of entries in the dataset.

The lowest life expectancy is 11.99 years, indicating extreme disparities in global health, possibly in underdeveloped regions with poor healthcare and living conditions.The highest life expectancy is 84.56 years, likely in highly developed countries with advanced healthcare and living standards.

The average life expectancy is approximately 64.56 years, suggesting a moderate global life expectancy. This reflects the general state of health systems and living conditions worldwide.

The standard deviation of 11.23 years shows moderate variability in life expectancy, suggesting that some countries have significantly higher or lower life expectancy compared to the mean.

25% of the observations have a life expectancy below 56.78 years, reflecting lower-performing health systems.The median life expectancy is 67.23 years, meaning half of the countries have life expectancy above this value. 25% of the observations have life expectancy above 73.10 years, typically in more developed countries with robust healthcare systems.


In [12]:
# Check the description of independent variables
print("\nSummary Statistics for Healthcare Expenditure (wdi_expedu):")
print(df[['wdi_expedu']].describe())




Summary Statistics for Healthcare Expenditure (wdi_expedu):
        wdi_expedu
count  4731.000000
mean      4.368319
std       1.954843
min       0.000000
25%       3.072646
50%       4.224880
75%       5.387615
max      44.333981


4731 observations indicate that healthcare expenditure data is available for fewer entries compared to life expectancy. This could limit the sample size in our analysis.

The average healthcare expenditure is approximately 4.37% of GDP, suggesting that, on average, countries allocate a modest portion of their GDP to healthcare.

The standard deviation of 1.95% reflects variability in how much countries spend on healthcare. Some countries spend significantly more or less than the average.

The minimum expenditure is 0.0%, possibly indicating countries where healthcare expenditure is either not publicly reported or is negligible relative to GDP. The maximum expenditure is 44.33%, which is extraordinarily high and likely reflects outlier. 25% of the countries spend less than 3.07% of GDP on healthcare, indicating low investment in public health systems.The median healthcare expenditure is 4.22%, meaning half of the countries spend less than this amount, while the other half spend more.25% of the countries spend more than 5.39% of GDP on healthcare, typically countries with strong public healthcare funding.

In [13]:
# Check the summary statistics for sanitation
print("\nSummary Statistics for Sanitation Total (who_sanittot):")
print(df[['who_sanittot']].describe())


Summary Statistics for Sanitation Total (who_sanittot):
       who_sanittot
count   2934.000000
mean      53.954669
std       31.107651
min        1.000000
25%       25.000000
50%       51.000000
75%       85.000000
max      100.000000


2934 observations show that sanitation data is available for a moderate number of entries. This smaller sample size compared to other variables may limit its use in the analysis.

The average sanitation access is 53.95%, indicating that on average, about half the population in the dataset has access to improved sanitation facilities. This highlights global disparities in sanitation infrastructure. A standard deviation of 31.11% suggests substantial variability in sanitation access across countries or regions.

The lowest sanitation access is 1%, reflecting extreme cases where nearly no access to sanitation is available, likely in the least developed regions. The maximum sanitation access is 100%, representing countries where the entire population has access to improved sanitation facilities, likely in highly developed countries. The data shows a large range of sanitation access, from countries with nearly no access to those where everyone has access.

The relatively low mean and median suggest that sanitation access is still a global challenge, with many countries struggling to provide adequate facilities. improved sanitation is critical for preventing diseases and improving overall health. It is expected to have a positive relationship with life expectancy in the regression analysis.


In [14]:
# Clean the data by dropping missing values
df_clean = df[['wdi_lifexp', 'wdi_expedu', 'who_sanittot']].dropna()

# Run the regression
lifeexp_model = smf.ols(formula='wdi_lifexp ~ wdi_expedu + who_sanittot', data=df_clean).fit()

# Print the regression summary
print("\nOLS Regression Results:")
print(lifeexp_model.summary())


OLS Regression Results:
                            OLS Regression Results                            
Dep. Variable:             wdi_lifexp   R-squared:                       0.576
Model:                            OLS   Adj. R-squared:                  0.576
Method:                 Least Squares   F-statistic:                     1429.
Date:                Thu, 05 Dec 2024   Prob (F-statistic):               0.00
Time:                        22:35:22   Log-Likelihood:                -6665.1
No. Observations:                2107   AIC:                         1.334e+04
Df Residuals:                    2104   BIC:                         1.335e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept       59.4928

R-squared: 0.576 ---- The model explains 57.6% of the variation in life expectancy. This is a reasonably good fit for a model with only two predictors.

Intercept is 59.49---When both wdi_expedu and who_sanittot are 0, the predicted life expectancy is 59.49 years. This value acts as a baseline and is not meaningful in real-world terms, as such extreme values for predictors are unlikely.

*** Healthcare Expenditure (wdi_expedu): Coefficient is 0.1406.---- A 1% increase in healthcare expenditure is associated with an increase of 0.14 years in life expectancy, holding sanitation constant.

The p-value (0.058) is slightly above the conventional threshold of 0.05, indicating that this effect is not statistically significant at the 95% confidence level.
While healthcare expenditure logically impacts life expectancy, its effect may be influenced by factors such as inefficiency or unequal acces

*** Sanitation Total (who_sanittot): Coefficient is 0.2098 ---A 1 percentage point increase in access to improved sanitation is associated with an increase of 0.21 years in life expectancy, holding healthcare expenditure constant.

The p-value < 0.001 indicates that this variable is highly statistically significant. This confirms the critical role of sanitation in improving health outcomes and reducing waterborne diseases.

F-statistic: 1429 (p-value < 0.001):
The overall model is statistically significant, meaning that the predictors together significantly explain variations in life expectancy.

Overall, Sanitation is a strong predictor, healthcare wxpenditure is less influential.the model effectively explains life expectancy variations using just two predictors.


*** The conclusion didn't change. Sanitation access remains a strong and significant predictor of life expectancy in both the long term and short term.
Healthcare expenditure has limited short-term effects but likely contributes to long-term improvements.

In [18]:
print('Q2')
#!pip install linearmodels
from linearmodels.panel import FirstDifferenceOLS

# Select the necessary columns and drop missing values
columns = ['ccode', 'year', 'wdi_lifexp', 'wdi_expedu', 'who_sanittot']
df1 = df[columns].dropna()

# Set the MultiIndex for panel data (country code and year)
df1 = df1.set_index(['ccode', 'year'])

# Define the dependent and independent variables
y = df1['wdi_lifexp']  # Dependent variable: Life Expectancy
X = df1[['wdi_expedu', 'who_sanittot']]  # Independent variables: Healthcare Expenditure and Sanitation Total

# Fit the first-differenced panel data model
fdmodel = FirstDifferenceOLS(y, X)
results = fdmodel.fit(cov_type='clustered', cluster_entity=True)

# Print the regression results
print(results.summary)

Q2
                     FirstDifferenceOLS Estimation Summary                      
Dep. Variable:             wdi_lifexp   R-squared:                        0.0997
Estimator:         FirstDifferenceOLS   R-squared (Between):              0.1515
No. Observations:                1882   R-squared (Within):               0.1940
Date:                Thu, Dec 05 2024   R-squared (Overall):              0.1555
Time:                        22:46:23   Log-likelihood                   -2092.6
Cov. Estimator:             Clustered                                           
                                        F-statistic:                      104.13
Entities:                         120   P-value                           0.0000
Avg Obs:                       17.558   Distribution:                  F(2,1880)
Min Obs:                       1.0000                                           
Max Obs:                       22.000   F-statistic (robust):             11.628
                         

R-squared (0.0997):Only 9.97% of the variance in changes in life expectancy (Δwdi_lifexp) is explained by changes in healthcare expenditure (Δwdi_expedu) and sanitation access (Δwho_sanittot).
A relatively low R-squared is expected in first-differences models, as they focus on within-entity changes and remove time-invariant factors

*** Healthcare Expenditure (Δwdi_expedu):Coefficient: 0.0617 ---A one-unit change in healthcare expenditure is associated with a 0.0617-year change in life expectancy.

However, not significant (p = 0.4214), suggesting that year-to-year changes in healthcare expenditure do not have an immediate measurable effect on life expectancy.

*** Sanitation Access (Δwho_sanittot):Coefficient: 0.0966--- A one-percentage-point increase in sanitation access is associated with a 0.0966-year increase in life expectancy.

This result is highly statistically significant (p < 0.001), indicating that changes in sanitation access are strongly linked to changes in life expectancy.

F-statistic (104.13) ---The model as a whole is statistically significant (p < 0.001), meaning at least one predictor significantly explains the changes in life expectancy.

Net of changes in healthcare expenditure, a one-unit increase in sanitation access is associated with a 0.0966-year increase in life expectancy. This is a substantively meaningful result, as sanitation improvements reduce disease prevalence and mortality rates.

Changes in healthcare expenditure, however, do not significantly predict changes in life expectancy in the short term. This could be due to:
Lag effects or the spending may not directly translate into effective health interventions.
The smaller coefficient values compared to a levels model are expected because this model captures the marginal effects of changes rather than overall levels.

In [19]:
# Group by 'ccode' and calculate the first differences for the relevant columns
df1['wdi_lifexp_diff'] = df1.groupby('ccode')['wdi_lifexp'].diff()
df1['wdi_expedu_diff'] = df1.groupby('ccode')['wdi_expedu'].diff()
df1['who_sanittot_diff'] = df1.groupby('ccode')['who_sanittot'].diff()

# Examine descriptive statistics of the first differences
print("Summary Statistics for First Differences:")
print(df1[['wdi_lifexp_diff', 'wdi_expedu_diff', 'who_sanittot_diff']].describe())

Summary Statistics for First Differences:
       wdi_lifexp_diff  wdi_expedu_diff  who_sanittot_diff
count      1987.000000      1987.000000        1987.000000
mean          0.245572         0.012508           0.544036
std           0.592644         0.604725           1.034378
min          -5.144000        -5.020820          -3.000000
25%           0.097561        -0.195092           0.000000
50%           0.248780        -0.002420           0.000000
75%           0.451220         0.224939           1.000000
max           7.591000         5.123410          13.000000


Range (-5.02 to 5.12) of Healthcare Expenditure imply some countries significantly reduced healthcare spending, while others made large increases. Changes in spending are generally small but vary significantly between countries.

Overall, life expectancy shows consistent improvement, though some countries experience setbacks. And there are positive progress is seen globally, with some countries showing rapid development in sanitation infrastructure.