<b>Data Source: <br>

<b>Life expectancy & Socio-Economic (world bank), Shritej Shrikant Chavan<br>

<b>Retrieved from</b> https://www.kaggle.com/datasets/mjshri23/life-expectancy-and-socio-economic-world-bank/data

In this analysis, I used a linear regression model to predict a country’s life expectancy based on several factors: income group, CO2 emissions, health expenditure, unemployment rate, and the burden of communicable and non-communicable diseases. Missing values were filled using the mean to ensure the data was complete and ready for modeling.

# Importing Library

In [26]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
# We can override the default matplotlib styles with those of Seaborn
sns.set()

In [28]:
data = pd.read_csv('life-expectancy-2019-fix.csv')
data.head()

Unnamed: 0,Country Name,Country Code,Region,IncomeGroup,Year,Life Expectancy World Bank,Prevelance of Undernourishment,CO2,Health Expenditure %,Education Expenditure %,Unemployment,Corruption,Sanitation,Injuries,Communicable,NonCommunicable
0,Afghanistan,AFG,South Asia,Low income,2019,64.833,26.9,6079.999924,13.242202,3.21378,11.217,2.0,,3481166.42,6034434.86,7601757.82
1,Angola,AGO,Sub-Saharan Africa,Lower middle income,2019,61.147,17.9,25209.99908,2.53336,1.927457,7.421,,,1168866.0,7237433.13,4176568.27
2,Albania,ALB,Europe & Central Asia,Upper middle income,2019,78.573,4.3,4829.999924,,3.91665,11.47,,47.577141,82288.1,51797.42,631629.88
3,Andorra,AND,Europe & Central Asia,High income,2019,,,500.0,6.711585,3.15061,,,100.000004,2124.01,800.5,19002.03
4,United Arab Emirates,ARE,Middle East & North Africa,High income,2019,77.972,6.0,188860.0006,4.275049,3.86737,2.23,,99.1477,382562.41,120204.51,1637717.4


## Data selection
 I cleaned the dataset by removing irrelevant or less useful features and retained only the variables that are likely to contribute meaningfully to predicting life expectancy. This feature selection process was based on relevance, data availability, and potential correlation with the target variable. The goal was to simplify the model and improve its predictive accuracy by focusing on the most impactful indicators.

In [30]:
data = data[['IncomeGroup','Life Expectancy World Bank','Prevelance of Undernourishment','CO2','Health Expenditure %','Education Expenditure %','Unemployment','Corruption','Sanitation','Injuries','Communicable','NonCommunicable']]
data.head()

Unnamed: 0,IncomeGroup,Life Expectancy World Bank,Prevelance of Undernourishment,CO2,Health Expenditure %,Education Expenditure %,Unemployment,Corruption,Sanitation,Injuries,Communicable,NonCommunicable
0,Low income,64.833,26.9,6079.999924,13.242202,3.21378,11.217,2.0,,3481166.42,6034434.86,7601757.82
1,Lower middle income,61.147,17.9,25209.99908,2.53336,1.927457,7.421,,,1168866.0,7237433.13,4176568.27
2,Upper middle income,78.573,4.3,4829.999924,,3.91665,11.47,,47.577141,82288.1,51797.42,631629.88
3,High income,,,500.0,6.711585,3.15061,,,100.000004,2124.01,800.5,19002.03
4,High income,77.972,6.0,188860.0006,4.275049,3.86737,2.23,,99.1477,382562.41,120204.51,1637717.4


## Checking the missing value
 I checked the dataset for missing (null) values and removed variables with a large number of missing entries, as they could negatively affect the model's performance. This step helped ensure the remaining data was cleaner, more reliable, and better suited for building an accurate predictive model.

In [32]:
Error = data.isna().sum()
print(Error)

IncomeGroup                         0
Life Expectancy World Bank         10
Prevelance of Undernourishment     36
CO2                                 8
Health Expenditure %                9
Education Expenditure %            69
Unemployment                       16
Corruption                        111
Sanitation                         69
Injuries                            0
Communicable                        0
NonCommunicable                     0
dtype: int64


In [36]:
# Menghapus kolom dengan banyak data hilang
data.drop(columns=['Corruption', 'Sanitation','Prevelance of Undernourishment','Education Expenditure %'], inplace=True)
data

Unnamed: 0,IncomeGroup,Life Expectancy World Bank,CO2,Health Expenditure %,Unemployment,Injuries,Communicable,NonCommunicable
0,Low income,64.833,6079.999924,13.242202,11.217000,3481166.42,6034434.86,7601757.82
1,Lower middle income,61.147,25209.999080,2.533360,7.421000,1168866.00,7237433.13,4176568.27
2,Upper middle income,78.573,4829.999924,,11.470000,82288.10,51797.42,631629.88
3,High income,,500.000000,6.711585,,2124.01,800.50,19002.03
4,High income,77.972,188860.000600,4.275049,2.230000,382562.41,120204.51,1637717.40
...,...,...,...,...,...,...,...,...
169,Lower middle income,70.474,209.999993,3.360347,1.801000,12484.18,26032.56,69213.56
170,Lower middle income,73.321,300.000012,6.363094,8.406000,6652.84,9095.19,43798.62
171,Upper middle income,64.131,439640.014600,9.109355,28.469999,3174676.10,13198944.71,10214261.89
172,Low income,63.886,6800.000191,5.312203,12.520000,510982.75,4837094.00,2649687.82


In [40]:
Error2 = data.isna().sum()
print(Error2)

IncomeGroup                    0
Life Expectancy World Bank    10
CO2                            8
Health Expenditure %           9
Unemployment                  16
Injuries                       0
Communicable                   0
NonCommunicable                0
dtype: int64


## Imputation by the mean
Missing values were filled using the mean of each column. While this method may reduce data variability and introduce slight bias, it helps retain most of the dataset and ensures the model can be trained effectively without losing too much information.

In [49]:
data['Life Expectancy World Bank'] = data['Life Expectancy World Bank'].fillna(data['Life Expectancy World Bank'].mean())
data['CO2'] = data['CO2'].fillna(data['CO2'].mean())
data['Health Expenditure %'] = data['Health Expenditure %'].fillna(data['Health Expenditure %'].mean())
data['Unemployment'] = data['Unemployment'].fillna(data['Unemployment'].mean())


In [51]:
Error3 = data.isna().sum()
print(Error3)

IncomeGroup                   0
Life Expectancy World Bank    0
CO2                           0
Health Expenditure %          0
Unemployment                  0
Injuries                      0
Communicable                  0
NonCommunicable               0
dtype: int64


In [53]:
data.dtypes

IncomeGroup                    object
Life Expectancy World Bank    float64
CO2                           float64
Health Expenditure %          float64
Unemployment                  float64
Injuries                      float64
Communicable                  float64
NonCommunicable               float64
dtype: object

## Trun categorical data into numerical 
In this step, I created a copy of the original dataset and converted the categorical variable IncomeGroup into numerical values using mapping. This transformation is necessary to allow the regression model to process the data effectively, as machine learning algorithms require numerical input.

In [57]:
datax = data.copy()
datax['IncomeGroup'] = data['IncomeGroup'].map({"Low income": 0, "Lower middle income": 1, "Upper middle income": 2, "High income": 3})
datax

Unnamed: 0,IncomeGroup,Life Expectancy World Bank,CO2,Health Expenditure %,Unemployment,Injuries,Communicable,NonCommunicable
0,0,64.833000,6079.999924,13.242202,11.217000,3481166.42,6034434.86,7601757.82
1,1,61.147000,25209.999080,2.533360,7.421000,1168866.00,7237433.13,4176568.27
2,2,78.573000,4829.999924,6.754494,11.470000,82288.10,51797.42,631629.88
3,3,72.589112,500.000000,6.711585,6.980652,2124.01,800.50,19002.03
4,3,77.972000,188860.000600,4.275049,2.230000,382562.41,120204.51,1637717.40
...,...,...,...,...,...,...,...,...
169,1,70.474000,209.999993,3.360347,1.801000,12484.18,26032.56,69213.56
170,1,73.321000,300.000012,6.363094,8.406000,6652.84,9095.19,43798.62
171,2,64.131000,439640.014600,9.109355,28.469999,3174676.10,13198944.71,10214261.89
172,0,63.886000,6800.000191,5.312203,12.520000,510982.75,4837094.00,2649687.82


## note:
    "Low income": 0,
    "Lower middle income": 1,
    "Upper middle income": 2,
    "High income": 3

In [62]:
datax.dtypes

IncomeGroup                     int64
Life Expectancy World Bank    float64
CO2                           float64
Health Expenditure %          float64
Unemployment                  float64
Injuries                      float64
Communicable                  float64
NonCommunicable               float64
dtype: object

In [64]:
datax.describe()

Unnamed: 0,IncomeGroup,Life Expectancy World Bank,CO2,Health Expenditure %,Unemployment,Injuries,Communicable,NonCommunicable
count,174.0,174.0,174.0,174.0,174.0,174.0,174.0,174.0
mean,1.804598,72.589112,180160.0,6.754494,6.980652,1273799.0,3593261.0,8397128.0
std,1.040665,7.480514,909075.4,3.003141,5.266034,4993871.0,13207430.0,33229660.0
min,0.0,53.283,10.0,1.525117,0.1,474.37,357.19,2498.8
25%,1.0,67.31575,3950.0,4.578131,3.62,64531.38,49194.92,369407.1
50%,2.0,73.414,16855.0,6.617792,5.6065,248678.5,274998.3,1621168.0
75%,3.0,77.90947,88315.0,8.344779,8.2725,882174.8,2044520.0,4383044.0
max,3.0,84.356341,10707220.0,23.961813,28.469999,53563910.0,143214500.0,324637800.0


## Data explanation
CO2 emissions have a wide range, from 10 to over 10 million, with an average of approximately 180,160. This large variation suggests significant differences in industrial activity and environmental impact across countries.

Health Expenditure % ranges from 1.53% to 23.96% of GDP, with an average of 6.75%. This indicates varying levels of national investment in healthcare.

Unemployment rates vary from 0.1% to 28.47%, with an average of 6.98%. This shows considerable differences in labor market conditions across countries.

Injuries range from around 474 to over 53 million cases, with a high average of approximately 1.27 million, indicating some countries experience extremely high injury-related burdens.

Communicable diseases have a mean of about 3.59 million cases, ranging from just over 350 to more than 143 million, showing large disparities in disease prevalence.

NonCommunicable diseases show the highest variation, with values ranging from around 2,500 to over 324 million, and an average of 8.39 million, reflecting significant differences in chronic disease burdens among nations.

Note: for more info please check the source data in above link.

## Regression

In [85]:
y = datax['Life Expectancy World Bank']
x1 = datax[['IncomeGroup','CO2','Health Expenditure %','Unemployment','Injuries','Communicable','NonCommunicable']]

In [87]:
# Add a constant. Esentially, we are adding a new column (equal in lenght to x), which consists only of 1s
x = sm.add_constant(x1)
# Fit the model, according to the OLS (ordinary least squares) method with a dependent variable y and an idependent x
results = sm.OLS(y,x).fit()
# Print a nice summary of the regression.
results.summary()

0,1,2,3
Dep. Variable:,Life Expectancy World Bank,R-squared:,0.709
Model:,OLS,Adj. R-squared:,0.697
Method:,Least Squares,F-statistic:,57.86
Date:,"Thu, 17 Apr 2025",Prob (F-statistic):,2.51e-41
Time:,19:47:50,Log-Likelihood:,-489.06
No. Observations:,174,AIC:,994.1
Df Residuals:,166,BIC:,1019.0
Df Model:,7,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,62.4296,1.004,62.162,0.000,60.447,64.412
IncomeGroup,5.2057,0.325,16.000,0.000,4.563,5.848
CO2,-6.492e-06,1.58e-06,-4.119,0.000,-9.6e-06,-3.38e-06
Health Expenditure %,0.3191,0.113,2.817,0.005,0.095,0.543
Unemployment,-0.1732,0.061,-2.858,0.005,-0.293,-0.054
Injuries,-5.564e-07,4.43e-07,-1.256,0.211,-1.43e-06,3.18e-07
Communicable,-2.535e-07,5.32e-08,-4.763,0.000,-3.59e-07,-1.48e-07
NonCommunicable,3.106e-07,9.49e-08,3.273,0.001,1.23e-07,4.98e-07

0,1,2,3
Omnibus:,9.758,Durbin-Watson:,2.222
Prob(Omnibus):,0.008,Jarque-Bera (JB):,10.146
Skew:,-0.486,Prob(JB):,0.00626
Kurtosis:,3.674,Cond. No.,117000000.0


## Analysis
<b>R-squared (0.709)</b>: This indicates that approximately <b>71% of the variation in life expectancy</b> is explained by the model. This suggests a good fit, but there’s still some unexplained variance.

<b>Adjusted R-squared (0.697)</b>: This value adjusts R-squared for the number of predictors, indicating that the model still explains a substantial amount of variance even after accounting for the number of features.

<b>F-statistic (57.86)</b> and <b>Prob (F-statistic) (2.51e-41)</b>: The high F-statistic and the very low p-value show that the model as a whole is <b>statistically significant</b>, meaning at least one of the predictors is significantly related to life expectancy.

<b>Coefficients (coef)</b>: These represent the <b>estimated change in life expectancy</b> for each unit change in the independent variables. For example:
<ul> <li><b>Income Group</b>: A one-unit increase in the income group is associated with an <b>increase of 5.21 years</b> in life expectancy.</li> <li><b>CO2</b>: Each additional unit of CO2 emissions is associated with a <b>slight decrease in life expectancy (-6.49e-06)</b>.</li> <li><b>Health Expenditure</b>: A 1% increase in health expenditure is associated with an <b>increase of 0.32 years</b> in life expectancy.</li> </ul>

<b>P-values (P>|t|)</b>: The p-values indicate whether the coefficients are statistically significant. Variables with <b>p-values less than 0.05 are considered significant</b>. For instance:
<ul> <li><b>Income Group (p = 0.000)</b>, <b>CO2 (p = 0.000)</b>, <b>Health Expenditure (p = 0.005)</b>, <b>Unemployment (p = 0.005)</b>, <b>Communicable (p = 0.000)</b>, and <b>Non Communicable (p = 0.001)</b> are all <b>statistically significant predictors</b>.</li> <li><b>Injuries (p = 0.211)</b> is <b>not statistically significant</b>, meaning it does not have a strong impact on life expectancy in this model.</li> </ul>

<b>Durbin-Watson (2.222)</b>: This statistic tests for autocorrelation in the residuals. A value <b>close to 2 suggests no autocorrelation</b>, which is ideal.

<b>Omnibus and Jarque-Bera tests</b>: Both tests indicate that the <b>residuals are not perfectly normally distributed (p-values < 0.05)</b>, suggesting potential issues with the model's residuals.

<b>Overall</b>, the model explains a <b>significant portion of the variation in life expectancy</b>, with key predictors such as <b>income, CO2 emissions, and health expenditure</b> showing significant relationships with life expectancy. However, some variables, like <b>injuries</b>, are <b>not statistically significant</b> in this model.

## Droping Insginificant Variable

In [90]:
y = datax['Life Expectancy World Bank']
x1 = datax[['IncomeGroup','CO2','Health Expenditure %','Unemployment','Communicable','NonCommunicable']]

In [95]:
# Add a constant. Esentially, we are adding a new column (equal in lenght to x), which consists only of 1s
x = sm.add_constant(x1)
# Fit the model, according to the OLS (ordinary least squares) method with a dependent variable y and an idependent x
results = sm.OLS(y,x).fit()
# Print a nice summary of the regression.
results.summary()

0,1,2,3
Dep. Variable:,Life Expectancy World Bank,R-squared:,0.707
Model:,OLS,Adj. R-squared:,0.696
Method:,Least Squares,F-statistic:,67.0
Date:,"Thu, 17 Apr 2025",Prob (F-statistic):,6.36e-42
Time:,19:49:57,Log-Likelihood:,-489.88
No. Observations:,174,AIC:,993.8
Df Residuals:,167,BIC:,1016.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,62.7166,0.980,64.021,0.000,60.783,64.651
IncomeGroup,5.1692,0.325,15.924,0.000,4.528,5.810
CO2,-5.117e-06,1.14e-06,-4.506,0.000,-7.36e-06,-2.88e-06
Health Expenditure %,0.2972,0.112,2.650,0.009,0.076,0.519
Unemployment,-0.1818,0.060,-3.015,0.003,-0.301,-0.063
Communicable,-2.697e-07,5.17e-08,-5.216,0.000,-3.72e-07,-1.68e-07
NonCommunicable,2.021e-07,3.93e-08,5.137,0.000,1.24e-07,2.8e-07

0,1,2,3
Omnibus:,9.309,Durbin-Watson:,2.239
Prob(Omnibus):,0.01,Jarque-Bera (JB):,9.511
Skew:,-0.48,Prob(JB):,0.00861
Kurtosis:,3.625,Cond. No.,113000000.0


## Analysis 2
After dropping the "Injuries" variable, the key significant changes are:

- **R-squared (0.707)**: The model still explains a high portion of the variance in life expectancy, but slightly decreased from the previous value (still showing a good fit).
  
- **Adjusted R-squared (0.696)**: After removing "Injuries," the adjusted R-squared slightly dropped, reflecting the reduction in model complexity but still indicating that the model is strong.

- **Coefficients**: 
  - The coefficient for **IncomeGroup** decreased slightly to **5.1692** (from 5.2057), but remains a significant predictor of life expectancy.
  - **CO2** continues to have a negative relationship with life expectancy, now with a coefficient of **-5.117e-06** (slightly smaller than before).
  - **Health Expenditure %**, **Unemployment**, **Communicable**, and **NonCommunicable Diseases** remain significant with similar values, indicating that removing the "Injuries" variable did not drastically change their impact.

- **Durbin-Watson (2.239)**: The model's residuals still exhibit no significant autocorrelation, similar to before.

**Conclusion**: Dropping "Injuries" slightly reduced the explanatory power of the model, but all other predictors remain significant with consistent effects on life expectancy. The model continues to provide a solid explanation of life expectancy based on the available predictors. Lastly, important to keep the model simple yet powerfull.

## Predictions of life expectancy

In [100]:
x

Unnamed: 0,const,IncomeGroup,CO2,Health Expenditure %,Unemployment,Communicable,NonCommunicable
0,1.0,0,6079.999924,13.242202,11.217000,6034434.86,7601757.82
1,1.0,1,25209.999080,2.533360,7.421000,7237433.13,4176568.27
2,1.0,2,4829.999924,6.754494,11.470000,51797.42,631629.88
3,1.0,3,500.000000,6.711585,6.980652,800.50,19002.03
4,1.0,3,188860.000600,4.275049,2.230000,120204.51,1637717.40
...,...,...,...,...,...,...,...
169,1.0,1,209.999993,3.360347,1.801000,26032.56,69213.56
170,1.0,1,300.000012,6.363094,8.406000,9095.19,43798.62
171,1.0,2,439640.014600,9.109355,28.469999,13198944.71,10214261.89
172,1.0,0,6800.000191,5.312203,12.520000,4837094.00,2649687.82


## Let's make predictions using dummy data from three fictional nations.

In [182]:
new_data = pd.DataFrame({'const': 1,'IncomeGroup': [0, 1, 3], 'CO2': [500, 20000,850000],'Health Expenditure %':[3.5 , 6, 11],'Unemployment':[12, 7, 3], 'Communicable':[500000, 300000, 20000],'NonCommunicable':[600000, 1200000,4500000]})
new_data = new_data[['const','IncomeGroup','CO2','Health Expenditure %','Unemployment','Communicable','NonCommunicable']]
new_data

Unnamed: 0,const,IncomeGroup,CO2,Health Expenditure %,Unemployment,Communicable,NonCommunicable
0,1,0,500,3.5,12,500000,600000
1,1,1,20000,6.0,7,300000,1200000
2,1,3,850000,11.0,3,20000,4500000


In [184]:
new_data.rename(index={0:'Nation1',1:'Nation2',2:'Nation3'})

Unnamed: 0,const,IncomeGroup,CO2,Health Expenditure %,Unemployment,Communicable,NonCommunicable
Nation1,1,0,500,3.5,12,500000,600000
Nation2,1,1,20000,6.0,7,300000,1200000
Nation3,1,3,850000,11.0,3,20000,4500000


In [186]:
# Use the predict method on the regression with the new data as a single argument
predictions = results.predict(new_data)
# The result
predictions

0    61.558705
1    68.455362
2    77.502240
dtype: float64

In [188]:
# If we want we can create a data frame, including everything
predictionsdf = pd.DataFrame({'Predictions':predictions})
# Join the two data frames
joined = new_data.join(predictionsdf)
# Rename the indices as before (not a good practice in general) 
joined.rename(index={0:'Nation1',1:'Nation2',2:'Nation3'})

Unnamed: 0,const,IncomeGroup,CO2,Health Expenditure %,Unemployment,Communicable,NonCommunicable,Predictions
Nation1,1,0,500,3.5,12,500000,600000,61.558705
Nation2,1,1,20000,6.0,7,300000,1200000,68.455362
Nation3,1,3,850000,11.0,3,20000,4500000,77.50224


## Conclution
The model effectively demonstrates how socioeconomic and health-related indicators influence life expectancy. Among the three hypothetical nations, **Nation3**, which has the highest income group, greater healthcare expenditure, low unemployment, and a higher burden of noncommunicable diseases, yields the **highest predicted life expectancy of 77.5 years**. In contrast, **Nation1**, with the lowest healthcare investment and highest communicable disease impact, records the **lowest predicted life expectancy at 61.6 years**. This clear upward trend across the three cases confirms that **higher income levels, increased health spending, and lower unemployment rates are positively associated with better life expectancy outcomes** in the model’s framework.