# 1.Introduction
This study aims to explore how long people live and how much education they get in different countries. I am particularly interested in how these two things are connected to each other and how other factors might affect this relationship. I will look at whether the differences in education between men and women, the wealth of a country measured by GNI per person, how many people use the internet, and how much a country spends on public health in 2014 play a role in this.

### Research Question
In this analysis, I focus on the link between how long people live (life expectancy) and how long they go to school (education years). I want to see if this link changes when we consider differences like whether men and women go to school for the same amount of time, how rich the country is, how many people use the internet, and how much money the country spends on health. By understanding these connections, I hope to provide useful information for making better policies and strategies.

# 2.Data Description
### Data Overview
In this study, I am using data that tells us about different things related to people's lives in various countries. The main thing I am  looking at is life expectancy – that's how long people, on average, live in these countries. This is our dependent variable, which means it's what I am trying to understand and explain through our analysis.

### Key Variables
Life Expectancy (Dependent Variable): This shows the average number of years a person is expected to live in each country.
Mean Years of Schooling (Independent Variable): This tells us, on average, how many years people in these countries go to school.
Other Important Factors (Control Variables):
Gender Differences in Schooling: We look at the average years of schooling separately for men and women.
Gross National Income (GNI) Per Capita: This is a measure of a country’s income and gives us an idea of how wealthy or poor a country is.
Internet Usage: The percentage of people in the country who use the internet.
Public Health Expenditure (% of GDP) for 2014: This tells us how much money each country spends on health care compared to its overall economy.
### Data Source and Collection
This analysis uses data from the "Human Development Reports" on Kaggle, originally sourced from the United Nations Development Programme. This dataset includes global human development indicators, focusing on life expectancy, education, and socio-economic factors. It is reliable and comprehensive, covering a wide range of countries. Available under the CC0: Public Domain license, it allows unrestricted use for any purpose.



In [16]:
# import libraries
import pandas as pd
import numpy as np
import json
import requests
from stargazer.stargazer import Stargazer
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv('https://raw.githubusercontent.com/Jk33033/Data_Analysis2_Final_Project/main/Final_Project/HDI.csv')


In [12]:
# select important columns
selected_columns = [
    'Life expectancy', 
    'Mean years of schooling', 
    'Gross national income (GNI) per capita', 
    'Mean years of schooling Female', 
    'Mean years of schooling Male', 
    'Internet users', 
    'Public health expenditure (% of GDP) 2014'
]


# make new dataset with only important columns
df_selected = df[selected_columns]

# check
df_selected

Unnamed: 0,Life expectancy,Mean years of schooling,Gross national income (GNI) per capita,Mean years of schooling Female,Mean years of schooling Male,Internet users,Public health expenditure (% of GDP) 2014
0,81.7,12.7,67614.0,12.8,12.7,96.8,8.3
1,82.5,13.2,42822.0,13.4,13.0,84.6,6.3
2,83.1,13.4,56364.0,13.3,13.5,88.0,7.7
3,81.1,13.2,45000.0,12.9,13.6,87.6,8.7
4,80.4,12.7,44519.0,12.6,12.9,96.3,9.2
...,...,...,...,...,...,...,...
190,,,,,,93.4,3.8
191,,,12058.0,,,,2.9
192,,,50063.0,,,,5.7
193,55.7,,294.0,,,1.8,


In [34]:
#clean dataset dropping NaN
df_cleaned = df_selected.dropna()
#replace ' '  with '_'
df_cleaned.columns = df_cleaned.columns.str.replace(' ', '_')
#rename
df_cleaned = df_cleaned.rename(columns={'Gross_national_income_(GNI)_per_capita': 'GNI_per_capita', 'Mean_years_of_schooling' : 'Years_of_schooling', 'Mean_years_of_schooling_Female' : 'Years_of_schooling_Female', 'Mean_years_of_schooling_Male' :  'Years_of_schooling_Male', 'Public_health_expenditure_(%_of_GDP)_2014' : 'Public_health_expenditure'})
#consider log
df_cleaned['ln_Years_of_schooling'] = np.log(df_cleaned['Years_of_schooling'])
df_cleaned['ln_Years_of_schooling_Female'] = np.log(df_cleaned['Years_of_schooling_Female'])
df_cleaned['ln_Years_of_schooling_Male'] = np.log(df_cleaned['Years_of_schooling_Male'])
df_cleaned['ln_GNI_per_capita'] = np.log(df_cleaned['GNI_per_capita'])

In [35]:
df_cleaned.describe()

Unnamed: 0,Life_expectancy,Years_of_schooling,GNI_per_capita,Years_of_schooling_Female,Years_of_schooling_Male,Internet_users,Public_health_expenditure,ln_Years_of_schooling,ln_Years_of_schooling_Female,ln_Years_of_schooling_Male,ln_GNI_per_capita
count,167.0,167.0,167.0,167.0,167.0,167.0,167.0,167.0,167.0,167.0,167.0
mean,71.454491,8.419162,17721.562874,8.072455,8.811976,47.935329,4.023952,2.038931,1.952372,2.111652,9.182883
std,8.400607,3.140887,19156.479671,3.469951,2.862403,28.623234,2.2632,0.467503,0.589485,0.383684,1.205373
min,48.9,1.4,587.0,1.0,2.0,2.2,0.8,0.336472,0.0,0.693147,6.375025
25%,65.5,6.1,3842.5,4.95,6.45,21.0,2.35,1.808289,1.599337,1.86405,8.253878
50%,74.2,8.8,10789.0,8.5,8.9,48.9,3.5,2.174752,2.140066,2.186051,9.286282
75%,77.25,11.2,24714.0,10.95,11.4,72.25,5.2,2.415914,2.393329,2.433613,10.115118
max,83.7,13.4,129916.0,13.4,13.6,98.2,10.8,2.595255,2.595255,2.61007,11.774643


# 3. Model

To understand the various factors influencing life expectancy, I employed five different linear regression models. Each model explores different aspects:

1.  Life Expectancy vs log of Mean Years of Schooling and log of GNI per Capita.

2.  Life Expectancy vs log of Mean Years of Schooling (Female).

3. Life Expectancy vs log of Mean Years of Schooling (Male).

4. Life Expectancy vs log of Mean Years of Schooling and Internet Usage.

5. Life Expectancy vs Mean Years of Schooling and Public Health Expenditure.

6. Life Expectancy vs log of Mean Years of Schooling 

These models will allow us to separately assess the impact of each factor on life expectancy.

### Why These Models?
Comprehensive Analysis: By using different models, we can compare the influence of various factors on life expectancy.
Gender-Specific Insights: Models 2 and 3 provide insights into how education impacts life expectancy differently for males and females.
Economic and Technological Factors: Models 1, 4, and 5 help in understanding the role of economic status, technology (internet usage), and health investment on life expectancy.

## Result and Interpretation
The following table presents the results of the regression analysis, offering detailed explanations for each finding. The models are simple linear regressions with HC1 covariance type to ensure robustness. The detailed tables for each regression analysis can be found in Appendix 1.

### 1.  Life Expectancy vs log of Mean Years of Schooling and log of GNI per Capita.
Based on the results of Model 1, where the coefficient for log(Years_of_schooling) is 4.56 and the coefficient for log(GNI_per_capita) is 4.22, and both are statistically significant at the 1% level, the interpretations are as follows:

Impact of Education Years (Coefficient of log(Years_of_schooling) = 4.56):

A one-unit increase in the natural log of the average years of schooling is associated with an approximate increase of 4.56 years in life expectancy.
The statistical significance of this coefficient suggests that the duration of education significantly influences life expectancy.

Impact of National Income (Coefficient of log(GNI_per_capita) = 4.22):

A one-unit increase in the natural log of Gross National Income per capita is associated with an approximate increase of 4.22 years in life expectancy.

The statistical significance of this coefficient indicates that the level of national income significantly affects life expectancy.


### 2.  Life Expectancy vs log of Mean Years of Schooling (Female).
For Model 2, the coefficient for log(Years_of_schooling_Female) is 10.42 and is statistically significant at the 1% level, the interpretation is as follows:

Impact of Female Education Years (Coefficient of log(Years_of_schooling_Female) = 10.42):
A one-unit increase in the natural log of the average years of schooling for females is associated with an approximate increase of 10.42 years in life expectancy.
The coefficient being significantly large and statistically significant at the 1% level strongly suggests that the duration of female education has a substantial and positive impact on life expectancy.
This significant effect implies that improvements in female education could be particularly effective in increasing life expectancy.


### 3.  Life Expectancy vs log of Mean Years of Schooling (Male).
Based on the results of Model 3, where the coefficient for log(Years_of_schooling_Male) is 15.48 and is statistically significant at the 1% level, the interpretation is as follows:

Impact of Male Education Years (Coefficient of log(Years_of_schooling_Male) = 15.48):
A one-unit increase in the natural log of the average years of schooling for males is associated with an approximate increase of 15.48 years in life expectancy.
The large magnitude of the coefficient, along with its statistical significance at the 1% level, indicates that the duration of male education has a profound and positive impact on life expectancy.
This substantial effect suggests that male education plays a critical role in determining life expectancy, possibly even more so than other factors.

### 4. Life Expectancy vs log of Mean Years of Schooling and Internet Usage.
Based on the results of Model 4, where the coefficient for internet usage is 0.239 and is statistically significant at the 1% level, the interpretation is as follows:

Interpretation of the Result
Impact of Internet Usage (Coefficient of Internet Usage = 0.239):
A one-unit increase in internet usage (presumably measured as a percentage of the population) is associated with an approximate increase of 0.239 years (about 2.39 months) in life expectancy.
The statistical significance of this coefficient at the 1% level indicates that internet usage has a positive and meaningful impact on life expectancy, although the magnitude of this impact is relatively small compared to factors like education.
This effect might reflect the broader implications of technological access and connectivity on health and well-being.

### 5. Life Expectancy vs Mean Years of Schooling and Public Health Expenditure.
Based on the results of Model 4, where the coefficient for public health expenditure is 1.97 and is statistically significant at the 1% level, the interpretation is as follows:

Impact of Public Health Expenditure (Coefficient of Public Health Expenditure = 1.97):
A one-unit increase in public health expenditure (presumably measured as a percentage of GDP) is associated with an approximate increase of 1.97 years in life expectancy.
The coefficient being large and statistically significant at the 1% level suggests that investment in public health has a substantial and positive impact on life expectancy.
This indicates that increased spending on health care and related public health services significantly contributes to improving the average lifespan of a population.

### 6. Life Expectancy vs log of Mean Years of Schooling 

Based on the results of Model 6, where the coefficient for log(Years_of_schooling) is 13.11 and is statistically significant at the 1% level, the interpretation is as follows:


Impact of Education Years (Coefficient of log(Years_of_schooling) = 13.11):
A one-unit increase in the natural log of the average years of schooling is associated with an approximate increase of 13.11 years in life expectancy.
The large magnitude of the coefficient, combined with its statistical significance at the 1% level, indicates that the duration of education has a profound and positive impact on life expectancy.
This substantial effect suggests that education plays a critical role in determining life expectancy, highlighting the importance of educational policies and investments in enhancing public health and longevity.

In [36]:
# create the regression model 1 
reg1 = smf.ols(formula="Life_expectancy ~ ln_Years_of_schooling + ln_GNI_per_capita", data=df_cleaned).fit(cov_type="HC1")

In [28]:
# create the regression model 2
reg2 = smf.ols(formula="Life_expectancy ~ ln_Years_of_schooling_Female", data=df_cleaned).fit(cov_type="HC1")

In [29]:
# create the regression model 3
reg3 = smf.ols(formula="Life_expectancy ~ ln_Years_of_schooling_Male", data=df_cleaned).fit(cov_type="HC1")

In [30]:
# create the regression model 4
reg4 = smf.ols(formula="Life_expectancy ~ Internet_users", data=df_cleaned).fit(cov_type="HC1")

In [31]:
# create the regression model 5
reg5 = smf.ols(formula="Life_expectancy ~ Public_health_expenditure", data=df_cleaned).fit(cov_type="HC1")

In [39]:
# create the regression model 6
reg6 = smf.ols(formula="Life_expectancy ~ ln_Years_of_schooling", data=df_cleaned).fit(cov_type="HC1")

# 4. Conclusion
In conclusion, Model 1, with the highest R-squared value, significantly impacts our understanding of life expectancy. It highlights the profound influence of education and economic factors, particularly the logarithmic values of Mean Years of Schooling and GNI per Capita. This model's findings suggest that enhancing educational opportunities and economic growth are crucial for improving public health outcomes.

The results emphasize the need for integrated policy approaches that focus on both education and economic development. Despite the strong correlations identified, it's important to remember that these do not imply causation. Future studies are needed to further explore these relationships. This analysis serves as a valuable guide for policymakers aiming to boost life expectancy through multifaceted strategies.

However, the analysis primarily faces the challenge of distinguishing correlation from causation, a common limitation in regression models. This means that while relationships between variables like education, economic status, and life expectancy are evident, the directionality and causative factors behind these associations remain uncertain. Additionally, the potential for omitted variable bias exists, where excluding key variables could skew the results. The linear nature of the models may also oversimplify the complex, potentially non-linear relationships inherent in socioeconomic data.

# Appendix
### 1. 

In [40]:
# create regression table for all models
stargazer = Stargazer([reg1, reg2, reg3, reg4, reg5, reg6])
stargazer.show_model_numbers(True)
stargazer.covariate_order(
    [
        'Intercept',
        'ln_Years_of_schooling',
        'ln_GNI_per_capita',
        'ln_Years_of_schooling_Female',
        'ln_Years_of_schooling_Male',
        'Internet_users',
        'Public_health_expenditure'
    ]
)

stargazer

0,1,2,3,4,5,6
,,,,,,
,Dependent variable: Life_expectancy,Dependent variable: Life_expectancy,Dependent variable: Life_expectancy,Dependent variable: Life_expectancy,Dependent variable: Life_expectancy,Dependent variable: Life_expectancy
,,,,,,
,(1),(2),(3),(4),(5),(6)
,,,,,,
Intercept,23.386***,51.118***,38.762***,59.979***,63.532***,44.735***
,(2.741),(1.721),(2.807),(0.809),(1.144),(2.264)
ln_Years_of_schooling,4.564***,,,,,13.105***
,(1.281),,,,,(1.029)
ln_GNI_per_capita,4.221***,,,,,
