### START OF HEALTH SECTION (ELIAS) ###

In [None]:
##pip! install country_converter

In [None]:
##pip! install geopandas

In [2]:
import numpy as np
import pandas as pd
import geopandas as gpd

import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = 'colab'

from itables import show

# This stops a few warning messages from showing
pd.options.mode.chained_assignment = None 
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Machine Learning Packages
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression, LogisticRegression 
from sklearn import metrics

import country_converter as coco

# Elias Project Outline

### By: Elias Nicolas

## Abstract 

**Through this project, we are aiming to find out what exactly makes a country happy? To solve this we must take into account factors such as the economy, enviormental area, citizen's health, and etc. With our end goal being to find any correlation between these factors and a society's happiness scores, and to determine if any factors outweigh the other.**

## Introduction and Background 

**Everyone wants to be happy, plain and simple. However as we all know, the road to getting there is arduous and complicated, there are so many different factors at play that can impact one's happiness. Through this analysis, we hope to explore the factors that are at play throughout this process and the impact they have on a general society's happiness. We will be analyzing categories such as a country's economy, citizens general health, and governmental systems in place, from studies conducted by the Gallup World Poll and the World Health Organization.**

## Data Used

**I got this data from kaggle from the user jainaru, showcasing the 2024 World Happiness Report scores. This report is publically avalibe information posted on the World Happiness Report website, so I came to the conclusion that it was ethical to use this data as it was intended to be shared.**

**Link to the Dataset: https://www.kaggle.com/datasets/jainaru/world-happiness-report-2024-yearly-updated**

**Country name: Name of the country**

**Regional indicator: Region to which the country belongs**

**Ladder score: The happiness score for each country, based on responses to the Cantril Ladder question that asks respondents to think of a ladder, with the best possible life for them being a 10, and the worst possible life being a 0**

**Upper whisker: Upper bound of the happiness score**

**Lower whisker: Lower bound of the happiness score**

**Log GDP per capita: The natural logarithm of the country's GDP per capita, adjusted for purchasing power parity (PPP) to account for differences in the cost of living between countries**

**Social support: The national average of binary responses(either 0 or 1 representing No/Yes) to the question about having relatives or friends to count on in times of trouble**

**Healthy life expectancy: The average number of years a newborn infant would live in good health, based on mortality rates and life expectancy at different ages**

**Freedom to make life choices: The national average of responses to the question about satisfaction with freedom to choose what to do with one's life**

**Generosity: The residual of regressing the national average of responses to the question about donating money to charity on GDP per capita**

**Perceptions of corruption: The national average of survey responses to questions about the perceived extent of corruption in the government and businesses**

**Dystopia + residual: Dystopia is an imaginary country with the worldâ€™s least-happy people, used as a benchmark for comparison. The dystopia + residual score is a combination of the Dystopia score and the unexplained residual for each country, ensuring that the combined score is always positive. Each of these factors contributes to the overall happiness score, but the Dystopia + residual value is a benchmark that ensures no country has a lower score than the hypothetical Dystopia**

**Positive affect: The national average of responses to questions about positive emotions experienced yesterday**

**Negative affect: The national average of responses to questions about negative emotions experienced yesterday**

In [3]:
df_2024 = pd.read_csv('./Data/happy_2024.csv')
cc = coco.CountryConverter()
df_2024['ISO_A3'] = df_2024['Country name'].apply(lambda x: cc.convert(x, to='ISO3'))
df_2024 = df_2024.drop(['Regional indicator'], axis=1)
df_2024.head()

Unnamed: 0,Country name,Ladder score,upperwhisker,lowerwhisker,Log GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption,Dystopia + residual,ISO_A3
0,Finland,7.741,7.815,7.667,1.844,1.572,0.695,0.859,0.142,0.546,2.082,FIN
1,Denmark,7.583,7.665,7.5,1.908,1.52,0.699,0.823,0.204,0.548,1.881,DNK
2,Iceland,7.525,7.618,7.433,1.881,1.617,0.718,0.819,0.258,0.182,2.05,ISL
3,Sweden,7.344,7.422,7.267,1.878,1.501,0.724,0.838,0.221,0.524,1.658,SWE
4,Israel,7.341,7.405,7.277,1.803,1.513,0.74,0.641,0.153,0.193,2.298,ISR


**This is a geojson of the geographical dimensions of various contries around the world, I can then merge this information with my dataset to then plot everything as a map.**

In [4]:
geojson_url = "https://datahub.io/core/geo-countries/r/countries.geojson"
gdf = gpd.read_file(geojson_url)
gdf = gdf.rename(columns = {'ISO3166-1-Alpha-3':'ISO_A3'})
merge_df = gdf.merge(df_2024, left_on='ISO_A3',right_on='ISO_A3')
merge_df = merge_df.drop(columns = 'ISO3166-1-Alpha-2')
merge_df.head()

Unnamed: 0,name,ISO_A3,geometry,Country name,Ladder score,upperwhisker,lowerwhisker,Log GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption,Dystopia + residual
0,Indonesia,IDN,"MULTIPOLYGON (((117.70361 4.16342, 117.70361 4...",Indonesia,5.568,5.67,5.466,1.361,1.184,0.472,0.779,0.399,0.055,1.318
1,Malaysia,MYS,"MULTIPOLYGON (((117.70361 4.16342, 117.69711 4...",Malaysia,5.975,6.078,5.872,1.646,1.143,0.54,0.829,0.226,0.119,1.473
2,Chile,CHL,"MULTIPOLYGON (((-69.51009 -17.50659, -69.50611...",Chile,6.36,6.448,6.273,1.616,1.369,0.673,0.651,0.117,0.075,1.858
3,Bolivia,BOL,"POLYGON ((-69.51009 -17.50659, -69.51009 -17.5...",Bolivia,5.784,5.895,5.674,1.217,1.179,0.488,0.719,0.1,0.061,2.02
4,Peru,PER,"MULTIPOLYGON (((-69.51009 -17.50659, -69.63832...",Peru,5.841,5.946,5.736,1.371,1.18,0.662,0.615,0.078,0.029,1.907


**This is data from the World Health Organization, showcasing the 2020 Obesity rate per 100 people (how many people out of 100 will have Obesity). This data from my knowledge is ethically sourced, and is publically avalible on the World Health Organization website for civilian download.**

**Link to the Dataset: https://www.who.int/news-room/fact-sheets/detail/obesity-and-overweight**

In [None]:
df_obesity = pd.read_csv('./Data/obesity_2022.csv')
df_obesity = df_obesity.rename(columns={'GEO_NAME_SHORT':'name','RATE_PER_100_N':'Rate per 100 people'})
main_mask = ((df_obesity['DIM_TIME'] == 2020) & (df_obesity['DIM_SEX'] == 'TOTAL') & (df_obesity['name'] != 'World'))
df_obesity = df_obesity[main_mask]

final_mask = ['name','Rate per 100 people']

df_obesity = df_obesity[final_mask].reset_index(drop =True)
df_obesity.head()

In [None]:
obesity_merge_df = df_obesity.merge(merge_df, left_on='name',right_on='name')
obesity_merge_df = obesity_merge_df.drop(columns='Country name')
obesity_merge_df = gpd.GeoDataFrame(obesity_merge_df, geometry='geometry')
obesity_merge_df.head()

## Exploratory Data Analysis ##

In [None]:
df_2024.shape

In [None]:
df_2024.dtypes

In [None]:
df_2024 = df_2024.sort_values(by='Ladder score', ascending = False).reset_index(drop=True)
df_2024

In [None]:
df_obesity.shape

In [None]:
df_obesity.dtypes

In [None]:
df_obesity = df_obesity.sort_values(by='Rate per 100 people', ascending = False).reset_index(drop=True)
df_obesity

## Proposed Question ##

**Specific questions I am going to focus on are:**
**What areas tend to have the highest/lowest happiness scores? Are there any specific reasons to cause it to behave like this?**
**Does a country's health, specifically obesity, have any effect on their happiness scores?**

In [None]:
## Plots a geographical map, colored by world happiness scores
fig = px.choropleth(
    merge_df,
    geojson=merge_df.geometry,
    locations=merge_df.index,
    color="Ladder score",        
    hover_name="Country name",      
    color_continuous_scale="RdYlBu",
    title="World Happiness Scores"
)

fig.update_layout(
    margin={"r":0,"l":0,"b":0},
    geo=dict(showframe=False, showcoastlines=False),
    width = 900,
    height = 700
)

fig.show(config={
    'scrollZoom': False,          
    'displayModeBar': False       
})

**From this plot, I can infer that countries in Africa and the Middle East tend to have lower happiness scores. This could be due to lack of resources, political oppression, and ramifications of warfare.**

In [None]:
## Plots a geographical map, colored by obesity rate per country
fig_2 = px.choropleth(
    obesity_merge_df,
    geojson=obesity_merge_df.geometry.__geo_interface__,
    locations=obesity_merge_df.index,
    color="Rate per 100 people",        
    hover_name="name",      
    color_continuous_scale="Inferno",
    title="Obesity Rate for Various Countries (How many people out of 100 for that country are obese?)"
)

fig_2.update_layout(
    margin={"r":0,"l":0,"b":0},
    geo=dict(showframe=False, showcoastlines=False),
    width = 900,
    height = 700
)

fig_2.show(config={
    'scrollZoom': False,          
    'displayModeBar': False       
})

In [None]:
## Creating the dataframe to compare obesity and happiness scores
mask = ['name','Healthy life expectancy','Rate per 100 people', 'Ladder score']
df_compare = obesity_merge_df.copy()
df_compare = df_compare[mask]
df_compare.sort_values(by='Ladder score',ascending=True).reset_index(drop=True).head(30)

**Countries that have a low obesity rate (potentially due to starvation but this is just an inference) tend to also be the countries with the lower percentile of happiness scores, there seems to be a relationship between the two. This would explain why the African and Middle Eastern area tends to have the lowest happiness scores.**

In [None]:
## Creating the dataframe for my linear regression model 
mask = ['Healthy life expectancy','Rate per 100 people', 'Ladder score']
df_model = obesity_merge_df.copy()
df_model = df_model[mask]
df_model

In [None]:
## Drop all the na values
df_model = df_model.dropna()
df_model.isna().sum()

In [None]:
## Train the linear regression model 
features = ['Healthy life expectancy','Rate per 100 people']
X = df_model[features].values.reshape(-1,2)
y = (df_model['Ladder score'])

In [None]:
LM = LinearRegression()
LM.fit(X,y)

In [None]:
LM.coef_

In [None]:
LM.intercept_

In [None]:
LM.score(X,y)

In [None]:
## Display the predicted scores and the calculated error between the theoretical and literal
x_values = 'Healthy life expectancy'  
y_values = 'Rate per 100 people' 
z_values = 'Ladder score' 

prediction_df = df_model.copy()
prediction_df['Predicted_Score'] = LM.predict(df_model[[x_values, y_values]].values)
prediction_df['Error'] = abs(prediction_df[z_values] - prediction_df['Predicted_Score'])

merged_columns = ['Healthy life expectancy', 'Ladder score', 'Rate per 100 people']
prediction_df = prediction_df.merge(obesity_merge_df,left_on = merged_columns, right_on = merged_columns)
final_mask = ['name','Healthy life expectancy','Ladder score', 'Rate per 100 people', 'Predicted_Score','Error']

prediction_df = prediction_df[final_mask].sort_values(by='Error',ascending=True).reset_index(drop=True)
prediction_df

In [None]:
prediction_df.sort_values(by='Error',ascending=False)

In [None]:
## These two lines generate a sequence of 100 evenly spaced numbers ranging from the lowest to the highest values found in the X and Y columns 
## Thus creating the boundaries and resolution for the plot
x_range = np.linspace(prediction_df[x_values].min(), prediction_df[x_values].max(), 100)
y_range = np.linspace(prediction_df[y_values].min(), prediction_df[y_values].max(), 100)

## This block generates a grid of coordinates and flattens them so the model can predict a z_value for every point
## It then reshapes those predictions back into a matrix format
xx, yy = np.meshgrid(x_range, y_range)
grid_data = np.c_[xx.ravel(), yy.ravel()]
zz_flat = LM.predict(grid_data)
zz = zz_flat.reshape(xx.shape)

## Plot the 3D scatter plot with the linear regression model 
fig_3 = px.scatter_3d(prediction_df, x=x_values, y=y_values, z=z_values)

## Add the linear regression to the 3D plane 
fig_3.add_trace(
    go.Surface(
        x=xx,
        y=yy,
        z=zz,
        name='Regression Plane',
        opacity=0.5,          
        colorscale='hot',
        showscale=False       
    )
)

fig_3.update_layout(
    title='3D Linear Regression Analysis',
    scene=dict(
        xaxis_title=x_values,
        yaxis_title=y_values,
        zaxis_title=z_values
    ),
    margin=dict(l=0, r=0, b=0, t=40)
)

fig_3.show()

**From the falicies found in my linear regression model, I hypothesis that the reason for my inaccuracy when predicting can be attributed to the fact that happiness scores are infinitely times more broad than just health, maybe adding more features will give me better predictions. However the results were not terrible, as we see health does have some sort of correlation to a country's happiness scores but by adding more data features (more columns in our 2024 happiness score report dataframe) I do believe we can get a more accurate prediction.**

In [None]:
## Add more features to the linear regression model to train it and hopefully improve it
mask_2 = ['Healthy life expectancy','Rate per 100 people', 'Social support', 'Generosity','Ladder score']
df_model_2 = obesity_merge_df.copy()
df_model_2 = df_model_2[mask_2]
df_model_2.head()

In [None]:
## Drop all the na values 
df_model_2 = df_model_2.dropna()
df_model_2.isna().sum()

In [None]:
features_2 = ['Healthy life expectancy','Rate per 100 people','Social support','Generosity']
X2 = df_model_2[features_2].values.reshape(-1,4)
y2 = (df_model_2['Ladder score'])

In [None]:
## Train the new linear regression model with the updated features
LM2 = LinearRegression()
LM2.fit(X2,y2)

In [None]:
LM2.coef_

In [None]:
LM2.intercept_

In [None]:
LM2.score(X2,y2)

In [None]:
## Display the predicted scores and the calculated error between the theoretical and literal for the second model
a_values = 'Healthy life expectancy' 
b_values = 'Rate per 100 people'  
c_values = 'Social support'
d_values ='Generosity'

e_values = 'Ladder score'

prediction_df_2 = df_model_2.copy()
prediction_df_2['Predicted_Score'] = LM2.predict(df_model_2[[a_values, b_values, c_values, d_values]].values)
prediction_df_2['Error'] = abs(prediction_df_2[e_values] - prediction_df_2['Predicted_Score'])

merged_columns_2 = [a_values,b_values,c_values,d_values,e_values]
prediction_df_2 = prediction_df_2.merge(obesity_merge_df,left_on = merged_columns, right_on = merged_columns)
final_mask_2 = ['name','Healthy life expectancy','Ladder score', 'Rate per 100 people', 'Predicted_Score','Error']

prediction_df_2 = prediction_df_2[final_mask].sort_values(by='Error',ascending=True).reset_index(drop=True)
prediction_df_2

In [None]:
prediction_df_2.sort_values(by='Error',ascending = False)

**When we added more features that are not necessarily related to health, we see that we got a more accurate prediction model. Meaning, we can take data from other group members sections to then improve our model. This shows that happiness cannot necessarily be attributed to one factor, it is a combination of various reasons that create that enviorment.**

## Conclusion

**From this analysis I was successfully able to determine a correlation between obesity rates and happiness, as I was able to identify that country's with the lower percentile of obesity rates tended to be in the lower percentile of happiness scores, I suspect this to be due to malnutrition and starvation but cannot come to a firm conclusion based off of this data alone. The only health related information in this/these dataset(s) are the obesity rates per country, the healthy life expectancy, and the respective happiness score (for each country). I am not ethically concerned about the origins of the data as the sources are from two respectable organizations, but I do believe the analysis can be elevated further with more information present. If I had the time and funds to continue this project, I would definitely try to find more data regarding a counntries starvation rate (or something along those lines) to see if my hypothesis holds up. I would also add this data to train my linear regression model to then make it more accurate.**

### END OF HEALTH SECTION (ELIAS) ###

In [None]:
file_location = '../Final_project/Main Data/2019.csv'
file_name = '../Final_project/Main Data/iceland_benefits.xlsx'
file_name2 = '../Final_project/Main Data/iceland_income_support.xlsx'
file_name3 = '../Final_project/Main Data/GDP%.xlsx'
file_name4 = '../Final_project/Main Data/social spending.csv'
DF_SS = pd.read_csv(file_name4)
DF_GDP = pd.read_excel(file_name3)
DF_inc = pd.read_excel(file_name2)
DF_ben = pd.read_excel(file_name)
DF = pd.read_csv(file_location)

In [None]:
DF.sort_values('Social support', ascending=False).head(10)

In [None]:
mask = DF_SS['Year'] == 2019
DF_SS[mask].sort_values(by='Public social expenditure as a share of GDP')

A strong wellfare program is usually a good indicator of a healthy country, a country that has a stable enough economy to afford a supportive federal program to help their population. The DF_SS dataset, according to the website it was pulled from, looks at among others, health, old age, incapacity-related benefits, family, active labor market programmes, unemployment, and housing. A few notable countries that we want to look at for the top of the Social support category from the hapiness dataset and the following datasets are Findland, Denmark, and Norway

In [None]:
col = ['Country Name','Indicator Name',2019]
m = DF_GDP[2019].notnull()
DF_GDP = DF_GDP[col][m]
DF_GDP.sort_values(2019, ascending=False).tail(11)

In [None]:
DF.sort_values('Social support', ascending=False).tail(10)

We made a few assumptions going into this project, namely that Social Support and GDP per Capita were big contributors to hapiness in a country. Our reasoning is that the more financial support and access to wellfare a population has, the less chance of falling to poverty. Lower scores on the %GDP dataset mean that the country in question spends that value as a percentage of their GDP towards domestic general government health expenditures, otherwise known as healthcare. As we can see in the DF_GDP dataset, some of the lowest values coincide with some of the lowest Social Support scores from the world hapiness dataset. Mainly Chad, Haiti, Afghanistan, and Benin.

In [None]:
DF_temp = pd.merge(DF, DF_SS[mask], 
                  left_on='Country or region', 
                  right_on='Entity', 
                  how="left")

DF_new = pd.merge(DF_temp, DF_GDP[m],
                 left_on='Country or region',  # Adjust this column name if needed
                 right_on='Country Name',
                 how="left")

In [None]:
DF_new1 = DF_new.sort_values(by=2019)
mask1 = DF_new1['Social support'] >= 1
fig = px.scatter(DF_new1[mask1],
                 x='Social support',
                 y='Public social expenditure as a share of GDP',
                 color=2019,
                 hover_data='Country or region',
                trendline='ols')
fig.update_layout(
    xaxis_title='Country Score "Social Support"',
    yaxis_title='Social Expenditure as %GDP',
    coloraxis_colorbar_title_text='General Health Expenditure %GDP')
fig.show()

In [None]:
DF_new1 = DF_new.sort_values(by=2019)
mask1 = DF_new1['Social support'] >= 1
fig = px.scatter(DF_new1[mask1],
                 x='Social support',
                 y='Public social expenditure as a share of GDP',
                 color=2019,
                 hover_data='Country or region',
                trendline='ols')
fig.update_layout(
    xaxis_title='Country Score "Social Support"',
    yaxis_title='Social Expenditure as %GDP',
    coloraxis_colorbar_title_text='General Health Expenditure %GDP')
fig.show()

Now to explain some outliers, these numbers are percentage based of GDP, meaning smaller economies that spend a relative equal amount as bigger countries will have higher scores. For instance, Greece has a much smaller population than a lot of other countries, but they spend a large portion of their GDP on wellfare and social support. High scores here don't always equate to hapiness, as Greece shows, but as we can see there is a general positive relationship between a government's social expenditure, health expenditure, and how much a population believes they recieve support from their government.