Work on an end-to-end machine learning project, starting out from EDA, to preprocessing, to modeling then model evaluation.

Name: Selsabeel A.



Date created: 21-04-2023


Dataset: https://www.kaggle.com/datasets/ajaypalsinghlo/world-happiness-report-2021


## **World Happiness Report**

# Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load Data

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

import warnings
warnings.filterwarnings("ignore") # filter warnings when coding

In [None]:
data = pd.read_csv("/kaggle/input/world-happiness-report-2021/world-happiness-report.csv")


## Explaratory Data Analysis

In [None]:
data.head()
# show the first five rows of our data

In [None]:
# look at the number of rows and columns
print(f'Rows: {data.shape[0]}, Columns: {data.shape[1]}')

In [None]:
# look at the columns
print(data.columns)

# Attributes

**Country:** Name of the country.

**Year:** Year of the survey.

**Life Ladder:** The happiness score, based on a Gallup survey asking individuals to rate their current life on a scale of 0-10.

**Log GDP per capita:** The natural log of the country's GDP per capita.

**Social support:** The level of social support in the country, based on the answer to the question "If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?"

**Healthy life expectancy at birth:** The average number of years a person can expect to live in good health at birth.

**Freedom to make life choices:** The level of perceived freedom to make life choices, based on the answer to the question "Are you satisfied or dissatisfied with your freedom to choose what you do with your life?"

**Generosity:** The extent to which Generosity contributes to the calculation of the Happiness Score.

**Perceptions of corruption:** The level of perceived corruption in the country, based on the answer to the question "Is corruption widespread throughout the government and businesses in your country, or not?" If they think there's no corruption, this value would be 0.

**Positive affect:** The extent to which individuals experience positive emotions, rating it from 0 to 1, with higher values indicating a greater degree of positive affect.

**Negative affect:** The extent to which individuals experience negative emotions, rating it from 0 to 1, with higher values indicating a greater degree of negative affect.

In [None]:
# assign the columns a list of new column names
data.columns = data.columns.str.lower().str.replace(' ', '_')

In [None]:
print(data.columns)
#change column names to undercase and _ names to make them easier to deal with and more consistent in naming (clean code)

In [None]:
data.dtypes

In [None]:
data.isna().sum()

In [None]:
data.describe()
#some basic statistic desriptions of the data to give an overview

Some interesting notes about the data:


*   What does a negative genorosity value indicate? It meant a question of how much someone gave in the past month. How can that be negative? Did they accept charity? But no, the kaggle dataset said it ranges from 0 to 1. So it is a mistake. Upon further research I found that the given definition was wrong, and generosity actually meant the extent to which Generosity contributes to the calculation of the Happiness Score.



In [None]:
# Visualize the distribution of the target variable
sns.histplot(data['life_ladder'])
plt.show()

In [None]:
# Get the maximum recorded year
max_year = data['year'].max()
print(max_year)

# Filter the data to keep only the rows with the maximum recorded year
data_max_year = data[data['year'] == max_year]


In [None]:
# Sort the data by the "Life Ladder" column in descending order
data_max_year_sorted = data_max_year.sort_values(by=['life_ladder'], ascending=False)

# Print the top happiest countries in the maximum recorded year
print(data_max_year_sorted[['country_name', 'life_ladder']].head())

In [None]:
# Sort in ascending order
data_sorted = data_max_year.sort_values(by=['life_ladder'])

# Select the top 5 rows from the sorted DataFrame
bottom_5 = data_sorted.head(5)

# Print the bottom 5 countries
print(bottom_5[['country_name','life_ladder']])


In [None]:
# Get the minimum and maximum values of the "Ladder score" column
min_val = data['life_ladder'].min()
max_val = data['life_ladder'].max()

In [None]:
# Create a DataFrame to store the min and max values
df = pd.DataFrame({'value': [min_val, max_val], 'type': ['Minimum', 'Maximum']})

In [None]:
# Create a bar chart to visualize the min and max values
plt.bar(x=df['type'], height=df['value'])

## Preprocessing

Interpretation of null values:



*   Log GDP per capita and Healthy life expectancy at birth  cannot be 0, it must be a missing value.
*   Positive and negative affect having a null value seems to represent the value '0', since survey participants were asked to rate from 0 to 1.
*   Social support, generosity, freedom to make life choices, and perceptions of corruption can be 0, since 0 in generosity means that the survery participants could have not donated anything that month. A value of 0 in perception of corruption would mean they do not perceieve any corruption in their country.

In [None]:
data['social_support'] = data['social_support'].fillna(0.0)
data['generosity'] = data['generosity'].fillna(0.0)
data['freedom_to_make_life_choices'] = data['freedom_to_make_life_choices'].fillna(0.0)
data['perceptions_of_corruption'] = data['perceptions_of_corruption'].fillna(0.0)
data['positive_affect'] = data['positive_affect'].fillna(0.0)
data['negative_affect'] = data['negative_affect'].fillna(0.0)


In [None]:
data.isna().sum()

For log GDP per capita and health life expectancy, we need to first try to understand the data to decide what to do with its null values. We need to see its skewness, its relationship with other attributes, and its data generating process.

In [None]:
sns.set_palette("husl")
sns.displot(data=data, x="log_gdp_per_capita", kind="kde")

In [None]:
sns.pairplot(data=data, vars=["log_gdp_per_capita", "social_support", "healthy_life_expectancy_at_birth"])
plt.show()

The relationship seems to be strongest between log_gdp_per_capita and healthy_life_expectancy_at_birth. We can use linear regression models to replace the null values of log_gdp_per_capita. However, these two both have null values so it doesn't seem to be a good idea. Let's try something else.

In [None]:
sns.displot(data=data, x="log_gdp_per_capita", kind="hist")
plt.show()


No indication of missing data like gaps or unusual shapes. What about outliers?

In [None]:
# create a boxplot for log_gdp_per_capita
sns.boxplot(x=data['log_gdp_per_capita'])
plt.show()

No outliers. All previous plots indicate a normal distribution. That means that we can replace the null values with the mean.

In [None]:
data["log_gdp_per_capita"].fillna(data["log_gdp_per_capita"].mean(), inplace=True)

OK, great! Now, what about the null values in Health Life Expectancy?

In [None]:
sns.set_palette("husl")
sns.displot(data=data, x="healthy_life_expectancy_at_birth", kind="kde")

In [None]:
sns.displot(data=data, x="healthy_life_expectancy_at_birth", kind="hist")
plt.show()


The data is positively skewed.

In [None]:
sns.pairplot(data=data, vars=["log_gdp_per_capita", "social_support", "healthy_life_expectancy_at_birth"])
plt.show()

Once again the best correlatation was with healthy life expectancy which also contains null values.

In [None]:
sns.displot(data=data, x="healthy_life_expectancy_at_birth", kind="hist")
plt.show()

No significant gaps or unusual shape in the distribution, so we can infer that there isn't missing data. But we notice once again it is a very positively skewed attribute. What about outliers?

In [None]:
# create a boxplot for log_gdp_per_capita
sns.boxplot(x=data['healthy_life_expectancy_at_birth'])
plt.show()

All plots have been indicated a strong positive skew, and presence of outliers.That means that we can replace the null values with the median.

In [None]:
data["healthy_life_expectancy_at_birth"].fillna(data["healthy_life_expectancy_at_birth"].mean(), inplace=True)

In [None]:
data.isna().sum()
#number of null values in each column

In [None]:
data
#mean value for each column, rounded to 2d.p

In [None]:
# First 5 unique values and number of unique values for each column
for col in data.columns:
  print(col)
  print(f'First 5 unique values: {data[col].unique()[:5]}')
  print(f'Number of unique values: {data[col].nunique()}\n')

In [None]:
data.info()

In [None]:
fig, axes = plt.subplots(4, 2, figsize=(16, 20))
sns.regplot(data=data, x='log_gdp_per_capita', y='life_ladder', ax=axes[0, 0])
axes[0, 0].set_title('log_gdp_per_capita vs life_ladder')
sns.regplot(data=data, x='social_support', y='life_ladder', ax=axes[0, 1])
axes[0, 1].set_title('social_support vs life_ladder')
sns.regplot(data=data, x='healthy_life_expectancy_at_birth', y='life_ladder', ax=axes[1, 0])
axes[1, 0].set_title('healthy_life_expectancy_at_birth vs life_ladder')
sns.regplot(data=data, x='freedom_to_make_life_choices', y='life_ladder', ax=axes[1, 1])
axes[1, 1].set_title('freedom_to_make_life_choices vs life_ladder')
sns.regplot(data=data, x='generosity', y='life_ladder', ax=axes[2, 0])
axes[2, 0].set_title('generosity vs life_ladder')
sns.regplot(data=data, x='perceptions_of_corruption', y='life_ladder', ax=axes[2, 1])
axes[2, 1].set_title('perceptions_of_corruption vs life_ladder')
sns.regplot(data=data, x='positive_affect', y='life_ladder', ax=axes[3, 0])
axes[3, 0].set_title('positive_affect vs life_ladder')
sns.regplot(data=data, x='negative_affect', y='life_ladder', ax=axes[3, 1])
axes[3, 1].set_title('negative_affect vs life_ladder')

plt.tight_layout()
plt.show()


In [None]:
data.describe()

In [None]:
from sklearn.preprocessing import MinMaxScaler

# generate 1000 out_df points randomly drawn from an exponential distribution
original_data = np.random.exponential(size = 1000).reshape(-1, 1)

scaler = MinMaxScaler()
#scale the out_df between 0 and 1
scaled_out_data = scaler.fit_transform(original_data)

In [None]:
fig, ax=plt.subplots(1,2)
sns.distplot(original_data, ax=ax[0])
ax[0].set_title("Original out_df")
sns.distplot(scaled_out_data, ax=ax[1])
ax[1].set_title("Scaled out_data")

# The data doesn't contain any duplicates.

In [None]:
num_duplicates = data.duplicated().sum()
print("Number of duplicates in the data:", num_duplicates)

# Split the Data into Target Variable (Life Ladder) and the other features to be comparing it with

In [None]:
X = data.drop(['life_ladder'], axis=1)
y = data['life_ladder']


If there are categorical variables in the dataset, we need to encode them as numeric values so that they can be used in the machine learning model. For this dataset, we can see in the columns that there are two: Year and Country

In [None]:
from sklearn.preprocessing import LabelEncoder

# Encode the "year" column
le = LabelEncoder()
X['year'] = le.fit_transform(X['year'])

# Encode the "country_name" column
le = LabelEncoder()
X['country_name'] = le.fit_transform(X['country_name'])


It would be very useful for our data to also break down country_name data by region.

In [None]:
# Create a dictionary mapping each country to its region
region_dict = {
    'Western Europe': ['Iceland', 'Denmark', 'Switzerland', 'Netherlands', 'Norway', 'Sweden', 'Luxembourg', 'Austria', 'Ireland', 'Finland', 'Germany', 'Belgium', 'United Kingdom', 'France', 'Spain', 'Malta', 'Italy', 'Cyprus', 'Portugal', 'Greece'],
    'Central and Eastern Europe': ['Estonia', 'Czech Republic', 'Slovenia', 'Lithuania', 'Latvia', 'Slovakia', 'Poland', 'Romania', 'Hungary', 'Bulgaria', 'Croatia', 'Serbia', 'Montenegro', 'Bosnia and Herzegovina', 'North Macedonia', 'Kosovo'],
    'Southeast Asia': ['Singapore', 'Thailand', 'Philippines', 'Indonesia', 'Vietnam', 'Malaysia', 'Myanmar', 'Cambodia', 'Laos'],
    'East Asia': ['Japan', 'South Korea', 'Hong Kong S.A.R. of China', 'Taiwan Province of China', 'Mongolia', 'China'],
    'South Asia': ['Maldives', 'Nepal', 'Bangladesh', 'Pakistan', 'Sri Lanka', 'India', 'Afghanistan'],
    'Sub-Saharan Africa': ['Mauritius', 'South Africa', 'Tunisia', 'Morocco', 'Sudan', 'Ghana', 'Nigeria', 'Sierra Leone', 'Zambia', 'Egypt', 'Congo (Kinshasa)', 'Ethiopia', 'Uganda', 'Kenya', 'Mali', 'Senegal', 'Gabon', 'Niger', 'Burkina Faso', 'Ivory Coast', 'Cameroon', 'Angola', 'Madagascar', 'Zimbabwe', 'Botswana', 'Malawi', 'Haiti', 'Yemen', 'Liberia', 'Rwanda', 'Togo', 'Syria', 'Tanzania', 'Afghanistan', 'Central African Republic', 'South Sudan', 'Chad', 'Lesotho', 'Burundi', 'Congo (Brazzaville)'],
    'Middle East and North Africa': ['Israel', 'United Arab Emirates', 'Bahrain', 'Saudi Arabia', 'Kuwait', 'Libya', 'Iraq', 'Morocco', 'Algeria', 'Palestinian Territories', 'Jordan', 'Lebanon', 'Tunisia', 'Turkey', 'Iran', 'Egypt', 'Yemen', 'Syria']
}


In [None]:
# Create a function to map country to region
def map_country_to_region(country):
    for region, countries in region_dict.items():
        if country in countries:
            return region
    return None

# Apply the function to create the 'region' column
data['region'] = data['country_name'].apply(map_country_to_region)


In [None]:
# Drop rows with null values in the "Region" column
data.dropna(subset=['region'], inplace=True)

# Check the unique values in the "Region" column
print(data['region'].unique())

# Scale the data

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)


# Split the Data into Train and Train Sets

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Trying out different Regression Models

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.cluster import KMeans
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeRegressor

from sklearn.metrics import mean_squared_error, r2_score


# **Linear Regression Model**

In [None]:
# Train and evaluate a  simple linear regression model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
lr_pred = lr_model.predict(X_test)
lr_mse = mean_squared_error(y_test, lr_pred)
print(f"Simple Linear Regression MSE: {lr_mse}")
r2 = r2_score(y_test, lr_pred)
print("R-squared:", r2)

**Multiple Linear Regression**

In [None]:
# Train and evaluate a multiple regression model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
lr_pred = lr_model.predict(X_test)
lr_mse = mean_squared_error(y_test, lr_pred)
print(f"Multiple Regression MSE: {lr_mse}")
r2 = r2_score(y_test, lr_pred)
print("R-squared:", r2)

**Polynomial Linear Regression**

In [None]:
# Fit a polynomial regression model
poly = PolynomialFeatures(degree=2)
X_poly_train = poly.fit_transform(X_train)
X_poly_test = poly.transform(X_test)

In [None]:

# Train and evaluate a polynomial regression model
poly_model = LinearRegression()
poly_model.fit(X_poly_train, y_train)
poly_pred = poly_model.predict(X_poly_test)
poly_mse = mean_squared_error(y_test, poly_pred)
print(f"Polynomial Regression MSE: {poly_mse}")
r2 = r2_score(y_test, poly_pred)
print("R-squared:", r2)

**Cluster Model**

In [None]:
# Train and evaluate a KMeans clustering model
kmeans_model = KMeans(n_clusters=3, random_state=42)
kmeans_model.fit(X_train)
kmeans_pred = kmeans_model.predict(X_test)
kmeans_mse = mean_squared_error(y_test, kmeans_pred)
print(f"KMeans Clustering MSE: {kmeans_mse}")
r2 = r2_score(y_test, kmeans_pred)
print("R-squared:", r2)

In [None]:
# Train and evaluate a decision tree regression model
dt_model = DecisionTreeRegressor(random_state=42)
dt_model.fit(X_train, y_train)
dt_pred = dt_model.predict(X_test)
dt_mse = mean_squared_error(y_test, dt_pred)
r2 = r2_score(y_test, dt_pred)
print(f"Decision Tree Regression MSE: {dt_mse}")
print("R-squared:", r2)

 *The lowest mean square error was Decision Tree Regression Model with a value of 0.2689. It also has the highest R2 score of 0.7877*


## Model Evaluation

In this analysis, we explored the relationship between life satisfaction and various socio-economic factors. We started by examining the correlation between each feature and the target variable. We found that factors such as GDP per capita, social support, and healthy life expectancy were positively correlated with life satisfaction, while factors such as corruption and negative affect were negatively correlated.

We built several regression models to predict life satisfaction based on these features. The one that fit it best was the Decision Tree Regression Model.

Overall, our analysis suggests that GDP per capita, social support, and healthy life expectancy are the strongest predictors of life satisfaction, and that a multiple regression model or polynomial regression model can be used to predict life satisfaction with reasonable accuracy. The use of clustering algorithms can also provide insights into the grouping of countries based on their life satisfaction and socio-economic factors.




