# Project Understanding








## Project objective


Tujuan utama dari analisis dataset World Happiness Report 2024 adalah untuk memahami faktor-faktor yang memengaruhi tingkat kebahagiaan di berbagai negara serta mengidentifikasi perbedaan utama antara negara dengan skor kebahagiaan tertinggi dan terendah. Analisis ini bertujuan untuk memberikan wawasan tentang bagaimana faktor ekonomi, sosial, dan kesejahteraan berkontribusi terhadap kebahagiaan masyarakat serta bagaimana negara-negara dapat meningkatkan kualitas hidup warganya.

## Asses Situation


Indeks kebahagiaan dunia mencerminkan kualitas hidup di berbagai negara, dipengaruhi oleh faktor ekonomi, sosial, dan kesejahteraan. Negara dengan ekonomi kuat, dukungan sosial baik, dan kebebasan individu cenderung lebih bahagia, sementara negara dengan korupsi tinggi dan akses terbatas ke layanan dasar sering kali memiliki skor lebih rendah.

Meskipun terdapat kesenjangan, kebijakan berbasis data dapat membantu meningkatkan kesejahteraan masyarakat. Dengan menganalisis dataset ini, pemangku kepentingan dapat memahami faktor utama yang memengaruhi kebahagiaan dan merancang strategi untuk meningkatkan kualitas hidup secara global.

## Project Goals

Analisis data ini bertujuan untuk memahami faktor-faktor yang memengaruhi kebahagiaan di berbagai negara, mengidentifikasi negara dengan skor tertinggi dan terendah, serta menganalisis perbedaan utama di antara mereka. Selain itu, analisis ini dapat membantu mengungkap tren global dan faktor utama yang berkontribusi terhadap kesejahteraan masyarakat. Dengan pendekatan berbasis data, hasilnya dapat memberikan wawasan berharga bagi pembuat kebijakan dan pemangku kepentingan dalam meningkatkan kualitas hidup secara global.

## Project Plan

Project ini bertujuan untuk menganalisis tingkat kebahagiaan global, mengidentifikasi negara dengan skor tertinggi dan terendah, serta memahami faktor-faktor yang memengaruhi kebahagiaan. Prosesnya mencakup pengumpulan dan pembersihan data, eksplorasi tren historis, serta analisis hubungan antar variabel utama. Analisis ini akan menggunakan Python dan alat visualisasi seperti Tableau atau Power BI. Hasil akhirnya berupa laporan tren kebahagiaan global, wawasan mengenai faktor utama yang berkontribusi, serta rekomendasi strategis bagi pembuat kebijakan dan pemangku kepentingan.

# Data Understanding

Dataset yang saya gunakan dari website https://www.kaggle.com/
dengan judul World Happiness Report 2024
https://www.kaggle.com/datasets/jainaru/world-happiness-report-2024-yearly-updated?select=World-happiness-report-updated_2024.csv

In [176]:
import pandas as pd

df = pd.read_csv("/content/drive/MyDrive/AVD PRAKTIKUM/World-happiness-report-updated_2024.csv", encoding="latin1")

df


Unnamed: 0,Country name,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect
0,Afghanistan,2008,3.724,7.350,0.451,50.500,0.718,0.164,0.882,0.414,0.258
1,Afghanistan,2009,4.402,7.509,0.552,50.800,0.679,0.187,0.850,0.481,0.237
2,Afghanistan,2010,4.758,7.614,0.539,51.100,0.600,0.118,0.707,0.517,0.275
3,Afghanistan,2011,3.832,7.581,0.521,51.400,0.496,0.160,0.731,0.480,0.267
4,Afghanistan,2012,3.783,7.661,0.521,51.700,0.531,0.234,0.776,0.614,0.268
...,...,...,...,...,...,...,...,...,...,...,...
2358,Zimbabwe,2019,2.694,7.698,0.759,53.100,0.632,-0.051,0.831,0.658,0.235
2359,Zimbabwe,2020,3.160,7.596,0.717,53.575,0.643,0.003,0.789,0.661,0.346
2360,Zimbabwe,2021,3.155,7.657,0.685,54.050,0.668,-0.079,0.757,0.610,0.242
2361,Zimbabwe,2022,3.296,7.670,0.666,54.525,0.652,-0.073,0.753,0.641,0.191


## Pemeriksaan Struktur Data

1. df.info ini berguna untuk mengecek tipe data tiap kolom dan disini saya hanya menggunakan df.info dan tidak menggunakan df.dtypes karena menurut saya denngan df.info itu sudah cukup.

In [177]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2363 entries, 0 to 2362
Data columns (total 11 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Country name                      2363 non-null   object 
 1   year                              2363 non-null   int64  
 2   Life Ladder                       2363 non-null   float64
 3   Log GDP per capita                2335 non-null   float64
 4   Social support                    2350 non-null   float64
 5   Healthy life expectancy at birth  2300 non-null   float64
 6   Freedom to make life choices      2327 non-null   float64
 7   Generosity                        2282 non-null   float64
 8   Perceptions of corruption         2238 non-null   float64
 9   Positive affect                   2339 non-null   float64
 10  Negative affect                   2347 non-null   float64
dtypes: float64(9), int64(1), object(1)
memory usage: 203.2+ KB


2. df.head disini berguna untuk melihat baris data atas

In [178]:
df.head()

Unnamed: 0,Country name,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect
0,Afghanistan,2008,3.724,7.35,0.451,50.5,0.718,0.164,0.882,0.414,0.258
1,Afghanistan,2009,4.402,7.509,0.552,50.8,0.679,0.187,0.85,0.481,0.237
2,Afghanistan,2010,4.758,7.614,0.539,51.1,0.6,0.118,0.707,0.517,0.275
3,Afghanistan,2011,3.832,7.581,0.521,51.4,0.496,0.16,0.731,0.48,0.267
4,Afghanistan,2012,3.783,7.661,0.521,51.7,0.531,0.234,0.776,0.614,0.268


3. df.tail disini berguna untuk mengecek data bagian bawah.

In [179]:
df.tail()

Unnamed: 0,Country name,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect
2358,Zimbabwe,2019,2.694,7.698,0.759,53.1,0.632,-0.051,0.831,0.658,0.235
2359,Zimbabwe,2020,3.16,7.596,0.717,53.575,0.643,0.003,0.789,0.661,0.346
2360,Zimbabwe,2021,3.155,7.657,0.685,54.05,0.668,-0.079,0.757,0.61,0.242
2361,Zimbabwe,2022,3.296,7.67,0.666,54.525,0.652,-0.073,0.753,0.641,0.191
2362,Zimbabwe,2023,3.572,7.679,0.694,55.0,0.735,-0.069,0.757,0.61,0.179


4. df.shape dsini berguna untuk mengecek berapa jumlah baris dan kolomnya



In [180]:
(df.shape)

(2363, 11)

5. df.shape disini saya gunakan untuk melihat ada berapa baris dan berapa kolom pada dataset yang saya ingin gunakan

In [181]:
 df.describe()

Unnamed: 0,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect
count,2363.0,2363.0,2335.0,2350.0,2300.0,2327.0,2282.0,2238.0,2339.0,2347.0
mean,2014.76386,5.483566,9.399671,0.809369,63.401828,0.750282,9.8e-05,0.743971,0.651882,0.273151
std,5.059436,1.125522,1.152069,0.121212,6.842644,0.139357,0.161388,0.184865,0.10624,0.087131
min,2005.0,1.281,5.527,0.228,6.72,0.228,-0.34,0.035,0.179,0.083
25%,2011.0,4.647,8.5065,0.744,59.195,0.661,-0.112,0.687,0.572,0.209
50%,2015.0,5.449,9.503,0.8345,65.1,0.771,-0.022,0.7985,0.663,0.262
75%,2019.0,6.3235,10.3925,0.904,68.5525,0.862,0.09375,0.86775,0.737,0.326
max,2023.0,8.019,11.676,0.987,74.6,0.985,0.7,0.983,0.884,0.705


6. Statistik Deskriptif
Disini saya menggunakan df.describe dan terlihat disini akan ada data yang saya hapus yaitu:
- generosity karna pengaruhnya terlalu kecil
- Negative affect karna saya rasa positve affect sudah cukup
- Perceptions of corruption	 karna saya ingin fokus analisis berdasarkan faktor ekonomi dan sosialnya.

In [182]:
print(df.isnull().sum())


Country name                          0
year                                  0
Life Ladder                           0
Log GDP per capita                   28
Social support                       13
Healthy life expectancy at birth     63
Freedom to make life choices         36
Generosity                           81
Perceptions of corruption           125
Positive affect                      24
Negative affect                      16
dtype: int64


# Data Preparation

Dataset yang saya gunakan dari website https://www.kaggle.com/ dengan judul World Happiness Report 2024. https://www.kaggle.com/datasets/jainaru/world-happiness-report-2024-yearly-updated?select=World-happiness-report-updated_2024.csv

In [183]:
df

Unnamed: 0,Country name,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect
0,Afghanistan,2008,3.724,7.350,0.451,50.500,0.718,0.164,0.882,0.414,0.258
1,Afghanistan,2009,4.402,7.509,0.552,50.800,0.679,0.187,0.850,0.481,0.237
2,Afghanistan,2010,4.758,7.614,0.539,51.100,0.600,0.118,0.707,0.517,0.275
3,Afghanistan,2011,3.832,7.581,0.521,51.400,0.496,0.160,0.731,0.480,0.267
4,Afghanistan,2012,3.783,7.661,0.521,51.700,0.531,0.234,0.776,0.614,0.268
...,...,...,...,...,...,...,...,...,...,...,...
2358,Zimbabwe,2019,2.694,7.698,0.759,53.100,0.632,-0.051,0.831,0.658,0.235
2359,Zimbabwe,2020,3.160,7.596,0.717,53.575,0.643,0.003,0.789,0.661,0.346
2360,Zimbabwe,2021,3.155,7.657,0.685,54.050,0.668,-0.079,0.757,0.610,0.242
2361,Zimbabwe,2022,3.296,7.670,0.666,54.525,0.652,-0.073,0.753,0.641,0.191


## Pengecekan Data

1. Missing values

In [184]:
print((df.isna().sum() / len(df)) * 100)

Country name                        0.000000
year                                0.000000
Life Ladder                         0.000000
Log GDP per capita                  1.184934
Social support                      0.550148
Healthy life expectancy at birth    2.666102
Freedom to make life choices        1.523487
Generosity                          3.427846
Perceptions of corruption           5.289886
Positive affect                     1.015658
Negative affect                     0.677105
dtype: float64


Diatas adalah perintah untuk mengecek nilai kosong tiap kolom, diatas bisa terlihat nilai kosong tertinggi itu:
- Perceptions of corruption  
- Genorosity


2. Imputasi

In [185]:
print(df['Perceptions of corruption'].dropna().describe())
print(df['Generosity'].dropna().describe())
print(df['Log GDP per capita'].dropna().describe())
print(df['Social support'].dropna().describe())
print(df['Healthy life expectancy at birth'].dropna().describe())
print(df['Freedom to make life choices'].dropna().describe())
print(df['Positive affect'].dropna().describe())
print(df['Negative affect'].dropna().describe())


count    2238.000000
mean        0.743971
std         0.184865
min         0.035000
25%         0.687000
50%         0.798500
75%         0.867750
max         0.983000
Name: Perceptions of corruption, dtype: float64
count    2282.000000
mean        0.000098
std         0.161388
min        -0.340000
25%        -0.112000
50%        -0.022000
75%         0.093750
max         0.700000
Name: Generosity, dtype: float64
count    2335.000000
mean        9.399671
std         1.152069
min         5.527000
25%         8.506500
50%         9.503000
75%        10.392500
max        11.676000
Name: Log GDP per capita, dtype: float64
count    2350.000000
mean        0.809369
std         0.121212
min         0.228000
25%         0.744000
50%         0.834500
75%         0.904000
max         0.987000
Name: Social support, dtype: float64
count    2300.000000
mean       63.401828
std         6.842644
min         6.720000
25%        59.195000
50%        65.100000
75%        68.552500
max        74.600000
N

diatas adalah persentase nilai yang hilang dan selanjunya saya akan melakukan imputasi semua data nilainnya.

In [186]:
df['Generosity'] = df['Generosity'].fillna(df['Generosity'].dropna().mean())
df['Social support'] = df['Social support'].fillna(df['Social support'].dropna().mean())
df['Log GDP per capita'] = df['Log GDP per capita'].fillna(df['Log GDP per capita'].dropna().mean())
df['Healthy life expectancy at birth'] = df['Healthy life expectancy at birth'].fillna(df['Healthy life expectancy at birth'].dropna().mean())
df['Perceptions of corruption'] = df['Perceptions of corruption'].fillna(df['Perceptions of corruption'].dropna().mean())
df['Positive affect'] = df['Positive affect'].fillna(df['Positive affect'].dropna().mean())
df['Freedom to make life choices'] = df['Freedom to make life choices'].fillna(df['Freedom to make life choices'].dropna().mean())
df['Negative affect'] = df['Negative affect'].fillna(df['Negative affect'].dropna().mean())

Diatas adalah imputasi nilai, disini saya langsung imputasi semua nilai.

In [187]:
pd.DataFrame(df.isna().sum() / len(df) * 100, columns=['Null Ratio %'])

Unnamed: 0,Null Ratio %
Country name,0.0
year,0.0
Life Ladder,0.0
Log GDP per capita,0.0
Social support,0.0
Healthy life expectancy at birth,0.0
Freedom to make life choices,0.0
Generosity,0.0
Perceptions of corruption,0.0
Positive affect,0.0


Diatas adalah hasil dari imputasi data nilai.

2. Duplicated Value

In [188]:
df[df.duplicated()]

Unnamed: 0,Country name,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect


Diatas adalah perintah untuk mengecek data yang muncul beberapa kali, sesuai studi kasus pada dataset saya tidak ada duplicated value.

3. Outliers

In [189]:
results = []

cols = df.select_dtypes(include=['float64', 'int64'])

for col in cols:
  q1 = df[col].quantile(0.25)
  q3 = df[col].quantile(0.75)
  iqr = q3 - q1
  lower_bound = q1 - 1.5*iqr
  upper_bound = q3 + 1.5*iqr
  outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
  percent_outliers = (len(outliers)/len(df))*100
  results.append({'Kolom': col, 'Persentase Outliers': percent_outliers})

results_df = pd.DataFrame(results)
results_df.set_index('Kolom', inplace=True)
results_df = results_df.rename_axis(None, axis=0).rename_axis('Kolom', axis=1)

display(results_df)

Kolom,Persentase Outliers
year,0.0
Life Ladder,0.084638
Log GDP per capita,0.042319
Social support,2.031316
Healthy life expectancy at birth,1.142615
Freedom to make life choices,0.677105
Generosity,1.86204
Perceptions of corruption,9.521794
Positive affect,0.380872
Negative affect,1.311892


Dari hasil diatas saya akan melakukan imputasi data.


Imputasi Outliers

In [190]:
columns_to_impute = ["Social support", "Healthy life expectancy at birth","Generosity","Perceptions of corruption","Negative affect"]

for col in columns_to_impute:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR


    df.loc[:, col] = df[col].clip(lower=lower_bound, upper=upper_bound)

In [191]:
results = []

cols = df.select_dtypes(include=['float64', 'int64'])

for col in cols:
  q1 = df[col].quantile(0.25)
  q3 = df[col].quantile(0.75)
  iqr = q3 - q1
  lower_bound = q1 - 1.5*iqr
  upper_bound = q3 + 1.5*iqr
  outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
  percent_outliers = (len(outliers)/len(df))*100
  results.append({'Kolom': col, 'Persentase Outliers': percent_outliers})

results_df = pd.DataFrame(results)
results_df.set_index('Kolom', inplace=True)
results_df = results_df.rename_axis(None, axis=0).rename_axis('Kolom', axis=1)

display(results_df)

Kolom,Persentase Outliers
year,0.0
Life Ladder,0.084638
Log GDP per capita,0.042319
Social support,0.0
Healthy life expectancy at birth,0.0
Freedom to make life choices,0.677105
Generosity,0.0
Perceptions of corruption,0.0
Positive affect,0.380872
Negative affect,0.0


Diatas adalah hasil dari imputasi nilai yang memiliki persentase Outliers.

In [192]:
df

Unnamed: 0,Country name,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect
0,Afghanistan,2008,3.724,7.350,0.504,50.500,0.718,0.164,0.882,0.414,0.258
1,Afghanistan,2009,4.402,7.509,0.552,50.800,0.679,0.187,0.850,0.481,0.237
2,Afghanistan,2010,4.758,7.614,0.539,51.100,0.600,0.118,0.707,0.517,0.275
3,Afghanistan,2011,3.832,7.581,0.521,51.400,0.496,0.160,0.731,0.480,0.267
4,Afghanistan,2012,3.783,7.661,0.521,51.700,0.531,0.234,0.776,0.614,0.268
...,...,...,...,...,...,...,...,...,...,...,...
2358,Zimbabwe,2019,2.694,7.698,0.759,53.100,0.632,-0.051,0.831,0.658,0.235
2359,Zimbabwe,2020,3.160,7.596,0.717,53.575,0.643,0.003,0.789,0.661,0.346
2360,Zimbabwe,2021,3.155,7.657,0.685,54.050,0.668,-0.079,0.757,0.610,0.242
2361,Zimbabwe,2022,3.296,7.670,0.666,54.525,0.652,-0.073,0.753,0.641,0.191


4. Incosistent Value

Disini saya tidak menggunakannya karena data data saya sudah konsisten.

In [193]:
df

Unnamed: 0,Country name,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect
0,Afghanistan,2008,3.724,7.350,0.504,50.500,0.718,0.164,0.882,0.414,0.258
1,Afghanistan,2009,4.402,7.509,0.552,50.800,0.679,0.187,0.850,0.481,0.237
2,Afghanistan,2010,4.758,7.614,0.539,51.100,0.600,0.118,0.707,0.517,0.275
3,Afghanistan,2011,3.832,7.581,0.521,51.400,0.496,0.160,0.731,0.480,0.267
4,Afghanistan,2012,3.783,7.661,0.521,51.700,0.531,0.234,0.776,0.614,0.268
...,...,...,...,...,...,...,...,...,...,...,...
2358,Zimbabwe,2019,2.694,7.698,0.759,53.100,0.632,-0.051,0.831,0.658,0.235
2359,Zimbabwe,2020,3.160,7.596,0.717,53.575,0.643,0.003,0.789,0.661,0.346
2360,Zimbabwe,2021,3.155,7.657,0.685,54.050,0.668,-0.079,0.757,0.610,0.242
2361,Zimbabwe,2022,3.296,7.670,0.666,54.525,0.652,-0.073,0.753,0.641,0.191


5. Construct Data

Terkait Construct data disini saya tidak memerlukan karena dataset yang saya punya sudah cukup untuk di analisis.

In [194]:
df

Unnamed: 0,Country name,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect
0,Afghanistan,2008,3.724,7.350,0.504,50.500,0.718,0.164,0.882,0.414,0.258
1,Afghanistan,2009,4.402,7.509,0.552,50.800,0.679,0.187,0.850,0.481,0.237
2,Afghanistan,2010,4.758,7.614,0.539,51.100,0.600,0.118,0.707,0.517,0.275
3,Afghanistan,2011,3.832,7.581,0.521,51.400,0.496,0.160,0.731,0.480,0.267
4,Afghanistan,2012,3.783,7.661,0.521,51.700,0.531,0.234,0.776,0.614,0.268
...,...,...,...,...,...,...,...,...,...,...,...
2358,Zimbabwe,2019,2.694,7.698,0.759,53.100,0.632,-0.051,0.831,0.658,0.235
2359,Zimbabwe,2020,3.160,7.596,0.717,53.575,0.643,0.003,0.789,0.661,0.346
2360,Zimbabwe,2021,3.155,7.657,0.685,54.050,0.668,-0.079,0.757,0.610,0.242
2361,Zimbabwe,2022,3.296,7.670,0.666,54.525,0.652,-0.073,0.753,0.641,0.191


6. Data Reduction

Disini saya rasa tidak perlu melakukan data reduction karena dataset yang saya punya sudah cukup buat analisis, karna sesuai dengan judul yang saya ambil tingkat kebahagiaan antara negara terbaik dan negara terburuk yang berdasarkan data World Happiness Report tanpa perlu mengurangin data.

In [195]:
df

Unnamed: 0,Country name,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect
0,Afghanistan,2008,3.724,7.350,0.504,50.500,0.718,0.164,0.882,0.414,0.258
1,Afghanistan,2009,4.402,7.509,0.552,50.800,0.679,0.187,0.850,0.481,0.237
2,Afghanistan,2010,4.758,7.614,0.539,51.100,0.600,0.118,0.707,0.517,0.275
3,Afghanistan,2011,3.832,7.581,0.521,51.400,0.496,0.160,0.731,0.480,0.267
4,Afghanistan,2012,3.783,7.661,0.521,51.700,0.531,0.234,0.776,0.614,0.268
...,...,...,...,...,...,...,...,...,...,...,...
2358,Zimbabwe,2019,2.694,7.698,0.759,53.100,0.632,-0.051,0.831,0.658,0.235
2359,Zimbabwe,2020,3.160,7.596,0.717,53.575,0.643,0.003,0.789,0.661,0.346
2360,Zimbabwe,2021,3.155,7.657,0.685,54.050,0.668,-0.079,0.757,0.610,0.242
2361,Zimbabwe,2022,3.296,7.670,0.666,54.525,0.652,-0.073,0.753,0.641,0.191
