## Introduction

This notebook explores the World Happiness Report dataset, which includes data on happiness scores across various countries. The goal is to visualize and analyze the relationships between happiness scores and factors like GDP, social support, life expectancy, and perceptions of corruption.

# We will perform exploratory data analysis (EDA) to understand key patterns in the dataset, 
# create various visualizations, and build interactive components using Streamlit to display our findings.

# Necessary Steps:
# 1. Import the necessary libraries.
# 2. Load the dataset (World Happiness Report) into a Pandas DataFrame.
# 3. Perform basic exploratory data analysis (EDA).
# 4. Create visualizations like histograms and scatter plots using Plotly Express.
# 5. Use Streamlit components to create an interactive dashboard.

## Data Description
The World Happiness Report dataset contains several columns that describe various factors affecting global happiness. Below is a description of each column:

- **Country name**: The name of the country.
- **Ladder score**: The happiness score for each country (dependent variable).
- **Explained by: Log GDP per capita**: The GDP per capita and its relationship with happiness.
- **Explained by: Social support**: The amount of social support in each country.
- **Explained by: Healthy life expectancy**: The average life expectancy in healthy years.
- **Explained by: Freedom to make life choices**: The level of perceived freedom in making life choices.
- **Explained by: Generosity**: The degree of generosity in each country.
- **Explained by: Perceptions of corruption**: How corruption is perceived in each country.
- **Dystopia + residual**: The base level of happiness, after adjusting for all other factors.


### Importing Required Libraries

In this cell, we import the necessary libraries to work with the data and create visualizations:
- **Pandas** (`import pandas as pd`): Used for data manipulation and loading the CSV file.
- **Plotly Express** (`import plotly.express as px`): Used to create visualizations like scatter plots and histograms.
- **OS** (`import os`): Used for interacting with the operating system, like changing the working directory if necessary.

These libraries will help us load, process, and visualize the data effectively.

In [24]:
import pandas as pd 

import plotly.express as px

import os

import streamlit as st


### Changing Working Directory
We will now change the working directory to where our data files are stored. This ensures that the dataset can be loaded correctly.

In [None]:
os.chdir('/Users/tamauriolive/Projects/software-development-tools-project')


### Loading the Dataset
In this step, we will load the **World Happiness Report** dataset into a Pandas DataFrame using the `read_csv` function. This will allow us to analyze and visualize the data. Ensure that the dataset file is in the correct path or adjust the file path accordingly.


In [6]:
import pandas as pd
df = pd.read_csv('WHR2024.csv')

### Data Overview
After loading the dataset, we can check the first few rows of the data to verify its structure and ensure it loaded correctly. This will give us a glimpse of the columns and the data types in the dataset.


### Displaying the First Few Rows
Now that we have loaded the dataset, let's display the first few rows to ensure that the data has been loaded correctly. This will help us verify the structure of the dataset, the column names, and the first few records.


In [7]:
# Show the first few rows of the dataset to check if it's loaded properly
df.head()


Unnamed: 0,Country name,Ladder score,upperwhisker,lowerwhisker,Explained by: Log GDP per capita,Explained by: Social support,Explained by: Healthy life expectancy,Explained by: Freedom to make life choices,Explained by: Generosity,Explained by: Perceptions of corruption,Dystopia + residual
0,Finland,7.741,7.815,7.667,1.844,1.572,0.695,0.859,0.142,0.546,2.082
1,Denmark,7.583,7.665,7.5,1.908,1.52,0.699,0.823,0.204,0.548,1.881
2,Iceland,7.525,7.618,7.433,1.881,1.617,0.718,0.819,0.258,0.182,2.05
3,Sweden,7.344,7.422,7.267,1.878,1.501,0.724,0.838,0.221,0.524,1.658
4,Israel,7.341,7.405,7.277,1.803,1.513,0.74,0.641,0.153,0.193,2.298


### Data Verification
By examining the first few rows, we can see that the dataset contains information on various factors like **GDP per capita**, **social support**, and **healthy life expectancy** for different countries. The columns and values appear to be loaded correctly, and we can now proceed with further exploration and analysis.


### Checking for Missing Values

In this step, we will check for any missing (NaN) values in the dataset. Missing values can impact data analysis, so identifying and addressing them early on is essential. We'll use the `isnull()` function to detect missing values and `sum()` to count them in each column.


In [10]:

# Check for any missing values in the dataset
df.isnull().sum()

Country name                                  0
Ladder score                                  0
upperwhisker                                  0
lowerwhisker                                  0
Explained by: Log GDP per capita              3
Explained by: Social support                  3
Explained by: Healthy life expectancy         3
Explained by: Freedom to make life choices    3
Explained by: Generosity                      3
Explained by: Perceptions of corruption       3
Dystopia + residual                           3
dtype: int64

### Results of Missing Value Check

The output shows that there are missing values (NaN) in the following columns:

- **Explained by: Log GDP per capita**: 3 missing values
- **Explained by: Social support**: 3 missing values
- **Explained by: Healthy life expectancy**: 3 missing values
- **Explained by: Freedom to make life choices**: 3 missing values
- **Explained by: Generosity**: 3 missing values
- **Explained by: Perceptions of corruption**: 3 missing values
- **Dystopia + residual**: 3 missing values

Missing values in these columns need to be handled, either by filling them in (e.g., using the mean, median, or a model-based approach) or removing the rows containing them, depending on the amount of missing data and the specific requirements of the analysis.


## Filling Missing Values with Mean

In this step, we will address the missing values in the dataset. The columns with missing values are:

- **Explained by: Log GDP per capita**
- **Explained by: Social support**
- **Explained by: Healthy life expectancy**
- **Explained by: Freedom to make life choices**
- **Explained by: Generosity**
- **Explained by: Perceptions of corruption**
- **Dystopia + residual**

For these columns, we will fill the missing values with the **mean** of each column. The mean is chosen because it is a standard technique for filling missing values when the data is roughly symmetric and doesn't have extreme outliers.


In [12]:
# Filling missing values with the mean for the selected columns
df['Explained by: Log GDP per capita'] = df['Explained by: Log GDP per capita'].fillna(df['Explained by: Log GDP per capita'].mean())
df['Explained by: Social support'] = df['Explained by: Social support'].fillna(df['Explained by: Social support'].mean())
df['Explained by: Healthy life expectancy'] = df['Explained by: Healthy life expectancy'].fillna(df['Explained by: Healthy life expectancy'].mean())
df['Explained by: Freedom to make life choices'] = df['Explained by: Freedom to make life choices'].fillna(df['Explained by: Freedom to make life choices'].mean())
df['Explained by: Generosity'] = df['Explained by: Generosity'].fillna(df['Explained by: Generosity'].mean())
df['Explained by: Perceptions of corruption'] = df['Explained by: Perceptions of corruption'].fillna(df['Explained by: Perceptions of corruption'].mean())
df['Dystopia + residual'] = df['Dystopia + residual'].fillna(df['Dystopia + residual'].mean())

print(df)

         Country name  Ladder score  upperwhisker  lowerwhisker  \
0             Finland         7.741         7.815         7.667   
1             Denmark         7.583         7.665         7.500   
2             Iceland         7.525         7.618         7.433   
3              Sweden         7.344         7.422         7.267   
4              Israel         7.341         7.405         7.277   
..                ...           ...           ...           ...   
138  Congo (Kinshasa)         3.295         3.462         3.128   
139      Sierra Leone         3.245         3.366         3.124   
140           Lesotho         3.186         3.469         2.904   
141           Lebanon         2.707         2.797         2.616   
142       Afghanistan         1.721         1.775         1.667   

     Explained by: Log GDP per capita  Explained by: Social support  \
0                               1.844                         1.572   
1                               1.908                

## Results of Filling Missing Values

After filling the missing values with the mean, the dataset should no longer have any missing values in the specified columns. To confirm that the changes were successful, we can check the number of missing values again using the `isnull().sum()` method.

We expect that all the columns now have 0 missing values, and the dataset is ready for further analysis or visualization.


In [13]:
# Check for any missing values in the dataset
df.isnull().sum()


Country name                                  0
Ladder score                                  0
upperwhisker                                  0
lowerwhisker                                  0
Explained by: Log GDP per capita              0
Explained by: Social support                  0
Explained by: Healthy life expectancy         0
Explained by: Freedom to make life choices    0
Explained by: Generosity                      0
Explained by: Perceptions of corruption       0
Dystopia + residual                           0
dtype: int64

## Results of Checking Missing Values

All the columns now have 0 missing values, and the dataset is ready for further analysis or visualization.


## Checking Dataset Info

The `df.info()` method provides a summary of the DataFrame, including the number of entries, column names, non-null counts, and data types for each column. This will help us confirm that there are no more missing values and verify the data types for each column.

We will use this to check if the missing values were filled correctly and ensure that all data types are correct before proceeding with further analysis or visualizations.



In [14]:
# Checking the DataFrame info after filling missing values
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 143 entries, 0 to 142
Data columns (total 11 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   Country name                                143 non-null    object 
 1   Ladder score                                143 non-null    float64
 2   upperwhisker                                143 non-null    float64
 3   lowerwhisker                                143 non-null    float64
 4   Explained by: Log GDP per capita            143 non-null    float64
 5   Explained by: Social support                143 non-null    float64
 6   Explained by: Healthy life expectancy       143 non-null    float64
 7   Explained by: Freedom to make life choices  143 non-null    float64
 8   Explained by: Generosity                    143 non-null    float64
 9   Explained by: Perceptions of corruption     143 non-null    float64
 10  Dystopia + res

## Results of Dataset Info

The `df.info()` method shows the summary of the DataFrame, including the number of non-null entries and the data types of each column. After filling the missing values, we should expect that the columns with previously missing values now show "143 non-null" entries. 

We should also confirm that the data types are correct for each column, especially ensuring that numerical columns are still in the appropriate format (e.g., `float64` for the continuous data).



## Descriptive Statistics for Numerical Columns

The `df.describe()` method generates summary statistics for the numerical columns in the dataset, including:

- Count
- Mean
- Standard Deviation (std)
- Minimum value (min)
- 25th percentile (25%)
- Median (50%)
- 75th percentile (75%)
- Maximum value (max)

This is helpful to ensure that the data distribution is in the expected range and to check if the filled values are reasonable based on the rest of the data. It also helps in identifying any outliers or inconsistencies.


In [15]:
# Generate descriptive statistics for numerical columns
df.describe()


Unnamed: 0,Ladder score,upperwhisker,lowerwhisker,Explained by: Log GDP per capita,Explained by: Social support,Explained by: Healthy life expectancy,Explained by: Freedom to make life choices,Explained by: Generosity,Explained by: Perceptions of corruption,Dystopia + residual
count,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0
mean,5.52758,5.641175,5.413972,1.378807,1.134329,0.520886,0.620621,0.146271,0.154121,1.575914
std,1.170717,1.155008,1.187133,0.420584,0.329777,0.163171,0.160766,0.072661,0.124898,0.531751
min,1.721,1.775,1.667,0.0,0.0,0.0,0.0,0.0,0.0,-0.073
25%,4.726,4.8455,4.606,1.079,0.9245,0.4,0.531,0.0925,0.069,1.317
50%,5.785,5.895,5.674,1.403,1.217,0.549,0.632,0.138,0.122,1.64
75%,6.416,6.5075,6.319,1.733,1.377,0.644,0.734,0.1915,0.191,1.8795
max,7.741,7.815,7.667,2.141,1.617,0.857,0.863,0.401,0.575,2.998


### Explanation of Results:

- **Ladder Score**: The average happiness score is around 5.53, with a wide range from 1.72 to 7.74. This indicates variability in happiness levels across countries.

- **Upper Whisker**: Similar to the Ladder score, the upper whisker has a mean of 5.64, showing that the upper end of data points is close to the average happiness score.

- **Lower Whisker**: The lower whisker, with a mean of 5.41, represents the lowest values within 1.5 times the interquartile range below the 25th percentile.

- **Log GDP per capita**: The average GDP per capita is 1.38, with significant variation, ranging from very low to high GDP countries.

- **Social Support**: Social support has a mean of 1.13, with a wide range, showing that some countries offer significantly more social support than others.

- **Healthy Life Expectancy**: The mean life expectancy is 0.52, with values ranging from 0 to 0.86, indicating varied healthcare outcomes.

- **Freedom to Make Life Choices**: The average score is 0.62, suggesting that most countries offer a moderate degree of freedom to their citizens.

- **Generosity**: The average generosity is 0.15, with values ranging from 0.00 to 0.40, indicating low generosity across countries.

- **Perceptions of Corruption**: The average perception of corruption is 0.15, with some countries reporting lower perceptions of corruption.

- **Dystopia + Residual**: The mean value is 1.58, which reflects the residuals after accounting for other factors in the happiness scores.


### Checking for Duplicates

- The code checks whether any duplicate rows exist in the dataset.
- If duplicates are found, the number of duplicate entries is printed.
- If no duplicates are found, a message is displayed confirming that no duplicates exist.


In [16]:
# Check for duplicate rows
duplicates = df[df.duplicated()]
if len(duplicates) > 0:
    print(f"Found {len(duplicates)} duplicates")
else:
    print("No duplicates found")


No duplicates found


### Duplicates checked

A message is displayed confirming that no duplicates exist.

In [17]:
# 1. The histogram will help us see how the happiness scores are distributed across countries. 
# We can check if most countries have high or low scores or if there are extreme outliers. 
# Ladder score (this is typically the happiness score of each country)

import plotly.express as px

# Create a histogram for the "Ladder score" (happiness score) column
fig = px.histogram(df, 
                   x="Ladder score", 
                   title="Distribution of Happiness Scores", 
                   labels={"Ladder score": "Happiness Score (Ladder Score)"}, 
                   nbins=20)

# Show the plot in Streamlit
fig.show()


### Histogram: Distribution of Happiness Scores

This histogram displays the distribution of happiness scores across the countries in the dataset. 

The **x-axis** represents the happiness scores, and the **y-axis** shows how many countries fall into each score range. 

This chart will help us understand:
- If most countries are concentrated in a particular range of scores.
- If there are any extreme outliers in the data (e.g., countries with extremely low or high happiness scores).
- The general shape of the distribution (e.g., skewed, normal, etc.).

By adjusting the number of bins (`nbins`), we can control the granularity of the score intervals and get a clearer picture of how the scores are spread.


### Histogram: Distribution of Happiness Scores
The histogram below shows the distribution of happiness scores (Ladder Score) across all countries. By visualizing this, we can determine if most countries have high or low happiness scores, and whether there are any extreme outliers. This helps us understand the general trend in global happiness and identify any countries with significantly different scores.


In [25]:
import streamlit as st

# Add a slider to control the number of bins for the histogram
bins = st.slider("Select number of bins", min_value=5, max_value=50, value=20)

# Create the histogram with the selected number of bins
fig = px.histogram(df, 
                   x="Ladder score", 
                   title="Distribution of Happiness Scores", 
                   labels={"Ladder score": "Happiness Score (Ladder Score)"}, 
                   nbins=bins)

# Show the plot in Streamlit
fig.show()


2025-01-21 19:57:12.079 
  command:

    streamlit run /Users/tamauriolive/Projects/software-development-tools-project/venv/lib/python3.13/site-packages/ipykernel_launcher.py [ARGUMENTS]


### Histogram: Distribution of Happiness Scores
The histogram below shows the distribution of happiness scores (Ladder Score) across all countries. By visualizing this, we can determine if most countries have high or low happiness scores, and whether there are any extreme outliers. This helps us understand the general trend in global happiness and identify any countries with significantly different scores.


In [26]:
# A scatterplot is useful for examining the relationship between two variables. 
# We're exploring if there's any correlation between happiness scores and other factors such as GDP.

fig = px.scatter(df, x="Explained by: Log GDP per capita", y="Ladder score", title="Happiness vs GDP per Capita")
fig.show()

### Scatter Plot: Happiness vs GDP per Capita

This scatter plot examines the relationship between **happiness scores** (`Ladder score`) and **GDP per capita** (`Explained by: Log GDP per capita`). The plot displays how these two variables correlate across countries.

From the plot, we can see a general upward trend, suggesting that countries with higher GDP per capita tend to have higher happiness scores. This could indicate that economic factors, represented by GDP, play a significant role in the overall well-being of a country's population.

However, there is some scatter, which implies that other factors beyond GDP may also be influencing the happiness scores. It is essential to consider additional variables such as social support, freedom to make life choices, and other aspects when analyzing happiness scores.


### Overall Conclusion

In this project, we explored the World Happiness Report dataset to visualize and analyze the relationships between various factors influencing global happiness scores. We utilized interactive components in a Streamlit app, including histograms and scatter plots, to present the data in a more accessible and engaging way.

- We examined the distribution of happiness scores across countries, using a histogram to show how scores are spread.
- A scatter plot was used to explore the correlation between happiness scores and other factors such as GDP, providing insights into how these variables interact.
- Additional improvements, like filtering data based on GDP and adjusting the number of histogram bins, allowed for a more customized and insightful analysis.

Through this process, we gained a deeper understanding of the key drivers of happiness across countries, with a focus on GDP, social support, and health-related factors. The app provides users with an intuitive interface to explore the dataset and derive insights from the visualizations.
