## OBJECTIVES
Analyze a small, publicly available dataset using Python's NumPy or Pandas libraries.                                                                                                
1. Explore the data:

Print the first few rows using data.head().
Check the data types of each column using data.info().
Get summary statistics (mean, median, standard deviation, etc.) using data.describe().

2. Calculate basic statistics:

Calculate the mean, median, mode, and standard deviation for each numerical feature.
Calculate the correlation between different features using data.corr().

## WORLD HAPPINESS DATA

The dataset used in this notebook is from the [World Happiness Report](https://worldhappiness.report/) datasetThe dataset consists of 2624 rows, where each row contains various hapiness-related metrics for a certain country in a given year.

In [1]:
# import libraries
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
file_name = "World_Happiness.csv"

# Read file into pandas dataframe
df = pd.read_csv(file_name)

# Display first five row
df.head()

Unnamed: 0,Year,Rank,Country name,Ladder score,upperwhisker,lowerwhisker,Explained by: Log GDP per capita,Explained by: Social support,Explained by: Healthy life expectancy,Explained by: Freedom to make life choices,...,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27
0,2023.0,143.0,Afghanistan,1.721,1.775,1.667,0.628,0.0,0.242,0.0,...,,,,,,,,,,
1,2022.0,137.0,Afghanistan,1.859,1.923,1.795,0.645,0.0,0.087,0.0,...,,,,,,,,,,
2,2021.0,146.0,Afghanistan,2.404,2.469,2.339,0.758,0.0,0.289,0.0,...,,,,,,,,,,
3,2020.0,150.0,Afghanistan,2.523,2.596,2.449,0.37,0.0,0.126,0.0,...,,,,,,,,,,
4,2019.0,153.0,Afghanistan,2.567,2.628,2.506,0.301,0.356,0.266,0.0,...,,,,,,,,,,


In [5]:
df.shape

(2624, 28)

In [7]:
df.columns

Index(['Year', 'Rank', 'Country name', 'Ladder score', 'upperwhisker',
       'lowerwhisker', 'Explained by: Log GDP per capita',
       'Explained by: Social support', 'Explained by: Healthy life expectancy',
       'Explained by: Freedom to make life choices',
       'Explained by: Generosity', 'Explained by: Perceptions of corruption',
       'Dystopia + residual', 'Unnamed: 13', 'Unnamed: 14', 'Unnamed: 15',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27'],
      dtype='object')

In [9]:
# Using loc[] to chosen required columns
df = df.loc[0:, "Year":"Explained by: Freedom to make life choices"]
df.drop(columns="Rank", inplace=True)

# Recheck columns
df.columns

Index(['Year', 'Country name', 'Ladder score', 'upperwhisker', 'lowerwhisker',
       'Explained by: Log GDP per capita', 'Explained by: Social support',
       'Explained by: Healthy life expectancy',
       'Explained by: Freedom to make life choices'],
      dtype='object')

In [11]:
# Basic descriptive of the dataframe
df.describe()

Unnamed: 0,Year,Ladder score,upperwhisker,lowerwhisker,Explained by: Log GDP per capita,Explained by: Social support,Explained by: Healthy life expectancy,Explained by: Freedom to make life choices
count,1969.0,1969.0,875.0,875.0,872.0,872.0,870.0,871.0
mean,2017.714068,5.451904,5.648687,5.418737,1.22028,1.078529,0.542917,0.563723
std,3.964913,1.121865,1.103935,1.139067,0.463448,0.355055,0.222949,0.180203
min,2011.0,1.364,1.427,1.301,0.0,0.0,0.0,0.0
25%,2015.0,4.596,4.885,4.638,0.90125,0.85075,0.383,0.4505
50%,2018.0,5.456,5.775,5.529,1.2635,1.1065,0.555,0.571
75%,2021.0,6.295,6.4585,6.254,1.567,1.361,0.70475,0.676
max,2024.0,7.856,7.904,7.78,2.209,1.84,1.138,1.018


In [13]:
# Basic info of the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2624 entries, 0 to 2623
Data columns (total 9 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   Year                                        1969 non-null   float64
 1   Country name                                1969 non-null   object 
 2   Ladder score                                1969 non-null   float64
 3   upperwhisker                                875 non-null    float64
 4   lowerwhisker                                875 non-null    float64
 5   Explained by: Log GDP per capita            872 non-null    float64
 6   Explained by: Social support                872 non-null    float64
 7   Explained by: Healthy life expectancy       870 non-null    float64
 8   Explained by: Freedom to make life choices  871 non-null    float64
dtypes: float64(8), object(1)
memory usage: 184.6+ KB


## DATA WRANGLING

### Data Type

In [15]:
df.dtypes

Year                                          float64
Country name                                   object
Ladder score                                  float64
upperwhisker                                  float64
lowerwhisker                                  float64
Explained by: Log GDP per capita              float64
Explained by: Social support                  float64
Explained by: Healthy life expectancy         float64
Explained by: Freedom to make life choices    float64
dtype: object

In [17]:
df["Year"] = df["Year"].astype(str)

### Handling Null Values

In [19]:
# Check total null values in each column
df.isnull().sum()

Year                                             0
Country name                                   655
Ladder score                                   655
upperwhisker                                  1749
lowerwhisker                                  1749
Explained by: Log GDP per capita              1752
Explained by: Social support                  1752
Explained by: Healthy life expectancy         1754
Explained by: Freedom to make life choices    1753
dtype: int64

In [21]:
# extract null values using the column "Year" as subset
year_null = df[df["Year"].isnull()]

# drop the null values by index/rows
df.drop(year_null.index, inplace=True)

In [23]:
# Rechecking the null values 
df.isnull().sum()

Year                                             0
Country name                                   655
Ladder score                                   655
upperwhisker                                  1749
lowerwhisker                                  1749
Explained by: Log GDP per capita              1752
Explained by: Social support                  1752
Explained by: Healthy life expectancy         1754
Explained by: Freedom to make life choices    1753
dtype: int64

In [25]:
df[df["upperwhisker"].isnull()]

Unnamed: 0,Year,Country name,Ladder score,upperwhisker,lowerwhisker,Explained by: Log GDP per capita,Explained by: Social support,Explained by: Healthy life expectancy,Explained by: Freedom to make life choices
5,2018.0,Afghanistan,3.203,,,,,,
6,2017.0,Afghanistan,3.632,,,,,,
7,2016.0,Afghanistan,3.794,,,,,,
8,2015.0,Afghanistan,3.360,,,,,,
9,2014.0,Afghanistan,3.575,,,,,,
...,...,...,...,...,...,...,...,...,...
2619,,,,,,,,,
2620,,,,,,,,,
2621,,,,,,,,,
2622,,,,,,,,,


In [27]:
# Using the column "upperwhisker" as subset to extract null values
upw_null = df[df["upperwhisker"].isnull()]

# Drop null values by index/rows
df.drop(upw_null.index, inplace=True)

In [29]:
df.isnull().sum()

Year                                          0
Country name                                  0
Ladder score                                  0
upperwhisker                                  0
lowerwhisker                                  0
Explained by: Log GDP per capita              3
Explained by: Social support                  3
Explained by: Healthy life expectancy         5
Explained by: Freedom to make life choices    4
dtype: int64

In [31]:
# Using the imputation method to fill the null values for the rest of the columns with null values
column_fill = df.loc[0:, "Explained by: Log GDP per capita":].columns

# Loop through the list of columns to calculate the mean and fill the null value with the mean
for col in column_fill:
    mean = df[col].mean()
    df[col] = df[col].fillna(mean)

In [33]:
df.isnull().sum()

Year                                          0
Country name                                  0
Ladder score                                  0
upperwhisker                                  0
lowerwhisker                                  0
Explained by: Log GDP per capita              0
Explained by: Social support                  0
Explained by: Healthy life expectancy         0
Explained by: Freedom to make life choices    0
dtype: int64

In [35]:
# Rechecking the dataframe shape
df.shape

(875, 9)

## Column Names Editing

In [37]:
df.columns

Index(['Year', 'Country name', 'Ladder score', 'upperwhisker', 'lowerwhisker',
       'Explained by: Log GDP per capita', 'Explained by: Social support',
       'Explained by: Healthy life expectancy',
       'Explained by: Freedom to make life choices'],
      dtype='object')

In [39]:
column = []

# Membership check
for col in df.columns:
    if "Explained by:" in col:
        col = col.replace("Explained by:", "").strip()
        column.append(col)
    else:
        column.append(col)

df.columns = column

# Replace whitespace with underscore and change to lower case
column_dict = {col: "_".join(col.split(" ")).lower() for col in df.columns}

# Rename column using the dict "colum_dict"
df = df.rename(columns=column_dict)

In [41]:
df.columns

Index(['year', 'country_name', 'ladder_score', 'upperwhisker', 'lowerwhisker',
       'log_gdp_per_capita', 'social_support', 'healthy_life_expectancy',
       'freedom_to_make_life_choices'],
      dtype='object')

## Mean, Median, Mode, and Standard Deviation of each column

In [43]:
# Function to calculate the mean
def get_mean(df, col):
    total = df[col].sum() # sum all values
    n = len(df[col])      # number of elements
    mean = total / n

    return mean

In [45]:
# Function to calculate the median
def get_median(df, col):
    sort_col = df[col].sort_values().reset_index(drop=True)  # Sort values in ascending order
    n = len(sort_col)  # number of elements
    
    if n % 2 == 1:     # check if number of element is an even number
        median = sort_col[n // 2]
    else:
        median = ((sort_col[n // 2 - 1]) + (sort_col[n // 2])) / 2


    return median

In [47]:
# Function to calculate the mode
def get_mode(df, col):
    mode_values = df[col].mode()  # get the mode 
    
    if len(mode_values) == 0:  # Check if there is no mode
        mode = None
        
    elif len(mode_values) == 1:  # Check if there is only one mode
        mode = mode_values[0]
        
    else:                       
        mode = list(mode_values)  # stores the multiple  ode in a list

    return mode

In [49]:
# Function to calculate the Standard Deviation
def get_std(df, col):
    mean = get_mean(df, col)   # gets the mean using the "get_mean()"
    values = df[col] 
    n = len(values)
    
    square_diff = [(x - mean)**2 for x in values]  # Squares the difference of each value and mean of the column
    variance = sum(square_diff) / (n - 1)         # the variance
    std_dev = math.sqrt(variance)      # gets the square root of the variance
    
    return std_dev

In [51]:
numerical_columns = df.select_dtypes(include="number").columns

for col in numerical_columns:
    print(col)
    print({f"Mean: {get_mean(df, col)}, Median: {get_median(df, col)}, Mode: {get_mode(df, col)}, Standard Deviation: {get_std(df, col)}"})
    print("")

ladder_score
{'Mean: 5.533725714285715, Median: 5.674, Mode: [4.289, 4.308, 5.411, 6.125, 6.172, 6.33, 6.455, 6.494], Standard Deviation: 1.1209253737693623'}

upperwhisker
{'Mean: 5.648686857142857, Median: 5.775, Mode: [4.454, 5.339, 5.406, 6.262, 6.424], Standard Deviation: 1.103934833643875'}

lowerwhisker
{'Mean: 5.418737142857143, Median: 5.529, Mode: 4.783, Standard Deviation: 1.1390667963403234'}

log_gdp_per_capita
{'Mean: 1.2202798165137616, Median: 1.261, Mode: 0.0, Standard Deviation: 0.46265161745734024'}

social_support
{'Mean: 1.0785286697247707, Median: 1.106, Mode: 0.0, Standard Deviation: 0.35444526432748974'}

healthy_life_expectancy
{'Mean: 0.5429172413793103, Median: 0.554, Mode: [0.0, 0.587], Standard Deviation: 0.22231039807175693'}

freedom_to_make_life_choices
{'Mean: 0.5637233065442021, Median: 0.571, Mode: [0.0, 0.647, 0.651], Standard Deviation: 0.17979026147151772'}



## Correlation

In [55]:
df.select_dtypes(include="number").corr()

Unnamed: 0,ladder_score,upperwhisker,lowerwhisker,log_gdp_per_capita,social_support,healthy_life_expectancy,freedom_to_make_life_choices
ladder_score,1.0,0.999456,0.999489,0.688491,0.68517,0.656844,0.541182
upperwhisker,0.999456,1.0,0.997891,0.682678,0.680963,0.646302,0.541902
lowerwhisker,0.999489,0.997891,1.0,0.693405,0.688525,0.666391,0.539925
log_gdp_per_capita,0.688491,0.682678,0.693405,1.0,0.616301,0.513473,0.449572
social_support,0.68517,0.680963,0.688525,0.616301,1.0,0.521331,0.528699
healthy_life_expectancy,0.656844,0.646302,0.666391,0.513473,0.521331,1.0,0.25962
freedom_to_make_life_choices,0.541182,0.541902,0.539925,0.449572,0.528699,0.25962,1.0
