# Checkpoint Three: Cleaning Data

Now you are ready to clean your data. Before starting coding, provide the link to your dataset below.

My dataset:https://www.kaggle.com/datasets/jainaru/world-happiness-report-2024-yearly-updated/data

Import the necessary libraries and create your dataframe(s).

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#reading data
df = pd.read_csv(r"C:\Users\launchcode\data-analysis-projects\World-happiness-report-updated_2024.csv",encoding='latin1')
happy_df = pd.read_csv(r"C:\Users\launchcode\data-analysis-projects\World-happiness-report-2024.csv",encoding='latin1')
# Rename 2024 columns to match historical
happy_df = happy_df.rename(columns={
    "Ladder score": "Life Ladder",
    "Healthy life expectancy": "Healthy life expectancy at birth"
})

happy_df["year"]=2024
#information about data
df.info()
happy_df.info()

## Missing Data

Test your dataset for missing data and handle it as needed. Make notes in the form of code comments as to your thought process.

In [None]:
df.isnull()
# no null values observed in the columns



In [None]:
df.isnull().sum()

 There are several other columns containing null values. These are:Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect 

In [None]:
# Find all numeric columns and replace any missing values in them with the columnâ€™s average value
numeric_cols = df.select_dtypes(include=np.number).columns

# not using the code below as that calculates the mean of the column distorting the trend
# df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())

# Fill numeric NaNs with the mean per country
df[numeric_cols] = df.groupby('Country name')[numeric_cols].transform(lambda x: x.fillna(x.mean()))

# count the number of missing (NaN, None, or NaT) values in each column of a DataFrame 
df.isnull().sum()
# Round all numeric columns to 3 decimal places
df[numeric_cols] = df[numeric_cols].round(3)

# Preview the first few rows
df.head(16)


Missing values in numeric variables were handled by replacing them with the column mean. This ensured no loss of data while preserving overall distributions. Non-numeric columns were left unchanged.

In [None]:
happy_df.isnull()
# no null values observed in the columns

In [None]:
happy_df.isnull().sum()

In [None]:
numeric_cols = happy_df.select_dtypes(include=np.number).columns
happy_df[numeric_cols] = happy_df[numeric_cols].fillna(happy_df[numeric_cols].mean())
happy_df[numeric_cols] = happy_df[numeric_cols].round(3)
happy_df.tail(50)



Missing values in numeric variables were handled by replacing them with the column mean. This ensured no loss of data while preserving overall distributions. Non-numeric columns were left unchanged.

In [None]:
happy_df.isnull().sum()
happy_df.info()

In [None]:
print(happy_df.columns)

## Irregular Data

Detect outliers in your dataset and handle them as needed. Use code comments to make notes about your thought process.

In [None]:
# target_col = "Life Ladder"


# plt.figure(figsize=(8, 4))
# df[target_col].plot(
#     kind="hist",
#     bins=25,
#     edgecolor="black"
# )

# plt.title("Happiness Score Distribution")
# plt.xlabel(target_col)
# plt.ylabel("Count")
# plt.show()

In [None]:
target_col = "Life Ladder"

top10 = (
    df[["Country name", target_col]]
    .sort_values(target_col, ascending=False)
    .head(10)
)
bottom10 = (
    df[["Country name", target_col]]
    .sort_values(target_col, ascending=True)
    .head(10)
)
top10,bottom10

In [None]:
latest_year = df["year"].max()
df_latest = df[df["year"] == latest_year].copy()
latest_year, df_latest.shape
# target_col = "Life Ladder"

top10_2023 = (
    df_latest[["Country name", target_col]]
    .sort_values(target_col, ascending=False)
    .head(10)
)
bottom10_2023 = (
    df_latest[["Country name", target_col]]
    .sort_values(target_col, ascending=True)
    .head(10)
)
top10_2023,bottom10_2023

In [None]:
# happy_df = happy_df.copy()
# happy_df = happy_df.select_dtypes(include=["float64", "int64"])
# happy_df.head()
numeric_cols = happy_df.select_dtypes(include=["float64", "int64"]).columns

# Example: fill NaNs in numeric columns only
happy_df[numeric_cols] = happy_df[numeric_cols].fillna(happy_df[numeric_cols].mean())
happy_df

In [None]:
# Check number of unique values in key columns (optional sanity check)

# happy_df.dtypes
# columns = [['Ladder score', 'Log GDP per capita','Social support', 'Healthy life expectancy','Freedom to make life choices','Generosity','Perceptions of corruption']]
# for c in columns:
#     print(happy_df[c].nunique())

In [None]:
happy_df

In [None]:
happy_df.isna().mean().sort_values(ascending=False) * 100
# percentage of missing values in each column of happy_df, sorted from highest to lowest.

In [None]:
numeric_cols = ['Life Ladder', 'Log GDP per capita','Social support']
plt.figure(figsize=(8, 6))
happy_df[numeric_cols].boxplot()
plt.show()

social interaction also has  outliers

In [None]:
numeric_cols = ['Healthy life expectancy at birth','Freedom to make life choices','Generosity','Perceptions of corruption']
plt.figure(figsize=(10, 6))
happy_df[numeric_cols].boxplot()
plt.show()

We can see that perceptions of corruption has many outliers. Outliers probably reflect genuine socio-economic extremes so I have decided to keep it as removing them will remove the countries as well which are required for this analysis

## Unnecessary Data

Look for the different types of unnecessary data in your dataset and address it as needed. Make sure to use code comments to illustrate your thought process.

In [None]:
df.info()
#I am using this sheet for reference but if I do end up using this for time-series analysis,  I might drop Positive affect and Negative afect
happy_df.info()
# I can probably get rid of upper whisker and lower whisker as these are just the upper and lower bounds

## Inconsistent Data

Check for inconsistent data and address any that arises. As always, use code comments to illustrate your thought process.

In [None]:
happy_df["Log GDP per capita"].value_counts()
# There are some low values but they might represent the lower economic countries 
happy_df["Social support"].value_counts()
# there are countries with low social support so outliers like '0' seem valid-Afghanistan
happy_df["Healthy life expectancy at birth"].value_counts()
# there's one country with 0 value i'm gonna keep it-Lethoso
happy_df["Freedom to make life choices"].value_counts()
#one country with 0 value -Afghanistan
happy_df["Generosity"].value_counts()
#ok
happy_df["Perceptions of corruption"].value_counts()




In [None]:
print(df.columns)
print(happy_df.columns)

In [None]:
df.dtypes
happy_df.dtypes
happy_df["year"].unique()

In [None]:
# df.to_csv('world_happiness_yearly_clean.csv', index=False)
# happy_df.to_csv('world_happiness_2024_clean.csv', index=False)