### **Merging of datasets**:

**Objective**: Since we were provided with 5 CSV files, this notebook documents the data merge process using the columns they have in common.

> **📌Important Note:** This process was performed prior to the EDA since I consider it more convenient to perform the merge in order to proceed with an exploratory analysis of data from a single dataset.

---


#### **First Step**: Exploration of dataset features

Task:
- Load all datasets
- Column browsing
- Identification of common columns

In [55]:
import pandas as pd
import matplotlib.pyplot as plt
import sys
import os
import logging

logging.basicConfig(level=logging.INFO)

# Remove the column display limit to show all columns in the DataFrame
pd.set_option('display.max_columns', None)

In [56]:
# Add the 'src' folder to sys.path
sys.path.append(os.path.abspath(os.path.join('..', 'src')))

import utils.analysis_functions as analysis_functions

In [57]:
df_2015 = pd.read_csv('../data/raw/2015.csv')
df_2016 = pd.read_csv('../data/raw/2016.csv')
df_2017 = pd.read_csv('../data/raw/2017.csv')
df_2018 = pd.read_csv('../data/raw/2018.csv')
df_2019 = pd.read_csv('../data/raw/2019.csv')

In [58]:
dict_of_df = {
    '2015': df_2015,
    '2016': df_2016,
    '2017': df_2017,
    '2018': df_2018,
    '2019': df_2019
}

In [59]:
for key, df in dict_of_df.items():
    print("-"*50)
    print(f"STATS FOR THE YEAR {key}")
    print(f'- The shape of the dataframe {df.shape}')
    print(f'- The columns in the dataframe {df.columns.to_list()}')
    print(f'- The number of missing values in the dataframe {df.isnull().sum().sum()}\n')

--------------------------------------------------
STATS FOR THE YEAR 2015
- The shape of the dataframe (158, 12)
- The columns in the dataframe ['Country', 'Region', 'Happiness Rank', 'Happiness Score', 'Standard Error', 'Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)', 'Generosity', 'Dystopia Residual']
- The number of missing values in the dataframe 0

--------------------------------------------------
STATS FOR THE YEAR 2016
- The shape of the dataframe (157, 13)
- The columns in the dataframe ['Country', 'Region', 'Happiness Rank', 'Happiness Score', 'Lower Confidence Interval', 'Upper Confidence Interval', 'Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)', 'Generosity', 'Dystopia Residual']
- The number of missing values in the dataframe 0

--------------------------------------------------
STATS FOR THE YEAR 2017
- The shape of the dataframe (155, 12)
- The colu

In [60]:
df_columns = {
    'Year 2015': df_2015.columns.to_list(),
    'Year 2016': df_2016.columns.to_list(),
    'Year 2017': df_2017.columns.to_list(),
    'Year 2018': df_2018.columns.to_list(),
    'Year 2019': df_2019.columns.to_list()
}

df_columnas = pd.DataFrame(dict([(k, pd.Series(v)) for k, v in df_columns.items()]))
df_columnas

Unnamed: 0,Year 2015,Year 2016,Year 2017,Year 2018,Year 2019
0,Country,Country,Country,Overall rank,Overall rank
1,Region,Region,Happiness.Rank,Country or region,Country or region
2,Happiness Rank,Happiness Rank,Happiness.Score,Score,Score
3,Happiness Score,Happiness Score,Whisker.high,GDP per capita,GDP per capita
4,Standard Error,Lower Confidence Interval,Whisker.low,Social support,Social support
5,Economy (GDP per Capita),Upper Confidence Interval,Economy..GDP.per.Capita.,Healthy life expectancy,Healthy life expectancy
6,Family,Economy (GDP per Capita),Family,Freedom to make life choices,Freedom to make life choices
7,Health (Life Expectancy),Family,Health..Life.Expectancy.,Generosity,Generosity
8,Freedom,Health (Life Expectancy),Freedom,Perceptions of corruption,Perceptions of corruption
9,Trust (Government Corruption),Freedom,Generosity,,


> **🔦Findings:**
>
>- In the datasets for Years `2018` and `2019`, we have a smaller number of columns, although fortunately those columns are found in all years.
>- If we want to work with these columns, we must standardize the names in order to perform the merge without problems.
>- Additionally, to perform a deeper EDA, I propose to add a column that helps us to identify the year from which each record comes from.


---

#### **Second Step**: Identification of compatibility between dataframes

Task:
- Identification of unique values.
- Normalization of column names
- Addition of the year identifier column.

In [61]:
# Check for the Countries in the dataframes
countries_2015 = df_2015['Country'].unique()
countries_2016 = df_2016['Country'].unique()
countries_2017 = df_2017['Country'].unique()
countries_2018 = df_2018['Country or region'].unique()
countries_2019 = df_2019['Country or region'].unique()

list_of_countries = [countries_2015, countries_2016, countries_2017, countries_2018, countries_2019]

for i, countries in enumerate(list_of_countries):
    print("-"*50)
    print(f"COUNTRIES IN THE YEAR {2015 + i}")
    print(f'- The number of countries in the dataframe {len(countries)}')


--------------------------------------------------
COUNTRIES IN THE YEAR 2015
- The number of countries in the dataframe 158
--------------------------------------------------
COUNTRIES IN THE YEAR 2016
- The number of countries in the dataframe 157
--------------------------------------------------
COUNTRIES IN THE YEAR 2017
- The number of countries in the dataframe 155
--------------------------------------------------
COUNTRIES IN THE YEAR 2018
- The number of countries in the dataframe 156
--------------------------------------------------
COUNTRIES IN THE YEAR 2019
- The number of countries in the dataframe 156


In [62]:
# Do we have the same countries in all datasets?
countries_2015 = df_2015['Country'].unique().sort()
countries_2016 = df_2016['Country'].unique().sort()
countries_2017 = df_2017['Country'].unique().sort()
countries_2018 = df_2018['Country or region'].unique().sort()
countries_2019 = df_2019['Country or region'].unique().sort()

# Check if the countries are the same in all datasets
print("-"*50)
print("Are the countries the same in all datasets?")
print(f'- The countries are the same in all datasets: {countries_2016 == countries_2017 == countries_2018 == countries_2019}')
print("-"*50)


--------------------------------------------------
Are the countries the same in all datasets?
- The countries are the same in all datasets: True
--------------------------------------------------


In [63]:
df_2015.columns

Index(['Country', 'Region', 'Happiness Rank', 'Happiness Score',
       'Standard Error', 'Economy (GDP per Capita)', 'Family',
       'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)',
       'Generosity', 'Dystopia Residual'],
      dtype='object')

In [64]:
# Function to standardize columns and add the year to the dataframe
def standardize_columns_and_add_year(df, year, rename_columns):
    """
    Standardize column names and add a 'Year' column to the dataframe.
    
    Parameters:
    df (pd.DataFrame): The dataframe to process.
    year (int): The year to add to the 'Year' column.
    rename_columns (dict): Dictionary of columns to rename.
    
    Returns:
    pd.DataFrame: The processed dataframe.
    """
    try:
        df = df.rename(columns=rename_columns)
        df.columns = df.columns.str.replace(' ', '_')
        df['Year'] = year
        logging.info(f'🪄Standardized columns for year {year} dataset')
        return df
    except Exception as e:
        print(f'Error: {e}')

# Dictionary of columns to rename
rename_columns = {
    'Happiness.Rank': 'Happiness Rank',
    'Overall rank': 'Happiness Rank',
    'Country or region': 'Country',
    'Happiness.Score': 'Happiness Score',
    'Score': 'Happiness Score',
    'Economy (GDP per Capita)': 'GDP per capita',
    'Economy..GDP.per.Capita.': 'GDP per capita',
    'Freedom to make life choices': 'Freedom',
    'Trust (Government Corruption)': 'Perceptions of corruption',
    'Trust..Government.Corruption.': 'Perceptions of corruption',
    'Health..Life.Expectancy.': 'Health (Life Expectancy)',
    'Healthy life expectancy': 'Health (Life Expectancy)',
    'Family': 'Social support'
}


In [65]:
# Standardize the column names
for year, df in dict_of_df.items():
    dict_of_df[year] = standardize_columns_and_add_year(df, int(year), rename_columns)

INFO:root:🪄Standardized columns for year 2015 dataset
INFO:root:🪄Standardized columns for year 2016 dataset
INFO:root:🪄Standardized columns for year 2017 dataset
INFO:root:🪄Standardized columns for year 2018 dataset
INFO:root:🪄Standardized columns for year 2019 dataset


In [66]:
# Extract the standardized dataframes
df_2015, df_2016, df_2017, df_2018, df_2019 = dict_of_df.values()

In [67]:
# Get the columns in the common columns in all datasets
columns_to_keep = df_2019.columns.intersection(df_2018.columns).intersection(
    df_2017.columns).intersection(df_2016.columns).intersection(df_2015.columns)

# Select only the common columns in each dataset
df_2015 = df_2015[columns_to_keep]
df_2016 = df_2016[columns_to_keep]
df_2017 = df_2017[columns_to_keep]
df_2018 = df_2018[columns_to_keep]
df_2019 = df_2019[columns_to_keep]


*👀 Let's take a look at what these common columns were in order to proceed with the merge.*

In [68]:
df_columns = {
    'Year 2015': df_2015.columns.to_list(),
    'Year 2016': df_2016.columns.to_list(),
    'Year 2017': df_2017.columns.to_list(),
    'Year 2018': df_2018.columns.to_list(),
    'Year 2019': df_2019.columns.to_list()
}

df_columns = pd.DataFrame(dict([(k, pd.Series(v)) for k, v in df_columns.items()]))
df_columns

Unnamed: 0,Year 2015,Year 2016,Year 2017,Year 2018,Year 2019
0,Happiness_Rank,Happiness_Rank,Happiness_Rank,Happiness_Rank,Happiness_Rank
1,Country,Country,Country,Country,Country
2,Happiness_Score,Happiness_Score,Happiness_Score,Happiness_Score,Happiness_Score
3,GDP_per_capita,GDP_per_capita,GDP_per_capita,GDP_per_capita,GDP_per_capita
4,Social_support,Social_support,Social_support,Social_support,Social_support
5,Health_(Life_Expectancy),Health_(Life_Expectancy),Health_(Life_Expectancy),Health_(Life_Expectancy),Health_(Life_Expectancy)
6,Freedom,Freedom,Freedom,Freedom,Freedom
7,Generosity,Generosity,Generosity,Generosity,Generosity
8,Perceptions_of_corruption,Perceptions_of_corruption,Perceptions_of_corruption,Perceptions_of_corruption,Perceptions_of_corruption
9,Year,Year,Year,Year,Year


---

#### **Third Step**: Merging of standardized datasets

Task:
- Performing a merge of all dataframes.
- Use the `analysis_functions.py` function to see the resulting data types, the number of nulls, among other metrics.
- Export the merged data to a CSV file.

In [69]:
list_of_df = [df_2015, df_2016, df_2017, df_2018, df_2019]
df_merged = pd.concat(list_of_df, axis=0)

In [70]:
analysis_functions.summary_by_columns(df_merged)

Unnamed: 0,Column,Data Type,Missing Values,Unique Values,Duplicates,Missing Values (%)
0,Happiness_Rank,int64,0,158,624,0.0
1,Country,object,0,170,612,0.0
2,Happiness_Score,float64,0,716,66,0.0
3,GDP_per_capita,float64,0,742,40,0.0
4,Social_support,float64,0,732,50,0.0
5,Health_(Life_Expectancy),float64,0,705,77,0.0
6,Freedom,float64,0,697,85,0.0
7,Generosity,float64,0,664,118,0.0
8,Perceptions_of_corruption,float64,1,635,146,0.13
9,Year,int64,0,5,777,0.0


*As we can see, we only have one null record, so I will proceed to delete that value and verify the results.*

In [71]:
df_merged.dropna(inplace=True)
analysis_functions.summary_by_columns(df_merged)

Unnamed: 0,Column,Data Type,Missing Values,Unique Values,Duplicates,Missing Values (%)
0,Happiness_Rank,int64,0,158,623,0.0
1,Country,object,0,170,611,0.0
2,Happiness_Score,float64,0,715,66,0.0
3,GDP_per_capita,float64,0,741,40,0.0
4,Social_support,float64,0,731,50,0.0
5,Health_(Life_Expectancy),float64,0,704,77,0.0
6,Freedom,float64,0,696,85,0.0
7,Generosity,float64,0,663,118,0.0
8,Perceptions_of_corruption,float64,0,635,146,0.0
9,Year,int64,0,5,776,0.0


In [72]:
df_merged.to_csv('../data/processed/merged_data.csv', index=False)

---