# Homework 4: Merging, Aggregating, and Reshaping

This homework emphasizes: merging, aggregating, reshaping.

## Exercise 1 - Merging

Create one file out of merging the 3 files (infant_mortality_rate.csv, life_expectancy_at_birth.csv, maternal_mortality_ratio.csv). Drop unneeded columns and rename columns as needed. Save the result as a CSV file.

In [15]:
# Load the three CSV files from GitHub
link_infant <- "https://raw.githubusercontent.com/FundamentalsChinmai/Hw04/refs/heads/main/data/infant_mortality_rate.csv" # URL : infant mortality data
link_life <- "https://raw.githubusercontent.com/FundamentalsChinmai/Hw04/refs/heads/main/data/life_expectancy_at_birth.csv" # URL : life expectancy data
link_maternal <- "https://raw.githubusercontent.com/FundamentalsChinmai/Hw04/refs/heads/main/data/maternal_mortality_ratio.csv" # URL : maternal mortality data

infant_mortality <- read.csv(link_infant) # read CSV file : infant_mortality from GitHub
life_expectancy <- read.csv(link_life) # read CSV file : life_expectancy from GitHub
maternal_mortality <- read.csv(link_maternal) # read CSV file : maternal_mortality from GitHub

In [16]:
# Merge the three files by country
merged_data <- merge( # merge function
    infant_mortality, # first data frame to merge : infant_mortality
    life_expectancy, # second data frame to merge : life_expectancy
    by = "country") # merge by country
merged_data <- merge( # merge function
    merged_data, # using previous merged data as first data frame
    maternal_mortality, # second data frame to merge : maternal_mortality
    by = "country") # merge by country
merged_data <- merged_data[, c("country", "region", "infant_mortality", "life_expectancy", "maternal_mortality")] # select only needed columns : country, region, and three health metrics

write.csv(merged_data, "merged_health_data.csv", row.names = FALSE) # save merged data : merged_health_data.csv without row names

In [17]:
# Verify the file was created correctly
verify_merged <- read.csv("merged_health_data.csv") # read CSV file : verify merged_health_data.csv
cat("Number of rows:", nrow(verify_merged), "\n") # print number of rows : nrow function
cat("Number of columns:", ncol(verify_merged), "\n") # print number of columns : ncol function
cat("Column names:", paste(colnames(verify_merged), collapse = ", "), "\n") # print column names : paste with collapse
head(verify_merged) # display first few rows : verify structure

Number of rows: 196 
Number of columns: 5 
Column names: country, region, infant_mortality, life_expectancy, maternal_mortality 


Unnamed: 0_level_0,country,region,infant_mortality,life_expectancy,maternal_mortality
Unnamed: 0_level_1,<chr>,<chr>,<dbl>,<dbl>,<int>
1,AFGHANISTAN,South Asia,42.0,54.4,521
2,ALBANIA,Europe,10.8,79.9,7
3,ALGERIA,Africa,18.6,77.9,62
4,ANDORRA,Europe,3.3,83.8,11
5,ANGOLA,Africa,46.1,62.9,183
6,ANTIGUA AND BARBUDA,Central America and the Caribbean,13.3,78.3,35


## Exercise 2 - Aggregating

Aggregate the merged file by region. Choose ONE suitable function to aggregate the variables. Save the result as a CSV file.

In [18]:
# Aggregate by region using mean
merged_data <- read.csv("merged_health_data.csv") # load merged file : merged_health_data.csv
aggregated_data <- aggregate(cbind(infant_mortality, life_expectancy, maternal_mortality) ~ region,
    data = merged_data, # source data : merged_data
    FUN = mean) # aggregation function : mean for all three health metrics by region

write.csv(aggregated_data, "aggregated_by_region.csv", row.names = FALSE) # save aggregated data : aggregated_by_region.csv without row names

In [19]:
# Verify the file was created correctly
verify_aggregated <- read.csv("aggregated_by_region.csv") # read CSV file : verify aggregated_by_region.csv
cat("Number of rows:", nrow(verify_aggregated), "\n") # print number of rows : nrow function
cat("Number of columns:", ncol(verify_aggregated), "\n") # print number of columns : ncol function
cat("Column names:", paste(colnames(verify_aggregated), collapse = ", "), "\n") # print column names : paste with collapse
verify_aggregated # display aggregated data : verify structure

Number of rows: 10 
Number of columns: 4 
Column names: region, infant_mortality, life_expectancy, maternal_mortality 


region,infant_mortality,life_expectancy,maternal_mortality
<chr>,<dbl>,<dbl>,<dbl>
Africa,39.568519,67.06852,292.12963
Australia and Oceania,14.842857,75.49286,94.71429
Central America and the Caribbean,13.22381,76.54286,69.2381
Central Asia,17.122222,73.82222,18.11111
East and Southeast Asia,16.076471,75.58235,77.64706
Europe,4.019048,80.12381,7.0
Middle East,12.93125,77.1625,22.9375
North America,7.033333,79.9,23.66667
South America,14.325,75.325,73.33333
South Asia,27.7125,71.125,138.75


## Exercise 3 - Reshaping

The merged file created in Exercise 1 is in wide format. Turn it into the correct long format. Consider omitting the column region, but you can keep it, as long as it is used properly. Save the result as a CSV file.

In [20]:
# Reshape from wide to long format
library(tidyr) # load package : tidyr for pivot_longer function
merged_data <- read.csv("merged_health_data.csv") # load merged file : merged_health_data.csv
long_data <- pivot_longer(data = merged_data, # source data : merged_data in wide format
    cols = c(infant_mortality, life_expectancy, maternal_mortality), # columns to pivot : three health metrics
    names_to = "metric", # new column name : metric for column names
    values_to = "value") # new column name : value for metric values

write.csv(long_data, "reshaped_long_format.csv", row.names = FALSE) # save reshaped data : reshaped_long_format.csv without row names


In [21]:
# Verify the file was created correctly
verify_long <- read.csv("reshaped_long_format.csv") # read CSV file : verify reshaped_long_format.csv
cat("Number of rows:", nrow(verify_long), "\n") # print number of rows : nrow function
cat("Number of columns:", ncol(verify_long), "\n") # print number of columns : ncol function
cat("Column names:", paste(colnames(verify_long), collapse = ", "), "\n") # print column names : paste with collapse
cat("Unique metrics:", paste(unique(verify_long$metric), collapse = ", "), "\n") # print unique metrics : paste unique metric names
head(verify_long, 20) # display first 20 rows : verify structure

Number of rows: 588 
Number of columns: 4 
Column names: country, region, metric, value 
Unique metrics: infant_mortality, life_expectancy, maternal_mortality 


Unnamed: 0_level_0,country,region,metric,value
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<dbl>
1,AFGHANISTAN,South Asia,infant_mortality,42.0
2,AFGHANISTAN,South Asia,life_expectancy,54.4
3,AFGHANISTAN,South Asia,maternal_mortality,521.0
4,ALBANIA,Europe,infant_mortality,10.8
5,ALBANIA,Europe,life_expectancy,79.9
6,ALBANIA,Europe,maternal_mortality,7.0
7,ALGERIA,Africa,infant_mortality,18.6
8,ALGERIA,Africa,life_expectancy,77.9
9,ALGERIA,Africa,maternal_mortality,62.0
10,ANDORRA,Europe,infant_mortality,3.3
