# Merging and preparation of data sets

In this Jupyter notebook we will merge and prepare all of the collected data sets. Since we are aiming to represent as many countries and parameters as we possibly can we need to create a general data set for all categories we can collect. We will then be using that data set for futher analysis and exploration.

Parameters: <u>safety</u>, tuition, <u>groceries</u>, social & sports activities, <u>rent</u>, public transport, ratio of foreign and domestic students and health & health insurance

In [12]:
# Importing modules and libraries
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
import os
import warnings
warnings.filterwarnings("ignore")

In [13]:
# Directory searching and collecting data set paths
cwd = os.getcwd()
#print(cwd)
datasets = list()
for file in os.listdir():
    if file.split('.')[1] == "csv":
        datasets.append(file)
#print(datasets)

## Loading safety index data sets

We are storing the average value of safety index for given countries in last 5 years. We will use that for the rest of the analysis and calculations. <br>
Read about safety index here: <a href="https://www.numbeo.com/crime/indices_explained.jsp">About Crime Indexes</a> or find reference in <a href="./../docs/docs.docx">UniMatch documentation.<a> 

In [14]:
# Merging and preparing safety index data sets
safetyConcat = []
years = [2020, 2021, 2022, 2023, 2024]

for file in datasets:
    if file[:14] == "crime_rankings":
        tempDF = pd.read_csv(file)
        safetyConcat.append(tempDF)

mergedDF = pd.concat(safetyConcat, ignore_index=True)
df = mergedDF.groupby("country")["safety_index"].mean().reset_index()
df.rename(columns={"safety_index": "avgSafetyIndex"},inplace = True)
#df.head()

## Loading cost of living data sets

We are merging a new data set with general information about costs of living in given countries into our general data set. Out of all the indices available we are choosing: cost of living index, rent index and groceries index.<br>
Read about safety index here: <a href="https://www.numbeo.com/cost-of-living/cpi_explained.jsp">Understanding Cost of Living Indexes</a> or find reference in <a href="./../docs/docs.docx">UniMatch documentation.<a> 

In [15]:
# Merging and preparing cost of living data sets
costLiving = pd.read_csv("./cli_2024.csv")
costLiving.columns = costLiving.columns.str.lower()
#costLiving.head()
df = pd.merge(df, costLiving[["country", "cost of living index", "rent index", "groceries index"]], on="country", how ="inner")
#df.head()


## Adding continent factor to data set

We are adding a factor into our general data. We are looking to do an analysis based on continents for someone interested.<br>

In [16]:
# Adding continent factor
continent = pd.read_csv("./continent.csv")
#continent.head()
continent.loc[continent["region"] == "Americas", "region"] = continent["sub-region"]
continent.loc[continent["region"] == "Latin America and the Caribbean", "region"] = "Southern America"
continent= continent.rename(columns={"name": "country"})
df = pd.merge(df, continent[["country", "region"]], on="country", how ="left")
#df.head()


## Loading healthcare price data sets

We are merging a another data set into our general data set displaying the healthcare prices for each country. We are obviously choosing the news data for our representation. <br>
Read about healthcare prices here: <a href="https://databank.worldbank.org/source/health-nutrition-and-population-statistics/preview/on">Health Statistics</a> or find reference in <a href="./../docs/docs.docx">UniMatch documentation.<a> 

In [17]:
# Merging healtcare prices
health = pd.read_excel("health-care.xlsx")
#health.head()
health["healthcare price"] = health["2021 [YR2021]"]
health= health.rename(columns={"Country Name": "country"})
df = pd.merge(df, health[["country", "healthcare price"]], on="country", how ="left")
df["healthcare price"] = pd.to_numeric(df["healthcare price"], errors='coerce')
#df.head()

## Loading average monthly cost of public transport

In [18]:
# Merging monthly cost of public transport
monthlyPass = pd.read_csv("public_transport.csv")
#monthlyPass.head()
df = pd.merge(df, monthlyPass[["country", "avgMntTransportCost"]], on="country", how ="left")
df.head()

Unnamed: 0,country,avgSafetyIndex,cost of living index,rent index,groceries index,region,healthcare price,avgMntTransportCost
0,Albania,56.3,42.1,10.6,42.0,Europe,277.471168,17.5
1,Algeria,47.94,28.9,3.8,36.8,Africa,78.820208,10.09
2,Argentina,36.46,29.4,7.6,29.7,Southern America,233.676745,11.0
3,Armenia,77.72,41.0,19.0,36.0,Asia,482.0,10.35
4,Australia,55.96,70.2,33.4,77.3,Oceania,975.402664,114.375


## Saving general data set to csv, parquet and xlsx

In [20]:
df.to_csv('GENERAL.csv', index=False)
df.to_excel('GENERAL.xlsx', index=False)
df.to_parquet('GENERAL.parquet')