# Merging and preparation of data sets

In this Jupyter notebook we will merge and prepare all of the collected data sets. Since we are aiming to represent as many countries and parameters as we possibly can we need to create a general data set for all categories we can collect. We will then be using that data set for futher analysis and exploration.

Parameters: <u>safety</u>, tuition, <u>groceries</u>, social & sports activities, <u>rent</u>, public transport, ratio of foreign and domestic students and health & health insurance

In [18]:
# Importing modules and libraries
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
import os
import warnings
warnings.filterwarnings("ignore")

In [19]:
# Directory searching and collecting data set paths
cwd = os.getcwd()
#print(cwd)
datasets = list()
for file in os.listdir():
    if file.split('.')[1] == "csv":
        datasets.append(file)
#print(datasets)

## Loading safety index data sets

We are storing the average value of safety index for given countries in last 5 years. We will use that for the rest of the analysis and calculations. <br>
Read about safety index here: <a href="https://www.numbeo.com/crime/indices_explained.jsp">About Crime Indexes</a> or find reference in <a href="./../docs/docs.docx">UniMatch documentation.<a> 

In [20]:
# Merging and preparing safety index data sets
safetyConcat = []
years = [2020, 2021, 2022, 2023, 2024]

for file in datasets:
    if file[:14] == "crime_rankings":
        tempDF = pd.read_csv(file)
        safetyConcat.append(tempDF)

mergedDF = pd.concat(safetyConcat, ignore_index=True)
df = mergedDF.groupby("country")["safety_index"].mean().reset_index()
df.rename(columns={"safety_index": "avgSafetyIndex"},inplace = True)
#df.head()

## Loading cost of living data sets

We are merging new data set with general information about costs of living in given countries into our general data set. Out of all the indices available we are choosing: cost of living index, rent index and groceries index.<br>
Read about safety index here: <a href="https://www.numbeo.com/cost-of-living/cpi_explained.jsp">Understanding Cost of Living Indexes</a> or find reference in <a href="./../docs/docs.docx">UniMatch documentation.<a> 

In [21]:
# Merging and preparing cost of living data sets
costLiving = pd.read_csv("./cli_2024.csv")
costLiving.columns = costLiving.columns.str.lower()
#costLiving.head()
df = pd.merge(df, costLiving[["country", "cost of living index", "rent index", "groceries index"]], on="country", how ="inner")
df.head()


Unnamed: 0,country,avgSafetyIndex,cost of living index,rent index,groceries index
0,Albania,56.3,42.1,10.6,42.0
1,Algeria,47.94,28.9,3.8,36.8
2,Argentina,36.46,29.4,7.6,29.7
3,Armenia,77.72,41.0,19.0,36.0
4,Australia,55.96,70.2,33.4,77.3
