## Context

we have single datasets for each type of "factor" that people might take in consideration when deciding what country to live in, but we need to merge all of them into one single dataset in order to build models with this data

### Exploration

we have 5 datasets in total:
- cost of living
- healthcare index
- safety index
- internet speed
- average temperatures

In [13]:
import pandas as pd

cost_of_living = pd.read_csv("../raw_data/Alternative_sources_country_level/cost_expense.csv")
healthcare = pd.read_csv("../raw_data/Alternative_sources_country_level/healthcare_index.csv")
climate = pd.read_csv("../raw_data/Alternative_sources_country_level/climate_avarage_temperature.csv")
internet = pd.read_csv("../raw_data/Alternative_sources_country_level/internet_speed_rankings.csv")
safety = pd.read_csv("../raw_data/Alternative_sources_country_level/safety_index_data.csv")

In [14]:
#select only the columns we need for now, rename them with explanative names and set the index to country for all datasets

#cost of living
cost_of_living = cost_of_living[['country', 'average_monthly_cost($)']]
cost_of_living.columns = ['Country', 'average_monthly_cost_$']
cost_of_living.set_index('Country', inplace=True)

# healthcare dataset
healthcare = healthcare[['Country', 'Health Care Index']]
healthcare.columns = ['Country', 'Healthcare Index']
healthcare.set_index('Country', inplace=True)

# climate dataset
climate = climate[['Country', 'Temperature']]
climate.columns = ['Country', 'average_yearly_temperature']
climate.set_index('Country', inplace=True)

# internet dataset
internet = internet[['Country', 'Internet Speed (Mbps)']]
internet.columns = ['Country', 'internet_speed_mbps']
internet.set_index('Country', inplace=True)

# safety dataset
safety = safety[['Country', 'Safety Index']]
safety.columns = ['Country', 'safety_index']
safety.set_index('Country', inplace=True)


In [15]:
#Now let's understand how the Country indexes are misaligned between the datasets


# 1. Assess misalignment between datasets

# Get all unique country names from each dataset
countries_cost = set(cost_of_living.index)
countries_climate = set(climate.index)
countries_internet = set(internet.index)
countries_safety = set(safety.index)
countries_healthcare = set(healthcare.index)

# Count total unique countries across all datasets
all_countries = countries_cost.union(countries_climate, countries_internet, countries_safety, countries_healthcare)
print(f"Total unique countries across all datasets: {len(all_countries)}")

# Check how many countries are common across all datasets
common_countries = countries_cost.intersection(countries_climate, countries_internet, countries_safety, countries_healthcare)
print(f"Countries common to all datasets: {len(common_countries)}")
print(f"Countries that would be lost in a direct merge: {len(all_countries) - len(common_countries)}")

# Check dataset-specific coverage
print(f"\nDataset coverage:")
print(f"Cost of living dataset: {len(countries_cost)} countries")
print(f"Climate dataset: {len(countries_climate)} countries")
print(f"Internet dataset: {len(countries_internet)} countries")
print(f"Safety dataset: {len(countries_safety)} countries")
print(f"Healthcare dataset: {len(countries_healthcare)} countries")

Total unique countries across all datasets: 255
Countries common to all datasets: 116
Countries that would be lost in a direct merge: 139

Dataset coverage:
Cost of living dataset: 172 countries
Climate dataset: 178 countries
Internet dataset: 155 countries
Safety dataset: 147 countries
Healthcare dataset: 236 countries


In [16]:
# 2. Standardize country names

# Create a function to standardize country names
def standardize_country_name(name):
    # Convert to lowercase for comparison
    name = name.lower()
    
    # Replacements to be made manually
    easy_replacements = {
        'democratic republic of the congo': 'congo',
        'republic of the congo': 'congo',
        'hong kong (sar)': 'hong kong',
        'macau (sar)': 'macau',
        'trinidad and tobago': 'trinidad & tobago',
        'bosnia and herzegovina': 'bosnia & herzegovina',
        'antigua and barbuda': 'antigua & barbuda',
    }
    
    # Apply replacements
    for old, new in easy_replacements.items():
        if name == old:
            return new
    
    # Remove common prefixes/suffixes
    prefixes = ['republic of ', 'the ', 'democratic republic of ', 'federation of ']
    for prefix in prefixes:
        if name.startswith(prefix):
            name = name[len(prefix):]
    
    # Remove spaces and special characters for comparison
    name = name.replace(' and ', ' & ')
    
    return name

# Create standardized versions of each dataset
def standardize_dataset(df):
    # Create a copy to avoid modifying the original
    df_std = df.copy()
    
    # Create a mapping of original to standardized names
    name_mapping = {idx: standardize_country_name(idx) for idx in df.index}
    
    # Create a new index with standardized names
    df_std.index = [name_mapping[idx] for idx in df.index]
    
    return df_std, name_mapping

# Standardize each dataset
cost_of_living_std, cost_mapping = standardize_dataset(cost_of_living)
climate_std, climate_mapping = standardize_dataset(climate)
internet_std, internet_mapping = standardize_dataset(internet)
safety_std, safety_mapping = standardize_dataset(safety)

# Check improvement after standardization
countries_cost_std = set(cost_of_living_std.index)
countries_climate_std = set(climate_std.index)
countries_internet_std = set(internet_std.index)
countries_safety_std = set(safety_std.index)

common_countries_std = countries_cost_std.intersection(countries_climate_std, countries_internet_std, countries_safety_std)
print(f"\nAfter standardization:")
print(f"Countries common to all datasets: {len(common_countries_std)}")




After standardization:
Countries common to all datasets: 119


In [25]:
# 3. Analyze country coverage and missing countries

# Create a DataFrame to show which countries are in which datasets
country_coverage = pd.DataFrame(index=sorted(all_countries))
country_coverage['in_cost'] = country_coverage.index.isin(countries_cost)
country_coverage['in_climate'] = country_coverage.index.isin(countries_climate)
country_coverage['in_internet'] = country_coverage.index.isin(countries_internet)
country_coverage['in_safety'] = country_coverage.index.isin(countries_safety)
country_coverage['in_healthcare'] = country_coverage.index.isin(countries_healthcare)

# Calculate coverage percentage for each country
country_coverage['coverage_pct'] = country_coverage.sum(axis=1) / 4 * 100

# Show countries with partial coverage
partial_coverage = country_coverage[country_coverage['coverage_pct'] < 100]
print(f"\nCountries with partial coverage: {len(partial_coverage)}")
print(partial_coverage.head(10))  # Show first 10 examples

# Create a mapping dictionary for manual corrections
manual_corrections = {
    # Examples of manual corrections
    'United States': 'USA',
    'USA': 'United States',
    'UK': 'United Kingdom',
    'United Kingdom': 'UK',
    # Add more as needed based on the analysis
}

# Function to merge datasets with standardized country names
def merge_datasets_with_standardization():
    # Create copies with standardized names
    cost_std = cost_of_living.copy()
    climate_std = climate.copy()
    internet_std = internet.copy()
    safety_std = safety.copy()
    healthcare_std = healthcare.copy()
    
    # Apply manual corrections to indexes
    for dataset in [cost_std, climate_std, internet_std, safety_std, healthcare_std]:
        new_index = [manual_corrections.get(country, country) for country in dataset.index]
        dataset.index = new_index
    
    # Merge datasets using inner join to keep only countries present in all datasets
    merged = pd.merge(cost_std, climate_std, left_index=True, right_index=True, how='inner')
    merged = pd.merge(merged, internet_std, left_index=True, right_index=True, how='inner')
    merged = pd.merge(merged, safety_std, left_index=True, right_index=True, how='inner')
    merged = pd.merge(merged, healthcare_std, left_index=True, right_index=True, how='inner')
    
    return merged

# This function can be used later when we finalize the standardization approach
merged_data = merge_datasets_with_standardization()



Countries with partial coverage: 111
                     in_cost  in_climate  in_internet  in_safety  \
Aland Islands          False       False        False      False   
Alderney               False       False        False      False   
American Samoa         False       False        False      False   
Anguilla               False       False        False      False   
Antigua And Barbuda    False       False        False      False   
Antigua and Barbuda    False        True         True      False   
Aruba                   True       False        False      False   
Benin                  False        True         True      False   
Bermuda                 True       False        False      False   
Bhutan                  True        True        False      False   

                     in_healthcare  coverage_pct  
Aland Islands                 True          25.0  
Alderney                      True          25.0  
American Samoa                True          25.0  
Anguilla 

In [26]:
merged_data.head()

Unnamed: 0,average_monthly_cost_$,average_yearly_temperature,internet_speed_mbps,safety_index,Healthcare Index
Afghanistan,960.545,18.1,3.88,24.9,24.24
Albania,518.916429,22.2,81.41,55.3,48.21
Algeria,356.0455,22.8,16.54,47.4,54.43
Angola,740.635,27.1,22.91,33.7,36.58
Argentina,503.73125,15.1,93.38,36.6,68.0


In [28]:
merged_data.to_csv("../raw_data/merged_country_level/temporary_merged_data.csv")