# Global Ethos: Clustering Countries based on Values and Beliefs

## Introduction

This project aims to analyze and cluster countries based on their values and beliefs, using data from the WVS Cross-National Wave 7 (2017-2022) dataset. Focusing on three thematic sub-sections—social values, religious values, and ethical values—we seek to uncover patterns and relationships within and among nations, providing insights into global cultural diversity.

Through the application of unsupervised learning, specifically clustering and dimensionality reduction, we will group countries based on shared characteristics, revealing clusters and patterns that shed light on cultural similarities and differences across nations. This analysis will contribute to a deeper understanding of the cultural landscapes shaping societies globally.

The project's outcomes have broad implications for cross-cultural studies, policy-making, and intercultural understanding. Understanding the values and beliefs that underpin different societies can foster empathy, appreciation, and collaboration among nations, promoting a more inclusive and harmonious global society.

## Dataset

| Feature                                               | Description                                      |
|-------------------------------------------------------|--------------------------------------------------|
| country                                               | Country of the respondent                        |
| importance_of_family                                  | Is family important? (numerical)                 |
| importance_of_friends                                 | Are friends important? (numerical)               |
| importance_of_leisure_time                            | Is leisure time important? (numerical)           |
| importance_of_politics                                | Is politics important? (numerical)               |
| importance_of_work                                    | Is work important? (numerical)                   |
| importance_of_religion                                | Is religion important? (numerical)               |
| child_qualities_good_manners                          | Are good manners important for children? (numerical) |
| child_qualities_independence                          | Is independence important for children? (numerical) |
| child_qualities_hard_work                             | Is hard work important for children? (numerical)  |
| child_qualities_feeling_of_responsibility             | Is the feeling of responsibility important for children? (numerical) |
| child_qualities_imagination                           | Is imagination important for children? (numerical) |
| child_qualities_tolerance_and_respect                 | Are tolerance and respect important for children? (numerical) |
| child_qualities_thrift_saving_money_and_things        | Is thrift (saving money and things) important for children? (numerical) |
| child_qualities_determination_perseverance            | Is determination and perseverance important for children? (numerical) |
| child_qualities_religious_faith                       | Is religious faith important for children? (numerical) |
| child_qualities_unselfishness                         | Is unselfishness important for children? (numerical) |
| child_qualities_obedience                             | Is obedience important for children? (numerical)   |
| neighbors_drug_addicts                                | Would it be a problem to have drug addicts as neighbors? (numerical) |
| neighbors_different_race                              | Would it be a problem to have people of different race as neighbors? (numerical) |
| neighbors_people_with_aids                            | Would it be a problem to have people with AIDS as neighbors? (numerical) |
| neighbors_immigrants_foreign_workers                  | Would it be a problem to have immigrants/foreign workers as neighbors? (numerical) |
| neighbors_homosexuals                                 | Would it be a problem to have homosexuals as neighbors? (numerical) |
| neighbors_different_religion                          | Would it be a problem to have people of different religion as neighbors? (numerical) |
| neighbors_heavy_drinkers                              | Would it be a problem to have heavy drinkers as neighbors? (numerical) |
| neighbors_unmarried_couples_living_together           | Would it be a problem to have unmarried couples living together as neighbors? (numerical) |
| neighbors_different_language                          | Would it be a problem to have people who speak a different language as neighbors? (numerical) |
| goal_make_parents_proud                               | One of my main goals in life has been to make my parents proud (numerical) |
| child_suffers_when_working_mother                     | When a mother works for pay, the children suffer (numerical) |
| men_better_political_leaders_than_women               | On the whole, men make better political leaders than women do (numerical) |
| university_more_important_for_boys                    | A university education is more important for a boy than for a girl (numerical) |
| men_better_business_executives_than_women             | On the whole, men make better business executives than women do (numerical) |
| being_a_housewife_just_as_fulfilling                  | Being a housewife is just as fulfilling as working for pay (numerical) |
| jobs_scarce_men_have_more_right_to_job_than_women     | When jobs are scarce, men should have more right to a job than women (numerical) |
| jobs_scarce_employers_give_priority_to_nation_people  | When jobs are scarce, employers should give priority to people of this country over immigrants (numerical) |
| problem_if_women_have_more_income_than_husband        | If a woman earns more money than her husband, it's almost certain to cause problems (numerical) |
| homosexual_couples_as_good_parents                    | Homosexual couples are as good parents as other couples (numerical) |
| duty_towards_society_to_have_children                 | It is a duty towards society to have children (numerical) |
| children_duty_to_take_care_of_ill_parent              | Adult children have the duty to provide long-term care for their parents (numerical) |
| people_who_dont_work_turn_lazy                        | People who don’t work turn lazy (numerical) |
| work_is_duty_towards_society                          | Work is a duty towards society (numerical) |
| work_always_comes_first_even_with_less_spare_time     | Work should always come first, even if it means less spare time (numerical) |
| best_description_of_society                           | What attitude concerning the society we live in best describes your opinion? (numerical) | 
| future_changes_less_importance_on_work                | How would you feel if in the future there was less importance placed on work in our lives? (numerical) |
| future_changes_more_emphasis_on_technology            | How would you feel if in the future there was more emphasis on the development of technology? (numerical) |
| future_changes_greater_respect_for_authority          | How would you feel if in the future there was greater respect for authority? (numerical) |
| importance_of_god                                     | How important is God in your life? (numerical) |
| believe_in_god                                        | Do you believe in God? (numerical) |
| believe_in_life_after_death                           | Do you believe in life after death? (numerical) |
| believe_in_hell                                       | Do you believe in hell? (numerical) |
| believe_in_heaven                                     | Do you believe in heaven? (numerical) |
| whenever_science_and_religion_conflict_religion_is_right | Whenever science and religion conflict, is religion always right (numerical) |
| only_acceptable_religion_is_my_religion                | The only acceptable religion is my religion (numerical) |
| frequency_of_attending_religious_services             | How often do you attend religious services? (numerical) |
| frequency_of_praying                                  | How often do you pray? (numerical) |
| religious_person                                      | Are you a religious person? (numerical) |
| meaning_of_religion_follow_religious_norms_and_ceremonies | Is the basic meaning of religion to follow religious norms and ceremonies or to do good to other people? (numerical) |
| meaning_of_religion_make_sense_of_life_after_death    | Is the basic meaning of religion to make sense of life after death or to make sense of life in this world? (numerical) |
| trouble_choosing_moral_rules                     | Are people often troubled deciding which moral rules are the right ones to follow? (numerical) |
| justifiable_claiming_government_benefits              | Is it justifiable to claim government benefits to which you are not entitled? (numerical) |
| justifiable_avoiding_fare_on_public_transport         | Is it justifiable to avoid paying fare on public transport? (numerical) |
| justifiable_stealing_property                         | Is it justifiable to steal property? (numerical) |
| justifiable_cheating_on_taxes                         | Is it justifiable to cheat on taxes if you have a chance? (numerical) |
| justifiable_accepting_a_bribe_in_duties               | Is it justifiable to accept a bribe in the course of your duties? (numerical) |
| justifiable_homosexuality                                            | Is homosexuality justifiable? (numerical) |
| justifiable_prostitution                                             | Is prostitution justifiable? (numerical)                                    |
| justifiable_abortion                                                 | Is abortion justifiable? (numerical)                                         |
| justifiable_divorce                                                  | Is divorce justifiable? (numerical)                                         |
| justifiable_sex_before_marriage                                      | Is sex before marriage justifiable? (numerical)                             |
| justifiable_suicide                                                  | Is suicide justifiable? (numerical)                                         |
| justifiable_euthanasia                                               | Is euthanasia justifiable? (numerical)                                      |
| justifiable_man_beating_wife                                         | Is it justifiable for a man to beat his wife? (numerical)                    |
| justifiable_parents_beating_children                                 | Is it justifiable for parents to beat their children? (numerical)            |
| justifiable_violence_against_others                                  | Is violence against others justifiable? (numerical)                          |
| justifiable_terrorism_as_political_ideological_or_religious_mean?     | Is terrorism as a political, ideological, or religious means justifiable? (numerical) |
| justifiable_having_casual_sex                                        | Is having casual sex justifiable? (numerical)                               |
| justifiable_political_violence                                       | Is political violence justifiable? (numerical)                               |
| justifiable_death_penalty                                            | Is the death penalty justifiable? (numerical)                               |
| government_right_keep_people_under_video_surveillance?                | Is it the government's right to keep people under video surveillance? (numerical) |
| government_right_monitor_all_emails_and_internet_information?         | Is it the government's right to monitor all emails and internet information? (numerical) |
| government_right_collect_information_about_residents?                 | Is it the government's right to collect information about residents? (numerical) |

## Project Definition

...


# Import Base Libraries

Import the relevant libraries: NumPy, Pandas, Matplotlib, and Seaborn. We will also use a Jupyter magic command `%matplotlib notebook` to enable interactive plots.

In [1]:
# import relevant libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid')
%matplotlib notebook

# Logging

We define a custom logging system using an enum and a class. The enum defines different logging levels with associated emojis, while the class provides a way to log messages with a specified level. We also define a `Logger` object with a default logging level of `LogLevel.INFO`. This code can be used to log messages in a more expressive and customizable way than the built-in `print()` function.

In [2]:
from enum import Enum

def bold(text):
    return "\033[1m" + text + "\033[0m"

class LogLevel(Enum):
    ERROR = "\033[91m🔴 ERROR\033[0m"
    WARNING = "\033[93m⚠️ WARNING\033[0m"
    INFO = "\033[94m🟡 INFO\033[0m"
    DEBUG = "\033[34m🔵 DEBUG\033[0m"
    SUCCESS = "\033[32m🟢 SUCCESS\033[0m"


class Logger:
    def __init__(self, level=LogLevel.INFO):
        self.level = level

    def log(self, level, message, **kwargs):
        if level.value <= self.level.value:
            nl = kwargs.get('nl', False)
            if nl:
                print()
            print(f"{level.value}: {bold(message)}")

    def error(self, message, **kwargs):
        self.log(LogLevel.ERROR, message, **kwargs)

    def warning(self, message, **kwargs):
        self.log(LogLevel.WARNING, message, **kwargs)

    def info(self, message, **kwargs):
        self.log(LogLevel.INFO, message, **kwargs)

    def debug(self, message, **kwargs):
        self.log(LogLevel.DEBUG, message, **kwargs)

    def success(self, message, **kwargs):
        self.log(LogLevel.SUCCESS, message, **kwargs)
        
logger = Logger(level=LogLevel.INFO)

# Preprocessing Presets

We have a set of preprocessing methods that can be used to clean and transform data before we can use it for machine learning tasks. First, we have a `split_data` method that takes in a dataset and splits it into input and output components, where the `target_col` parameter specifies which column should be treated as the output variable. Next, we have several encoding methods like `label_encode`, `one_hot_encode`, and `ordinal_encode`, which can be used to transform categorical data into numerical format. The `full_clean` method is a comprehensive method that applies various encoding techniques and feature engineering techniques to a given dataset. Finally, we have the `standard_scale` method, which can be used to normalize data by scaling it to a standard deviation of 1.

We will use this code as a toolset to prepare and clean data for various machine learning tasks.

In [3]:
# import relevant libraries
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder, StandardScaler, MinMaxScaler, RobustScaler

# preprocessing methods
def split_data(data, target_col=None):
    data = data.dropna()
    if target_col is None:
        X = data.iloc[:, :-1]
        y = data.iloc[:, -1]
    else:
        X = data.drop(target_col, axis=1)
        y = data[target_col]
    return X, y


def label_encode(target, mode='sklearn'):
    if mode == 'sklearn':
        return LabelEncoder().fit_transform(target)

def one_hot_encode(X, mode='pandas', **kwargs):
    if mode == 'pandas':
        return pd.get_dummies(X, **kwargs)
    if mode == 'sklearn':
        return OneHotEncoder().fit_transform(X, **kwargs)

def ordinal_encode(X, mode='sklearn', **kwargs):
    if mode == 'sklearn':
        return OrdinalEncoder().fit_transform(X, **kwargs)

def minmax_scale(X, mode='sklearn', **kwargs):
    if mode == 'sklearn':
        return MinMaxScaler().fit_transform(X, **kwargs)

def standard_scale(X, mode='sklearn', **kwargs):
    if mode == 'sklearn':
        return StandardScaler().fit_transform(X, **kwargs)

def robust_scale(X, mode='sklearn', **kwargs):
    if mode == 'sklearn':
        return RobustScaler().fit_transform(X, **kwargs)
    
def full_clean(df):
    df.drop_duplicates(inplace=True)
    
    # Encode the 'country' column using LabelEncoder
    df['country'] = label_encode(df['country'])

    # Scale the numeric columns using StandardScaler
    numeric_columns = df.select_dtypes(include='number').columns
    
    df[numeric_columns] = handle_unknowns(df[numeric_columns])
    df[numeric_columns] = standard_scale(df[numeric_columns])

    return df

def handle_unknowns(column):
    unknown_values = set(column[column < 0])  # Identify negative values representing "Unknown"
    for value in unknown_values:
        column = column.replace(value, pd.NA)  # Replace negative values with pd.NA

    return column

# Read Data

In this code, we explore the dataset using various data preprocessing techniques. The goal is to gain insights into the data and prepare it for potential machine learning modeling. These preprocessing steps can help us understand and transform the data to a more suitable format for potential machine learning modeling in the future.

In [4]:
# Read the CSV file
data = pd.read_csv('WVS_Cross-National_Wave_7_csv_v5_0_social_religious_ethical.csv')

# Data Exploration

We will perform an exploratory analysis on the dataset to gain insights into its structure, patterns, and distributions.

## Overview of the Dataset

There is a total of 94278 samples.

There are 80 numeric columns and 1 object column. The single object column represents the country and has 64 possible values (only 64 countries are represented in the study). The numeric columns have relatively low granularity with a number of unique values between 5 and 14 and an average of 8.3125.
Since the dataset has a large number of variables, PCA can be used to reduce the dimensionality and simplify the analysis. Moreover, we suspect that a lot of those variables have a high degree of correlation and are redundant.

In [5]:
print('%s: %s\n' % (bold('Number of samples'), len(data)))

print(bold('Dataset info'))
print(data.info(), '\n')

print(bold('Dataset preview'))
print(data.head(), '\n')

# get the feature labels
features = data.columns

# get numeric and categorical columns
numeric_col = data.select_dtypes(include=['int64', 'float64']).columns
category_col = data.select_dtypes(include=['object']).columns

print('%s: %s\n' % (bold('Numeric columns'), numeric_col.values))
print('%s: %s\n' % (bold('Categorical columns'), category_col.values))

print(bold('Dataset granularity'))
for col in data.columns:
    print(col, ':', data[col].nunique(), 'unique values')

[1mNumber of samples[0m: 94278

[1mDataset info[0m
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94278 entries, 0 to 94277
Data columns (total 81 columns):
 #   Column                                                            Non-Null Count  Dtype 
---  ------                                                            --------------  ----- 
 0   country                                                           94278 non-null  object
 1   importance_of_family                                              94278 non-null  int64 
 2   importance_of_friends                                             94278 non-null  int64 
 3   importance_of_leisure_time                                        94278 non-null  int64 
 4   importance_of_politics                                            94278 non-null  int64 
 5   importance_of_work                                                94278 non-null  int64 
 6   importance_of_religion                                            94278 non-nul

## Correlation Heatmap

This heatmap shows the correlation between the numeric features of the dataset. The cells colored in red represent high positive correlation while the cells colored in blue represent high negative correlation. The values in each cell represent the correlation coefficient between the corresponding features. This plot can be used to identify which features are strongly correlated and validate the use of PCA to reduce dimensionality.

In [6]:
correlation_matrix = data[numeric_col].corr()

plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=False, cmap='coolwarm', linewidths=0.5, fmt=".2f", cbar=True, xticklabels=True, yticklabels=True)
plt.title(None)
plt.xticks(fontsize=6)
plt.yticks(fontsize=6) 
plt.tight_layout()

plt.show()

<IPython.core.display.Javascript object>

# Principal Component Analysis (PCA) 

The Principal Component Analysis (PCA) section focuses on applying this dimensionality reduction technique to our dataset, effectively transforming the original features into a new set of uncorrelated variables called principal components. Through PCA, we aim to identify the most influential features that contribute to the overall variability in the data, enabling us to achieve a more concise representation while retaining the essential information contained within the dataset.

## Knee/Elbow Point Detection

We calculate the cumulative explained variance ratio and identify the knee point by finding the index where the differences between 1 and the cumulative variance ratios are relatively small compared to the maximum difference.

In [7]:
from sklearn.decomposition import PCA
from kneed import KneeLocator

def find_optimal_n_components(X, mode="kneed", threshold=0.1, legend=False):
    """
    Find the optimal number of components using the "knee" or "elbow" point approach.

    Parameters:
    - X: The input data matrix.
    - mode: The mode to determine the optimal number of components. Options: "kneed" or "auto" (default: "kneed").
    - threshold: The threshold value used in the "auto" mode to determine the optimal number of components (default: 0.1).
    - legend: Boolean flag to include legends for the index and variance (default: False).

    Returns:
    - The optimal number of components.
    """

    pca = PCA()
    pca.fit(X)
    index = 0

    cumulative_variance_ratio = np.cumsum(pca.explained_variance_ratio_)
    
    if mode == "kneed":
        kneedle = KneeLocator(range(1, len(cumulative_variance_ratio) + 1), cumulative_variance_ratio, curve='concave', direction='increasing')
        index = kneedle.knee
    elif mode == "threshold":
        diff = 1 - cumulative_variance_ratio
        index = np.argmax(diff < np.max(diff) * threshold) + 1

    # Plot the cumulative explained variance ratio and the knee/elbow point
    plt.figure(figsize=(8, 6))
    plt.plot(range(1, len(cumulative_variance_ratio) + 1), cumulative_variance_ratio, marker='o')
    plt.axvline(x=index, color='r', linestyle='--')
    plt.xlabel('Number of Components')
    plt.ylabel('Cumulative Explained Variance Ratio')
    
    if legend:
        plt.legend(['Cumulative Variance', f'Optimal Number of Components: {index}\nOptimal Variance: {cumulative_variance_ratio[index-1]:.2f}%'], loc="best")
         
    plt.tight_layout()
    plt.show()

    return index

optimal_n_components = find_optimal_n_components(data[numeric_col])
print('%s: %s\n' % (bold('Optimal number of components'), optimal_n_components))

<IPython.core.display.Javascript object>

[1mOptimal number of components[0m: 23



## Apply PCA for Dimensionality Reduction

In this section, we apply Principal Component Analysis (PCA) to reduce the dimensionality of the dataset based on the optimal number of components determined earlier.

In [8]:
pca = PCA(n_components=optimal_n_components)
numeric_data_pca = pca.fit_transform(data[numeric_col])

# Create a DataFrame from the transformed data
numeric_data_pca_df = pd.DataFrame(numeric_data_pca, columns=[f'PC{i}' for i in range(1, optimal_n_components + 1)])

data_pca = data.copy()
data_pca.drop(columns=numeric_col, inplace=True)
data_pca = pd.concat([data_pca, numeric_data_pca_df], axis=1)

print(data_pca.shape)

(94278, 24)


## Feature Loadings Visualization

This section visualizes the feature loadings obtained from PCA, revealing the correlations between the original features and principal components. The heatmap provides a clear representation of the influential features in the dataset.

In [9]:
# Get the feature loadings
feature_loadings = pd.DataFrame(pca.components_.T, columns=['PC'+str(i+1) for i in range(pca.components_.shape[0])], index=numeric_col)

# Plot the feature loadings
plt.figure(figsize=(12, 10))
sns.heatmap(feature_loadings, cmap='coolwarm', annot=False, fmt=".2f", linewidths=0.5, cbar=True, xticklabels=True, yticklabels=True)
plt.xlabel('Principal Components')
plt.ylabel('Original Features')
plt.xticks(fontsize=6)
plt.yticks(fontsize=6) 
plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>