# European Social Survey (Round 11) - Trust Variables Dimensional Reduction

Ondřej Marvan, 477001 

---

## Project Overview

This analysis explores the latent structure of trust-related variables in the European Social Survey (ESS) Round 11 data using **Principal Component Analysis (PCA)** and **Factor Analysis**. The goal is to discover how different types of trust—institutional, political, and social—relate to each other and how they vary across European countries.

### Variables Analyzed (0-10 scale):
| Variable | Description |
|----------|-------------|
| `trstplt` | Trust in politicians |
| `trstplc` | Trust in the police |
| `trstprl` | Trust in country's parliament |
| `trstprt` | Trust in political parties |
| `trstlgl` | Trust in the legal system |
| `trstep` | Trust in the European Parliament |
| `trstun` | Trust in the United Nations |
| `ppltrst` | Social trust (people can be trusted) |
| `pplhlp` | People are helpful |
| `pplfair` | People are fair |

In [3]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA, FactorAnalysis
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 11

print("Done")

Done


## 1. Data Loading and Preparation

In [None]:
# Data Path - main.csv 
data_path = "/home/ondrej-marvan/Documents/GitHub/OBS_DataScience/OBS_DataScience/Autumn 2025/2400-DS1UL Unsupervised Learning/Projects/Task_DimReduction/Data/main.csv"

# Load data
df = pd.read_csv(data_path, low_memory=False)
print(f"Dataset shape: {df.shape}")
print(f"Number of respondents: {df.shape[0]:,}")

In [None]:
# Define trust variables and labels
trust_vars = ['trstplt', 'trstplc', 'trstprl', 'trstprt', 'trstlgl', 
              'trstep', 'trstun', 'ppltrst', 'pplhlp', 'pplfair']

var_labels = {
    'trstplt': 'Trust Politicians',
    'trstplc': 'Trust Police',
    'trstprl': 'Trust Parliament',
    'trstprt': 'Trust Parties',
    'trstlgl': 'Trust Legal System',
    'trstep': 'Trust EU Parliament',
    'trstun': 'Trust UN',
    'ppltrst': 'Social Trust',
    'pplhlp': 'People Helpful',
    'pplfair': 'People Fair'
}

country_names = {
    'AT': 'Austria', 'BE': 'Belgium', 'BG': 'Bulgaria', 'CH': 'Switzerland',
    'CY': 'Cyprus', 'CZ': 'Czechia', 'DE': 'Germany', 'DK': 'Denmark',
    'EE': 'Estonia', 'ES': 'Spain', 'FI': 'Finland', 'FR': 'France',
    'GB': 'United Kingdom', 'GR': 'Greece', 'HR': 'Croatia', 'HU': 'Hungary',
    'IE': 'Ireland', 'IS': 'Iceland', 'IT': 'Italy', 'LT': 'Lithuania',
    'LV': 'Latvia', 'ME': 'Montenegro', 'NL': 'Netherlands', 'NO': 'Norway',
    'PL': 'Poland', 'PT': 'Portugal', 'RS': 'Serbia', 'SE': 'Sweden',
    'SI': 'Slovenia', 'SK': 'Slovakia', 'UA': 'Ukraine', 'XK': 'Kosovo'
}

## 2. Geographical Coverage

Keeping the **broadest possible geographical coverage** - all countries in the dataset.

In [None]:
# Get all countries
countries = df['cntry'].unique()
print(f"Countries in dataset: {len(countries)}")
print(f"\nCountry codes: {sorted(countries)}")

# Sample sizes per country
print("\n" + "="*50)
print("Sample Sizes by Country:")
print("="*50)
country_counts = df['cntry'].value_counts().sort_index()
for code, count in country_counts.items():
    name = country_names.get(code, code)
    print(f"{code} ({name}): {count:,}")

## 3. Variable Inspection

In [None]:
# Create working subset
df_trust = df[['cntry'] + trust_vars].copy()

# Handle missing values (ESS uses codes like 66, 77, 88, 99 for missing)
# Valid values are 0-10
print("Checking for missing/invalid values:")
print("-" * 50)
for var in trust_vars:
    df_trust[var] = pd.to_numeric(df_trust[var], errors='coerce')
    # Replace values outside 0-10 range with NaN
    invalid_mask = (df_trust[var] < 0) | (df_trust[var] > 10)
    df_trust.loc[invalid_mask, var] = np.nan
    n_missing = df_trust[var].isna().sum()
    pct_missing = (n_missing / len(df_trust)) * 100
    print(f"{var_labels[var]:20s}: {n_missing:,} missing ({pct_missing:.1f}%)")

In [None]:
# Descriptive statistics
print("\nDescriptive Statistics:")
print("="*70)
stats = df_trust[trust_vars].describe().round(2)
stats.columns = [var_labels[v] for v in trust_vars]
display(stats)

In [None]:
# Distribution of trust variables
fig, axes = plt.subplots(2, 5, figsize=(16, 8))
axes = axes.flatten()

for i, var in enumerate(trust_vars):
    df_trust[var].dropna().hist(bins=11, ax=axes[i], edgecolor='black', alpha=0.7)
    axes[i].set_title(var_labels[var], fontsize=10)
    axes[i].set_xlabel('Trust Score (0-10)')
    axes[i].set_ylabel('Frequency')

plt.suptitle('Distribution of Trust Variables', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

## 4. Data Preparation for PCA

In [None]:
# Complete case analysis
df_complete = df_trust.dropna(subset=trust_vars)
print(f"Complete cases: {len(df_complete):,} out of {len(df_trust):,} ({100*len(df_complete)/len(df_trust):.1f}%)")
print(f"Countries with complete data: {df_complete['cntry'].nunique()}")

# Extract and standardize features
X = df_complete[trust_vars].values
countries_complete = df_complete['cntry'].values

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print(f"\nFeature matrix shape: {X_scaled.shape}")
print("Data standardized (mean=0, std=1)")

## 5. Principal Component Analysis

In [None]:
# Fit PCA
pca = PCA()
pca.fit(X_scaled)

# Explained variance
explained_var = pca.explained_variance_ratio_
cumulative_var = np.cumsum(explained_var)
eigenvalues = pca.explained_variance_

# Kaiser criterion
n_kaiser = sum(eigenvalues > 1)

print("Explained Variance by Component:")
print("="*60)
for i, (var, cum, eig) in enumerate(zip(explained_var, cumulative_var, eigenvalues), 1):
    kaiser = "*" if eig > 1 else " "
    bar = '█' * int(var * 40)
    print(f"PC{i}: {var*100:5.1f}%  (cum: {cum*100:5.1f}%)  λ={eig:.2f} {kaiser} {bar}")

print(f"\n* Kaiser criterion (eigenvalue > 1): Retain {n_kaiser} components")

In [None]:
# Scree plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Eigenvalues
ax1 = axes[0]
ax1.plot(range(1, len(eigenvalues)+1), eigenvalues, 'bo-', linewidth=2, markersize=10)
ax1.axhline(y=1, color='r', linestyle='--', linewidth=2, label='Kaiser criterion (λ=1)')
ax1.set_xlabel('Principal Component', fontsize=12)
ax1.set_ylabel('Eigenvalue', fontsize=12)
ax1.set_title('Scree Plot', fontsize=14, fontweight='bold')
ax1.set_xticks(range(1, len(eigenvalues)+1))
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3)

# Cumulative variance
ax2 = axes[1]
ax2.bar(range(1, len(explained_var)+1), explained_var*100, alpha=0.7, color='steelblue', label='Individual')
ax2.plot(range(1, len(cumulative_var)+1), cumulative_var*100, 'ro-', linewidth=2, markersize=8, label='Cumulative')
ax2.axhline(y=70, color='g', linestyle='--', alpha=0.7, label='70% threshold')
ax2.axhline(y=80, color='orange', linestyle='--', alpha=0.7, label='80% threshold')
ax2.set_xlabel('Principal Component', fontsize=12)
ax2.set_ylabel('Variance Explained (%)', fontsize=12)
ax2.set_title('Variance Explained', fontsize=14, fontweight='bold')
ax2.set_xticks(range(1, len(explained_var)+1))
ax2.legend(fontsize=10)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 6. Component Loadings Analysis

In [None]:
# Component loadings
loadings = pca.components_[:n_kaiser].T
loadings_df = pd.DataFrame(
    loadings,
    index=[var_labels[v] for v in trust_vars],
    columns=[f'PC{i+1}' for i in range(n_kaiser)]
)

print(f"Component Loadings (retaining {n_kaiser} components):")
print("="*60)
display(loadings_df.round(3))

In [None]:
# Component interpretation
print("\nComponent Interpretation:")
print("="*60)
for i in range(n_kaiser):
    print(f"\nPC{i+1} ({explained_var[i]*100:.1f}% variance):")
    print("-" * 40)
    pc_loadings = loadings_df[f'PC{i+1}'].sort_values(key=abs, ascending=False)
    for var, load in pc_loadings.items():
        direction = "+" if load > 0 else "-"
        strength = "STRONG" if abs(load) > 0.5 else "moderate" if abs(load) > 0.3 else "weak"
        print(f"  {direction} {var}: {load:+.3f} ({strength})")

In [None]:
# Loadings heatmap
fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(loadings_df, annot=True, fmt='.2f', cmap='RdBu_r', center=0,
            vmin=-1, vmax=1, ax=ax, cbar_kws={'label': 'Loading'},
            linewidths=0.5)
ax.set_title('PCA Component Loadings', fontsize=14, fontweight='bold')
ax.set_ylabel('Trust Variables', fontsize=12)
ax.set_xlabel('Principal Components', fontsize=12)
plt.tight_layout()
plt.show()

## 7. Factor Analysis (Comparison with Varimax Rotation)

In [None]:
# Factor Analysis with Varimax rotation
fa = FactorAnalysis(n_components=n_kaiser, rotation='varimax', random_state=42)
fa.fit(X_scaled)

fa_loadings = pd.DataFrame(
    fa.components_.T,
    index=[var_labels[v] for v in trust_vars],
    columns=[f'Factor{i+1}' for i in range(n_kaiser)]
)

print(f"Factor Loadings (Varimax Rotation, {n_kaiser} factors):")
print("="*60)
display(fa_loadings.round(3))

In [None]:
# Compare PCA and FA loadings
fig, axes = plt.subplots(1, 2, figsize=(16, 8))

sns.heatmap(loadings_df, annot=True, fmt='.2f', cmap='RdBu_r', center=0,
            vmin=-1, vmax=1, ax=axes[0], cbar_kws={'label': 'Loading'}, linewidths=0.5)
axes[0].set_title('PCA Loadings', fontsize=14, fontweight='bold')

sns.heatmap(fa_loadings, annot=True, fmt='.2f', cmap='RdBu_r', center=0,
            vmin=-1, vmax=1, ax=axes[1], cbar_kws={'label': 'Loading'}, linewidths=0.5)
axes[1].set_title('Factor Analysis Loadings (Varimax)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

## 8. Dimension Reduction Visualization

In [None]:
# Transform data to 2D
pca_2d = PCA(n_components=2)
X_pca = pca_2d.fit_transform(X_scaled)

df_pca = pd.DataFrame({
    'PC1': X_pca[:, 0],
    'PC2': X_pca[:, 1],
    'Country': countries_complete
})

# Country means
country_means = df_pca.groupby('Country').agg({'PC1': 'mean', 'PC2': 'mean'}).reset_index()
country_means['Country_Name'] = country_means['Country'].map(country_names)

print("Country Positions in PCA Space:")
print("="*50)
display(country_means.sort_values('PC1', ascending=False).round(3))

In [None]:
# Main scatter plot
fig, ax = plt.subplots(figsize=(14, 10))

unique_countries = df_pca['Country'].unique()
colors = plt.cm.tab20(np.linspace(0, 1, len(unique_countries)))
color_map = dict(zip(unique_countries, colors))

# Individual points
for country in unique_countries:
    mask = df_pca['Country'] == country
    ax.scatter(df_pca.loc[mask, 'PC1'], df_pca.loc[mask, 'PC2'],
               c=[color_map[country]], alpha=0.1, s=5)

# Country means
for _, row in country_means.iterrows():
    ax.scatter(row['PC1'], row['PC2'], c=[color_map[row['Country']]],
               s=250, edgecolors='black', linewidths=2, zorder=5)
    ax.annotate(row['Country'], (row['PC1'], row['PC2']),
                xytext=(5, 5), textcoords='offset points',
                fontsize=10, fontweight='bold')

ax.set_xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]*100:.1f}% variance)', fontsize=12)
ax.set_ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]*100:.1f}% variance)', fontsize=12)
ax.set_title('European Trust Landscape: PCA of Trust Variables\n(Small dots: individuals, Large circles: country means)',
             fontsize=14, fontweight='bold')
ax.axhline(y=0, color='gray', linestyle='-', alpha=0.3)
ax.axvline(x=0, color='gray', linestyle='-', alpha=0.3)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 9. Biplot Visualization

In [None]:
fig, ax = plt.subplots(figsize=(14, 10))

scale = 4  # Arrow scale factor

# Country means
for _, row in country_means.iterrows():
    ax.scatter(row['PC1'], row['PC2'], c=[color_map[row['Country']]],
               s=350, edgecolors='black', linewidths=2, alpha=0.7, zorder=5)
    name = row['Country_Name'] if pd.notna(row['Country_Name']) else row['Country']
    ax.annotate(name, (row['PC1'], row['PC2']),
                xytext=(7, 7), textcoords='offset points',
                fontsize=9, fontweight='bold')

# Variable loadings as arrows
loadings_2d = pca_2d.components_.T
for i, var in enumerate(trust_vars):
    ax.arrow(0, 0, loadings_2d[i, 0]*scale, loadings_2d[i, 1]*scale,
             head_width=0.12, head_length=0.06, fc='darkred', ec='darkred', alpha=0.9, linewidth=2)
    ax.text(loadings_2d[i, 0]*scale*1.12, loadings_2d[i, 1]*scale*1.12,
            var_labels[var], fontsize=10, color='darkred', ha='center', fontweight='bold')

ax.set_xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]*100:.1f}% variance)', fontsize=12)
ax.set_ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]*100:.1f}% variance)', fontsize=12)
ax.set_title('PCA Biplot: Countries and Trust Variables\n(Red arrows: variable loadings, Circles: country means)',
             fontsize=14, fontweight='bold')
ax.axhline(y=0, color='gray', linestyle='-', alpha=0.3)
ax.axvline(x=0, color='gray', linestyle='-', alpha=0.3)
ax.grid(True, alpha=0.3)
ax.set_aspect('equal', adjustable='box')
plt.tight_layout()
plt.show()

## 10. Country Comparison

In [None]:
# Mean trust scores by country
country_trust_means = df_complete.groupby('cntry')[trust_vars].mean()
country_trust_means.columns = [var_labels[v] for v in trust_vars]
country_trust_means = country_trust_means.sort_values('Trust Politicians', ascending=False)

# Add country names
country_trust_means.index = [f"{idx} ({country_names.get(idx, idx)})" for idx in country_trust_means.index]

# Heatmap
fig, ax = plt.subplots(figsize=(14, 16))
sns.heatmap(country_trust_means, annot=True, fmt='.1f', cmap='RdYlGn',
            ax=ax, cbar_kws={'label': 'Mean Trust Score (0-10)'},
            linewidths=0.5)
ax.set_title('Trust Levels by Country\n(Sorted by Trust in Politicians)', fontsize=14, fontweight='bold')
ax.set_xlabel('Trust Variables', fontsize=12)
ax.set_ylabel('Country', fontsize=12)
plt.tight_layout()
plt.show()

## 11. Correlation Structure

In [None]:
# Correlation matrix
fig, ax = plt.subplots(figsize=(10, 8))
corr = df_complete[trust_vars].corr()
corr.columns = [var_labels[v] for v in trust_vars]
corr.index = [var_labels[v] for v in trust_vars]

mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr, mask=mask, annot=True, fmt='.2f', cmap='coolwarm',
            center=0, vmin=-1, vmax=1, ax=ax, square=True,
            cbar_kws={'label': 'Correlation'}, linewidths=0.5)
ax.set_title('Correlation Matrix of Trust Variables', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## 12. Summary and Conclusions

In [None]:
print("="*70)
print("ANALYSIS SUMMARY")
print("="*70)
print(f"""
DATA OVERVIEW:
- Total respondents analyzed: {len(df_complete):,}
- Countries included: {df_complete['cntry'].nunique()}
- Variables: {len(trust_vars)} trust-related measures

PCA RESULTS:
- PC1 explains {explained_var[0]*100:.1f}% of variance
- PC2 explains {explained_var[1]*100:.1f}% of variance
- First {n_kaiser} components (Kaiser criterion) explain {cumulative_var[n_kaiser-1]*100:.1f}% of total variance

KEY FINDINGS:

1. DIMENSION STRUCTURE:
   - PC1 captures GENERAL TRUST (all variables load positively)
     → People who trust one institution tend to trust others
   
   - PC2 distinguishes INSTITUTIONAL vs SOCIAL trust
     → Political/institutional trust vs interpersonal trust

2. COUNTRY PATTERNS:
   - Nordic countries tend to show higher trust levels
   - Some Eastern European countries show lower institutional trust
   - Social trust patterns differ from political trust patterns

3. VARIABLE CLUSTERING:
   - Political trust cluster: politicians, parties, parliament
   - Social trust cluster: ppltrst, pplfair, pplhlp
   - Police/legal system: bridge both clusters

METHODOLOGY:
- Complete case analysis (listwise deletion)
- Standardized data (z-scores)
- Kaiser criterion for component retention
- Varimax rotation for Factor Analysis comparison
""")

---

**Conclusion**: This dimensional reduction analysis reveals that trust in European societies can be understood through two main dimensions: (1) a general trust factor capturing overall trust propensity, and (2) a dimension distinguishing institutional/political trust from interpersonal/social trust. These findings are consistent with social capital theory and highlight meaningful cross-national variation in trust structures.