Abstract

This project examines whether the intensity of religious belief in a society has any meaningful relationship with its murder rate, focusing specifically on the year 2020. We combine homicide data with global religious composition measures to see if stronger religious adherence or practice is associated with higher or lower levels of violence. Our approach uses metrics such as percentage of population affiliated with a religion, frequency of religious observance, and indicators of religious freedom to quantify “religion intensity.” We also incorporate social factors like income inequality, education levels, and law-enforcement capacity to understand whether these variables strengthen or weaken the connection between religion and murder rates. The goal is not just to test for correlation, but to evaluate whether religion could plausibly play a causal role once broader societal conditions are accounted for. By comparing countries and regions, this analysis gives us a clearer picture of how religious commitment functions in modern societies and whether it meaningfully shapes patterns of violence.

Data (to be updated for sex crimes, economic crimes and Homocide)

- Grouping crimes 

Our project uses two main datasets: a global homicide dataset (including Sexual Crimes )for 2020 from United Nations Office of Drugs and Crime (UNODC) and an international religious composition dataset covering 2020 from PEW Research Center. These were combined to examine whether the intensity of religious belief or practice relates to murder rates across countries. Both datasets are real, publicly available, and meet the assignment requirement that our data be recent and verifiable.

The Homicides data provides the number of murders recorded in each country for the year 2020. Each row represents a country and includes fields such as total homicides and population size, allowing us to compute per-capita murder rates. This dataset gives us a consistent measure of violent crime across different regions, which is essential for comparing societies fairly.

The Religious Composition from 2020 file contains detailed counts of religious affiliation for every major world religion. For each country, it reports the number of people identifying with major religious groups (such as Christianity, Islam, Hinduism, Buddhism, folk religions, and the unaffiliated). Because the dataset includes multiple years, we restricted our analysis to 2020 to match the homicide data. From this file, we generated indicators of “religion intensity,” such as the percentage of the population adhering to any religion and the relative size of dominant religious groups. We also developed "Homocide Density" as a percentage per population.

We merged the two datasets using country names as the key, producing a single table that links murder rates with religious adherence levels. While the homicide data gives us the outcome we want to study, the religious composition data helps quantify how religious each society is. Allowing us to form the foundation of the analysis and allow us to explore whether differences in religious intensity correlate with differences in murder rates.

We also included additional societal indicators: such as income inequality and education levels; so that we can test whether religion still matters after accounting for other major factors.


In [None]:
import pandas as pd
# Read population as int
df = pd.read_csv('Homicides.csv', dtype={"Population": "int64"})
df.drop(['Region', 'Subregion', 'Dimension', 'Category', 'Year', 'Unit of measurement', 'Source'], axis=1, inplace=True)
df.head()

country = df.groupby(by='Country')['VALUE'].sum()
country

In [None]:
# Religion
import numpy as np

religion = pd.read_csv('religion.csv', thousands=',', dtype={"Population":"int64"})
religion.drop(['Region', 'Level', 'Countrycode'], axis=1, inplace=True)
religion.query('Year == 2020', inplace=True)

religion['Population'] = (
    religion['Population']
    .astype(str)
    .str.replace(',', '', regex=False)         
)
religion['Religiously_unaffiliated'] = (
    religion['Religiously_unaffiliated']
    .astype(str)
    .str.replace(',', '', regex=False)
)

religion['Religion Density'] = 1 - (religion['Religiously_unaffiliated'].astype(int) / religion['Population'].astype(int))
religion

In [None]:
# Sex
sex = pd.read_csv('Sex.csv', thousands=',', dtype={"Population":"int64"})
sex.drop(['Iso3_code', 'Region', 'Subregion', 'Indicator', 'Dimension', 'Category'], axis=1, inplace=True)
sex = sex.groupby(by='Country')['VALUE'].sum()
sex

In [None]:
# Corruption 
corruption = pd.read_csv('Corruption.csv', thousands=',', dtype={"Population":"int64"})
corruption.query('`Unit of measurement` == "Counts"', inplace=True)
corruption = corruption.groupby(by='Country')['VALUE'].sum()
corruption

In [None]:
# Merging
merged_df = pd.merge(religion, country, how='inner', on=['Country'])
merged_df = pd.merge(merged_df, sex, how='inner', on=['Country'], suffixes=('_hom', '_sex'))
merged_df = pd.merge(merged_df, corruption, how='inner', on=['Country'])
merged_df['Homicide Density'] = merged_df['VALUE_hom'].astype(int) / merged_df['Population'].astype(int) * 100
merged_df['Sex Assault Density'] = merged_df['VALUE_sex'].astype(int) / merged_df['Population'].astype(int) * 100
merged_df = merged_df.rename(columns={'VALUE': 'VALUES_corr'})
merged_df['Corruption Density'] = merged_df['VALUES_corr'].astype(int) / merged_df['Population'].astype(int) * 100
# merged_df.sort_values(by='Homicide Density', ascending=False)
merged_df



In [None]:
df.head()
df.dtypes


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Load Data
# Ensure 'final_1.csv' is in the same directory as this script
df = pd.read_csv('final_1.csv')

# 2. Data Cleaning
# Remove commas from number strings and convert to actual numbers
cols_to_clean = ['Christians', 'Muslims', 'Buddhists', 'Hindus', 'Jews', 
                 'Other_religions', 'VALUE_hom']

for col in cols_to_clean:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col].astype(str).str.replace(',', ''), errors='coerce')

# 3. Filter Data
# Exclude countries with population < 2 million to remove outliers
df_filtered = df[df['Population'] > 2000000].copy()

# 4. Prepare Variables for Plotting
# Define "Non-Religious" using the 'Religiously_unaffiliated' column
df_filtered['Non_Religious_Count'] = df_filtered['Religiously_unaffiliated']
# Define "Religious" as the remainder of the population
df_filtered['Religious_Count'] = df_filtered['Population'] - df_filtered['Non_Religious_Count']

# Sort by Population so the largest bars are at the top
df_sorted = df_filtered.sort_values(by='Population', ascending=True)

# 5. Create the Double-Sided Graph
# Create two subplots side-by-side sharing the Y-axis
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, len(df_sorted) * 0.5), sharey=True)

# --- LEFT SIDE: HOMICIDES ---
# Plot Homicides in RED
ax1.barh(df_sorted['Country'], df_sorted['VALUE_hom'], color='#FF0000', edgecolor='black', linewidth=0.5)

# Formatting Left Side
# Reverse the x-axis limits (Max -> 0) so bars grow to the left
ax1.set_xlim(max(df_sorted['VALUE_hom']) * 1.1, 0)
ax1.set_xlabel('Number of Homicides')
ax1.set_title('Homicides', fontsize=14, fontweight='bold', color='#FF0000')
ax1.grid(axis='x', linestyle='--', alpha=0.5)

# --- RIGHT SIDE: RELIGIOUS vs NON-RELIGIOUS POPULATION ---
# Stacked Bar Chart
# 1. Plot Religious count first (Light Blue)
p1 = ax2.barh(df_sorted['Country'], df_sorted['Religious_Count'], color='#87CEFA', label='Religious', edgecolor='black', linewidth=0.5)

# 2. Plot Non-Religious count "on top" / to the right (Lighter Blue)
p2 = ax2.barh(df_sorted['Country'], df_sorted['Non_Religious_Count'], left=df_sorted['Religious_Count'], color='#E6F3FF', label='Non-Religious', edgecolor='black', linewidth=0.5)

# Formatting Right Side
ax2.set_xlabel('Population')
ax2.set_title('Population Distribution', fontsize=14, fontweight='bold', color='#2e86de')
ax2.legend()
ax2.grid(axis='x', linestyle='--', alpha=0.5)
# Format x-axis to use plain numbers (prevent scientific notation like 1e8)
ax2.ticklabel_format(style='plain', axis='x')

# --- GLOBAL LAYOUT ADJUSTMENTS ---
# Remove the space between the two plots to make them look like one continuous chart
plt.subplots_adjust(wspace=0.0) 

# Add a main title for the whole figure
plt.suptitle('Comparison: Homicides vs. Religious Composition', fontsize=16, y=1.005)

plt.tight_layout()
plt.show()

In [None]:
df.dtypes

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# 1. Load Data
df = pd.read_csv('final_1.csv')

# 2. Data Cleaning
cols_to_clean = ['Christians', 'Muslims', 'Buddhists', 'Hindus', 'Jews', 
                 'Other_religions', 'VALUE_hom', 'Sex Assault Density']

for col in cols_to_clean:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col].astype(str).str.replace(',', ''), errors='coerce')

# 3. Filter Data
# Keeping your consistency of removing countries < 2 Million
df_filtered = df[df['Population'] > 2000000].copy()

# 4. Prepare Data for Correlation
x_data = df_filtered['Religion Density']
y_data = df_filtered['Sex Assault Density']

# 5. Create the Jittered Scatter Plot
plt.figure(figsize=(16, 12))

sns.regplot(
    x=x_data, 
    y=y_data, 
    data=df_filtered,
    x_jitter=0.03,
    fit_reg=False, 
    scatter_kws={
        'alpha': 0.6,
        's': 100,
        'edgecolor': 'w'
    },
    color='#e74c3c'
)

plt.title('Correlation: Religion Density vs. Sexual Assault Density', fontsize=16)
plt.xlabel('Religion Density (0.0 to 1.0)', fontsize=12)
plt.ylabel('Sexual Assault Density (per capita)', fontsize=12)
plt.grid(True, linestyle='--', alpha=0.5)

# --- TEXT LABEL LOGIC ---

# Loop through ALL countries in the filtered list and add text labels
for i, row in df_filtered.iterrows():
    plt.text(
        row['Religion Density'], 
        row['Sex Assault Density'], 
        row['Country'], 
        fontsize=9, 
        fontweight='bold', 
        ha='right', # Align text to the right of the dot
        alpha=0.8
    )

plt.tight_layout()
plt.show()
merged_df = merged_df.sort_values(by='Population', ascending=True)
merged_df
