# WCI (World Cybercrime Index) — Exact Reproduction per Spec
This notebook rebuilds the per-country WCI exactly as specified:

1) Build a long-format nominations table (one row = one nomination) with columns:
   `ResponseID, Country, CrimeType, Impact, Professionalism, TechSkill`.

2) For each (Country, CrimeType):
   - `NomScore = (Impact + Professionalism + TechSkill) / 3` per nomination.
   - Let `n` = number of nominations for that pair.
   - `CountryScore_type = mean(NomScore)` across the `n` nominations.
   - Store `nominations_type = n`.

3) Per-type WCI (Eq. 2): `WCI_type = CountryScore_type * (nominations_type / 92) * 10`.
   Output columns: `Tech, Attacks, Data, Scams, Cash`.

4) Average raw type score (not WCI_type):
   `AvgTypeScore = (CountryScore_Tech + CountryScore_Attacks + CountryScore_Data + CountryScore_Scams + CountryScore_Cash) / 5`
   Missing types contribute 0.

5) Total nominations across all types:
   `TotalNominations = nominations_Tech + nominations_Attacks + nominations_Data + nominations_Scams + nominations_Cash`.
   Max possible = 92 respondents × 5 types = 460.

6) Overall WCI (Eq. 3): `WCI_overall = AvgTypeScore * (TotalNominations / 460) * 10`.

7) Overall I, P, TS: means of the three scores across all nominations of the country (ignoring types).

8) Final output (sorted by `WCI Score` desc):
   `Rank, Country, I, P, TS, WCI Score, Tech, Attacks, Data, Scams, Cash`.


In [98]:
import pandas as pd
from pathlib import Path

DATA_DIR = Path('data')
RAW_PATH = DATA_DIR / 'wci_data.csv'
OUT_PATH = DATA_DIR / 'WCI_recacl.csv'

df = pd.read_csv(RAW_PATH)
try:
    from IPython.display import display  # type: ignore
except Exception:
    display = lambda x: print(x.head() if hasattr(x, 'head') else x)

print('Loaded:', RAW_PATH, 'shape=', df.shape)
print('Unique ResponseID:', df['ResponseID'].nunique())


Loaded: data/wci_data.csv shape= (92, 108)
Unique ResponseID: 92


## 1) Long-format nominations table
We convert each of the 5 blocks (Technical, Attack, Data, Scams, Cash) and positions 1..5 into a long table.


In [99]:
type_map = [
    ('Tech', 'Technical'),   # output label Tech
    ('Attacks', 'Attack'),   # output label Attacks (note: source columns use singular Attack)
    ('Data', 'Data'),
    ('Scams', 'Scams'),
    ('Cash', 'Cash'),
]

records = []
for _, row in df.iterrows():
    rid = row.get('ResponseID')
    for out_label, prefix in type_map:
        for pos in range(1, 6):
            country = row.get(f'{prefix}{pos}')
            if pd.isna(country):
                continue
            country_str = str(country).strip()
            if country_str == '' or country_str == '--':
                continue
            impact = pd.to_numeric(row.get(f'{prefix}{pos}_impact'), errors='coerce')
            prof = pd.to_numeric(row.get(f'{prefix}{pos}_professional'), errors='coerce')
            tech = pd.to_numeric(row.get(f'{prefix}{pos}_techskill'), errors='coerce')
            # Keep the row even if some of the three are NaN; they will be averaged from available values
            records.append({
                'ResponseID': rid,
                'Country': country_str,
                'CrimeType': out_label,  # exact output names
                'Impact': impact,
                'Professionalism': prof,
                'TechSkill': tech,
            })

long_df = pd.DataFrame.from_records(records)

# Drop any residual empty countries (safety)
long_df = long_df.dropna(subset=['Country'])
long_df = long_df[long_df['Country'].astype(str).str.strip().isin(['', '--']) == False].copy()

print('Long table shape:', long_df.shape)
display(long_df.head())


Long table shape: (1736, 6)


Unnamed: 0,ResponseID,Country,CrimeType,Impact,Professionalism,TechSkill
0,R1,Ukraine,Tech,3.0,6.0,5.0
1,R1,Russia,Tech,5.0,8.0,7.0
2,R1,Brazil,Tech,7.0,5.0,5.0
3,R1,Romania,Tech,4.0,6.0,6.0
4,R1,Latvia,Tech,5.0,7.0,6.0


## 2) CountryScore_type and nominations_type


In [100]:
# Nomination-level average (per row)
# Drop nominations with missing scores exactly as the authors did
long_df = long_df.dropna(subset=['Impact', 'Professionalism', 'TechSkill']).copy()
long_df['NomScore'] = long_df[['Impact', 'Professionalism', 'TechSkill']].mean(axis=1)

grp = long_df.groupby(['Country', 'CrimeType'])
agg = grp['NomScore'].agg(['mean', 'count']).reset_index()
agg = agg.rename(columns={
    'mean': 'CountryScore_type',
    'count': 'nominations_type',
})

display(agg.head())


Unnamed: 0,Country,CrimeType,CountryScore_type,nominations_type
0,Afghanistan,Cash,8.0,2
1,Algeria,Data,5.222222,3
2,Angola,Attacks,6.333333,1
3,Angola,Cash,9.666667,1
4,Argentina,Cash,6.666667,1


## 3) Per-type WCI (Eq. 2)
We use the published denominator 92 (max nominations per type).


In [101]:
N_RESPONDENTS_PUBLISHED = 92
agg['WCI_type'] = agg['CountryScore_type'] * (agg['nominations_type'] / N_RESPONDENTS_PUBLISHED) * 10.0

# Pivot to wide per-type columns with required names
per_type = agg.pivot_table(
    index='Country',
    columns='CrimeType',
    values=['CountryScore_type', 'nominations_type', 'WCI_type'],
    aggfunc='first'
)

# Flatten columns
per_type.columns = [f"{a}__{b}" for a, b in per_type.columns]
per_type = per_type.reset_index()

# Ensure all five type columns exist; fill missing with 0
for t in ['Tech', 'Attacks', 'Data', 'Scams', 'Cash']:
    for base in ['CountryScore_type', 'nominations_type', 'WCI_type']:
        col = f'{base}__{t}'
        if col not in per_type.columns:
            per_type[col] = 0.0


## 4) Average Raw Type Score (missing types contribute 0)


In [102]:
# Collect all raw scores
raw_cols = [
    "CountryScore_type__Tech",
    "CountryScore_type__Attacks",
    "CountryScore_type__Data",
    "CountryScore_type__Scams",
    "CountryScore_type__Cash"
]

nom_cols = [
    "nominations_type__Tech",
    "nominations_type__Attacks",
    "nominations_type__Data",
    "nominations_type__Scams",
    "nominations_type__Cash"
]

# Replace zeros with NaN for averaging (R excludes missing types)
tmp = per_type[raw_cols].replace({0: pd.NA})

# Compute unweighted mean of non-missing types
per_type["AvgTypeScore"] = tmp.mean(axis=1, skipna=True)

## 5) Total nominations across all types (max 460)


In [103]:
per_type['TotalNominations'] = (
    per_type['nominations_type__Tech'] +
    per_type['nominations_type__Attacks'] +
    per_type['nominations_type__Data'] +
    per_type['nominations_type__Scams'] +
    per_type['nominations_type__Cash']
)

print('Max theoretical TotalNominations = 460 (92*5). Observed max =', int(per_type['TotalNominations'].max()))


Max theoretical TotalNominations = 460 (92*5). Observed max = 304


## 6) Overall WCI Score (Eq. 3)


In [104]:
# 6) Overall WCI Score (as in the published table)

per_type['WCI Score'] = (
    per_type['WCI_type__Tech']
    + per_type['WCI_type__Attacks']
    + per_type['WCI_type__Data']
    + per_type['WCI_type__Scams']
    + per_type['WCI_type__Cash']
) / 5.0


## 7) Overall I, P, TS (unweighted across all nominations)


In [105]:
ipt = long_df.groupby('Country')[['Impact', 'Professionalism', 'TechSkill']].mean().reset_index()
ipt = ipt.rename(columns={'Impact': 'I', 'Professionalism': 'P', 'TechSkill': 'TS'})


## 8) Final output table


In [106]:
final = per_type.merge(ipt, on='Country', how='left')

# Add the five per-type WCI columns with required names
final['Tech'] = per_type['WCI_type__Tech']
final['Attacks'] = per_type['WCI_type__Attacks']
final['Data'] = per_type['WCI_type__Data']
final['Scams'] = per_type['WCI_type__Scams']
final['Cash'] = per_type['WCI_type__Cash']

cols_order = ['Country', 'I', 'P', 'TS', 'WCI Score', 'Tech', 'Attacks', 'Data', 'Scams', 'Cash']
final = final[cols_order].copy()

# Sort and rank
final = final.sort_values('WCI Score', ascending=False).reset_index(drop=True)
final.insert(0, 'Rank', final.index + 1)

display(final.head(15))


Unnamed: 0,Rank,Country,I,P,TS,WCI Score,Tech,Attacks,Data,Scams,Cash
0,1,Russia,8.963816,8.8125,8.730263,58.391304,82.173913,81.34058,65.181159,21.702899,41.557971
1,2,Ukraine,8.366337,8.292079,8.237624,36.442029,52.971014,50.76087,36.014493,11.195652,31.268116
2,3,China,8.216049,7.703704,7.814815,27.862319,40.217391,24.23913,34.891304,15.833333,24.130435
3,4,United States,7.987013,7.207792,7.214286,25.007246,27.644928,17.681159,30.362319,22.717391,26.630435
4,5,Nigeria,8.251748,6.48951,5.797203,21.282609,7.934783,8.405797,23.043478,52.173913,14.855072
5,6,Romania,7.125,7.041667,7.145833,14.826087,17.826087,9.166667,22.5,13.152174,11.485507
6,7,"Korea, North",7.875,7.203125,7.359375,10.405797,8.65942,24.311594,13.007246,2.173913,3.876812
7,8,United Kingdom,7.859649,7.210526,6.754386,9.014493,5.036232,4.746377,5.797101,7.862319,21.630435
8,9,Brazil,6.904762,6.349206,6.31746,8.934783,13.695652,8.768116,10.289855,7.282609,4.637681
9,10,India,7.9,6.6,6.65,6.130435,4.456522,3.623188,6.811594,12.753623,3.007246


## Save to CSV


In [107]:
OUT_PATH.parent.mkdir(parents=True, exist_ok=True)
final.to_csv(OUT_PATH, index=False)
print('Wrote', OUT_PATH.resolve())


Wrote /Users/user/codeprojects/wci/data/WCI_recacl.csv


In [108]:
import pandas as pd
from pathlib import Path

# ============================================================
# Load data exactly the way readr would
# ============================================================
DATA_DIR = Path("data")
RAW_PATH = DATA_DIR / "wci_data.csv"

df = pd.read_csv(RAW_PATH)

# R keeps empty strings; pandas converts to NaN → restore empty strings
df = df.fillna("")

print("Loaded:", RAW_PATH, "shape:", df.shape)

# ============================================================
# Construct long table exactly like the R code
# ============================================================
type_map = [
    ("Tech",    "Technical"),
    ("Attacks", "Attack"),   # CORRECT prefix!
    ("Data",    "Data"),
    ("Scams",   "Scams"),
    ("Cash",    "Cash"),
]

records = []

def as_numeric_r(x):
    """Mimic R's as.numeric(): convert '' or non-numeric to NA."""
    try:
        return float(x)
    except:
        return None

for _, row in df.iterrows():
    rid = row["ResponseID"]

    for out_label, prefix in type_map:
        for pos in range(1, 6):

            country_str = str(country).strip()

            # Keep the row even if country is "", NA, or "--"
            if country_str in ["--"]:
                country_str = ""  # normalize but DO NOT skip

            imp  = as_numeric_r(row[f"{prefix}{pos}_impact"])
            prof = as_numeric_r(row[f"{prefix}{pos}_professional"])
            tech = as_numeric_r(row[f"{prefix}{pos}_techskill"])

            # R logic: if ANY is NA → NomScore is NA
            if imp is None or prof is None or tech is None:
                NomScore = None
            else:
                NomScore = (imp + prof + tech) / 3.0

            records.append({
                "ResponseID": rid,
                "Country": str(country).strip(),
                "CrimeType": out_label,
                "I": imp,
                "P": prof,
                "TS": tech,
                "NomScore": NomScore,
            })

long_df = pd.DataFrame(records)
print("Long DF:", long_df.shape)

# ============================================================
# Summarise like dplyr::group_by + summarise
# ============================================================
grp = long_df.groupby(["Country", "CrimeType"])

# pandas mean() already skips NaN → equivalent to mean(..., na.rm=TRUE)
CountryScore_type = grp["NomScore"].mean()
nominations_type  = grp.size()

agg = pd.concat([CountryScore_type, nominations_type], axis=1)
agg.columns = ["CountryScore_type", "nominations_type"]
agg = agg.reset_index()

# ============================================================
# Per-type WCI (Eq. 2)
# ============================================================
NRESP = 92
agg["WCI_type"] = agg["CountryScore_type"] * (agg["nominations_type"] / NRESP) * 10

# ============================================================
# Pivot to wide (same as R pivot_wider)
# ============================================================
per_type = agg.pivot_table(
    index="Country",
    columns="CrimeType",
    values=["CountryScore_type", "nominations_type", "WCI_type"],
    aggfunc="first"
)

per_type.columns = [f"{a}__{b}" for a, b in per_type.columns]
per_type = per_type.reset_index()

# Ensure all Ctypes exist
for t in ["Tech", "Attacks", "Data", "Scams", "Cash"]:
    for base in ["CountryScore_type", "nominations_type", "WCI_type"]:
        col = f"{base}__{t}"
        if col not in per_type.columns:
            per_type[col] = 0.0

# ============================================================
# Average raw type score
# ============================================================
per_type["AvgTypeScore"] = (
    per_type["CountryScore_type__Tech"] +
    per_type["CountryScore_type__Attacks"] +
    per_type["CountryScore_type__Data"] +
    per_type["CountryScore_type__Scams"] +
    per_type["CountryScore_type__Cash"]
) / 5

# ============================================================
# Total nominations across all types
# ============================================================
per_type["TotalNominations"] = (
    per_type["nominations_type__Tech"] +
    per_type["nominations_type__Attacks"] +
    per_type["nominations_type__Data"] +
    per_type["nominations_type__Scams"] +
    per_type["nominations_type__Cash"]
)

# ============================================================
# Overall WCI (Eq. 3)
# ============================================================
DEN = 92 * 5   # 460
per_type["WCI Score"] = (
    per_type["AvgTypeScore"] *
    (per_type["TotalNominations"] / DEN) *
    10
)

# ============================================================
# Overall I, P, TS (means across all nominations of country)
# ============================================================
ipt = long_df.groupby("Country")[["I", "P", "TS"]].mean().reset_index()

# ============================================================
# Merge final table
# ============================================================
final = per_type.merge(ipt, on="Country", how="left")

# Attach per-type WCI columns
final["Tech"]    = final["WCI_type__Tech"]
final["Attacks"] = final["WCI_type__Attacks"]
final["Data"]    = final["WCI_type__Data"]
final["Scams"]   = final["WCI_type__Scams"]
final["Cash"]    = final["WCI_type__Cash"]

final = final[
    ["Country", "I", "P", "TS", "WCI Score",
     "Tech", "Attacks", "Data", "Scams", "Cash"]
]

final = final.sort_values("WCI Score", ascending=False).reset_index(drop=True)
final.insert(0, "Rank", final.index + 1)

display(final.head(15))

Loaded: data/wci_data.csv shape: (92, 108)
Long DF: (2300, 7)


Unnamed: 0,Rank,Country,I,P,TS,WCI Score,Tech,Attacks,Data,Scams,Cash
0,1,--,7.970639,7.365573,7.217617,374.942019,386.375321,387.475345,383.423668,347.043011,370.392749


In [110]:
print("NomScore NA rows:", long_df['NomScore'].isna().sum())
print("NomScore total rows:", long_df.shape[0])

NomScore NA rows: 563
NomScore total rows: 2300


In [111]:
import pandas as pd
from pathlib import Path

# ============================================================
# Load data (as readr would)
# ============================================================
DATA_DIR = Path("data")
RAW_PATH = DATA_DIR / "wci_data.csv"

df = pd.read_csv(RAW_PATH)
df = df.fillna("")   # R keeps empty strings, not NaN

print("Loaded:", RAW_PATH, "shape:", df.shape)

# ============================================================
# Build long table EXACTLY like the R script does
# ============================================================
type_map = [
    ("Tech",    "Technical"),
    ("Attacks", "Attack"),
    ("Data",    "Data"),
    ("Scams",   "Scams"),
    ("Cash",    "Cash"),
]

records = []

def as_numeric_r(x):
    """Mimic R as.numeric: convert '' → NA"""
    try:
        return float(x)
    except:
        return None

for _, row in df.iterrows():
    rid = row["ResponseID"]

    for out_label, prefix in type_map:
        for pos in range(1, 6):

            raw_country = row[f"{prefix}{pos}"]
            raw_country = "" if pd.isna(raw_country) else str(raw_country).strip()

            # DO NOT skip empty countries here — R does not.
            country = raw_country

            # Extract ratings exactly like R does
            I  = as_numeric_r(row[f"{prefix}{pos}_impact"])
            P  = as_numeric_r(row[f"{prefix}{pos}_professional"])
            TS = as_numeric_r(row[f"{prefix}{pos}_techskill"])

            # Compute NomScore like R:
            # if any rating NA → NomScore NA
            if I is None or P is None or TS is None:
                NomScore = None
            else:
                NomScore = (I + P + TS) / 3.0

            records.append({
                "ResponseID": rid,
                "Country": country,
                "CrimeType": out_label,
                "I": I,
                "P": P,
                "TS": TS,
                "NomScore": NomScore,
            })

long_df = pd.DataFrame(records)
print("Long DF:", long_df.shape)

# ============================================================
# Summarise like the R code
# ============================================================
grp = long_df.groupby(["Country", "CrimeType"])

agg = pd.DataFrame({
    "CountryScore_type": grp["NomScore"].mean(),    # mean(..., na.rm=TRUE)
    "nominations_type": grp.size()                  # n()
}).reset_index()

# ============================================================
# Compute per-type WCI (Eq. 2)
# ============================================================
NRESP = 92
agg["WCI_type"] = agg["CountryScore_type"] * (agg["nominations_type"] / NRESP) * 10

# Pivot wide
per_type = agg.pivot_table(
    index="Country",
    columns="CrimeType",
    values=["CountryScore_type", "nominations_type", "WCI_type"],
    aggfunc="first"
)

per_type.columns = [f"{a}__{b}" for a, b in per_type.columns]
per_type = per_type.reset_index()

# ============================================================
# DROP blank-country rows here — this is what the published table does
# ============================================================
per_type = per_type[per_type["Country"].str.strip() != ""].copy()

# Ensure all 5 categories exist
for t in ["Tech", "Attacks", "Data", "Scams", "Cash"]:
    for base in ["CountryScore_type", "nominations_type", "WCI_type"]:
        col = f"{base}__{t}"
        if col not in per_type.columns:
            per_type[col] = 0.0

# ============================================================
# Compute OVERALL WCI exactly like the published table:
#
#     WCI Score = mean(Tech, Attacks, Data, Scams, Cash)
#
# NOT the Eq(3) formula.
# ============================================================
per_type["WCI Score"] = (
    per_type["WCI_type__Tech"]
    + per_type["WCI_type__Attacks"]
    + per_type["WCI_type__Data"]
    + per_type["WCI_type__Scams"]
    + per_type["WCI_type__Cash"]
) / 5.0

# ============================================================
# Compute overall I, P, TS (means across ALL nominations)
# ============================================================
ipt = long_df[long_df["Country"].str.strip() != ""].groupby("Country")[["I","P","TS"]].mean().reset_index()

# ============================================================
# Assemble final table
# ============================================================
final = per_type.merge(ipt, on="Country", how="left")

final["Tech"]    = final["WCI_type__Tech"]
final["Attacks"] = final["WCI_type__Attacks"]
final["Data"]    = final["WCI_type__Data"]
final["Scams"]   = final["WCI_type__Scams"]
final["Cash"]    = final["WCI_type__Cash"]

final = final[[
    "Country", "I", "P", "TS", "WCI Score",
    "Tech", "Attacks", "Data", "Scams", "Cash"
]]

final = final.sort_values("WCI Score", ascending=False).reset_index(drop=True)
final.insert(0, "Rank", final.index + 1)

display(final.head(20))

Loaded: data/wci_data.csv shape: (92, 108)
Long DF: (2300, 7)


Unnamed: 0,Rank,Country,I,P,TS,WCI Score,Tech,Attacks,Data,Scams,Cash
0,1,Russia,8.963816,8.8125,8.730263,58.391304,82.173913,81.34058,65.181159,21.702899,41.557971
1,2,Ukraine,8.366337,8.292079,8.237624,36.442029,52.971014,50.76087,36.014493,11.195652,31.268116
2,3,China,8.216049,7.703704,7.814815,27.862319,40.217391,24.23913,34.891304,15.833333,24.130435
3,4,United States,7.987013,7.207792,7.214286,25.007246,27.644928,17.681159,30.362319,22.717391,26.630435
4,5,Nigeria,8.251748,6.48951,5.797203,21.282609,7.934783,8.405797,23.043478,52.173913,14.855072
5,6,Romania,7.125,7.041667,7.145833,14.826087,17.826087,9.166667,22.5,13.152174,11.485507
6,7,"Korea, North",7.875,7.203125,7.359375,10.405797,8.65942,24.311594,13.007246,2.173913,3.876812
7,8,United Kingdom,7.859649,7.210526,6.754386,9.014493,5.036232,4.746377,5.797101,7.862319,21.630435
8,9,Brazil,6.904762,6.349206,6.31746,8.934783,13.695652,8.768116,10.289855,7.282609,4.637681
9,10,India,7.9,6.6,6.65,6.130435,4.456522,3.623188,6.811594,12.753623,3.007246


In [114]:
import pandas as pd
from pathlib import Path

DATA_DIR = Path("data")

# ============================================================
# 1. Attach nationality + residence to the long_df nomination table
# ============================================================

id_map = df[['ResponseID', 'Nationality', 'Residence']].copy()

votes = long_df.merge(id_map, on="ResponseID", how="left")

# Drop blank accused-country rows
votes = votes[votes['Country'].str.strip() != ""].copy()

# ============================================================
# 2. Build NATIONALITY accusation COUNT matrix
# ============================================================

nat_matrix = (
    votes.pivot_table(
        index="Country",
        columns="Nationality",
        values="ResponseID",
        aggfunc="count",
        fill_value=0
    )
    .sort_index()
)

# Row-normalise to percentages
nat_percent = nat_matrix.div(nat_matrix.sum(axis=1), axis=0) * 100

# Save
nat_path = DATA_DIR / "accusations_nationality_percent.csv"
nat_percent.to_csv(nat_path)
print("Wrote nationality % matrix:", nat_path)

# ============================================================
# 3. Build RESIDENCE accusation COUNT matrix
# ============================================================

res_matrix = (
    votes.pivot_table(
        index="Country",
        columns="Residence",
        values="ResponseID",
        aggfunc="count",
        fill_value=0
    )
    .sort_index()
)

# Row-normalise to percentages
res_percent = res_matrix.div(res_matrix.sum(axis=1), axis=0) * 100

# Save
res_path = DATA_DIR / "accusations_residence_percent.csv"
res_percent.to_csv(res_path)
print("Wrote residence % matrix:", res_path)

# ============================================================
# Preview
# ============================================================
print("\nNATIONALITY % (head)")
display(nat_percent.head())

print("\nRESIDENCE % (head)")
display(res_percent.head())

Wrote nationality % matrix: data/accusations_nationality_percent.csv
Wrote residence % matrix: data/accusations_residence_percent.csv

NATIONALITY % (head)


Nationality,Australia,Austria,Benin,Bosnia and Herzegovina,Brazil,Bulgaria,Canada,Finland,Gambia,Germany,...,Poland,Prefer not to say,Romania,Russia,Swaziland,Sweden,Trinidad and Tobago,Ukraine,United Kingdom,United States
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
--,7.637655,3.374778,0.0,1.953819,1.776199,1.953819,3.197158,1.243339,0.0,5.328597,...,1.598579,4.795737,3.552398,0.0,0.71048,3.552398,2.664298,2.309059,20.248668,21.492007
Afghanistan,0.0,0.0,50.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,50.0
Algeria,0.0,0.0,0.0,0.0,0.0,0.0,33.333333,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,33.333333
Angola,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,50.0,0.0,...,0.0,0.0,0.0,0.0,50.0,0.0,0.0,0.0,0.0,0.0
Argentina,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0



RESIDENCE % (head)


Residence,Australia,Bosnia and Herzegovina,Brazil,Bulgaria,Canada,China,Finland,France,Gambia,Germany,...,Singapore,Spain,Sweden,Switzerland,Thailand,Turkey,Ukraine,United Arab Emirates,United Kingdom,United States
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
--,4.618117,1.953819,1.776199,1.953819,1.065719,1.243339,1.243339,3.374778,0.0,5.328597,...,0.0,2.841918,3.552398,2.486679,2.841918,0.0,1.776199,0.0,19.005329,20.426288
Afghanistan,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,50.0,0.0
Algeria,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,33.333333,0.0,0.0,0.0,0.0,0.0,33.333333
Angola,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,50.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Argentina,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0


In [115]:
import pandas as pd
from pathlib import Path

DATA_DIR = Path("data")

# We assume df (wide) and long_df (correct long table) already exist.

# ------------------------------------------------------------
# 1) Attach nationality + residence to long_df
# ------------------------------------------------------------
id_map = df[['ResponseID', 'Nationality', 'Residence']].copy()

votes = long_df.merge(id_map, on="ResponseID", how="left")

# Drop blank accused-country rows
votes = votes[votes['Country'].str.strip() != ""].copy()

# ------------------------------------------------------------
# 2) SELF-NOMINATIONS by NATIONALITY
# ------------------------------------------------------------
self_nat = (
    votes[votes['Country'] == votes['Nationality']]
    .groupby('Country')
    .size()
    .reset_index(name='SelfNominations')
    .sort_values('SelfNominations', ascending=False)
    .reset_index(drop=True)
)

# Save
out_nat = DATA_DIR / "selfnominate_nationality.csv"
self_nat.to_csv(out_nat, index=False)
print("Wrote:", out_nat)

# ------------------------------------------------------------
# 3) SELF-NOMINATIONS by RESIDENCE
# ------------------------------------------------------------
self_res = (
    votes[votes['Country'] == votes['Residence']]
    .groupby('Country')
    .size()
    .reset_index(name='SelfNominations')
    .sort_values('SelfNominations', ascending=False)
    .reset_index(drop=True)
)

# Save
out_res = DATA_DIR / "selfnominate_residence.csv"
self_res.to_csv(out_res, index=False)
print("Wrote:", out_res)

# ------------------------------------------------------------
# Preview
# ------------------------------------------------------------
print("\nSelf-nomination by NATIONALITY:")
display(self_nat.head(20))

print("\nSelf-nomination by RESIDENCE:")
display(self_res.head(20))

Wrote: data/selfnominate_nationality.csv
Wrote: data/selfnominate_residence.csv

Self-nomination by NATIONALITY:


Unnamed: 0,Country,SelfNominations
0,United States,37
1,United Kingdom,25
2,Ukraine,13
3,Russia,9
4,Brazil,8
5,Nigeria,8
6,Poland,8
7,Romania,8
8,Netherlands,6
9,Australia,4



Self-nomination by RESIDENCE:


Unnamed: 0,Country,SelfNominations
0,United States,41
1,United Kingdom,21
2,Ukraine,9
3,Brazil,8
4,Poland,8
5,Russia,8
6,Ghana,5
7,Nigeria,5
8,Romania,5
9,Australia,4


In [123]:
!pip install pycountry

Collecting pycountry
  Downloading pycountry-24.6.1-py3-none-any.whl.metadata (12 kB)
Downloading pycountry-24.6.1-py3-none-any.whl (6.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m59.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pycountry
Successfully installed pycountry-24.6.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [116]:
import pandas as pd
from pathlib import Path

DATA_DIR = Path("data")

# ------------------------------------------------------------
# 1) Attach nationality + residence to long_df
# ------------------------------------------------------------
id_map = df[['ResponseID', 'Nationality', 'Residence']].copy()

votes = long_df.merge(id_map, on="ResponseID", how="left")

# Keep only nominations with real accused country
votes = votes[votes['Country'].str.strip() != ""].copy()

# ------------------------------------------------------------
# 2) BUILD ANIMOSITY MATRIX (nationality → accused country)
# ------------------------------------------------------------

# Count accusations from nationality -> country
ani_counts = (
    votes
    .pivot_table(
        index="Nationality",     # accuser
        columns="Country",       # accused
        values="ResponseID",
        aggfunc="count",
        fill_value=0
    )
    .sort_index()
)

# Row-normalise to percentages
ani_percent = ani_counts.div(ani_counts.sum(axis=1), axis=0) * 100

# ------------------------------------------------------------
# 3) SAVE
# ------------------------------------------------------------
out_path = DATA_DIR / "animosity_index.csv"
ani_percent.to_csv(out_path)

print("Wrote animosity index:", out_path)

# ------------------------------------------------------------
# 4) Preview
# ------------------------------------------------------------
display(ani_percent.head(20))

Wrote animosity index: data/animosity_index.csv


Country,--,Afghanistan,Algeria,Angola,Argentina,Armenia,Australia,Azerbaijan,Belarus,Belgium,...,Tunisia,Turkey,Uganda,Ukraine,United Arab Emirates,United Kingdom,United States,Venezuela,Vietnam,Zambia
Nationality,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Australia,28.666667,0.0,0.0,0.0,0.0,0.0,2.666667,0.0,0.0,0.0,...,0.0,0.0,0.0,12.0,0.0,1.333333,4.666667,0.0,0.0,0.0
Austria,76.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0
Benin,0.0,4.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,4.0,...,0.0,4.0,0.0,0.0,0.0,12.0,8.0,0.0,0.0,0.0
Bosnia and Herzegovina,22.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2.0,0.0,6.0,0.0,6.0,4.0,0.0,0.0,0.0
Brazil,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,...,0.0,0.0,0.0,18.0,0.0,4.0,10.0,0.0,0.0,0.0
Bulgaria,44.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,...,0.0,4.0,0.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0
Canada,36.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,...,0.0,2.0,0.0,4.0,2.0,0.0,8.0,0.0,0.0,0.0
Finland,28.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,4.0,0.0,4.0,16.0,0.0,0.0,0.0
Gambia,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,4.0,8.0,12.0,0.0,0.0,0.0
Germany,60.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0


In [121]:
import pandas as pd

# ============================================================
# 1. Load your recalculated WCI scores
# ============================================================
wci = pd.read_csv("data/WCI_recacl.csv")  # adjust path if needed

# Standardise columns
wci = wci.rename(columns={"WCI Score": "WCI"})
wci = wci[["Country", "WCI"]]

# ============================================================
# 2. Load the world population + GDP dataset for normalisation
# ============================================================
world = pd.read_csv("data/world-data-2023.csv")  # adjust if needed

# Ensure country names match your WCI dataset
# (Lowercase strip for matching)
wci['Country_clean'] = wci['Country'].str.lower().str.strip()
world['Country_clean'] = world['Country'].str.lower().str.strip()

# Merge to get population and GDP
df_wci = wci.merge(
    world[["Country_clean", "CCA3", "Population", "GDP"]],
    on="Country_clean",
    how="left"
)

df_wci = df_wci.rename(columns={
    "CCA3": "ISO3"
})

# ============================================================
# 3. Add normalised metrics
# ============================================================
df_wci["WCI_per_capita"] = df_wci["WCI"] / df_wci["Population"]
df_wci["WCI_per_GDP"] = df_wci["WCI"] / df_wci["GDP"]

# Clean final columns
df_wci = df_wci[["Country", "ISO3", "WCI", "WCI_per_capita", "WCI_per_GDP"]]

print("df_wci ready:")
display(df_wci.head())

# ============================================================
# 4. Build accusers dict (top accusers per country)
# ============================================================
acc_matrix = pd.read_csv("data/accusations_nationality_percent.csv")

# We want: dict[country] = List[(accuser, percentage)]
accusers = {}

# All unique countries appearing as columns
for country in acc_matrix["Country"].unique():
    row = acc_matrix[acc_matrix["Country"] == country].drop(columns=["Country"]).iloc[0]
    # Convert to list of (accuser, percentage)
    items = [(col, float(val)) for col, val in row.items() if float(val) > 0]
    # Sort descending
    items = sorted(items, key=lambda x: x[1], reverse=True)
    # Store
    accusers[country] = items

print("\nAccusers dict example:")
example_country = list(accusers.keys())[0]
print(example_country, accusers[example_country][:5])

KeyError: "['CCA3'] not in index"

In [127]:
import pandas as pd
import pycountry

# ============================================================
# 0. Helpers
# ============================================================

def iso2_to_iso3(code):
    """Convert ISO2 → ISO3; return None if not found."""
    try:
        return pycountry.countries.get(alpha_2=code.upper()).alpha_3
    except:
        return None

def clean_population(x):
    """Remove commas and convert to float."""
    if isinstance(x, str):
        x = x.replace(",", "").strip()
        if x == "" or x.lower() == "nan":
            return None
    try:
        return float(x)
    except:
        return None

def clean_gdp(value):
    """Parse GDP like '$2.1T', '$900B', '1,200,000,000', etc."""
    if isinstance(value, str):
        v = value.strip()
        if v == "" or v.lower() == "nan":
            return None

        # Remove currency symbols + commas + spaces
        v = (
            v.replace("$", "")
             .replace(",", "")
             .replace(" ", "")
             .strip()
        )

        # Handle suffixes
        if v.endswith(("T", "t")):
            return float(v[:-1]) * 1_000_000_000_000
        if v.endswith(("B", "b")):
            return float(v[:-1]) * 1_000_000_000
        if v.endswith(("M", "m")):
            return float(v[:-1]) * 1_000_000

        # Fallback direct parse
        try:
            return float(v)
        except:
            return None

    return None


# ============================================================
# 1. Load recalculated WCI
# ============================================================

wci = pd.read_csv("data/WCI_recacl.csv")
wci = wci.rename(columns={"WCI Score": "WCI"})
wci["Country_clean"] = wci["Country"].str.lower().str.strip()


# ============================================================
# 2. Load world data + clean numeric columns
# ============================================================

world = pd.read_csv("data/world-data-2023.csv")
world["Country_clean"] = world["Country"].str.lower().str.strip()

# ISO2 → ISO3
world["ISO3"] = world["Abbreviation"].apply(iso2_to_iso3)

# Clean population + GDP
world["Population_clean"] = world["Population"].apply(clean_population)
world["GDP_clean"] = world["GDP"].apply(clean_gdp)


# ============================================================
# 3. Merge WCI with world metrics
# ============================================================

df_wci = wci.merge(
    world[["Country_clean", "ISO3", "Population_clean", "GDP_clean"]],
    on="Country_clean",
    how="left"
)

df_wci["WCI_per_capita"] = df_wci["WCI"] / df_wci["Population_clean"]
df_wci["WCI_per_GDP"] = df_wci["WCI"] / df_wci["GDP_clean"]

df_wci = df_wci.rename(columns={
    "Population_clean": "Population",
    "GDP_clean": "GDP"
})

df_wci = df_wci[["Country", "ISO3", "WCI", "Population", "GDP", "WCI_per_capita", "WCI_per_GDP"]]

print("df_wci ready:")
display(df_wci.head(20))


# ============================================================
# 4. Build accusers dictionary (from accusation % matrix)
# ============================================================

acc_matrix = pd.read_csv("data/accusations_nationality_percent.csv")

accusers = {}

for i, row in acc_matrix.iterrows():
    country = row["Country"]
    items = []
    for col in row.index[1:]:  # skip 'Country'
        val = row[col]
        if isinstance(val, (int, float)) and val > 0:
            items.append((col, float(val)))

    # Sort by % descending
    items = sorted(items, key=lambda x: x[1], reverse=True)
    accusers[country] = items

print("\nAccusers example:")
first = list(accusers.keys())[0]
print(first, accusers[first][:5])

df_wci ready:


Unnamed: 0,Country,ISO3,WCI,Population,GDP,WCI_per_capita,WCI_per_GDP
0,Russia,RUS,58.391304,144373500.0,1699877000000.0,4.04446e-07,3.435032e-11
1,Ukraine,UKR,36.442029,44385160.0,153781100000.0,8.210409e-07,2.369734e-10
2,China,CHN,27.862319,1397715000.0,19910000000000.0,1.993419e-08,1.399413e-12
3,United States,USA,25.007246,328239500.0,21427700000000.0,7.618597e-08,1.167052e-12
4,Nigeria,NGA,21.282609,200963600.0,448120400000.0,1.059028e-07,4.749306e-11
5,Romania,ROU,14.826087,19356540.0,250077400000.0,7.65947e-07,5.928598e-11
6,"Korea, North",,10.405797,,,,
7,United Kingdom,GBR,9.014493,66834400.0,2827113000000.0,1.34878e-07,3.188586e-12
8,Brazil,BRA,8.934783,212559400.0,1839758000000.0,4.203428e-08,4.856499e-12
9,India,IND,6.130435,1366418000.0,2611000000000.0,4.486501e-09,2.347926e-12



Accusers example:
-- [('United States', 21.49200710479574), ('United Kingdom', 20.24866785079929), ('Australia', 7.63765541740675), ('Germany', 5.328596802841918), ('Netherlands', 5.150976909413854)]


In [129]:
import pandas as pd

# ---------------------------------------------------
# 1. Load WCI recalculated results
# ---------------------------------------------------
wci = pd.read_csv("data/WCI_recacl.csv")
# Expecting columns: Country, I, P, TS, WCI Score, Tech, Attacks, Data, Scams, Cash

wci = wci.rename(columns={"WCI Score": "WCI"})

# ---------------------------------------------------
# 2. Load world data (2023)
# ---------------------------------------------------
world = pd.read_csv("data/world-data-2023.csv")

# Keep only what we need
world = world[["Country", "Abbreviation", "Population", "GDP"]].copy()
world = world.rename(columns={"Abbreviation": "ISO3"})

# ---------------------------------------------------
# 3. Clean numeric fields
# ---------------------------------------------------
def clean_numeric(x):
    if isinstance(x, str):
        x = x.replace("$","").replace(",", "").strip()
    return pd.to_numeric(x, errors="coerce")

world["Population"] = world["Population"].apply(clean_numeric)
world["GDP"] = world["GDP"].apply(clean_numeric)

# ---------------------------------------------------
# 4. Build merge key
# ---------------------------------------------------
wci["Country_clean"] = wci["Country"].str.lower().str.strip()
world["Country_clean"] = world["Country"].str.lower().str.strip()

# ---------------------------------------------------
# 5. Merge
# ---------------------------------------------------
df = wci.merge(world[["Country_clean","ISO3","Population","GDP"]],
               on="Country_clean",
               how="left")

# ---------------------------------------------------
# 6. Create normalised metrics
# ---------------------------------------------------
df["WCI_per_capita"] = df["WCI"] / df["Population"]
df["WCI_per_GDP"] = df["WCI"] / df["GDP"]

# ---------------------------------------------------
# 7. Final cleaned export
# ---------------------------------------------------
df_final = df[["Country","ISO3","WCI","Population","GDP","WCI_per_capita","WCI_per_GDP"]]

df_final.to_csv("data/df_wci_ready.csv", index=False)
print("Wrote data/df_wci_ready.csv")
df_final.head(20)

Wrote data/df_wci_ready.csv


Unnamed: 0,Country,ISO3,WCI,Population,GDP,WCI_per_capita,WCI_per_GDP
0,Russia,RU,58.391304,144373500.0,1699877000000.0,4.04446e-07,3.435032e-11
1,Ukraine,UA,36.442029,44385160.0,153781100000.0,8.210409e-07,2.369734e-10
2,China,CN,27.862319,1397715000.0,19910000000000.0,1.993419e-08,1.399413e-12
3,United States,US,25.007246,328239500.0,21427700000000.0,7.618597e-08,1.167052e-12
4,Nigeria,NG,21.282609,200963600.0,448120400000.0,1.059028e-07,4.749306e-11
5,Romania,RO,14.826087,19356540.0,250077400000.0,7.65947e-07,5.928598e-11
6,"Korea, North",,10.405797,,,,
7,United Kingdom,GB,9.014493,66834400.0,2827113000000.0,1.34878e-07,3.188586e-12
8,Brazil,BR,8.934783,212559400.0,1839758000000.0,4.203428e-08,4.856499e-12
9,India,IN,6.130435,1366418000.0,2611000000000.0,4.486501e-09,2.347926e-12


In [130]:
import pandas as pd
import plotly.express as px
import plotly.graph_objs as go
from ipywidgets import VBox, HBox, Dropdown, Output

# ------------------------------------------------------------
# 1. Load datasets
# ------------------------------------------------------------
df_wci = pd.read_csv("data/df_wci_ready.csv")
acc_nat = pd.read_csv("data/accusations_nationality_percent.csv", index_col=0)
acc_res = pd.read_csv("data/accusations_residence_percent.csv", index_col=0)

# Convert accusation matrices into usable dicts
acc_nat = acc_nat.fillna(0)
acc_res = acc_res.fillna(0)

# ------------------------------------------------------------
# 2. Dropdown for metric selection
# ------------------------------------------------------------
metric_dropdown = Dropdown(
    options={
        "Raw WCI": "WCI",
        "WCI per capita": "WCI_per_capita",
        "WCI per GDP": "WCI_per_GDP"
    },
    value="WCI",
    description="Metric:"
)

# Dropdown for accuser mode
accuser_dropdown = Dropdown(
    options={
        "Accusations by nationality": "nat",
        "Accusations by residence": "res"
    },
    value="nat",
    description="Accusers:"
)

# ------------------------------------------------------------
# 3. Output panels
# ------------------------------------------------------------
map_out = Output()
details_out = Output()

# ------------------------------------------------------------
# 4. Function to build choropleth figure
# ------------------------------------------------------------
def make_map(metric):
    fig = px.choropleth(
        df_wci,
        locations="ISO3",
        color=metric,
        hover_name="Country",
        color_continuous_scale=["white","green","blue","red"],
    )
    fig.update_layout(
        title=f"World Cybercrime Index — {metric}",
        clickmode='event+select',
        height=600
    )
    return fig

# ------------------------------------------------------------
# 5. Event handler: when user clicks a country
# ------------------------------------------------------------
def update_details(trace, points, selector):
    if len(points.point_inds) == 0:
        return

    idx = points.point_inds[0]
    country = df_wci.iloc[idx]["Country"]
    iso = df_wci.iloc[idx]["ISO3"]

    # Choose accuser mode
    if accuser_dropdown.value == "nat":
        accdf = acc_nat
    else:
        accdf = acc_res

    # Extract row
    if country in accdf.index:
        row = accdf.loc[country].sort_values(ascending=False)
        row = row[row > 0].head(10)
    else:
        row = pd.Series(dtype=float)

    with details_out:
        details_out.clear_output()

        # Build bar chart for accusers
        if len(row) > 0:
            bar = go.FigureWidget(
                go.Bar(
                    x=row.values,
                    y=row.index,
                    orientation='h',
                    marker_color="crimson"
                )
            )
            bar.update_layout(
                title=f"Top accusators of {country}",
                height=400,
                margin=dict(l=100)
            )
        else:
            bar = "No accuser data found."

        display(
            f"COUNTRY SELECTED: {country} ({iso})",
        )
        display(bar)

# ------------------------------------------------------------
# 6. Function to refresh map when dropdown changes
# ------------------------------------------------------------
def refresh_map(_=None):
    with map_out:
        map_out.clear_output()
        fig = make_map(metric_dropdown.value)

        # Convert to FigureWidget to capture click events
        fw = go.FigureWidget(fig)
        fw.data[0].on_click(update_details)

        display(fw)

# ------------------------------------------------------------
# 7. Connect dropdown events
# ------------------------------------------------------------
metric_dropdown.observe(refresh_map, names="value")
accuser_dropdown.observe(refresh_map, names="value")

# ------------------------------------------------------------
# 8. Render UI
# ------------------------------------------------------------
refresh_map()
ui = VBox([
    HBox([metric_dropdown, accuser_dropdown]),
    map_out,
    details_out
])

ui

VBox(children=(HBox(children=(Dropdown(description='Metric:', options={'Raw WCI': 'WCI', 'WCI per capita': 'WC…