# Cyber Crime Losses & Complaints – Guided Analysis

Welcome! This notebook is designed for **non‑technical and non‑coding users** who need to explore a cyber crime dataset and prepare results for Power BI.  \
We will walk through each step, explain what the code does in plain language, and export tables to CSV files so you can create charts in Power BI without writing any code yourself.  \
If a plot would normally appear in Python, we instead save the underlying data for Power BI.  \
At the end you will have several CSV files ready to import into Power BI to create your own charts.

## 1. Load and prepare the data

In this section we will:

1. Import the Python packages we need.
2. Create a folder to store our output files (so everything is in one place).
3. Read the cyber crime dataset from `LossFromNetCrime.csv`.
4. Clean the data by filling missing country names and converting numeric columns to numbers.
5. Define the list of years (2019–2024) we will analyse.

Everything is commented so you can follow along even if you have never coded before.

In [None]:
# Import Python packages
import os
import pandas as pd
import numpy as np

# 1.1 Create an output folder if it does not already exist.
# All of our CSV files will be saved here.
os.makedirs('Code_Output', exist_ok=True)

# 1.2 Load the cyber crime dataset into a pandas DataFrame.
# The CSV file contains rows for countries and columns for each year’s complaints and losses.
df_loss = pd.read_csv('LossFromNetCrime.csv')

# 1.3 Replace any missing country names with 'Unknown'.
# This prevents errors when sorting or grouping.
df_loss['Country'] = df_loss['Country'].fillna('Unknown')

# 1.4 Convert all columns except 'Country' to numbers.
# Sometimes numbers are stored as text; this forces them to numeric so we can add and average them.
numeric_cols = [c for c in df_loss.columns if c != 'Country']
df_loss[numeric_cols] = df_loss[numeric_cols].apply(pd.to_numeric, errors='coerce')

# 1.5 Define the years we will analyse.
years = [2019, 2020, 2021, 2022, 2023, 2024]

# 1.6 Display the first few rows to verify the data loaded correctly.
df_loss.head()

## 2. Totals by year and changes

To understand the overall trend, we need to know how many complaints and how much money was lost across **all countries** each year.

We will:

- Sum complaints and losses for each year.
- Compute the change from one year to the next (this shows whether things are getting better or worse).
- Save the results to a CSV file for Power BI (`totals.csv`).
- Display the table in the notebook for reference.

In [None]:
# 2.1 Calculate total complaints and losses for each year across all countries
totals = []
for year in years:
    total_complaints = df_loss[f'{year}_Complaints'].sum()
    total_losses = df_loss[f'{year}_Losses'].sum()
    totals.append({
        'Year': year,
        'Total_Complaints': total_complaints,
        'Total_Losses': total_losses
    })

# 2.2 Convert the list of totals to a DataFrame
totals_df = pd.DataFrame(totals)

# 2.3 Compute year‑over‑year changes
totals_df['Complaints_Change'] = totals_df['Total_Complaints'].diff()
totals_df['Losses_Change'] = totals_df['Total_Losses'].diff()

# 2.4 Save the totals and changes to CSV for Power BI
totals_df.to_csv('Code_Output/totals.csv', index=False)

# 2.5 Display the totals table
totals_df

## 3. Top countries by complaints and losses (per year)

Power BI visualisations often focus on the top performers or worst offenders. We will identify the **top 5 countries** for each year based on both complaints and losses. Instead of plotting the results here, we build tidy tables for Power BI:

- `top5_complaints_chart_data.csv`: Each row contains the year, country, and complaints (in millions).
- `top5_losses_chart_data.csv`: Each row contains the year, country, and losses (in billions).

These files allow you to create bar charts in Power BI showing the top countries year by year.

In [None]:
# 3.1 Build the data for top 5 complaints per year
complaint_rows = []
for year in years:
    col_name = f'{year}_Complaints'
    # Sort the countries by complaints for this year and take the top 5
    top5 = df_loss.sort_values(by=col_name, ascending=False).head(5)
    for country, value in zip(top5['Country'], top5[col_name]):
        complaint_rows.append({
            'Year': year,
            'Country': country,
            'Complaints_Millions': value / 1_000_000
        })

complaint_chart_df = pd.DataFrame(complaint_rows)

# 3.2 Save the data to CSV
complaint_chart_df.to_csv('Code_Output/top5_complaints_chart_data.csv', index=False)

# 3.3 Display the first few rows for reference
complaint_chart_df.head()

In [None]:
# 3.4 Build the data for top 5 losses per year
loss_rows = []
for year in years:
    col_name = f'{year}_Losses'
    top5 = df_loss.sort_values(by=col_name, ascending=False).head(5)
    for country, value in zip(top5['Country'], top5[col_name]):
        loss_rows.append({
            'Year': year,
            'Country': country,
            'Losses_Billions': value / 1_000_000_000
        })

loss_chart_df = pd.DataFrame(loss_rows)

# 3.5 Save to CSV
loss_chart_df.to_csv('Code_Output/top5_losses_chart_data.csv', index=False)

# 3.6 Display the first few rows
loss_chart_df.head()

## 4. Correlation between complaints and losses

Do countries with more complaints also experience higher financial losses? To find out, we calculate the **Pearson correlation coefficient** between complaints and losses for each year. This statistic ranges from –1 (strong negative relationship) to +1 (strong positive relationship). We save the results to `correlation_complaints_losses.csv`.

In [None]:
# 4.1 Compute the correlation for each year
correlation_list = []
for year in years:
    complaints_col = f'{year}_Complaints'
    losses_col = f'{year}_Losses'
    corr_value = df_loss[complaints_col].corr(df_loss[losses_col])
    correlation_list.append({'Year': year, 'Correlation': corr_value})

# 4.2 Create a DataFrame of correlation values
corr_df = pd.DataFrame(correlation_list)

# 4.3 Save to CSV for Power BI
corr_df.to_csv('Code_Output/correlation_complaints_losses.csv', index=False)

# 4.4 Show the table
corr_df

## 5. Forecasting 2025 using trained linear regression

This code trains separate linear models on the 2019‑2024 data for each selected country to predict both losses and complaints for 2025, includes a Year column in the output, and builds a tidy table of actual vs. predicted values

What this does:

Fits a straight-line model to each country’s 2019–2024 losses and complaints.

Predicts losses and complaints only for 2025 (no extrapolation beyond 2025).

Outputs predicted_losses_linear_2025.csv with columns Country, Year, Predicted_Losses, and Predicted_Complaints.

Creates actual_vs_predicted_linear_by_country.csv with a row for every (country, year) pair, including actual 2019–2024 values and the 2025 prediction, plus a Type column (Actual/Predicted). This tidy file is ideal for a Power BI line chart: set Year on the x-axis (categorical), Losses_Billions and/or Complaints_Millions as values (Sum), Country as legend, and Type to distinguish solid vs. dotted segments.

In [None]:
# --- Linear regression prediction for 2025 (losses and complaints) ---

# We're still using the same years list defined earlier: [2019, 2020, 2021, 2022, 2023, 2024]

from sklearn.linear_model import LinearRegression

# Step 1: recompute total losses/complaints across all years to find top countries.
df_loss['Total_Losses']     = df_loss[[f'{y}_Losses'     for y in years]].sum(axis=1)
df_loss['Total_Complaints'] = df_loss[[f'{y}_Complaints' for y in years]].sum(axis=1)

# Identify the union of top 5 countries by total losses and total complaints.
top5_by_losses     = df_loss.nlargest(5, 'Total_Losses')['Country'].tolist()
top5_by_complaints = df_loss.nlargest(5, 'Total_Complaints')['Country'].tolist()
selected_countries = sorted(set(top5_by_losses + top5_by_complaints))

# Step 2: train a linear regression model per country and predict 2025.
predictions = []
for country in selected_countries:
    # Prepare feature array (years) and target arrays (losses & complaints).
    X = np.array(years).reshape(-1, 1)  # shape (6,1)
    y_losses     = np.array([df_loss.loc[df_loss['Country'] == country, f'{yr}_Losses'].values[0]     for yr in years])
    y_complaints = np.array([df_loss.loc[df_loss['Country'] == country, f'{yr}_Complaints'].values[0] for yr in years])

    # Fit linear models.
    lr_loss = LinearRegression().fit(X, y_losses)
    lr_comp = LinearRegression().fit(X, y_complaints)

    # Predict 2025 values.
    pred_loss_2025 = lr_loss.predict(np.array([[2025]]))[0]
    pred_comp_2025 = lr_comp.predict(np.array([[2025]]))[0]

    predictions.append({
        'Country': country,
        'Year': 2025,
        'Predicted_Losses': pred_loss_2025,
        'Predicted_Complaints': pred_comp_2025
    })

# Convert predictions to a DataFrame and save to CSV (easy import into Power BI).
pred_df_linear = pd.DataFrame(predictions)
pred_df_linear.to_csv('Code_Output/predicted_losses_linear_2025.csv', index=False)

# Step 3: build a tidy table of actual vs. predicted values for each country/year.
rows = []
for country in selected_countries:
    # Actual data for 2019–2024.
    for yr in years:
        rows.append({
            'Country': country,
            'Year': yr,
            'Losses_Billions': df_loss.loc[df_loss['Country'] == country, f'{yr}_Losses'].values[0]     / 1e9,
            'Complaints_Millions': df_loss.loc[df_loss['Country'] == country, f'{yr}_Complaints'].values[0] / 1e6,
            'Type': 'Actual'
        })
    # Predicted 2025 values.
    pred_loss_b  = pred_df_linear.loc[pred_df_linear['Country'] == country, 'Predicted_Losses'].values[0]     / 1e9
    pred_comp_m  = pred_df_linear.loc[pred_df_linear['Country'] == country, 'Predicted_Complaints'].values[0] / 1e6
    rows.append({
        'Country': country,
        'Year': 2025,
        'Losses_Billions': pred_loss_b,
        'Complaints_Millions': pred_comp_m,
        'Type': 'Predicted'
    })

# Create the tidy DataFrame and save it for Power BI.
actual_vs_pred_linear_df = pd.DataFrame(rows)
actual_vs_pred_linear_df.to_csv('Code_Output/actual_vs_predicted_linear_by_country.csv', index=False)

# Optional: display the first few rows for verification.
actual_vs_pred_linear_df.head()


## 6. Forecasting European countries for 2025

In [None]:
# --- European-only linear regression prediction with country codes & names ---

# 1. Define dictionaries for EU country codes and for full country names.
eu_country_codes = {
    "PT": "Portugal", "AT": "Austria", "RO": "Romania", "BE": "Belgium",
    "BG": "Bulgaria", "SE": "Sweden", "SI": "Slovenia", "SK": "Slovakia",
    "CY": "Cyprus", "CZ": "Czech Republic", "DE": "Germany", "DK": "Denmark",
    "EE": "Estonia", "ES": "Spain", "FI": "Finland", "FR": "France",
    "GR": "Greece", "HR": "Croatia", "HU": "Hungary", "IE": "Ireland",
    "IT": "Italy", "LT": "Lithuania", "LU": "Luxembourg", "LV": "Latvia",
    "MT": "Malta", "NL": "Netherlands", "PL": "Poland", "EU": "Europe region"
}

country_codes = {
    "PR": "Puerto Rico", "PS": "Palestine", "PT": "Portugal", "PY": "Paraguay",
    "AE": "United Arab Emirates", "AF": "Afghanistan", "AL": "Albania", "AM": "Armenia",
    "AO": "Angola", "AR": "Argentina", "AT": "Austria", "AU": "Australia",
    "AZ": "Azerbaijan", "RO": "Romania", "BA": "Bosnia and Herzegovina", "RS": "Serbia",
    "BD": "Bangladesh", "RU": "Russia", "BE": "Belgium", "BG": "Bulgaria",
    "BH": "Bahrain", "SA": "Saudi Arabia", "BR": "Brazil", "SC": "Seychelles",
    "SE": "Sweden", "SG": "Singapore", "SI": "Slovenia", "BY": "Belarus",
    "SK": "Slovakia", "BZ": "Belize", "CA": "Canada", "SV": "El Salvador",
    "CH": "Switzerland", "SZ": "Eswatini", "CL": "Chile", "CN": "China",
    "CO": "Colombia", "CR": "Costa Rica", "TH": "Thailand", "CY": "Cyprus",
    "CZ": "Czech Republic", "TR": "Turkey", "DE": "Germany", "TW": "Taiwan",
    "TZ": "Tanzania", "DK": "Denmark", "DO": "Dominican Republic", "UA": "Ukraine",
    "UG": "Uganda", "US": "United States", "EC": "Ecuador", "EE": "Estonia",
    "EG": "Egypt", "UZ": "Uzbekistan", "ES": "Spain", "VE": "Venezuela",
    "VG": "British Virgin Islands", "VN": "Vietnam", "FI": "Finland", "FR": "France",
    "GB": "United Kingdom", "GE": "Georgia", "GH": "Ghana", "GN": "Guinea",
    "GR": "Greece", "GT": "Guatemala", "HK": "Hong Kong", "HN": "Honduras",
    "HR": "Croatia", "YE": "Yemen", "HU": "Hungary", "ID": "Indonesia",
    "IE": "Ireland", "IL": "Israel", "IN": "India", "ZA": "South Africa",
    "IQ": "Iraq", "IR": "Iran", "IS": "Iceland", "IT": "Italy",
    "ZW": "Zimbabwe", "JO": "Jordan", "JP": "Japan", "KE": "Kenya",
    "KG": "Kyrgyzstan", "KH": "Cambodia", "KN": "Saint Kitts and Nevis", "KR": "South Korea",
    "KZ": "Kazakhstan", "LB": "Lebanon", "LK": "Sri Lanka", "LT": "Lithuania",
    "LU": "Luxembourg", "LV": "Latvia", "LY": "Libya", "MD": "Moldova",
    "MM": "Myanmar", "MN": "Mongolia", "MT": "Malta", "MV": "Maldives",
    "MX": "Mexico", "MY": "Malaysia", "MZ": "Mozambique", "NG": "Nigeria",
    "NI": "Nicaragua", "NL": "Netherlands", "NO": "Norway", "NP": "Nepal",
    "NZ": "New Zealand", "OM": "Oman", "PA": "Panama", "PE": "Peru",
    "PG": "Papua New Guinea", "PH": "Philippines", "PK": "Pakistan", "PL": "Poland",
    "NaN": "NaN"  # Represents missing or undefined country
}

# 2. Identify which EU country codes appear in our dataset (df_loss). Ignore codes not present.
eu_codes_in_data = [
    code for code in eu_country_codes.keys()
    if code in df_loss['Country'].unique()
]

# 3. Prepare a European subset of df_loss.
df_loss_eu = df_loss[df_loss['Country'].isin(eu_codes_in_data)].copy()

# 3a. Add full country names and a numeric Country ID.
# This makes the data human-friendly and works well with Power BI slicers.
df_loss_eu = df_loss_eu.rename(columns={'Country': 'Country Code'})
df_loss_eu['Country'] = df_loss_eu['Country Code'].map(country_codes)

# Drop any rows where the country code was missing (rare but safe).
df_loss_eu = df_loss_eu.dropna(subset=['Country Code'])

# Assign a unique Country ID (starting at 100, increment by 1).
# Sorting ensures deterministic ordering.
df_loss_eu = df_loss_eu.sort_values('Country Code')
df_loss_eu['Country ID'] = range(101, 101 + len(df_loss_eu))

# 4. Fit linear regression models (2019–2024) and predict 2025 for losses and complaints.
eu_predictions = []
for code in eu_codes_in_data:
    # Use the code to retrieve the full country name and ensure a row exists
    losses = [df_loss_eu.loc[df_loss_eu['Country Code'] == code, f'{yr}_Losses'].values[0] for yr in years]
    comps  = [df_loss_eu.loc[df_loss_eu['Country Code'] == code, f'{yr}_Complaints'].values[0] for yr in years]
    X = np.array(years).reshape(-1, 1)

    # Fit separate linear models for losses and complaints.
    lr_loss = LinearRegression().fit(X, losses)
    lr_comp = LinearRegression().fit(X, comps)

    pred_loss_2025 = lr_loss.predict(np.array([[2025]]))[0]
    pred_comp_2025 = lr_comp.predict(np.array([[2025]]))[0]

    eu_predictions.append({
        'Country Code': code,
        'Country': country_codes.get(code, code),
        'Country ID': df_loss_eu.loc[df_loss_eu['Country Code'] == code, 'Country ID'].values[0],
        'Year': 2025,
        'Predicted_Losses': pred_loss_2025,
        'Predicted_Complaints': pred_comp_2025
    })

pred_df_linear_eu = pd.DataFrame(eu_predictions)
pred_df_linear_eu.to_csv('Code_Output/predicted_losses_linear_2025_europe.csv', index=False)

# 5. Build a tidy table combining actual 2019–2024 data and the 2025 predictions.
eu_rows = []
for code in eu_codes_in_data:
    country_name = country_codes.get(code, code)
    country_id   = df_loss_eu.loc[df_loss_eu['Country Code'] == code, 'Country ID'].values[0]

    # Add actual data rows
    for yr in years:
        eu_rows.append({
            'Country Code': code,
            'Country': country_name,
            'Country ID': country_id,
            'Year': yr,
            'Losses_Billions': df_loss_eu.loc[df_loss_eu['Country Code'] == code, f'{yr}_Losses'].values[0] / 1e9,
            'Complaints_Millions': df_loss_eu.loc[df_loss_eu['Country Code'] == code, f'{yr}_Complaints'].values[0] / 1e6,
            'Type': 'Actual'
        })

    # Add 2025 prediction row
    pred_loss_b = pred_df_linear_eu.loc[pred_df_linear_eu['Country Code'] == code, 'Predicted_Losses'].values[0] / 1e9
    pred_comp_m = pred_df_linear_eu.loc[pred_df_linear_eu['Country Code'] == code, 'Predicted_Complaints'].values[0] / 1e6
    eu_rows.append({
        'Country Code': code,
        'Country': country_name,
        'Country ID': country_id,
        'Year': 2025,
        'Losses_Billions': pred_loss_b,
        'Complaints_Millions': pred_comp_m,
        'Type': 'Predicted'
    })

actual_vs_pred_linear_eu_df = pd.DataFrame(eu_rows)
actual_vs_pred_linear_eu_df.to_csv(
    'Code_Output/actual_vs_predicted_linear_by_country_europe.csv',
    index=False
)

# Optional: display to verify
actual_vs_pred_linear_eu_df.head()

## 6. Build a tidy table for actual vs predicted losses

To visualise the results in Power BI, we prepare a **tidy** table where each row represents a single observation (country–year–loss type).

This table includes:

- The actual losses for 2019–2024 (in billions).
- The predicted loss for 2025 (in billions).
- A column named `Type` indicating whether the value is Actual or Predicted.

We save this table to `actual_vs_predicted_losses_by_country.csv`.

In [None]:
# 6.1 Build a tidy DataFrame of actual and predicted losses
rows = []
for country in selected_countries:
    # Actual data: convert each year's loss to billions
    actual_losses = [df_loss.loc[df_loss['Country'] == country, f'{year}_Losses'].values[0] for year in years]
    actual_losses_b = [val / 1e9 for val in actual_losses]
    for year_val, val in zip(years, actual_losses_b):
        rows.append({
            'Country': country,
            'Year': year_val,
            'Losses_Billions': val,
            'Type': 'Actual'
        })
    # Predicted: use the regression prediction for 2025
    predicted_b = pred_df.loc[pred_df['Country'] == country, 'Predicted_Losses_2025'].values[0] / 1e9
    rows.append({
        'Country': country,
        'Year': 2025,
        'Losses_Billions': predicted_b,
        'Type': 'Predicted'
    })

# 6.2 Create a DataFrame and save to CSV
pred_chart_df = pd.DataFrame(rows)
pred_chart_df.to_csv('Code_Output/actual_vs_predicted_losses_by_country.csv', index=False)

# 6.3 Display a few rows to verify
pred_chart_df.head()

## 7. Summary and next steps

You have now:

- Loaded and cleaned the cyber crime dataset.
- Calculated annual totals and changes.
- Identified top countries for complaints and losses.
- Evaluated the correlation between complaints and losses.
- Predicted 2025 losses using the average change.
- Prepared a tidy dataset for plotting actual vs predicted losses.

All tables have been saved to the `Code_Output` folder as CSV files.  \
You can now import these files into Power BI to create bar charts, line charts, and other visualisations.  \
In Power BI remember to set the appropriate data types (e.g., `Year` as a whole number) and choose **Sum** for numeric fields.