# Visualizing Data

Knowing how to summarize data using tools like Matplotlib is essential for understanding and communicating insights effectively. Visualizations help identify patterns, trends, and outliers in data, making complex datasets easier to interpret. They also enhance decision-making by presenting information in a clear and concise manner, enabling stakeholders to grasp key findings quickly.

Before we can begin visualizing we first need to load in some data. We know how to load data using an API or from a SQL server. But I want to give you another example of reading data from local files. I have added a .csv which I got from the internet. We will first read this data and then use matplotlib to visualize it.

In [None]:
# We always start with importing the necessary packages (you probably need to activate your .venv and install them first)
import matplotlib.pyplot as plt
import polars as pl

In [None]:
# We will now use polars (but you can use pandas or another package) to read the data from the CSV file
# note the r before the path, this is a raw string literal (windows sometimes has problems with backslashes in paths)
df = pl.read_csv(r"C:\Users\YouriDibbet\PipInstallParty\data\Global_Cybersecurity_Threats_2015-2024.csv") 

In [None]:
df.head()

In [None]:
df.describe()

##### Matplotlib provides a ton of useful visualization tools. You can find more info [here](https://matplotlib.org/)

In [None]:
# Group by Country and sum Financial Loss
country_loss = (
    df.group_by("Country")
    .agg(pl.col("Financial Loss (in Million $)").sum().alias("Total Loss"))
    .sort("Total Loss", descending=True)
    .head(10)
)

# Plot
plt.figure(figsize=(12, 6))
plt.bar(country_loss["Country"], country_loss["Total Loss"])
plt.title("Top 10 Countries by Total Financial Loss (2015 - 2024)")
plt.ylabel("Financial Loss (in Million $)")
plt.xlabel("Country")
plt.xticks(rotation=45)
plt.tight_layout()
plt.grid(axis='y')
plt.show()

In [None]:
# Group by year and sum the number of affected users
users_per_year = (
    df.group_by("Year")
    .agg(pl.col("Number of Affected Users").sum().alias("Total Affected Users"))
    .sort("Year")
)

# Plot
plt.figure(figsize=(12, 6))
plt.plot(users_per_year["Year"], users_per_year["Total Affected Users"], marker="o")
plt.title("Total Affected Users per Year (2015 - 2024)")
plt.xlabel("Year")
plt.ylabel("Total Affected Users")
plt.grid(True)
plt.tight_layout()
plt.show()


In [None]:
# Define amount of datapoints (define seed to always get the same datapoints)
sample = df.sample(n=10, seed=42)

# We can also define variables for the x and y axis, this way we can reuse code easier
x = sample["Number of Affected Users"].to_list()
y = sample["Financial Loss (in Million $)"].to_list()
colors = sample["Attack Type"].to_list()


unique_types = list(set(colors))
color_map = {atype: i for i, atype in enumerate(unique_types)}
color_values = [color_map[atype] for atype in colors]

# Plot
plt.figure(figsize=(12, 6))
scatter = plt.scatter(x, y, c=color_values, cmap="tab10", alpha=0.7)
plt.title("Financial Loss vs. Number of Affected Users")
plt.xlabel("Number of Affected Users")
plt.ylabel("Financial Loss (in Million $)")
plt.grid(True)

# Legend
handles = [plt.Line2D([0], [0], marker='o', color='w',
                      label=atype, markersize=8, markerfacecolor=plt.cm.tab10(i / len(unique_types)))
           for atype, i in color_map.items()]
plt.legend(handles=handles, title="Attack type", bbox_to_anchor=(1.05, 1), loc='upper left')

plt.tight_layout()
plt.show()


In [None]:
grouped = (
    df.group_by("Attack Type")
    .agg(pl.col("Financial Loss (in Million $)").alias("Losses"))
)

# Define variables
labels = grouped["Attack Type"].to_list()
loss_lists = grouped["Losses"].to_list()

# Plot
plt.figure(figsize=(12, 6))
plt.boxplot(loss_lists, tick_labels=labels, patch_artist=True)
plt.title("Financial Loss per Attack Type (Boxplot)")
plt.ylabel("Financial Loss (in Million $)")
plt.xticks(rotation=45)
plt.tight_layout()
plt.grid(axis='y')
plt.show()

These are just some basic visualisations. But I hope you get an idea of how this might be useful for analyzing large amounts of data. In this example we used a .csv file, but you are free to use SQL database, API data, Excel or any other form of readable data. 