<a href="https://colab.research.google.com/github/ACTH-DKES/ACTH2025/blob/main/week6/week_6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Visualization Tutorial - Cultural Heritage

---

Data visualization helps transform raw data into visuals that highlight patterns, comparisons, and relationships. A well-constructed visualization should:

- **Communicate** clearly a key insight
- **Add value** beyond just displaying data
- **Be accessible**
- **Remain readable** and interpretable

## Section 1: let's generate a synthetic dataset

I don't want to give you too many files.

We generate a fake dataset of archaeological finds with variables like site, period, material, and depth. We'll use the following libraries:

- `pandas`: to handle tabular data
- `matplotlib.pyplot`: to create basic visualizations
- `seaborn`: to create statistical visualizations

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random
import numpy as np

# Dataset configuration
num_entries = 120
sites = ['Pompeii', 'Herculaneum', 'Ostia Antica', 'Delphi', 'Knossos']
periods = ['Bronze Age', 'Iron Age', 'Classical', 'Hellenistic', 'Roman']
artifact_types = ['Pottery', 'Coin', 'Sculpture', 'Inscription', 'Tool']
materials = ['Clay', 'Bronze', 'Marble', 'Iron', 'Glass']
conditions = ['Excellent', 'Good', 'Fair', 'Poor']
years = list(range(1900, 2024))

# Generate synthetic entries
data = []
for i in range(num_entries):
    entry = {
        "Site": random.choice(sites),
        "Period": random.choice(periods),
        "ArtifactType": random.choice(artifact_types),
        "Material": random.choice(materials),
        "Condition": random.choices(conditions, weights=[1, 2, 3, 2])[0],
        "DiscoveryYear": random.choice(years),
        "DepthMeters": round(random.uniform(0.3, 5.0), 2)
    }
    data.append(entry)

# Convert list to pandas DataFrame
df = pd.DataFrame(data)
df.head()

## Heatmap: Visualizing How Artifact Types Are Distributed Across Sites

A **heatmap** is a grid where each cell is colored based on a numeric value. It’s especially good for showing:

- How **frequently** something occurs
- The **distribution** of values across two categories

In this example, we use a heatmap to explore:
- Which **artifact types** were found at which **archaeological sites**
- How often each combination appears in the data

---

- `pandas.pivot(index=..., columns=..., values=...)`: reshapes the dataset so that rows become one category (e.g. Sites), columns another (e.g. ArtifactType), and the cell values a third (e.g. Count).
- `seaborn.heatmap()`: draws a color-coded grid showing the relationship.
- `annot=True`: prints the actual number in each cell.
- `cmap="YlGnBu"`: sets the color gradient from light (low values) to dark (high values).


In [None]:
# We already created 'df' at the beginning. We'll reuse it here.

# Step 1: Count how many times each artifact type appears at each site
heatmap_data = df.groupby(['Site', 'ArtifactType']).size().reset_index(name='Count')

# Step 2: Reshape the data into a matrix format
heatmap_matrix = heatmap_data.pivot(index='Site', columns='ArtifactType', values='Count')

# Step 3: Plot the heatmap
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 5))
sns.heatmap(heatmap_matrix, annot=True, cmap='YlGnBu')
plt.title('Frequency of Artifact Types by Site')
plt.show()


## Section 2: Count Plot of Artifact Types

We use `seaborn.countplot()` to display the frequency of each artifact type. Before plotting, we use `plt.figure(figsize=(8, 5))` to define the figure size in inches (width x height).

In [None]:
plt.figure(figsize=(8, 5))  # Set figure size
sns.countplot(data=df, x="ArtifactType", order=df["ArtifactType"].value_counts().index)  # Frequency plot
plt.title("Frequency of Artifact Types")  # Title
plt.xticks(rotation=45)  # -->rotate x-axis labels for readability <-- important
plt.tight_layout()  # adjust padding to fit everything
plt.show()  # Render the plot in the jup beloww

## Section 3: Depth by Site and Period (Box Plot)
## Box Plot: Understanding Distribution in One Visual

A box plot (also called a box-and-whisker plot) is a statistical chart used to visualize the distribution of a dataset, especially useful when comparing values across groups.

It summarizes a numeric variable using five key statistics:

1. **Minimum**: The smallest value (excluding outliers)
2. **First Quartile (Q1)**: 25% of the data falls below this value
3. **Median (Q2)**: The middle value (50% of the data is above and below this)
4. **Third Quartile (Q3)**: 75% of the data falls below this value
5. **Maximum**: The largest value (excluding outliers)

Outliers (extremely high or low values) are plotted as dots outside the "whiskers."

---

### What It Shows:

- The **box** spans from Q1 to Q3 — the interquartile range (IQR) — containing the middle 50% of the data.
- The **line inside the box** is the median.
- The **whiskers** extend to the lowest and highest non-outlier values.
- **Outliers** are shown as individual points beyond the whiskers.

---

### When to Use It:

Box plots are great when you want to:

- Compare **distributions across multiple groups** (e.g., Price by Region)
- Identify **skewed data** (asymmetric box or whiskers)
- Spot **outliers**

---

We use `x=Site` and `y=DepthMeters`, grouped by `Period` (assigned a Hue/color to Period) to compare excavation depths across sites and periods.

In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(data=df, x="Site", y="DepthMeters", hue="Period")
plt.title("Excavation Depth by Site and Period")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# RDF Graph Visualization (CIDOC CRM)

## Explanation of `pyvis` Graph Construction

We use the `Network` class from the `pyvis.network` module to build an interactive graph based on RDF triples.

---

### `net = Network(height="750px", width="100%", directed=True)`

This creates a new interactive graph canvas.

- `height` and `width`: set the size of the output graph in pixels or percentage.
- `directed=True`: means the graph will have **arrowed edges**, showing direction from subject to object.

---

### net.add_node(...)
This adds a node to the graph. A node is a circle with a label attached.

The first argument is the node ID (it must be unique).

label=... sets the text shown next to the node.

If the node already exists, it won’t be duplicated.

### net.add_edge(...)
net.add_edge(s_label, o_label, label=p_label)

This creates a line (edge) between two nodes.

s_label: the source node (where the edge starts)

o_label: the target node (where the edge ends)

label=p_label: optional text displayed on the line (typically the RDF predicate)

---

Together, these methods let you turn RDF triples (subject, predicate, object) into a visual graph, where:

Nodes are resources (subjects and objects),

Edges are relationships (predicates).


In [19]:
# Generating a syntethic graph
#!pip install rdflib
#!pip install pyvis
from rdflib import Graph, Namespace, RDF, RDFS, Literal
from pyvis.network import Network
import random

g = Graph()
EX = Namespace("http://example.org/")
CRM = Namespace("http://www.cidoc-crm.org/cidoc-crm/")
g.bind("ex", EX)
g.bind("crm", CRM)

artifacts = ["Amphora", "Statue", "Coin", "Mosaic"]
sites = ["Pompeii", "Knossos"]
materials = ["Clay", "Marble", "Bronze", "Stone"]
periods = ["Roman", "Hellenistic"]

for i, name in enumerate(artifacts):
    art = EX[f"artifact{i}"]
    prod = EX[f"production{i}"]
    g.add((art, RDF.type, CRM["E22_Man-Made_Object"]))
    g.add((art, RDFS.label, Literal(name)))
    g.add((art, CRM["P108i_was_produced_by"], prod))
    g.add((prod, CRM["P7_took_place_at"], Literal(random.choice(sites))))
    g.add((prod, CRM["P4_has_time-span"], Literal(random.choice(periods))))
    g.add((art, CRM["P45_consists_of"], Literal(random.choice(materials))))

### Basic Interactive RDF Graph (one color)

In [20]:
net = Network(height="750px", width="100%", directed=True)

for s, p, o in g:
    s_label = s.split("/")[-1]
    o_label = o.split("/")[-1] if isinstance(o, Namespace) or isinstance(o, str) else str(o)
    p_label = p.split("/")[-1]
    net.add_node(s_label, label=s_label)
    net.add_node(o_label, label=o_label)
    net.add_edge(s_label, o_label, label=p_label)

net.write_html("basic_graph.html") # save the graph

### Class-Colored RDF Graph (Yellow Classes, Purple Instances)

In [21]:
from rdflib.namespace import RDF

net = Network(height="750px", width="100%", directed=True)
classes = set(o for s, p, o in g.triples((None, RDF.type, None)))

for s, p, o in g:
    s_label = s.split("/")[-1]
    o_label = o.split("/")[-1] if isinstance(o, Namespace) or isinstance(o, str) else str(o)
    p_label = p.split("/")[-1]

    net.add_node(s_label, label=s_label, color='purple')
    color = 'yellow' if o in classes else 'purple'
    net.add_node(o_label, label=o_label, color=color)
    net.add_edge(s_label, o_label, label=p_label)

net.write_html("class_colored_graph.html") # save the graph

### Exercise: Add Different Colors for Materials and Time Periods
FIrst use RDFLib to get a set of the materials (they are the object of a specific relationship...) The same goes for time periods. Once you have the sets you can assign them the colors.
<details>
<summary> Click to show solution</summary>

```python
from rdflib.namespace import RDF

net = Network(height="750px", width="100%", directed=True)
classes = set(o for s, p, o in g.triples((None, RDF.type, None)))
materials = set(o for s, p, o in g.triples((None, CRM.P45_consists_of, None)))
timespans = set(o for s, p, o in g.triples((None, CRM["P4_has_time-span"], None)))

for s, p, o in g:
    s_label = s.split("/")[-1]
    o_label = o.split("/")[-1] if isinstance(o, Namespace) or isinstance(o, str) else str(o)
    p_label = p.split("/")[-1]

    net.add_node(s_label, label=s_label, color='purple')
    if o in classes:
        color = "yellow"
    elif o in materials:
        color = "green"
    elif o in timespans:
        color = "pink"
    else:
        color = "purple"
    net.add_node(o_label, label=o_label, color=color)
    net.add_edge(s_label, o_label, label=p_label)

net.write_html("new_colored_graph.html")
```
</details>

In [23]:
from rdflib.namespace import RDF

net = Network(height="750px", width="100%", directed=True)
classes = set(o for s, p, o in g.triples((None, RDF.type, None)))
materials = set(o for s, p, o in g.triples((None, CRM.P45_consists_of, None)))
timespans = set(o for s, p, o in g.triples((None, CRM["P4_has_time-span"], None)))

for s, p, o in g:
    s_label = s.split("/")[-1]
    o_label = o.split("/")[-1] if isinstance(o, Namespace) or isinstance(o, str) else str(o)
    p_label = p.split("/")[-1]

    net.add_node(s_label, label=s_label, color='purple')
    if o in classes:
        color = "yellow"
    elif o in materials:
        color = "green"
    elif o in timespans:
        color = "pink"
    else:
        color = "purple"
    net.add_node(o_label, label=o_label, color=color)
    net.add_edge(s_label, o_label, label=p_label)

net.write_html("new_colored_graph.html")

## Some "easy" way to look for visualizations


In [None]:
# !pip install ydata_profiling (the first time you need to install it)
import ydata_profiling as cheat
import pandas as pd

example_met = pd.read_csv("met_museum_5000_sample.csv")

cheat.ProfileReport(df)

In [None]:
example_met.head()

In [None]:
### You need to clean it and filter it to use correctly
### the profile report, or there will be mistakes :)

## What data visualization to use?
https://www.data-to-viz.com/
## How to find the code to do visualizations in Python?
https://python-graph-gallery.com/


### Exercise:

Experiment with visualizations on the MET and the initial DF. Look at the data and try to make new visualizations.

Explore the various visualizations in data-to-viz. Find the code on python graph gallery if they are available!

Tweak the code to adapt it to your data, and filter your date to be suitable to be put in a visualization!

### Extra, at home: A tutorial that is useful for the exam, we will see more of this in 2 weeks

https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial (multilingual)