# Analyze the created dataset
This notebook analyzes each feature of the dataset and is leveraged to understand which type of paintings are under-represented.

### 0. Import libraries and load data

In [1]:
import polars as pl
import plotly.express as px

COLORS = ["#cd968e", "#acb0e0", "#aecbdc", "#bcd5c3", "#bfbfbf"]

In [2]:
data = pl.read_json("../../data/intermediate/met_paintings/met_paintings_enhanced_data.json")
data

id,title,artist,year,type,style,description
i64,str,str,i64,str,str,str
0,"""A Ship in a Stormy Sea""","""Ivan Konstantinovich Aivazovsk…",1900,"""landscape""",,"""Aivazovsky was a celebrated pa…"
1,"""Saint Giles with Christ Triump…","""Miguel Alcañiz (or Miquel Alca…",1413,,,"""These panels, from an altarpie…"
2,"""Flora and Zephyr""","""Jacopo Amigoni""",1739,"""mythological""",,"""The composition celebrates the…"
4,"""Jérôme Bonaparte (1784–1860), …","""Giacomo Andreoli""",1813,,,"""The following miniature is cle…"
5,"""Saint Alexander""","""Fra Angelico (Guido di Pietro)""",1430,,,"""This early work by Fra Angelic…"
…,…,…,…,…,…,…
2137,"""Picquigny""","""Frits Thaulow""",1899,,,"""Thaulow earned great success w…"
2138,"""Bust-Length Study of a Man""","""François-Auguste Biard""",1848,,,"""Despite the nuanced depiction …"
2139,"""A Man Seated and Asleep""","""Giuseppe Abbati""",1870,,,"""This picture’s lack of pretens…"
2140,"""Rachel Ruysch (1664–1750)""","""Michiel van Musscher|Rachel Ru…",1692,,,"""Over a career that spanned mor…"


### 1. Artists

In [3]:
artist_frequency = data["artist"].value_counts().sort("count").rename({"count": "frequency"})
print(f"Number of artists: {len(set(data['artist'].to_list()))}")

fig = px.histogram(
    artist_frequency, x="frequency", title="Artist Frequency", color_discrete_sequence=COLORS[2:3]
)
fig.show()

Number of artists: 915


### 2. Year of creation

In [4]:
paintings_per_century = (
    data.with_columns((pl.col("year") // 100 + 1).alias("century"))
    .group_by("century")
    .len()
    .sort("century")
    .with_columns(pl.col("len") / data.shape[0] * 100)
    .rename({"len": "percentage"})
)
print(f"Covered period: {data['year'].min()} - {data['year'].max()}")

fig = px.bar(
    paintings_per_century,
    x="century",
    y="percentage",
    title="Distribution of Paintings Across Centuries",
    color_discrete_sequence=COLORS[3:4],
)
fig.show()

Covered period: 1239 - 1931


### 3. Type and style

In [5]:
print(data["type"].value_counts().sort("count").to_numpy())

[['literary' 1]
 ['battle' 1]
 ['vanitas' 1]
 ['wildlife' 1]
 ['capriccio' 2]
 ['pastorale' 2]
 ['marina' 3]
 ['animal' 4]
 ['interior' 5]
 ['allegorical' 7]
 ['veduta' 7]
 ['cityscape' 7]
 ['history' 7]
 ['self-portrait' 8]
 ['sketch and study' 8]
 ['nude' 9]
 ['flower' 14]
 ['mythological' 32]
 ['still life' 33]
 ['landscape' 98]
 ['genre' 110]
 ['religious' 143]
 ['portrait' 153]
 [None 1397]]


In [6]:
print(data["style"].value_counts().sort("count").to_numpy())

[['art nouveau (modern)' 1]
 ['naïve art (primitivism)' 1]
 ['international gothic' 1]
 ['neo-rococo' 2]
 ['academicism' 2]
 ['proto renaissance' 2]
 ['pointillism' 3]
 ['tenebrism' 3]
 ['neoclassicism' 4]
 ['cloisonnism' 4]
 ['classicism' 5]
 ['symbolism' 9]
 ['high renaissance' 14]
 ['post-impressionism' 17]
 ['mannerism (late renaissance)' 17]
 ['early renaissance' 18]
 ['rococo' 24]
 ['romanticism' 29]
 ['northern renaissance' 32]
 ['realism' 62]
 ['impressionism' 68]
 ['baroque' 69]
 [None 1666]]


### 4. Description length

In [7]:
data_description_word_count = data.with_columns(
    pl.col("description")
    .map_elements(lambda x: len(x.split(" ")), return_dtype=pl.Int64)
    .alias("description word count")
)

fig = px.box(
    x=data_description_word_count["description word count"],
    title="Description Word Count",
    color_discrete_sequence=COLORS[0:1],
)
fig.update_xaxes(title_text="number of words")
fig.show()

In [8]:
description_lengths = ["shortest:\n", "medium:\n", "long:\n"]

for index, description_index in enumerate([0, 100, 1000]):
    description = data_description_word_count.sort("description word count")["description"][
        description_index
    ]

    print(f"{description_lengths[index]}{description}")

shortest:
A replica of this panel is in the Museum of Fine Arts, Houston.

medium:
This is one of three views of Gardanne, a hill town near Aix-en-Provence where Cézanne worked from the summer of 1885 through the spring of 1886. The steeple of the local church crowns the cluster of red-roofed buildings which animate the sloping terrain. Faceted and geometric, the structures anticipate early-twentieth-century Cubism.

long:
To the left in this view of London and the Thames River is the still unfinished Westminster Bridge. Behind it from left to right are Saint Johns, Smith Square, Westminster Hall, Westminster Abbey, and the tower of Saint Margaret's Church. Work on Westminster Bridge began in 1738 and was completed in 1750. It is shown here in approximately the state it would have reached by 1742.
Samuel Scott's early seascapes are indebted in style and subject matter to the Dutch painter Willem van de Velde the Younger (1633–1707). He turned to topographical views along the Thames in 