In [1]:
# Import pandas for data manipulation
import pandas as pd

# Optional: Import for pretty display
from IPython.display import display

print("‚úì Libraries imported successfully!")


‚úì Libraries imported successfully!


In [2]:
# Load the CSV file
# pd.read_csv() reads CSV into a DataFrame (a table)
df = pd.read_csv('../data/artemis_dataset_release_v0.csv')

# Show basic info
print("‚úì Dataset loaded successfully!")
print(f"  Total rows (annotations): {len(df):,}")
print(f"  Columns: {list(df.columns)}")


‚úì Dataset loaded successfully!
  Total rows (annotations): 454,684
  Columns: ['art_style', 'painting', 'emotion', 'utterance', 'repetition']


In [3]:
# Display first 5 rows
# .head() shows the first few rows of a DataFrame
df.head()


Unnamed: 0,art_style,painting,emotion,utterance,repetition
0,Post_Impressionism,vincent-van-gogh_portrait-of-madame-ginoux-l-a...,something else,"She seems very happy in the picture, and you w...",10
1,Post_Impressionism,vincent-van-gogh_portrait-of-madame-ginoux-l-a...,sadness,This woman has really knotty hands which makes...,10
2,Post_Impressionism,vincent-van-gogh_portrait-of-madame-ginoux-l-a...,something else,"When looking at this woman, I am filled with c...",10
3,Post_Impressionism,vincent-van-gogh_portrait-of-madame-ginoux-l-a...,contentment,"A woman looking at ease, peaceful, and satisfi...",10
4,Post_Impressionism,vincent-van-gogh_portrait-of-madame-ginoux-l-a...,awe,She looks like a lady from that past that migh...,10


## Step 4: Basic Dataset Statistics


In [4]:
# Calculate basic statistics
total_annotations = len(df)
unique_paintings = df['painting'].nunique()  # .nunique() counts unique values
unique_styles = df['art_style'].nunique()
unique_emotions = df['emotion'].nunique()

print("="*60)
print("DATASET STATISTICS")
print("="*60)
print(f"Total annotations:     {total_annotations:,}")
print(f"Unique paintings:      {unique_paintings:,}")
print(f"Unique art styles:     {unique_styles}")
print(f"Unique emotions:       {unique_emotions}")
print(f"Avg captions/painting: {total_annotations/unique_paintings:.2f}")
print("="*60)


DATASET STATISTICS
Total annotations:     454,684
Unique paintings:      80,031
Unique art styles:     27
Unique emotions:       9
Avg captions/painting: 5.68


### üí° What This Tells Us

- **454k+ annotations** but only **80k unique paintings**
- Each painting has **~5-6 different captions** from different people
- This is good! Multiple perspectives on the same artwork
- For training, we'll need to decide: use all captions or one per image?


## Step 5: Emotion Distribution

What emotions do people feel when looking at art?


In [5]:
# Count how many times each emotion appears
# .value_counts() counts occurrences of each unique value
emotion_counts = df['emotion'].value_counts()

print("EMOTION DISTRIBUTION:")
print("-"*60)
for emotion, count in emotion_counts.items():
    percentage = (count / total_annotations) * 100
    # Create a simple bar chart with characters
    bar = "‚ñà" * int(percentage)
    print(f"{emotion:20s}: {count:7,} ({percentage:5.2f}%) {bar}")


EMOTION DISTRIBUTION:
------------------------------------------------------------
contentment         : 126,134 (27.74%) ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
awe                 :  72,927 (16.04%) ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
something else      :  52,962 (11.65%) ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
sadness             :  49,061 (10.79%) ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
amusement           :  45,336 ( 9.97%) ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
fear                :  41,577 ( 9.14%) ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
excitement          :  37,636 ( 8.28%) ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
disgust             :  22,411 ( 4.93%) ‚ñà‚ñà‚ñà‚ñà
anger               :   6,640 ( 1.46%) ‚ñà


## Step 6: Caption Length Analysis

**Why this matters for our model:**
- We need to know how many words to generate
- LSTM needs to know the maximum sequence length
- Helps us decide padding/truncation strategy


In [6]:
# Calculate length of each caption (number of words)
# .str.split() splits text into words: "Hello world" ‚Üí ["Hello", "world"]
# .str.len() counts the words: ["Hello", "world"] ‚Üí 2
df['caption_length'] = df['utterance'].str.split().str.len()

# Get statistics
min_length = df['caption_length'].min()
max_length = df['caption_length'].max()
avg_length = df['caption_length'].mean()
median_length = df['caption_length'].median()

print("CAPTION LENGTH STATISTICS:")
print("-"*60)
print(f"Shortest caption:  {min_length} words")
print(f"Longest caption:   {max_length} words")
print(f"Average length:    {avg_length:.2f} words")
print(f"Median length:     {median_length:.0f} words")
print("-"*60)

# Show distribution
print("\nLength Distribution (word count ranges):")
length_ranges = pd.cut(df['caption_length'], bins=[0, 10, 20, 30, 40, 50, 100, 200], 
                        labels=['1-10', '11-20', '21-30', '31-40', '41-50', '51-100', '100+'])
print(length_ranges.value_counts().sort_index())


CAPTION LENGTH STATISTICS:
------------------------------------------------------------
Shortest caption:  1 words
Longest caption:   202 words
Average length:    15.69 words
Median length:     14 words
------------------------------------------------------------

Length Distribution (word count ranges):
caption_length
1-10      103760
11-20     268259
21-30      61960
31-40      13989
41-50       3962
51-100      2644
100+         109
Name: count, dtype: int64


### üìè Design Decision Time!

Based on the caption lengths:
- Most captions are probably 15-30 words
- We'll need to set a **max_length** for our LSTM (probably 50-60 words)
- Shorter captions will be **padded** with `<pad>` tokens
- Longer captions might need to be **truncated**

**Remember this number - you'll use it when building the LSTM!**


## Step 7: Example - Same Painting, Different Perspectives

Let's look at how different people describe the SAME artwork!


In [7]:
# Pick the first painting in the dataset
first_painting = df['painting'].iloc[0]
art_style = df['art_style'].iloc[0]

print("="*80)
print(f"PAINTING: {first_painting}")
print(f"STYLE: {art_style}")
print("="*80)
print()

# Get all captions for this painting
same_painting = df[df['painting'] == first_painting]

print(f"This painting has {len(same_painting)} different descriptions:\n")

# Show each person's description
for i, (idx, row) in enumerate(same_painting.iterrows(), 1):
    print(f"Person {i} felt '{row['emotion'].upper()}':")
    print(f'  "{row["utterance"]}"')
    print()


PAINTING: vincent-van-gogh_portrait-of-madame-ginoux-l-arlesienne-1890
STYLE: Post_Impressionism

This painting has 10 different descriptions:

Person 1 felt 'SOMETHING ELSE':
  "She seems very happy in the picture, and you want to know what what is behind the smile."

Person 2 felt 'SADNESS':
  "This woman has really knotty hands which makes her look like she has arthritis."

Person 3 felt 'SOMETHING ELSE':
  "When looking at this woman, I am filled with curiosity about what she is thinking about with her elbow on the table and a very emotionless face."

Person 4 felt 'CONTENTMENT':
  "A woman looking at ease, peaceful, and satisfied amongst her books makes me feel content."

Person 5 felt 'AWE':
  "She looks like a lady from that past that might have been a teacher (books).  She looks tired and I wondered how hard it must have been for them back then."

Person 6 felt 'DISGUST':
  "The details of the woman's face is off-putting and mildly disturbing."

Person 7 felt 'CONTENTMENT':
  "

### ü§î Notice the Diversity?

- Same image ‚Üí Different interpretations
- Different emotions felt by different people
- Some focus on visual details, others on feelings
- This is why image captioning is interesting but challenging!

**Our model will learn from all these perspectives!**


## Step 8: Art Style Distribution

Which art movements are most represented?


In [8]:
# Count art styles
style_counts = df['art_style'].value_counts()

print("TOP 15 ART STYLES:")
print("-"*80)
for style, count in style_counts.head(15).items():
    percentage = (count / total_annotations) * 100
    bar = "‚ñà" * int(percentage / 2)  # Scaled down for display
    print(f"{style:40s}: {count:6,} ({percentage:5.2f}%) {bar}")


TOP 15 ART STYLES:
--------------------------------------------------------------------------------
Impressionism                           : 72,361 (15.91%) ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
Realism                                 : 59,681 (13.13%) ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
Romanticism                             : 39,069 ( 8.59%) ‚ñà‚ñà‚ñà‚ñà
Expressionism                           : 38,717 ( 8.52%) ‚ñà‚ñà‚ñà‚ñà
Post_Impressionism                      : 36,374 ( 8.00%) ‚ñà‚ñà‚ñà
Art_Nouveau_Modern                      : 24,711 ( 5.43%) ‚ñà‚ñà
Symbolism                               : 24,103 ( 5.30%) ‚ñà‚ñà
Baroque                                 : 23,469 ( 5.16%) ‚ñà‚ñà
Abstract_Expressionism                  : 16,075 ( 3.54%) ‚ñà
Northern_Renaissance                    : 14,160 ( 3.11%) ‚ñà
Naive_Art_Primitivism                   : 14,086 ( 3.10%) ‚ñà
Rococo                                  : 11,904 ( 2.62%) ‚ñà
Cubism                                  : 11,462 ( 2.52%) ‚ñà
Color_Field_Painting        

In [9]:
# Sample 10 random captions
sample = df.sample(10, random_state=42)

print("RANDOM SAMPLE OF CAPTIONS:")
print("="*80)

for i, (idx, row) in enumerate(sample.iterrows(), 1):
    print(f"\n{i}. [{row['emotion'].upper()}] ({row['art_style']})")
    print(f"   Painting: {row['painting'][:50]}...")  # Truncate long names
    print(f'   Caption: "{row["utterance"]}"')
    print(f"   Length: {row['caption_length']} words")


RANDOM SAMPLE OF CAPTIONS:

1. [EXCITEMENT] (Symbolism)
   Painting: william-blake_night-startled-by-the-lark-1820...
   Caption: "the angel will fly around in the starry sky"
   Length: 9 words

2. [SOMETHING ELSE] (Minimalism)
   Painting: robert-mangold_untitled-from-skowhegan-suite-1992...
   Caption: "This image makes me feel interested because the orange board does not seem to go with the black string."
   Length: 19 words

3. [SADNESS] (Impressionism)
   Painting: konstantin-korovin_in-a-room-1886...
   Caption: "The man in the bed looks as if he could be potentially ill with the way his face seems bleak and the way he is leaning."
   Length: 26 words

4. [CONTENTMENT] (Realism)
   Painting: vasily-surikov_whacky-seated-on-the-ground-study-t...
   Caption: "The person sitting has his hand up, looks like a monk posture and is reflective."
   Length: 15 words

5. [FEAR] (Realism)
   Painting: viktor-vasnetsov_edge-of-the-spruce-forest-1881...
   Caption: "The trees look so close t