# CuratorAI: A Personalized Art & Museum Discovery Engine
***Building an AI-Powered Museum Guide***

# Table of Contents
1. [Overview üéØ](##overview)
2. [Data Import üì•](##data-import)
3. [Pre-Processing üßπ](##pre-processing)
4. [Exploratory Data Analysis üîç](##exploratory-data-analysis)
5. [Feature Engineering ‚öôÔ∏è](##feature-engineering)
6. [Model Building & Training ü§ñ](##model-building--training)
7. [Model Evaluation & Validation üìä](##model-evaluation--validation)
8. [Model Deployment üíæ](##model-deployment)
9. [Conclusion & Next Steps üìù](##conclusion--next-steps)


## Overview

**The Art Discovery Problem**

> **"I know what I like, but I don't know how to find more of it."**

This simple frustration echoes through the grand halls of museums worldwide. Visitors stand before masterpieces feeling overwhelmed by choice, unsure where to turn next in collections spanning 5,000 years of human creativity. The Metropolitan Museum of Art alone houses over 470,000 artworks‚Äîenough to spend 30 seconds on each piece for 163 consecutive days.


**The Business Challenge**

Museums face a critical engagement problem:
- <u>Visitor Overwhelm</u>: Too many choices lead to decision paralysis
- <u>Personalization Gap</u>: One-size-fits-all audio guides and maps
- <u>Discovery Barriers</u>: Visitors struggle to find artworks that resonate with their personal tastes
- <u>Digital Engagement</u>: Online collections remain underutilized without intelligent navigation


**Solution: CuratorAI**

CuratorAI is an intelligent recommendation system that transforms art discovery from being overwhelming to personalized. By analyzing the textual artworks, their titles, artists, cultural contexts, and historical periods, building bridges between what you love and what you'll love next.

Think of it as: 
> **"The Netflix for art, the Spotify for sculptures, the personal curator in your pocket."**


**Technical Approach**
- <u>Content-Based Filtering</u>: Using TF-IDF and Nearest Neighbors to find similar artworks
- <u>Feature Engineering</u>: Transforming textual metadata into meaningful numerical representations
- <u>End-to-End Pipeline</u>: From raw data to deployed web application
- <u>Real-World Constraints</u>: Working within The Met's Open Access data policies


**Value Proposition**

For **art lovers**: Discover hidden gems and personal connections across centuries
For **museums**: Increase engagement, visit duration, and digital interaction
For **educators**: Create thematic learning journeys through visual culture

**Let's begin our journey through data, art, and machine learning‚Äîtransforming how the world discovers beauty.**

## Data Import

In [1]:
# Data & ML
import pandas as pd
import numpy as np
import scipy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors
import joblib

# API & Images
import requests
import json
from PIL import Image
from io import BytesIO
import time

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Utilities
import os
import warnings
warnings.filterwarnings('ignore')

In [9]:
#¬†Loading dataset
df = pd.read_csv('/Users/Rosella/Library/Mobile Documents/com~apple~CloudDocs/Personal Projects/Production-ML-Portafolio/CuratorAI/MetObjects.csv')

#¬†General info
df.info()

#¬†Data columns
print(f'\nRows and columns: {df.shape}')

# First 10 rows
print("\nFirst 10 rows with all columns:")
print(df.head(10).to_string())

#¬†Systematic assesment
missing_analysis = df.isnull().sum().sort_values(ascending=False)
missing_percentage = (missing_analysis / len(df) * 100)

print("\nColumns missing values:")
print(missing_percentage)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 484956 entries, 0 to 484955
Data columns (total 54 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   Object Number            484956 non-null  object 
 1   Is Highlight             484956 non-null  bool   
 2   Is Timeline Work         484956 non-null  bool   
 3   Is Public Domain         484956 non-null  bool   
 4   Object ID                484956 non-null  int64  
 5   Gallery Number           49541 non-null   object 
 6   Department               484956 non-null  object 
 7   AccessionYear            481094 non-null  object 
 8   Object Name              482690 non-null  object 
 9   Title                    456153 non-null  object 
 10  Culture                  208190 non-null  object 
 11  Period                   91143 non-null   object 
 12  Dynasty                  23201 non-null   object 
 13  Reign                    11236 non-null   object 
 14  Port

The initial exploration reveals several data quality challenges that require systematic addressing:

**Data Quality Issues Identified:**
- **Missing Values**: Critical fields like `Artist Display Name` (59% missing), `Object Date` (97.2% missing), `Culture` (43% missing), require strategic handling.
- **Data Type Mismatches**: Numeric fields stored as objects, date information inconsistently formatted
- **Column Relevance**: Only ~15 of 45 columns directly support the recommendation objective
- **Data Integrity**: Need to verify no duplicate `Object ID` entries and validate image URL availability

**Immediate Action Plan:**
1. Identify the optimal subset of columns for feature engineering and check thier missing values percentaje
2. Develop a missing data strategy (imputation vs. filtering)
3. Validate data types and convert if necessary

## Pre-Processing

In [4]:
# Optimal columns
primary_columns = ['Object ID',
                    'Title', 
                    'Object Name',
                    'Object Begin Date',
                    'Object End Date',
                    'Department', 
                    'Classification',
                    'Medium'
                    ]

image_columns = ['Link Resource',
                  'Dimensions'
                  ]

# Checking missing values percentaje
def checking_nulls(columns):
    for col in columns:
        complete_percentage = (1 - df[col].isnull().mean()) * 100

        print(f'{col}: {complete_percentage}')

checking_nulls(primary_columns)
print()
checking_nulls(image_columns)

Object ID: 100.0
Title: 94.06069829015415
Object Name: 99.53274111465782
Object Begin Date: 100.0
Object End Date: 100.0
Department: 100.0
Classification: 83.76821814762576
Medium: 98.51223616163117

Link Resource: 100.0
Dimensions: 84.5227195869316


10 primary columns have been selected for the recommendation engine, all demonstrating **>83% data completeness**, providing a provides comprehensive coverage of the artwork identity, historical context, physical characteristics, and visual content while maintaining high data quality standards.

In [5]:
#¬†Creating new dataset with select columns
selected_columns = primary_columns + image_columns
data = df[selected_columns].copy()

#¬†Cheking data types for the new dataset
print('Data Types:')
print(data.dtypes)

# Cheking null values
print('\nNull values before cleaning:')
for col in selected_columns:
    if data[col].isnull().sum() > 0:
        print(f'Column {col}: {data[col].isnull().sum()}')

#¬†Handeling null values
data['Title'].fillna('Untitled', inplace=True)
data['Object Name'].fillna('Artwork', inplace=True)
data['Classification'].fillna('Unclassified', inplace=True)
data['Medium'].fillna('Medium unknown', inplace=True)
data['Dimensions'].fillna('Dimensions unavailable', inplace=True)
print(f'\nNull values after cleaning: {data.isnull().sum().sum()}')

#¬†Checking for duplicates
if data.duplicated().sum() > 0:
    print('‚ö†Ô∏è  Duplicates found - investigate before proceeding')
else:
    print('‚úÖ No duplicate records')

# Clean Dataset
print('\n=== CLEAN DATASET READY ===')
print(f'Final shape: {data.shape}')
data.head()


Data Types:
Object ID             int64
Title                object
Object Name          object
Object Begin Date     int64
Object End Date       int64
Department           object
Classification       object
Medium               object
Link Resource        object
Dimensions           object
dtype: object

Null values before cleaning:
Column Title: 28803
Column Object Name: 2266
Column Classification: 78717
Column Medium: 7215
Column Dimensions: 75058

Null values after cleaning: 0
‚úÖ No duplicate records

=== CLEAN DATASET READY ===
Final shape: (484956, 10)


Unnamed: 0,Object ID,Title,Object Name,Object Begin Date,Object End Date,Department,Classification,Medium,Link Resource,Dimensions
0,1,One-dollar Liberty Head Coin,Coin,1853,1853,The American Wing,Unclassified,Gold,http://www.metmuseum.org/art/collection/search/1,Dimensions unavailable
1,2,Ten-dollar Liberty Head Coin,Coin,1901,1901,The American Wing,Unclassified,Gold,http://www.metmuseum.org/art/collection/search/2,Dimensions unavailable
2,3,Two-and-a-Half Dollar Coin,Coin,1909,1927,The American Wing,Unclassified,Gold,http://www.metmuseum.org/art/collection/search/3,Diam. 11/16 in. (1.7 cm)
3,4,Two-and-a-Half Dollar Coin,Coin,1909,1927,The American Wing,Unclassified,Gold,http://www.metmuseum.org/art/collection/search/4,Diam. 11/16 in. (1.7 cm)
4,5,Two-and-a-Half Dollar Coin,Coin,1909,1927,The American Wing,Unclassified,Gold,http://www.metmuseum.org/art/collection/search/5,Diam. 11/16 in. (1.7 cm)


## Exploratory Data Analysis

In [6]:
# Categorical columns exploration
print("=== CATEGORICAL DISTRIBUTION ANALYSIS ===")
categorical_columns = ['Object Name', 'Department', 'Classification', 'Medium']

for col in categorical_columns:
    print(f'\n{col} Analysis:')

    #¬†Unique values
    unique_count = data[col].nunique()
    print(f'Unique values: {unique_count}')

    #¬†Top categories
    cat_value_counts = data[col].value_counts()
    print(f'Top ten: {cat_value_counts.head(10)}')

=== CATEGORICAL DISTRIBUTION ANALYSIS ===

Object Name Analysis:


Unique values: 28632
Top ten: Object Name
Print             102986
Photograph         29451
Drawing            26018
Book               13397
Kylix fragment      8926
Piece               8621
Fragment            7213
Painting            6014
Negative            5928
Bowl                3633
Name: count, dtype: int64

Department Analysis:
Unique values: 19
Top ten: Department
Drawings and Prints                       172630
European Sculpture and Decorative Arts     43051
Photographs                                37459
Asian Art                                  37000
Greek and Roman Art                        33726
Costume Institute                          31652
Egyptian Art                               27969
The American Wing                          18532
Islamic Art                                15573
Modern and Contemporary Art                14696
Name: count, dtype: int64

Classification Analysis:
Unique values: 1245
Top ten: Classification
Prints                  84326
Unclas

In [7]:
# Deparments with most diverse collections
print("=== DEPARTMENT DIVERSITY ===")
dept_diversity = data.groupby('Department')['Object Name'].nunique().sort_values(ascending=False)
print(dept_diversity.head(10))

# Most common medium by departments
print("\n=== TOP MEDIUMS BY DEPARTMENT ===")
top_mediums = data.groupby('Department')['Medium'].apply(lambda x: x.value_counts().head(3))
print(top_mediums.head(30))

#¬†Artworks by periods
def categorized_by_centuary(begin_date, end_date):
    #¬†continuar

SyntaxError: incomplete input (2480348760.py, line 13)

## Feature Engineering