## EDA - The Price of Art

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import kagglehub
import os

In [9]:
# Download latest version
path = kagglehub.dataset_download("flkuhm/art-price-dataset")

print("Path to dataset files:", path)

Path to dataset files: /home/codespace/.cache/kagglehub/datasets/flkuhm/art-price-dataset/versions/1


In [10]:
# 1. First, let's see what files are inside the directory we just downloaded.
# This will print the name of the CSV file(s).
print("Files in the directory:", os.listdir(path))

Files in the directory: ['artDataset.csv', 'artDataset']


In [None]:
# 2. Now, we need the exact filename. Let's assume from the Kaggle page that the file is named 'artDataset'
filename = 'artDataset.csv'

In [12]:
# 3. We create the full, correct path to the file. 
# os.path.join is the best way to do this, as it works on any operating system (Windows, Mac, Linux).
full_path = os.path.join(path, filename)

In [13]:
# 4. Finally, we read the CSV file into our pandas DataFrame.
try:
    df = pd.read_csv(full_path)
    print("\nDataset loaded successfully!")
except FileNotFoundError:
    print(f"\nError: The file '{filename}' was not found in the directory '{path}'.")
    print("Please check the filename from the 'os.listdir(path)' output and update the 'filename' variable.")



Dataset loaded successfully!


In [14]:
df = pd.read_csv(full_path)
df

Unnamed: 0.1,Unnamed: 0,price,artist,title,yearCreation,signed,condition,period,movement
0,0,28.500 USD,Tommaso Ottieri,Bayreuth Opera,2021,Signed on verso,This work is in excellent condition.,Contemporary,Baroque
1,1,3.000 USD,Pavel Tchelitchew,Drawings of the Opera,First Half 20th Century,Signed and titled,Not examined out of frame.No obvious signs of ...,Post-War,Surrealism
2,2,5.000 USD,Leo Gabin,Two on Sidewalk,2016,"Signed, titled and dated on verso",This work is in excellent condition.,Contemporary,Abstract
3,3,5.000 USD,Matthias Dornfeld,Blumenszene,2010,"Signed, titled and dated on the reverse with t...",This work is in excellent condition.There is m...,Contemporary,Abstract
4,4,2.500 USD,Alexis Marguerite Teplin,Feverish Embarkation,2001,Signed on verso,This work is in excellent condition.,Contemporary,Abstract
...,...,...,...,...,...,...,...,...,...
749,749,680 USD,Jane Kent,Miracle Grow #17,2012,Signed and dated on lower right.,Not examined out of frame.No obvious signs of ...,Contemporary,Abstract
750,750,1.275 USD,Gary Bower,Rolph Series,1970,[nan],Not examined out of frame.Significant undulati...,Contemporary,Geometric Abstraction
751,751,680 USD,Jane Kent,Untitled,2012,[nan],Not examined out of frame.No apparent imperfec...,Contemporary,Geometric Abstraction
752,752,1.275 USD,T. L. Solien,Juniper,1986,[nan],Not examined outside of frame.Pinholes at edge...,Contemporary,Abstract


In [15]:
# Let's understand the dimensions and the columns of our dataset.
# This gives us a quick overview without printing thousands of rows.

print(f"The DataFrame has {df.shape[0]} rows and {df.shape[1]} columns.")
print("\nColumn Names:")
print(df.columns.tolist())

The DataFrame has 754 rows and 9 columns.

Column Names:
['Unnamed: 0', 'price', 'artist', 'title', 'yearCreation', 'signed', 'condition', 'period', 'movement']


In [16]:
# The 'Unnamed: 0' column is likely an old index from the CSV file.
# It's redundant, so let's drop it to clean our DataFrame.

if 'Unnamed: 0' in df.columns:
    df.drop('Unnamed: 0', axis=1, inplace=True)
    print("Dropped the 'Unnamed: 0' column.")
else:
    print("'Unnamed: 0' column not found. No action taken.")

# Let's verify the change
print("\nUpdated Column Names:")
print(df.columns.tolist())

Dropped the 'Unnamed: 0' column.

Updated Column Names:
['price', 'artist', 'title', 'yearCreation', 'signed', 'condition', 'period', 'movement']


In [17]:
# Get a concise summary of the DataFrame to check data types and missing values.
df.info()

<class 'pandas.DataFrame'>
RangeIndex: 754 entries, 0 to 753
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   price         754 non-null    str  
 1   artist        753 non-null    str  
 2   title         754 non-null    str  
 3   yearCreation  754 non-null    str  
 4   signed        754 non-null    str  
 5   condition     754 non-null    str  
 6   period        754 non-null    str  
 7   movement      754 non-null    str  
dtypes: str(8)
memory usage: 47.3 KB


In [18]:
# Get descriptive statistics for the numerical columns.
df.describe()

Unnamed: 0,price,artist,title,yearCreation,signed,condition,period,movement
count,754,753,754,754,754,754,754,754
unique,108,454,679,136,390,376,5,34
top,800 USD,Russell Young,Untitled,2012,[nan],Excellent condition.,Contemporary,Realism
freq,124,17,35,34,153,82,414,177


## Initial Findings from Data Inspection
After the initial data loading and inspection using .info() and .describe(), several key characteristics and potential issues have been identified in the dataset:

- Data Type Mismatch: All columns, including critical ones like price and yearCreation, are currently stored as objects (str). This prevents any numerical or time-series analysis.
- Price Column: The price column contains the ' USD' suffix, which needs to be removed before conversion to a numeric type. The most frequent price is '$800', which may indicate a specific category of art or a default price that requires further investigation.
Missing Values:
- The artist column has one official missing value (NaN). Given that there is only one, this warrants a specific investigation of the corresponding title to make an informed imputation decision (e.g., is it a famous work, or by an 'Anonymous' artist?).
- The signed column contains the literal string '[nan]' as its most frequent value. This is a hidden missing value that needs to be converted to a proper NaN for accurate analysis.
- High Cardinality: The condition column has 376 unique values, which is too many for effective analysis. It will likely require grouping or simplification later in the project.

## Proposed Cleaning & Preprocessing Plan
To prepare the dataset for exploratory analysis and visualization, the following steps will be taken in order:

- Investigate Missing Artist: Before any bulk transformations, we will isolate the single row with the missing artist value. By examining its title and other attributes, we will decide on the most appropriate value to impute (e.g., "Anonymous", "Unknown", or the actual artist's name if identifiable).
- Convert price to Numeric:
Remove the ' USD' suffix and any commas from the price column.
Convert the column to a numeric (float) data type.
- Convert yearCreation to Numeric:
Inspect the column for non-numeric characters.
Convert the column to a numeric (integer) data type.
- Handle Hidden Missing Values:
Convert the literal '[nan]' strings in the signed column to actual NaN values.
Decide on a strategy for the now-missing values in the signed column (e.g., imputation with a placeholder like 'Not Signed').
- (Future) Simplify Categorical Columns:
Once the main structural issues are resolved, we will address the high cardinality in the condition column to make it more analyzable.

In [19]:
# Investigating the single missing 'artist' value:

# Let's find the row where the 'artist' is missing (NaN)
missing_artist_row = df[df['artist'].isnull()]

# Display the full information for this row
print("Investigating the row with the missing artist:")
print(missing_artist_row.T) # Using .T to transpose it for better readability

Investigating the row with the missing artist:
                                                            725
price                                                 1.275 USD
artist                                                      NaN
title                                                     [nan]
yearCreation                                              [nan]
signed                      Signed and dated in pencil to verso
condition     Not examined out of frame.Minor sheet undulati...
period                                             Contemporary
movement                                                Realism
