## Introduction 

In this notebook, we will perform an EDA on the MOMA dataset. You can find the dataset on Kaggle here: https://www.kaggle.com/momanyc/museum-collection. There are two datasets, one for artists and another for artworks. Combined, there are about 145000 records. 

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as plt

In [2]:
artists = pd.read_csv('artists.csv')

In [3]:
artworks = pd.read_csv('artworks.csv')

## Checking the cleanliness of the set

According to the website, an incomplete record is given by 'not curator approved'. We will replace these strings with numpy NA's to make processing easier. 

In [4]:
artists.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15091 entries, 0 to 15090
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Artist ID    15091 non-null  int64  
 1   Name         15091 non-null  object 
 2   Nationality  12603 non-null  object 
 3   Gender       12019 non-null  object 
 4   Birth Year   11237 non-null  float64
 5   Death Year   4579 non-null   float64
dtypes: float64(2), int64(1), object(3)
memory usage: 707.5+ KB


Even without starting to look into the string 'not curator approved', we can see that Death Year has a large number of unknowns. Specifically, we find the following.

In [9]:
artists['Death Year'].isna().sum()

10512

In [10]:
artworks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130262 entries, 0 to 130261
Data columns (total 21 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   Artwork ID          130262 non-null  int64  
 1   Title               130210 non-null  object 
 2   Artist ID           128802 non-null  object 
 3   Name                128802 non-null  object 
 4   Date                127950 non-null  object 
 5   Medium              118343 non-null  object 
 6   Dimensions          118799 non-null  object 
 7   Acquisition Date    124799 non-null  object 
 8   Credit              127192 non-null  object 
 9   Catalogue           130262 non-null  object 
 10  Department          130262 non-null  object 
 11  Classification      130262 non-null  object 
 12  Object Number       130262 non-null  object 
 13  Diameter (cm)       1399 non-null    float64
 14  Circumference (cm)  10 non-null      float64
 15  Height (cm)         111893 non-nul

Simply scanning through the variables, we see that Circumference, Length, Weight have a ton of NA's. 