# Final Project: Researching Catalogued Work in MoMA

## Introduction
Representation of art is known to be very problematic in terms ...

In this study, we look to see if there is any relation between if the artwork is catalogued and the other attributes which it might have. Cataloguing artworks is extremely important as it not only showcases the importance which the musuem might place on the artwork, but it also essential for risk management, research and exhibition development. 

  https://mgnsw.org.au/sector/resources/online-resources/collection-management/cataloguing/

## Data: ...


### Data Preparation

We start off by reading the data.

In [1]:
## Importing & reading data
import pandas as pd
import random
import re

In [2]:
df = pd.read_csv("./collection/Artworks.csv")

We clean the data in two main ways, by dropping unnecessary attributes and replacing NaN values.

In [3]:
## Cleaning the data by droppig unnecessary attributes

# These attributes give the same info: 'Artist' & 'ConstituentID'
#  so I plan on droping 'Artist'.
df.drop("Artist", axis=1, inplace=True)

# These attributes give the same info: "ObjectID" & "Title"
#  so I plan on droping "Title"
df.drop("Title", axis=1, inplace=True)

# Measurement attributes give the same info as "Dimensions"
#  so I plan on droping "Dimensions"
df.drop("Dimensions", axis=1, inplace=True)

# Also droping 'URL' & 'ThumbnailURL' as they don't give any real info.
#  These are only filled out for catalogued artworks.
df.drop('URL', axis=1, inplace=True)
df.drop('ThumbnailURL', axis=1, inplace=True)

# "AccessionNumber" refers to the barely visible coded number attached the artwork in 
#  real life. This is used to link the displayed or archived work with it proper 
#  identification and knowledge for the humans working at the museum.
   # https://www.okmuseums.org/sites/oma2/uploads/documents/Technical_Bulletins/Technical_Bulletin_42_-_Applying_Accession_Numbers_Part_I.pdf

df.drop("AccessionNumber", axis=1, inplace=True)


# We are dropping these attributes as its time expensive to look at these. (However, in time we
#   might go back to look at their affects).

df.drop('ArtistBio', axis=1, inplace=True)
df.drop('Nationality', axis=1, inplace=True)
df.drop('BeginDate', axis=1, inplace=True)
df.drop('EndDate', axis=1, inplace=True)

In [4]:
df.isna().sum()

ConstituentID           1283
Gender                  1283
Date                    2202
Medium                  9701
CreditLine              2437
Classification             0
Department                 0
DateAcquired            7125
Cataloged                  0
ObjectID                   0
Circumference (cm)    138141
Depth (cm)            124312
Diameter (cm)         136689
Height (cm)            17796
Length (cm)           137409
Weight (kg)           137861
Width (cm)             18717
Seat Height (cm)      138151
Duration (sec.)       136011
dtype: int64

In [5]:
# Measurement attributes that are NaN will now be zero (ie 'Circumference (cm)', 
#  'Depth (cm)', 'Diameter (cm)', 'Height (cm)', 'Length (cm)', 'Weight (kg)', 'Width (cm)', 
#  'Seat Height (cm)', 'Duration (sec.)' )
df['Circumference (cm)'].fillna(0, inplace=True)
df['Depth (cm)'].fillna(0, inplace=True)
df['Diameter (cm)'].fillna(0, inplace=True)
df['Height (cm)'].fillna(0, inplace=True)
df['Length (cm)'].fillna(0, inplace=True)
df['Weight (kg)'].fillna(0, inplace=True)
df['Width (cm)'].fillna(0, inplace=True)
df['Seat Height (cm)'].fillna(0, inplace=True)
df['Duration (sec.)'].fillna(0, inplace=True)

# Unknown artist will have a negative id number
df["ConstituentID"].fillna( "-1", inplace=True)

df["Cataloged"].replace("Y", 1, inplace=True)
df["Cataloged"].replace("N", 0, inplace=True)

# Discrete values with NaN will now be a string "UNKNOWN"
df['Medium'].replace(pd.NA, "UNKNOWN", inplace=True)
df['Date'].replace(pd.NA, "UNKNOWN", inplace=True)
df["Gender"].replace(pd.NA, "UNKNOWN", inplace=True)
df["CreditLine"].replace(pd.NA, "UNKNOWN", inplace=True)

df["DateAcquired"].replace(pd.NA, "0001-01-01", inplace=True)

df['Medium'] = df['Medium'].str.lower()

We change date attribute to be a period object, and then an int.

In [6]:
# This is now a period obj
df["DateAcquired"] =  pd.PeriodIndex(df["DateAcquired"], freq="D")
df["DateAcquired"] = df["DateAcquired"].astype(int)

(At this point we jump to the EDAs, before changing anymore features.) From here on the values that need to be changed are discrete, so we deal with them on a case by case basis.

In [7]:
# Changing string attributes to numbers as well as giving them their own csv

departments = {"DepartmentName": [], "UID": []}
ls = df["Department"].unique()
for idx, x in enumerate(ls):
    departments["DepartmentName"].append(x)
    departments["UID"].append(idx)
    df["Department"].replace(x, idx, inplace=True)

classifications = {"ClassificationName": [], "UID": []}
ls = df["Classification"].unique()
for idx, x in enumerate(ls):
    classifications["ClassificationName"].append(x)
    classifications["UID"].append(idx)
    df["Classification"].replace(x, idx, inplace=True)

creditlines = {"CreditLineDescription": [], "UID": []}
ls = df["CreditLine"].unique()
for idx, x in enumerate(ls):
    creditlines["CreditLineDescription"].append(x)
    creditlines["UID"].append(idx)
    df["CreditLine"].replace(x, idx, inplace=True)

In [8]:
# departments = pd.DataFrame(departments)
# departments.to_csv("departments.csv", sep=',')

# classifications = pd.DataFrame(classifications)
# classifications.to_csv("classifications.csv", sep=',')

# creditlines = pd.DataFrame(creditlines)
# creditlines.to_csv("creditlines.csv", sep=',')

At this point, we are creating medium types based on characteristics of the top 60 medium types. 

This is by taking apart important word phrases to make generalizations that fit more cases. For example, "Gelatin silver print" and "Albumen silver print" are both prints, and both use silver print process (a photography process). However, they use different soluble solution for the print process (ie gelatin or albumen). Therefore, we can make four medium categories from this problem: "print", "silver print", "gelatin", "albumen".

We also add two types that aren't shown within the top 60: cardboard, and paint. As they are known to be somewhat normal within fine art presentations.

In [9]:
df["Medium"].value_counts().head(60)

gelatin silver print                                                                              16215
unknown                                                                                            9706
lithograph                                                                                         7822
albumen silver print                                                                               4874
pencil on paper                                                                                    1795
chromogenic color print                                                                            1722
letterpress                                                                                        1681
etching                                                                                            1630
ink on paper                                                                                       1460
lithograph, printed in color                                    

In [10]:
mediums = {"ink", "lithograph", "engraving", "gelatin", "albumen", "silver print", "pencil", 
          "chromogenic", "color print", "paper", "letterpress", "tracing paper", "offset", 
          "video", "oil", "canvas", "drypoint", "woodcut", "screenprint", 
          "poster", "etching", "inkjet print", "pigmented", "photogravure", "black", "color", 
          "wood", "aquatint", "platinum print", "collotype", "matte", "ballpoint pen", "board", 
          "watercolor", "illustrated book", "photolithograph", "white", "sound", "portfolio", 
          "linoleum cut", "dye transfer print", "gouache", "glass negative", "tracing paper", 
           "colored pencil", "cardboard", "silkscreen", "silent", "charcoal", "paint", "bronze", 
           "drypoint"}

In [11]:
for m in mediums:
    df[m] = 0
    df.loc[df['Medium'].str.contains(m),m] = 1

From this point, we take the "Date" attribute, and create two more attributes: "TimeStarted" and "TimeFinished". 

"Date" is a string attribute that holds the years which the art piece was being worked on.. Normally, the attribute's elements are displayed as either XXXX-XX or XXXX. Though, there are other valid years which can be extracted from long string phrases, but we avoid these for time being. Therefore, the dataframe shrinks where years can not be easily extracted.

In [12]:
# Reading all unique dates and changing the two new columns based on that info.

startAttri = "TimeStarted"
finishAttri = "TimeFinished"

df[startAttri] = "-"
df[finishAttri] = "-"
ls = df["Date"].unique()
for s in ls:
    try:
        d = int(s)
        df.loc[df["Date"].str.contains(s),startAttri] = d
        df.loc[df["Date"].str.contains(s), finishAttri] = d
    except:
        l = s.split('-')
        if len(l) == 2 and len(l[1]) == 2 and len(l[0]) == 4:
            start = l[0]
            finish = l[0][:2]+l[1]
            df.loc[df["Date"].str.contains(s), startAttri] = int(start)
            df.loc[df["Date"].str.contains(s), finishAttri] = int(finish)

In [13]:
# Removing rows where dates could not be extracted.
df = df.loc[df[startAttri] != "-"]
df = df.loc[df[finishAttri] != "-"]

In [14]:
# Finding the number of artist involved and the number of that gender (male, female, neither). 
#  Taking the first artist listed

df["NumMales"] = df["Gender"].str.count('Male')
df["NumFemales"] = df["Gender"].str.count('Female')
df["NumArtists"] = df["ConstituentID"].str.count(',') + 1
df["FirstArtistListed"] = df["ConstituentID"].str.extract('(^\d*)', expand=False).str.strip()

df["NumArtists"].fillna( -1, inplace=True)
df["FirstArtistListed"].fillna( -1, inplace=True)
df["FirstArtistListed"] = pd.to_numeric( df["FirstArtistListed"])
df["FirstArtistListed"].fillna( -1, inplace=True)

In [15]:
df.drop("ConstituentID", axis=1, inplace=True)
df.drop("Date", axis=1, inplace=True)
df.drop("Gender", axis=1, inplace=True)
df.drop("Medium", axis=1, inplace=True)
df.drop("ObjectID", axis=1, inplace=True)

In [16]:
df.isna().sum()

CreditLine           0
Classification       0
Department           0
DateAcquired         0
Cataloged            0
                    ..
TimeFinished         0
NumMales             0
NumFemales           0
NumArtists           0
FirstArtistListed    0
Length: 70, dtype: int64

In [17]:
print(df.columns)
df.head(5)

Index(['CreditLine', 'Classification', 'Department', 'DateAcquired',
       'Cataloged', 'Circumference (cm)', 'Depth (cm)', 'Diameter (cm)',
       'Height (cm)', 'Length (cm)', 'Weight (kg)', 'Width (cm)',
       'Seat Height (cm)', 'Duration (sec.)', 'silkscreen', 'white',
       'platinum print', 'wood', 'bronze', 'screenprint', 'paint', 'cardboard',
       'collotype', 'woodcut', 'video', 'letterpress', 'pencil', 'color',
       'color print', 'chromogenic', 'ink', 'board', 'albumen', 'silent',
       'watercolor', 'poster', 'oil', 'illustrated book', 'portfolio',
       'pigmented', 'tracing paper', 'sound', 'canvas', 'inkjet print',
       'charcoal', 'linoleum cut', 'photogravure', 'lithograph', 'matte',
       'glass negative', 'etching', 'aquatint', 'dye transfer print',
       'drypoint', 'engraving', 'ballpoint pen', 'photolithograph', 'paper',
       'offset', 'silver print', 'gouache', 'black', 'gelatin',
       'colored pencil', 'TimeStarted', 'TimeFinished', 'NumMales',

Unnamed: 0,CreditLine,Classification,Department,DateAcquired,Cataloged,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),...,gouache,black,gelatin,colored pencil,TimeStarted,TimeFinished,NumMales,NumFemales,NumArtists,FirstArtistListed
0,0,0,0,9595,1,0.0,0.0,0.0,48.6,0.0,...,0,0,0,0,1896,1896,1,0,1,6210.0
1,1,0,0,9147,1,0.0,0.0,0.0,40.6401,0.0,...,0,0,0,1,1987,1987,1,0,1,7470.0
2,2,0,0,9876,1,0.0,0.0,0.0,34.3,0.0,...,1,0,0,0,1903,1903,1,0,1,7605.0
3,3,0,0,9147,1,0.0,0.0,0.0,50.8,0.0,...,0,0,0,0,1980,1980,1,0,1,7056.0
4,2,0,0,9876,1,0.0,0.0,0.0,38.4,0.0,...,1,0,0,0,1903,1903,1,0,1,7605.0


## Exploration (EDA)

Below we do some data exploration, looking at the data given cataloged or not. The EDA is performed before the strings types are changed into catergoical int types.

In [18]:
# ncat = df.loc[df["Cataloged"] == 0]
# print("Not Catalogued: ", len(ncat["Cataloged"]))
# 
# cat = df.loc[df["Cataloged"] == 1]
# print("Cataloged: ", len(cat["Cataloged"]))

#### Dates of Acquisions

In [19]:
# ## Top 25 acquisions dates
# b = df["DateAcquired"].value_counts()[:25]

# b.plot.bar(title = "Top 25 acquisions dates", figsize = (10,8))

In [20]:
# ## Top 10 acquision dates, given not catalogued
# a = df.loc[df["Cataloged"] == 0]
# b = a["DateAcquired"].value_counts()[:10]

# b.plot.bar(title = "Top 10 acquision dates, given not catalogued", figsize = (10,8))

In [21]:
# ## Top 10 acquision dates, given catalogued
# a = df.loc[df["Cataloged"] == 1]
# b = a["DateAcquired"].value_counts()[:10]

# b.plot.bar(title = "Top 10 acquision dates, given catalogued", figsize = (10,8))

#### Departments

In [22]:
# ## Departments in ascending order of works
# b = df["Department"].value_counts()
# b.plot.barh(title = "Top Departments", figsize = (10,8))

In [23]:
# ## Departments in ascending order of works, given not catalogued
# b = ncat["Department"].value_counts()
# b.plot.barh(title = "Top Departments, given not catalogued", figsize = (10,8))

In [24]:
# ## Departments in ascending order of works, given catalogued
# b = cat["Department"].value_counts()
# b.plot.barh(title = "Top Departments, given catalogued", figsize = (10,8))

#### Credit Line

In [25]:
# ## Top 10 CreditLine info
# b = df["CreditLine"].value_counts()[:10]
# b.plot.barh(title = "Top 10 Credit Lines", figsize = (10,8))

In [26]:
# ## Top 10 CreditLine info, given not catalogued
# b = ncat["CreditLine"].value_counts()[:10]
# b.plot.barh(title = "Top 10 Credit Lines, given not catalogued", figsize = (10,8))

In [27]:
# ## Top 10 CreditLine info, given catalogued
# b = cat["CreditLine"].value_counts()[:10]
# b.plot.barh(title = "Top 10 Credit Lines, given catalogued", figsize = (10,8))

## Analysis (Part 1)

In [28]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn import metrics
import numpy as np

For this project we run two classification analysis algorithms: naive bayes & logistic regression. We also check to see which features have the greatest affect on whether an artwork will be catalogued or not.

The **first feature set** we run is below.

In [29]:
features = ['CreditLine',
       'Classification', 'Department', 'DateAcquired', 
       'Circumference (cm)', 'Depth (cm)', 'Diameter (cm)', 'Height (cm)',
       'Length (cm)', 'Weight (kg)', 'Width (cm)', 'Seat Height (cm)',
       'Duration (sec.)', 'NumMales', 'NumFemales', 'NumArtists',
       'FirstArtistListed']



X = df[features].to_numpy()
y = df["Cataloged"].to_numpy()


We start off by creating three folds.

In [30]:
random.seed(10)

length = X.shape[0]
indices = list(range(length))
np.random.shuffle(indices)

mid = int(length//3)

X_f1, y_f1 = X[indices[:mid], :], y[indices[:mid]]
X_f2, y_f2 = X[indices[mid: (mid*2)], :], y[indices[mid: (mid*2)]]
X_f3, y_f3 = X[indices[(mid*2):], :], y[indices[(mid*2):]]


### Logistic Regression (Feature Set 1)

In [31]:
from sklearn.linear_model import LogisticRegression
# import seaborn as sns
import matplotlib.pyplot as plt

In [32]:
lr = LogisticRegression()

In [33]:
# Fold3 = Test
X_test, y_test = X_f3, y_f3 
X_train = np.concatenate((X_f1, X_f2), axis = 0)
y_train = np.concatenate((y_f1, y_f2), axis = 0)

In [34]:
%timeit -n1 -r1 classifier = lr.fit(X_train, y_train)

900 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [35]:
y_pred = lr.predict(X_test)

In [36]:
print(metrics.classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.53      0.06      0.10     16026
           1       0.65      0.97      0.78     28887

    accuracy                           0.65     44913
   macro avg       0.59      0.51      0.44     44913
weighted avg       0.61      0.65      0.54     44913



In [37]:
w1 = pd.DataFrame(lr.coef_[0], columns=["Weights"], index=features)
w1

Unnamed: 0,Weights
CreditLine,8.273215e-05
Classification,0.0003320555
Department,4.212876e-06
DateAcquired,1.537652e-06
Circumference (cm),3.863682e-07
Depth (cm),0.000722556
Diameter (cm),0.0001537022
Height (cm),0.007749727
Length (cm),0.0003572888
Weight (kg),0.0006882361


In [38]:
lr.intercept_

array([8.07858148e-05])

In [39]:
# class_names = classifier.classes_

# np.set_printoptions(precision=2)

# # Plot non-normalized confusion matrix
# titles_options = [
#     ("F1: Confusion matrix, without normalization", None),
#     ("F1: Normalized confusion matrix", "true"),
# ]
# for title, normalize in titles_options:
#     disp = ConfusionMatrixDisplay.from_estimator(
#         classifier,
#         X_test,
#         y_test,
#         display_labels=class_names[:8],
#         cmap=plt.cm.Blues,
#         normalize=normalize,
#     )
#     disp.ax_.set_title(title)
#     disp.ax_.set_figsize=(10, 10)
#     print(title)
#     print(disp.confusion_matrix)

# plt.show()

In [40]:
# Fold2 = Test
X_test, y_test = X_f2, y_f2 
X_train = np.concatenate((X_f1, X_f3), axis = 0)
y_train = np.concatenate((y_f1, y_f3), axis = 0)

In [41]:
%timeit -n1 -r1 classifier = lr.fit(X_train, y_train)

694 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [42]:
y_pred = lr.predict(X_test)

In [43]:
print(metrics.classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.52      0.06      0.10     16132
           1       0.65      0.97      0.78     28781

    accuracy                           0.64     44913
   macro avg       0.58      0.51      0.44     44913
weighted avg       0.60      0.64      0.53     44913



In [44]:
w2 = pd.DataFrame(lr.coef_[0], columns=["Weights"], index=features)
w2

Unnamed: 0,Weights
CreditLine,7.6e-05
Classification,0.000369
Department,2e-05
DateAcquired,2e-06
Circumference (cm),1e-06
Depth (cm),0.000735
Diameter (cm),0.000152
Height (cm),0.007798
Length (cm),0.000296
Weight (kg),0.001859


In [45]:
# class_names = classifier.classes_

# np.set_printoptions(precision=2)

# # Plot non-normalized confusion matrix
# titles_options = [
#     ("F2: Confusion matrix, without normalization", None),
#     ("F2: Normalized confusion matrix", "true"),
# ]
# for title, normalize in titles_options:
#     disp = ConfusionMatrixDisplay.from_estimator(
#         classifier,
#         X_test,
#         y_test,
#         display_labels=class_names[:8],
#         cmap=plt.cm.Blues,
#         normalize=normalize,
#     )
#     disp.ax_.set_title(title)
#     disp.ax_.set_figsize=(10, 10)
#     print(title)
#     print(disp.confusion_matrix)

# plt.show()

In [46]:
# Fold1 = Test
X_test, y_test = X_f1, y_f1 
X_train = np.concatenate((X_f3, X_f2), axis = 0)
y_train = np.concatenate((y_f3, y_f2), axis = 0)

In [47]:
%timeit -n1 -r1 classifier = lr.fit(X_train, y_train)

952 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [48]:
y_pred = lr.predict(X_test)

In [49]:
print(metrics.classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.51      0.05      0.10     16061
           1       0.65      0.97      0.78     28852

    accuracy                           0.64     44913
   macro avg       0.58      0.51      0.44     44913
weighted avg       0.60      0.64      0.53     44913



In [50]:
w3 = pd.DataFrame(lr.coef_[0], columns=["Weights"], index=features)
w3

Unnamed: 0,Weights
CreditLine,7.9e-05
Classification,0.000359
Department,1.5e-05
DateAcquired,2e-06
Circumference (cm),1e-06
Depth (cm),0.000676
Diameter (cm),0.000155
Height (cm),0.007691
Length (cm),0.000353
Weight (kg),0.001373


In [51]:
# class_names = classifier.classes_

# np.set_printoptions(precision=2)

# # Plot non-normalized confusion matrix
# titles_options = [
#     ("F3: Confusion matrix, without normalization", None),
#     ("F3: Normalized confusion matrix", "true"),
# ]
# for title, normalize in titles_options:
#     disp = ConfusionMatrixDisplay.from_estimator(
#         classifier,
#         X_test,
#         y_test,
#         display_labels=class_names[:8],
#         cmap=plt.cm.Blues,
#         normalize=normalize,
#     )
#     disp.ax_.set_title(title)
#     disp.ax_.set_figsize=(10, 10)
#     print(title)
#     print(disp.confusion_matrix)

# plt.show()

We look at the average weights now

In [52]:
w = (w1+w2+w3)/3
w

Unnamed: 0,Weights
CreditLine,7.920009e-05
Classification,0.0003533194
Department,1.313471e-05
DateAcquired,1.5531e-06
Circumference (cm),9.543643e-07
Depth (cm),0.0007112098
Diameter (cm),0.0001535338
Height (cm),0.007746335
Length (cm),0.0003353329
Weight (kg),0.001306814


In [53]:
# w.to_csv("avg_w.csv", sep=',')

### Naive Bayes  (Feature Set 1)

In [54]:
# Gaussian Naive Bayes
from sklearn.naive_bayes import GaussianNB

In [55]:
gnb = GaussianNB()

In [56]:
# Fold3 = Test
X_test, y_test = X_f3, y_f3 
X_train = np.concatenate((X_f1, X_f2), axis = 0)
y_train = np.concatenate((y_f1, y_f2), axis = 0)

In [57]:
%timeit -n1 -r1 gnb.fit(X_train, y_train)

40.9 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [58]:
y_pred = gnb.predict(X_test)

In [59]:
print(metrics.classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.37      0.99      0.54     16026
           1       0.90      0.05      0.10     28887

    accuracy                           0.39     44913
   macro avg       0.64      0.52      0.32     44913
weighted avg       0.71      0.39      0.26     44913



In [60]:
results = {"Class 0: Variance":gnb.var_[0], "Class 0: Mean":gnb.theta_[0],
           "Class 1: Variance":gnb.var_[1], "Class 1: Mean":gnb.theta_[1]}

vm1 = pd.DataFrame(results, index=features)
vm1

Unnamed: 0,Class 0: Variance,Class 0: Mean,Class 1: Variance,Class 1: Mean
CreditLine,4517229.0,1978.97164,5525944.0,2388.778703
Classification,49.42439,6.138881,43.18625,6.153749
Department,26.75858,1.608735,26.16839,1.439696
DateAcquired,37454580000.0,-52119.682726,17513360000.0,-18411.795759
Circumference (cm),25.12729,0.003417,25.13832,0.003586
Depth (cm),74.66801,0.509035,502.9008,2.307758
Diameter (cm),27.47501,0.062649,59.18144,0.334697
Height (cm),605.7552,24.310268,3345.267,37.743652
Length (cm),30.38123,0.079411,1407.638,0.737942
Weight (kg),24.92142,0.0,124374.5,2.3816


In [61]:
# Fold2 = Test
X_test, y_test = X_f2, y_f2 
X_train = np.concatenate((X_f1, X_f3), axis = 0)
y_train = np.concatenate((y_f1, y_f3), axis = 0)

In [62]:
%timeit -n1 -r1 gnb.fit(X_train, y_train)

29.1 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [63]:
y_pred = gnb.predict(X_test)

In [64]:
print(metrics.classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.37      0.99      0.54     16132
           1       0.89      0.06      0.11     28781

    accuracy                           0.39     44913
   macro avg       0.63      0.52      0.32     44913
weighted avg       0.70      0.39      0.26     44913



In [65]:
results = {"Class 0: Variance":gnb.var_[0], "Class 0: Mean":gnb.theta_[0],
           "Class 1: Variance":gnb.var_[1], "Class 1: Mean":gnb.theta_[1]}

vm2 = pd.DataFrame(results, index=features)
vm2

Unnamed: 0,Class 0: Variance,Class 0: Mean,Class 1: Variance,Class 1: Mean
CreditLine,4597351.0,2001.224826,5508499.0,2371.602106
Classification,49.16805,6.140026,42.96082,6.153588
Department,26.42822,1.610621,25.81268,1.436464
DateAcquired,37283160000.0,-51793.802443,17082280000.0,-17791.563778
Circumference (cm),24.77031,0.003428,24.92368,0.005694
Depth (cm),46.17157,0.480578,578.5073,2.372132
Diameter (cm),26.03158,0.056674,71.99154,0.363855
Height (cm),642.2392,24.419334,3987.486,37.99457
Length (cm),30.33036,0.075678,209.2662,0.628691
Weight (kg),24.56376,1.4e-05,730313.9,6.068197


In [66]:
# Fold1 = Test
X_test, y_test = X_f1, y_f1 
X_train = np.concatenate((X_f3, X_f2), axis = 0)
y_train = np.concatenate((y_f3, y_f2), axis = 0)

In [67]:
%timeit -n1 -r1 gnb.fit(X_train, y_train)

32.4 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [68]:
y_pred = gnb.predict(X_test)

In [69]:
print(metrics.classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.37      0.99      0.54     16061
           1       0.89      0.05      0.10     28852

    accuracy                           0.39     44913
   macro avg       0.63      0.52      0.32     44913
weighted avg       0.71      0.39      0.26     44913



In [70]:
results = {"Class 0: Variance":gnb.var_[0], "Class 0: Mean":gnb.theta_[0],
           "Class 1: Variance":gnb.var_[1], "Class 1: Mean":gnb.theta_[1]}

vm3 = pd.DataFrame(results, index=features)
vm3

Unnamed: 0,Class 0: Variance,Class 0: Mean,Class 1: Variance,Class 1: Mean
CreditLine,4528538.0,1975.115275,5492844.0,2374.131667
Classification,49.39518,6.10691,43.47971,6.172869
Department,26.82982,1.594316,26.26431,1.44066
DateAcquired,37911790000.0,-52925.101406,17398820000.0,-18291.743532
Circumference (cm),25.01822,0.0,25.16522,0.002461
Depth (cm),71.3287,0.512162,462.6658,2.257279
Diameter (cm),27.85865,0.071201,66.74633,0.351295
Height (cm),641.8459,24.399297,2449.632,37.872606
Length (cm),30.59909,0.076182,1408.387,0.772387
Weight (kg),25.01822,1.4e-05,608582.9,4.454813


We now look at average variance and means

In [71]:
vm = (vm1+vm2+vm3)/3
vm

Unnamed: 0,Class 0: Variance,Class 0: Mean,Class 1: Variance,Class 1: Mean
CreditLine,4547706.0,1985.103914,5509096.0,2378.170826
Classification,49.32921,6.128605,43.20893,6.160068
Department,26.67221,1.604557,26.0818,1.43894
DateAcquired,37549840000.0,-52279.528858,17331490000.0,-18165.034357
Circumference (cm),24.97194,0.002282,25.07574,0.003914
Depth (cm),64.05609,0.500591,514.6913,2.31239
Diameter (cm),27.12175,0.063508,65.9731,0.349949
Height (cm),629.9468,24.3763,3260.795,37.870276
Length (cm),30.43689,0.07709,1008.43,0.713007
Weight (kg),24.83447,9e-06,487757.1,4.301537


### Observations (Part 1)

Insights on relations.
- All the weights lean positively
- Width and Height have the most effect in comparison to all other features

General Notes.
- Out of the dimensionality attributes, height & width have the most effect 
  - On the assumption that most artworks will have a height/width, this makes sense
- The classification of the artwork (ie painting, sculpture, etc) had 26x more effect than the Department (ie “Painting & Sculpture”, “Film”, etc)
  - Classification is a subfield of department in a sense so I wonder if medium, a subfield of classification has more of an effect?
- The number of male artists has more of an effect than the number of females
  - Which could just be related to the overwhelming number of male artists in the collection
- Interesting insight, the mean & variance for the duration of a film is strikingly different between the classes. 
  - Not Catalogued: 
    - Var: 4.98e+05
    - Mean: 28.52
  - Catalogued:
    - Var: 9.38e+06
    - Mean: 63.15


## Analysis (Part 2)

The **second feature set** is below

In [72]:
features = list(df.columns)
features.remove("Cataloged")

print("Features to look at: ")
print(features)

X = df[features].to_numpy()
y = df["Cataloged"].to_numpy()

Features to look at: 
['CreditLine', 'Classification', 'Department', 'DateAcquired', 'Circumference (cm)', 'Depth (cm)', 'Diameter (cm)', 'Height (cm)', 'Length (cm)', 'Weight (kg)', 'Width (cm)', 'Seat Height (cm)', 'Duration (sec.)', 'silkscreen', 'white', 'platinum print', 'wood', 'bronze', 'screenprint', 'paint', 'cardboard', 'collotype', 'woodcut', 'video', 'letterpress', 'pencil', 'color', 'color print', 'chromogenic', 'ink', 'board', 'albumen', 'silent', 'watercolor', 'poster', 'oil', 'illustrated book', 'portfolio', 'pigmented', 'tracing paper', 'sound', 'canvas', 'inkjet print', 'charcoal', 'linoleum cut', 'photogravure', 'lithograph', 'matte', 'glass negative', 'etching', 'aquatint', 'dye transfer print', 'drypoint', 'engraving', 'ballpoint pen', 'photolithograph', 'paper', 'offset', 'silver print', 'gouache', 'black', 'gelatin', 'colored pencil', 'TimeStarted', 'TimeFinished', 'NumMales', 'NumFemales', 'NumArtists', 'FirstArtistListed']


We start off by creating three folds

In [73]:
length = X.shape[0]
indices = list(range(length))
np.random.shuffle(indices)

mid = int(length//3)

X_f1, y_f1 = X[indices[:mid], :], y[indices[:mid]]
X_f2, y_f2 = X[indices[mid: (mid*2)], :], y[indices[mid: (mid*2)]]
X_f3, y_f3 = X[indices[(mid*2):], :], y[indices[(mid*2):]]


### Naive Bayes (Feature Set 2)

In [74]:
# Fold3 = Test
X_test, y_test = X_f3, y_f3 
X_train = np.concatenate((X_f1, X_f2), axis = 0)
y_train = np.concatenate((y_f1, y_f2), axis = 0)

In [75]:
%timeit -n1 -r1 gnb.fit(X_train, y_train)

417 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [76]:
y_pred = gnb.predict(X_test)

In [77]:
print(metrics.classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.37      0.99      0.54     16168
           1       0.90      0.06      0.10     28745

    accuracy                           0.39     44913
   macro avg       0.63      0.52      0.32     44913
weighted avg       0.71      0.39      0.26     44913



In [78]:
results = {"Class 0: Variance":gnb.var_[0], "Class 0: Mean":gnb.theta_[0],
           "Class 1: Variance":gnb.var_[1], "Class 1: Mean":gnb.theta_[1]}

vm1 = pd.DataFrame(results, index=features)
vm1

Unnamed: 0,Class 0: Variance,Class 0: Mean,Class 1: Variance,Class 1: Mean
CreditLine,4.526331e+06,1976.279617,5.529487e+06,2382.292133
Classification,4.894180e+01,6.113756,4.285667e+01,6.148767
Department,2.639721e+01,1.601479,2.580533e+01,1.436313
DateAcquired,3.734183e+10,-51961.338679,1.705471e+10,-17758.878425
Circumference (cm),2.472360e+01,0.002246,2.477767e+01,0.003289
...,...,...,...,...
TimeFinished,1.135736e+03,1950.765967,1.204896e+03,1959.031848
NumMales,2.494810e+01,0.880628,2.507852e+01,0.863401
NumFemales,2.466173e+01,0.099466,2.474102e+01,0.188749
NumArtists,2.521891e+01,1.091978,2.520073e+01,1.126560


In [79]:
# Fold2 = Test
X_test, y_test = X_f2, y_f2 
X_train = np.concatenate((X_f1, X_f3), axis = 0)
y_train = np.concatenate((y_f1, y_f3), axis = 0)

In [80]:
%timeit -n1 -r1 gnb.fit(X_train, y_train)

398 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [81]:
y_pred = gnb.predict(X_test)

In [82]:
print(metrics.classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.37      0.98      0.54     16065
           1       0.88      0.07      0.13     28848

    accuracy                           0.40     44913
   macro avg       0.63      0.53      0.33     44913
weighted avg       0.70      0.40      0.27     44913



In [83]:
results = {"Class 0: Variance":gnb.var_[0], "Class 0: Mean":gnb.theta_[0],
           "Class 1: Variance":gnb.var_[1], "Class 1: Mean":gnb.theta_[1]}

vm2 = pd.DataFrame(results, index=features)
vm2

Unnamed: 0,Class 0: Variance,Class 0: Mean,Class 1: Variance,Class 1: Mean
CreditLine,4.541673e+06,1981.213193,5.517574e+06,2383.783482
Classification,4.979054e+01,6.130963,4.356235e+01,6.164915
Department,2.702102e+01,1.601636,2.643885e+01,1.441410
DateAcquired,3.797113e+10,-53021.682621,1.763759e+10,-18608.439052
Circumference (cm),2.539445e+01,0.003421,2.541555e+01,0.003766
...,...,...,...,...
TimeFinished,1.138445e+03,1950.563103,1.208003e+03,1958.844552
NumMales,2.556582e+01,0.880481,2.571540e+01,0.861978
NumFemales,2.528602e+01,0.097717,2.537349e+01,0.189347
NumArtists,2.586681e+01,1.090471,2.586558e+01,1.125156


In [84]:
# Fold1 = Test
X_test, y_test = X_f1, y_f1 
X_train = np.concatenate((X_f3, X_f2), axis = 0)
y_train = np.concatenate((y_f3, y_f2), axis = 0)

In [85]:
%timeit -n1 -r1 gnb.fit(X_train, y_train)

401 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [86]:
y_pred = gnb.predict(X_test)

In [87]:
print(metrics.classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.37      0.99      0.53     15986
           1       0.90      0.05      0.09     28927

    accuracy                           0.38     44913
   macro avg       0.63      0.52      0.31     44913
weighted avg       0.71      0.38      0.25     44913



In [88]:
results = {"Class 0: Variance":gnb.var_[0], "Class 0: Mean":gnb.theta_[0],
           "Class 1: Variance":gnb.var_[1], "Class 1: Mean":gnb.theta_[1]}

vm3 = pd.DataFrame(results, index=features)
vm3

Unnamed: 0,Class 0: Variance,Class 0: Mean,Class 1: Variance,Class 1: Mean
CreditLine,4.574981e+06,1997.717401,5.480168e+06,2368.401594
Classification,4.925480e+01,6.141005,4.320797e+01,6.166548
Department,2.659822e+01,1.610523,2.600102e+01,1.439099
DateAcquired,3.733676e+10,-51856.483945,1.730206e+10,-18127.850503
Circumference (cm),2.479781e+01,0.001179,2.503408e+01,0.004691
...,...,...,...,...
TimeFinished,1.143108e+03,1950.753451,1.204125e+03,1958.959283
NumMales,2.513701e+01,0.880588,2.523176e+01,0.859844
NumFemales,2.485640e+01,0.102318,2.493176e+01,0.189242
NumArtists,2.540751e+01,1.094282,2.536176e+01,1.122341


We now look at average variance and means

In [89]:
vm = (vm1+vm2+vm3)/3
vm

Unnamed: 0,Class 0: Variance,Class 0: Mean,Class 1: Variance,Class 1: Mean
CreditLine,4.547662e+06,1985.070070,5.509076e+06,2378.159070
Classification,4.932905e+01,6.128575,4.320900e+01,6.160077
Department,2.667215e+01,1.604546,2.608173e+01,1.438941
DateAcquired,3.754991e+10,-52279.835082,1.733145e+10,-18165.055993
Circumference (cm),2.497195e+01,0.002282,2.507577e+01,0.003915
...,...,...,...,...
TimeFinished,1.139097e+03,1950.694174,1.205675e+03,1958.945228
NumMales,2.521698e+01,0.880566,2.534190e+01,0.861741
NumFemales,2.493472e+01,0.099834,2.501542e+01,0.189113
NumArtists,2.549774e+01,1.092244,2.547603e+01,1.124686


In [90]:
# vm.to_csv("avg_vm_additional.csv", sep=',')

### Logisitc Regression (Feature Set 2)

In [91]:
# Fold3 = Test
X_test, y_test = X_f3, y_f3 
X_train = np.concatenate((X_f1, X_f2), axis = 0)
y_train = np.concatenate((y_f1, y_f2), axis = 0)

In [92]:
%timeit -n1 -r1 classifier = lr.fit(X_train, y_train)

1.69 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [93]:
y_pred = lr.predict(X_test)

In [94]:
print(metrics.classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.54      0.08      0.14     16168
           1       0.65      0.96      0.78     28745

    accuracy                           0.64     44913
   macro avg       0.60      0.52      0.46     44913
weighted avg       0.61      0.64      0.55     44913



In [95]:
w1 = pd.DataFrame(lr.coef_[0], columns=["Weights"], index=features)
w1

Unnamed: 0,Weights
CreditLine,8.740665e-05
Classification,-8.413380e-05
Department,-1.452519e-04
DateAcquired,1.559072e-06
Circumference (cm),5.568949e-07
...,...
TimeFinished,-1.004475e-04
NumMales,-1.377356e-05
NumFemales,5.121368e-05
NumArtists,9.830890e-06


In [96]:
lr.intercept_

array([-1.71197423e-06])

In [97]:
# class_names = classifier.classes_

# np.set_printoptions(precision=2)

# # Plot non-normalized confusion matrix
# titles_options = [
#     ("F1: Confusion matrix, without normalization", None),
#     ("F1: Normalized confusion matrix", "true"),
# ]
# for title, normalize in titles_options:
#     disp = ConfusionMatrixDisplay.from_estimator(
#         classifier,
#         X_test,
#         y_test,
#         display_labels=class_names[:8],
#         cmap=plt.cm.Blues,
#         normalize=normalize,
#     )
#     disp.ax_.set_title(title)
#     disp.ax_.set_figsize=(10, 10)
#     print(title)
#     print(disp.confusion_matrix)

# plt.show()

In [98]:
# Fold2 = Test
X_test, y_test = X_f2, y_f2 
X_train = np.concatenate((X_f1, X_f3), axis = 0)
y_train = np.concatenate((y_f1, y_f3), axis = 0)

In [99]:
%timeit -n1 -r1 classifier = lr.fit(X_train, y_train)

1.7 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [100]:
y_pred = lr.predict(X_test)

In [101]:
print(metrics.classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.54      0.08      0.14     16065
           1       0.65      0.96      0.78     28848

    accuracy                           0.65     44913
   macro avg       0.59      0.52      0.46     44913
weighted avg       0.61      0.65      0.55     44913



In [102]:
w2 = pd.DataFrame(lr.coef_[0], columns=["Weights"], index=features)
w2

Unnamed: 0,Weights
CreditLine,8.823695e-05
Classification,-8.512570e-05
Department,-1.395393e-04
DateAcquired,1.563881e-06
Circumference (cm),7.814084e-08
...,...
TimeFinished,-9.535415e-05
NumMales,-1.427362e-05
NumFemales,5.140628e-05
NumArtists,9.704267e-06


In [103]:
# class_names = classifier.classes_

# np.set_printoptions(precision=2)

# # Plot non-normalized confusion matrix
# titles_options = [
#     ("F1: Confusion matrix, without normalization", None),
#     ("F1: Normalized confusion matrix", "true"),
# ]
# for title, normalize in titles_options:
#     disp = ConfusionMatrixDisplay.from_estimator(
#         classifier,
#         X_test,
#         y_test,
#         display_labels=class_names[:8],
#         cmap=plt.cm.Blues,
#         normalize=normalize,
#     )
#     disp.ax_.set_title(title)
#     disp.ax_.set_figsize=(10, 10)
#     print(title)
#     print(disp.confusion_matrix)

# plt.show()

In [104]:
# Fold1 = Test
X_test, y_test = X_f1, y_f1 
X_train = np.concatenate((X_f3, X_f2), axis = 0)
y_train = np.concatenate((y_f3, y_f2), axis = 0)

In [105]:
%timeit -n1 -r1 classifier = lr.fit(X_train, y_train)

2.63 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [106]:
y_pred = lr.predict(X_test)

In [107]:
print(metrics.classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.56      0.08      0.15     15986
           1       0.66      0.96      0.78     28927

    accuracy                           0.65     44913
   macro avg       0.61      0.52      0.46     44913
weighted avg       0.62      0.65      0.55     44913



In [108]:
w3 = pd.DataFrame(lr.coef_[0], columns=["Weights"], index=features)
w3

Unnamed: 0,Weights
CreditLine,0.000081
Classification,-0.000089
Department,-0.000149
DateAcquired,0.000002
Circumference (cm),0.000002
...,...
TimeFinished,-0.000103
NumMales,-0.000016
NumFemales,0.000051
NumArtists,0.000007


In [109]:
# class_names = classifier.classes_

# np.set_printoptions(precision=2)

# # Plot non-normalized confusion matrix
# titles_options = [
#     ("F1: Confusion matrix, without normalization", None),
#     ("F1: Normalized confusion matrix", "true"),
# ]
# for title, normalize in titles_options:
#     disp = ConfusionMatrixDisplay.from_estimator(
#         classifier,
#         X_test,
#         y_test,
#         display_labels=class_names[:8],
#         cmap=plt.cm.Blues,
#         normalize=normalize,
#     )
#     disp.ax_.set_title(title)
#     disp.ax_.set_figsize=(10, 10)
#     print(title)
#     print(disp.confusion_matrix)

# plt.show()

We calculate the average weights

In [110]:
w = (w1+w2+w3)/3
w

Unnamed: 0,Weights
CreditLine,8.555990e-05
Classification,-8.602097e-05
Department,-1.444568e-04
DateAcquired,1.554406e-06
Circumference (cm),9.515379e-07
...,...
TimeFinished,-9.969518e-05
NumMales,-1.478274e-05
NumFemales,5.106589e-05
NumArtists,8.702412e-06


In [111]:
# w.to_csv("avg_w_additional.csv", sep=',')

### Observations (Part 2)

Insights on relations.
 - We have negative weights!!
 - There are a couple of comparable attributes that reverse relation than they did before the introduction of the new attributes

General Notes.

 - Out of the dimensionality attributes, height & width have the most effect
   - this stays the same even when accounting for medium types
 - The classification of the artwork had 1.6x less of an effect than the Department 
    - This is widely different than when we did not account for mediums
    - On the question of medium as subfields, we see that 8 fields have somewhat comparable weights to classification: portfolio, black, lithograph, drypoint, pencil, illustrated book, paper, color.
      - comparable refers to them having absolute weights between 0.0001 and 0.0004
  - The number of female artists has more of an effect than the number of males
     - this is a reverse from before medium types were included. 
  - Interesting insight, the variance between all medium types are similar
  - Interesting insight, the mean btwn some the mediums of most weight are large mostly but some are small
      - Not Catalogued:
        - portfolio: 0.026482579861738
        - black: 0.112505069970958
        - lithograph: 0.138698558962618
      - Catalogued:
        - portfolio: 0.0817274676381294
        - black: 0.0114768301798104
        - lithograph: 0.182224500679839

