# Artist classification by painting

This project explores the merits of using machine learning in 
supplementing art experts’ efforts in identifying forgeries among disputed paintings. From the movement of brushstrokes to the use of light and dark, successful algorithms will likely incorporate many aspects of a painter's unique style. 

[Multi Label Image Classification Simplified Model](https://www.analyticsvidhya.com/blog/2019/04/build-first-multi-label-image-classification-model-python/)

# Explanatory Data Analysis

### Download data

We will start by using the "train.zip" data [Painter by Numbers](https://www.kaggle.com/competitions/painter-by-numbers/overview) from Kaggle.As you can see the train.zip is 38.7 GB. That is a lot and will take a lot of time to download/load. 

For our timeline, instead of using the full dataset we will use a sub portion of the full train data: "train_1.zip". 

In [None]:
import tensorflow.python.keras.preprocessing.image
from PIL import Image
import tensorflow as tf
import tensorflow.keras as tfk
from tensorflow.keras.preprocessing import image
import matplotlib.pyplot as plt
import pandas as pd
import os
from sklearn.model_selection import train_test_split
from tqdm import tqdm
import numpy as np
from PIL import ImageFile
import tensorflow.keras as keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras.utils import to_categorical

Let's take a look at the first painting

In [None]:
im = Image.open(r'C:\Users\14794\Documents\NACME_capstone\train_1\train_1\1.jpg')

Let's take a look of the first 10 images in the train_1 folder

In [None]:
# List all files in the train_1 folder

list_dir = os.listdir('train_1/train_1/')

for i in len(list_dir):
  if i < 10:
    i.open(r'C:/Users/14794/Documents/NACME_capstone/train_1/train_1/' + i)
  else:
    break


Take notice how the images are ordered and sorted. We can see that the images don't go completely in order from 1.2.3.. and so on. Instead it goes 1.10.11.12.14... and so on. Next load in all_data_info.csv.

In [None]:
# Load the data
labels = pd.read_csv('C:/Users/14794/Documents/NACME_capstone/all_data_info.csv/all_data_info.csv')
labels.head(10)

Viewing the first 10 samples, we can see that this dataframe contains all the image in all the train and test dataset. We will need to sort through the dataframe and use only rows corresponding to our train_1 dataset. 

In [None]:
list_df = pd.DataFrame(list_dir, columns = ['new_filename'])
for i in list_df['new_filename']:
    list_df['path'] = path + i
print(list_df)

We labeled the list_dir column name "new_filename", since our code will need a column in common to merge efficiently. We also decided to append the path into the dataframe for future purposes if needed. 

In [None]:
new_df = pd.merge(labels, list_df, on ="new_filename")
new_df = new_df.dropna()

By dropping all our naN values, our data lost roughly 2000 rows of data.

Now we can do some exploratory data analysis (EDA).

In [None]:
print(new_df["artist"].unique())        #unique names of artists
print(new_df["artist"].nunique())       #count of unique artists
print(new_df.columns)                   #see all the column names
count = new_df.groupby(['artist'])['artist'].count()  #count the amount of times each artist has an entry
print(count.sort_values(ascending = False).tail(10))  #sort the names of artists and show the 10 least promenant artists
print(count.sort_values(ascending = False).head(10))  #sort the names of artists and show the 10 most promenant artists

So now we know that we have around 500 artist, with around 8,000 photos to use. We also see how many of the artists from this reduced dataset have very little paintings. Many of the artist only have 1 entry, and others have less than 5 which makes it very difficult for our model to use them to train.

To fix this, we will have to fix our data!
 - Here we will merge the dataframe holding the picures and the dataframe holding all the information

In [None]:
new_df = pd.merge(labels, list_df, on ="new_filename") 
ImageFile.LOAD_TRUNCATED_IMAGES = True

We merged the dataframe so that the dataframe we have will only include the names of the files in the training folder, this way we don't tell the computer to show us a picture we don't have.

Next step is changing all the image files to text so the computer can actually read them.

In [None]:
painting_images = []
    for i in tqdm(range(list_df.shape[0])):
        img = image.load_img('train_1/train_1/'+ new_df['new_filename'][i], target_size=(224, 224, 3))
        img = image.img_to_array(img)
        img = img/255
        painting_images.append(img)
    X = np.array(painting_images)
    y = np.array(new_df['artist'])

Now we will create a method which will create the model. We created our own model going off of popular models we found online.

In [None]:
def create_model(num_classes):
    model = Sequential()
    model.add(Conv2D(filters=16, kernel_size=(5, 5), activation="relu", input_shape=(224, 224, 3)))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Dropout(0.25))

    model.add(Conv2D(filters=32, kernel_size=(5, 5), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Dropout(0.25))

    model.add(Conv2D(filters=64, kernel_size=(5, 5), activation="relu"))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Dropout(0.25))

    model.add(Conv2D(filters=64, kernel_size=(5, 5), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Dropout(0.25))

    model.add(Flatten())
    model.add(Dense(128, activation='relu'))
    model.add(Dropout(0.5))

    model.add(Dense(64, activation='relu'))
    model.add(Dense(32, activation='relu'))
    model.add(Dropout(0.5))

    # new addition
    model.add(Dense(16, activation='relu'))
    model.add(Dropout(0.25))
    
    model.add(Dense(num_classes, activation='softmax'))
    model.summary()

    return model

Here is our second model we created, it's not as good as the first model.

In [None]:
def RESNET50(num_classes):
    resnet = Sequential()
    pretrained = tensorflow.keras.applications.ResNet50(include_top = False,
                                                        input_shape = (224, 224, 3),
                                                        pooling = 'avg',
                                                        classes = 5,
                                                        weights = 'imagenet')
    for layer in pretrained.layers:
        layer.trainable = False

    resnet.add(pretrained)
    resnet.add(Flatten())
    resnet.add(Dense(128, activation='relu'))
    resnet.add(Dropout(0.5))
    resnet.add(Dense(64, activation='relu'))

    resnet.add(Dense(32, activation='relu'))
    resnet.add(Dropout(0.5))

    resnet.add(Dropout(0.25))
    resnet.add(Dense(num_classes, activation='softmax'))

Now we create the function that will clean our data further! Getting rid of the artist with little amounts of pictures, we also train_test_split here.

In [None]:
def cleaning_data(x_array, y_array):
    dataframe_1 = pd.DataFrame(x_array.reshape((y_array.shape[0], -1)), columns=list(range(150528)))
    dataframe_2 = pd.DataFrame(y_array, columns=["filename"])
    df = pd.merge(dataframe_1, dataframe_2, left_index=True, right_index=True)
    df = df.dropna()
    df['count'] = df.groupby('filename')['filename'].transform('count')
    dataframe = df[df['count'] >= 10]
    print(dataframe.head(10))
    labels = [x for x in dataframe["filename"].unique()]
    num_artists = len(labels)
    print(num_artists)

    print(dataframe.isna().sum())

    X_array = dataframe[list(range(150528))].to_numpy()
    Y_array = dataframe['filename'].to_numpy()
    artists = list(np.unique(Y_array))
    num_classes = len(artists)
    Y_array = np.array([artists.index(i) for i in Y_array])
    Y_array = to_categorical(Y_array)

    x_train, x_test, y_train, y_test = train_test_split(X_array, Y_array, random_state=42, test_size=0.15, shuffle=True)
    x_train = x_train.reshape((y_train.shape[0], 224, 224, 3))
    x_test = x_test.reshape((y_test.shape[0], 224, 224, 3))
    print(x_train, y_test, y_train, y_test)
    print(x_train.shape, y_train.shape, y_train.shape, y_test.shape)

    return x_train, x_test, y_train, y_test, num_artists

Here is the main code

In [None]:
if __name__ == "__main__":

    x_array, y_array = data_preprocessing()
    x_train, x_test, y_train, y_test, num_classes = cleaning_data(x_array, y_array)
    model = create_model(num_classes)
    model.summary()
    model.compile(optimizer = 'adam', loss = "categorical_crossentropy", metrics = ['accuracy'])
    history = model.fit(x_train, y_train, epochs=10, batch_size = 18)
    model.save('./ckpts/Model_1')
    ResNet = RESNET50(num_classes)
    ResNet.summary()
    ResNet.compile(optimizer = 'adam', loss = "categorical_crossentropy", metrics = ['accuracy'])
    ResHistory = ResNet.fit(x_train, y_train, epochs = 10, batch_size = 18)
    ResNet.save('.ckpts/ResNet')


    ResNet.evaluate(x_test, y_test)