### Imports
1. **Libraries Used**:
   - `LabelEncoder` from `sklearn.preprocessing`: Encodes categorical labels into numerical labels.
   - `train_test_split` from `sklearn.model_selection`: Splits datasets into training and validation sets.
   - `layers, models` from `tensorflow.keras`: Used to create and manage Convolutional Neural Network (CNN) models.
   - `Image` from `PIL`: Used for image loading and manipulation.
   - `pandas`, `numpy`, `os`, `matplotlib`: Libraries for data manipulation, mathematical operations, file handling, and plotting respectively.

In [1]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from tensorflow.keras import layers, models
import tensorflow as tf
import pandas as pd
import os
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt


2024-12-03 03:23:35.800336: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


### Loading Artist Metadata
- **Loading Artist Data**:
  - Loads data from `artists.csv` which contains metadata about the artists, such as their names, genres, and nationality.
  - Filters the metadata for a specific artist, *Albrecht Dürer*, and displays the artist info, which includes the Wikipedia link, genre, nationality, and more.


In [2]:
# Load the metadata from artists.csv
artists_csv_path = '/Users/ijunhyeong/Desktop/2024_Fall/STAT4060J/Project/dataset/artists.csv'
artists_df = pd.read_csv(artists_csv_path)

# Filter metadata for Albrecht Dürer
durer_info = artists_df[artists_df['name'] == 'Albrecht Dürer']
print(durer_info)  # This will display the artist info

    id            name        years                 genre nationality  \
19  19  Albrecht Dürer  1471 - 1528  Northern Renaissance      German   

                                                  bio  \
19  Albrecht Dürer (; German: [ˈʔalbʁɛçt ˈdyːʁɐ]; ...   

                                      wikipedia  paintings  
19  http://en.wikipedia.org/wiki/Albrecht_Dürer        328  


### Data Preparation
- **Image Loading and Processing**:
  - The images are stored in a directory, and the goal is to resize them to a uniform size (`224x224`) to maintain consistency for input to the CNN model.
  - Images are loaded, resized, and normalized to have pixel values between `[0, 1]`.
  - The artist's name is used as a label for each image.
  
- **Limiting Data**:
  - To reduce memory and computational load, the script processes only up to `10` images per artist from the first `50` artists.
  - This is useful in managing large datasets, especially given the resource limitations of the computer.
- **Encoding Artist Labels**:
  - The labels (artist names) are converted into numerical form using `LabelEncoder`. This is necessary because neural networks require numerical labels.
- **Dataset Creation**:
  - Converts the list of images (`X`) and labels (`y`) to numpy arrays, which are better suited for efficient computation.
  - Checks the shape of the dataset, which should output `500` images, each of shape `(224, 224, 3)` (as RGB images).

In [3]:
# Get the list of artist names
artist_names = artists_df['name'].tolist()

# Directory containing all paintings from all artists
image_base_dir = "/Users/ijunhyeong/Desktop/2024_Fall/STAT4060J/Project/dataset/resized"

# Desired image size (e.g., 224x224)
desired_size = (224, 224)

# List to store image data and labels
X = []
y = []

In [4]:
# Loop through artist directories
for artist in artist_names[:50]:  # Limit to the first 50 artists
    artist_dir = os.path.join(image_base_dir, artist.replace(" ", "_"))
    
    if not os.path.isdir(artist_dir):
        # Skip if the directory for the artist does not exist
        continue

    # Loop through each image file in the artist directory (limit to 10 paintings per artist)
    # This is because my computer is shit, so if you want to extract every paintings, run below
    '''
    # Loop through each image file in the artist directory
    for filename in os.listdir(artist_dir):
    if filename.endswith(".jpg"):
        file_path = os.path.join(artist_dir, filename)

        # Load image, resize to desired size, and convert to RGB to ensure uniformity
        with Image.open(file_path) as img:
            img_rgb = img.convert('RGB')  # Convert to RGB if not already
            img_resized = img_rgb.resize(desired_size)  # Resize to the same size
            img_array = np.array(img_resized) / 255.0  # Normalize pixel values to [0, 1]

            # Append to list
            X.append(img_array)
            y.append(artist)  # Label with the artist's name

    '''
    
    count = 0
    for filename in os.listdir(artist_dir):
        if filename.endswith(".jpg") and count < 10:
            file_path = os.path.join(artist_dir, filename)

            # Load image, resize to desired size, and convert to RGB to ensure uniformity
            with Image.open(file_path) as img:
                img_rgb = img.convert('RGB')  # Convert to RGB if not already
                img_resized = img_rgb.resize(desired_size)  # Resize to the same size
                img_array = np.array(img_resized) / 255.0  # Normalize pixel values to [0, 1]
                
                # Append to list
                X.append(img_array)
                y.append(artist)  # Label with the artist's name
                count += 1

# Convert lists to numpy arrays
X = np.array(X)
y = np.array(y)

# Check the shape of the dataset
print(f"Number of images: {X.shape[0]}, Image shape: {X.shape[1:]}")

Number of images: 500, Image shape: (224, 224, 3)


In [5]:
# Encode the labels (artist names)
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)
print(f"Encoded labels: {y_encoded}")

# Print the classes to see which label corresponds to each artist
print(f"Classes: {label_encoder.classes_}")

Encoded labels: [ 2  2  2  2  2  2  2  2  2  2 47 47 47 47 47 47 47 47 47 47  8  8  8  8
  8  8  8  8  8  8  7  7  7  7  7  7  7  7  7  7 43 43 43 43 43 43 43 43
 43 43 44 44 44 44 44 44 44 44 44 44 11 11 11 11 11 11 11 11 11 11  3  3
  3  3  3  3  3  3  3  3 48 48 48 48 48 48 48 48 48 48 19 19 19 19 19 19
 19 19 19 19 24 24 24 24 24 24 24 24 24 24 28 28 28 28 28 28 28 28 28 28
 32 32 32 32 32 32 32 32 32 32 33 33 33 33 33 33 33 33 33 33 37 37 37 37
 37 37 37 37 37 37 38 38 38 38 38 38 38 38 38 38 15 15 15 15 15 15 15 15
 15 15 16 16 16 16 16 16 16 16 16 16 13 13 13 13 13 13 13 13 13 13  0  0
  0  0  0  0  0  0  0  0  1  1  1  1  1  1  1  1  1  1 40 40 40 40 40 40
 40 40 40 40 30 30 30 30 30 30 30 30 30 30 18 18 18 18 18 18 18 18 18 18
 45 45 45 45 45 45 45 45 45 45  6  6  6  6  6  6  6  6  6  6 29 29 29 29
 29 29 29 29 29 29  9  9  9  9  9  9  9  9  9  9 21 21 21 21 21 21 21 21
 21 21 26 26 26 26 26 26 26 26 26 26 10 10 10 10 10 10 10 10 10 10 42 42
 42 42 42 42 42 42 42 42 46 46 46 4

- **Splitting Data**:
  - The dataset is split into an `80%` training set and a `20%` validation set to train and evaluate the model's performance.

In [6]:
# Split the data: 80% for training, 20% for validation
X_train, X_val, y_train, y_val = train_test_split(X, y_encoded, test_size=0.2, random_state=42)

print(f"Training samples: {X_train.shape[0]}, Validation samples: {X_val.shape[0]}")

Training samples: 400, Validation samples: 100


### CNN Model for Feature Extraction
- **Model Creation**:
  - A basic CNN model with three convolutional layers (`Conv2D`) followed by max-pooling (`MaxPooling2D`) layers.
  - The model ends with a `Flatten` layer to convert the 2D features into a 1D feature vector.
  - The features extracted by this model are intended for further analysis or classification.

### Code Limitations
- There are some parts of the code commented out due to system limitations (e.g., saving all features to a CSV file). If needed, those parts can be uncommented for more comprehensive analysis on systems with higher memory and computation capabilities.


In [11]:
# Define the CNN model for feature extraction
model = models.Sequential([
    layers.Conv2D(32, (3, 3), activation='relu', input_shape=(224, 224, 3)),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(128, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Flatten(),
])

# Extract features from the images
features = model.predict(X)

# Create a CSV file with painter name, image file name, and NumPy array
# My computer cannot deal with this.
'''
data = []
for i in range(len(X)):
    data.append([y[i], image_filenames[i], features[i].tolist()])

# Convert to DataFrame
columns = ["Painter Name", "Image File Name", "NumPy Array"]
df = pd.DataFrame(data, columns=columns)

# Save the DataFrame to a CSV file
df.to_csv('painter_image_features.csv', index=False)
''' 

# Display the NumPy array of each image
for i, img_array in enumerate(X):
    print(f"{y[i]}: {img_array}")  # Display each image's artist name and its NumPy array

[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 251ms/step
Amedeo Modigliani: [[[0.22352941 0.14509804 0.13333333]
  [0.21960784 0.12941176 0.11372549]
  [0.29019608 0.17647059 0.15294118]
  ...
  [0.14901961 0.10980392 0.11372549]
  [0.41960784 0.37254902 0.38823529]
  [0.96078431 0.9372549  0.95294118]]

 [[0.20784314 0.12941176 0.10980392]
  [0.18823529 0.09803922 0.07058824]
  [0.24705882 0.13333333 0.09803922]
  ...
  [0.14117647 0.09803922 0.10196078]
  [0.41568627 0.36470588 0.38039216]
  [0.96078431 0.9372549  0.95294118]]

 [[0.22745098 0.15294118 0.11764706]
  [0.2        0.10980392 0.07058824]
  [0.24705882 0.1372549  0.09019608]
  ...
  [0.1372549  0.09019608 0.09803922]
  [0.41176471 0.36470588 0.37647059]
  [0.96078431 0.93333333 0.94901961]]

 ...

 [[0.3372549  0.32156863 0.3254902 ]
  [0.22745098 0.19215686 0.18431373]
  [0.26666667 0.19607843 0.16470588]
  ...
  [0.10980392 0.10196078 0.1254902 ]
  [0.10588235 0.10196078 0.14901961]
  [0.12941176 0.125