<a href="https://colab.research.google.com/github/ConstanzaSchibber/capstone_colors/blob/main/notebooks/3_Data_Annotation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Annotation: How to Validate Model Results?

In this project, due to the absence of labeled data, I manually crop each image to focus solely on the makeup area—for instance, isolating just the lipstick in an image.

In the notebook, the cropped section is then analyzed to determine the average CIELAB color, which serves as a close approximation to the 'ground truth' color. Data annotation is a necessary step to ensure accurate color analysis, as it allows for more precise comparison and evaluation of the results.

# Libraries

In [2]:
!pip freeze > requirements3.txt
!pip list --format=freeze > requirements3.txt

In [17]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import glob
from PIL import Image
import numpy as np
import cv2
from skimage.color import rgb2lab, lab2rgb

# Reading Data & Identifying Images with a Ground Truth Color Value

In [1]:
# Mount Drive
from google.colab import drive

# Mount Google Drive to access files stored there
# The 'force_remount=True' option ensures that the drive is remounted even if it is already mounted
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


I was not able to get the makeup color from all of the images because a number of images showed the container without the color of the actual makeup (e.g., lipstick). Thus, a small number of images will not have ground truth value.

To identify the data with ground truth value, I create a new variable in the metadata indicating coded 1 if there is a ground truth value and 0 if there is not. To do so, I use the list of cropped image files.

In [4]:
# Metadata
metadata = pd.read_csv('/content/drive/MyDrive/data/processed/data_1.csv')
metadata.columns

Index(['Unnamed: 0', 'level_0', 'index', 'category', 'joined', 'brand',
       'product', 'shade', 'img_url', 'shade_description_original', 'id',
       'validation', 'img_name'],
      dtype='object')

In [6]:
# List of cropped images
folder_path = '/content/drive/MyDrive/data/processed/makeup_img_ground_truth'
files = glob.glob(os.path.join(folder_path, '*'))  # List all files with full paths

# The file names without the extension because some extensions changed when the image was cropped
files = [os.path.splitext(os.path.basename(f))[0] for f in files]

print(files)

['ulta232', 'ulta250', 'ulta433', 's2427938-main-zoom', 'CF_PDP_Raunchy_swatch', 'juicypangwaterblusherCR01', 'Sunset', 'ulta8', 's2474427-av-04-zoom', 'ulta6', 'Product5_540x', 'ulta14', '00000000_zi_dcd6941f-7cfd-4fc8-99ef-84dad24047f0__02_ai', 'ulta7', 's2410009-main-zoom', '71dCGKXNjwL', 'BOUNCE_Blush_PlayfulPeach_Swatch_v1_RGB', 'ulta13', 'LiquidBlushLip_Tempo_1024x1024', 'CHEEK-GELEE_SWATCH_Lively_1200x', 's2115871-main-zoom', 'CHEEK_SHADE_SMITTEN_Elephant_copy_1', '4LyjpbQeBAGrzzEpwZZs', 'Morning-Vibes-PPB-Target-Pre-Fall-Half-Closed_800x1200', 'ulta23', '0a1c6623-93fb-4e6e-a940-c7ddda5d0949', 'Poppi-Instant-Crush-Matte-Blush-Inline-Half-Closed', 'ulta25', 'Watercolour-Liquid-Blush-Angel-3', 'Wish-Me-Luck-PPB-Target-Pre-Fall-Half-Closed_800x1200', 'fresh-n-peachy_2_800x1200', 'ulta30', 'Watercolour-Liquid-Blush-Crush-3', 'Watercolour-Liquid-Illuminator-Elegance-3', 'Watercolour-Liquid-Blush-Gentle-3', 'Watercolour-Liquid-Blush-Caress-3', 'Watercolour-Liquid-Blush-Chelsea-3', 'Wa

In [8]:
metadata['img_name'][0].split('.')[0]

'Sunset'

In [10]:
# ground truth dummy
metadata['ground_truth'] = pd.Series(dtype='object')

# checking if each image has a cropped version
for i in range(len(metadata)):
  if metadata['img_name'][i].split('.')[0] in files:
    metadata.loc[i, 'ground_truth'] = 1
  else:
    metadata.loc[i, 'ground_truth'] = 0


Below, we observe that 88.8% of the images in the dataset have a corresponding ground truth value. The remaining images lack ground truth values, but this is due to factors unrelated to the color itself, such as incomplete data (e.g., the image had the packaging but did not show the makeup color.) Therefore, the absence of ground truth for these images should not significantly impact the overall analysis, as it doesn't introduce any bias related to the color properties being studied.

In [14]:
round(metadata.ground_truth.value_counts()/len(metadata)*100, 2)

Unnamed: 0_level_0,count
ground_truth,Unnamed: 1_level_1
1,88.8
0,11.2


# Ground Truth CIELAB Color Value

In [32]:
# store CIELAB color
metadata['ground_truth_CIELAB'] = pd.Series(dtype='object')

# extract color and save it for
for i in range(len(metadata)):
  if metadata['ground_truth'][i] == 1:
    # load image
    # file path
    directory = '/content/drive/MyDrive/data/processed/makeup_img_ground_truth/'
    filename = metadata['img_name'][i].split('.')[0]
    file_path = glob.glob(os.path.join(directory, filename + '.*'))

    # read image
    swatch = cv2.imread(file_path[0])

    # convert to Lab color space
    swatch = cv2.cvtColor(swatch, cv2.COLOR_BGR2RGB)
    img_lab = rgb2lab(swatch)

    # extract the average
    mean_swatch = img_lab.mean(axis=0).mean(axis=0)
    metadata.at[i, 'ground_truth_CIELAB'] = mean_swatch

In [34]:
metadata.ground_truth_CIELAB.info()

<class 'pandas.core.series.Series'>
RangeIndex: 527 entries, 0 to 526
Series name: ground_truth_CIELAB
Non-Null Count  Dtype 
--------------  ----- 
468 non-null    object
dtypes: object(1)
memory usage: 4.2+ KB


In [35]:
metadata.to_csv('/content/drive/My Drive/metadata_ground_truth.csv')