# Data Tagging for Image Similarity

This notebook contains code and explanation for our labeling process.

In [2]:
import pandas as pd
import os
from PIL import Image
import csv

## Sample equal size classes

Due to resource limitations, we could not label 824^2 pairs. We opted to keep a balanced dataset by classes, reducing to the minimum available number of instances in a class - 150. In total, we keep 450 samples.

In [3]:
path = "datasets/house_styles/"
image_folder = "datasets/house_styles/all_images"

img_labels = pd.read_csv(path+"labels.csv")
img_labels['house_type'].value_counts()

house_type
farmhouse    428
modern       289
rustic       150
Name: count, dtype: int64

In [4]:
# sample 150 images from each category:
sample_size = 150
sampled_images = img_labels.groupby('house_type').apply(lambda x: x.sample(sample_size))
sampled_images = sampled_images.reset_index(drop=True)

sampled_images['house_type'].value_counts()

  sampled_images = img_labels.groupby('house_type').apply(lambda x: x.sample(sample_size))


house_type
farmhouse    150
modern       150
rustic       150
Name: count, dtype: int64

In [5]:
image_names = sampled_images['file_label'].values
len(image_names)

450

In [6]:
# save only sampled images to a new labels file:
sampled_images.to_csv(path+"sampled_labels.csv", index=False)

## Paired Labeling

We create a labeling system where each pair of instances with different labels is automatically 0 similarity, and instances from the same class will be manually labeled. This additionally reduces the number of labels needed from 450^2 to 3*150^2.

In [7]:
with open(path+'sampled_paired_labels.csv', 'w') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['image1', 'image1_path', 'image2', 'image2_path', 'similarity'])
    count = 0
    
    for i in range(len(image_names)):
        for j in range(i+1, len(image_names)):
            label1 = sampled_images[sampled_images['file_label']==image_names[i]]['house_type'].values[0]
            label2 = sampled_images[sampled_images['file_label']==image_names[j]]['house_type'].values[0]
            path1 = os.path.join(image_folder, image_names[i])
            path2 = os.path.join(image_folder, image_names[j])
            link1 = '('+path1+')'
            link2 = '('+path2+')'

            if label1!=label2:
                writer.writerow([image_names[i], link1, image_names[j], link2, 0])
            else:
                writer.writerow([image_names[i], link1, image_names[j], link2, ""])

            count+=1
    
print("Total pairs: ", count)


Total pairs:  101025


In [14]:
# load sampled_paired_labels and count nulls:
sampled_paired_labels = pd.read_csv(path+'sampled_paired_labels.csv', index_col=False)
sampled_paired_labels['similarity'].value_counts()

similarity
0.0    67500
Name: count, dtype: int64

In [15]:
len(sampled_paired_labels) - 67500

33525

## Shuffle Rows

In [6]:
df_origin = pd.read_csv(path+'sampled_paired_labels.csv')

df_shuffled = df_origin.sample(frac=1).reset_index(drop=True)
df_shuffled.to_csv(path+'sampled_paired_labels_shuffled.csv', index=False)

In [None]:
bad_images = ['487_6af5bc71.jpg']

## Append Labels to main file

In [14]:
main_file_path = path+'sampled_paired_labels_shuffled.csv'
main_file = pd.read_csv(main_file_path)
rounds_file_path = "active_learning_labels"
rounds_concats = []

current_round = 1

for round in range(0, current_round+1):
    path_file = os.path.join(rounds_file_path, "round_"+str(round)+".csv")
    df = pd.read_csv(path_file)
    rounds_concats.append(df)

# Join all rounds to add similarity column to main file:
df = pd.concat(rounds_concats)
df = df[['image1', 'image2', 'similarity']]
df = df.rename(columns={'similarity': 'similarity_round'})

new_main_file = pd.merge(main_file, df, on=['image1', 'image2'], how='left')

# keep similarity of main unless it is null, then add similarity_round if exists:
new_main_file['similarity'] = new_main_file['similarity'].fillna(new_main_file['similarity_round'])
new_main_file = new_main_file.drop(columns=['similarity_round'])

print(new_main_file.value_counts('similarity'))

new_main_file.to_csv(path+'sampled_paired_labels_shuffled.csv', index=False)


similarity
0.0    67500
1.0      137
2.0      113
3.0       87
Name: count, dtype: int64


In [11]:
main_file.value_counts('similarity')

similarity
0.0    67500
1.0       98
2.0       78
3.0       61
Name: count, dtype: int64

## Draft - Automated Labeling Prompt

In [None]:
import pandas as pd
from PIL import Image
from IPython.display import display
import os

def display_images_inline(image_path1, image_path2):
    """
    Display two images side by side inline in a Jupyter Notebook.
    """
    if os.path.exists(image_path1) and os.path.exists(image_path2):
        # Open and display the images
        img1 = Image.open(image_path1)
        img2 = Image.open(image_path2)

        # Display both images in the notebook
        print("Image 1:")
        display(img1)
        print("Image 2:")
        display(img2)
    else:
        print(f"One or both image paths are invalid: {image_path1}, {image_path2}")

def update_csv_with_label(csv_file, row_index, label):
    """
    Update the CSV file at the specified row with the new label.
    """
    df = pd.read_csv(csv_file)
    df.at[row_index, 'label'] = label
    df.to_csv(csv_file, index=False)

def pipeline(csv_file):
    """
    Main function that iterates through the CSV, displays images inline, and takes user input.
    """
    df = pd.read_csv(csv_file)

    for index, row in df.iterrows():
        if pd.isna(row['similarity']):
            print(f"Row {index}: Label is missing, displaying images...")

            # Get image paths from the row
            image_path1 = row['image1_path'].strip("()")
            image_path2 = row['image2_path'].strip("()")

            # Display the images inline in the notebook
            display_images_inline(image_path1, image_path2)

            # Ask the user to input the label (1, 2, or 3)
            while True:
                try:
                    label = input("Please enter the label (1, 2, or 3): ")
                    if label in ['1', '2', '3']:
                        break
                    else:
                        print("Invalid input. Please enter 1, 2, or 3.")
                except ValueError:
                    print("Invalid input. Please enter 1, 2, or 3.")

            # Update the CSV file with the user input and save it
            update_csv_with_label(csv_file, index, label)

            print(f"Row {index} updated with label {label}. Continuing...\n")

csv_file = "datasets/house_styles/test_sampled_paired_labels.csv"
pipeline(csv_file)


## Manual Labeling Rules

- Different Class: label 0
- Same Class: label 1 to 3

**Objective:** The users information need is finding houses with similar aesthetic characteristics, utilities, and environment.

**Features for tagging:**
1. Are the houses of similar size? (floors / area / people capacity)
2. Are the houses both in bold or regular colors?
3. Are the houses the same color palette? (light / dark)
4. Are the houses from the same material? (wooden / concrete / glass / brick)
5. Do the houses share building style "vibe"? (roof shape / floors...)
6. Do both houses have some identical characteristics? examples:
    - for modern: pool, parking space
    - for farmhouse: porch, garage
    - for rustic: chimney
7. Do both houses have or don't have a garden or open space? 
8. Are the houses in the same environment? (urban / rural)
9. Do the houses both feel open or closed? (windows / spaces / doors)

**Features to avoid:**
1. Avoid comparison by image size, quality, or photo style.
2. Avoid comparison influenced by angle.
3. Avoid comparison by time of year and time of day in the image.
4. Avoid considering people and objects in the image.

**Labeling Rules:**
Consider all the features above and answer those questions with "yes", "no", or "not relevant / not sure".
Among the features that are relevant, calculate the positive answers.
- more than 2/3 positive answers: label 3
- between 1/3 and 2/3 positive answers: label 2
- less than 1/3 positive answers: label 1

