# Same Class Different Color, Size and Texture (SCDCST) Text-Vision Comparison - SyntheticKonkle

This notebook compares CVCL and CLIP models on text-vision matching using the SyntheticKonkle dataset.
The task is 4-way classification where distractors are from the SAME class but DIFFERENT colors, sizes, AND textures.

## Text Ordering Considerations

**We use natural English adjective ordering**: `"{size} {color} {texture} {class}"`
- Example: "large red smooth apple", "small blue bumpy apple", "medium green smooth apple"
- This follows standard English grammar rules: size → color → texture → noun
- Both CVCL (trained on child-directed speech) and CLIP (trained on internet text) expect this natural ordering
- Parents typically say "big red bumpy ball" not "bumpy red big ball"

## Test Characteristics
- **Visual**: All candidates from SAME class, all three attributes vary
- **Variation**: Color, size, AND texture all differ between candidates
- **No control**: All visual attributes are mentioned in text
- **Difficulty**: Medium-Hard - multiple attributes provide discrimination within same class

In [1]:
# Imports
import os
import sys
import pandas as pd
import numpy as np
import torch
from PIL import Image
from tqdm import tqdm
import random
from datetime import datetime
import clip
from collections import defaultdict
from torch.utils.data import Dataset, DataLoader

# Path setup - Use absolute paths to avoid any confusion
REPO_ROOT = r'C:\Users\jbats\Projects\NTU-Synthetic'

# Add discover-hidden-visual-concepts to path
DISCOVER_ROOT = os.path.join(REPO_ROOT, 'discover-hidden-visual-concepts')
sys.path.insert(0, DISCOVER_ROOT)
sys.path.insert(0, REPO_ROOT)

# Import from discover-hidden-visual-concepts repo
sys.path.append(os.path.join(DISCOVER_ROOT, 'src'))
from utils.model_loader import load_model
from models.feature_extractor import FeatureExtractor

# Paths
DATA_PATH = os.path.join(REPO_ROOT, 'data', 'SyntheticKonkle_224', 'SyntheticKonkle')
METADATA_PATH = os.path.join(REPO_ROOT, 'data', 'SyntheticKonkle', 'master_labels.csv')
RESULTS_PATH = os.path.join(REPO_ROOT, 'PatrickProject', 'Chart_Generation', 'text_vision_results.csv')

print(f"Data path: {DATA_PATH}")
print(f"Metadata path: {METADATA_PATH}")
print(f"Results will be saved to: {RESULTS_PATH}")

  from pkg_resources import packaging


Data path: C:\Users\jbats\Projects\NTU-Synthetic\data\SyntheticKonkle_224\SyntheticKonkle
Metadata path: C:\Users\jbats\Projects\NTU-Synthetic\data\SyntheticKonkle\master_labels.csv
Results will be saved to: C:\Users\jbats\Projects\NTU-Synthetic\PatrickProject\Chart_Generation\text_vision_results.csv


In [2]:
# Load and prepare data
def load_synthetic_data():
    """Load SyntheticKonkle dataset with metadata for SCDCST testing"""
    # Read metadata
    df = pd.read_csv(METADATA_PATH)
    
    # Build full paths
    df['image_path'] = df.apply(lambda row: os.path.join(DATA_PATH, row['folder'], row['filename']), axis=1)
    
    # Filter to only entries with valid size, color, and texture information
    df = df[
        df['size'].notna() & (df['size'] != '') &
        df['color'].notna() & (df['color'] != '') &
        df['texture'].notna() & (df['texture'] != '')
    ].copy()
    
    # Standardize values (lowercase)
    df['size'] = df['size'].str.lower().str.strip()
    df['color'] = df['color'].str.lower().str.strip()
    df['texture'] = df['texture'].str.lower().str.strip()
    
    # Filter to only valid size and texture values
    valid_sizes = ['small', 'medium', 'large']
    valid_textures = ['smooth', 'bumpy']
    df = df[df['size'].isin(valid_sizes) & df['texture'].isin(valid_textures)].copy()
    
    # Create combination columns
    df['color_size_texture'] = df['color'] + '_' + df['size'] + '_' + df['texture']
    df['full_combo'] = df['class'] + '_' + df['color'] + '_' + df['size'] + '_' + df['texture']
    
    print(f"Loaded {len(df)} images with size, color, and texture annotations")
    print(f"Unique classes: {df['class'].nunique()}")
    print(f"Unique colors: {df['color'].nunique()}")
    print(f"Unique sizes: {sorted(df['size'].unique())}")
    print(f"Unique textures: {sorted(df['texture'].unique())}")
    
    # Find classes that have multiple color-size-texture combinations
    class_groups = df.groupby('class')['color_size_texture'].nunique()
    valid_classes = class_groups[class_groups >= 3].index.tolist()
    
    print(f"\nClasses with 3+ color-size-texture combinations: {len(valid_classes)}")
    if len(valid_classes) > 0:
        print(f"Examples: {valid_classes[:3]}")
        # Show combinations for first example
        if len(valid_classes) > 0:
            example = valid_classes[0]
            cst_combos = df[df['class'] == example]['color_size_texture'].unique()[:4]
            print(f"  {example} has combinations like: {cst_combos}")
    
    return df, valid_classes

# Load data
data_df, valid_classes = load_synthetic_data()
print("\nSample data:")
print(data_df[['class', 'color', 'size', 'texture', 'color_size_texture']].head())

Loaded 7865 images with size, color, and texture annotations
Unique classes: 67
Unique colors: 11
Unique sizes: ['large', 'medium', 'small']
Unique textures: ['bumpy', 'smooth']

Classes with 3+ color-size-texture combinations: 67
Examples: ['abacus', 'apple', 'axe']
  abacus has combinations like: ['red_large_bumpy' 'green_large_bumpy' 'blue_large_bumpy'
 'yellow_large_bumpy']

Sample data:
    class   color   size texture  color_size_texture
0  abacus     red  large   bumpy     red_large_bumpy
1  abacus   green  large   bumpy   green_large_bumpy
2  abacus    blue  large   bumpy    blue_large_bumpy
3  abacus  yellow  large   bumpy  yellow_large_bumpy
4  abacus  orange  large   bumpy  orange_large_bumpy


In [3]:
def run_scdcst_text_vision_test(model_name='cvcl-resnext', seed=0, device=None, num_trials=4000):
    """Run Same Class Different Color, Size and Texture text-vision test
    
    Text format uses natural English ordering: "{size} {color} {texture} {class}"
    Example: "large red smooth apple", "small blue bumpy apple"
    
    Args:
        model_name: Model to test ('cvcl-resnext' or 'clip-resnext')
        seed: Random seed for reproducibility
        device: Device to use (None for auto-detect)
        num_trials: Total number of trials to run
    """
    # Set seeds to match original test methodology
    random.seed(seed)
    torch.manual_seed(seed)
    
    print(f"\n{'='*60}")
    print(f"Running SCDCST Text-Vision Test with {model_name}")
    print(f"(Same Class Different Color, Size & Texture)")
    print(f"Text format: {{size}} {{color}} {{texture}} {{class}} (natural English order)")
    print(f"{'='*60}")
    
    # Device selection
    if device is None:
        device = 'cuda' if torch.cuda.is_available() else 'cpu'
    
    if device == 'cuda' and not torch.cuda.is_available():
        print("[ERROR] CUDA requested but not available! Falling back to CPU.")
        device = 'cpu'
    
    print(f"Using device: {device}")
    
    # Load model
    print(f"[INFO] Loading {model_name} on {device}...")
    model, transform = load_model(model_name, seed=seed, device=device)
    extractor = FeatureExtractor(model_name, model, device)
    model.eval()
    
    # Load and prepare data
    df = pd.read_csv(METADATA_PATH)
    
    # Build full paths
    df['image_path'] = df.apply(lambda row: os.path.join(DATA_PATH, row['folder'], row['filename']), axis=1)
    
    # Filter to entries with all annotations
    df = df[
        df['size'].notna() & (df['size'] != '') &
        df['color'].notna() & (df['color'] != '') &
        df['texture'].notna() & (df['texture'] != '')
    ].copy()
    
    # Standardize values
    df['size'] = df['size'].str.lower().str.strip()
    df['color'] = df['color'].str.lower().str.strip()
    df['texture'] = df['texture'].str.lower().str.strip()
    
    # Filter to only valid size and texture values
    valid_sizes = ['small', 'medium', 'large']
    valid_textures = ['smooth', 'bumpy']
    df = df[df['size'].isin(valid_sizes) & df['texture'].isin(valid_textures)].copy()
    
    # Create combination column
    df['color_size_texture'] = df['color'] + '_' + df['size'] + '_' + df['texture']
    
    # Find classes with at least 3 different color-size-texture combinations
    class_groups = df.groupby('class')
    valid_classes = []
    for class_name, group in class_groups:
        unique_cst = group['color_size_texture'].unique()
        if len(unique_cst) >= 3:
            valid_classes.append(class_name)
    
    if len(valid_classes) == 0:
        print("ERROR: No classes have 3+ different color-size-texture combinations.")
        print("Cannot run SCDCST test.")
        return [], 0.0
    
    print(f"\nFound {len(valid_classes)} classes with 3+ color-size-texture combinations")
    
    # Pre-compute image embeddings
    print("\nExtracting image embeddings...")
    image_embeddings = {}
    skipped_images = []
    
    # Get all relevant images
    df_valid = df[df['class'].isin(valid_classes)]
    all_image_paths = df_valid['image_path'].unique().tolist()
    batch_size = 16
    
    for i in tqdm(range(0, len(all_image_paths), batch_size), desc="Extracting embeddings"):
        batch_paths = all_image_paths[i:i+batch_size]
        batch_images = []
        
        for img_path in batch_paths:
            try:
                img = Image.open(img_path).convert('RGB')
                img_processed = transform(img).unsqueeze(0).to(device)
                batch_images.append((img_path, img_processed))
            except Exception as e:
                skipped_images.append(img_path)
                continue
        
        if batch_images:
            paths = [p for p, _ in batch_images]
            imgs = torch.cat([img for _, img in batch_images], dim=0)
            
            with torch.no_grad():
                embeddings = extractor.get_img_feature(imgs)
                embeddings = extractor.norm_features(embeddings)
            
            for path, emb in zip(paths, embeddings):
                image_embeddings[path] = emb.cpu().float()
    
    print(f"Extracted embeddings for {len(image_embeddings)} images")
    if skipped_images:
        print(f"Skipped {len(skipped_images)} corrupted/invalid images")
    
    # Prepare for trials
    correct_count = 0
    trial_results = []
    
    # Calculate trials per class
    trials_per_class = num_trials // len(valid_classes)
    remaining_trials = num_trials % len(valid_classes)
    
    print(f"\nRunning {num_trials} trials across {len(valid_classes)} classes...")
    
    # Run trials
    for class_idx, class_name in enumerate(tqdm(valid_classes, desc="Processing classes")):
        # Get all images for this class
        class_data = df_valid[df_valid['class'] == class_name]
        
        # Group by color-size-texture
        cst_groups = class_data.groupby('color_size_texture').agg({
            'image_path': list,
            'color': 'first',
            'size': 'first',
            'texture': 'first'
        }).to_dict('index')
        
        available_cst = list(cst_groups.keys())
        
        if len(available_cst) < 3:
            continue
        
        # Determine number of trials for this class
        n_trials = trials_per_class + (1 if class_idx < remaining_trials else 0)
        
        for trial in range(n_trials):
            # Select color-size-texture combinations for 4-way choice
            if len(available_cst) == 3:
                # Use all 3 plus duplicate one
                selected_cst = available_cst.copy()
                selected_cst.append(random.choice(available_cst))
            else:
                # Select 4 different combinations if possible
                selected_cst = random.sample(available_cst, min(4, len(available_cst)))
            
            # First combination is the query
            query_cst = selected_cst[0]
            query_data = cst_groups[query_cst]
            
            # Select random query image from valid images
            valid_query_paths = [p for p in query_data['image_path'] if p in image_embeddings]
            if not valid_query_paths:
                continue
            query_img_path = random.choice(valid_query_paths)
            query_color = query_data['color']
            query_size = query_data['size']
            query_texture = query_data['texture']
            
            # Shuffle for candidate order
            random.shuffle(selected_cst)
            correct_idx = selected_cst.index(query_cst)
            
            # Create text prompts - NATURAL ENGLISH ORDER: {size} {color} {texture} {class}
            candidate_texts = []
            for cst in selected_cst:
                cst_data = cst_groups[cst]
                # Natural English order: size → color → texture → noun
                text_prompt = f"{cst_data['size']} {cst_data['color']} {cst_data['texture']} {class_name.lower()}"
                candidate_texts.append(text_prompt)
            
            # Encode text prompts
            with torch.no_grad():
                if "clip" in model_name:
                    tokens = clip.tokenize(candidate_texts, truncate=True).to(device)
                    txt_features = model.encode_text(tokens)
                    txt_features = extractor.norm_features(txt_features)
                else:  # CVCL
                    tokens, token_len = model.tokenize(candidate_texts)
                    tokens = tokens.to(device)
                    if isinstance(token_len, torch.Tensor):
                        token_len = token_len.to(device)
                    txt_features = model.encode_text(tokens, token_len)
                    txt_features = extractor.norm_features(txt_features)
            
            # Get query image embedding
            query_embedding = image_embeddings[query_img_path].unsqueeze(0).to(device)
            
            # Calculate similarity
            query_embedding = query_embedding.float()
            txt_features = txt_features.float()
            
            similarity = (100.0 * query_embedding @ txt_features.transpose(-2, -1)).softmax(dim=1)
            
            # Get prediction
            pred_idx = similarity.argmax(dim=1).item()
            
            # Check if correct
            is_correct = (pred_idx == correct_idx)
            if is_correct:
                correct_count += 1
            
            # Store trial result
            trial_results.append({
                'trial': len(trial_results) + 1,
                'query_class': class_name,
                'query_color': query_color,
                'query_size': query_size,
                'query_texture': query_texture,
                'query_img': os.path.basename(query_img_path),
                'correct_idx': correct_idx,
                'predicted_idx': pred_idx,
                'correct': is_correct,
                'candidate_texts': candidate_texts,
                'similarity_scores': similarity.cpu().numpy().tolist()
            })
    
    # Calculate accuracy
    accuracy = correct_count / len(trial_results) if trial_results else 0
    
    print(f"\n{'='*60}")
    print(f"Results for {model_name} - SCDCST Text-Vision Test:")
    print(f"Total trials: {len(trial_results)}")
    print(f"Correct: {correct_count}")
    print(f"Accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")
    print(f"{'='*60}")
    
    # Save results
    results_row = {
        'Model': model_name,
        'Test': 'SCDCST-TextVision',
        'Dataset': 'SyntheticKonkle',
        'Correct': correct_count,
        'Trials': len(trial_results),
        'Accuracy': accuracy
    }
    
    os.makedirs(os.path.dirname(RESULTS_PATH), exist_ok=True)
    if os.path.exists(RESULTS_PATH):
        results_df = pd.read_csv(RESULTS_PATH)
    else:
        results_df = pd.DataFrame()
    
    results_df = pd.concat([results_df, pd.DataFrame([results_row])], ignore_index=True)
    results_df.to_csv(RESULTS_PATH, index=False, float_format='%.4f')
    print(f"\nResults saved to {RESULTS_PATH}")
    
    return trial_results, accuracy

## Run CVCL SCDCST Text-Vision Test

In [4]:
# Run CVCL test with seed=0 (matching original tests)
cvcl_trials, cvcl_accuracy = run_scdcst_text_vision_test('cvcl-resnext', seed=0, num_trials=4000)


Running SCDCST Text-Vision Test with cvcl-resnext
(Same Class Different Color, Size & Texture)
Text format: {size} {color} {texture} {class} (natural English order)
Using device: cuda
[INFO] Loading cvcl-resnext on cuda...
Loading checkpoint from C:\Users\jbats\.cache\huggingface\hub\models--wkvong--cvcl_s_dino_resnext50_embedding\snapshots\f50eaa0c50a6076a5190b1dd52aeeb6c3e747045\cvcl_s_dino_resnext50_embedding.ckpt


Lightning automatically upgraded your loaded checkpoint from v1.5.8 to v2.5.2. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint C:\Users\jbats\.cache\huggingface\hub\models--wkvong--cvcl_s_dino_resnext50_embedding\snapshots\f50eaa0c50a6076a5190b1dd52aeeb6c3e747045\cvcl_s_dino_resnext50_embedding.ckpt`



Found 67 classes with 3+ color-size-texture combinations

Extracting image embeddings...


Extracting embeddings: 100%|██████████| 492/492 [00:22<00:00, 21.73it/s]


Extracted embeddings for 7835 images
Skipped 22 corrupted/invalid images

Running 4000 trials across 67 classes...


Processing classes: 100%|██████████| 67/67 [00:38<00:00,  1.72it/s]


Results for cvcl-resnext - SCDCST Text-Vision Test:
Total trials: 4000
Correct: 1158
Accuracy: 0.2895 (28.95%)

Results saved to C:\Users\jbats\Projects\NTU-Synthetic\PatrickProject\Chart_Generation\text_vision_results.csv





## Run CLIP SCDCST Text-Vision Test

In [5]:
# Run CLIP test with seed=0 (matching original tests)
clip_trials, clip_accuracy = run_scdcst_text_vision_test('clip-resnext', seed=0, num_trials=4000)


Running SCDCST Text-Vision Test with clip-resnext
(Same Class Different Color, Size & Texture)
Text format: {size} {color} {texture} {class} (natural English order)
Using device: cuda
[INFO] Loading clip-resnext on cuda...

Found 67 classes with 3+ color-size-texture combinations

Extracting image embeddings...


  attn_output = scaled_dot_product_attention(q, k, v, attn_mask, dropout_p, is_causal)
Extracting embeddings: 100%|██████████| 492/492 [00:17<00:00, 27.51it/s]


Extracted embeddings for 7835 images
Skipped 22 corrupted/invalid images

Running 4000 trials across 67 classes...


Processing classes: 100%|██████████| 67/67 [00:18<00:00,  3.54it/s]


Results for clip-resnext - SCDCST Text-Vision Test:
Total trials: 4000
Correct: 3600
Accuracy: 0.9000 (90.00%)

Results saved to C:\Users\jbats\Projects\NTU-Synthetic\PatrickProject\Chart_Generation\text_vision_results.csv





## Compare Results

In [None]:
# Display comparison
print("\n" + "="*60)
print("SCDCST TEXT-VISION TEST COMPARISON")
print("="*60)
print(f"\nTest: Same Class Different Color, Size & Texture (4-way forced choice)")
print(f"No control: All attributes mentioned in text")
print(f"Text format: '{{size}} {{color}} {{texture}} {{class}}' (natural English order)")
print(f"Example: 'large red smooth apple' vs 'small blue bumpy apple'")
print(f"\nResults:")
print(f"  CVCL Accuracy: {cvcl_accuracy:.4f} ({cvcl_accuracy*100:.2f}%)")
print(f"  CLIP Accuracy: {clip_accuracy:.4f} ({clip_accuracy*100:.2f}%)")
print(f"\nDifference: {abs(cvcl_accuracy - clip_accuracy):.4f} ({abs(cvcl_accuracy - clip_accuracy)*100:.2f}%)")
if cvcl_accuracy > clip_accuracy:
    print(f"CVCL performs better by {(cvcl_accuracy - clip_accuracy)*100:.2f}%")
elif clip_accuracy > cvcl_accuracy:
    print(f"CLIP performs better by {(clip_accuracy - cvcl_accuracy)*100:.2f}%")
else:
    print("Both models perform equally")

print("\n" + "="*60)
print("\nAnalysis:")
print("- Maximum attribute variation within same class")
print("- All three attributes provide discriminative signals")
print("- Natural English ordering (size → color → texture) helps both models")
print("- Should perform better than tests with fewer varying attributes")

## Analysis Notes

### SCDCST Text-Vision Test Characteristics:
- **Visual**: All 4 candidates from SAME class
- **Variation**: Color, size, AND texture all differ between candidates
- **Text Prompts**: Full natural order "{size} {color} {texture} {class}"
- **No control**: All visual attributes are mentioned in text

