# Same Class Different Color and Size (SCDCS) Text-Vision Comparison - SyntheticKonkle

This notebook compares CVCL and CLIP models on text-vision matching using the SyntheticKonkle dataset.
The task is 4-way classification where distractors are from the SAME class but DIFFERENT colors AND sizes.

## Text Ordering Considerations

**We use natural English adjective ordering**: `"{size} {color} {class}"`
- Example: "large red apple", "small green apple", "medium blue apple"
- This follows standard English grammar rules (opinion → size → age → shape → color → origin → material → purpose → noun)
- Both CVCL (trained on child-directed speech) and CLIP (trained on internet text) expect this natural ordering
- Parents typically say "big red ball" not "red big ball" or "ball big red"

## Test Characteristics
- **Visual**: All candidates from SAME class, with controlled texture
- **Variation**: Both color AND size differ between candidates
- **Control**: Texture held constant visually but NOT mentioned in text
- **Difficulty**: Medium - harder than DCDC/DCDS (where class also varies) but easier than pure color or size tests

In [1]:
# Imports
import os
import sys
import pandas as pd
import numpy as np
import torch
from PIL import Image
from tqdm import tqdm
import random
from datetime import datetime
import clip
from collections import defaultdict
from torch.utils.data import Dataset, DataLoader

# Path setup - Use absolute paths to avoid any confusion
REPO_ROOT = r'C:\Users\jbats\Projects\NTU-Synthetic'

# Add discover-hidden-visual-concepts to path
DISCOVER_ROOT = os.path.join(REPO_ROOT, 'discover-hidden-visual-concepts')
sys.path.insert(0, DISCOVER_ROOT)
sys.path.insert(0, REPO_ROOT)

# Import from discover-hidden-visual-concepts repo
sys.path.append(os.path.join(DISCOVER_ROOT, 'src'))
from utils.model_loader import load_model
from models.feature_extractor import FeatureExtractor

# Paths
DATA_PATH = os.path.join(REPO_ROOT, 'data', 'SyntheticKonkle_224', 'SyntheticKonkle')
METADATA_PATH = os.path.join(REPO_ROOT, 'data', 'SyntheticKonkle', 'master_labels.csv')
RESULTS_PATH = os.path.join(REPO_ROOT, 'PatrickProject', 'Chart_Generation', 'text_vision_results.csv')

print(f"Data path: {DATA_PATH}")
print(f"Metadata path: {METADATA_PATH}")
print(f"Results will be saved to: {RESULTS_PATH}")

  from pkg_resources import packaging


Data path: C:\Users\jbats\Projects\NTU-Synthetic\data\SyntheticKonkle_224\SyntheticKonkle
Metadata path: C:\Users\jbats\Projects\NTU-Synthetic\data\SyntheticKonkle\master_labels.csv
Results will be saved to: C:\Users\jbats\Projects\NTU-Synthetic\PatrickProject\Chart_Generation\text_vision_results.csv


In [2]:
# Load and prepare data
def load_synthetic_data():
    """Load SyntheticKonkle dataset with metadata for color-size testing"""
    # Read metadata
    df = pd.read_csv(METADATA_PATH)
    
    # Build full paths
    df['image_path'] = df.apply(lambda row: os.path.join(DATA_PATH, row['folder'], row['filename']), axis=1)
    
    # Filter to only entries with valid size, color, and texture information
    df = df[
        df['size'].notna() & (df['size'] != '') &
        df['color'].notna() & (df['color'] != '') &
        df['texture'].notna() & (df['texture'] != '')
    ].copy()
    
    # Standardize values (lowercase)
    df['size'] = df['size'].str.lower().str.strip()
    df['color'] = df['color'].str.lower().str.strip()
    df['texture'] = df['texture'].str.lower().str.strip()
    
    # Create combination columns
    df['class_texture'] = df['class'] + '_' + df['texture']
    df['color_size'] = df['color'] + '_' + df['size']
    df['full_combo'] = df['class'] + '_' + df['color'] + '_' + df['texture'] + '_' + df['size']
    
    print(f"Loaded {len(df)} images with size, color, and texture annotations")
    print(f"Unique classes: {df['class'].nunique()}")
    print(f"Unique colors: {df['color'].nunique()}")
    print(f"Unique sizes: {df['size'].nunique()}")
    print(f"Size values: {sorted(df['size'].unique())}")
    print(f"Color values: {sorted(df['color'].unique())[:10]}...")  # Show first 10
    
    # Find class-texture combinations that have multiple color-size pairs
    ct_groups = df.groupby('class_texture')['color_size'].nunique()
    valid_ct = ct_groups[ct_groups >= 3].index.tolist()
    
    print(f"\nClass-Texture combinations with 3+ color-size pairs: {len(valid_ct)}")
    if len(valid_ct) > 0:
        print(f"Examples: {valid_ct[:3]}")
        # Show color-size distribution for first example
        if len(valid_ct) > 0:
            example = valid_ct[0]
            color_sizes = df[df['class_texture'] == example]['color_size'].unique()[:4]
            print(f"  {example} has color-size pairs like: {color_sizes}")
    
    return df, valid_ct

# Load data
data_df, valid_combinations = load_synthetic_data()
print("\nSample data:")
print(data_df[['class', 'color', 'size', 'texture', 'color_size']].head())

Loaded 7882 images with size, color, and texture annotations
Unique classes: 67
Unique colors: 12
Unique sizes: 4
Size values: ['bumpy', 'large', 'medium', 'small']
Color values: ['ball', 'black', 'blue', 'brown', 'gray', 'green', 'orange', 'pink', 'purple', 'red']...

Class-Texture combinations with 3+ color-size pairs: 136
Examples: ['abacus_bumpy', 'abacus_smooth', 'apple_bumpy']
  abacus_bumpy has color-size pairs like: ['red_large' 'green_large' 'blue_large' 'yellow_large']

Sample data:
    class   color   size texture    color_size
0  abacus     red  large   bumpy     red_large
1  abacus   green  large   bumpy   green_large
2  abacus    blue  large   bumpy    blue_large
3  abacus  yellow  large   bumpy  yellow_large
4  abacus  orange  large   bumpy  orange_large


In [3]:
def run_scdcs_text_vision_test(model_name='cvcl-resnext', seed=0, device=None, num_trials=4000):
    """Run Same Class Different Color and Size text-vision test with controlled texture
    
    Text format uses natural English ordering: "{size} {color} {class}"
    Example: "large red apple", "small green apple"
    
    Args:
        model_name: Model to test ('cvcl-resnext' or 'clip-resnext')
        seed: Random seed for reproducibility
        device: Device to use (None for auto-detect)
        num_trials: Total number of trials to run
    """
    # Set seeds to match original test methodology
    random.seed(seed)
    torch.manual_seed(seed)
    
    print(f"\n{'='*60}")
    print(f"Running SCDCS Text-Vision Test with {model_name}")
    print(f"(Same Class Different Color & Size - Controlled Texture)")
    print(f"Text format: {{size}} {{color}} {{class}} (natural English order)")
    print(f"{'='*60}")
    
    # Device selection
    if device is None:
        device = 'cuda' if torch.cuda.is_available() else 'cpu'
    
    if device == 'cuda' and not torch.cuda.is_available():
        print("[ERROR] CUDA requested but not available! Falling back to CPU.")
        device = 'cpu'
    
    print(f"Using device: {device}")
    
    # Load model
    print(f"[INFO] Loading {model_name} on {device}...")
    model, transform = load_model(model_name, seed=seed, device=device)
    extractor = FeatureExtractor(model_name, model, device)
    model.eval()
    
    # Load and prepare data
    df = pd.read_csv(METADATA_PATH)
    
    # Build full paths
    df['image_path'] = df.apply(lambda row: os.path.join(DATA_PATH, row['folder'], row['filename']), axis=1)
    
    # Filter to entries with all annotations
    df = df[
        df['size'].notna() & (df['size'] != '') &
        df['color'].notna() & (df['color'] != '') &
        df['texture'].notna() & (df['texture'] != '')
    ].copy()
    
    # Standardize values
    df['size'] = df['size'].str.lower().str.strip()
    df['color'] = df['color'].str.lower().str.strip()
    df['texture'] = df['texture'].str.lower().str.strip()
    
    # Create combination columns
    df['class_texture'] = df['class'] + '_' + df['texture']
    df['color_size'] = df['color'] + '_' + df['size']
    
    # Find class-texture combinations with at least 3 different color-size pairs
    ct_groups = df.groupby('class_texture')
    valid_ct = []
    for ct, group in ct_groups:
        unique_color_sizes = group['color_size'].unique()
        if len(unique_color_sizes) >= 3:
            valid_ct.append(ct)
    
    if len(valid_ct) == 0:
        print("ERROR: No class-texture combinations have 3+ different color-size pairs.")
        print("Cannot run SCDCS test with strict controls.")
        return [], 0.0
    
    print(f"\nFound {len(valid_ct)} class-texture combinations with 3+ color-size pairs")
    
    # Pre-compute image embeddings for efficiency
    print("\nExtracting image embeddings...")
    image_embeddings = {}
    skipped_images = []
    
    # Get all relevant images
    df_valid = df[df['class_texture'].isin(valid_ct)]
    all_image_paths = df_valid['image_path'].unique().tolist()
    batch_size = 16
    
    for i in tqdm(range(0, len(all_image_paths), batch_size), desc="Extracting embeddings"):
        batch_paths = all_image_paths[i:i+batch_size]
        batch_images = []
        
        for img_path in batch_paths:
            try:
                img = Image.open(img_path).convert('RGB')
                img_processed = transform(img).unsqueeze(0).to(device)
                batch_images.append((img_path, img_processed))
            except Exception as e:
                skipped_images.append(img_path)
                continue
        
        if batch_images:
            paths = [p for p, _ in batch_images]
            imgs = torch.cat([img for _, img in batch_images], dim=0)
            
            with torch.no_grad():
                embeddings = extractor.get_img_feature(imgs)
                embeddings = extractor.norm_features(embeddings)
            
            for path, emb in zip(paths, embeddings):
                image_embeddings[path] = emb.cpu().float()
    
    print(f"Extracted embeddings for {len(image_embeddings)} images")
    if skipped_images:
        print(f"Skipped {len(skipped_images)} corrupted/invalid images")
    
    # Prepare for trials
    correct_count = 0
    trial_results = []
    
    # Calculate trials per combination
    trials_per_ct = num_trials // len(valid_ct)
    remaining_trials = num_trials % len(valid_ct)
    
    print(f"\nRunning {num_trials} trials across {len(valid_ct)} combinations...")
    
    # Run trials
    for ct_idx, ct in enumerate(tqdm(valid_ct, desc="Processing combinations")):
        # Get all images for this class-texture combination
        ct_data = df_valid[df_valid['class_texture'] == ct]
        
        # Group by color-size
        cs_groups = ct_data.groupby('color_size').agg({
            'image_path': list,
            'color': 'first',
            'size': 'first'
        }).to_dict('index')
        
        available_color_sizes = list(cs_groups.keys())
        
        if len(available_color_sizes) < 3:
            continue
        
        # Parse class from combination string
        class_name = ct.split('_')[0]
        
        # Determine number of trials for this combination
        n_trials = trials_per_ct + (1 if ct_idx < remaining_trials else 0)
        
        for trial in range(n_trials):
            # Select color-size pairs for 4-way choice
            if len(available_color_sizes) == 3:
                # Use all 3 pairs plus duplicate one for 4-way choice
                selected_pairs = available_color_sizes.copy()
                selected_pairs.append(random.choice(available_color_sizes))
            else:
                # Select 4 different pairs if possible
                selected_pairs = random.sample(available_color_sizes, min(4, len(available_color_sizes)))
            
            # First pair is the query
            query_pair = selected_pairs[0]
            query_data = cs_groups[query_pair]
            
            # Select random query image from valid images
            valid_query_paths = [p for p in query_data['image_path'] if p in image_embeddings]
            if not valid_query_paths:
                continue
            query_img_path = random.choice(valid_query_paths)
            query_color = query_data['color']
            query_size = query_data['size']
            
            # Shuffle for candidate order
            random.shuffle(selected_pairs)
            correct_idx = selected_pairs.index(query_pair)
            
            # Create text prompts - NATURAL ENGLISH ORDER: {size} {color} {class}
            candidate_texts = []
            for pair in selected_pairs:
                pair_data = cs_groups[pair]
                # Natural English order: size before color before noun
                text_prompt = f"{pair_data['size']} {pair_data['color']} {class_name.lower()}"
                candidate_texts.append(text_prompt)
            
            # Encode text prompts
            with torch.no_grad():
                if "clip" in model_name:
                    tokens = clip.tokenize(candidate_texts, truncate=True).to(device)
                    txt_features = model.encode_text(tokens)
                    txt_features = extractor.norm_features(txt_features)
                else:  # CVCL
                    tokens, token_len = model.tokenize(candidate_texts)
                    tokens = tokens.to(device)
                    if isinstance(token_len, torch.Tensor):
                        token_len = token_len.to(device)
                    txt_features = model.encode_text(tokens, token_len)
                    txt_features = extractor.norm_features(txt_features)
            
            # Get query image embedding
            query_embedding = image_embeddings[query_img_path].unsqueeze(0).to(device)
            
            # Calculate similarity
            query_embedding = query_embedding.float()
            txt_features = txt_features.float()
            
            similarity = (100.0 * query_embedding @ txt_features.transpose(-2, -1)).softmax(dim=1)
            
            # Get prediction
            pred_idx = similarity.argmax(dim=1).item()
            
            # Check if correct
            is_correct = (pred_idx == correct_idx)
            if is_correct:
                correct_count += 1
            
            # Store trial result
            trial_results.append({
                'trial': len(trial_results) + 1,
                'query_class': class_name,
                'query_color': query_color,
                'query_size': query_size,
                'class_texture': ct,
                'query_img': os.path.basename(query_img_path),
                'correct_idx': correct_idx,
                'predicted_idx': pred_idx,
                'correct': is_correct,
                'candidate_texts': candidate_texts,
                'similarity_scores': similarity.cpu().numpy().tolist()
            })
    
    # Calculate accuracy
    accuracy = correct_count / len(trial_results) if trial_results else 0
    
    print(f"\n{'='*60}")
    print(f"Results for {model_name} - SCDCS Text-Vision Test:")
    print(f"Total trials: {len(trial_results)}")
    print(f"Correct: {correct_count}")
    print(f"Accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")
    print(f"{'='*60}")
    
    # Save results
    results_row = {
        'Model': model_name,
        'Test': 'SCDCS-TextVision',
        'Dataset': 'SyntheticKonkle',
        'Correct': correct_count,
        'Trials': len(trial_results),
        'Accuracy': accuracy
    }
    
    os.makedirs(os.path.dirname(RESULTS_PATH), exist_ok=True)
    if os.path.exists(RESULTS_PATH):
        results_df = pd.read_csv(RESULTS_PATH)
    else:
        results_df = pd.DataFrame()
    
    results_df = pd.concat([results_df, pd.DataFrame([results_row])], ignore_index=True)
    results_df.to_csv(RESULTS_PATH, index=False, float_format='%.4f')
    print(f"\nResults saved to {RESULTS_PATH}")
    
    return trial_results, accuracy

## Run CVCL SCDCS Text-Vision Test

In [4]:
# Run CVCL test with seed=0 (matching original tests)
cvcl_trials, cvcl_accuracy = run_scdcs_text_vision_test('cvcl-resnext', seed=0, num_trials=4000)


Running SCDCS Text-Vision Test with cvcl-resnext
(Same Class Different Color & Size - Controlled Texture)
Text format: {size} {color} {class} (natural English order)
Using device: cuda
[INFO] Loading cvcl-resnext on cuda...
Loading checkpoint from C:\Users\jbats\.cache\huggingface\hub\models--wkvong--cvcl_s_dino_resnext50_embedding\snapshots\f50eaa0c50a6076a5190b1dd52aeeb6c3e747045\cvcl_s_dino_resnext50_embedding.ckpt


Lightning automatically upgraded your loaded checkpoint from v1.5.8 to v2.5.2. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint C:\Users\jbats\.cache\huggingface\hub\models--wkvong--cvcl_s_dino_resnext50_embedding\snapshots\f50eaa0c50a6076a5190b1dd52aeeb6c3e747045\cvcl_s_dino_resnext50_embedding.ckpt`



Found 136 class-texture combinations with 3+ color-size pairs

Extracting image embeddings...


Extracting embeddings: 100%|██████████| 492/492 [00:21<00:00, 23.00it/s]


Extracted embeddings for 7841 images
Skipped 31 corrupted/invalid images

Running 4000 trials across 136 combinations...


Processing combinations: 100%|██████████| 136/136 [00:38<00:00,  3.49it/s]


Results for cvcl-resnext - SCDCS Text-Vision Test:
Total trials: 3971
Correct: 1078
Accuracy: 0.2715 (27.15%)

Results saved to C:\Users\jbats\Projects\NTU-Synthetic\PatrickProject\Chart_Generation\text_vision_results.csv





## Run CLIP SCDCS Text-Vision Test

In [5]:
# Run CLIP test with seed=0 (matching original tests)
clip_trials, clip_accuracy = run_scdcs_text_vision_test('clip-resnext', seed=0, num_trials=4000)


Running SCDCS Text-Vision Test with clip-resnext
(Same Class Different Color & Size - Controlled Texture)
Text format: {size} {color} {class} (natural English order)
Using device: cuda
[INFO] Loading clip-resnext on cuda...

Found 136 class-texture combinations with 3+ color-size pairs

Extracting image embeddings...


  attn_output = scaled_dot_product_attention(q, k, v, attn_mask, dropout_p, is_causal)
Extracting embeddings: 100%|██████████| 492/492 [00:17<00:00, 28.21it/s]


Extracted embeddings for 7841 images
Skipped 31 corrupted/invalid images

Running 4000 trials across 136 combinations...


Processing combinations: 100%|██████████| 136/136 [00:20<00:00,  6.76it/s]


Results for clip-resnext - SCDCS Text-Vision Test:
Total trials: 3971
Correct: 3599
Accuracy: 0.9063 (90.63%)

Results saved to C:\Users\jbats\Projects\NTU-Synthetic\PatrickProject\Chart_Generation\text_vision_results.csv





## Compare Results

In [6]:
# Display comparison
print("\n" + "="*60)
print("SCDCS TEXT-VISION TEST COMPARISON")
print("="*60)
print(f"\nTest: Same Class Different Color & Size (4-way forced choice)")
print(f"Control: Texture held constant (not mentioned in text)")
print(f"Text format: '{{size}} {{color}} {{class}}' (natural English order)")
print(f"Example: 'large red apple' vs 'small green apple'")
print(f"\nResults:")
print(f"  CVCL Accuracy: {cvcl_accuracy:.4f} ({cvcl_accuracy*100:.2f}%)")
print(f"  CLIP Accuracy: {clip_accuracy:.4f} ({clip_accuracy*100:.2f}%)")
print(f"\nDifference: {abs(cvcl_accuracy - clip_accuracy):.4f} ({abs(cvcl_accuracy - clip_accuracy)*100:.2f}%)")
if cvcl_accuracy > clip_accuracy:
    print(f"CVCL performs better by {(cvcl_accuracy - clip_accuracy)*100:.2f}%")
elif clip_accuracy > cvcl_accuracy:
    print(f"CLIP performs better by {(clip_accuracy - cvcl_accuracy)*100:.2f}%")
else:
    print("Both models perform equally")

print("\n" + "="*60)
print("\nAnalysis:")
print("- Tests multi-attribute discrimination within same class")
print("- Both color AND size provide discriminative signals")
print("- Natural English ordering helps both models")
print("- Should perform better than single-attribute tests (SCDC or SCDS)")


SCDCS TEXT-VISION TEST COMPARISON

Test: Same Class Different Color & Size (4-way forced choice)
Control: Texture held constant (not mentioned in text)
Text format: '{size} {color} {class}' (natural English order)
Example: 'large red apple' vs 'small green apple'

Results:
  CVCL Accuracy: 0.2715 (27.15%)
  CLIP Accuracy: 0.9063 (90.63%)

Difference: 0.6349 (63.49%)
CLIP performs better by 63.49%


Analysis:
- Tests multi-attribute discrimination within same class
- Both color AND size provide discriminative signals
- Natural English ordering helps both models
- Should perform better than single-attribute tests (SCDC or SCDS)


## Analysis Notes

### SCDCS Text-Vision Test Characteristics:
- **Visual Control**: All 4 candidates have same class and texture
- **Variation**: Both color AND size differ between candidates
- **Text Prompts**: Natural order "{size} {color} {class}" (e.g., "large red apple")
- **NOT mentioned**: Texture is controlled but excluded from text
