# Same Class Different Size and Texture (SCDST) Text-Vision Comparison - SyntheticKonkle

This notebook compares CVCL and CLIP models on text-vision matching using the SyntheticKonkle dataset.
The task is 4-way classification where distractors are from the SAME class but DIFFERENT sizes AND textures.

## Text Ordering Considerations

**We use natural English adjective ordering**: `"{size} {texture} {class}"`
- Example: "large smooth apple", "small bumpy apple", "medium smooth apple"
- This follows standard English grammar rules where size comes before texture
- Both CVCL (trained on child-directed speech) and CLIP (trained on internet text) expect this natural ordering
- Parents typically say "big smooth ball" not "smooth big ball"

## Test Characteristics
- **Visual**: All candidates from SAME class, with controlled color
- **Variation**: Both size AND texture differ between candidates
- **Control**: Color held constant visually but NOT mentioned in text
- **Difficulty**: Medium - harder than DCDST (where class also varies) but easier than pure size or texture tests

In [1]:
# Imports
import os
import sys
import pandas as pd
import numpy as np
import torch
from PIL import Image
from tqdm import tqdm
import random
from datetime import datetime
import clip
from collections import defaultdict
from torch.utils.data import Dataset, DataLoader

# Path setup - Use absolute paths to avoid any confusion
REPO_ROOT = r'C:\Users\jbats\Projects\NTU-Synthetic'

# Add discover-hidden-visual-concepts to path
DISCOVER_ROOT = os.path.join(REPO_ROOT, 'discover-hidden-visual-concepts')
sys.path.insert(0, DISCOVER_ROOT)
sys.path.insert(0, REPO_ROOT)

# Import from discover-hidden-visual-concepts repo
sys.path.append(os.path.join(DISCOVER_ROOT, 'src'))
from utils.model_loader import load_model
from models.feature_extractor import FeatureExtractor

# Paths
DATA_PATH = os.path.join(REPO_ROOT, 'data', 'SyntheticKonkle_224', 'SyntheticKonkle')
METADATA_PATH = os.path.join(REPO_ROOT, 'data', 'SyntheticKonkle', 'master_labels.csv')
RESULTS_PATH = os.path.join(REPO_ROOT, 'PatrickProject', 'Chart_Generation', 'text_vision_results.csv')

print(f"Data path: {DATA_PATH}")
print(f"Metadata path: {METADATA_PATH}")
print(f"Results will be saved to: {RESULTS_PATH}")

  from pkg_resources import packaging


Data path: C:\Users\jbats\Projects\NTU-Synthetic\data\SyntheticKonkle_224\SyntheticKonkle
Metadata path: C:\Users\jbats\Projects\NTU-Synthetic\data\SyntheticKonkle\master_labels.csv
Results will be saved to: C:\Users\jbats\Projects\NTU-Synthetic\PatrickProject\Chart_Generation\text_vision_results.csv


In [2]:
# Load and prepare data
def load_synthetic_data():
    """Load SyntheticKonkle dataset with metadata for size-texture testing"""
    # Read metadata
    df = pd.read_csv(METADATA_PATH)
    
    # Build full paths
    df['image_path'] = df.apply(lambda row: os.path.join(DATA_PATH, row['folder'], row['filename']), axis=1)
    
    # Filter to only entries with valid size, color, and texture information
    df = df[
        df['size'].notna() & (df['size'] != '') &
        df['color'].notna() & (df['color'] != '') &
        df['texture'].notna() & (df['texture'] != '')
    ].copy()
    
    # Standardize values (lowercase)
    df['size'] = df['size'].str.lower().str.strip()
    df['color'] = df['color'].str.lower().str.strip()
    df['texture'] = df['texture'].str.lower().str.strip()
    
    # Filter to only valid size and texture values
    valid_sizes = ['small', 'medium', 'large']
    valid_textures = ['smooth', 'bumpy']
    df = df[df['size'].isin(valid_sizes) & df['texture'].isin(valid_textures)].copy()
    
    # Create combination columns
    df['class_color'] = df['class'] + '_' + df['color']
    df['size_texture'] = df['size'] + '_' + df['texture']
    df['full_combo'] = df['class'] + '_' + df['color'] + '_' + df['size'] + '_' + df['texture']
    
    print(f"Loaded {len(df)} images with size, color, and texture annotations")
    print(f"Unique classes: {df['class'].nunique()}")
    print(f"Unique sizes: {sorted(df['size'].unique())}")
    print(f"Unique textures: {sorted(df['texture'].unique())}")
    print(f"Unique colors: {df['color'].nunique()}")
    
    # Find class-color combinations that have multiple size-texture pairs
    cc_groups = df.groupby('class_color')['size_texture'].nunique()
    valid_cc = cc_groups[cc_groups >= 3].index.tolist()
    
    print(f"\nClass-Color combinations with 3+ size-texture pairs: {len(valid_cc)}")
    if len(valid_cc) > 0:
        print(f"Examples: {valid_cc[:3]}")
        # Show size-texture distribution for first example
        if len(valid_cc) > 0:
            example = valid_cc[0]
            st_pairs = df[df['class_color'] == example]['size_texture'].unique()[:4]
            print(f"  {example} has size-texture pairs like: {st_pairs}")
    
    return df, valid_cc

# Load data
data_df, valid_combinations = load_synthetic_data()
print("\nSample data:")
print(data_df[['class', 'size', 'texture', 'color', 'size_texture']].head())

Loaded 7865 images with size, color, and texture annotations
Unique classes: 67
Unique sizes: ['large', 'medium', 'small']
Unique textures: ['bumpy', 'smooth']
Unique colors: 11

Class-Color combinations with 3+ size-texture pairs: 663
Examples: ['abacus_black', 'abacus_blue', 'abacus_brown']
  abacus_black has size-texture pairs like: ['large_bumpy' 'large_smooth' 'medium_bumpy' 'medium_smooth']

Sample data:
    class   size texture   color size_texture
0  abacus  large   bumpy     red  large_bumpy
1  abacus  large   bumpy   green  large_bumpy
2  abacus  large   bumpy    blue  large_bumpy
3  abacus  large   bumpy  yellow  large_bumpy
4  abacus  large   bumpy  orange  large_bumpy


In [3]:
def run_scdst_text_vision_test(model_name='cvcl-resnext', seed=0, device=None, num_trials=4000):
    """Run Same Class Different Size and Texture text-vision test with controlled color
    
    Text format uses natural English ordering: "{size} {texture} {class}"
    Example: "large smooth apple", "small bumpy apple"
    
    Args:
        model_name: Model to test ('cvcl-resnext' or 'clip-resnext')
        seed: Random seed for reproducibility
        device: Device to use (None for auto-detect)
        num_trials: Total number of trials to run
    """
    # Set seeds to match original test methodology
    random.seed(seed)
    torch.manual_seed(seed)
    
    print(f"\n{'='*60}")
    print(f"Running SCDST Text-Vision Test with {model_name}")
    print(f"(Same Class Different Size & Texture - Controlled Color)")
    print(f"Text format: {{size}} {{texture}} {{class}} (natural English order)")
    print(f"{'='*60}")
    
    # Device selection
    if device is None:
        device = 'cuda' if torch.cuda.is_available() else 'cpu'
    
    if device == 'cuda' and not torch.cuda.is_available():
        print("[ERROR] CUDA requested but not available! Falling back to CPU.")
        device = 'cpu'
    
    print(f"Using device: {device}")
    
    # Load model
    print(f"[INFO] Loading {model_name} on {device}...")
    model, transform = load_model(model_name, seed=seed, device=device)
    extractor = FeatureExtractor(model_name, model, device)
    model.eval()
    
    # Load and prepare data
    df = pd.read_csv(METADATA_PATH)
    
    # Build full paths
    df['image_path'] = df.apply(lambda row: os.path.join(DATA_PATH, row['folder'], row['filename']), axis=1)
    
    # Filter to entries with all annotations
    df = df[
        df['size'].notna() & (df['size'] != '') &
        df['color'].notna() & (df['color'] != '') &
        df['texture'].notna() & (df['texture'] != '')
    ].copy()
    
    # Standardize values
    df['size'] = df['size'].str.lower().str.strip()
    df['color'] = df['color'].str.lower().str.strip()
    df['texture'] = df['texture'].str.lower().str.strip()
    
    # Filter to only valid size and texture values
    valid_sizes = ['small', 'medium', 'large']
    valid_textures = ['smooth', 'bumpy']
    df = df[df['size'].isin(valid_sizes) & df['texture'].isin(valid_textures)].copy()
    
    # Create combination columns
    df['class_color'] = df['class'] + '_' + df['color']
    df['size_texture'] = df['size'] + '_' + df['texture']
    
    # Find class-color combinations with at least 3 different size-texture pairs
    cc_groups = df.groupby('class_color')
    valid_cc = []
    for cc, group in cc_groups:
        unique_st = group['size_texture'].unique()
        # With 3 sizes and 2 textures, we can have up to 6 combinations
        if len(unique_st) >= 3:
            valid_cc.append(cc)
    
    if len(valid_cc) == 0:
        print("ERROR: No class-color combinations have 3+ different size-texture pairs.")
        print("Cannot run SCDST test with strict controls.")
        return [], 0.0
    
    print(f"\nFound {len(valid_cc)} class-color combinations with 3+ size-texture pairs")
    
    # Pre-compute image embeddings for efficiency
    print("\nExtracting image embeddings...")
    image_embeddings = {}
    skipped_images = []
    
    # Get all relevant images
    df_valid = df[df['class_color'].isin(valid_cc)]
    all_image_paths = df_valid['image_path'].unique().tolist()
    batch_size = 16
    
    for i in tqdm(range(0, len(all_image_paths), batch_size), desc="Extracting embeddings"):
        batch_paths = all_image_paths[i:i+batch_size]
        batch_images = []
        
        for img_path in batch_paths:
            try:
                img = Image.open(img_path).convert('RGB')
                img_processed = transform(img).unsqueeze(0).to(device)
                batch_images.append((img_path, img_processed))
            except Exception as e:
                skipped_images.append(img_path)
                continue
        
        if batch_images:
            paths = [p for p, _ in batch_images]
            imgs = torch.cat([img for _, img in batch_images], dim=0)
            
            with torch.no_grad():
                embeddings = extractor.get_img_feature(imgs)
                embeddings = extractor.norm_features(embeddings)
            
            for path, emb in zip(paths, embeddings):
                image_embeddings[path] = emb.cpu().float()
    
    print(f"Extracted embeddings for {len(image_embeddings)} images")
    if skipped_images:
        print(f"Skipped {len(skipped_images)} corrupted/invalid images")
    
    # Prepare for trials
    correct_count = 0
    trial_results = []
    
    # Calculate trials per combination
    trials_per_cc = num_trials // len(valid_cc)
    remaining_trials = num_trials % len(valid_cc)
    
    print(f"\nRunning {num_trials} trials across {len(valid_cc)} combinations...")
    
    # Run trials
    for cc_idx, cc in enumerate(tqdm(valid_cc, desc="Processing combinations")):
        # Get all images for this class-color combination
        cc_data = df_valid[df_valid['class_color'] == cc]
        
        # Group by size-texture
        st_groups = cc_data.groupby('size_texture').agg({
            'image_path': list,
            'size': 'first',
            'texture': 'first'
        }).to_dict('index')
        
        available_st = list(st_groups.keys())
        
        if len(available_st) < 3:
            continue
        
        # Parse class from combination string
        class_name = cc.split('_')[0]
        
        # Determine number of trials for this combination
        n_trials = trials_per_cc + (1 if cc_idx < remaining_trials else 0)
        
        for trial in range(n_trials):
            # Select size-texture pairs for 4-way choice
            if len(available_st) == 3:
                # Use all 3 pairs plus duplicate one for 4-way choice
                selected_pairs = available_st.copy()
                selected_pairs.append(random.choice(available_st))
            else:
                # Select 4 different pairs if possible
                selected_pairs = random.sample(available_st, min(4, len(available_st)))
            
            # First pair is the query
            query_pair = selected_pairs[0]
            query_data = st_groups[query_pair]
            
            # Select random query image from valid images
            valid_query_paths = [p for p in query_data['image_path'] if p in image_embeddings]
            if not valid_query_paths:
                continue
            query_img_path = random.choice(valid_query_paths)
            query_size = query_data['size']
            query_texture = query_data['texture']
            
            # Shuffle for candidate order
            random.shuffle(selected_pairs)
            correct_idx = selected_pairs.index(query_pair)
            
            # Create text prompts - NATURAL ENGLISH ORDER: {size} {texture} {class}
            candidate_texts = []
            for pair in selected_pairs:
                pair_data = st_groups[pair]
                # Natural English order: size before texture before noun
                text_prompt = f"{pair_data['size']} {pair_data['texture']} {class_name.lower()}"
                candidate_texts.append(text_prompt)
            
            # Encode text prompts
            with torch.no_grad():
                if "clip" in model_name:
                    tokens = clip.tokenize(candidate_texts, truncate=True).to(device)
                    txt_features = model.encode_text(tokens)
                    txt_features = extractor.norm_features(txt_features)
                else:  # CVCL
                    tokens, token_len = model.tokenize(candidate_texts)
                    tokens = tokens.to(device)
                    if isinstance(token_len, torch.Tensor):
                        token_len = token_len.to(device)
                    txt_features = model.encode_text(tokens, token_len)
                    txt_features = extractor.norm_features(txt_features)
            
            # Get query image embedding
            query_embedding = image_embeddings[query_img_path].unsqueeze(0).to(device)
            
            # Calculate similarity
            query_embedding = query_embedding.float()
            txt_features = txt_features.float()
            
            similarity = (100.0 * query_embedding @ txt_features.transpose(-2, -1)).softmax(dim=1)
            
            # Get prediction
            pred_idx = similarity.argmax(dim=1).item()
            
            # Check if correct
            is_correct = (pred_idx == correct_idx)
            if is_correct:
                correct_count += 1
            
            # Store trial result
            trial_results.append({
                'trial': len(trial_results) + 1,
                'query_class': class_name,
                'query_size': query_size,
                'query_texture': query_texture,
                'class_color': cc,
                'query_img': os.path.basename(query_img_path),
                'correct_idx': correct_idx,
                'predicted_idx': pred_idx,
                'correct': is_correct,
                'candidate_texts': candidate_texts,
                'similarity_scores': similarity.cpu().numpy().tolist()
            })
    
    # Calculate accuracy
    accuracy = correct_count / len(trial_results) if trial_results else 0
    
    print(f"\n{'='*60}")
    print(f"Results for {model_name} - SCDST Text-Vision Test:")
    print(f"Total trials: {len(trial_results)}")
    print(f"Correct: {correct_count}")
    print(f"Accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")
    print(f"{'='*60}")
    
    # Save results
    results_row = {
        'Model': model_name,
        'Test': 'SCDST-TextVision',
        'Dataset': 'SyntheticKonkle',
        'Correct': correct_count,
        'Trials': len(trial_results),
        'Accuracy': accuracy
    }
    
    os.makedirs(os.path.dirname(RESULTS_PATH), exist_ok=True)
    if os.path.exists(RESULTS_PATH):
        results_df = pd.read_csv(RESULTS_PATH)
    else:
        results_df = pd.DataFrame()
    
    results_df = pd.concat([results_df, pd.DataFrame([results_row])], ignore_index=True)
    results_df.to_csv(RESULTS_PATH, index=False, float_format='%.4f')
    print(f"\nResults saved to {RESULTS_PATH}")
    
    return trial_results, accuracy

## Run CVCL SCDST Text-Vision Test

In [4]:
# Run CVCL test with seed=0 (matching original tests)
cvcl_trials, cvcl_accuracy = run_scdst_text_vision_test('cvcl-resnext', seed=0, num_trials=4000)


Running SCDST Text-Vision Test with cvcl-resnext
(Same Class Different Size & Texture - Controlled Color)
Text format: {size} {texture} {class} (natural English order)
Using device: cuda
[INFO] Loading cvcl-resnext on cuda...
Loading checkpoint from C:\Users\jbats\.cache\huggingface\hub\models--wkvong--cvcl_s_dino_resnext50_embedding\snapshots\f50eaa0c50a6076a5190b1dd52aeeb6c3e747045\cvcl_s_dino_resnext50_embedding.ckpt


Lightning automatically upgraded your loaded checkpoint from v1.5.8 to v2.5.2. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint C:\Users\jbats\.cache\huggingface\hub\models--wkvong--cvcl_s_dino_resnext50_embedding\snapshots\f50eaa0c50a6076a5190b1dd52aeeb6c3e747045\cvcl_s_dino_resnext50_embedding.ckpt`



Found 663 class-color combinations with 3+ size-texture pairs

Extracting image embeddings...


Extracting embeddings: 100%|██████████| 490/490 [00:21<00:00, 22.81it/s]


Extracted embeddings for 7805 images
Skipped 22 corrupted/invalid images

Running 4000 trials across 663 combinations...


Processing combinations: 100%|██████████| 663/663 [00:39<00:00, 16.79it/s]


Results for cvcl-resnext - SCDST Text-Vision Test:
Total trials: 4000
Correct: 1378
Accuracy: 0.3445 (34.45%)

Results saved to C:\Users\jbats\Projects\NTU-Synthetic\PatrickProject\Chart_Generation\text_vision_results.csv





## Run CLIP SCDST Text-Vision Test

In [5]:
# Run CLIP test with seed=0 (matching original tests)
clip_trials, clip_accuracy = run_scdst_text_vision_test('clip-resnext', seed=0, num_trials=4000)


Running SCDST Text-Vision Test with clip-resnext
(Same Class Different Size & Texture - Controlled Color)
Text format: {size} {texture} {class} (natural English order)
Using device: cuda
[INFO] Loading clip-resnext on cuda...

Found 663 class-color combinations with 3+ size-texture pairs

Extracting image embeddings...


  attn_output = scaled_dot_product_attention(q, k, v, attn_mask, dropout_p, is_causal)
Extracting embeddings: 100%|██████████| 490/490 [00:18<00:00, 26.68it/s]


Extracted embeddings for 7805 images
Skipped 22 corrupted/invalid images

Running 4000 trials across 663 combinations...


Processing combinations: 100%|██████████| 663/663 [00:22<00:00, 30.10it/s]


Results for clip-resnext - SCDST Text-Vision Test:
Total trials: 4000
Correct: 1662
Accuracy: 0.4155 (41.55%)

Results saved to C:\Users\jbats\Projects\NTU-Synthetic\PatrickProject\Chart_Generation\text_vision_results.csv





## Compare Results

In [None]:
# Display comparison
print("\n" + "="*60)
print("SCDST TEXT-VISION TEST COMPARISON")
print("="*60)
print(f"\nTest: Same Class Different Size & Texture (4-way forced choice)")
print(f"Control: Color held constant (not mentioned in text)")
print(f"Text format: '{{size}} {{texture}} {{class}}' (natural English order)")
print(f"Example: 'large smooth apple' vs 'small bumpy apple'")
print(f"\nResults:")
print(f"  CVCL Accuracy: {cvcl_accuracy:.4f} ({cvcl_accuracy*100:.2f}%)")
print(f"  CLIP Accuracy: {clip_accuracy:.4f} ({clip_accuracy*100:.2f}%)")
print(f"\nDifference: {abs(cvcl_accuracy - clip_accuracy):.4f} ({abs(cvcl_accuracy - clip_accuracy)*100:.2f}%)")
if cvcl_accuracy > clip_accuracy:
    print(f"CVCL performs better by {(cvcl_accuracy - clip_accuracy)*100:.2f}%")
elif clip_accuracy > cvcl_accuracy:
    print(f"CLIP performs better by {(clip_accuracy - cvcl_accuracy)*100:.2f}%")
else:
    print("Both models perform equally")

print("\n" + "="*60)
print("\nAnalysis:")
print("- Tests multi-attribute discrimination within same class")
print("- Both size AND texture provide discriminative signals")
print("- Natural English ordering (size before texture) helps both models")
print("- Should perform better than single-attribute tests (SCDS or pure texture)")

## Analysis Notes

### SCDST Text-Vision Test Characteristics:
- **Visual Control**: All 4 candidates have same class and color
- **Variation**: Both size AND texture differ between candidates
- **Text Prompts**: Natural order "{size} {texture} {class}" (e.g., "large smooth apple")
- **NOT mentioned**: Color is controlled but excluded from text

