# Same Class Different Texture (SCDT) Text-Vision Comparison - SyntheticKonkle

This notebook compares CVCL and CLIP models on text-vision matching using the SyntheticKonkle dataset.
The task is 4-way classification where distractors are from the SAME class but DIFFERENT textures.

## Test Characteristics
- **Visual**: All candidates from SAME class, with controlled color and size
- **Variation**: Only texture differs between candidates (smooth vs bumpy)
- **Control**: Color and size held constant visually but NOT mentioned in text
- **Text format**: `"{texture} {class}"` (e.g., "smooth apple", "bumpy apple")
- **Difficulty**: Very hard - only texture provides discrimination within same class

## Special Note on 4-way Choice with 2 Textures
Since there are only 2 texture values (smooth and bumpy), we create 4-way choice by:
- Using 3 images of one texture (randomly chosen)
- Using 1 image of the other texture
- This creates an unambiguous mapping where each image corresponds to exactly one correct answer
- Example: 3 smooth apples + 1 bumpy apple, where each specific image must be matched correctly

In [1]:
# Imports
import os
import sys
import pandas as pd
import numpy as np
import torch
from PIL import Image
from tqdm import tqdm
import random
from datetime import datetime
import clip
from collections import defaultdict
from torch.utils.data import Dataset, DataLoader

# Path setup - Use absolute paths to avoid any confusion
REPO_ROOT = r'C:\Users\jbats\Projects\NTU-Synthetic'

# Add discover-hidden-visual-concepts to path
DISCOVER_ROOT = os.path.join(REPO_ROOT, 'discover-hidden-visual-concepts')
sys.path.insert(0, DISCOVER_ROOT)
sys.path.insert(0, REPO_ROOT)

# Import from discover-hidden-visual-concepts repo
sys.path.append(os.path.join(DISCOVER_ROOT, 'src'))
from utils.model_loader import load_model
from models.feature_extractor import FeatureExtractor

# Paths
DATA_PATH = os.path.join(REPO_ROOT, 'data', 'SyntheticKonkle_224', 'SyntheticKonkle')
METADATA_PATH = os.path.join(REPO_ROOT, 'data', 'SyntheticKonkle', 'master_labels.csv')
RESULTS_PATH = os.path.join(REPO_ROOT, 'PatrickProject', 'Chart_Generation', 'text_vision_results.csv')

print(f"Data path: {DATA_PATH}")
print(f"Metadata path: {METADATA_PATH}")
print(f"Results will be saved to: {RESULTS_PATH}")

  from pkg_resources import packaging


Data path: C:\Users\jbats\Projects\NTU-Synthetic\data\SyntheticKonkle_224\SyntheticKonkle
Metadata path: C:\Users\jbats\Projects\NTU-Synthetic\data\SyntheticKonkle\master_labels.csv
Results will be saved to: C:\Users\jbats\Projects\NTU-Synthetic\PatrickProject\Chart_Generation\text_vision_results.csv


In [2]:
# Load and prepare data
def load_synthetic_data():
    """Load SyntheticKonkle dataset with metadata for texture testing"""
    # Read metadata
    df = pd.read_csv(METADATA_PATH)
    
    # Build full paths
    df['image_path'] = df.apply(lambda row: os.path.join(DATA_PATH, row['folder'], row['filename']), axis=1)
    
    # Filter to only entries with valid size, color, and texture information
    df = df[
        df['size'].notna() & (df['size'] != '') &
        df['color'].notna() & (df['color'] != '') &
        df['texture'].notna() & (df['texture'] != '')
    ].copy()
    
    # Standardize values (lowercase)
    df['size'] = df['size'].str.lower().str.strip()
    df['color'] = df['color'].str.lower().str.strip()
    df['texture'] = df['texture'].str.lower().str.strip()
    
    # Filter to only valid texture values
    valid_textures = ['smooth', 'bumpy']
    df = df[df['texture'].isin(valid_textures)].copy()
    
    # Create combination columns
    df['class_color_size'] = df['class'] + '_' + df['color'] + '_' + df['size']
    df['full_combo'] = df['class'] + '_' + df['color'] + '_' + df['size'] + '_' + df['texture']
    
    print(f"Loaded {len(df)} images with texture annotations")
    print(f"Unique classes: {df['class'].nunique()}")
    print(f"Unique textures: {sorted(df['texture'].unique())}")
    print(f"Unique colors: {df['color'].nunique()}")
    print(f"Unique sizes: {df['size'].nunique()}")
    
    # Find class-color-size combinations that have both textures
    ccs_groups = df.groupby('class_color_size')['texture'].nunique()
    valid_ccs = ccs_groups[ccs_groups == 2].index.tolist()  # Must have both smooth and bumpy
    
    print(f"\nClass-Color-Size combinations with both textures: {len(valid_ccs)}")
    if len(valid_ccs) > 0:
        print(f"Examples: {valid_ccs[:3]}")
    
    return df, valid_ccs

# Load data
data_df, valid_combinations = load_synthetic_data()
print("\nSample data:")
print(data_df[['class', 'texture', 'color', 'size']].head())

Loaded 7865 images with texture annotations
Unique classes: 67
Unique textures: ['bumpy', 'smooth']
Unique colors: 11
Unique sizes: 3

Class-Color-Size combinations with both textures: 1961
Examples: ['abacus_black_large', 'abacus_black_medium', 'abacus_black_small']

Sample data:
    class texture   color   size
0  abacus   bumpy     red  large
1  abacus   bumpy   green  large
2  abacus   bumpy    blue  large
3  abacus   bumpy  yellow  large
4  abacus   bumpy  orange  large


In [None]:
def run_scdt_text_vision_test(model_name='cvcl-resnext', seed=0, device=None, num_trials=4000):
    """Run Same Class Different Texture text-vision test
    
    Test design:
    - Query: One image of a specific texture
    - Distractors: 3 images of the SAME CLASS but with different texture distribution
    - Colors and sizes can vary for image diversity
    - Text format: "{texture} {class}" (e.g., "smooth apple", "bumpy apple")
    - Uses 3-1 split: 3 images of one texture, 1 of the other
    
    Args:
        model_name: Model to test ('cvcl-resnext' or 'clip-resnext')
        seed: Random seed for reproducibility
        device: Device to use (None for auto-detect)
        num_trials: Total number of trials to run
    """
    # Set seeds to match original test methodology
    random.seed(seed)
    torch.manual_seed(seed)
    
    print(f"\n{'='*60}")
    print(f"Running SCDT Text-Vision Test with {model_name}")
    print(f"(Same Class Different Texture - Varied Colors & Sizes)")
    print(f"Text format: {{texture}} {{class}}")
    print(f"Note: Using 3-1 split (3 of one texture, 1 of the other)")
    print(f"{'='*60}")
    
    # Device selection
    if device is None:
        device = 'cuda' if torch.cuda.is_available() else 'cpu'
    
    if device == 'cuda' and not torch.cuda.is_available():
        print("[ERROR] CUDA requested but not available! Falling back to CPU.")
        device = 'cpu'
    
    print(f"Using device: {device}")
    
    # Load model
    print(f"[INFO] Loading {model_name} on {device}...")
    model, transform = load_model(model_name, seed=seed, device=device)
    extractor = FeatureExtractor(model_name, model, device)
    model.eval()
    
    # Load and prepare data
    df = pd.read_csv(METADATA_PATH)
    
    # Build full paths
    df['image_path'] = df.apply(lambda row: os.path.join(DATA_PATH, row['folder'], row['filename']), axis=1)
    
    # Filter to entries with texture annotation
    df = df[df['texture'].notna() & (df['texture'] != '')].copy()
    
    # Standardize values
    df['texture'] = df['texture'].str.lower().str.strip()
    
    # Filter to only valid texture values
    valid_textures = ['smooth', 'bumpy']
    df = df[df['texture'].isin(valid_textures)].copy()
    
    print(f"\nLoaded {len(df)} images with texture annotations")
    print(f"Unique classes: {df['class'].nunique()}")
    print(f"Unique textures: {sorted(df['texture'].unique())}")
    
    # Find classes that have both textures with enough images
    class_groups = df.groupby('class')
    valid_classes = []
    for class_name, group in class_groups:
        unique_textures = group['texture'].unique()
        if len(unique_textures) == 2:  # Has both smooth and bumpy
            texture_counts = group.groupby('texture').size()
            if texture_counts.min() >= 4:  # At least 4 images per texture (for 3-1 split)
                valid_classes.append(class_name)
    
    if len(valid_classes) == 0:
        print("ERROR: No classes have both textures with enough images.")
        print("Cannot run SCDT test.")
        return [], 0.0
    
    print(f"\nFound {len(valid_classes)} classes with both textures")
    print(f"Classes (first 10): {sorted(valid_classes)[:10]}")
    
    # Pre-compute image embeddings
    print("\nExtracting image embeddings...")
    image_embeddings = {}
    skipped_images = []
    
    # Get all relevant images
    df_valid = df[df['class'].isin(valid_classes)]
    all_image_paths = df_valid['image_path'].unique().tolist()
    batch_size = 16
    
    for i in tqdm(range(0, len(all_image_paths), batch_size), desc="Extracting embeddings"):
        batch_paths = all_image_paths[i:i+batch_size]
        batch_images = []
        
        for img_path in batch_paths:
            try:
                img = Image.open(img_path).convert('RGB')
                img_processed = transform(img).unsqueeze(0).to(device)
                batch_images.append((img_path, img_processed))
            except Exception as e:
                skipped_images.append(img_path)
                continue
        
        if batch_images:
            paths = [p for p, _ in batch_images]
            imgs = torch.cat([img for _, img in batch_images], dim=0)
            
            with torch.no_grad():
                embeddings = extractor.get_img_feature(imgs)
                embeddings = extractor.norm_features(embeddings)
            
            for path, emb in zip(paths, embeddings):
                image_embeddings[path] = emb.cpu().float()
    
    print(f"Extracted embeddings for {len(image_embeddings)} images")
    if skipped_images:
        print(f"Skipped {len(skipped_images)} corrupted/invalid images")
    
    # Prepare for trials
    correct_count = 0
    trial_results = []
    
    # Calculate trials per class
    trials_per_class = num_trials // len(valid_classes)
    remaining_trials = num_trials % len(valid_classes)
    
    print(f"\nRunning {num_trials} trials across {len(valid_classes)} classes...")
    print(f"Trials per class: {trials_per_class}, with {remaining_trials} getting 1 extra")
    
    # Run trials
    for class_idx, class_name in enumerate(tqdm(valid_classes, desc="Processing classes")):
        # Get all images for this class
        class_data = df_valid[df_valid['class'] == class_name]
        
        # Group by texture
        smooth_images = class_data[class_data['texture'] == 'smooth']['image_path'].tolist()
        bumpy_images = class_data[class_data['texture'] == 'bumpy']['image_path'].tolist()
        
        # Filter to valid embeddings
        smooth_images = [p for p in smooth_images if p in image_embeddings]
        bumpy_images = [p for p in bumpy_images if p in image_embeddings]
        
        # Determine number of trials for this class
        n_trials = trials_per_class + (1 if class_idx < remaining_trials else 0)
        
        for trial in range(n_trials):
            if len(trial_results) >= num_trials:
                break
            
            # Randomly choose which texture gets 3 images vs 1
            if random.random() < 0.5:
                majority_texture = 'smooth'
                minority_texture = 'bumpy'
                majority_images = smooth_images
                minority_images = bumpy_images
            else:
                majority_texture = 'bumpy'
                minority_texture = 'smooth'
                majority_images = bumpy_images
                minority_images = smooth_images
            
            # Need at least 3 majority and 1 minority
            if len(majority_images) < 3 or len(minority_images) < 1:
                continue
            
            # Select 3 different images from majority texture (can be different colors/sizes)
            selected_majority = random.sample(majority_images, 3)
            # Select 1 from minority texture
            selected_minority = random.sample(minority_images, 1)
            
            # Build candidates list with (image_path, texture) tuples
            candidates = []
            for img_path in selected_majority:
                candidates.append((img_path, majority_texture))
            for img_path in selected_minority:
                candidates.append((img_path, minority_texture))
            
            # Select query from candidates
            query_idx = random.randint(0, 3)
            query_img_path, query_texture = candidates[query_idx]
            
            # Create text prompts
            candidate_texts = [f"{texture} {class_name.lower()}" for _, texture in candidates]
            
            # Shuffle for random presentation
            shuffled_order = list(range(4))
            random.shuffle(shuffled_order)
            shuffled_candidates = [candidates[i] for i in shuffled_order]
            shuffled_texts = [candidate_texts[i] for i in shuffled_order]
            correct_idx = shuffled_order.index(query_idx)
            
            # Encode text prompts
            with torch.no_grad():
                if "clip" in model_name:
                    tokens = clip.tokenize(shuffled_texts, truncate=True).to(device)
                    txt_features = model.encode_text(tokens)
                    txt_features = extractor.norm_features(txt_features)
                else:  # CVCL
                    tokens, token_len = model.tokenize(shuffled_texts)
                    tokens = tokens.to(device)
                    if isinstance(token_len, torch.Tensor):
                        token_len = token_len.to(device)
                    txt_features = model.encode_text(tokens, token_len)
                    txt_features = extractor.norm_features(txt_features)
            
            # Get query image embedding
            query_embedding = image_embeddings[query_img_path].unsqueeze(0).to(device)
            
            # Calculate similarity
            query_embedding = query_embedding.float()
            txt_features = txt_features.float()
            
            similarity = (100.0 * query_embedding @ txt_features.transpose(-2, -1)).softmax(dim=1)
            
            # Get prediction
            pred_idx = similarity.argmax(dim=1).item()
            
            # Check if correct
            is_correct = (pred_idx == correct_idx)
            if is_correct:
                correct_count += 1
            
            # Store trial result
            trial_results.append({
                'trial': len(trial_results) + 1,
                'query_class': class_name,
                'query_texture': query_texture,
                'query_img': os.path.basename(query_img_path),
                'correct_idx': correct_idx,
                'predicted_idx': pred_idx,
                'correct': is_correct,
                'candidate_texts': shuffled_texts,
                'similarity_scores': similarity.cpu().numpy().tolist()
            })
    
    # Calculate accuracy
    accuracy = correct_count / len(trial_results) if trial_results else 0
    
    print(f"\n{'='*60}")
    print(f"Results for {model_name} - SCDT Text-Vision Test:")
    print(f"Total trials: {len(trial_results)}")
    print(f"Correct: {correct_count}")
    print(f"Accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")
    print(f"{'='*60}")
    
    # Save results
    results_row = {
        'Model': model_name,
        'Test': 'SCDT-TextVision',
        'Dataset': 'SyntheticKonkle',
        'Correct': correct_count,
        'Trials': len(trial_results),
        'Accuracy': accuracy
    }
    
    os.makedirs(os.path.dirname(RESULTS_PATH), exist_ok=True)
    if os.path.exists(RESULTS_PATH):
        results_df = pd.read_csv(RESULTS_PATH)
    else:
        results_df = pd.DataFrame()
    
    results_df = pd.concat([results_df, pd.DataFrame([results_row])], ignore_index=True)
    results_df.to_csv(RESULTS_PATH, index=False, float_format='%.4f')
    print(f"\nResults saved to {RESULTS_PATH}")
    
    return trial_results, accuracy

## Run CVCL SCDT Text-Vision Test

In [4]:
# Run CVCL test with seed=0 (matching original tests)
cvcl_trials, cvcl_accuracy = run_scdt_text_vision_test('cvcl-resnext', seed=0, num_trials=4000)


Running SCDT Text-Vision Test with cvcl-resnext
(Same Class Different Texture - Controlled Color & Size)
Text format: {texture} {class}
Note: Using 3-1 split (3 of one texture, 1 of the other)
Using device: cuda
[INFO] Loading cvcl-resnext on cuda...
Loading checkpoint from C:\Users\jbats\.cache\huggingface\hub\models--wkvong--cvcl_s_dino_resnext50_embedding\snapshots\f50eaa0c50a6076a5190b1dd52aeeb6c3e747045\cvcl_s_dino_resnext50_embedding.ckpt


Lightning automatically upgraded your loaded checkpoint from v1.5.8 to v2.5.2. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint C:\Users\jbats\.cache\huggingface\hub\models--wkvong--cvcl_s_dino_resnext50_embedding\snapshots\f50eaa0c50a6076a5190b1dd52aeeb6c3e747045\cvcl_s_dino_resnext50_embedding.ckpt`



Found 16 class-color-size combinations suitable for testing

Extracting image embeddings...


Extracting embeddings: 100%|██████████| 5/5 [00:00<00:00, 11.74it/s]


Extracted embeddings for 64 images
Skipped 10 corrupted/invalid images

Running 4000 trials across 16 combinations...


Processing combinations: 100%|██████████| 16/16 [00:07<00:00,  2.17it/s]


Results for cvcl-resnext - SCDT Text-Vision Test:
Total trials: 879
Correct: 246
Accuracy: 0.2799 (27.99%)

Results saved to C:\Users\jbats\Projects\NTU-Synthetic\PatrickProject\Chart_Generation\text_vision_results.csv





## Run CLIP SCDT Text-Vision Test

In [5]:
# Run CLIP test with seed=0 (matching original tests)
clip_trials, clip_accuracy = run_scdt_text_vision_test('clip-resnext', seed=0, num_trials=4000)


Running SCDT Text-Vision Test with clip-resnext
(Same Class Different Texture - Controlled Color & Size)
Text format: {texture} {class}
Using device: cuda
[INFO] Loading clip-resnext on cuda...

Found 1961 class-color-size combinations with both textures

Extracting image embeddings...


  attn_output = scaled_dot_product_attention(q, k, v, attn_mask, dropout_p, is_causal)
Extracting embeddings: 100%|██████████| 488/488 [00:17<00:00, 27.59it/s]


Extracted embeddings for 7786 images
Skipped 22 corrupted/invalid images

Running 4000 trials across 1961 combinations...


Processing combinations: 100%|██████████| 1961/1961 [00:21<00:00, 89.87it/s]


Results for clip-resnext - SCDT Text-Vision Test:
Total trials: 4000
Correct: 1322
Accuracy: 0.3305 (33.05%)

Results saved to C:\Users\jbats\Projects\NTU-Synthetic\PatrickProject\Chart_Generation\text_vision_results.csv





## Compare Results

In [None]:
# Display comparison
print("\n" + "="*60)
print("SCDT TEXT-VISION TEST COMPARISON")
print("="*60)
print(f"\nTest: Same Class Different Texture (4-way forced choice)")
print(f"Control: Colors and sizes can vary (not mentioned in text)")
print(f"Text format: '{{texture}} {{class}}'")
print(f"Implementation: 3-1 split (3 of one texture, 1 of another)")
print(f"\nResults:")
print(f"  CVCL Accuracy: {cvcl_accuracy:.4f} ({cvcl_accuracy*100:.2f}%)")
print(f"  CLIP Accuracy: {clip_accuracy:.4f} ({clip_accuracy*100:.2f}%)")
print(f"\nDifference: {abs(cvcl_accuracy - clip_accuracy):.4f} ({abs(cvcl_accuracy - clip_accuracy)*100:.2f}%)")
if cvcl_accuracy > clip_accuracy:
    print(f"CVCL performs better by {(cvcl_accuracy - clip_accuracy)*100:.2f}%")
elif clip_accuracy > cvcl_accuracy:
    print(f"CLIP performs better by {(clip_accuracy - cvcl_accuracy)*100:.2f}%")
else:
    print("Both models perform equally")

print("\n" + "="*60)
print("\nAnalysis:")
print("- Tests pure texture discrimination within same class")
print("- Texture is a subtle tactile property")
print("- Colors/sizes vary to increase available test data")
print("- Models must encode and match texture information")

## Analysis Notes

### SCDT Text-Vision Test Characteristics:
- **Visual Control**: All 4 candidates have same class, color, and size
- **Variation**: Only texture differs (smooth vs bumpy)
- **Text Prompts**: "{texture} {class}" (e.g., "smooth apple")
- **4-way Implementation**: Each texture appears twice (2 smooth, 2 bumpy)

### What This Tests:
- Pure texture discrimination within same class
- Model's ability to encode tactile/surface properties
- Whether texture information transfers from vision to language

### Expected Performance:
- Likely the hardest test - texture is subtle
- Performance may be near chance (25%) if models don't encode texture well
- CVCL might have advantage if child-directed speech emphasizes texture

### Why This Is Challenging:
- Texture is primarily a tactile property
- Visual texture cues can be subtle
- Only 2 texture values limit discrimination
- Same class constraint removes object-level cues