# Different Class Different Color and Texture (DCDCT) Text-Vision Comparison - SyntheticKonkle

This notebook compares CVCL and CLIP models on text-vision matching using the SyntheticKonkle dataset.
The task is 4-way classification where distractors are from DIFFERENT classes with DIFFERENT colors AND textures.

## Text Ordering Considerations

**We use natural English adjective ordering**: `"{color} {texture} {class}"`
- Example: "red smooth apple", "blue bumpy car", "green smooth ball"
- This follows standard English grammar rules where color comes before texture
- Both CVCL (trained on child-directed speech) and CLIP (trained on internet text) expect this natural ordering
- Parents typically say "red bumpy ball" not "bumpy red ball"

## Test Characteristics
- **Visual**: All candidates from DIFFERENT classes, with controlled size
- **Variation**: Class, color, AND texture all differ between candidates
- **Control**: Size held constant visually but NOT mentioned in text
- **Difficulty**: Easier - multiple discriminative cues from class, color, and texture differences

In [1]:
# Imports
import os
import sys
import pandas as pd
import numpy as np
import torch
from PIL import Image
from tqdm import tqdm
import random
from datetime import datetime
import clip
from collections import defaultdict
from torch.utils.data import Dataset, DataLoader

# Path setup - Use absolute paths to avoid any confusion
REPO_ROOT = r'C:\Users\jbats\Projects\NTU-Synthetic'

# Add discover-hidden-visual-concepts to path
DISCOVER_ROOT = os.path.join(REPO_ROOT, 'discover-hidden-visual-concepts')
sys.path.insert(0, DISCOVER_ROOT)
sys.path.insert(0, REPO_ROOT)

# Import from discover-hidden-visual-concepts repo
sys.path.append(os.path.join(DISCOVER_ROOT, 'src'))
from utils.model_loader import load_model
from models.feature_extractor import FeatureExtractor

# Paths
DATA_PATH = os.path.join(REPO_ROOT, 'data', 'SyntheticKonkle_224', 'SyntheticKonkle')
METADATA_PATH = os.path.join(REPO_ROOT, 'data', 'SyntheticKonkle', 'master_labels.csv')
RESULTS_PATH = os.path.join(REPO_ROOT, 'PatrickProject', 'Chart_Generation', 'text_vision_results.csv')

print(f"Data path: {DATA_PATH}")
print(f"Metadata path: {METADATA_PATH}")
print(f"Results will be saved to: {RESULTS_PATH}")

  from pkg_resources import packaging


Data path: C:\Users\jbats\Projects\NTU-Synthetic\data\SyntheticKonkle_224\SyntheticKonkle
Metadata path: C:\Users\jbats\Projects\NTU-Synthetic\data\SyntheticKonkle\master_labels.csv
Results will be saved to: C:\Users\jbats\Projects\NTU-Synthetic\PatrickProject\Chart_Generation\text_vision_results.csv


In [2]:
# Load and prepare data
def load_synthetic_data():
    """Load SyntheticKonkle dataset with metadata for DCDCT testing"""
    # Read metadata
    df = pd.read_csv(METADATA_PATH)
    
    # Build full paths
    df['image_path'] = df.apply(lambda row: os.path.join(DATA_PATH, row['folder'], row['filename']), axis=1)
    
    # Filter to only entries with valid size, color, and texture information
    df = df[
        df['size'].notna() & (df['size'] != '') &
        df['color'].notna() & (df['color'] != '') &
        df['texture'].notna() & (df['texture'] != '')
    ].copy()
    
    # Standardize values (lowercase)
    df['size'] = df['size'].str.lower().str.strip()
    df['color'] = df['color'].str.lower().str.strip()
    df['texture'] = df['texture'].str.lower().str.strip()
    
    # Filter to only valid size and texture values
    valid_sizes = ['small', 'medium', 'large']
    valid_textures = ['smooth', 'bumpy']
    df = df[df['size'].isin(valid_sizes) & df['texture'].isin(valid_textures)].copy()
    
    # Create combination columns
    df['class_color_texture'] = df['class'] + '_' + df['color'] + '_' + df['texture']
    df['full_combo'] = df['class'] + '_' + df['color'] + '_' + df['size'] + '_' + df['texture']
    
    print(f"Loaded {len(df)} images with size, color, and texture annotations")
    print(f"Unique classes: {df['class'].nunique()}")
    print(f"Unique colors: {df['color'].nunique()}")
    print(f"Unique textures: {sorted(df['texture'].unique())}")
    print(f"Unique sizes: {sorted(df['size'].unique())}")
    
    # Show unique class-color-texture combinations
    unique_cct = df['class_color_texture'].nunique()
    print(f"\nTotal unique class-color-texture combinations: {unique_cct}")
    print(f"Sample combinations: {df['class_color_texture'].unique()[:5]}")
    
    return df

# Load data
data_df = load_synthetic_data()
print("\nSample data:")
print(data_df[['class', 'color', 'texture', 'size', 'class_color_texture']].head())

Loaded 7865 images with size, color, and texture annotations
Unique classes: 67
Unique colors: 11
Unique textures: ['bumpy', 'smooth']
Unique sizes: ['large', 'medium', 'small']

Total unique class-color-texture combinations: 1342
Sample combinations: ['abacus_red_bumpy' 'abacus_green_bumpy' 'abacus_blue_bumpy'
 'abacus_yellow_bumpy' 'abacus_orange_bumpy']

Sample data:
    class   color texture   size  class_color_texture
0  abacus     red   bumpy  large     abacus_red_bumpy
1  abacus   green   bumpy  large   abacus_green_bumpy
2  abacus    blue   bumpy  large    abacus_blue_bumpy
3  abacus  yellow   bumpy  large  abacus_yellow_bumpy
4  abacus  orange   bumpy  large  abacus_orange_bumpy


In [3]:
def run_dcdct_text_vision_test(model_name='cvcl-resnext', seed=0, device=None, num_trials=4000):
    """Run Different Class Different Color and Texture text-vision test with controlled size
    
    Text format uses natural English ordering: "{color} {texture} {class}"
    Example: "red smooth apple", "blue bumpy car", "green smooth ball"
    
    Args:
        model_name: Model to test ('cvcl-resnext' or 'clip-resnext')
        seed: Random seed for reproducibility
        device: Device to use (None for auto-detect)
        num_trials: Total number of trials to run
    """
    # Set seeds to match original test methodology
    random.seed(seed)
    torch.manual_seed(seed)
    
    print(f"\n{'='*60}")
    print(f"Running DCDCT Text-Vision Test with {model_name}")
    print(f"(Different Class Different Color & Texture - Controlled Size)")
    print(f"Text format: {{color}} {{texture}} {{class}} (natural English order)")
    print(f"{'='*60}")
    
    # Device selection
    if device is None:
        device = 'cuda' if torch.cuda.is_available() else 'cpu'
    
    if device == 'cuda' and not torch.cuda.is_available():
        print("[ERROR] CUDA requested but not available! Falling back to CPU.")
        device = 'cpu'
    
    print(f"Using device: {device}")
    
    # Load model
    print(f"[INFO] Loading {model_name} on {device}...")
    model, transform = load_model(model_name, seed=seed, device=device)
    extractor = FeatureExtractor(model_name, model, device)
    model.eval()
    
    # Load and prepare data
    df = pd.read_csv(METADATA_PATH)
    
    # Build full paths
    df['image_path'] = df.apply(lambda row: os.path.join(DATA_PATH, row['folder'], row['filename']), axis=1)
    
    # Filter to entries with all annotations and valid values
    df = df[
        df['size'].notna() & (df['size'] != '') &
        df['color'].notna() & (df['color'] != '') &
        df['texture'].notna() & (df['texture'] != '')
    ].copy()
    
    # Standardize values
    df['size'] = df['size'].str.lower().str.strip()
    df['color'] = df['color'].str.lower().str.strip()
    df['texture'] = df['texture'].str.lower().str.strip()
    
    # Filter to only valid size and texture values
    valid_sizes = ['small', 'medium', 'large']
    valid_textures = ['smooth', 'bumpy']
    df = df[df['size'].isin(valid_sizes) & df['texture'].isin(valid_textures)].copy()
    
    # Create unique identifier for each combination
    df['class_color_texture_size'] = df['class'] + '_' + df['color'] + '_' + df['texture'] + '_' + df['size']
    
    # Group all data by unique combinations
    combo_groups = df.groupby('class_color_texture_size').agg({
        'image_path': list,
        'class': 'first',
        'color': 'first',
        'texture': 'first',
        'size': 'first'
    }).reset_index()
    
    print(f"\nTotal unique combinations: {len(combo_groups)}")
    print(f"Unique classes: {combo_groups['class'].nunique()}")
    print(f"Unique colors: {combo_groups['color'].nunique()}")
    print(f"Unique textures: {combo_groups['texture'].nunique()} - {sorted(combo_groups['texture'].unique())}")
    print(f"Unique sizes: {combo_groups['size'].nunique()} - {sorted(combo_groups['size'].unique())}")
    
    # Pre-compute image embeddings for efficiency
    print("\nExtracting image embeddings...")
    image_embeddings = {}
    skipped_images = []
    
    all_image_paths = df['image_path'].unique().tolist()
    batch_size = 16
    
    for i in tqdm(range(0, len(all_image_paths), batch_size), desc="Extracting embeddings"):
        batch_paths = all_image_paths[i:i+batch_size]
        batch_images = []
        
        for img_path in batch_paths:
            try:
                img = Image.open(img_path).convert('RGB')
                img_processed = transform(img).unsqueeze(0).to(device)
                batch_images.append((img_path, img_processed))
            except Exception as e:
                skipped_images.append(img_path)
                continue
        
        if batch_images:
            paths = [p for p, _ in batch_images]
            imgs = torch.cat([img for _, img in batch_images], dim=0)
            
            with torch.no_grad():
                embeddings = extractor.get_img_feature(imgs)
                embeddings = extractor.norm_features(embeddings)
            
            for path, emb in zip(paths, embeddings):
                image_embeddings[path] = emb.cpu().float()
    
    print(f"Extracted embeddings for {len(image_embeddings)} images")
    if skipped_images:
        print(f"Skipped {len(skipped_images)} corrupted/invalid images")
    
    # Prepare for trials
    correct_count = 0
    trial_results = []
    
    print(f"\nRunning {num_trials} trials...")
    
    # Run trials
    for trial_num in tqdm(range(num_trials), desc="Running trials"):
        # For DCDCT: Select 4 combinations with DIFFERENT classes
        # Try to control size when possible
        
        # First, randomly select a query combination
        query_idx = random.randint(0, len(combo_groups) - 1)
        query_combo = combo_groups.iloc[query_idx]
        query_class = query_combo['class']
        query_size = query_combo['size']
        
        # Find combinations with same size but different classes
        same_size_diff_class = combo_groups[
            (combo_groups['size'] == query_size) & 
            (combo_groups['class'] != query_class)
        ]
        
        # Also get combinations with different classes (fallback if not enough same size)
        diff_class_combos = combo_groups[combo_groups['class'] != query_class]
        
        # Build candidate list
        candidate_combos = [query_combo]
        
        # Try to get 3 distractors with same size but different classes
        if len(same_size_diff_class) >= 3:
            # Prefer diverse classes, colors, and textures
            distractors = same_size_diff_class.sample(min(3, len(same_size_diff_class)))
        else:
            # If not enough same-size options, use any different class
            distractors = diff_class_combos.sample(3)
        
        for _, distractor in distractors.iterrows():
            candidate_combos.append(distractor)
        
        # Ensure we have exactly 4 candidates
        if len(candidate_combos) != 4:
            continue
            
        # Select random query image from valid images
        valid_query_paths = [p for p in query_combo['image_path'] if p in image_embeddings]
        if not valid_query_paths:
            continue
        query_img_path = random.choice(valid_query_paths)
        
        # Shuffle candidates for random order (keeping track of correct index)
        shuffled_order = list(range(4))
        random.shuffle(shuffled_order)
        shuffled_candidates = [candidate_combos[i] for i in shuffled_order]
        correct_idx = shuffled_order.index(0)  # Find where the query (index 0) ended up
        
        # Create text prompts - NATURAL ENGLISH ORDER: {color} {texture} {class}
        candidate_texts = []
        for candidate in shuffled_candidates:
            # Natural English order: color before texture before noun
            text_prompt = f"{candidate['color']} {candidate['texture']} {candidate['class'].lower()}"
            candidate_texts.append(text_prompt)
        
        # Encode text prompts
        with torch.no_grad():
            if "clip" in model_name:
                tokens = clip.tokenize(candidate_texts, truncate=True).to(device)
                txt_features = model.encode_text(tokens)
                txt_features = extractor.norm_features(txt_features)
            else:  # CVCL
                tokens, token_len = model.tokenize(candidate_texts)
                tokens = tokens.to(device)
                if isinstance(token_len, torch.Tensor):
                    token_len = token_len.to(device)
                txt_features = model.encode_text(tokens, token_len)
                txt_features = extractor.norm_features(txt_features)
        
        # Get query image embedding
        query_embedding = image_embeddings[query_img_path].unsqueeze(0).to(device)
        
        # Calculate similarity
        query_embedding = query_embedding.float()
        txt_features = txt_features.float()
        
        similarity = (100.0 * query_embedding @ txt_features.transpose(-2, -1)).softmax(dim=1)
        
        # Get prediction
        pred_idx = similarity.argmax(dim=1).item()
        
        # Check if correct
        is_correct = (pred_idx == correct_idx)
        if is_correct:
            correct_count += 1
        
        # Store trial result
        trial_results.append({
            'trial': trial_num + 1,
            'query_class': query_combo['class'],
            'query_color': query_combo['color'],
            'query_texture': query_combo['texture'],
            'query_size': query_combo['size'],
            'query_img': os.path.basename(query_img_path),
            'correct_idx': correct_idx,
            'predicted_idx': pred_idx,
            'correct': is_correct,
            'candidate_texts': candidate_texts,
            'similarity_scores': similarity.cpu().numpy().tolist()
        })
    
    # Calculate accuracy
    accuracy = correct_count / len(trial_results) if trial_results else 0
    
    print(f"\n{'='*60}")
    print(f"Results for {model_name} - DCDCT Text-Vision Test:")
    print(f"Total trials: {len(trial_results)}")
    print(f"Correct: {correct_count}")
    print(f"Accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")
    print(f"{'='*60}")
    
    # Save results
    results_row = {
        'Model': model_name,
        'Test': 'DCDCT-TextVision',
        'Dataset': 'SyntheticKonkle',
        'Correct': correct_count,
        'Trials': len(trial_results),
        'Accuracy': accuracy
    }
    
    os.makedirs(os.path.dirname(RESULTS_PATH), exist_ok=True)
    if os.path.exists(RESULTS_PATH):
        results_df = pd.read_csv(RESULTS_PATH)
    else:
        results_df = pd.DataFrame()
    
    results_df = pd.concat([results_df, pd.DataFrame([results_row])], ignore_index=True)
    results_df.to_csv(RESULTS_PATH, index=False, float_format='%.4f')
    print(f"\nResults saved to {RESULTS_PATH}")
    
    return trial_results, accuracy

## Run CVCL DCDCT Text-Vision Test

In [4]:
# Run CVCL test with seed=0 (matching original tests)
cvcl_trials, cvcl_accuracy = run_dcdct_text_vision_test('cvcl-resnext', seed=0, num_trials=4000)


Running DCDCT Text-Vision Test with cvcl-resnext
(Different Class Different Color & Texture - Controlled Size)
Text format: {color} {texture} {class} (natural English order)
Using device: cuda
[INFO] Loading cvcl-resnext on cuda...
Loading checkpoint from C:\Users\jbats\.cache\huggingface\hub\models--wkvong--cvcl_s_dino_resnext50_embedding\snapshots\f50eaa0c50a6076a5190b1dd52aeeb6c3e747045\cvcl_s_dino_resnext50_embedding.ckpt


Lightning automatically upgraded your loaded checkpoint from v1.5.8 to v2.5.2. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint C:\Users\jbats\.cache\huggingface\hub\models--wkvong--cvcl_s_dino_resnext50_embedding\snapshots\f50eaa0c50a6076a5190b1dd52aeeb6c3e747045\cvcl_s_dino_resnext50_embedding.ckpt`



Total unique combinations: 3952
Unique classes: 67
Unique colors: 11
Unique textures: 2 - ['bumpy', 'smooth']
Unique sizes: 3 - ['large', 'medium', 'small']

Extracting image embeddings...


Extracting embeddings: 100%|██████████| 492/492 [00:21<00:00, 22.55it/s]


Extracted embeddings for 7835 images
Skipped 22 corrupted/invalid images

Running 4000 trials...


Running trials: 100%|██████████| 4000/4000 [00:49<00:00, 80.52it/s]


Results for cvcl-resnext - DCDCT Text-Vision Test:
Total trials: 4000
Correct: 1053
Accuracy: 0.2632 (26.32%)

Results saved to C:\Users\jbats\Projects\NTU-Synthetic\PatrickProject\Chart_Generation\text_vision_results.csv





## Run CLIP DCDCT Text-Vision Test

In [5]:
# Run CLIP test with seed=0 (matching original tests)
clip_trials, clip_accuracy = run_dcdct_text_vision_test('clip-resnext', seed=0, num_trials=4000)


Running DCDCT Text-Vision Test with clip-resnext
(Different Class Different Color & Texture - Controlled Size)
Text format: {color} {texture} {class} (natural English order)
Using device: cuda
[INFO] Loading clip-resnext on cuda...

Total unique combinations: 3952
Unique classes: 67
Unique colors: 11
Unique textures: 2 - ['bumpy', 'smooth']
Unique sizes: 3 - ['large', 'medium', 'small']

Extracting image embeddings...


  attn_output = scaled_dot_product_attention(q, k, v, attn_mask, dropout_p, is_causal)
Extracting embeddings: 100%|██████████| 492/492 [00:18<00:00, 27.27it/s]


Extracted embeddings for 7835 images
Skipped 22 corrupted/invalid images

Running 4000 trials...


Running trials: 100%|██████████| 4000/4000 [00:26<00:00, 151.91it/s]


Results for clip-resnext - DCDCT Text-Vision Test:
Total trials: 4000
Correct: 3848
Accuracy: 0.9620 (96.20%)

Results saved to C:\Users\jbats\Projects\NTU-Synthetic\PatrickProject\Chart_Generation\text_vision_results.csv





## Compare Results

In [6]:
# Display comparison
print("\n" + "="*60)
print("DCDCT TEXT-VISION TEST COMPARISON")
print("="*60)
print(f"\nTest: Different Class Different Color & Texture (4-way forced choice)")
print(f"Control: Size held constant (not mentioned in text)")
print(f"Text format: '{{color}} {{texture}} {{class}}' (natural English order)")
print(f"Example: 'red smooth apple' vs 'blue bumpy car' vs 'green smooth ball'")
print(f"\nResults:")
print(f"  CVCL Accuracy: {cvcl_accuracy:.4f} ({cvcl_accuracy*100:.2f}%)")
print(f"  CLIP Accuracy: {clip_accuracy:.4f} ({clip_accuracy*100:.2f}%)")
print(f"\nDifference: {abs(cvcl_accuracy - clip_accuracy):.4f} ({abs(cvcl_accuracy - clip_accuracy)*100:.2f}%)")
if cvcl_accuracy > clip_accuracy:
    print(f"CVCL performs better by {(cvcl_accuracy - clip_accuracy)*100:.2f}%")
elif clip_accuracy > cvcl_accuracy:
    print(f"CLIP performs better by {(clip_accuracy - cvcl_accuracy)*100:.2f}%")
else:
    print("Both models perform equally")

print("\n" + "="*60)
print("\nAnalysis:")
print("- Multiple cues: class, color, AND texture all differ")
print("- Maximum discriminative power from three varying attributes")
print("- Natural English ordering (color before texture) helps both models")
print("- Should perform well due to multiple discriminative cues")


DCDCT TEXT-VISION TEST COMPARISON

Test: Different Class Different Color & Texture (4-way forced choice)
Control: Size held constant (not mentioned in text)
Text format: '{color} {texture} {class}' (natural English order)
Example: 'red smooth apple' vs 'blue bumpy car' vs 'green smooth ball'

Results:
  CVCL Accuracy: 0.2632 (26.32%)
  CLIP Accuracy: 0.9620 (96.20%)

Difference: 0.6987 (69.88%)
CLIP performs better by 69.88%


Analysis:
- Multiple cues: class, color, AND texture all differ
- Maximum discriminative power from three varying attributes
- Natural English ordering (color before texture) helps both models
- Should perform well due to multiple discriminative cues


## Analysis Notes

### DCDCT Text-Vision Test Characteristics:
- **Visual Control**: All 4 candidates have same size only
- **Variation**: Class, color, AND texture all differ between candidates
- **Text Prompts**: Natural order "{color} {texture} {class}" (e.g., "red smooth apple")
- **NOT mentioned**: Size is controlled but excluded from text

