# Loop 7 Analysis: Diagnosing Submission Duplicate IDs

**Goal**: Identify why exp_006 submission has 40 duplicate IDs and fix the issue

**Problem**: Submission file has 200,041 rows instead of 200,000 - 40 duplicate IDs found

**Hypothesis**: Index misalignment or concatenation bug in prediction generation

In [1]:
import pandas as pd
import numpy as np

print("Loading data to diagnose duplicate issue...")

# Load test data
test = pd.read_csv('/home/data/test.csv')
print(f"Test data shape: {test.shape}")
print(f"Test ID range: {test['id'].min()} - {test['id'].max()}")
print(f"Test ID uniqueness: {test['id'].nunique() == len(test)}")

# Load submission
sub = pd.read_csv('/home/code/submission_candidates/candidate_005.csv')
print(f"\nSubmission shape: {sub.shape}")
print(f"Submission ID range: {sub['id'].min()} - {sub['id'].max()}")
print(f"Submission ID uniqueness: {sub['id'].nunique() == len(sub)}")

# Find duplicates
duplicates = sub[sub['id'].duplicated(keep=False)].copy()
print(f"\nTotal duplicate rows: {len(duplicates)}")
print(f"Unique duplicated IDs: {sub['id'].duplicated().sum()}")

# Show sample duplicates
print("\nSample duplicate rows:")
print(duplicates.head(10))

Loading data to diagnose duplicate issue...
Test data shape: (200000, 10)
Test ID range: 300000 - 499999
Test ID uniqueness: True

Submission shape: (200040, 2)
Submission ID range: 300000 - 499999
Submission ID uniqueness: False

Total duplicate rows: 80


Unique duplicated IDs: 40

Sample duplicate rows:
           id      Price
576    300576  91.312671
577    300576  91.312671
4191   304190  66.682781
4192   304190  66.682781
6025   306023  96.941909
6026   306023  96.941909
10368  310365  94.525570
10369  310365  94.525570
15683  315679  94.442971
15684  315679  94.442971


## Analyze Duplicate Pattern

In [2]:
# Analyze the pattern of duplicates
print("=== DUPLICATE PATTERN ANALYSIS ===")

# Group by ID to see the duplication pattern
dup_groups = duplicates.groupby('id').agg({
    'Price': ['count', 'first', 'last', 'nunique']
}).round(6)
dup_groups.columns = ['count', 'first_price', 'last_price', 'unique_prices']

print(f"Duplicate ID groups: {len(dup_groups)}")
print(f"IDs with 2 rows: {(dup_groups['count'] == 2).sum()}")
print(f"IDs with >2 rows: {(dup_groups['count'] > 2).sum()}")
print(f"IDs with different prices: {(dup_groups['unique_prices'] > 1).sum()}")

print("\nSample duplicate groups:")
print(dup_groups.head(10))

# Check if duplicates are consecutive
print("\n=== CHECKING IF DUPLICATES ARE CONSECUTIVE ===")
dup_ids = sub[sub['id'].duplicated()]['id'].unique()
for dup_id in dup_ids[:5]:
    idx = sub[sub['id'] == dup_id].index
    print(f"ID {dup_id}: indices {list(idx)}, consecutive: {idx[1] - idx[0] == 1}")

=== DUPLICATE PATTERN ANALYSIS ===
Duplicate ID groups: 40
IDs with 2 rows: 40
IDs with >2 rows: 0
IDs with different prices: 0

Sample duplicate groups:
        count  first_price  last_price  unique_prices
id                                                   
300576      2    91.312671   91.312671              1
304190      2    66.682781   66.682781              1
306023      2    96.941909   96.941909              1
310365      2    94.525570   94.525570              1
315679      2    94.442971   94.442971              1
316920      2    94.945388   94.945388              1
320094      2   113.191144  113.191144              1
325617      2    76.980625   76.980625              1
327553      2    84.249429   84.249429              1
330018      2    54.893704   54.893704              1

=== CHECKING IF DUPLICATES ARE CONSECUTIVE ===
ID 300576: indices [576, 577], consecutive: True
ID 304190: indices [4191, 4192], consecutive: True
ID 306023: indices [6025, 6026], consecutive: True

## Investigate Experiment Code

In [3]:
# Look at the experiment folder to understand submission generation
import os
import json

exp_folder = "/home/code/experiments/006_original_dataset_combo"
print(f"Checking experiment folder: {exp_folder}")

# List files in experiment folder
if os.path.exists(exp_folder):
    files = os.listdir(exp_folder)
    print(f"Files in experiment folder: {files}")
    
    # Look for submission generation code
    py_files = [f for f in files if f.endswith('.py')]
    print(f"Python files: {py_files}")
    
    # Check if there's a notebook
    nb_files = [f for f in files if f.endswith('.ipynb')]
    print(f"Notebook files: {nb_files}")
else:
    print("Experiment folder not found!")

# Check session state for clues
with open('/home/code/session_state.json', 'r') as f:
    session_state = json.load(f)

exp_006 = next((e for e in session_state['experiments'] if e['id'] == 'exp_005'), None)
if exp_006:
    print(f"\nExperiment notes: {exp_006.get('notes', 'No notes')}")

Checking experiment folder: /home/code/experiments/006_original_dataset_combo
Files in experiment folder: []
Python files: []
Notebook files: []

Experiment notes: MASSIVE IMPROVEMENT: Added original dataset features (orig_price, orig_price_r7, orig_price_r8, orig_price_r9) and COMBO/interaction features (NaN encoding, NaN×Weight Capacity, categorical×Weight Capacity). Removed histogram bins. Optimized groupby statistics (kept mean, count, median only). Hyperparameters: learning_rate=0.03, max_depth=10, reg_alpha=0.1, reg_lambda=1.0. 20-fold CV. Feature count: 53 total (4 orig + 15 combo + 28 groupby + 6 other). Top features: weight_capacity_mean_price, weight_capacity_median_price, Laptop Compartment_median_price. Achieved RMSE 24.321 ± 0.162, beating target 38.616 by 14.295 points!


## Root Cause Hypotheses

In [None]:
print("=== ROOT CAUSE HYPOTHESES ===")

print("\n1. INDEX MISALIGNMENT:")
print("   - Test data reordered during feature engineering")
print("   - Predictions generated with different index order")
print("   - When merging back, some indices duplicated")

print("\n2. CONCATENATION BUG:")
print("   - Multiple prediction arrays concatenated incorrectly")
print("   - Could happen if using np.concatenate or pd.concat with wrong axis")

print("\n3. CROSS-VALIDATION LEAKAGE:")
print("   - CV predictions not properly aligned with test IDs")
print("   - Some folds might have overlapping indices")

print("\n4. FEATURE ENGINEERING SIDE EFFECT:")
print("   - Groupby operations or merges might create duplicate rows")
print("   - If test data has duplicate Weight Capacity values, groupby could expand rows")

# Check if test data has any duplicate weight capacity values that could cause issues
test_dup_weights = test['Weight Capacity (kg)'].duplicated().sum()
print(f"\nTest data duplicate Weight Capacity values: {test_dup_weights}")

# Check distribution of weight capacity
weight_counts = test['Weight Capacity (kg)'].value_counts()
print(f"Max duplicates for any weight capacity: {weight_counts.max()}")
print(f"Number of weight capacities with >1 occurrence: {(weight_counts > 1).sum()}")

## Solution Approach

In [None]:
print("=== SOLUTION PLAN ===")

print("\n1. CREATE CLEAN SUBMISSION:")
print("   - Load test.csv to get correct ID order")
print("   - Generate predictions using trained model from exp_006")
print("   - Ensure predictions align 1:1 with test IDs")
print("   - Save with exactly 200,000 rows")

print("\n2. VERIFY ALIGNMENT:")
print("   - Check that prediction length matches test length")
print("   - Verify no duplicate IDs in output")
print("   - Validate ID range matches test data")

print("\n3. RESUBMIT:")
print("   - Use Submit() function with corrected candidate")
print("   - Monitor for successful upload")

# Create a clean submission template
clean_sub = test[['id']].copy()
clean_sub['Price'] = 0.0  # Placeholder

print(f"\nClean submission template shape: {clean_sub.shape}")
print(f"Clean submission ID uniqueness: {clean_sub['id'].nunique() == len(clean_sub)}")
print(f"ID range: {clean_sub['id'].min()} - {clean_sub['id'].max()}")

# Save clean template
clean_sub.to_csv('/home/code/submission_candidates/candidate_005_clean.csv', index=False)
print("\n✓ Clean template saved to: /home/code/submission_candidates/candidate_005_clean.csv")