# Phase 2 - Data Preparation & Splitting

**Objective**: Load FIM `.jsonl` dataset from Phase 1 and split by repository into train/val/test sets.

**Key Points**:
- Split by **repository**, not by file (prevents data leakage)
- Ratio: 80% train / 5% validation / 15% test
- Verify no repository overlap between splits

## Step 1: Upload FIM Dataset

Upload .jsonl files from Phase 1 to Colab.

In [2]:
from google.colab import drive
import os
import shutil
print("Mounting Google Drive...")
drive.mount('/content/drive')
DRIVE_PATH = '/content/drive/MyDrive/AI-Auto-Complete/phase1_output/fim_dataset.jsonl'
os.makedirs('/content/fim_data', exist_ok=True)
print(f"Copying {DRIVE_PATH} to local...")
if os.path.exists(DRIVE_PATH):
    shutil.copy(DRIVE_PATH, '/content/fim_data/fim_dataset.jsonl')
    print(f"Copied: fim_dataset.jsonl")
else:
    print(f"ERROR: File not found at {DRIVE_PATH}")
    print("Please update DRIVE_PATH to match your file location!")

Mounting Google Drive...
Mounted at /content/drive
Copying /content/drive/MyDrive/AI-Auto-Complete/phase1_output/fim_dataset.jsonl to local...
✓ Copied: fim_dataset.jsonl


## Step 2: Load and Validate FIM Dataset

In [3]:
import json
import glob
from collections import defaultdict

def load_jsonl(file_path):
    samples = []
    with open(file_path, 'r', encoding='utf-8') as f:
        for line_num, line in enumerate(f, 1):
            try:
                data = json.loads(line.strip())
                samples.append(data)
            except json.JSONDecodeError as e:
                print(f"Warning: Skipping invalid JSON at line {line_num}: {e}")
    return samples

def validate_fim_format(sample):
    text = sample.get('text', '')
    return '<PRE>' in text and '<SUF>' in text and '<MID>' in text

all_samples = []
jsonl_files = glob.glob('/content/fim_data/*.jsonl')

print(f"Found {len(jsonl_files)} JSONL files\n")

for file_path in jsonl_files:
    print(f"Loading {os.path.basename(file_path)}...")
    samples = load_jsonl(file_path)

    valid_samples = [s for s in samples if validate_fim_format(s)]
    invalid_count = len(samples) - len(valid_samples)

    if invalid_count > 0:
        print(f"  Warning: {invalid_count} samples with invalid FIM format")

    all_samples.extend(valid_samples)
    print(f"  Loaded {len(valid_samples)} valid samples")

print(f"\n{'='*50}")
print(f"Total samples loaded: {len(all_samples):,}")
print(f"{'='*50}")

Found 1 JSONL files

Loading fim_dataset.jsonl...
  Loaded 28361 valid samples

Total samples loaded: 28,361


## Step 3: Extract Repository Information

Extract repository from file paths to enable repository-based splitting.

In [4]:
import re
from urllib.parse import urlparse

def extract_repository(sample):
    if 'repository' in sample:
        return sample['repository']

    if 'file_path' in sample:
        path = sample['file_path']
        parts = path.split('/')
        if len(parts) >= 3:
            return f"{parts[1]}/{parts[2]}"

    if 'url' in sample:
        url = sample['url']
        match = re.search(r'github\.com/([^/]+)/([^/]+)', url)
        if match:
            return f"{match.group(1)}/{match.group(2)}"

    text_sample = sample.get('text', '')[:100]
    return f"unknown_{hash(text_sample) % 10000}"

repo_to_samples = defaultdict(list)

for sample in all_samples:
    repo = extract_repository(sample)
    repo_to_samples[repo].append(sample)

print(f"Total repositories: {len(repo_to_samples):,}")
print(f"\nRepository size distribution:")
repo_sizes = [len(samples) for samples in repo_to_samples.values()]
print(f"  Min samples per repo: {min(repo_sizes)}")
print(f"  Max samples per repo: {max(repo_sizes)}")
print(f"  Average samples per repo: {sum(repo_sizes)/len(repo_sizes):.1f}")

print(f"\nTop 10 largest repositories:")
sorted_repos = sorted(repo_to_samples.items(), key=lambda x: len(x[1]), reverse=True)
for i, (repo, samples) in enumerate(sorted_repos[:10], 1):
    print(f"  {i}. {repo}: {len(samples)} samples")

Total repositories: 9,318

Repository size distribution:
  Min samples per repo: 1
  Max samples per repo: 63
  Average samples per repo: 3.0

Top 10 largest repositories:
  1. unknown_4994: 63 samples
  2. unknown_2348: 49 samples
  3. unknown_3257: 48 samples
  4. unknown_9801: 46 samples
  5. unknown_5840: 44 samples
  6. unknown_6417: 41 samples
  7. unknown_3052: 33 samples
  8. unknown_5098: 33 samples
  9. unknown_1098: 31 samples
  10. unknown_8483: 30 samples


## Step 4: Split by Repository (80/5/15)

**Critical**: Split repositories, not individual samples, to prevent data leakage.

In [5]:
import random
import numpy as np

random.seed(42)
np.random.seed(42)

all_repos = list(repo_to_samples.keys())
random.shuffle(all_repos)

n_repos = len(all_repos)
train_end = int(n_repos * 0.80)
val_end = int(n_repos * 0.85)

train_repos = all_repos[:train_end]
val_repos = all_repos[train_end:val_end]
test_repos = all_repos[val_end:]

print(f"Repository split:")
print(f"  Train: {len(train_repos):,} repos ({len(train_repos)/n_repos*100:.1f}%)")
print(f"  Val:   {len(val_repos):,} repos ({len(val_repos)/n_repos*100:.1f}%)")
print(f"  Test:  {len(test_repos):,} repos ({len(test_repos)/n_repos*100:.1f}%)")

train_samples = []
val_samples = []
test_samples = []

for repo in train_repos:
    train_samples.extend(repo_to_samples[repo])

for repo in val_repos:
    val_samples.extend(repo_to_samples[repo])

for repo in test_repos:
    test_samples.extend(repo_to_samples[repo])

print(f"\nSample split:")
total_samples = len(train_samples) + len(val_samples) + len(test_samples)
print(f"  Train: {len(train_samples):,} samples ({len(train_samples)/total_samples*100:.1f}%)")
print(f"  Val:   {len(val_samples):,} samples ({len(val_samples)/total_samples*100:.1f}%)")
print(f"  Test:  {len(test_samples):,} samples ({len(test_samples)/total_samples*100:.1f}%)")

Repository split:
  Train: 7,454 repos (80.0%)
  Val:   466 repos (5.0%)
  Test:  1,398 repos (15.0%)

Sample split:
  Train: 22,707 samples (80.1%)
  Val:   1,437 samples (5.1%)
  Test:  4,217 samples (14.9%)


## Step 5: Verify No Data Leakage

In [6]:
train_repo_set = set(train_repos)
val_repo_set = set(val_repos)
test_repo_set = set(test_repos)

train_val_overlap = train_repo_set & val_repo_set
train_test_overlap = train_repo_set & test_repo_set
val_test_overlap = val_repo_set & test_repo_set

print("Data Leakage Check:")
print(f"  Train ∩ Val:  {len(train_val_overlap)} repos")
print(f"  Train ∩ Test: {len(train_test_overlap)} repos")
print(f"  Val ∩ Test:   {len(val_test_overlap)} repos")

if len(train_val_overlap) == 0 and len(train_test_overlap) == 0 and len(val_test_overlap) == 0:
    print("\nSUCCESS: No data leakage detected!")
else:
    print("\nERROR: Data leakage detected! Check your splitting logic.")
    if train_val_overlap:
        print(f"  Overlapping repos (Train/Val): {list(train_val_overlap)[:5]}")
    if train_test_overlap:
        print(f"  Overlapping repos (Train/Test): {list(train_test_overlap)[:5]}")
    if val_test_overlap:
        print(f"  Overlapping repos (Val/Test): {list(val_test_overlap)[:5]}")

Data Leakage Check:
  Train ∩ Val:  0 repos
  Train ∩ Test: 0 repos
  Val ∩ Test:   0 repos

✅ SUCCESS: No data leakage detected!


## Step 6: Save Split Datasets

In [7]:
def save_jsonl(samples, file_path):
    """Save samples to JSONL file"""
    with open(file_path, 'w', encoding='utf-8') as f:
        for sample in samples:
            f.write(json.dumps(sample, ensure_ascii=False) + '\n')
    print(f"Saved {len(samples):,} samples to {file_path}")

# Save datasets
os.makedirs('/content/split_data', exist_ok=True)

save_jsonl(train_samples, '/content/split_data/train.jsonl')
save_jsonl(val_samples, '/content/split_data/val.jsonl')
save_jsonl(test_samples, '/content/split_data/test.jsonl')

print("\n" + "="*50)
print("Data preparation complete!")
print("="*50)
print("\nNext step: Use these files in 02_training.ipynb")

✓ Saved 22,707 samples to /content/split_data/train.jsonl
✓ Saved 1,437 samples to /content/split_data/val.jsonl
✓ Saved 4,217 samples to /content/split_data/test.jsonl

Data preparation complete!

Next step: Use these files in 02_training.ipynb


## Step 7: Preview Sample Data

In [9]:
import random

sample = random.choice(train_samples)
text = sample['text']

print("Random Training Sample:")
print("="*80)

if '<PRE>' in text and '<SUF>' in text and '<MID>' in text:
    parts = text.split('<SUF>')
    prefix = parts[0].replace('<PRE>', '').strip()

    suffix_mid = parts[1].split('<MID>')
    suffix = suffix_mid[0].strip()
    middle = suffix_mid[1].strip() if len(suffix_mid) > 1 else ''

    print("PREFIX:")
    print(prefix[:200] + "..." if len(prefix) > 200 else prefix)
    print("\nSUFFIX:")
    print(suffix[:200] + "..." if len(suffix) > 200 else suffix)
    print("\nMIDDLE (what model should learn to fill):")
    print(middle[:200] + "..." if len(middle) > 200 else middle)
else:
    print(text[:500])

Random Training Sample:
PREFIX:
package com.azure.monitor.opentelemetry.exporter;


import com.azure.data.appconfiguration.ConfigurationClientBuilder;
import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.context.Scope;
...

SUFFIX:
Tracer tracer = openTelemetrySdk.getTracer("Sample");

         
        ConfigurationClient client = new ConfigurationClientBuilder()
            .connectionString("{app-config-connection-string}")
 ...

MIDDLE (what model should learn to fill):
.buildAndRegisterGlobal();
