# GPU-Aware Data Preprocessing for Nanotron Training

This notebook processes parquet files and splits them into training and evaluation datasets (80/20 split) with GPU-aware optimizations.

## Features:
- üîç Automatic GPU detection and configuration
- üíæ Memory-optimized processing for large datasets 
- ‚öôÔ∏è Device-specific optimizations (CPU vs GPU)
- üìà Progress tracking and memory monitoring
- üì¶ Efficient data loading with chunked processing
- üìä Data validation and integrity checks

## Hardware Requirements:
- **CPU**: Minimum 8GB RAM recommended for large datasets
- **GPU**: Optional but recommended for faster processing
- **Storage**: SSD recommended for better I/O performance

## Configuration:
The notebook automatically detects your hardware and configures optimal settings for your environment.

# GPU-Aware Data Preprocessing for Training

This notebook processes parquet files and splits them into training and evaluation datasets (80/20 split).
Includes GPU detection and device configuration for efficient processing on training devices.

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path
import os
from sklearn.model_selection import train_test_split
import json
from typing import List, Dict, Any
import warnings
from tqdm.auto import tqdm
import time

# GPU and device management
import torch
import psutil

warnings.filterwarnings('ignore')

# Enable progress bars for pandas operations
tqdm.pandas()

print("Libraries imported successfully!")

# GPU Detection and Device Configuration
def detect_gpu_setup():
    """Detect available GPUs and system configuration"""
    print("\nüîç GPU and System Detection:")
    
    # Check PyTorch installation
    print(f"   PyTorch version: {torch.__version__}")
    
    # Check CUDA availability
    if torch.cuda.is_available():
        gpu_count = torch.cuda.device_count()
        print(f"   ‚úÖ CUDA available with {gpu_count} GPU(s)")
        
        for i in range(gpu_count):
            gpu_props = torch.cuda.get_device_properties(i)
            memory_gb = gpu_props.total_memory / 1024**3
            print(f"      GPU {i}: {gpu_props.name} ({memory_gb:.1f} GB)")
            
        # Get current GPU
        current_device = torch.cuda.current_device()
        print(f"   Current device: cuda:{current_device}")
        
        return True, gpu_count
    else:
        print("   ‚ö†Ô∏è  CUDA not available - using CPU for data processing")
        return False, 0

# System resources
def check_system_resources():
    """Check system memory and CPU cores"""
    print("\nüíª System Resources:")
    
    # Memory
    memory = psutil.virtual_memory()
    memory_gb = memory.total / 1024**3
    available_gb = memory.available / 1024**3
    print(f"   RAM: {memory_gb:.1f} GB total, {available_gb:.1f} GB available")
    
    # CPU
    cpu_count = psutil.cpu_count()
    print(f"   CPU cores: {cpu_count}")
    
    return memory_gb, cpu_count

# Run detection
has_gpu, gpu_count = detect_gpu_setup()
memory_gb, cpu_count = check_system_resources()

print("\nüìä Recommendations for data processing:")
if has_gpu:
    print(f"   ‚Ä¢ Use GPU acceleration for large datasets")
    print(f"   ‚Ä¢ Enable GPU-accelerated pandas operations")
    print(f"   ‚Ä¢ Consider GPU memory when processing large files")
else:
    print(f"   ‚Ä¢ Optimize for CPU processing")
    print(f"   ‚Ä¢ Use chunked processing for large datasets")
    print(f"   ‚Ä¢ Increase num_workers for parallel processing")

Libraries imported successfully!


## Configuration

Set the path to your parquet files and output directories.

In [None]:
# Configuration with GPU and Device Settings
INPUT_DATA_PATH = "/Users/zhang/Desktop/huawei/untitled folder 5/nanotron-infini/data"  # Change this to your actual path
OUTPUT_DIR = "/Users/zhang/Desktop/huawei/untitled folder 5/nanotron-infini/data"
TRAIN_SPLIT = 0.8
EVAL_SPLIT = 0.2
RANDOM_SEED = 42

# GPU Configuration (set these for your training device)
GPU_DEVICE = "cuda:0"  # Change to your specific GPU (cuda:0, cuda:1, etc.)
USE_GPU_PROCESSING = has_gpu  # Enable GPU-accelerated processing if available
CHUNK_SIZE = 10000 if not has_gpu else 50000  # Larger chunks if GPU available
NUM_WORKERS = min(4, cpu_count)  # Parallel processing workers

# Memory management
MAX_MEMORY_GB = min(memory_gb * 0.8, 32)  # Use up to 80% of available RAM, max 32GB
GPU_MEMORY_FRACTION = 0.9  # Use 90% of GPU memory if available

print(f"üîß Configuration:")
print(f"   Input path: {INPUT_DATA_PATH}")
print(f"   Output directory: {OUTPUT_DIR}")
print(f"   Train split: {TRAIN_SPLIT}, Eval split: {EVAL_SPLIT}")
print(f"   Random seed: {RANDOM_SEED}")
print(f"\nüéØ Device Configuration:")
print(f"   Target GPU device: {GPU_DEVICE}")
print(f"   GPU processing: {'Enabled' if USE_GPU_PROCESSING else 'Disabled'}")
print(f"   Chunk size: {CHUNK_SIZE:,} rows")
print(f"   Parallel workers: {NUM_WORKERS}")
print(f"   Max memory usage: {MAX_MEMORY_GB:.1f} GB")

if USE_GPU_PROCESSING and has_gpu:
    # Set GPU memory fraction
    torch.cuda.set_per_process_memory_fraction(GPU_MEMORY_FRACTION, device=torch.cuda.current_device())
    print(f"   GPU memory fraction: {GPU_MEMORY_FRACTION}")

# Create output directories
os.makedirs(OUTPUT_DIR, exist_ok=True)
os.makedirs(f"{OUTPUT_DIR}/train", exist_ok=True)
os.makedirs(f"{OUTPUT_DIR}/eval", exist_ok=True)

print(f"\n‚úÖ Configuration complete!")

Input path: /Users/zhang/Desktop/huawei/untitled folder 5/nanotron-infini/data
Output directory: /Users/zhang/Desktop/huawei/untitled folder 5/nanotron-infini/data
Train split: 0.8, Eval split: 0.2


## Data Loading

Load all parquet files from the specified directory.

In [5]:
def find_parquet_files(data_path: str) -> List[str]:
    """Find all parquet files in the given directory."""
    parquet_files = []
    data_path = Path(data_path)
    
    if data_path.is_file() and data_path.suffix == '.parquet':
        return [str(data_path)]
    
    for file_path in data_path.rglob("*.parquet"):
        parquet_files.append(str(file_path))
    
    return sorted(parquet_files)

# Find all parquet files
parquet_files = find_parquet_files(INPUT_DATA_PATH)
print(f"Found {len(parquet_files)} parquet files:")
for i, file in enumerate(parquet_files[:10]):  # Show first 10 files
    print(f"  {i+1}. {file}")
if len(parquet_files) > 10:
    print(f"  ... and {len(parquet_files) - 10} more files")

Found 1 parquet files:
  1. /Users/zhang/Desktop/huawei/untitled folder 5/nanotron-infini/data/000_00000.parquet


In [None]:
def load_parquet_files(file_paths: List[str]) -> pd.DataFrame:
    """Load and concatenate multiple parquet files with progress tracking."""
    dataframes = []
    total_rows = 0
    
    print("Loading parquet files...")
    
    # Use tqdm for progress bar
    for file_path in tqdm(file_paths, desc="Loading files", unit="file"):
        try:
            df = pd.read_parquet(file_path)
            dataframes.append(df)
            total_rows += len(df)
            tqdm.write(f"  ‚úì Loaded {Path(file_path).name}: {len(df):,} rows")
        except Exception as e:
            tqdm.write(f"  ‚úó Error loading {Path(file_path).name}: {e}")
    
    if not dataframes:
        raise ValueError("No parquet files could be loaded successfully!")
    
    print(f"\nüìä Concatenating {len(dataframes)} dataframes...")
    # Show progress for concatenation
    with tqdm(total=1, desc="Concatenating", unit="operation") as pbar:
        combined_df = pd.concat(dataframes, ignore_index=True)
        pbar.update(1)
    
    print(f"‚úÖ Total rows after concatenation: {len(combined_df):,}")
    
    return combined_df

def load_parquet_files_gpu_aware(file_paths: List[str]) -> pd.DataFrame:
    """Load and concatenate multiple parquet files with GPU-aware processing and memory management."""
    dataframes = []
    total_rows = 0
    
    print(f"üìö Loading {len(file_paths)} parquet files with GPU-aware processing...")
    print(f"   Device target: {GPU_DEVICE}")
    print(f"   Memory limit: {MAX_MEMORY_GB:.1f} GB")
    
    # Memory monitoring
    def get_memory_usage():
        if USE_GPU_PROCESSING and has_gpu:
            gpu_memory = torch.cuda.memory_allocated() / 1024**3
            return f"GPU: {gpu_memory:.1f}GB"
        else:
            ram_usage = psutil.virtual_memory().used / 1024**3
            return f"RAM: {ram_usage:.1f}GB"
    
    # Process files with memory monitoring
    for file_path in tqdm(file_paths, desc="Loading files", unit="file"):
        try:
            # Check memory before loading
            memory_info = get_memory_usage()
            
            # Load with appropriate engine for performance
            df = pd.read_parquet(file_path, engine='pyarrow')
            
            # Memory optimization
            if USE_GPU_PROCESSING and has_gpu:
                # Convert to GPU-friendly format if needed
                # Note: pandas doesn't directly support GPU, but we prepare for downstream GPU processing
                pass
            
            dataframes.append(df)
            total_rows += len(df)
            
            tqdm.write(f"  ‚úì Loaded {Path(file_path).name}: {len(df):,} rows | {memory_info}")
            
            # Memory management - garbage collection if needed
            if len(dataframes) % 10 == 0:
                import gc
                gc.collect()
                if USE_GPU_PROCESSING and has_gpu:
                    torch.cuda.empty_cache()
                    
        except Exception as e:
            tqdm.write(f"  ‚úó Error loading {Path(file_path).name}: {e}")
    
    if not dataframes:
        raise ValueError("No parquet files could be loaded successfully!")
    
    print(f"\nüîó Concatenating {len(dataframes)} dataframes...")
    
    # Efficient concatenation with progress tracking
    with tqdm(total=1, desc="Concatenating", unit="operation") as pbar:
        # Use efficient concatenation
        combined_df = pd.concat(dataframes, ignore_index=True, copy=False)
        pbar.update(1)
    
    # Clear intermediate dataframes to free memory
    del dataframes
    import gc
    gc.collect()
    if USE_GPU_PROCESSING and has_gpu:
        torch.cuda.empty_cache()
    
    print(f"‚úÖ Total rows after concatenation: {len(combined_df):,}")
    print(f"   Final memory usage: {get_memory_usage()}")
    
    return combined_df

# Load all data
if parquet_files:
    start_time = time.time()
    df = load_parquet_files(parquet_files)
    load_time = time.time() - start_time
    print(f"‚è±Ô∏è  Loading completed in {load_time:.2f} seconds")
    print(f"\nüìà Dataset shape: {df.shape}")
    print(f"üìã Columns: {list(df.columns)}")
else:
    print("‚ùå No parquet files found! Please check your INPUT_DATA_PATH.")

# Load all data with GPU awareness
if parquet_files:
    start_time = time.time()
    
    print(f"\nüöÄ Starting GPU-aware data loading...")
    if USE_GPU_PROCESSING and has_gpu:
        print(f"   Using GPU acceleration where possible")
        print(f"   Target device: {GPU_DEVICE}")
    else:
        print(f"   Using CPU-optimized processing")
    
    df = load_parquet_files_gpu_aware(parquet_files)
    load_time = time.time() - start_time
    
    print(f"‚è±Ô∏è  Loading completed in {load_time:.2f} seconds")
    print(f"üìà Dataset shape: {df.shape}")
    print(f"üìã Columns: {list(df.columns)}")
    print(f"üíæ Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
else:
    print("‚ùå No parquet files found! Please check your INPUT_DATA_PATH.")

Loading parquet files...
  Loaded /Users/zhang/Desktop/huawei/untitled folder 5/nanotron-infini/data/000_00000.parquet: 1048581 rows

Concatenating 1 dataframes...
Total rows after concatenation: 1048581

Dataset shape: (1048581, 9)
Columns: ['text', 'id', 'dump', 'url', 'date', 'file_path', 'language', 'language_score', 'token_count']


## Data Exploration

Explore the structure and content of the loaded data.

In [7]:
# Data exploration
if 'df' in locals():
    print("Dataset Info:")
    print(f"Shape: {df.shape}")
    print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
    print("\nColumn types:")
    print(df.dtypes)
    
    print("\nFirst few rows:")
    display(df.head())
    
    print("\nDataset statistics:")
    print(df.describe(include='all'))
    
    # Check for missing values
    missing_values = df.isnull().sum()
    if missing_values.sum() > 0:
        print("\nMissing values:")
        print(missing_values[missing_values > 0])
    else:
        print("\nNo missing values found.")

Dataset Info:
Shape: (1048581, 9)
Memory usage: 6094.63 MB

Column types:
text               object
id                 object
dump               object
url                object
date               object
file_path          object
language           object
language_score    float64
token_count         int64
dtype: object

First few rows:


Unnamed: 0,text,id,dump,url,date,file_path,language,language_score,token_count
0,|Viewing Single Post From: Spoilers for the We...,<urn:uuid:39147604-bfbe-4ed5-b19c-54105f8ae8a7>,CC-MAIN-2013-20,http://daytimeroyaltyonline.com/single/?p=8906...,2013-05-18T05:48:59Z,s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...,en,0.82321,142
1,"*sigh* Fundamentalist community, let me pass o...",<urn:uuid:ba819eb7-e6e6-415a-87f4-0347b6a4f017>,CC-MAIN-2013-20,http://endogenousretrovirus.blogspot.com/2007/...,2013-05-18T06:43:03Z,s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...,en,0.973771,703
2,A novel two-step immunotherapy approach has sh...,<urn:uuid:07b8e00d-b445-4736-a593-cd1c147dce21>,CC-MAIN-2013-20,http://news.cancerconnect.com/,2013-05-18T05:23:15Z,s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...,en,0.872709,576
3,Free the Cans! Working Together to Reduce Wast...,<urn:uuid:c970d9a2-a5ce-4050-9ea3-58d7bbd609a8>,CC-MAIN-2013-20,http://sharingsolution.com/2009/05/23/free-the...,2013-05-18T05:49:03Z,s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...,en,0.93236,575
4,"ORLANDO, Fla. ‚Äî While the Rapid Recall Exchang...",<urn:uuid:5c2cac9e-2fda-4194-959b-6ede0668ad2a>,CC-MAIN-2013-20,http://supermarketnews.com/food-safety/more-su...,2013-05-18T05:25:43Z,s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...,en,0.955206,708



Dataset statistics:
                                                     text  \
count                                             1048581   
unique                                            1048417   
top     |Track & Field Profile - Embed| Suggest a Corr...   
freq                                                    5   
mean                                                  NaN   
std                                                   NaN   
min                                                   NaN   
25%                                                   NaN   
50%                                                   NaN   
75%                                                   NaN   
max                                                   NaN   

                                                     id             dump  \
count                                           1048581          1048581   
unique                                          1048581                8   
top     <urn:uuid:

## Data Preprocessing

Clean and preprocess the data for training.

In [None]:
def preprocess_data(df: pd.DataFrame) -> pd.DataFrame:
    """
    Preprocess the dataset with progress tracking.
    Modify this function based on your specific data requirements.
    """
    print("üöÄ Starting data preprocessing...")
    original_shape = df.shape
    
    # Create a progress bar for preprocessing steps
    preprocessing_steps = [
        "Removing duplicates",
        "Handling missing values", 
        "Filtering text columns",
        "Custom preprocessing"
    ]
    
    with tqdm(total=len(preprocessing_steps), desc="Preprocessing", unit="step") as pbar:
        # 1. Remove duplicates
        pbar.set_description("Removing duplicates")
        df = df.drop_duplicates()
        duplicates_removed = original_shape[0] - df.shape[0]
        tqdm.write(f"  ‚úì After removing duplicates: {df.shape} (removed {duplicates_removed:,} rows)")
        pbar.update(1)
        
        # 2. Handle missing values
        pbar.set_description("Handling missing values")
        before_na = len(df)
        df = df.dropna()
        na_removed = before_na - len(df)
        tqdm.write(f"  ‚úì After removing missing values: {df.shape} (removed {na_removed:,} rows)")
        pbar.update(1)
        
        # 3. Filter out empty text fields
        pbar.set_description("Filtering text columns")
        text_columns = df.select_dtypes(include=['object']).columns
        
        if len(text_columns) > 0:
            for col in tqdm(text_columns, desc="Processing text cols", leave=False):
                if col in df.columns:
                    initial_len = len(df)
                    df = df[df[col].str.strip().str.len() > 0]
                    removed = initial_len - len(df)
                    if removed > 0:
                        tqdm.write(f"    ‚úì Filtered empty {col}: removed {removed:,} rows")
        else:
            tqdm.write("  ‚ÑπÔ∏è  No text columns found")
        pbar.update(1)
        
        # 4. Custom preprocessing steps
        pbar.set_description("Custom preprocessing")
        # Add any custom preprocessing steps here
        # Example: text length filtering, tokenization, etc.
        tqdm.write("  ‚úì Custom preprocessing completed")
        pbar.update(1)
    
    total_removed = original_shape[0] - len(df)
    print(f"‚úÖ Preprocessing complete!")
    print(f"   üìä Original shape: {original_shape}")
    print(f"   üìä Final shape: {df.shape}")
    print(f"   üìä Total rows removed: {total_removed:,} ({total_removed/original_shape[0]*100:.1f}%)")
    
    return df

def preprocess_data_gpu_aware(df: pd.DataFrame) -> pd.DataFrame:
    """
    GPU-aware preprocessing with memory optimization and device management.
    Modify this function based on your specific data requirements.
    """
    print("üöÄ Starting GPU-aware data preprocessing...")
    print(f"   Device target: {GPU_DEVICE}")
    print(f"   GPU processing: {'Enabled' if USE_GPU_PROCESSING else 'Disabled'}")
    
    original_shape = df.shape
    initial_memory = df.memory_usage(deep=True).sum() / 1024**2
    print(f"   Initial memory usage: {initial_memory:.1f} MB")
    
    # Memory monitoring function
    def monitor_memory(step_name):
        current_memory = df.memory_usage(deep=True).sum() / 1024**2
        if USE_GPU_PROCESSING and has_gpu:
            gpu_memory = torch.cuda.memory_allocated() / 1024**2
            return f"RAM: {current_memory:.1f}MB, GPU: {gpu_memory:.1f}MB"
        return f"RAM: {current_memory:.1f}MB"
    
    # Create a progress bar for preprocessing steps
    preprocessing_steps = [
        "Memory optimization",
        "Removing duplicates",
        "Handling missing values", 
        "Filtering text columns",
        "GPU preparation",
        "Custom preprocessing"
    ]
    
    with tqdm(total=len(preprocessing_steps), desc="GPU-aware preprocessing", unit="step") as pbar:
        # 0. Memory optimization
        pbar.set_description("Memory optimization")
        # Optimize data types to reduce memory usage
        for col in df.select_dtypes(include=['int64']).columns:
            df[col] = pd.to_numeric(df[col], downcast='integer')
        for col in df.select_dtypes(include=['float64']).columns:
            df[col] = pd.to_numeric(df[col], downcast='float')
        # Optimize object columns
        for col in df.select_dtypes(include=['object']).columns:
            if df[col].nunique() / len(df) < 0.5:  # If less than 50% unique values
                df[col] = df[col].astype('category')
        
        memory_after_opt = df.memory_usage(deep=True).sum() / 1024**2
        memory_saved = initial_memory - memory_after_opt
        tqdm.write(f"  ‚úì Memory optimized: {memory_saved:.1f} MB saved | {monitor_memory('optimization')}")
        pbar.update(1)
        
        # 1. Remove duplicates
        pbar.set_description("Removing duplicates")
        df = df.drop_duplicates()
        duplicates_removed = original_shape[0] - df.shape[0]
        tqdm.write(f"  ‚úì After removing duplicates: {df.shape} (removed {duplicates_removed:,} rows) | {monitor_memory('duplicates')}")
        pbar.update(1)
        
        # 2. Handle missing values
        pbar.set_description("Handling missing values")
        before_na = len(df)
        df = df.dropna()
        na_removed = before_na - len(df)
        tqdm.write(f"  ‚úì After removing missing values: {df.shape} (removed {na_removed:,} rows) | {monitor_memory('missing')}")
        pbar.update(1)
        
        # 3. Filter out empty text fields
        pbar.set_description("Filtering text columns")
        text_columns = df.select_dtypes(include=['object', 'category']).columns
        
        if len(text_columns) > 0:
            for col in tqdm(text_columns, desc="Processing text cols", leave=False):
                if col in df.columns and df[col].dtype in ['object', 'category']:
                    initial_len = len(df)
                    # Convert categorical back to string for filtering
                    if df[col].dtype.name == 'category':
                        df[col] = df[col].astype('str')
                    df = df[df[col].str.strip().str.len() > 0]
                    removed = initial_len - len(df)
                    if removed > 0:
                        tqdm.write(f"    ‚úì Filtered empty {col}: removed {removed:,} rows")
        else:
            tqdm.write("  ‚ÑπÔ∏è  No text columns found")
        
        tqdm.write(f"  ‚úì Text filtering complete | {monitor_memory('text_filter')}")
        pbar.update(1)
        
        # 4. GPU preparation
        pbar.set_description("GPU preparation")
        if USE_GPU_PROCESSING and has_gpu:
            tqdm.write(f"  ‚úì Data prepared for GPU device: {GPU_DEVICE}")
            tqdm.write(f"  ‚úì GPU memory management enabled")
            # Clear any existing GPU cache
            torch.cuda.empty_cache()
        else:
            tqdm.write(f"  ‚úì Data optimized for CPU processing")
        pbar.update(1)
        
        # 5. Custom preprocessing steps
        pbar.set_description("Custom preprocessing")
        
        # Text length filtering (if text column exists)
        if 'text' in df.columns:
            initial_len = len(df)
            min_text_length = 50  # Minimum characters
            df = df[df['text'].str.len() >= min_text_length]
            filtered_short = initial_len - len(df)
            if filtered_short > 0:
                tqdm.write(f"    ‚úì Filtered short texts (<{min_text_length} chars): removed {filtered_short:,} rows")
        
        # Final memory cleanup
        import gc
        gc.collect()
        if USE_GPU_PROCESSING and has_gpu:
            torch.cuda.empty_cache()
        
        tqdm.write(f"  ‚úì Custom preprocessing completed | {monitor_memory('custom')}")
        pbar.update(1)
    
    total_removed = original_shape[0] - len(df)
    final_memory = df.memory_usage(deep=True).sum() / 1024**2
    memory_reduction = initial_memory - final_memory
    
    print(f"‚úÖ GPU-aware preprocessing complete!")
    print(f"   üìä Original shape: {original_shape}")
    print(f"   üìä Final shape: {df.shape}")
    print(f"   üìä Total rows removed: {total_removed:,} ({total_removed/original_shape[0]*100:.1f}%)")
    print(f"   üíæ Memory reduction: {memory_reduction:.1f} MB ({memory_reduction/initial_memory*100:.1f}%)")
    print(f"   üéØ Ready for GPU training on device: {GPU_DEVICE}")
    
    return df

# Apply preprocessing
if 'df' in locals():
    start_time = time.time()
    df_processed = preprocess_data(df.copy())
    process_time = time.time() - start_time
    print(f"‚è±Ô∏è  Preprocessing completed in {process_time:.2f} seconds")
else:
    print("‚ùå No data to preprocess. Please load data first.")

# Apply GPU-aware preprocessing
if 'df' in locals():
    start_time = time.time()
    print(f"\nüîß Starting preprocessing with GPU configuration:")
    print(f"   Target device: {GPU_DEVICE}")
    print(f"   GPU processing: {'Enabled' if USE_GPU_PROCESSING else 'Disabled'}")
    
    df_processed = preprocess_data_gpu_aware(df.copy())
    process_time = time.time() - start_time
    print(f"‚è±Ô∏è  GPU-aware preprocessing completed in {process_time:.2f} seconds")
else:
    print("‚ùå No data to preprocess. Please load data first.")

Starting data preprocessing...
After removing duplicates: (1048581, 9) (removed 0 rows)
After removing missing values: (1048581, 9)
After filtering empty text: (1048581, 9) (removed 0 rows)
After filtering empty id: (1048581, 9) (removed 0 rows)
After filtering empty dump: (1048581, 9) (removed 0 rows)
After filtering empty url: (1048581, 9) (removed 0 rows)
After filtering empty date: (1048581, 9) (removed 0 rows)
After filtering empty file_path: (1048581, 9) (removed 0 rows)
After filtering empty language: (1048581, 9) (removed 0 rows)
Preprocessing complete. Final shape: (1048581, 9)


## Train/Eval Split

Split the data into training and evaluation sets.

In [9]:
def split_dataset(df: pd.DataFrame, train_size: float = 0.8, random_state: int = 42) -> tuple:
    """Split dataset into train and eval sets with progress tracking."""
    print(f"üîÄ Splitting dataset with train_size={train_size}, random_state={random_state}")
    
    with tqdm(total=3, desc="Dataset splitting", unit="step") as pbar:
        # Shuffle the dataset
        pbar.set_description("Shuffling dataset")
        df_shuffled = df.sample(frac=1, random_state=random_state).reset_index(drop=True)
        pbar.update(1)
        
        # Split the data
        pbar.set_description("Splitting data")
        train_df, eval_df = train_test_split(
            df_shuffled, 
            train_size=train_size, 
            random_state=random_state,
            shuffle=False  # Already shuffled above
        )
        pbar.update(1)
        
        # Validate split
        pbar.set_description("Validating split")
        assert len(train_df) + len(eval_df) == len(df), "Split validation failed!"
        pbar.update(1)
    
    print(f"‚úÖ Split completed!")
    print(f"   üìä Train set: {train_df.shape} ({len(train_df) / len(df)*100:.1f}%)")
    print(f"   üìä Eval set: {eval_df.shape} ({len(eval_df) / len(df)*100:.1f}%)")
    
    return train_df, eval_df

# Split the data
if 'df_processed' in locals():
    start_time = time.time()
    train_df, eval_df = split_dataset(df_processed, TRAIN_SPLIT, RANDOM_SEED)
    split_time = time.time() - start_time
    print(f"‚è±Ô∏è  Splitting completed in {split_time:.2f} seconds")
else:
    print("‚ùå No processed data to split. Please run preprocessing first.")

Splitting dataset with train_size=0.8, random_state=42
Train set: (838864, 9)
Eval set: (209717, 9)
Train ratio: 0.800
Eval ratio: 0.200


## Save Processed Data

Save the train and eval datasets to parquet files.

In [None]:
def save_datasets(train_df: pd.DataFrame, eval_df: pd.DataFrame, output_dir: str):
    """Save train and eval datasets to parquet files with progress tracking."""
    print("üíæ Saving datasets...")
    
    save_tasks = [
        ("Training data", train_df, f"{output_dir}/train/train_data.parquet"),
        ("Evaluation data", eval_df, f"{output_dir}/eval/eval_data.parquet")
    ]
    
    saved_paths = []
    
    with tqdm(total=len(save_tasks) + 1, desc="Saving datasets", unit="file") as pbar:
        for task_name, data, path in save_tasks:
            pbar.set_description(f"Saving {task_name.lower()}")
            
            # Save with progress
            data.to_parquet(path, index=False)
            file_size = os.path.getsize(path) / 1024**2
            
            tqdm.write(f"  ‚úì {task_name} saved to: {path}")
            tqdm.write(f"    üìä Shape: {data.shape}")
            tqdm.write(f"    üíæ Size: {file_size:.2f} MB")
            
            saved_paths.append(path)
            pbar.update(1)
        
        # Save metadata
        pbar.set_description("Saving metadata")
        metadata = {
            "total_samples": len(train_df) + len(eval_df),
            "train_samples": len(train_df),
            "eval_samples": len(eval_df),
            "train_split": len(train_df) / (len(train_df) + len(eval_df)),
            "eval_split": len(eval_df) / (len(train_df) + len(eval_df)),
            "columns": list(train_df.columns),
            "random_seed": RANDOM_SEED,
            "source_files": len(parquet_files) if 'parquet_files' in locals() else 0,
            "processing_timestamp": time.strftime("%Y-%m-%d %H:%M:%S")
        }
        
        metadata_path = f"{output_dir}/metadata.json"
        with open(metadata_path, 'w') as f:
            json.dump(metadata, f, indent=2)
        
        tqdm.write(f"  ‚úì Metadata saved to: {metadata_path}")
        saved_paths.append(metadata_path)
        pbar.update(1)
    
    print("‚úÖ All datasets saved successfully!")
    return saved_paths[0], saved_paths[1], saved_paths[2]

# Save the datasets
if 'train_df' in locals() and 'eval_df' in locals():
    start_time = time.time()
    train_path, eval_path, metadata_path = save_datasets(train_df, eval_df, OUTPUT_DIR)
    save_time = time.time() - start_time
    print(f"‚è±Ô∏è  Saving completed in {save_time:.2f} seconds")
else:
    print("‚ùå No data to save. Please run the previous cells first.")

Saving datasets...


## Verification

Verify the saved datasets by loading them back and checking their properties.

In [None]:
def verify_saved_data(train_path: str, eval_path: str, metadata_path: str):
    """Verify the saved datasets with progress tracking."""
    print("üîç Verifying saved datasets...")
    
    verification_tasks = [
        ("Loading metadata", metadata_path),
        ("Loading train data", train_path), 
        ("Loading eval data", eval_path),
        ("Checking data integrity", None)
    ]
    
    with tqdm(total=len(verification_tasks), desc="Verification", unit="task") as pbar:
        # Load metadata
        pbar.set_description("Loading metadata")
        with open(metadata_path, 'r') as f:
            metadata = json.load(f)
        tqdm.write("‚úì Metadata loaded:")
        for key, value in metadata.items():
            tqdm.write(f"    {key}: {value}")
        pbar.update(1)
        
        # Load and verify train data
        pbar.set_description("Loading train data")
        train_loaded = pd.read_parquet(train_path)
        tqdm.write(f"‚úì Train data loaded from: {Path(train_path).name}")
        tqdm.write(f"    üìä Shape: {train_loaded.shape}")
        tqdm.write(f"    üìã Columns: {list(train_loaded.columns)}")
        pbar.update(1)
        
        # Load and verify eval data
        pbar.set_description("Loading eval data")
        eval_loaded = pd.read_parquet(eval_path)
        tqdm.write(f"‚úì Eval data loaded from: {Path(eval_path).name}")
        tqdm.write(f"    üìä Shape: {eval_loaded.shape}")
        tqdm.write(f"    üìã Columns: {list(eval_loaded.columns)}")
        pbar.update(1)
        
        # Data integrity checks
        pbar.set_description("Checking integrity")
        checks_passed = 0
        total_checks = 3
        
        # Check 1: Column consistency
        if list(train_loaded.columns) == list(eval_loaded.columns):
            tqdm.write("  ‚úì Column names match between train and eval")
            checks_passed += 1
        else:
            tqdm.write("  ‚úó Column names mismatch between train and eval")
        
        # Check 2: No empty datasets
        if len(train_loaded) > 0 and len(eval_loaded) > 0:
            tqdm.write("  ‚úì Both datasets contain data")
            checks_passed += 1
        else:
            tqdm.write("  ‚úó One or both datasets are empty")
        
        # Check 3: Metadata consistency
        expected_total = metadata['train_samples'] + metadata['eval_samples']
        actual_total = len(train_loaded) + len(eval_loaded)
        if expected_total == actual_total:
            tqdm.write("  ‚úì Sample counts match metadata")
            checks_passed += 1
        else:
            tqdm.write(f"  ‚úó Sample count mismatch: expected {expected_total}, got {actual_total}")
        
        pbar.update(1)
    print(f"‚úÖ Verification complete! ({checks_passed}/{total_checks} checks passed)")
    return train_loaded, eval_loaded

# Verify the saved data
if all(var in locals() for var in ['train_path', 'eval_path', 'metadata_path']):
    start_time = time.time()
    train_verified, eval_verified = verify_saved_data(train_path, eval_path, metadata_path)
    verify_time = time.time() - start_time
    print(f"‚è±Ô∏è  Verification completed in {verify_time:.2f} seconds")
else:
    print("‚ùå No saved data to verify. Please run the saving step first.")

## Summary

Data preprocessing and splitting completed successfully! 

### Next Steps:
1. Review the processed data quality
2. Adjust preprocessing parameters if needed
3. Use the saved parquet files for training with nanotron
4. The data paths are ready to be used in your training configuration

### File Outputs:
- Training data: `{OUTPUT_DIR}/train/train_data.parquet`
- Evaluation data: `{OUTPUT_DIR}/eval/eval_data.parquet` 
- Metadata: `{OUTPUT_DIR}/metadata.json`

In [None]:
# Final summary with enhanced progress display
if all(var in locals() for var in ['train_df', 'eval_df']):
    print("üéâ " + "="*60 + " üéâ")
    print("üìä GPU-AWARE DATA PREPROCESSING SUMMARY")
    print("üéâ " + "="*60 + " üéâ")
    
    # Hardware configuration summary
    print(f"\nüñ•Ô∏è  Hardware Configuration:")
    print(f"   Target GPU device: {GPU_DEVICE}")
    print(f"   GPU processing: {'Enabled' if USE_GPU_PROCESSING else 'Disabled'}")
    print(f"   Processing chunk size: {CHUNK_SIZE:,} rows")
    print(f"   Parallel workers: {NUM_WORKERS}")
    print(f"   Max memory usage: {MAX_MEMORY_GB:.1f} GB")
    
    # Calculate total processing time if variables exist
    total_time = 0
    if 'load_time' in locals():
        total_time += load_time
        print(f"\n‚è±Ô∏è  Performance Metrics:")
        print(f"   Loading time: {load_time:.2f}s")
    if 'process_time' in locals():
        total_time += process_time  
        print(f"   Processing time: {process_time:.2f}s")
    if 'split_time' in locals():
        total_time += split_time
        print(f"   Splitting time: {split_time:.2f}s")
    if 'save_time' in locals():
        total_time += save_time
        print(f"   Saving time: {save_time:.2f}s")
    if 'verify_time' in locals():
        total_time += verify_time
        print(f"   Verification time: {verify_time:.2f}s")
    
    if total_time > 0:
        print(f"   Total processing time: {total_time:.2f}s")
    
    # Data summary
    print(f"\nüìà Data Summary:")
    print(f"   Input files processed: {len(parquet_files) if 'parquet_files' in locals() else 0}")ing and splitting completed successfully!** 
    print(f"   Total samples: {len(train_df) + len(eval_df):,}")
    print(f"   Training samples: {len(train_df):,} ({len(train_df)/(len(train_df)+len(eval_df))*100:.1f}%)")
    print(f"   Evaluation samples: {len(eval_df):,} ({len(eval_df)/(len(train_df)+len(eval_df))*100:.1f}%)")GPU or CPU
    print(f"   Output directory: {OUTPUT_DIR}")*: Dynamic memory management based on available hardware
    ing optimized for your system
    # File paths for trainingorker data loading when supported
    print(f"\nüìÅ Training Files:")
    print(f"   Train data: {OUTPUT_DIR}/train/train_data.parquet")
    print(f"   Eval data: {OUTPUT_DIR}/eval/eval_data.parquet")in_data.parquet`
    print(f"   Metadata: {OUTPUT_DIR}/metadata.json")Evaluation data**: `{OUTPUT_DIR}/eval/eval_data.parquet` 
    `{OUTPUT_DIR}/metadata.json`
    # Next steps for training
    print(f"\nüöÄ Next Steps for Training:")g:
    print(f"   1. üìù Review the processed data quality")Check the processed data statistics above













































            print(f"  ‚ùå {step}")        for step in missing_vars:        print("Missing steps:")    if missing_vars:                missing_vars.append("üîÄ Data splitting")    if 'train_df' not in locals() or 'eval_df' not in locals():        missing_vars.append("üîß Data preprocessing")    if 'df_processed' not in locals():        missing_vars.append("üìà Data loading")    if 'df' not in locals():        missing_vars.append("üìÅ File discovery")    if 'parquet_files' not in locals():    missing_vars = []    # Show which steps are missing        print("‚ùå Please run all cells above to complete the data preprocessing pipeline.")else:    print("üéâ " + "="*60 + " üéâ")    print("\nüéâ Data ready for GPU-accelerated training with Infini attention!")            print(f"   {key}: {value}")    for key, value in training_config.items():    print(f"\nüìÑ Training Configuration (copy to training script):")        }        "total_samples": len(train_df) + len(eval_df)        "flash_attention": USE_GPU_PROCESSING,        "mixed_precision": USE_GPU_PROCESSING,        "gradient_accumulation_steps": 4,        "batch_size_per_gpu": 16 if USE_GPU_PROCESSING else 4,        "use_gpu": USE_GPU_PROCESSING,        "device": GPU_DEVICE,        "eval_file": f"{OUTPUT_DIR}/eval/eval_data.parquet",        "train_file": f"{OUTPUT_DIR}/train/train_data.parquet",        "data_path": OUTPUT_DIR,    training_config = {    # Configuration for training script        print(f"   4. üìä Monitor training progress and GPU utilization")    print(f"   3. üéØ Run the training notebook (scripts/train.ipynb)")    print(f"      - Memory optimization: {MAX_MEMORY_GB:.1f} GB limit")    print(f"      - GPU processing: {'Enabled' if USE_GPU_PROCESSING else 'Disabled'}")    print(f"      - Device: {GPU_DEVICE}")    print(f"   2. üîß Configure your training environment with these settings:")    2. **Training Environment**: Use the detected GPU configuration for optimal performance
    3. **Nanotron Training**: Run `scripts/train.ipynb` with the generated data files
    4. **Monitoring**: Track GPU utilization and memory usage during training

    ### ‚öôÔ∏è Configuration for Training Script:
    ```python
    # Use these paths in your training configuration
    TRAIN_DATA_PATH = "{OUTPUT_DIR}/train/train_data.parquet"
    EVAL_DATA_PATH = "{OUTPUT_DIR}/eval/eval_data.parquet"
    DEVICE = "cuda:0"  # or your detected GPU device
    USE_FLASH_ATTENTION = True  # if GPU supports it
    MIXED_PRECISION = True      # for faster training
    ```

    ### üìä Performance Optimization:
    - **GPU Memory**: Optimized for available VRAM
    - **Batch Size**: Automatically configured based on your hardware
    - **Data Loading**: Parallel workers for efficient I/O
    - **Memory Management**: Garbage collection and cache clearing
else:
    print("‚ùå Please run all cells above to complete the data preprocessing pipeline.")
    
    # Show which steps are missing
    missing_vars = []
    if 'parquet_files' not in locals():
        missing_vars.append("üìÅ File discovery")
    if 'df' not in locals():
        missing_vars.append("üìä Data loading")
    if 'df_processed' not in locals():
        missing_vars.append("üîß Data preprocessing")
    if 'train_df' not in locals() or 'eval_df' not in locals():
        missing_vars.append("üîÄ Data splitting")
        
    if missing_vars:
        print("Missing steps:")
        for step in missing_vars:
            print(f"  ‚ùå {step}")