## 1. The Challenge

I needed to process 20,000 Professional Go games (SGF format) into a numerical format (Tensors) that my Neural Network could read.

Each game consists of approximately 200 moves. My initial goal was to save every board state as a training example.

## 2. The Initial Approach (The Mistake)

My first script (`process.py`) was designed to iterate through every move of every game and save it immediately as a separate .npz file.

### ðŸ”´ The "Bad" Logic

In [None]:
# Pseudo-code of the initial failure
# for game in 20_000_games:
#     for move in game.moves: # Approx 200 moves per game
#         # 1. Encode board to 19x19x17 tensor
#         tensor = encoder.encode(board)
#         
#         # 2. Save IMMEDIATELY to disk
#         # Problem: This runs 200 times per game
#         np.save(f"data/processed/game_{id}_move_{move_num}.npz", tensor)

## 3. The Bottleneck Analysis

When I ran this script, my computer froze, and the processing speed dropped drastically.

### The Math

$$20,000 \text{ games} \times 200 \text{ moves/game} = 4,000,000 \text{ files}$$

### Why this failed:

1. **Inode Exhaustion**: File systems (like NTFS or ext4) struggle to index millions of tiny files in a single folder.

2. **I/O Latency**: The time spent opening and closing 4 million files was greater than the time spent actually processing the data.

3. **Manageability**: Moving, deleting, or loading 4 million files later would be impossible for the OS explorer.

## 4. The Solution: Batching

I refactored the pipeline to aggregate all moves from a single game into a list in memory, and then save one file per game.

### ðŸŸ¢ The Optimized Logic

In [None]:
# Pseudo-code of the fix
# for game in 20_000_games:
#     # Create temporary lists in RAM
#     game_inputs = []
#     game_targets = []
#
#     for move in game.moves:
#         tensor = encoder.encode(board)
#         game_inputs.append(tensor) # Store in RAM
#         game_targets.append(move_coord)
#
#     # Save ONCE per game
#     # Reduces file count by factor of ~200
#     save_path = f"data/processed/game_{id}.npz"
#     np.savez_compressed(save_path, inputs=game_inputs, targets=game_targets)

## 5. Results

| Metric | Before | After | Improvement |
|--------|--------|-------|-------------|
| **Total Files** | 4,000,000 | 20,000 | 200x reduction |
| **Disk Usage** | Highly fragmented | Efficient (compressed) | âœ“ |
| **Loading Speed** | Extremely slow | Fast (200 examples per file) | âœ“ |

### Key Improvements

$$\text{File Reduction} = \frac{4,000,000}{20,000} = 200\times$$

- **Disk Usage**: Efficient (due to `np.savez_compressed`).
- **Loading Speed**: PyTorch Dataset can now open one file and get 200 examples instantly, drastically speeding up training.

## Key Takeaway

**Always calculate total file output before running a massive loop.** I/O overhead is often the silent killer in ML pipelines.

### Lessons Learned

- âœ“ Batch operations in memory before writing to disk
- âœ“ Estimate file system impact (inode limits, I/O bottlenecks)
- âœ“ Use compression when saving large arrays
- âœ“ Profile your code to identify bottlenecks early

The fist process.py caused the problem.

In [None]:
import os
import glob
import numpy as np
from sgfmill import sgf, boards
from encoder import GoBoardEncoder

# --- CONFIGURATION ---
RAW_DATA_DIR = 'data/raw'
PROCESSED_DIR = 'data/processed'
MAX_GAMES_TO_PROCESS = 20000  # Limit for testing

if not os.path.exists(PROCESSED_DIR):
    os.makedirs(PROCESSED_DIR)

class GameStateWrapper:
    def __init__(self, sgfmill_board, color_to_move):
        self.board_size = 19
        self.board = np.zeros((19, 19), dtype=int)
        self.color_to_move = color_to_move
        for r in range(19):
            for c in range(19):
                color = sgfmill_board.get(r, c)
                if color == 'b': self.board[r][c] = 1
                elif color == 'w': self.board[r][c] = 2

def get_sgf_content(filepath):
    with open(filepath, 'rb') as f:
        raw_data = f.read()
    try:
        return raw_data.decode('utf-8')
    except UnicodeDecodeError:
        try:
            return raw_data.decode('gb18030')
        except:
            return None

def process_games():
    encoder = GoBoardEncoder()
    # Case insensitive search for .sgf, .SGF, etc.
    files = glob.glob(os.path.join(RAW_DATA_DIR, '*.[sS][gG][fF]'))
    
    print(f"ðŸ“‚ Found {len(files)} SGF files. Processing up to {MAX_GAMES_TO_PROCESS}...")
    
    processed_count = 0
    
    for i, filepath in enumerate(files):
        if processed_count >= MAX_GAMES_TO_PROCESS:
            break

        sgf_content = get_sgf_content(filepath)
        if sgf_content is None: continue

        try:
            game = sgf.Sgf_game.from_string(sgf_content)
            
            # --- DATA DIET FILTER ---
            # Only keep games where players are  9d, or Pro (p)
            # This ensures r AI learns from "Pro / Andvanced Amatures", not amateurs.
            root = game.get_root()
            try:
                b_rank = root.get("BR").lower()
                w_rank = root.get("WR").lower()
                # Simple check: if 'd' is in rank, check digit. If 'p' is in rank, keep it.
                is_high_dan = ('p' in b_rank) or ('p' in w_rank) or \
                              ('9d' in b_rank) or ('9d' in w_rank)
                if not is_high_dan:
                    continue 
            except:
                # If rank is missing, skip or keep depending on preference
                continue 

            board = boards.Board(19)
            filename_base = os.path.basename(filepath).replace('.sgf', '')
            
            # Arrays to hold ALL moves for this single game
            game_inputs = []
            game_targets = []
            
            for move_node in game.get_main_sequence():
                color, move_coords = move_node.get_move()
                if move_coords is None: continue 
                
                row, col = move_coords
                
                # 1. Encode
                game_state = GameStateWrapper(board, color)
                input_tensor = encoder.encode(game_state)
                target_index = row * 19 + col
                
                # 2. Collect (Don't save yet)
                game_inputs.append(input_tensor)
                game_targets.append(target_index)
                
                # 3. Update
                board.play(row, col, color)
            
            # 4. SAVE ONCE PER GAME
            if len(game_inputs) > 0:
                save_path = os.path.join(PROCESSED_DIR, f"{filename_base}.npz")
                
                # Stack them: Inputs becomes (N, 17, 19, 19), Targets becomes (N,)
                np.savez_compressed(
                    save_path, 
                    inputs=np.array(game_inputs, dtype=np.float32), 
                    targets=np.array(game_targets, dtype=np.int64)
                )
                processed_count += 1
                if processed_count % 100 == 0:
                    print(f"âœ… Processed {processed_count} games...")

        except Exception as e:
            # Don't crash on one bad file
            continue

    print(f"ðŸŽ‰ DONE! Processed {processed_count} valid high-quality games.")

if __name__ == "__main__":
    process_games()
