# SageMaker Hyperscaler Training for Legal Reasoning Model (Part 1)

This notebook demonstrates how to train the Legal Reasoning Model using SageMaker Hyperscaler on ml.g5.8xlarge instances for optimal price-performance.

## Part 1: Setup and Configuration

### Setup

First, let's set up the environment and import necessary libraries.

In [None]:
import os
import json
import yaml
import boto3
import sagemaker
from sagemaker.huggingface import HuggingFace
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set plot style
plt.style.use('ggplot')
sns.set_theme(style="whitegrid")

### Configure AWS and SageMaker

Set up AWS credentials and SageMaker session.

In [None]:
# Set AWS region
region = 'us-east-1'  # Change to your preferred region

# Create SageMaker session
boto_session = boto3.Session(region_name=region)
sagemaker_session = sagemaker.Session(boto_session=boto_session)

# Get SageMaker execution role
role = sagemaker.get_execution_role()

# Set S3 bucket and prefix
bucket = sagemaker_session.default_bucket()
prefix = 'legal-reasoning-model'

print(f"SageMaker Role ARN: {role}")
print(f"S3 Bucket: {bucket}")
print(f"S3 Prefix: {prefix}")

### Load Configuration

Load the hyperscaler configuration from YAML file.

In [None]:
# Load configuration
config_path = "../configs/hyperscaler_config.yaml"

with open(config_path, 'r') as f:
    config = yaml.safe_load(f)

# Update config with our session values
config['aws']['region'] = region
config['aws']['s3_bucket'] = bucket

# Display configuration
print("Model Configuration:")
print(json.dumps(config['model'], indent=2))

print("\nTraining Configuration:")
print(json.dumps(config['training'], indent=2))

print("\nHyperscaler Configuration:")
print(json.dumps(config['hyperscaler'], indent=2))

### Prepare Training Data

Prepare and upload the training data to S3.

In [None]:
# Define paths
input_file = "../data/german/processed/all_examples.jsonl"
output_dir = "../data/hyperscaler"
language = config['model']['language']

# Check if input file exists
if not os.path.exists(input_file):
    print(f"Input file not found: {input_file}")
    print("Please run the data processing script first.")
else:
    print(f"Input file found: {input_file}")
    
    # Create output directory
    os.makedirs(output_dir, exist_ok=True)
    print(f"Output directory: {output_dir}")
    
    # We'll use the prepare_data_for_hyperscaler.py script
    # For demonstration, we'll show the command here
    cmd = f"""python ../scripts/prepare_data_for_hyperscaler.py \
    --input-file {input_file} \
    --output-dir {output_dir} \
    --s3-bucket {bucket} \
    --s3-prefix {prefix}/data \
    --language {language}"""
    
    print("\nCommand to prepare data:")
    print(cmd)
    
    # Note: In a real notebook, you might want to run this command
    # using !{cmd} or a subprocess call