# Text Summarization Model Training

This notebook demonstrates how to fine-tune a transformer model (BART) on the CNN/DailyMail dataset using SageMaker for text summarization.

**Note**: This notebook is configured to run with the **Python 3 (ipykernel)** kernel and is optimized for AWS free tier usage.

In [None]:
# Install required libraries
!pip install --upgrade pip
!pip install boto3 sagemaker
!pip install transformers datasets
!pip install torch torchvision
!pip install pandas numpy scikit-learn
!pip install rouge-score nltk

# Verify CUDA is not available (CPU training)
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Number of GPUs: {torch.cuda.device_count() if torch.cuda.is_available() else 'None'}")

In [None]:
# Check for backend directory and files
import os
import sys

# Check if our backend directory exists
if not os.path.exists('backend'):
    print("Creating backend directory...")
    os.makedirs('backend', exist_ok=True)

# Print existing files in backend directory to confirm they're there
if os.path.exists('backend'):
    print("Files in backend directory:")
    for file in os.listdir('backend'):
        print(f"  - {file}")

In [None]:
# Import required libraries
import os
import boto3
import sagemaker
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.processing import ScriptProcessor
from sagemaker.pytorch import PyTorch

# Initialize SageMaker session
session = sagemaker.Session()
role = sagemaker.get_execution_role()

# Define S3 bucket and prefixes
bucket = session.default_bucket()
prefix = "text-summarization"

# Define S3 paths for data and model artifacts
data_prefix = f"{prefix}/data"
model_prefix = f"{prefix}/model"
output_path = f"s3://{bucket}/{model_prefix}/output"
preprocessing_output_path = f"s3://{bucket}/{data_prefix}"

# Define SageMaker instance types - using t3.medium which is more commonly available
processing_instance_type = "ml.t3.medium"  # More commonly available
training_instance_type = "ml.t3.medium"    # More commonly available
inference_instance_type = "ml.t3.medium"   # More commonly available

# Verify that instance types are properly set
print(f"Processing instance type: {processing_instance_type}")
print(f"Training instance type: {training_instance_type}")
print(f"Inference instance type: {inference_instance_type}")

print(f"SageMaker role: {role}")
print(f"S3 bucket: {bucket}")
print(f"Data will be saved to: {preprocessing_output_path}")
print(f"Model will be saved to: {output_path}")

## Step 1: Data Preprocessing

We'll use a SageMaker Processing job to download and prepare the CNN/DailyMail dataset.
This step has been optimized for CPU instances and minimal data usage.

In [None]:
# Upload the preprocessing script to S3
preprocessing_script_path = "backend/preprocess.py"
preprocessing_s3_path = sagemaker.s3.S3Uploader.upload(
    local_path=preprocessing_script_path,
    desired_s3_uri=f"s3://{bucket}/{prefix}/scripts"
)

# Get PyTorch image URI
pytorch_image = sagemaker.image_uris.retrieve(
    framework="pytorch",
    region=session.boto_region_name,
    version="1.10.0",
    py_version="py38",
    instance_type=processing_instance_type,
    image_scope="training"
)

print(f"PyTorch image URI: {pytorch_image}")

# Configure the preprocessing job
processor = ScriptProcessor(
    command=['python3'],
    image_uri=pytorch_image,
    role=role,
    instance_count=1,
    instance_type=processing_instance_type,
    base_job_name='text-summarization-preprocessing',
    max_runtime_in_seconds=1800  # 30 minutes max
)

print(f"Starting preprocessing job using script: {preprocessing_s3_path}")

# Start the preprocessing job
processor.run(
    code=preprocessing_s3_path,
    arguments=[
        '--output-dir', '/opt/ml/processing/output',
        '--train-split-size', '0.01',  # Use only 1% of the data
        '--max-samples', '100'  # Cap at 100 samples
    ],
    outputs=[
        ProcessingOutput(
            output_name="data",
            source="/opt/ml/processing/output",
            destination=preprocessing_output_path
        )
    ],
    wait=True
)

print(f"Preprocessing job completed. Data saved to: {preprocessing_output_path}")

## Step 2: Model Training

We'll now use the optimized training script for the model training. The script has been enhanced with:
1. Memory-efficient training
2. Minimal dataset usage
3. Early stopping
4. Resource optimization for AWS free tier

In [None]:
# Upload the training script to S3
training_script_path = "backend/train.py"
training_s3_path = sagemaker.s3.S3Uploader.upload(
    local_path=training_script_path,
    desired_s3_uri=f"s3://{bucket}/{prefix}/scripts"
)

# Define hyperparameters - optimized for free tier
hyperparameters = {
    'model-name': 'facebook/bart-base',
    'epochs': 1,
    'batch-size': 1,  # Minimal batch size for memory efficiency
    'learning-rate': 2e-5,
    'warmup-steps': 10,  # Reduced for faster training
    'max-input-length': 128,  # Shorter sequences for memory
    'max-target-length': 32,  # Shorter summaries for memory
    'dataset-size': 0.01,  # Use only 1% of the dataset
    'max-train-samples': 50,  # Cap training samples
    'max-val-samples': 10,  # Cap validation samples
    'max-steps': 10,  # Very small number of steps
    'use-max-steps': True  # Use steps instead of epochs
}

# Create PyTorch estimator
model_estimator = PyTorch(
    entry_point='train.py',
    source_dir='backend',
    role=role,
    framework_version='1.10.0',
    py_version='py38',
    instance_type=training_instance_type,
    instance_count=1,
    hyperparameters=hyperparameters,
    output_path=output_path,
    max_runtime_in_seconds=1800,  # 30 minutes max
    base_job_name='text-summarization-training'
)

# Start training
print("Starting training job...")
try:
    model_estimator.fit(
        inputs={
            'train': preprocessing_output_path + '/train',
            'validation': preprocessing_output_path + '/validation'
        },
        wait=True
    )
    print("Training completed!")
    print(f"Model artifacts saved to: {output_path}")
except Exception as e:
    print(f"Error during training: {e}")
    print("\nTroubleshooting steps:")
    print("1. Check your AWS account limits in the AWS Console")
    print("2. Verify that you have sufficient permissions")
    print("3. Check if the preprocessing job completed successfully")
    print("4. Try using a different instance type")
    print("\nYou can also try training manually through the AWS Console")
    raise

## Step 3: Model Deployment

Now we'll deploy the trained model to a SageMaker endpoint for real-time inference.
We're using t3.medium instances which are commonly available across regions.

In [None]:
# Deploy the model to a SageMaker endpoint
endpoint_name = "summarizer-endpoint"

print(f"Deploying model to endpoint: {endpoint_name}")
print("This may take several minutes...")

try:
    # First try with t3.medium
    predictor = model_estimator.deploy(
        initial_instance_count=1,
        instance_type=inference_instance_type,  # ml.t3.medium
        endpoint_name=endpoint_name,
        wait=True
    )
    print(f"Model deployed to endpoint: {endpoint_name}")
except Exception as e:
    print(f"Error deploying with t3.medium: {e}")
    print("Trying with t3.large...")
    try:
        # Try with t3.large as fallback
        predictor = model_estimator.deploy(
            initial_instance_count=1,
            instance_type="ml.t3.large",
            endpoint_name=endpoint_name,
            wait=True
        )
        print(f"Model deployed to endpoint with t3.large: {endpoint_name}")
    except Exception as inner_e:
        print(f"Error deploying with t3.large: {inner_e}")
        print("\nTroubleshooting steps:")
        print("1. Check your AWS account limits in the AWS Console")
        print("2. Verify that you have sufficient permissions")
        print("3. Check if the model artifacts were saved correctly")
        print("4. Try deleting any existing endpoints with the same name")
        print("\nYou can also try deploying manually through the AWS Console")
        raise

## Step 4: Test the Endpoint

Let's test our deployed model with a sample text.

In [None]:
# Test the endpoint with a sample text
sample_text = """
The Chrysler Building, the famous art deco New York skyscraper, will be sold for a small fraction of its previous sales price. The deal, first reported by The Real Deal, was for $150 million, according to a source familiar with the deal. Mubadala, an Abu Dhabi investment fund, purchased 90% of the building for $800 million in 2008. Real estate firm Tishman Speyer had owned the other 10%. The buyer is RFR Holding, a New York real estate company. Officials with Tishman and RFR did not immediately respond to a request for comments. It's unclear when the deal will close. The building sold fairly quickly after being publicly placed on the market only two months ago. The sale was handled by CBRE Group. The incentive to sell the building at such a huge loss was due to the soaring rent the owners pay to Cooper Union, a New York college, for the land under the building. The rent is rising from $7.75 million last year to $32.5 million this year to $41 million in 2028. Meantime, rents in the building itself are not rising nearly that fast. While the building is an iconic landmark in the New York skyline, it is competing against newer office towers with large floor plans that are preferred by many tenants. The Chrysler Building was briefly the world's tallest, before it was surpassed by the Empire State Building, which was completed the following year.
"""

try:
    # Use the endpoint for inference
    response = predictor.predict({'text': sample_text})
    print("Generated summary:")
    print(response['summary'])
except Exception as e:
    print(f"Error during inference: {e}")
    print("This could be due to an issue with the endpoint setup or the model itself.")
    print("Check the CloudWatch logs for the endpoint for more details.")

## Conclusion

We've implemented a memory-efficient training pipeline optimized for AWS with the following improvements:

1. **Resource Usage**: Using t3.medium instances which are commonly available across regions
2. **Small Dataset**: Using only 1% of the data with a maximum of 100 samples
3. **Memory Optimization**: Implemented gradient checkpointing and half-precision training
4. **Early Stopping**: Training stops after 10 steps to avoid excessive resource usage
5. **Efficient Preprocessing**: Optimized data preprocessing for minimal memory usage
6. **Resource Monitoring**: Added memory usage tracking throughout the pipeline

The model is now trained and deployed in a way that should work within AWS limits. Note that the model's performance might be limited due to the small dataset and short training time, but it provides a good starting point for experimentation.

To improve the model's performance, you could:
1. Use a larger dataset (but this would require more resources)
2. Train for more steps (but this would increase costs)
3. Use a larger model (but this would require more memory)

Remember to delete the endpoint when you're done to avoid incurring charges!

In [None]:
# Clean up - delete the endpoint when done
try:
    predictor.delete_endpoint()
    print(f"Endpoint {endpoint_name} deleted successfully")
except Exception as e:
    print(f"Error deleting endpoint: {e}")