# AI Engineer Technical Assessment

## Overview
Build an AI-powered solution for sentiment analysis of movie reviews that leverages the existing dataset to improve accuracy. This assessment is designed to be completed in 2-3 hours, we do NOT expect very detailed answers or long explanations.

## Notes
- AI assistance is allowed and, in fact, encouraged. caveats are:
    - Concise explanations and simple code are preferred
    - Solutions that use newer information and go beyond LLMs cuttof date are valuable.
    - You must be able to explain the code you write here

- Look up any information you need, copy and paste code is allowed.
- Setup the environment as needed. You can use your local environment, colab, or any other environment of your preferenc.
- Focus on working solutions, leave iteration and improvements if you have extra time.

## Setup
The following cells will download and prepare the IMDB dataset. 

In [1]:
import pandas as pd
import numpy as np
from datasets import load_dataset

# Load IMDB dataset
dataset = load_dataset("imdb")
train_df = pd.DataFrame(dataset['train'])
test_df = pd.DataFrame(dataset['test'])

# Sample subset for quicker development
train_df = train_df.sample(n=5000, random_state=42)
test_df = test_df.sample(n=10, random_state=42)

print(f"Training samples: {len(train_df)}")
print(f"Test samples: {len(test_df)}")

# Display sample data
print("\nSample review:")
sample = train_df.iloc[0]
print(f"Text: {sample['text'][:200]}...")
print(f"Sentiment: {'Positive' if sample['label'] == 1 else 'Negative'}")

  from .autonotebook import tqdm as notebook_tqdm


Training samples: 5000
Test samples: 10

Sample review:
Text: Dumb is as dumb does, in this thoroughly uninteresting, supposed black comedy. Essentially what starts out as Chris Klein trying to maintain a low profile, eventually morphs into an uninspired version...
Sentiment: Negative


## Task 1: Model Implementation
Implement a solution that analyzes sentiment in movie reviews. This part is explicitly open-ended: Explore ways to leverage the example dataset to enhance predictions. You can consider a pre-trained language model that can understand and generate text, external API's, RAG systems etc. 
Feel free to use any library or tool you are comfortable with.

In [6]:
from dotenv import load_dotenv
import warnings
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, PreTrainedTokenizerFast
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from tqdm import tqdm
from typing import List, Dict
import matplotlib.pyplot as plt

warnings.filterwarnings("ignore")

In [7]:
# Load Hugging Face token from .env file
load_dotenv()
print("Loaded .env file.")

True

In [3]:
# Set up device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cpu


In [4]:
# Define model name - as we are going to fine-tune a LLM model with LoRA, Llama 3 8B as it's a good balance
# of performance and resource requirements
model_name = "meta-llama/Meta-Llama-3-8B"

In [None]:
# Check if we're using CUDA to determine the quantization approach
if device.type == "cuda":
    # Set up quantization configuration for efficient memory usage
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,                     # Load the model weights in 4-bit format
        bnb_4bit_quant_type="nf4",             # Use the "nf4" quantization type
        bnb_4bit_compute_dtype=torch.float16,  # Use the float16 data type for computations
        bnb_4bit_use_double_quant=True,        # Use double quantization for better accuracy
    )

    # Load the model with quantization
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True
    )
else:
    # For CPU, load a smaller model or with different settings
    print("Running on CPU. This will be slow and may require a smaller model.")
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        device_map="auto",
        trust_remote_code=True
    )

Running on CPU. This will be slow and may require a smaller model.


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

In [None]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

In [None]:
# Prepare the model for training:
#   1. Freezes all parameters
#   2. Cast output embeddings and LayerNorm weights to float32
#   3. Enables gradient checkpointing
#   4. Add the upcasting of the lm head to float32
model = prepare_model_for_kbit_training(model)

In [None]:
# Set up LoRA configuration
peft_config = LoraConfig(
    r=16,                                                    # Rank - Lora attention dimension
    lora_alpha=32,                                           # Scaling factor or learning rate for Lora weights
    lora_dropout=0.05,                                       # Dropout probability for Lora layers
    bias="none",                                             # Type of bias to use
    task_type="CAUSAL_LM",                                   # Type of task that the model is being trained for
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj",]    # Target attention modules
)

# Apply LoRA to the model
model = get_peft_model(model, peft_config)
print("Model prepared with LoRA")

In [None]:
def preprocess_data(df: pd.DataFrame, tokenizer: PreTrainedTokenizerFast, max_length: int = 512) -> List[Dict]:
    """Preprocess data for training by formatting prompts and tokenizing.

    Args:
        df (Dataframe): Dataframe containing the data.
        tokenizer (PreTrainedTokenizerFast): Tokenizer to use for preprocessing.
        max_length (int, optional): Maximum length of the input sequence. Defaults to 512.

    Returns:
        (List[Dict]): List of preprocessed data items.
    """
    processed_data = []

    for _, row in tqdm(df.iterrows(), total=len(df), desc="Preprocessing data"):
        text = row['text']
        label = "positive" if row['label'] == 1 else "negative"

        # Create the prompt in the format Llama 3 expects
        prompt = f"<|begin_of_text|>Review: {text}\nSentiment: {label}<|end_of_text|>"

        # Tokenize
        encodings = tokenizer(prompt, truncation=True, max_length=max_length, padding="max_length", return_tensors="pt")

        processed_data.append({
            "input_ids": encodings["input_ids"][0],
            "attention_mask": encodings["attention_mask"][0],
            "labels": encodings["input_ids"][0].clone(),
            "original_text": text,
            "original_label": row['label']
        })

    return processed_data

In [None]:
# Preprocess training data
print("Preprocessing training data...")
train_data = preprocess_data(train_df, tokenizer)
print(f"Processed {len(train_data)} training examples")

In [None]:
# Create a simple dataset class
class IMDBDataset(torch.utils.data.Dataset):
    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return {
            "input_ids": self.data[idx]["input_ids"],
            "attention_mask": self.data[idx]["attention_mask"],
            "labels": self.data[idx]["labels"]
        }

In [None]:
# Create the training dataset and dataloader
train_dataset = IMDBDataset(train_data)
train_dataloader = torch.utils.data.DataLoader(
    train_dataset,
    batch_size=4,     # Small batch size due to memory constraints
    shuffle=True
)

In [None]:
def train_model(model: any, dataloader: torch.utils.data.DataLoader, num_epochs: int = 3) -> any:
    """Train the model using the provided dataloader. Displays a chart of training loss over time.

    Args:
        model (any): The model to be trained.
        dataloader (torch.utils.data.DataLoader): The dataloader to be used for training.
        num_epochs (int, optional): The number of epochs to train for. Defaults to 3.

    Returns:
        (any): The trained model.
    """
    # Set up optimizer
    optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)

    # Training loop
    model.train()
    losses = []

    for epoch in range(num_epochs):
        epoch_losses = []
        progress_bar = tqdm(dataloader, desc=f"Epoch {epoch+1}/{num_epochs}")

        for batch in progress_bar:
            # Move batch to device
            batch = {k: v.to(device) for k, v in batch.items()}

            # Forward pass
            outputs = model(
                input_ids=batch["input_ids"],
                attention_mask=batch["attention_mask"],
                labels=batch["labels"]
            )

            loss = outputs.loss
            epoch_losses.append(loss.item())

            # Backward pass
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

            # Update progress bar
            progress_bar.set_postfix({"loss": loss.item()})

        avg_loss = sum(epoch_losses) / len(epoch_losses)
        losses.append(avg_loss)
        print(f"Epoch {epoch+1}/{num_epochs} - Average Loss: {avg_loss:.4f}")

    # Plot training loss
    plt.figure(figsize=(10, 6))
    plt.plot(range(1, num_epochs+1), losses, marker='o')
    plt.title('Training Loss')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.grid(True)
    plt.show()

    return model

In [None]:
# Train the model
print("Starting model training...")
trained_model = train_model(model, train_dataloader, num_epochs=5)
print("Model training completed")

In [None]:
def predict_sentiment(text: str, model: any, tokenizer: PreTrainedTokenizerFast, max_length: int = 512) -> Dict[str, str | float]:
    """Predict sentiment for a given text. Also returns the confidence score.
    
    Args:
        text (str): The text to be analyzed.
        model (any): The model to be used for prediction.
        tokenizer (PreTrainedTokenizerFast): The tokenizer to be used for preprocessing.
        max_length (int, optional): The maximum length of the input sequence. Defaults to 512.
    
    Returns:
        (Dict[str, str | float]): A dictionary containing the predicted sentiment and confidence score. 
    """
    # Create prompt
    prompt = f"<|begin_of_text|>Review: {text}\nSentiment:"

    # Tokenize
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=max_length)
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # Generate
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=10,
            temperature=0.1,
            do_sample=True
        )

    # Decode
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract sentiment
    try:
        sentiment_part = generated_text.split("Sentiment:")[-1].strip().lower()
        if "positive" in sentiment_part:
            sentiment = "positive"
            confidence = 0.9  # Simplified confidence score
        elif "negative" in sentiment_part:
            sentiment = "negative"
            confidence = 0.9  # Simplified confidence score
        else:
            # If the model output is unclear, use a heuristic approach
            positive_words = ["good", "great", "excellent", "amazing", "enjoyed", "best", "recommend"]
            negative_words = ["bad", "terrible", "awful", "worst", "disappointing", "waste", "boring"]

            pos_count = sum(1 for word in positive_words if word in text.lower())
            neg_count = sum(1 for word in negative_words if word in text.lower())

            if pos_count > neg_count:
                sentiment = "positive"
                confidence = 0.6 + (0.1 * min(pos_count - neg_count, 3))  # Scale confidence
            else:
                sentiment = "negative"
                confidence = 0.6 + (0.1 * min(neg_count - pos_count, 3))  # Scale confidence
    except:
        # Fallback
        sentiment = "unknown"
        confidence = 0.5

    return {
        "sentiment": sentiment,
        "confidence": confidence
    }

In [None]:
# Test the model on a sample
sample_review = "This movie was absolutely fantastic! The acting was superb and the plot kept me engaged throughout."
result = predict_sentiment(sample_review, trained_model, tokenizer)
print(f"Sample review: {sample_review}")
print(f"Predicted sentiment: {result['sentiment']} (confidence: {result['confidence']:.2f})")

In [None]:
# Save model fine-tuned
print("Saving fine-tuned model...")
trained_model.save_pretrained("fine_tuned_model")
tokenizer.save_pretrained("tokenizer")
print("Model saved successfully")

## Task 2: API Implementation
Create a simple API using FastAPI that serves your solution. The API should accept a review text and return the sentiment analysis result.

Expected format:
```python
# Request
{
    "review_text": "This movie exceeded my expectations..."
}

# Response
{
    "sentiment": "positive",
    "confidence": 0.92,
    "similar_reviews": [
        {},
        {}
    ]
}
```

In [None]:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import uvicorn
from typing import List, Dict
import datetime

In [None]:
# Define request and response models
class ReviewRequest(BaseModel):
    review_text: str

class SimilarReview(BaseModel):
    text: str
    sentiment: str
    similarity: float

class SentimentResponse(BaseModel):
    sentiment: str
    confidence: float
    similar_reviews: List[SimilarReview]

In [None]:
# Create FastAPI app
api_version = "1.0.0"
app = FastAPI(title="Movie Review Sentiment Analysis API",
              description="API for analyzing sentiment in movie reviews using Llama 3 with LoRA fine-tuning",
              version=api_version)

In [None]:
# Global variables to store model and data
global_model = AutoModelForCausalLM.from_pretrained("fine_tuned_model")
global_tokenizer = AutoTokenizer.from_pretrained("tokenizer")
global_train_data = train_data

In [None]:
# Root endpoint
@app.get("/")
def read_root():
    return {"message": "Welcome to the Movie Review Sentiment Analysis API"}

In [None]:
# Health check endpoint
@app.get("/health")
def health_check():
    return {
        "status": "healthy",
        "version": api_version,
        "timestamp": datetime.datetime.now().isoformat(),
        "model_loaded": global_model is not None
    }

In [None]:
def find_similar_reviews(text: str, train_data: List[Dict], top_n: int = 2) -> List[Dict]:
    """Find similar reviews in the training data using simple word overlap.

    Args:
        text (str): The review text.
        train_data (List[Dict]): The training data.
        top_n (int, optional): The number of top results. Defaults to 2.

    Returns:
        (List[Dict]): A list of dictionaries containing the result.
    """
    text_words = set(text.lower().split())
    similarities = []

    for item in train_data:
        review_words = set(item["original_text"].lower().split())
        # Calculate Jaccard similarity
        intersection = len(text_words.intersection(review_words))
        union = len(text_words.union(review_words))
        similarity = intersection / union if union > 0 else 0

        similarities.append({
            "text": item["original_text"],
            "sentiment": "positive" if item["original_label"] == 1 else "negative",
            "similarity": similarity
        })

    # Sort by similarity and return top_n
    similar_reviews = sorted(similarities, key=lambda x: x["similarity"], reverse=True)[:top_n]
    return similar_reviews

In [None]:
def complete_prediction(text: str, model: any, tokenizer: PreTrainedTokenizerFast, train_data: List[Dict]) -> Dict[str, any]:
    """Complete prediction function that returns sentiment, confidence, and similar reviews.

    Args:
        text (str): The review text.
        model (any): The model to be used.
        tokenizer (PreTrainedTokenizerFast): The tokenizer to be used for preprocessing.
        train_data (List[Dict]): The training data.

    Returns:
        (Dict[str, any]): A dictionary containing the result.
    """
    # Get sentiment prediction
    result = predict_sentiment(text, model, tokenizer)

    # Find similar reviews
    similar_reviews = find_similar_reviews(text, train_data)

    return {
        "sentiment": result["sentiment"],
        "confidence": result["confidence"],
        "similar_reviews": similar_reviews
    }

In [None]:
# Sentiment analysis endpoint
@app.post("/analyze", response_model=SentimentResponse)
def analyze_sentiment(request: ReviewRequest):
    try:
        # Check if the text is provided
        if not request.review_text or len(request.review_text.strip()) == 0:
            raise HTTPException(status_code=400, detail="Review text cannot be empty")

        # Get prediction
        result = complete_prediction(
            text=request.review_text,
            model=global_model,
            tokenizer=global_tokenizer,
            train_data=global_train_data
        )

        # Format similar reviews for response
        similar_reviews = [
            SimilarReview(
                text=review["text"],
                sentiment=review["sentiment"],
                similarity=review["similarity"]
            )
            for review in result["similar_reviews"]
        ]

        # Return response
        return SentimentResponse(
            sentiment=result["sentiment"],
            confidence=result["confidence"],
            similar_reviews=similar_reviews
        )
    except Exception as e:
        # Change to a log error in a production environment
        print(f"Error processing request: {str(e)}")
        raise HTTPException(status_code=500, detail=f"Error processing request: {str(e)}")

In [None]:
# Function to run the API server
def main():
    uvicorn.run(app, host="0.0.0.0", port=8000)

In [None]:
# Example of how to start the API server
print("To start the API server, run:")
print("main()")

In [None]:
# Example of API usage with curl
print("\nExample curl request:")
print('''curl -X POST "http://localhost:8000/analyze" \\
    -H "Content-Type: application/json" \\
    -d '{"review_text": "This movie was absolutely fantastic! The acting was superb."}'
''')

# Example of API usage with Python
print("\nExample of API usage in Python:")
print('''
import requests
import json

url = "http://localhost:8000/analyze"
data = {"review_text": "This movie was absolutely fantastic! The acting was superb."}
response = requests.post(url, json=data)
result = response.json()
print(json.dumps(result, indent=2))
''')

## Task 3: Testing and Performance
Evaluate your solution's performance on the test set. Include:
1. Accuracy metrics (precision, recall, F1-score)
2. Inference speed (average time per prediction)

Compare performance with and without using the example data to demonstrate any improvements.

In [None]:
import time
from sklearn.metrics import classification_report, accuracy_score, precision_recall_fscore_support
import numpy as np
from typing import Optional
from transformers import pipeline

In [None]:
# Preprocess test data
print("Preprocessing test data...")
test_data = preprocess_data(test_df, tokenizer)
print(f"Processed {len(test_data)} test examples")

In [None]:
def evaluate_model(model: any, test_data: List[Dict], tokenizer: Optional[PreTrainedTokenizerFast] = None) -> Dict:
    """Evaluate the model performance on test data.

    Args:
        model (any): The model to be used.
        test_data (List[Dict]): The test data.
        tokenizer (Optional[PreTrainedTokenizerFast]): The tokenizer to be used for preprocessing.

    Returns:
        (Dict): A dictionary containing all the metrics.
    """
    true_labels = []
    predicted_labels = []
    confidences = []
    inference_times = []

    print("Evaluating model on test data...")
    for item in tqdm(test_data, desc="Evaluating"):
        # Get the true label
        true_label = "positive" if item["original_label"] == 1 else "negative"
        true_labels.append(true_label)

        # Measure inference time
        start_time = time.time()
        if tokenizer is None:
            result = model(item["original_text"])[0]
        else:
            result = predict_sentiment(item["original_text"], model, tokenizer)
        end_time = time.time()

        # Record results
        predicted_labels.append(result["sentiment"])
        confidences.append(result["confidence"])
        inference_times.append(end_time - start_time)

    # Calculate metrics
    accuracy = accuracy_score([1 if label == "positive" else 0 for label in true_labels],
                             [1 if label == "positive" else 0 for label in predicted_labels])

    precision, recall, f1, _ = precision_recall_fscore_support(
        [1 if label == "positive" else 0 for label in true_labels],
        [1 if label == "positive" else 0 for label in predicted_labels],
        average='binary'
    )

    avg_inference_time = sum(inference_times) / len(inference_times)

    # Print results
    print(f"\nAccuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1 Score: {f1:.4f}")
    print(f"Average Inference Time: {avg_inference_time:.4f} seconds per review")

    # Print classification report
    print("\nClassification Report:")
    print(classification_report(
        [1 if label == "positive" else 0 for label in true_labels],
        [1 if label == "positive" else 0 for label in predicted_labels],
        target_names=["Negative", "Positive"]
    ))

    return {
        "accuracy": accuracy,
        "precision": precision,
        "recall": recall,
        "f1": f1,
        "avg_inference_time": avg_inference_time,
        "true_labels": true_labels,
        "predicted_labels": predicted_labels,
        "confidences": confidences,
        "inference_times": inference_times
    }

In [None]:
# Evaluate fine-tuned model
print("\n=== Evaluating Fine-tuned Llama 3 Model with LoRA ===")
fine_tuned_results = evaluate_model(trained_model, test_data, tokenizer)

In [None]:
# For comparison, let's use a baseline model (BERT-based sentiment classifier)
print("\n=== Evaluating Baseline Model (BERT) ===")
# Load a pre-trained sentiment analysis model
baseline_model = pipeline("sentiment-analysis", device=0 if torch.cuda.is_available() else -1)
baseline_results = evaluate_model(baseline_model, test_data)

In [None]:
# Compare models
print("\n=== Model Comparison ===")
print(f"Metric          | Fine-tuned Llama 3 | Baseline BERT")
print(f"----------------|-------------------|-------------")
print(f"Accuracy        | {fine_tuned_results['accuracy']:.4f}              | {baseline_results['accuracy']:.4f}")
print(f"Precision       | {fine_tuned_results['precision']:.4f}              | {baseline_results['precision']:.4f}")
print(f"Recall          | {fine_tuned_results['recall']:.4f}              | {baseline_results['recall']:.4f}")
print(f"F1 Score        | {fine_tuned_results['f1']:.4f}              | {baseline_results['f1']:.4f}")
print(f"Inference Time  | {fine_tuned_results['avg_inference_time']:.4f} sec          | {baseline_results['avg_inference_time']:.4f} sec")

In [None]:
# Visualize results
plt.figure(figsize=(12, 6))
metrics = ['Accuracy', 'Precision', 'Recall', 'F1 Score']
fine_tuned_scores = [fine_tuned_results['accuracy'], fine_tuned_results['precision'],
                     fine_tuned_results['recall'], fine_tuned_results['f1']]
baseline_scores = [baseline_results['accuracy'], baseline_results['precision'],
                   baseline_results['recall'], baseline_results['f1']]

x = np.arange(len(metrics))
width = 0.35

plt.bar(x - width/2, fine_tuned_scores, width, label='Fine-tuned Llama 3')
plt.bar(x + width/2, baseline_scores, width, label='Baseline BERT')

plt.ylabel('Score')
plt.title('Model Performance Comparison')
plt.xticks(x, metrics)
plt.ylim(0, 1.0)
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)

plt.tight_layout()
plt.show()

# Visualize inference time comparison
plt.figure(figsize=(8, 5))
models = ['Fine-tuned Llama 3', 'Baseline BERT']
inference_times = [fine_tuned_results['avg_inference_time'], baseline_results['avg_inference_time']]

plt.bar(models, inference_times, color=['#1f77b4', '#ff7f0e'])
plt.ylabel('Average Inference Time (seconds)')
plt.title('Inference Speed Comparison')
plt.grid(axis='y', linestyle='--', alpha=0.7)

plt.tight_layout()
plt.show()

# Analyze confidence distribution
plt.figure(figsize=(10, 6))
plt.hist(fine_tuned_results['confidences'], bins=10, alpha=0.7, label='Fine-tuned Llama 3')
plt.hist(baseline_results['confidences'], bins=10, alpha=0.7, label='Baseline BERT')
plt.xlabel('Confidence Score')
plt.ylabel('Frequency')
plt.title('Confidence Score Distribution')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.7)

plt.tight_layout()
plt.show()

In [None]:
# Conclusion
print("\n=== Performance Analysis Conclusion ===")
if fine_tuned_results['f1'] > baseline_results['f1']:
    print("The fine-tuned Llama 3 model with LoRA outperforms the baseline BERT model in terms of F1 score.")
    print(f"Improvement: {(fine_tuned_results['f1'] - baseline_f1) / baseline_f1 * 100:.2f}% increase in F1 score.")
else:
    print("The baseline BERT model outperforms the fine-tuned Llama 3 model in terms of F1 score.")
    print(f"Difference: {(baseline_f1 - fine_tuned_results['f1']) / fine_tuned_results['f1'] * 100:.2f}% higher F1 score for baseline.")

if fine_tuned_results['avg_inference_time'] < baseline_avg_inference_time:
    print("\nThe fine-tuned Llama 3 model is faster for inference.")
    print(f"Speed improvement: {(baseline_avg_inference_time - fine_tuned_results['avg_inference_time']) / baseline_avg_inference_time * 100:.2f}% faster.")
else:
    print("\nThe baseline BERT model is faster for inference.")
    print(f"Speed difference: {(fine_tuned_results['avg_inference_time'] - baseline_avg_inference_time) / baseline_avg_inference_time * 100:.2f}% slower for fine-tuned model.")

print("\nKey advantages of the fine-tuned Llama 3 model:")
print("1. Better understanding of context and nuanced language in movie reviews")
print("2. Ability to provide similar reviews for context")
print("3. More flexible and adaptable to different types of review language")
print("4. Can be further improved with more training data and longer training time")

print("\nTrade-offs:")
print("1. Resource requirements (memory, computation)")
print("2. Training time and complexity")
print("3. Deployment considerations for larger models")

## Task 4: Deployment Strategy

1. Describe your deployment strategy considering:
   - Data storage and retrieval
   - Scalability
   - Resource requirements
   - Cost considerations

2. Create a simple Dockerfile to package your solution

In [None]:
# Write your deployment strategy here as a markdown cell
deployment_strategy = """
# Deployment Strategy for Sentiment Analysis API

## Infrastructure

* Compute: Amazon ECS with EC2 launch type (for GPU support)
* Container orchestration: Docker container with the FastAPI app.
* Networking: ECS Service with Application Load Balancer (ALB).
* CI/CD: GitHub Actions to deploy container image to Amazon ECR and update ECS task.

## Scalability Approach

1. Horizontal Scaling:
   - Scale tasks up/down based on CPU/memory usage and request rate

2. Performance Optimization:
   - Model quantization (4-bit) to reduce memory footprint
   - Batch processing for high-throughput scenarios
   - Caching frequently requested predictions

3. High Availability:
   - Multi-zone deployment for fault tolerance
   - Health checks to ensure service is healthy before traffic is routed to it
   - Graceful degradation with fallback models if primary model is unavailable

## Model & Data Storage

1. Model Storage:
   - Store model artifacts in cloud object storage (S3)
   - Version control for models using DVC or similar tools
   - Model registry to track model versions and performance metrics

2. Data Management:
   - MongoDB for storing processed reviews and metadata
   - Redis for caching frequent predictions and similar reviews
   - Periodic data archiving for historical analysis
   - Data versioning to track dataset changes

3. Secrets Management:
   - Securely store API keys and other sensitive data
   - Use AWS Secrets Manager to manage secrets

## Resource & Cost Considerations

1. Resource Optimization:
   - Right-sizing containers based on workload patterns
   - Spot instances for batch processing and training
   - Reserved instances for baseline capacity
   - Autoscaling to match demand patterns

2. Cost Management:
   - Monitoring and alerting on resource usage
   - Cost allocation tagging for different components
   - Regular review of resource utilization
   - Scheduled scaling for predictable traffic patterns

3. Performance vs. Cost Tradeoffs:
   - Quantization to reduce compute requirements
   - Caching strategy to reduce inference calls
   - Tiered service levels based on response time requirements
"""

print(deployment_strategy)

# Write your Dockerfile content
dockerfile_content = """
# Use NVIDIA CUDA base image for GPU support
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04

# Set working directory
WORKDIR /app

# Install Python and dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    python3.10 \
    python3-pip \
    python3-dev \
    && rm -rf /var/lib/apt/lists/*

# Copy requirements file
COPY requirements.txt .

# Install Python dependencies
RUN pip3 install --no-cache-dir -r requirements.txt

# Copy model files and application code
COPY ./model /app/model
COPY ./app /app/app

# Set environment variables
ENV MODEL_PATH=/app/model
ENV PYTHONPATH=/app

# Expose port for API
EXPOSE 8000

# Set up entrypoint script
COPY entrypoint.sh /app/
RUN chmod +x /app/entrypoint.sh

# Run the API server
ENTRYPOINT ["/app/entrypoint.sh"]
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
"""

print("\nDockerfile:")
print(dockerfile_content)

## Evaluation Criteria
- Implementation that can process reviews and return sentiments
- Use of extra data to improve predictions
- Proper API design
- Reasonable deployment strategy

Good luck!
