In [None]:
I've created a text file with all the code sections clearly marked with === SECTION NAME === separators. Each section includes both the markdown cells and code cells, making it easy for you to copy and paste them into your Jupyter notebook. The sections are organized as follows:
Title and Introduction
Part 1: Data Exploration and Preprocessing
Imports
Load Data
Text Preprocessing
Create Labels
Part 2: Model Implementation
Dataset Class
Data Preparation
Model Training
Model Evaluation
Part 3: Model Deployment
Save Model
API Deployment
Part 4: Summary and Conclusions
You can now create a new Jupyter notebook and copy each section between the markers into separate cells. For sections marked with === MARKDOWN: ===, create markdown cells, and for sections marked with === CODE: ===, create code cells.


=== MARKDOWN: TITLE ===
# Customer Support Ticket Classification - Implementation

This notebook demonstrates the implementation of our multi-label classification model for customer support tickets.

=== MARKDOWN: PART 1 ===
## Part 1: Data Exploration and Preprocessing

=== CODE: IMPORTS ===
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import AdamW
from sklearn.model_selection import train_test_split
from sklearn.metrics import hamming_loss, accuracy_score, f1_score, roc_auc_score
from imblearn.over_sampling import RandomOverSampler
import time

# Download required NLTK data
nltk.download('punkt')
nltk.download('stopwords')

# Set style for plots
plt.style.use('seaborn')
sns.set_palette('husl')

=== MARKDOWN: LOAD DATA ===
### 1.1 Load and Examine Data

=== CODE: LOAD DATA ===
# Load the dataset
df = pd.read_json('data/electronics_reviews.json', lines=True)

# Display basic information
print("Dataset Shape:", df.shape)
print("\nColumns:", df.columns.tolist())
print("\nSample Data:")
df.head()

=== MARKDOWN: TEXT PREPROCESSING ===
### 1.2 Text Preprocessing

=== CODE: TEXT PREPROCESSING ===
def preprocess_text(text):
    """Clean and normalize text data."""
    # Convert to lowercase
    text = text.lower()
    
    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Remove extra whitespace
    text = ' '.join(text.split())
    
    return text

# Combine review text and summary
df['combined_text'] = df['reviewText'] + ' ' + df['summary']

# Clean text
df['combined_text'] = df['combined_text'].apply(preprocess_text)

# Display sample of preprocessed text
print("Sample of preprocessed text:")
print(df['combined_text'].iloc[0])

=== MARKDOWN: CREATE LABELS ===
### 1.3 Create Multi-Label Categories

=== CODE: CREATE LABELS ===
def create_labels(df):
    """Create multi-label categories based on review content with 10 categories."""
    labels = []
    
    for _, row in df.iterrows():
        review = row['combined_text'].lower()
        label = [0] * 10  # 10 categories
        
        # Product Quality
        if any(word in review for word in ['quality', 'durable', 'reliable', 'broken', 'defect']):
            label[0] = 1
            
        # Customer Service
        if any(word in review for word in ['service', 'support', 'help', 'customer', 'return']):
            label[1] = 1
            
        # Price
        if any(word in review for word in ['price', 'cost', 'expensive', 'cheap', 'value']):
            label[2] = 1
            
        # Functionality
        if any(word in review for word in ['work', 'function', 'feature', 'performance']):
            label[3] = 1
            
        # Technical Issues
        if any(word in review for word in ['bug', 'error', 'crash', 'glitch', 'problem']):
            label[4] = 1
            
        # Shipping/Delivery
        if any(word in review for word in ['shipping', 'delivery', 'arrived', 'package']):
            label[5] = 1
            
        # User Experience
        if any(word in review for word in ['easy', 'difficult', 'simple', 'complicated']):
            label[6] = 1
            
        # Product Compatibility
        if any(word in review for word in ['compatible', 'compatibility', 'works with']):
            label[7] = 1
            
        # Product Features
        if any(word in review for word in ['feature', 'specification', 'capability']):
            label[8] = 1
            
        # Others
        if sum(label) == 0:
            label[9] = 1
            
        labels.append(label)
    
    return np.array(labels)

# Create labels
labels = create_labels(df)

# Create DataFrame for labels
labels_df = pd.DataFrame(labels, columns=[
    'Product Quality', 'Customer Service', 'Price', 'Functionality',
    'Technical Issues', 'Shipping/Delivery', 'User Experience',
    'Product Compatibility', 'Product Features', 'Others'
])

# Display label distribution
print("Label Distribution:")
print(labels_df.sum())

=== MARKDOWN: PART 2 ===
## Part 2: Model Implementation

=== MARKDOWN: DATASET CLASS ===
### 2.1 Create Custom Dataset Class

=== CODE: DATASET CLASS ===
class ReviewDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=512):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = str(self.texts[idx])
        encoding = self.tokenizer(
            text,
            add_special_tokens=True,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt'
        )
        
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.FloatTensor(self.labels[idx])
        }

=== MARKDOWN: DATA PREPARATION ===
### 2.2 Prepare Data for Training

=== CODE: DATA PREPARATION ===
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    df['combined_text'], labels, test_size=0.2, random_state=42
)

# Handle class imbalance
ros = RandomOverSampler(random_state=42)
X_train_resampled, y_train_resampled = ros.fit_resample(
    X_train.values.reshape(-1, 1), y_train
)
X_train_resampled = X_train_resampled.flatten()

# Initialize tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Create datasets
train_dataset = ReviewDataset(X_train_resampled, y_train_resampled, tokenizer)
test_dataset = ReviewDataset(X_test, y_test, tokenizer)

# Create dataloaders
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16)

print(f"Training samples: {len(train_dataset)}")
print(f"Testing samples: {len(test_dataset)}")

=== MARKDOWN: MODEL TRAINING ===
### 2.3 Model Training

=== CODE: MODEL TRAINING ===
def train_model(model, train_loader, val_loader, device, num_epochs=3):
    """Train the BERT model."""
    optimizer = AdamW(model.parameters(), lr=2e-5)
    criterion = torch.nn.BCEWithLogitsLoss()
    
    for epoch in range(num_epochs):
        model.train()
        total_loss = 0
        
        for batch in train_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            
            optimizer.zero_grad()
            outputs = model(input_ids, attention_mask=attention_mask)
            loss = criterion(outputs.logits, labels)
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
            
        # Validation
        model.eval()
        val_loss = 0
        with torch.no_grad():
            for batch in val_loader:
                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                labels = batch['labels'].to(device)
                
                outputs = model(input_ids, attention_mask=attention_mask)
                loss = criterion(outputs.logits, labels)
                val_loss += loss.item()
        
        print(f'Epoch {epoch + 1}:')
        print(f'Training Loss: {total_loss/len(train_loader):.4f}')
        print(f'Validation Loss: {val_loss/len(val_loader):.4f}')

# Initialize model
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=10,
    problem_type="multi_label_classification"
)

# Move model to device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

# Train model
print("Training model...")
train_model(model, train_loader, test_loader, device)

=== MARKDOWN: MODEL EVALUATION ===
### 2.4 Model Evaluation

=== CODE: MODEL EVALUATION ===
def evaluate_model(model, test_loader, device):
    """Evaluate the model using various metrics."""
    model.eval()
    all_preds = []
    all_labels = []
    start_time = time.time()
    
    with torch.no_grad():
        for batch in test_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            
            outputs = model(input_ids, attention_mask=attention_mask)
            preds = torch.sigmoid(outputs.logits)
            preds = (preds > 0.5).float()
            
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())
    
    all_preds = np.array(all_preds)
    all_labels = np.array(all_labels)
    
    # Calculate metrics
    hamming = hamming_loss(all_labels, all_preds)
    subset_acc = accuracy_score(all_labels, all_preds)
    micro_f1 = f1_score(all_labels, all_preds, average='micro')
    macro_f1 = f1_score(all_labels, all_preds, average='macro')
    
    # Calculate AUC-ROC for each category
    category_names = [
        'Product Quality', 'Customer Service', 'Price', 'Functionality',
        'Technical Issues', 'Shipping/Delivery', 'User Experience',
        'Product Compatibility', 'Product Features', 'Others'
    ]
    
    auc_scores = []
    for i in range(all_labels.shape[1]):
        auc = roc_auc_score(all_labels[:, i], all_preds[:, i])
        auc_scores.append((category_names[i], auc))
    
    inference_time = time.time() - start_time
    
    return {
        'hamming_loss': hamming,
        'subset_accuracy': subset_acc,
        'micro_f1': micro_f1,
        'macro_f1': macro_f1,
        'auc_scores': dict(auc_scores),
        'inference_time': inference_time
    }

# Evaluate model
print("Evaluating model...")
metrics = evaluate_model(model, test_loader, device)

print("\nModel Evaluation Results:")
print(f"Hamming Loss: {metrics['hamming_loss']:.4f}")
print(f"Subset Accuracy: {metrics['subset_accuracy']:.4f}")
print(f"Micro F1 Score: {metrics['micro_f1']:.4f}")
print(f"Macro F1 Score: {metrics['macro_f1']:.4f}")
print("\nAUC-ROC Scores for each category:")
for category, score in metrics['auc_scores'].items():
    print(f"{category}: {score:.4f}")
print(f"\nInference Time: {metrics['inference_time']:.2f} seconds")

=== MARKDOWN: PART 3 ===
## Part 3: Model Deployment

=== MARKDOWN: SAVE MODEL ===
### 3.1 Save the Model

=== CODE: SAVE MODEL ===
# Save the model
model.save_pretrained('trained_model')
tokenizer.save_pretrained('trained_model')
print("Model saved successfully!")

=== MARKDOWN: API DEPLOYMENT ===
### 3.2 Create API for Model Deployment

=== CODE: API DEPLOYMENT ===
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import uvicorn

app = FastAPI()

class PredictionRequest(BaseModel):
    text: str

class PredictionResponse(BaseModel):
    categories: dict

@app.post("/predict")
async def predict(request: PredictionRequest):
    try:
        # Preprocess text
        text = preprocess_text(request.text)
        
        # Tokenize
        inputs = tokenizer(
            text,
            add_special_tokens=True,
            max_length=512,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        
        # Make prediction
        with torch.no_grad():
            outputs = model(**inputs)
            predictions = torch.sigmoid(outputs.logits)
            predictions = (predictions > 0.5).float()
        
        # Create response
        category_names = [
            'Product Quality', 'Customer Service', 'Price', 'Functionality',
            'Technical Issues', 'Shipping/Delivery', 'User Experience',
            'Product Compatibility', 'Product Features', 'Others'
        ]
        
        result = {
            category_names[i]: bool(predictions[0][i])
            for i in range(len(category_names))
        }
        
        return {"categories": result}
    
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

# Run the API (in a separate cell)
# uvicorn.run(app, host="0.0.0.0", port=8000)

=== MARKDOWN: SUMMARY ===
## Part 4: Summary and Conclusions

### 4.1 Project Summary

1. **Data Processing**:
   - Loaded and preprocessed Amazon Electronics Reviews
   - Created 10 multi-label categories
   - Handled class imbalance using RandomOverSampler

2. **Model Development**:
   - Fine-tuned BERT for multi-label classification
   - Implemented custom dataset class
   - Added evaluation metrics

3. **Deployment**:
   - Created FastAPI for model serving
   - Added error handling
   - Implemented prediction endpoint

### 4.2 Future Improvements

1. **Model Enhancements**:
   - Implement data augmentation
   - Create ensemble model
   - Fine-tune hyperparameters

2. **Feature Engineering**:
   - Add more category keywords
   - Implement advanced text preprocessing
   - Add sentiment analysis

3. **Deployment**:
   - Add authentication
   - Implement rate limiting
   - Add monitoring and logging 