# Deep Learning Model for Sentiment Analysis

This notebook creates and trains deep learning models for sentiment analysis on movie reviews. The notebook includes:
- Data loading and preprocessing
- Memory optimization for different system configurations
- CNN and Transformer model architectures
- Model training with optimized parameters
- Model evaluation and prediction functions

## 1. Import Libraries and Setup

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Embedding, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import re
import matplotlib.pyplot as plt
import psutil
import pickle
import gc
import math

# Set random seeds for reproducibility
tf.random.set_seed(42)
np.random.seed(42)

print("Libraries imported successfully!")

Dataset shape: (100000, 2)
                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive


## 2. Data Loading and Initial Analysis

In [None]:
# Load the dataset
df = pd.read_csv('final_combined_dataset_clean.csv')
print(f"Dataset shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"First few rows:")
print(df.head())
print(f"\nSentiment distribution:")
print(df['sentiment'].value_counts())

## 3. Text Preprocessing

In [None]:
def enhanced_preprocess_text(text):
    """Enhanced text preprocessing for sentiment analysis"""
    text = text.lower()
    # Keep important punctuation that might indicate sentiment
    text = re.sub(r'[^a-zA-Z\s!?.,]', '', text)
    # Convert multiple exclamation/question marks to single ones
    text = re.sub(r'(!)\1+', r'!', text)
    text = re.sub(r'(\?)\1+', r'?', text)
    # Remove extra whitespace
    text = ' '.join(text.split())
    return text

# Apply preprocessing
print("Applying text preprocessing...")
df['review'] = df['review'].apply(enhanced_preprocess_text)

# Convert sentiment labels to binary (0 for negative, 1 for positive)
df['sentiment'] = df['sentiment'].map({'negative': 0, 'positive': 1})

print("Preprocessing completed!")
print(f"Final sentiment distribution:\n{df['sentiment'].value_counts()}")

Preprocessing completed!
Sentiment distribution:
sentiment
1    50000
0    50000
Name: count, dtype: int64


## 4. Memory Optimization and Configuration

In [None]:
def estimate_system_resources():
    """Estimate system resources and recommend configuration"""
    print("=== System Resource Analysis ===")
    print(f"Total RAM: {psutil.virtual_memory().total / (1024**3):.1f} GB")
    print(f"Available RAM: {psutil.virtual_memory().available / (1024**3):.1f} GB")
    print(f"Current Python process memory: {psutil.Process().memory_info().rss / (1024**3):.1f} GB")
    
    # Memory estimation for different configurations
    embedding_dims = [200, 256, 300]
    max_features_options = [10000, 15000, 20000]
    
    print("\n=== Memory Estimation for Model Configurations ===")
    for max_feat in max_features_options:
        for emb_dim in embedding_dims:
            embedding_memory = max_feat * emb_dim * 4 / (1024**2)  # 4 bytes per float32
            print(f"MAX_FEATURES={max_feat:,}, EMB_DIM={emb_dim}: ~{embedding_memory:.1f} MB for embeddings")
    
    return psutil.virtual_memory().available / (1024**3)

def get_optimal_config(available_gb):
    """Get optimal configuration based on available memory"""
    if available_gb < 4:
        config = {
            'MAX_FEATURES': 10000,
            'MAX_LENGTH': 250,
            'BATCH_SIZE': 32,
            'EPOCHS': 12,
            'EMBEDDING_DIM': 200,
            'description': 'Conservative (for <4GB RAM)'
        }
    elif available_gb < 8:
        config = {
            'MAX_FEATURES': 15000,
            'MAX_LENGTH': 300,
            'BATCH_SIZE': 64,
            'EPOCHS': 15,
            'EMBEDDING_DIM': 256,
            'description': 'Moderate (for <8GB RAM)'
        }
    else:
        config = {
            'MAX_FEATURES': 20000,
            'MAX_LENGTH': 300,
            'BATCH_SIZE': 128,
            'EPOCHS': 20,
            'EMBEDDING_DIM': 300,
            'description': 'Full (for >=8GB RAM)'
        }
    
    print(f"\n=== Selected Configuration: {config['description']} ===")
    for key, value in config.items():
        if key != 'description':
            print(f"{key}: {value}")
    
    return config

# Analyze system and get optimal configuration
available_gb = estimate_system_resources()
config = get_optimal_config(available_gb)

# Set global configuration variables
MAX_FEATURES = config['MAX_FEATURES']
MAX_LENGTH = config['MAX_LENGTH']
BATCH_SIZE = config['BATCH_SIZE']
EPOCHS = config['EPOCHS']
EMBEDDING_DIM = config['EMBEDDING_DIM']

System Information:
Total RAM: 13.9 GB
Available RAM: 4.8 GB
Python process memory: 0.6 GB

Memory Estimation (approximate):
MAX_FEATURES=10000, EMB_DIM=256: ~9.8 MB for embeddings
MAX_FEATURES=10000, EMB_DIM=300: ~11.4 MB for embeddings
MAX_FEATURES=15000, EMB_DIM=256: ~14.6 MB for embeddings
MAX_FEATURES=15000, EMB_DIM=300: ~17.2 MB for embeddings
MAX_FEATURES=20000, EMB_DIM=256: ~19.5 MB for embeddings
MAX_FEATURES=20000, EMB_DIM=300: ~22.9 MB for embeddings

Using MAX_FEATURES = 15000 (Moderate for <8GB RAM)
Average review length: 231 tokens
Median review length: 173 tokens
90th percentile length: 452 tokens
Tokenization completed!
Vocabulary size: 145314
X shape: (100000, 300)
y shape: (100000,)


## 5. Text Tokenization and Sequence Processing

In [None]:
# Initialize tokenizer
print("=== Tokenization Process ===")
tokenizer = Tokenizer(num_words=MAX_FEATURES, oov_token="<OOV>")
tokenizer.fit_on_texts(df['review'])

# Convert texts to sequences
sequences = tokenizer.texts_to_sequences(df['review'])

# Analyze sequence lengths
lengths = [len(seq) for seq in sequences]
print(f"Sequence length statistics:")
print(f"  Average: {np.mean(lengths):.0f} tokens")
print(f"  Median: {np.median(lengths):.0f} tokens")
print(f"  90th percentile: {np.percentile(lengths, 90):.0f} tokens")
print(f"  95th percentile: {np.percentile(lengths, 95):.0f} tokens")

# Pad sequences
X = pad_sequences(sequences, maxlen=MAX_LENGTH, padding='post', truncating='post')
y = df['sentiment'].values

print(f"\n=== Tokenization Results ===")
print(f"Vocabulary size: {len(tokenizer.word_index):,}")
print(f"Features shape: {X.shape}")
print(f"Labels shape: {y.shape}")

## 6. Data Splitting

In [None]:
# Split the data with stratification to maintain class balance
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"=== Data Split Results ===")
print(f"Training set: {X_train.shape[0]:,} samples")
print(f"Test set: {X_test.shape[0]:,} samples")
print(f"Training set positive ratio: {y_train.mean():.3f}")
print(f"Test set positive ratio: {y_test.mean():.3f}")

Training set size: 80000
Test set size: 20000


## 7. Model Architecture Definitions

In [None]:
def create_cnn_model():
    """Create CNN model optimized for current system configuration"""
    model = Sequential([
        Embedding(input_dim=MAX_FEATURES, output_dim=EMBEDDING_DIM, input_length=MAX_LENGTH),
        
        # Multi-scale CNN layers
        tf.keras.layers.Conv1D(64, 3, activation='relu', padding='same'),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.MaxPooling1D(2),
        
        tf.keras.layers.Conv1D(128, 5, activation='relu', padding='same'),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.MaxPooling1D(2),
        
        tf.keras.layers.Conv1D(256, 7, activation='relu', padding='same'),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.GlobalMaxPooling1D(),
        
        # Dense layers
        Dense(128, activation='relu'),
        tf.keras.layers.BatchNormalization(),
        Dropout(0.3),
        Dense(64, activation='relu'),
        Dropout(0.2),
        Dense(1, activation='sigmoid')
    ])
    
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
        loss='binary_crossentropy',
        metrics=['accuracy', tf.keras.metrics.AUC()]
    )
    return model

def create_transformer_model():
    """Create Transformer model with multi-head attention"""
    inputs = tf.keras.Input(shape=(MAX_LENGTH,))
    
    # Embedding with positional encoding
    embedding = Embedding(input_dim=MAX_FEATURES, output_dim=EMBEDDING_DIM)(inputs)
    position_encoding = tf.keras.layers.Dense(EMBEDDING_DIM, activation='linear')(embedding)
    embedding = tf.keras.layers.Add()([embedding, position_encoding])
    
    # Multi-head attention layers
    attention_output = tf.keras.layers.MultiHeadAttention(
        num_heads=8, key_dim=32, dropout=0.1
    )(embedding, embedding)
    attention_output = tf.keras.layers.Add()([embedding, attention_output])
    attention_output = tf.keras.layers.LayerNormalization()(attention_output)
    
    # Feed-forward network
    ffn = tf.keras.layers.Dense(EMBEDDING_DIM * 2, activation='relu')(attention_output)
    ffn = tf.keras.layers.Dense(EMBEDDING_DIM)(ffn)
    ffn = tf.keras.layers.Dropout(0.1)(ffn)
    ffn_output = tf.keras.layers.Add()([attention_output, ffn])
    ffn_output = tf.keras.layers.LayerNormalization()(ffn_output)
    
    # Global pooling and classification
    avg_pool = tf.keras.layers.GlobalAveragePooling1D()(ffn_output)
    max_pool = tf.keras.layers.GlobalMaxPooling1D()(ffn_output)
    pooled = tf.keras.layers.Concatenate()([avg_pool, max_pool])
    
    x = tf.keras.layers.Dense(128, activation='relu')(pooled)
    x = tf.keras.layers.LayerNormalization()(x)
    x = tf.keras.layers.Dropout(0.3)(x)
    x = tf.keras.layers.Dense(64, activation='relu')(x)
    x = tf.keras.layers.Dropout(0.2)(x)
    outputs = tf.keras.layers.Dense(1, activation='sigmoid')(x)
    
    model = tf.keras.Model(inputs=inputs, outputs=outputs)
    model.compile(
        optimizer=tf.keras.optimizers.AdamW(learning_rate=0.001, weight_decay=0.01),
        loss='binary_crossentropy',
        metrics=['accuracy', tf.keras.metrics.AUC()]
    )
    return model

print("Model architectures defined successfully!")

## 8. Training Configuration and Callbacks

In [None]:
def create_training_callbacks(model_name):
    """Create optimized callbacks for training"""
    callbacks = []
    
    # Early stopping
    early_stopping = tf.keras.callbacks.EarlyStopping(
        monitor='val_loss',
        patience=5,
        restore_best_weights=True,
        verbose=1
    )
    callbacks.append(early_stopping)
    
    # Learning rate reduction
    lr_schedule = tf.keras.callbacks.ReduceLROnPlateau(
        monitor='val_loss',
        factor=0.2,
        patience=3,
        min_lr=0.00001,
        verbose=1
    )
    callbacks.append(lr_schedule)
    
    # Model checkpointing
    checkpoint = tf.keras.callbacks.ModelCheckpoint(
        f'best_{model_name}_model.keras',
        monitor='val_loss',
        save_best_only=True,
        verbose=1
    )
    callbacks.append(checkpoint)
    
    # Memory cleanup
    class MemoryCleanupCallback(tf.keras.callbacks.Callback):
        def on_epoch_end(self, epoch, logs=None):
            gc.collect()
    
    callbacks.append(MemoryCleanupCallback())
    
    return callbacks

print("Training callbacks configured successfully!")

## 9. Model Training - CNN

In [None]:
# Create and train CNN model
print("=== Training CNN Model ===")
cnn_model = create_cnn_model()
cnn_model.summary()

# Train the model
cnn_callbacks = create_training_callbacks('cnn')
history_cnn = cnn_model.fit(
    X_train, y_train,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    validation_data=(X_test, y_test),
    callbacks=cnn_callbacks,
    verbose=1
)

print("CNN training completed!")

## 10. Model Training - Transformer

In [None]:
# Create and train Transformer model
print("=== Training Transformer Model ===")
transformer_model = create_transformer_model()
transformer_model.summary()

# Train the model
transformer_callbacks = create_training_callbacks('transformer')
history_transformer = transformer_model.fit(
    X_train, y_train,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    validation_data=(X_test, y_test),
    callbacks=transformer_callbacks,
    verbose=1
)

print("Transformer training completed!")

## 11. Model Evaluation and Comparison

In [None]:
def evaluate_model(model, model_name, X_test, y_test):
    """Comprehensive model evaluation"""
    print(f"\n=== {model_name} Model Evaluation ===")
    
    # Basic metrics
    test_loss, test_accuracy, test_auc = model.evaluate(X_test, y_test, verbose=0)
    print(f"Test Loss: {test_loss:.4f}")
    print(f"Test Accuracy: {test_accuracy:.4f}")
    print(f"Test AUC: {test_auc:.4f}")
    
    # Predictions and classification report
    y_pred = (model.predict(X_test, verbose=0) > 0.5).astype(int)
    print(f"\nClassification Report:")
    print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))
    
    return test_accuracy, test_auc

# Evaluate both models
cnn_accuracy, cnn_auc = evaluate_model(cnn_model, "CNN", X_test, y_test)
transformer_accuracy, transformer_auc = evaluate_model(transformer_model, "Transformer", X_test, y_test)

# Model comparison
print(f"\n=== Model Comparison ===")
print(f"CNN - Accuracy: {cnn_accuracy:.4f}, AUC: {cnn_auc:.4f}")
print(f"Transformer - Accuracy: {transformer_accuracy:.4f}, AUC: {transformer_auc:.4f}")

# Select best model
if cnn_accuracy > transformer_accuracy:
    best_model = cnn_model
    best_model_name = "CNN"
    best_history = history_cnn
else:
    best_model = transformer_model
    best_model_name = "Transformer"
    best_history = history_transformer

print(f"Best performing model: {best_model_name}")

## 12. Training Visualization

In [None]:
def plot_training_history(history, model_name):
    """Plot training history"""
    plt.figure(figsize=(15, 5))
    
    # Accuracy plot
    plt.subplot(1, 3, 1)
    plt.plot(history.history['accuracy'], label='Training Accuracy')
    plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
    plt.title(f'{model_name} - Accuracy')
    plt.xlabel('Epoch')
    plt.ylabel('Accuracy')
    plt.legend()
    plt.grid(True)
    
    # Loss plot
    plt.subplot(1, 3, 2)
    plt.plot(history.history['loss'], label='Training Loss')
    plt.plot(history.history['val_loss'], label='Validation Loss')
    plt.title(f'{model_name} - Loss')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.legend()
    plt.grid(True)
    
    # AUC plot
    plt.subplot(1, 3, 3)
    plt.plot(history.history['auc'], label='Training AUC')
    plt.plot(history.history['val_auc'], label='Validation AUC')
    plt.title(f'{model_name} - AUC')
    plt.xlabel('Epoch')
    plt.ylabel('AUC')
    plt.legend()
    plt.grid(True)
    
    plt.tight_layout()
    plt.show()

# Plot training histories
plot_training_history(history_cnn, "CNN")
plot_training_history(history_transformer, "Transformer")

## 13. Prediction Function

In [None]:
def predict_sentiment(text, model=best_model):
    """Predict sentiment of a new review"""
    # Preprocess the text
    cleaned_text = enhanced_preprocess_text(text)
    
    # Convert to sequence
    sequence = tokenizer.texts_to_sequences([cleaned_text])
    padded_sequence = pad_sequences(sequence, maxlen=MAX_LENGTH, padding='post', truncating='post')
    
    # Make prediction
    prediction = model.predict(padded_sequence, verbose=0)[0][0]
    
    sentiment = "Positive" if prediction > 0.5 else "Negative"
    confidence = prediction if prediction > 0.5 else 1 - prediction
    
    return sentiment, confidence

# Test with sample reviews
sample_reviews = [
    "This movie was absolutely fantastic! Great acting and storyline.",
    "Terrible movie. Boring plot and bad acting.",
    "The movie was okay, nothing special but not bad either.",
    "Amazing cinematography and brilliant performances by all actors!",
    "Worst movie I've ever seen. Complete waste of time."
]

print(f"\n=== Testing Sentiment Prediction ({best_model_name} Model) ===")
for i, review in enumerate(sample_reviews, 1):
    sentiment, confidence = predict_sentiment(review)
    print(f"{i}. Review: '{review}'")
    print(f"   Predicted: {sentiment} (Confidence: {confidence:.3f})")
    print()

## 14. Model and Configuration Saving

In [None]:
# Save the best model
print("=== Saving Models and Configuration ===")

# Save both models
cnn_model.save('sentiment_cnn_model.keras')
transformer_model.save('sentiment_transformer_model.keras')
best_model.save('best_sentiment_model.keras')

print(f"✓ CNN model saved as 'sentiment_cnn_model.keras'")
print(f"✓ Transformer model saved as 'sentiment_transformer_model.keras'")
print(f"✓ Best model ({best_model_name}) saved as 'best_sentiment_model.keras'")

# Save the tokenizer
with open('tokenizer.pkl', 'wb') as f:
    pickle.dump(tokenizer, f)
print("✓ Tokenizer saved as 'tokenizer.pkl'")

# Save all configuration parameters
model_config = {
    'MAX_FEATURES': MAX_FEATURES,
    'MAX_LENGTH': MAX_LENGTH,
    'EMBEDDING_DIM': EMBEDDING_DIM,
    'BATCH_SIZE': BATCH_SIZE,
    'EPOCHS': EPOCHS,
    'best_model_name': best_model_name,
    'best_model_accuracy': float(cnn_accuracy if best_model_name == "CNN" else transformer_accuracy),
    'system_config': config['description']
}

with open('model_config.pkl', 'wb') as f:
    pickle.dump(model_config, f)
print("✓ Model configuration saved as 'model_config.pkl'")

print("\n=== Summary ===")
print(f"Best model: {best_model_name}")
print(f"Final accuracy: {model_config['best_model_accuracy']:.4f}")
print(f"Configuration: {model_config['system_config']}")
print("All files saved successfully!")

## 15. Final Notes

This notebook has successfully:
- ✅ Loaded and preprocessed the sentiment analysis dataset
- ✅ Optimized configuration based on system resources
- ✅ Created and trained both CNN and Transformer models
- ✅ Evaluated model performance with comprehensive metrics
- ✅ Implemented prediction functions for new text
- ✅ Saved all models and configurations for future use

The models are now ready for deployment and can be loaded using the saved files for making predictions on new movie reviews.