**Copyright: © NexStream Technical Education, LLC**.  
All rights reserved

##Combined BERT Simplified with Classifiers    

In this project, you will utilize pre-trained BERT embeddings and traditional Machine Learning models to implement a Sentiment Analyzer.   
Specifically, you will:   
- Explore how pre-trained BERT embeddings can be used with traditional machine learning models
- Compare transformer-based embedding extraction with traditional NLP embedding methods
- Implement a sentiment analysis pipeline using BERT embeddings and logistic regression
- Evaluate model performance and interpret results using appropriate metrics

Follow the instructions in the code cells to complete and test your code. You will replace all triple underscores (___) with your code. Please refer to the lecture slides for details on each of the functions/algorithms and hints on the implementation.   

<br>

**NOTE - IF USING COLAB, YOU SHOULD SET YOUR RUNTIME TO USE A GPU FOR THIS NOTEBOOK!!!**
- Select Runtime - Change runtime type, then select a GPU option
- Note that sometimes the free version will prohibit you from using the GPU resources during peak use times

**Introduction to Transfer Learning with Transformers**   

Pre-trained Language Models  
Pre-trained transformer models like BERT (Bidirectional Encoder Representations from Transformers) are trained on massive text corpora to learn contextual representations of language, which can then be leveraged for downstream tasks without having to train a deep learning model from scratch.

<br>

BERT uses the transformer encoder architecture, which consists of:
- Multi-head self-attention mechanisms: Allow the model to focus on different parts of the input sequence
- Feed-forward neural networks: Process the attended information
- Layer normalization and residual connections: Facilitate training of deep networks

<br>

Transfer Learning Approaches with BERT   
There are two primary approaches to using BERT for downstream tasks:
- Fine-tuning: Update all or part of BERT's parameters on a target task
- Feature extraction via pre-trained models: Use BERT as a fixed feature extractor and train a separate model (e.g. classifier) on these features

<br>

In this project, we'll focus on the feature extraction (pre-trained model) approach, which is more computationally efficient and often works well when paired with traditional machine learning models.


###Step 1.  Install and import the required libraries   

No coding needed - simply run this cell to set up your environment.  


**NOTE - IF USING COLAB, YOU SHOULD SET YOUR RUNTIME TO USE A GPU FOR FASTER EXECUTION**
- Select Runtime - Change runtime type, then select a GPU option
- Note that sometimes the free version will prohibit you from using the GPU resources during peak use times


In [None]:
#Install and Import Required Libraries
!pip install transformers scikit-learn torch

import numpy as np
import tensorflow as tf
from tensorflow import keras
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
from sklearn.linear_model import LogisticRegression
from transformers import BertTokenizer, BertModel
import torch
import warnings
warnings.filterwarnings('ignore')

# Configure TensorFlow to handle GPU memory properly
try:
    # Configure GPU memory growth to avoid allocation issues
    gpus = tf.config.experimental.list_physical_devices('GPU')
    if gpus:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
        print(f"GPU Available: {len(gpus)} GPU(s) configured with memory growth")
    else:
        print("No GPU available, using CPU")
except Exception as e:
    print(f"GPU configuration warning: {e}")
    print("Continuing with default configuration")

print("PyTorch CUDA Available: ", torch.cuda.is_available())

###Step 2:  Load BERT tokenizer  

Load the Hugging Face BERT tokenizer using a BERT base pre-trained, uncased model.  
Reference links:
- https://huggingface.co/docs/transformers/en/model_doc/bert#transformers.BertTokenizer  
- https://huggingface.co/google-bert/bert-base-uncased

In [None]:
#Load the BERT tokenizer using pre-trained model, uncased model

print("Loading BERT tokenizer...")
tokenizer = ___

###Step 3: BERT Embedding Extraction Function   

Tokenization and Input Preparation with the BERT pre-trained model, including the following:  
- Tokenization using WordPiece (breaking words into subword units)
- Adding special tokens: [CLS] at the beginning and [SEP] at the end
- Converting tokens to IDs using BERT's vocabulary
- Creating attention masks to handle padding

<br>

Notes on the [CLS] token:
The [CLS] token is a special token added to the beginning of each input sequence. BERT is trained so that the final hidden state corresponding to this token serves as an aggregate representation of the entire sequence, making it ideal for classification tasks.  
When BERT processes text, it works in three key ways that enable the [CLS] token to represent the full sequence:
- Bidirectional Context: Unlike earlier models that processed text from left-to-right, BERT's transformer architecture allows each word to "see" all other words in the sequence. This means the [CLS] token's representation is influenced by every other word in the text.
- Self-Attention Mechanism: BERT uses attention mechanisms that allow it to focus on different parts of the input sequence when creating each token's representation. The [CLS] token's representation is shaped by these attention patterns across the entire sequence.
- Special Pre-training: During BERT's pre-training, the [CLS] token was specifically trained to serve as an aggregate representation for classification tasks through a "next sentence prediction" objective. This trained the model to pack sentence-level information into this token.

In this step, you will write the function:  

    def extract_bert_embeddings(texts, max_length=128, batch_size=32):

    Parameters:
    -----------
    texts : list
        List of texts to encode
    max_length : int
        Maximum sequence length for tokenization
    batch_size : int
        Batch size for processing to manage memory

    Returns:
    --------
    numpy.ndarray
        Array of BERT embeddings with shape (n_texts, 768)

Reference links:
- Initialize PyTorch BERT model:  https://huggingface.co/docs/transformers/v4.50.0/en/model_doc/bert#transformers.BertModel
- Note:  Set to evaluation mode using Model.eval()

In [None]:
def extract_bert_embeddings(texts, max_length=128, batch_size=32):
    """
    Extract BERT embeddings from text inputs using PyTorch BERT.

    Parameters:
    -----------
    texts : list
        List of texts to encode
    max_length : int
        Maximum sequence length for tokenization
    batch_size : int
        Batch size for processing to manage memory

    Returns:
    --------
    numpy.ndarray
        Array of BERT embeddings with shape (n_texts, 768)
    """
    # Load PyTorch BERT model (more stable than TensorFlow version)
    # Set the model to use the BERT base model, uncased.
    embedding_model = ___

    # Set to evaluation mode for consistency.
    # Execute the following line (no coding needed).
    # This function disables dropout and batch normalization updates.
    embedding_model.eval()

    # Move model to GPU if available
    # Execute the following lines (no coding needed)
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    embedding_model = embedding_model.to(device)

    # Function to decode review if it's in list format (from keras dataset)
    # Execute the following function (no coding needed)
    def decode_review(text):
        if not isinstance(text, list):
            return text
        # If you're using the IMDB dataset from keras
        try:
            word_index = keras.datasets.imdb.get_word_index()
            reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
            return ' '.join([reverse_word_index.get(i - 3, '?') for i in text])
        except:
            return "Unable to decode review"

    # Process texts in batches to avoid memory issues
    # Create empty list to store all embeddings
    all_embeddings = ___


    # Disable gradient computation for efficiency.
    # Execute the following line (no coding needed)
    with torch.no_grad():

    # Loop over the length of the input texts in steps of the input batch size
        for i in range(0, ___, ___):

            # Slice the batch of texts
            batch_texts = ___

            # Create an empty list to store the embeddings.
            batch_embeddings = ___

            # Loop over the batch of texts
            for text in ___:

                # Call the decode_review function and pass the text batch
                decoded_text = ___

                # Convert a string to a sequence of ids (integer), using the tokenizer and vocabulary.
                # The 'tokenizer' reference was set up in Step 2.
                # Hint: see the text cells for this step for reference API
                # Set decoded_text to be the first sequence to be encoded
                # Set the maximum length and padding to max_length
                # Set truncation to True
                # Set return tensors to 'pt' for PyTorch
                inputs = ___(
                         ___,
                         ___,
                         ___,
                         ___,
                         ___
                )

                # Move inputs to same device as model
                # Execute the following line (no coding needed)
                inputs = {key: val.to(device) for key, val in inputs.items()}

                # Extract embeddings from the embedding_model.
                # Here we need to feed the input_ids and attention_mask from the encoded_input to the BertModel.
                # Hint:  do this by passing a dictionary from your tokenizer 'inputs' result defined above.
                outputs = ___

                # Get the [CLS] token embedding (first token)
                # Hint:  convert the token to numpy
                cls_embedding = ___
                batch_embeddings.___

            all_embeddings.extend(batch_embeddings)

            if i % 500 == 0:
                print(f"Processed {i}/{len(texts)} examples")

    return np.array(all_embeddings)

###Step 4:  Classifier Building Functions

####Sequential Neural Network Classifier

    Build a sequential neural network classifier for BERT embeddings.

    - Dense Layer: 256 units with ReLU activation (input dimension = dimension of the BERT base embeddings)
    - First Dropout Layer: Helps prevent overfitting (rate=0.2)
    - Dense Layer: 64 units with ReLU activation
    - Second Dropout Layer: Additional regularization (rate=0.2)
    - Classification Layer: 1 unit with sigmoid activation for binary classification

    Compile the model with:
    - Optimizer: Adam
    - Loss: Binary Cross-Entropy
    - Metrics: Accuracy

    Parameters:
    -----------
    input_dim : int
        Dimension of the input embeddings (768 for BERT base)

    Returns:
    --------
    keras.Model
        The sequential neural network model

Reference links:
- https://keras.io/api/models/sequential/
- https://keras.io/api/layers/core_layers/dense/
- https://keras.io/api/models/model_training_apis/

In [None]:
def build_sequential_neural_network(input_dim=768):
    """
      Build a sequential neural network classifier for BERT embeddings.
      - Dense Layer: 256 units with ReLU activation (input dimension = dimension of the BERT base embeddings)
      - First Dropout Layer: Helps prevent overfitting (rate=0.2)
      - Dense Layer: 64 units with ReLU activation
      - Second Dropout Layer: Additional regularization (rate=0.2)
      - Classification Layer: 1 unit with sigmoid activation for binary classification

      Compile the model with:
      - Optimizer: Adam
      - Loss: Binary Cross-Entropy
      - Metrics: Accuracy

      Parameters:
      -----------
      input_dim : int
          Dimension of the input embeddings (768 for BERT base)

      Returns:
      --------
      keras.Model
          The sequential neural network model
    """

    # Build the Sequential network according to the specifications above
    model = ___([
                  ___,
                  ___,
                  ___,
                  ___,
                  ___
    ])

    # Compile the model according to the specifications above
    model.___

    return model

####Logistic Regression Classifier   

    """
    Build a logistic regression classifier for BERT embeddings.

    Parameters:
    -----------
    C : float
        Inverse of regularization strength
    max_iter : int
        Maximum number of iterations for solver

    Returns:
    --------
    LogisticRegression
        The logistic regression model
    """

Reference links:
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [None]:
def build_logistic_regression_classifier(C=1.0, max_iter=1000):
    """
    Build a logistic regression classifier for BERT embeddings.

    Parameters:
    -----------
    C : float
        Inverse of regularization strength
    max_iter : int
        Maximum number of iterations for solver

    Returns:
    --------
    LogisticRegression
        The logistic regression model
    """
    # Create the logistic regression model according to the specifications above
    # Set the random state to 42 for reproducibility.
    # Set the regularization strength and max number of iterations parameters according to the inputs
    # Hint:  set n_jobs=-1 to utilize all available cores
    lr_model = ___(___, ___, random_state=42, ___)
    return lr_model

###Step 5:  Training and Evaluation Functions   

In this step you will create models for the classifiers (Sequential Neural Network and Logistic Regression).  
Then you will train the model, make predictions and evaluate the model performance.

####Sequential Neural Network   

    """
    Train and evaluate a neural network classifier on BERT embeddings.

    Parameters:
    -----------
    train_data : numpy.ndarray
        Training data embeddings
    train_labels : numpy.ndarray
        Training labels
    test_data : numpy.ndarray
        Test data embeddings
    test_labels : numpy.ndarray
        Test labels
    epochs : int
        Number of training epochs
    batch_size : int
        Batch size for training

    Returns:
    --------
    tuple
        (model, history) - the trained model and training history
    """

Reference links:
- https://keras.io/api/callbacks/early_stopping/
- https://keras.io/api/models/model_training_apis/

In [None]:
def train_and_evaluate_neural_network(train_data, train_labels, test_data, test_labels, epochs=10, batch_size=32):
    """
    Train and evaluate a neural network classifier on BERT embeddings.

    Parameters:
    -----------
    train_data : numpy.ndarray
        Training data embeddings
    train_labels : numpy.ndarray
        Training labels
    test_data : numpy.ndarray
        Test data embeddings
    test_labels : numpy.ndarray
        Test labels
    epochs : int
        Number of training epochs
    batch_size : int
        Batch size for training

    Returns:
    --------
    tuple
        (model, history) - the trained model and training history
    """
    # Force CPU usage to avoid device conflicts
    print("Using CPU for neural network training to avoid GPU compatibility issues...")

    with tf.device('/CPU:0'):
        # Build model on CPU
        # Call your build function and pass the input dimension as the number of examples (columns)
        model = ___

        # Enable the early stopping callback.
        # Set the patience to 3 (Number of epochs with no improvement after which training will be stopped).
        # Set the model to restore the model weights from the epoch with the best value of the monitored quantity.
        # Hint: https://keras.io/api/callbacks/early_stopping/
        early_stopping = ___

        # Train model
        # Pass in train_data, train_labels
        # Set the train/validation split to 90/10%
        # Set the number of epochs to the function input parameter
        # Set the batch size to the function input parameter
        # Set the early stop callback (parameter set up previously)
        # Set the verbose flag to '1'
        # Hint: https://keras.io/api/models/model_training_apis/
        # Hint: call the fit method with the parameters defined above.
        history = model.___(
                        ___,
                        ___,
                        ___,
                        ___,
                        ___,
                        ___
        )

        # Evaluate model
        # Hint: https://keras.io/api/models/model_training_apis/
        # Hint: call the evaluate method
        test_loss, test_accuracy = ___
        print(f"Test Accuracy: {test_accuracy:.4f}")

        # Make predictions
        # Hint: https://keras.io/api/models/model_training_apis/
        # Hint: call the predict method
        predictions = ___

        # Scale the prediction labels to 0 or 1 with '1' if the prediction is > 0.5
        # Hint: you'll need to flatten() the labels after scaling
        predicted_labels = ___


    # Generate a classification report.
    # Execute the following lines (no coding needed).
    print("\nClassification Report:")
    print(classification_report(test_labels, predicted_labels, target_names=['Negative', 'Positive']))

    # Confusion matrix
    cm = confusion_matrix(test_labels, predicted_labels)
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=['Negative', 'Positive'],
                yticklabels=['Negative', 'Positive'])
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.title('Confusion Matrix - Neural Network')
    plt.show()

    # Plot training history
    plt.figure(figsize=(12, 4))
    plt.subplot(1, 2, 1)
    plt.plot(history.history['accuracy'])
    plt.plot(history.history['val_accuracy'])
    plt.title('Neural Network Accuracy')
    plt.ylabel('Accuracy')
    plt.xlabel('Epoch')
    plt.legend(['Train', 'Validation'], loc='upper left')

    plt.subplot(1, 2, 2)
    plt.plot(history.history['loss'])
    plt.plot(history.history['val_loss'])
    plt.title('Neural Network Loss')
    plt.ylabel('Loss')
    plt.xlabel('Epoch')
    plt.legend(['Train', 'Validation'], loc='upper right')
    plt.tight_layout()
    plt.show()

    return model, history


####Logistic Regression   

    """
    Train and evaluate a logistic regression classifier on BERT embeddings.

    Parameters:
    -----------
    train_data : numpy.ndarray
        Training data embeddings
    train_labels : numpy.ndarray
        Training labels
    test_data : numpy.ndarray
        Test data embeddings
    test_labels : numpy.ndarray
        Test labels

    Returns:
    --------
    LogisticRegression
        The trained logistic regression model
    """

Reference links:
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.fit
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_proba

In [None]:
def train_and_evaluate_logistic_regression(train_data, train_labels, test_data, test_labels):
    """
    Train and evaluate a logistic regression classifier on BERT embeddings.

    Parameters:
    -----------
    train_data : numpy.ndarray
        Training data embeddings
    train_labels : numpy.ndarray
        Training labels
    test_data : numpy.ndarray
        Test data embeddings
    test_labels : numpy.ndarray
        Test labels

    Returns:
    --------
    LogisticRegression
        The trained logistic regression model
    """
    # Build the logistic regression model
    # Call your build function
    model = ___

    # Train model
    # Pass in train_data, train_labels
    # Hint: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.fit
    print("Training logistic regression model...")
    model.___

    # Evaluate model
    # Hint: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.score
    train_accuracy = ___
    test_accuracy = ___

    print(f"Training Accuracy: {train_accuracy:.4f}")
    print(f"Test Accuracy: {test_accuracy:.4f}")

    # Make the following predictions:
    #     1. predicted class labels for the test data (test_data)
    #     2. probability estimates for each class for the test data (store the probability of a positive class only)
    # Hint:  https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict
    # Hint:  https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_proba
    predictions = ___
    predicted_proba = ___  # Probability of positive class

    # Generate a classification report
    # Execute the following lines (no coding needed)
    print("\nClassification Report:")
    print(classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))

    # Confusion matrix
    cm = confusion_matrix(test_labels, predictions)
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=['Negative', 'Positive'],
                yticklabels=['Negative', 'Positive'])
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.title('Confusion Matrix - Logistic Regression')
    plt.show()

    # Feature importance analysis (coefficients)
    plt.figure(figsize=(10, 6))
    # Sort coefficients and plot top 20 most influential
    coef = model.coef_[0]
    top_positive_idx = np.argsort(coef)[-20:]
    top_negative_idx = np.argsort(coef)[:20]

    plt.barh(range(20), coef[top_positive_idx], color='green')
    plt.barh(range(20, 40), coef[top_negative_idx], color='red')
    plt.yticks([])
    plt.title('Top 20 Feature Coefficients (Positive and Negative)')
    plt.xlabel('Coefficient Value')
    plt.tight_layout()
    plt.show()

    # ROC curve
    fpr, tpr, _ = roc_curve(test_labels, predicted_proba)
    roc_auc = auc(fpr, tpr)

    plt.figure(figsize=(8, 6))
    plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curve - Logistic Regression')
    plt.legend(loc="lower right")
    plt.show()

    return model

###Step 6:  Pipeline Functions   

In this step, you will implement the processing pipeline for our BERT and classifiers (Sequential Neural Network and Logistic Regression) model applied to sentiment analysis.   

Our sentiment analysis pipeline consists of:
- Extracting embeddings from text data from the pre-trained BERT model
- Training the sequential neural network and logistic regression model on these embeddings
- Evaluate the models performance
- Using the models to predict sentiment of new texts


####Sequential Neural Network   

    """
    Run the complete neural network pipeline on BERT embeddings.

    Parameters:
    -----------
    train_data : list
        Training data texts
    train_labels : numpy.ndarray
        Training labels
    test_data : list
        Test data texts
    test_labels : numpy.ndarray
        Test labels
    max_length : int
        Maximum sequence length for tokenization

    Returns:
    --------
    tuple
        (model, embeddings) - the trained model and BERT embeddings
    """

In [None]:
def run_neural_network_pipeline(train_data, train_labels, test_data, test_labels, max_length=128):
    """
    Run the complete neural network pipeline on BERT embeddings.

    Parameters:
    -----------
    train_data : list
        Training data texts
    train_labels : numpy.ndarray
        Training labels
    test_data : list
        Test data texts
    test_labels : numpy.ndarray
        Test labels
    max_length : int
        Maximum sequence length for tokenization

    Returns:
    --------
    tuple
        (model, embeddings) - the trained model and BERT embeddings
    """

    # Extract the BERT embeddings for the training and validation (test) data using your previous function (extract_bert_embeddings)
    print("Extracting BERT embeddings for training data...")
    train_embeddings = ___
    print("Extracting BERT embeddings for test data...")
    test_embeddings = ___

    # Train and evaluate your model using your previous function (train_and_evaluate_neural_network)
    print("Training the neural network classifier...")
    model, history = ___

    return model, (train_embeddings, test_embeddings)

####Logistic Regression   

    """
    Run the complete logistic regression pipeline on BERT embeddings.

    Parameters:
    -----------
    train_data : list
        Training data texts
    train_labels : numpy.ndarray
        Training labels
    test_data : list
        Test data texts
    test_labels : numpy.ndarray
        Test labels
    max_length : int
        Maximum sequence length for tokenization

    Returns:
    --------
    tuple
        (model, embeddings) - the trained model and BERT embeddings
    """

In [None]:
def run_logistic_regression_pipeline(train_data, train_labels, test_data, test_labels, max_length=128):
    """
    Run the complete logistic regression pipeline on BERT embeddings.

    Parameters:
    -----------
    train_data : list
        Training data texts
    train_labels : numpy.ndarray
        Training labels
    test_data : list
        Test data texts
    test_labels : numpy.ndarray
        Test labels
    max_length : int
        Maximum sequence length for tokenization

    Returns:
    --------
    tuple
        (model, embeddings) - the trained model and BERT embeddings
    """

    # Extract the BERT embeddings for the training and validation (test) data using your previous function (extract_bert_embeddings)
    print("Extracting BERT embeddings for training data...")
    train_embeddings = ___
    print("Extracting BERT embeddings for test data...")
    test_embeddings = ___

    # Train and evaluate your model using your previous function (train_and_evaluate_neural_network)
    print("Training the logistic regression classifier...")
    model = ___

    return model, (train_embeddings, test_embeddings)


###Step 7:  Combined Pipeline Function (for efficiency)

Hopefully, you recognized that since we are using a pre-trained BERT model, we should be able to make the processing pipeline more efficient by reusing the embeddings for both classifiers.  
In this step, you will implement the combine the processing pipeline for our BERT and classifiers (Sequential Neural Network and Logistic Regression) model applied to sentiment analysis.  By taking advantage of the BERT pre-trained model, we can use the same embeddings for both our classifiers.   


    """
    Run both neural network and logistic regression pipelines on the same BERT embeddings.
    
    This is more efficient as embeddings are extracted only once.

    Parameters:
    -----------
    train_data : list
        Training data texts
    train_labels : numpy.ndarray
        Training labels
    test_data : list
        Test data texts
    test_labels : numpy.ndarray
        Test labels
    max_length : int
        Maximum sequence length for tokenization

    Returns:
    --------
    tuple
        (nn_model, lr_model, embeddings)
    """

In [None]:
def run_combined_pipeline(train_data, train_labels, test_data, test_labels, max_length=128):
    """
    Run both neural network and logistic regression pipelines on the same BERT embeddings.

    This is more efficient as embeddings are extracted only once.

    Parameters:
    -----------
    train_data : list
        Training data texts
    train_labels : numpy.ndarray
        Training labels
    test_data : list
        Test data texts
    test_labels : numpy.ndarray
        Test labels
    max_length : int
        Maximum sequence length for tokenization

    Returns:
    --------
    tuple
        (nn_model, lr_model, embeddings)
    """

    # Extract the BERT embeddings for the training and validation (test) data using your previous function (extract_bert_embeddings)
    print("Extracting BERT embeddings for training data...")
    train_embeddings = ___
    print("Extracting BERT embeddings for test data...")
    test_embeddings = ___

    # Train and evaluate your model using your previous function (train_and_evaluate_neural_network)
    # Train neural network
    print("\n=== Training Neural Network Classifier ===")
    nn_model, history = ___

    # Train logistic regression
    print("\n=== Training Logistic Regression Classifier ===")
    lr_model = ___

    return nn_model, lr_model, (train_embeddings, test_embeddings)

###Step 8:  Prediction Functions   

Now we are able to test our sentiment analyzer by making predictions with our pre-trained BERT and classifier (Sequential Neural Network or Logistic Regression) models.

Write a function which inputs a text, model reference and length and returns analytics.

    def predict_sentiment_nn(text, model, max_length=128):



####Sequential Neural Network   

    """
    Predict sentiment using the neural network model.
    """

In [None]:
def predict_sentiment_nn(text, model, max_length=128):
    """
    Predict sentiment using the neural network model.
    """
    # Extract the embeddings for the text using your previous function
    embedding = ___

    # Make prediction using CPU to match model device
    # Hint:  call the predict function for the input model
    with tf.device('/CPU:0'):
        prediction = ___

    return {
        'text': text,
        'score': float(prediction),
        'sentiment': 'Positive' if prediction > 0.5 else 'Negative',
        'confidence': float(prediction) if prediction > 0.5 else float(1 - prediction)
    }

####Logistic Regression   

    """
    Predict sentiment using the logistic regression model.
    """

In [None]:
def predict_sentiment_lr(text, model, max_length=128):
    """
    Predict sentiment using the logistic regression model.
    """
    # Get embedding for the text using your previous function
    embedding = ___

    # Make prediction
    # Hint:  call the predict function for the input model
    prediction = model.___
    probability = model.___  # Probability of positive class

    return {
        'text': text,
        'score': float(probability),
        'sentiment': 'Positive' if prediction == 1 else 'Negative',
        'confidence': float(probability) if prediction == 1 else float(1 - probability)
    }


###Step 9:  Load and Prepare Data   

No coding needed for this step.   
Do not modify this cell - for test purposes.

In [None]:
# Load IMDB dataset
imdb = keras.datasets.imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

# Create a smaller sample for demonstration
sample_size = 5000  # Adjust based on available computational resources
train_sample = train_data[:sample_size]
train_sample_labels = train_labels[:sample_size]
test_sample = test_data[:1000]
test_sample_labels = test_labels[:1000]



###Step 10:  Run the combined pipeline   

No coding needed for this step.   
Do not modify this cell - for test purposes.

In [None]:
nn_model, lr_model, embeddings = run_combined_pipeline(
    train_sample, train_sample_labels,
    test_sample, test_sample_labels
)

###Step 11:  Model Comparison  

No coding needed for this step.   
Do not modify this cell - for test purposes.

In [None]:
# Try some examples with both models
examples = [
    "This movie was fantastic! I really enjoyed every minute of it.",
    "The acting was terrible and the plot made no sense.",
    "It was okay, not great but not terrible either."
]

print("===== Model Comparison =====")
for i, example in enumerate(examples):
    nn_result = predict_sentiment_nn(example, nn_model)
    lr_result = predict_sentiment_lr(example, lr_model)

    print(f"\nExample {i+1}: \"{example}\"")
    print(f"Neural Network: {nn_result['sentiment']} (confidence: {nn_result['confidence']:.2f})")
    print(f"Logistic Regression: {lr_result['sentiment']} (confidence: {lr_result['confidence']:.2f})")

###Step 12:  (Optional): Parameter Sensitivity Analysis for Logistic Regression

In [None]:
# Tune the Logistic Regression regularization parameter and observe its effect on performance

print("\n===== Logistic Regression Parameter Sensitivity Analysis =====")
c_values = [0.01, 0.1, 1.0, 10.0, 100.0]
train_accuracies = []
test_accuracies = []

for c in c_values:
    lr_model = LogisticRegression(C=c, max_iter=1000, random_state=42)
    lr_model.fit(embeddings[0], train_sample_labels)

    train_acc = lr_model.score(embeddings[0], train_sample_labels)
    test_acc = lr_model.score(embeddings[1], test_sample_labels)

    train_accuracies.append(train_acc)
    test_accuracies.append(test_acc)

    print(f"C={c}: Train Accuracy={train_acc:.4f}, Test Accuracy={test_acc:.4f}")

# Plot regularization parameter sensitivity
plt.figure(figsize=(10, 6))
plt.semilogx(c_values, train_accuracies, 'b-o', label='Training Accuracy')
plt.semilogx(c_values, test_accuracies, 'r-o', label='Test Accuracy')
plt.xlabel('Regularization Parameter (C)')
plt.ylabel('Accuracy')
plt.title('Logistic Regression Sensitivity to Regularization')
plt.legend()
plt.grid(True)
plt.show()

## REFLECTION QUESTIONS  

Answer the following questions.

1. Summarize the performance of both your models.  What explains the difference in training accuracy vs. test accuracy between the neural network and logistic regression models?
2. Why might the models disagree on the neutral example "It was okay, not great but not terrible either"? Why does the neural network appear "overconfident" (high confidence score in its sentiment prediction)?
3. Given the similar performance of both models, what factors would influence your choice between using the neural network or logistic regression in a production environment?
4. How does the feature extraction approach using BERT embeddings differ from traditional NLP approaches, and why is it particularly effective for this task?
5. What modifications would you make to either model to improve performance on neutral or ambiguous examples?