#Installation of Required Libraries

This cell handles the installation of the essential libraries needed for the project. It installs TensorFlow for building and training the neural network, NLTK for processing text data, and Keras-Tuner for optimizing the model's hyperparameters. This setup ensures that all necessary Python packages are available in the environment for the subsequent parts of the project.



In [None]:
# Install TensorFlow, NLTK, and Keras-Tuner for model development and tuning
%pip install tensorflow nltk keras-tuner


# Restart the Python Environment

This cell uses Databricks' utility function `dbutils.library.restartPython()` to restart the Python environment. Restarting is crucial after installing new libraries to ensure that the newly installed packages are correctly loaded. This step helps avoid conflicts that might arise from changes in the environment since the notebook session started.


In [None]:
# Restart the Python environment to ensure newly installed packages are loaded
dbutils.library.restartPython()


# Imports and Data Preparation

This cell is dedicated to setting up the data pipeline. It starts by importing necessary libraries such as NLTK for natural language processing, NumPy for numerical operations, and components from TensorFlow and scikit-learn for model preparation. The NLTK library is used to download and load the 'movie_reviews' dataset, which is then preprocessed and split into training and test sets. The preprocessing steps include tokenizing the text data and converting it into sequences, which are necessary for training the neural network model.


In [None]:
# Import necessary libraries for data processing and model preparation
import nltk
from nltk.corpus import movie_reviews
import numpy as np
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Download the NLTK movie reviews dataset
nltk.download('movie_reviews')

# Load the movie reviews and their associated categories (positive or negative)
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

# Unzip reviews and labels from the documents
reviews, labels = zip(*documents)

# Convert labels to a numpy array with binary encoding (1 = positive, 0 = negative)
labels = np.array([1 if label == 'pos' else 0 for label in labels])

# Initialize and configure the tokenizer to convert text to sequences of integers
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(reviews)

# Transform reviews to sequences and pad them to ensure uniform length
sequences = tokenizer.texts_to_sequences(reviews)
data = pad_sequences(sequences, maxlen=200)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=42)


#  Model Definition

In this cell, a function `build_model` is defined to create the neural network model using TensorFlow Keras. The model uses an Embedding layer for text input, followed by multiple Dense layers with ReLU activations and Dropout layers for regularization, configured dynamically using hyperparameters. The final architecture includes a Global Average Pooling layer and a sigmoid activation layer for binary classification. This setup is encapsulated in a function to facilitate hyperparameter tuning with Keras Tuner.


In [None]:
# Import TensorFlow and define the model architecture using Keras
import tensorflow as tf
from tensorflow.keras import layers, models

def build_model(hp):
    """Builds a sequential neural network model from hyperparameters."""
    model = models.Sequential()
    model.add(layers.Embedding(10000, 128, input_length=200))

    # Add multiple dense layers based on hyperparameters with dropout for regularization
    for i in range(hp.Int('n_layers', 1, 3)):
        model.add(layers.Dense(units=hp.Int('n_units', min_value=64, max_value=256, step=32),
                               activation='relu'))
        model.add(layers.Dropout(hp.Float('dropout_rate', 0.1, 0.5)))

    model.add(layers.GlobalAveragePooling1D())
    model.add(layers.Dense(1, activation='sigmoid'))

    # Compile the model with Adam optimizer and binary cross-entropy loss
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model


#  Hyperparameter Tuning and MLflow Integration

This cell integrates MLflow for tracking experiments and sets up hyperparameter tuning using Keras Tuner's RandomSearch. The MLflow tracking URI is set to 'databricks', ensuring that all experiment data is logged to the Databricks workspace. The hyperparameter tuning process involves defining a search space for model configurations and executing the search over a specified number of trials and epochs. The best model from the tuning process is then evaluated on the test set, and the test accuracy is logged to MLflow for performance tracking.


In [None]:
# Import MLflow and Keras Tuner for model tracking and hyperparameter tuning
import mlflow
from kerastuner.tuners import RandomSearch

# Set the MLflow tracking URI for Databricks
mlflow.set_tracking_uri("databricks")

# Enable auto-logging for TensorFlow to MLflow
mlflow.tensorflow.autolog()

# Configure and initiate the RandomSearch tuner with the defined model-building function
tuner = RandomSearch(
    build_model,
    objective='val_accuracy',
    max_trials=50,
    executions_per_trial=1,
    directory='my_dir',
    project_name='SentimentAnalysis',
    overwrite=True
)

# Start an MLflow run and perform hyperparameter tuning with training and validation
with mlflow.start_run():
    tuner.search(X_train, y_train, epochs=5, validation_split=0.1)
    best_model = tuner.get_best_models(num_models=1)[0]

    # Evaluate the best model on the test set and log the test accuracy to MLflow
    test_loss, test_acc = best_model.evaluate(X_test, y_test)
    print("Test Accuracy: ", test_acc)
    mlflow.log_metric("test_accuracy", test_acc)


#  Model Deployment and Lifecycle Management

The final cell handles the deployment aspects of the project. It logs the best-performing model to MLflow, constructs a unique model URI, and then registers the model in the MLflow Model Registry under the name 'SentimentAnalysisModel'. After registration, the model is transitioned to the 'Production' stage using MLflow's lifecycle management capabilities. This step is critical for moving the model from a development stage to a production-ready state, ensuring it is available for real-world applications.


In [None]:
# Import MLflow TensorFlow utilities for model logging
import mlflow.tensorflow

# Define the path for logging the model
model_path = "models/sentiment_analysis"

# Log the best model to MLflow specifying the path
mlflow.tensorflow.log_model(best_model, artifact_path=model_path)

# Construct the model URI using the current MLflow run ID and the model path
run_id = mlflow.active_run().info.run_id
model_uri = f"runs:/{run_id}/{model_path}"

# Register the logged model in the MLflow Model Registry
model_details = mlflow.register_model(model_uri=model_uri, name="SentimentAnalysisModel")

# Transition the registered model to the 'Production' stage using the MLflow client
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
    name="SentimentAnalysisModel",
    version=model_details.version,
    stage="Production"
)
