### Preprocessing and Modeling: Chatbot 'Yoldi'

This notebook is dedicated to the preprocessing and modeling phases for the development of the 'Yoldi' chatbot. It focuses on transforming the cleaned data into a format suitable for training machine learning models and developing the chatbot's response generation system.

#### Preprocessing
- **Objective**: Prepare and refine the data for model training and response system development.
- **Steps Involved**:
  - Application of the function to link customer queries to corresponding Customer Support responses in the cleaned dataset.
  - Further preprocessing of the linked data, ensuring consistency and usability for model training.
  - Extraction of features relevant to the chatbot's response system, such as topics, sentiment, and named entities.

#### Intent Recognition and Response Generation
- **Objective**: Develop mechanisms to recognize user intent and generate appropriate responses.
- **Approach**:
  - Explore various NLP techniques and algorithms for intent recognition without labeled data.
  - Implement and evaluate different models for response generation, considering the context and intent of user queries.

#### Model Training and Evaluation
- **Objective**: Train and evaluate models for the chatbot's core functionalities.
- **Methodology**:
  - Train models for topic modeling, sentiment analysis, and intent recognition.
  - Evaluate models using appropriate metrics to ensure effectiveness and accuracy.
  - Fine-tune models based on evaluation results to improve performance.

#### Response System Integration
- **Objective**: Integrate trained models to create a coherent response system for the chatbot.
- **Details**:
  - Combine models to interpret user queries and generate relevant responses.
  - Implement logic to handle various types of queries and maintain contextual relevance.

#### Prototyping and Testing
- **Objective**: Prototype the chatbot and conduct initial testing.
- **Process**:
  - Develop a basic User Interface for interacting with the chatbot.
  - Conduct test runs to assess the chatbot's response accuracy and coherence.
  - Gather feedback and insights for further improvement.

In [6]:
import sys
sys.path.append('../scripts/')

In [7]:
import logging
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.utils import Sequence
from utils import *

In [8]:
# setup logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

#### Loading Cleaned Data:

In [9]:
file_path = '../data/interim/cleaned_data.csv'
df = load_data(file_path=file_path)

2023-12-09 11:46:00,934 - INFO - Starting execution of load_data
2023-12-09 11:46:03,822 - INFO - Data loading completed successfully


In [10]:
df.head()

Unnamed: 0,query_id,query_text,query_inbound,query_response_tweet_id,query_in_response_to_tweet_id,response_id,response_text,response_inbound,sentiment,sentiment_class,query_length,response_length
0,1515,id like at least one month to go by where i do...,True,1514,,1514,hello this does not sound good can you dm the ...,False,-0.5423,Negative,122,113
1,425816,klasse vielen dank dafür könntet ihr dann bitt...,True,425815,425817.0,425815,hi tom weve escalated your case to our special...,False,-0.5994,Negative,132,136
2,354377,will do thanks,True,354378,354376.0,354378,youre welcome steffi,False,0.4404,Positive,14,20
3,2231678,report a site published photoshop crack,True,2231677,2231679.0,2231677,hi bianko you can submit your report here let ...,False,0.0,Neutral,39,68
4,810064,how is nairobi airport developing have you got...,True,810062,,810062,you soon you can always check our lounge locat...,False,0.0,Neutral,75,75


#### Preprocessing:

We will perform a series of steps to transform the data for modeling by doing this, we ensure the consistency on responses. We will also add some engineered features that will enhance the model for the response generation.

In [11]:
try:
    # Drop rows where either query_text or response_text is missing
    df.dropna(subset=['query_text', 'response_text'], inplace=True)
    logging.info(f"Dataframe shape after dropping missing values: {df.shape}")
except Exception as e:
    logging.error(f"Error in handling missing values: {e}")

2023-12-09 11:46:03,997 - INFO - Dataframe shape after dropping missing values: (494440, 12)


In [12]:
# sampling query_id and corresponding response_id
try:
    subset_ids = df[['query_id', 'response_id']].drop_duplicates().sample(frac=0.1, random_state=42)
# merging the sampled ids back with the original dataframe to get the full rows
    df_sampled = pd.merge(subset_ids, df, on=['query_id', 'response_id'], how='inner')
    df_model = df_sampled[['query_text', 'response_text']]
    logging.info(f"Reduced dataset size: {df_sampled.shape[0]}")
except Exception as e:
    logging.error(f"Error reducing the dataframe: {e}")
    raise

2023-12-09 11:46:04,173 - INFO - Reduced dataset size: 49444


In [13]:
# saving preprocessed data to csv
folder_path2 = '../data/'  # Adjust the path as needed
file_name2 = 'processed/processed_data.csv'
full_path2 = save_data(df, folder_path2, file_name2)

if full_path2:
    print(f"DataFrame saved at: {full_path2}")
else:
    print("Failed to save the DataFrame.")

2023-12-09 11:46:09,753 - INFO - DataFrame saved successfully to ../data/processed/processed_data.csv


DataFrame saved at: ../data/processed/processed_data.csv


In [14]:
# saving preprocessed data to csv
folder_path3 = '../data/'  # Adjust the path as needed
file_name3 = 'processed/model_data.csv'
full_path3 = save_data(df_model, folder_path3, file_name3)

if full_path2:
    print(f"DataFrame saved at: {full_path2}")
else:
    print("Failed to save the DataFrame.")

2023-12-09 11:46:10,164 - INFO - DataFrame saved successfully to ../data/processed/model_data.csv


DataFrame saved at: ../data/processed/processed_data.csv


### LSTM-based Seq2Seq Model Development for Chatbot 'Yoldi'

This section of the notebook is dedicated to the development of a LSTM-based Seq2Seq model for the chatbot 'Yoldi', aimed at automating responses for customer queries. The process is divided into several key stages:

#### Model Preparation
- **Objective**: Setup the LSTM-based Seq2Seq model architecture.
- **Tasks**:
  - Define the Encoder and Decoder architecture.
  - Set up embedding layers for text processing.
  - Initialize LSTM layers and define model parameters.
#### Data Splitting
- **Objective**: Divide the dataset into training and testing subsets.
- **Methodology**:
  - Utilize the `train_test_split` method to segregate the data.
  - Ensure a balanced representation of data in both training and testing sets.
#### Model Training
- **Objective**: Train the Seq2Seq model on the dataset.
- **Approach**:
  - Feed the training data into the model.
  - Monitor performance metrics during the training process.
  - Employ validation checks to assess model learning.
#### Model Evaluation
- **Objective**: Evaluate the trained model's performance.
- **Techniques**:
  - Apply the model to the test dataset.
  - Analyze the model's accuracy, precision, recall, and other relevant metrics.
  - Use techniques like confusion matrix, ROC curve, etc., for deeper analysis.
#### Model Optimization
- **Objective**: Fine-tune the model for improved performance.
- **Strategies**:
  - Adjust model hyperparameters like learning rate, batch size, etc.
  - Experiment with different numbers of LSTM units and layers.
  - Explore regularization techniques to prevent overfitting.
#### Conclusion and Next Steps
- Summarize the findings from the model development process into the project report draft.
- Outline potential improvements, testing and deployment stages.

In [15]:
df_model.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49444 entries, 0 to 49443
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   query_text     49444 non-null  object
 1   response_text  49444 non-null  object
dtypes: object(2)
memory usage: 772.7+ KB


In [16]:
def tokenize_texts(texts):
    """"
    Tokenize a list of texts and convert them into sequences.
    Args:
    texts (list of str): A list of texts to be tokenized.
    Returns:
    (Tokenizer, list): A tokenizer object and a list of text sequences
    """
    try:
        tokenizer = Tokenizer()
        tokenizer.fit_on_texts(texts)
        return tokenizer, tokenizer.texts_to_sequences(texts)
    except Exception as e:
        logging.error(f"Error during tokenization: {e}")
        raise

# Tokenize texts
try:
    input_tokenizer, input_sequences = tokenize_texts(df_model['query_text'])
    target_tokenizer, target_sequences = tokenize_texts(df_model['response_text'])
    logging.info("Tokenization successful.")
except Exception as e:
    logging.error(f"Error during tokenization: {e}")

2023-12-09 11:46:12,825 - INFO - Tokenization successful.


#### Splitting Data

In [17]:
try:
    # Splitting the data into training and testing sets
    train_input_texts, test_input_texts, train_target_texts, test_target_texts = train_test_split(
        input_sequences, target_sequences, test_size=0.2, random_state=42)

    logging.info("Data split into training and testing sets.")
except Exception as e:
    logging.error(f"Error during data splitting: {e}")

2023-12-09 11:46:12,866 - INFO - Data split into training and testing sets.


#### Preparing Encoder and Decoder Data

In [18]:
try:
    # calculating maximum sequence lengths from the training data
    max_encoder_seq_length = max(len(seq) for seq in train_input_texts)
    max_decoder_seq_length = max(len(seq) for seq in train_target_texts)

    # token counts
    num_encoder_tokens = len(input_tokenizer.word_index) + 1  # +1 for padding token
    num_decoder_tokens = len(target_tokenizer.word_index) + 1

    logging.info(f'Max encoder sequence length (train): {max_encoder_seq_length}')
    logging.info(f'Max decoder sequence length (train): {max_decoder_seq_length}')
    logging.info(f'Number of encoder tokens: {num_encoder_tokens}')
    logging.info(f'Number of decoder tokens: {num_decoder_tokens}')
except Exception as e:
    logging.error(f'Error in calculating sequence lengths or number of tokens: {e}')
    raise

2023-12-09 11:46:12,897 - INFO - Max encoder sequence length (train): 60
2023-12-09 11:46:12,899 - INFO - Max decoder sequence length (train): 57
2023-12-09 11:46:12,899 - INFO - Number of encoder tokens: 39684
2023-12-09 11:46:12,899 - INFO - Number of decoder tokens: 22196


In [19]:
def prepare_data(train_input_sequences, train_target_sequences, 
                 test_input_sequences, test_target_sequences, 
                 num_encoder_tokens, num_decoder_tokens, 
                 max_encoder_seq_length, max_decoder_seq_length):
    """
    Prepare data for Seq2Seq model training and testing.
    """
    try:
        # Pad sequences for training data
        train_encoder_input_data = pad_sequences(train_input_sequences, maxlen=max_encoder_seq_length, padding='post')
        train_decoder_input_data = pad_sequences(train_target_sequences, maxlen=max_decoder_seq_length, padding='post')

        # Pad sequences for testing data
        test_encoder_input_data = pad_sequences(test_input_sequences, maxlen=max_encoder_seq_length, padding='post')
        test_decoder_input_data = pad_sequences(test_target_sequences, maxlen=max_decoder_seq_length, padding='post')

        # Prepare decoder target data for training set
        train_decoder_target_data = np.zeros((len(train_target_sequences), max_decoder_seq_length, num_decoder_tokens), dtype='float32')
        for i, seq in enumerate(train_target_sequences):
            for t, token in enumerate(seq):
                if t > 0:  # decoder_target_data will be ahead by one timestep and will not include the start token.
                    train_decoder_target_data[i, t - 1, token] = 1.

        # Prepare decoder target data for testing set
        test_decoder_target_data = np.zeros((len(test_target_sequences), max_decoder_seq_length, num_decoder_tokens), dtype='float32')
        for i, seq in enumerate(test_target_sequences):
            for t, token in enumerate(seq):
                if t > 0:  # same logic as above
                    test_decoder_target_data[i, t - 1, token] = 1.

        logging.info("Data preparation for training and testing completed successfully.")
        return (train_encoder_input_data, train_decoder_input_data, train_decoder_target_data), (test_encoder_input_data, test_decoder_input_data, test_decoder_target_data)
    except Exception as e:
        logging.error(f"Error during data preparation: {e}")
        raise


In [20]:
try:
    # Prepare the training and testing data
    (train_encoder_input_data, 
     train_decoder_input_data, 
     train_decoder_target_data), (test_encoder_input_data, 
                                  test_decoder_input_data, 
                                  test_decoder_target_data) = prepare_data(
        train_input_texts, train_target_texts, 
        test_input_texts, test_target_texts, 
        num_encoder_tokens, num_decoder_tokens, 
        max_encoder_seq_length, max_decoder_seq_length)

    logging.info("Data for training and testing prepared successfully.")
except Exception as e:
    logging.error(f"Error during preparing data for training and testing: {e}")
    raise

2023-12-09 11:46:15,996 - INFO - Data preparation for training and testing completed successfully.
2023-12-09 11:46:15,997 - INFO - Data for training and testing prepared successfully.


In [21]:
# function to build model
def build_seq2seq_model(num_encoder_tokens, num_decoder_tokens, latent_dim=64):
    """
    Build a Seq2Seq model.
    Args:
        num_encoder_tokens (int): Number of unique tokens in the input.
        num_decoder_tokens (int): Number of unique tokens in the output.
        latent_dim (int, optional): Dimensionality of the encoding space. Default to 256.
    Returns:
        Model: A Seq2Seq model instance.
    """
    try:
        # Encoder
        encoder_inputs = Input(shape=(None, num_encoder_tokens))
        encoder_lstm = LSTM(latent_dim, return_state=True)
        _, state_h, state_c = encoder_lstm(encoder_inputs)
        encoder_states = [state_h, state_c]

        # Decoder
        decoder_inputs = Input(shape=(None, num_decoder_tokens))
        decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
        decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
        decoder_dense = Dense(num_decoder_tokens, activation='softmax')
        decoder_outputs = decoder_dense(decoder_outputs)

        # Model
        model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
        logging.info("Seq2Seq model built successfully.")
        return model
    except Exception as e:
        logging.error(f"Error building Seq2Seq model: {e}")
        raise

In [22]:
# Building and compiling the Seq2Seq model
try:
    # Build the model
    seq2seq_model = build_seq2seq_model(num_encoder_tokens, num_decoder_tokens, latent_dim=64)

    # Compile the model
    seq2seq_model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

    # Print model summary
    seq2seq_model.summary()
    logging.info("Seq2Seq model built and compiled successfully.")
except Exception as e:
    logging.error(f"Error during model building or compilation: {e}")

2023-12-09 11:46:16,456 - INFO - Seq2Seq model built successfully.


Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_1 (InputLayer)        [(None, None, 39684)]        0         []                            
                                                                                                  
 input_2 (InputLayer)        [(None, None, 22196)]        0         []                            
                                                                                                  
 lstm (LSTM)                 [(None, 64),                 1017574   ['input_1[0][0]']             
                              (None, 64),                 4                                       
                              (None, 64)]                                                         
                                                                                              

2023-12-09 11:46:16,471 - INFO - Seq2Seq model built and compiled successfully.


In [23]:
try:
    history = seq2seq_model.fit(
        [train_encoder_input_data, train_decoder_input_data], 
        train_decoder_target_data,
        batch_size=32,  # Adjust based on your system's capability
        epochs=50,  # Number of epochs to train for
        validation_data=([test_encoder_input_data, test_decoder_input_data], test_decoder_target_data),
        verbose=1
    )
    logging.info("Model training completed successfully.")
except Exception as e:
    logging.error(f"Error during model training: {e}")
    raise

: 

: 

In [None]:
# Example usage
model_path = '../models/seq2seq_chatbot.h5'  # Adjust path as needed
save_model(seq2seq_model, model_path)


In [None]:
# Evaluate model performance
try:
    test_loss = seq2seq_model.evaluate([test_encoder_input_data, test_decoder_input_data], test_decoder_target_data)
    logging.info(f"Test Loss: {test_loss}")
except Exception as e:
    logging.error(f"Error during model evaluation: {e}")