### Preprocessing and Modeling: Chatbot 'Yoldi'

This notebook is dedicated to the preprocessing and modeling phases for the development of the 'Yoldi' chatbot. It focuses on transforming the cleaned data into a format suitable for training machine learning models and developing the chatbot's response generation system.

#### Preprocessing
- **Objective**: Prepare and refine the data for model training and response system development.
- **Steps Involved**:
  - Application of the function to link customer queries to corresponding Customer Support responses in the cleaned dataset.
  - Further preprocessing of the linked data, ensuring consistency and usability for model training.
  - Extraction of features relevant to the chatbot's response system, such as topics, sentiment, and named entities.

#### Intent Recognition and Response Generation
- **Objective**: Develop mechanisms to recognize user intent and generate appropriate responses.
- **Approach**:
  - Explore various NLP techniques and algorithms for intent recognition without labeled data.
  - Implement and evaluate different models for response generation, considering the context and intent of user queries.

#### Model Training and Evaluation
- **Objective**: Train and evaluate models for the chatbot's core functionalities.
- **Methodology**:
  - Train models for topic modeling, sentiment analysis, and intent recognition.
  - Evaluate models using appropriate metrics to ensure effectiveness and accuracy.
  - Fine-tune models based on evaluation results to improve performance.

#### Response System Integration
- **Objective**: Integrate trained models to create a coherent response system for the chatbot.
- **Details**:
  - Combine models to interpret user queries and generate relevant responses.
  - Implement logic to handle various types of queries and maintain contextual relevance.

#### Prototyping and Testing
- **Objective**: Prototype the chatbot and conduct initial testing.
- **Process**:
  - Develop a basic User Interface for interacting with the chatbot.
  - Conduct test runs to assess the chatbot's response accuracy and coherence.
  - Gather feedback and insights for further improvement.

In [1]:
import sys
sys.path.append('../scripts/')  

In [3]:
import logging
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import spacy
import nltk
import re
from utils import *
from sklearn.model_selection import train_test_split
# Import necessary libraries for modeling
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical


2023-11-29 12:28:55.171900: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [4]:
# setup logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

#### Loading Cleaned Data:

In [4]:
file_path = '../data/interim/cleaned_data.csv'
df = load_data(file_path=file_path)

2023-11-28 19:40:32,880 - INFO - Starting execution of load_data
2023-11-28 19:40:35,440 - INFO - Data loading completed successfully


In [5]:
df.head()

Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id,cleaned_text,processed_text,pos_tags,dep_parse,sentiment,entities,sentiment_class,topic
0,2149594,Tesco,False,2017-11-09 00:55:53,@535047 Can you please confirm the requested d...,,2149595.0,can you please confirm the requested details i...,confirm request detail dm thank,"['AUX', 'PRON', 'INTJ', 'VERB', 'DET', 'VERB',...","['aux', 'nsubj', 'intj', 'ROOT', 'det', 'amod'...",0.3612,[],Positive,8.0
1,1561881,482478,True,2017-11-04 04:57:53,What's with the pricing @GoDaddyHelp?\n$12.96....,156188215618831561880,,what s with the pricing ok t cs state ican ok ...,s pricing ok t cs state ican ok add option ok ...,"['PRON', 'VERB', 'ADP', 'DET', 'NOUN', 'INTJ',...","['nsubj', 'csubj', 'prep', 'det', 'pobj', 'pre...",0.6808,[],Positive,5.0
2,1956537,SpotifyCares,False,2017-10-30 17:24:43,@580626 Hey Henry! It's an easter egg for the ...,,1956538.0,hey henry it s an easter egg for the netflix s...,hey henry s easter egg netflix strange thing t...,"['INTJ', 'INTJ', 'PRON', 'VERB', 'DET', 'ADJ',...","['intj', 'intj', 'nsubj', 'ROOT', 'det', 'amod...",-0.3818,"[('henry s', 'PERSON'), ('netflix', 'GPE')]",Negative,1.0
3,1495379,AskCiti,False,2017-11-05 00:32:31,@467204 Hello. We haven't heard from u. If u s...,,1495377.0,hello we haven t heard from u if u still requi...,hello haven t hear u u require assistance pls ...,"['INTJ', 'PRON', 'VERB', 'PROPN', 'VERB', 'ADP...","['intj', 'nsubj', 'ROOT', 'nsubj', 'ccomp', 'p...",0.4215,[],Positive,3.0
4,582530,ATVIAssist,False,2017-12-03 13:19:03,"@257338 Apologies for the delay, please provid...",,582531.0,apologies for the delay please provide us more...,apology delay provide detail include gamer tag...,"['NOUN', 'ADP', 'DET', 'NOUN', 'INTJ', 'VERB',...","['nsubj', 'prep', 'det', 'pobj', 'intj', 'ROOT...",0.1531,[],Neutral,3.0


#### Preprocessing:

We will perform a series of steps to transform the data for modeling by doing this, we ensure the consistency on responses. We will also add some engineered features that will enhance the model for the response generation.

In [6]:
# Retain necessary features
features_to_keep = ['tweet_id', 'author_id', 
                    'processed_text', 'sentiment', 'entities', 
                    'sentiment_class', 'topic', 'pos_tags', 'dep_parse']
# creating a separate df with important features
df_features = df[features_to_keep]
# linking queries and responses
df_preprocessed = link_queries_responses(df)
# merging back the retained features
df_preprocessed = df_preprocessed.merge(df_features, on=['tweet_id', 'author_id', 'processed_text'], how='left')
# dropping rows with NaN in response_processed_text
df_preprocessed.dropna(subset=['response_processed_text'], inplace=True)

# checking data quality
df_preprocessed.head()

2023-11-28 19:40:46,093 - INFO - Queries and responses linked successfully


Unnamed: 0,tweet_id,author_id,processed_text,created_at,response_processed_text,response_created_at,sentiment,entities,sentiment_class,topic,pos_tags,dep_parse
17,664563,278380,s movie yo,2017-11-21 00:59:51,able stream movie personal device flight check...,2017-11-21 01:18:25,0.0,[],Neutral,5.0,"['SCONJ', 'VERB', 'DET', 'NOUN', 'PROPN']","['advmod', 'ROOT', 'det', 'nsubj', 'nsubj']"
23,2647246,746880,account hack way time try fix refuse help get ...,2017-11-17 22:50:25,m sorry frustration receive e mail account spe...,2017-11-17 22:53:29,0.128,[],Neutral,1.0,"['PRON', 'NOUN', 'AUX', 'VERB', 'ADV', 'ADV', ...","['poss', 'nsubjpass', 'auxpass', 'ccomp', 'adv..."
30,2192747,641784,soon flight check bag,2017-11-09 19:22:43,min prior departure count bag late check kr,2017-11-09 19:27:54,0.0,[],Neutral,4.0,"['SCONJ', 'ADV', 'ADP', 'DET', 'NOUN', 'AUX', ...","['advmod', 'advmod', 'prep', 'det', 'pobj', 'a..."
36,1820428,546520,escalate issue delivery guy roll order show re...,2017-10-29 07:33:39,kindly provide detail ll look issue appropriat...,2017-10-29 07:47:00,-0.296,[],Negative,5.0,"['INTJ', 'VERB', 'DET', 'NOUN', 'DET', 'NOUN',...","['intj', 'ROOT', 'det', 'dobj', 'det', 'compou..."
62,1067084,371817,dad accidentally account deactivate need activ...,2017-10-23 12:58:07,hi ve reply dm let s continue chat gu,2017-10-23 15:03:29,-0.34,[],Negative,5.0,"['PRON', 'NOUN', 'ADV', 'VERB', 'PRON', 'NOUN'...","['poss', 'nsubj', 'advmod', 'ROOT', 'poss', 'n..."


In [7]:
df_preprocessed = feature_engineering(df_preprocessed)

2023-11-28 19:41:03,436 - INFO - Feature Engineering function applied successfully


In [12]:
df_preprocessed.head(5)

Unnamed: 0,tweet_id,author_id,processed_text,created_at,response_processed_text,response_created_at,sentiment,entities,sentiment_class,topic,pos_tags,dep_parse,entity_count,text_length,unique_pos_count,sentence_complexity,vocab_diversity,product_entity_count
17,664563,278380,s movie yo,2017-11-21 00:59:51,able stream movie personal device flight check...,2017-11-21 01:18:25,0.0,[],Neutral,5.0,"['SCONJ', 'VERB', 'DET', 'NOUN', 'PROPN']","['advmod', 'ROOT', 'det', 'nsubj', 'nsubj']",0,10,0,0,1.0,0
23,2647246,746880,account hack way time try fix refuse help get ...,2017-11-17 22:50:25,m sorry frustration receive e mail account spe...,2017-11-17 22:53:29,0.128,[],Neutral,1.0,"['PRON', 'NOUN', 'AUX', 'VERB', 'ADV', 'ADV', ...","['poss', 'nsubjpass', 'auxpass', 'ccomp', 'adv...",0,72,0,0,1.0,0
30,2192747,641784,soon flight check bag,2017-11-09 19:22:43,min prior departure count bag late check kr,2017-11-09 19:27:54,0.0,[],Neutral,4.0,"['SCONJ', 'ADV', 'ADP', 'DET', 'NOUN', 'AUX', ...","['advmod', 'advmod', 'prep', 'det', 'pobj', 'a...",0,21,0,0,1.0,0
36,1820428,546520,escalate issue delivery guy roll order show re...,2017-10-29 07:33:39,kindly provide detail ll look issue appropriat...,2017-10-29 07:47:00,-0.296,[],Negative,5.0,"['INTJ', 'VERB', 'DET', 'NOUN', 'DET', 'NOUN',...","['intj', 'ROOT', 'det', 'dobj', 'det', 'compou...",0,64,0,0,0.9,0
62,1067084,371817,dad accidentally account deactivate need activ...,2017-10-23 12:58:07,hi ve reply dm let s continue chat gu,2017-10-23 15:03:29,-0.34,[],Negative,5.0,"['PRON', 'NOUN', 'ADV', 'VERB', 'PRON', 'NOUN'...","['poss', 'nsubj', 'advmod', 'ROOT', 'poss', 'n...",0,58,0,0,1.0,0


In [9]:
df_preprocessed.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9824 entries, 17 to 136449
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   tweet_id                 9824 non-null   int64  
 1   author_id                9824 non-null   object 
 2   processed_text           9592 non-null   object 
 3   created_at               9824 non-null   object 
 4   response_processed_text  9824 non-null   object 
 5   response_created_at      9824 non-null   object 
 6   sentiment                9824 non-null   float64
 7   entities                 9824 non-null   object 
 8   sentiment_class          9824 non-null   object 
 9   topic                    9592 non-null   float64
 10  pos_tags                 9824 non-null   object 
 11  dep_parse                9824 non-null   object 
 12  entity_count             9824 non-null   int64  
 13  text_length              9824 non-null   int64  
 14  unique_pos_count         9

In [13]:
df_preprocessed.describe()

Unnamed: 0,tweet_id,sentiment,topic,entity_count,text_length,unique_pos_count,sentence_complexity,vocab_diversity,product_entity_count
count,9824.0,9824.0,9592.0,9824.0,9824.0,9824.0,9824.0,9824.0,9824.0
mean,1481305.0,0.06117,4.502189,0.0,50.779316,0.0,0.0,0.941806,0.0
std,869393.0,0.400826,2.263526,0.0,32.491739,0.0,0.0,0.161121,0.0
min,63.0,-0.9552,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,732643.0,-0.1027,3.0,0.0,27.0,0.0,0.0,0.928571,0.0
50%,1471785.0,0.0,5.0,0.0,48.0,0.0,0.0,1.0,0.0
75%,2238852.0,0.3612,6.0,0.0,68.0,0.0,0.0,1.0,0.0
max,2987314.0,0.9753,9.0,0.0,257.0,0.0,0.0,1.0,0.0


In [10]:
# saving preprocessed data to csv
folder_path2 = '../data/'  # Adjust the path as needed
file_name2 = 'processed/processed_data.csv'
full_path2 = save_data(df_preprocessed, folder_path2, file_name2)

if full_path2:
    print(f"DataFrame saved at: {full_path2}")
else:
    print("Failed to save the DataFrame.")

2023-11-29 11:51:25,014 - INFO - DataFrame saved successfully to ../data/processed/processed_data.csv


DataFrame saved at: ../data/processed/processed_data.csv


### LSTM-based Seq2Seq Model Development for Chatbot 'Yoldi'

This section of the notebook is dedicated to the development of a LSTM-based Seq2Seq model for the chatbot 'Yoldi', aimed at automating responses for customer queries. The process is divided into several key stages:

#### Model Preparation
- **Objective**: Setup the LSTM-based Seq2Seq model architecture.
- **Tasks**:
  - Define the Encoder and Decoder architecture.
  - Set up embedding layers for text processing.
  - Initialize LSTM layers and define model parameters.
#### Data Splitting
- **Objective**: Divide the dataset into training and testing subsets.
- **Methodology**:
  - Utilize the `train_test_split` method to segregate the data.
  - Ensure a balanced representation of data in both training and testing sets.
#### Model Training
- **Objective**: Train the Seq2Seq model on the dataset.
- **Approach**:
  - Feed the training data into the model.
  - Monitor performance metrics during the training process.
  - Employ validation checks to assess model learning.
#### Model Evaluation
- **Objective**: Evaluate the trained model's performance.
- **Techniques**:
  - Apply the model to the test dataset.
  - Analyze the model's accuracy, precision, recall, and other relevant metrics.
  - Use techniques like confusion matrix, ROC curve, etc., for deeper analysis.
#### Model Optimization
- **Objective**: Fine-tune the model for improved performance.
- **Strategies**:
  - Adjust model hyperparameters like learning rate, batch size, etc.
  - Experiment with different numbers of LSTM units and layers.
  - Explore regularization techniques to prevent overfitting.
#### Conclusion and Next Steps
- Summarize the findings from the model development process into the project report draft.
- Outline potential improvements, testing and deployment stages.

In [5]:
file_path3 = '../data/processed/processed_data.csv'
df_modeling = load_data(file_path=file_path3)

2023-11-29 12:32:41,542 - INFO - Starting execution of load_data
2023-11-29 12:32:41,644 - INFO - Data loading completed successfully


In [None]:
import logging
import pandas as pd

# Setup logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def remove_nan_values(df, column_name):
    """
    Remove rows from the DataFrame where the specified column has NaN values.
    
    Args:
    df (pd.DataFrame): The DataFrame to clean.
    column_name (str): The name of the column to check for NaN values.

    Returns:
    pd.DataFrame: A DataFrame with NaN values removed from the specified column.
    """
    try:
        # Check if column exists in DataFrame
        if column_name not in df.columns:
            raise ValueError(f"Column '{column_name}' not found in DataFrame")

        # Count NaN values before removal
        nan_count_before = df[column_name].isna().sum()

        # Remove NaN values
        cleaned_df = df.dropna(subset=[column_name])

        # Count NaN values after removal
        nan_count_after = cleaned_df[column_name].isna().sum()

        logging.info(f"Removed {nan_count_before - nan_count_after} rows with NaN values from '{column_name}' column.")
        return cleaned_df

    except Exception as e:
        logging.error(f"Error in remove_nan_values function: {e}")
        raise


In [6]:
try:
    input_texts = df_modeling['processed_text'].tolist()
    target_texts = ['\t' + text for text in df_modeling['response_processed_text'].tolist()]  # '\t' as start token for responses

    logging.info("Dataset loaded successfully with %d pairs.", len(input_texts))
except Exception as e:
    logging.error("Error loading dataset: %s", e)
    raise

2023-11-29 16:44:11,721 - INFO - Dataset loaded successfully with 9824 pairs.


In [7]:
# function to fit tokenizer and return sequences and word index
def tokenize_texts(texts):
    """
    The function tokenizes the texts in the preprocessed_text.
    Args:
    Texts: array-like format of texts.
    Returns:
    str: The full path to the saved CSV file, or None if an error occurred.
    """
    try:
        tokenizer = Tokenizer()
        tokenizer.fit_on_texts(texts)
        sequences = tokenizer.texts_to_sequences(texts)
        word_index = tokenizer.word_index
        return tokenizer, sequences, word_index
    except Exception as e:
        logging.error("Error during tokenization: %s", e)
        raise

In [9]:
# Function to prepare the encoder and decoder data
def prepare_data(input_texts, target_texts, num_encoder_tokens, num_decoder_tokens, max_encoder_seq_length, max_decoder_seq_length):
    try:
        input_tokenizer, input_sequences, _ = tokenize_texts(input_texts)
        target_tokenizer, target_sequences, _ = tokenize_texts(target_texts)
        
        encoder_input_data = pad_sequences(input_sequences, maxlen=max_encoder_seq_length, padding='post')
        decoder_input_data = pad_sequences(target_sequences, maxlen=max_decoder_seq_length, padding='post')

        # Initialize decoder_target_data
        decoder_target_data = np.zeros((len(target_sequences), max_decoder_seq_length, num_decoder_tokens), dtype='float32')
        
        for i, seq in enumerate(target_sequences):
            for t, token in enumerate(seq):
                if t > 0:  # decoder_target_data will be ahead by one timestep and will not include the start token.
                    decoder_target_data[i, t - 1, token] = 1.
        
        logging.info("Data prepared successfully.")
        return encoder_input_data, decoder_input_data, decoder_target_data
    except Exception as e:
        logging.error("Error during data preparation: %s", e)
        raise

In [8]:
# Function to create a Seq2Seq model
def create_seq2seq_model(num_encoder_tokens, num_decoder_tokens, latent_dim=256):
    try:
        # Define encoder
        encoder_inputs = Input(shape=(None, num_encoder_tokens))
        encoder = LSTM(latent_dim, return_state=True)
        encoder_outputs, state_h, state_c = encoder(encoder_inputs)
        encoder_states = [state_h, state_c]

        # Define decoder
        decoder_inputs = Input(shape=(None, num_decoder_tokens))
        decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
        decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
        decoder_dense = Dense(num_decoder_tokens, activation='softmax')
        decoder_outputs = decoder_dense(decoder_outputs)

        # Define the Seq2Seq model
        model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
        logging.info("Seq2Seq model created successfully.")
        return model
    except Exception as e:
        logging.error("Error during model creation: %s", e)
        raise

In [11]:
# Function to find indices of float values in lists
def find_floats_in_list(text_list):
    float_indices = [index for index, item in enumerate(text_list) if isinstance(item, float)]
    return float_indices


# Find indices of float values
input_float_indices = find_floats_in_list(input_texts)
target_float_indices = find_floats_in_list(target_texts)

# Print out the indices
print("Input text float indices:", input_float_indices)
print("Target text float indices:", target_float_indices)


Input text float indices: [22, 30, 65, 67, 99, 122, 196, 197, 247, 276, 318, 378, 445, 504, 505, 516, 579, 581, 629, 834, 867, 880, 908, 947, 1022, 1181, 1197, 1232, 1287, 1392, 1415, 1417, 1424, 1524, 1525, 1554, 1585, 1678, 1688, 1707, 1710, 1727, 1778, 1790, 1822, 1895, 1900, 1911, 2014, 2015, 2067, 2070, 2072, 2164, 2187, 2265, 2328, 2334, 2347, 2437, 2472, 2496, 2568, 2580, 2616, 2668, 2686, 2751, 2808, 2819, 2840, 2848, 2913, 2932, 3091, 3108, 3109, 3117, 3181, 3201, 3202, 3285, 3386, 3428, 3477, 3538, 3559, 3642, 3693, 3727, 3751, 3767, 3856, 3871, 3890, 3953, 3979, 4017, 4029, 4054, 4070, 4076, 4100, 4329, 4599, 4606, 4618, 4715, 4774, 4786, 4851, 4901, 4955, 4997, 5018, 5031, 5061, 5065, 5071, 5188, 5214, 5224, 5239, 5283, 5290, 5297, 5331, 5441, 5552, 5569, 5638, 5692, 5695, 5717, 5805, 5835, 5854, 5868, 5872, 5937, 5957, 5971, 6026, 6073, 6183, 6195, 6223, 6240, 6270, 6278, 6299, 6300, 6349, 6388, 6463, 6495, 6586, 6590, 6599, 6613, 6672, 6714, 6735, 6751, 6865, 6934, 6963, 

In [12]:
# Print out examples of float values in the input texts
for index in input_float_indices[:10]:  # Adjust the slice as needed
    print(f"Index: {index}, Value: {input_texts[index]}")

Index: 22, Value: nan
Index: 30, Value: nan
Index: 65, Value: nan
Index: 67, Value: nan
Index: 99, Value: nan
Index: 122, Value: nan
Index: 196, Value: nan
Index: 197, Value: nan
Index: 247, Value: nan
Index: 276, Value: nan


In [10]:
try:
    # Defining hyperparameters and sequence lengths
    max_encoder_seq_length = max(len(txt) for txt in input_texts if txt is not None)
    max_decoder_seq_length = max(len(txt) for txt in target_texts if txt is not None)
    num_encoder_tokens = max(max(seq) for seq in input_sequences if seq) + 1  # +1 for padding
    num_decoder_tokens = max(max(seq) for seq in target_sequences if seq) + 1

    logging.info(f'Max encoder sequence length: {max_encoder_seq_length}')
    logging.info(f'Max decoder sequence length: {max_decoder_seq_length}')
    logging.info(f'Number of encoder tokens: {num_encoder_tokens}')
    logging.info(f'Number of decoder tokens: {num_decoder_tokens}')
except Exception as e:
    logging.error('Error calculating sequence lengths or number of tokens: %s', e)
    raise

2023-11-29 16:59:53,299 - ERROR - Error calculating sequence lengths or number of tokens: object of type 'float' has no len()


TypeError: object of type 'float' has no len()

In [None]:
import logging
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense
import numpy as np

# Configure logging
logging.basicConfig(format='%(asctime)s - %(levelname)s - %(message)s', level=logging.INFO)

# Function to fit tokenizer and return sequences and word index
def tokenize_texts(texts):
    try:
        tokenizer = Tokenizer()
        tokenizer.fit_on_texts(texts)
        sequences = tokenizer.texts_to_sequences(texts)
        word_index = tokenizer.word_index
        return tokenizer, sequences, word_index
    except Exception as e:
        logging.error("Error during tokenization: %s", e)
        raise

# Function to prepare the encoder and decoder data
def prepare_data(input_texts, target_texts, num_encoder_tokens, num_decoder_tokens, max_encoder_seq_length, max_decoder_seq_length):
    try:
        input_tokenizer, input_sequences, _ = tokenize_texts(input_texts)
        target_tokenizer, target_sequences, _ = tokenize_texts(target_texts)
        
        encoder_input_data = pad_sequences(input_sequences, maxlen=max_encoder_seq_length, padding='post')
        decoder_input_data = pad_sequences(target_sequences, maxlen=max_decoder_seq_length, padding='post')

        # Initialize decoder_target_data
        decoder_target_data = np.zeros((len(target_sequences), max_decoder_seq_length, num_decoder_tokens), dtype='float32')
        
        for i, seq in enumerate(target_sequences):
            for t, token in enumerate(seq):
                if t > 0:  # decoder_target_data will be ahead by one timestep and will not include the start token.
                    decoder_target_data[i, t - 1, token] = 1.
        
        logging.info("Data prepared successfully.")
        return encoder_input_data, decoder_input_data, decoder_target_data
    except Exception as e:
        logging.error("Error during data preparation: %s", e)
        raise

# Function to create a Seq2Seq model
def create_seq2seq_model(num_encoder_tokens, num_decoder_tokens, latent_dim=256):
    try:
        # Define encoder
        encoder_inputs = Input(shape=(None, num_encoder_tokens))
        encoder = LSTM(latent_dim, return_state=True)
        encoder_outputs, state_h, state_c = encoder(encoder_inputs)
        encoder_states = [state_h, state_c]

        # Define decoder
        decoder_inputs = Input(shape=(None, num_decoder_tokens))
        decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
        decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
        decoder_dense = Dense(num_decoder_tokens, activation='softmax')
        decoder_outputs = decoder_dense(decoder_outputs)

        # Define the Seq2Seq model
        model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
        logging.info("Seq2Seq model created successfully.")
        return model
    except Exception as e:
        logging.error("Error during model creation: %s", e)
        raise

# Example of loading data from a file (assuming CSV format)
# Here, we need to load your preprocessed dataset to fill `input_texts` and `target_texts`
try:
    # Load the dataset
    dataset_path = '/path/to/your/dataset.csv'  # Update with your dataset path
    data = pd.read_csv(dataset_path)
    
    # Assuming 'processed_text' column is for input and 'response_processed_text' column is for target
    input_texts = data['processed_text'].tolist()
    target_texts = ['\t' + text for text in data['response_processed_text'].tolist()]  # '\t' as start token for responses

    logging.info("Dataset loaded successfully with %d pairs.", len(input_texts))
except Exception as e:
    logging.error("Error loading dataset: %s", e)
    raise

# Continue with the rest of your code for data preparation and model training
# ...
