## Preprocessing and Modeling: Chatbot 'Yoldi'

This notebook is dedicated to the preprocessing and modeling phases for the development of the 'Yoldi' chatbot. It focuses on transforming the cleaned data into a format suitable for training machine learning models and developing the chatbot's response generation system.

### 1. Preprocessing
- **Objective**: Prepare and refine the data for model training and response system development.
- **Steps Involved**:
  - Application of the function to link customer queries to corresponding Customer Support responses in the cleaned dataset.
  - Further preprocessing of the linked data, ensuring consistency and usability for model training.
  - Extraction of features relevant to the chatbot's response system, such as topics, sentiment, and named entities.

### 2. Intent Recognition and Response Generation
- **Objective**: Develop mechanisms to recognize user intent and generate appropriate responses.
- **Approach**:
  - Explore various NLP techniques and algorithms for intent recognition without labeled data.
  - Implement and evaluate different models for response generation, considering the context and intent of user queries.

### 3. Model Training and Evaluation
- **Objective**: Train and evaluate models for the chatbot's core functionalities.
- **Methodology**:
  - Train models for topic modeling, sentiment analysis, and intent recognition.
  - Evaluate models using appropriate metrics to ensure effectiveness and accuracy.
  - Fine-tune models based on evaluation results to improve performance.

### 4. Response System Integration
- **Objective**: Integrate trained models to create a coherent response system for the chatbot.
- **Details**:
  - Combine models to interpret user queries and generate relevant responses.
  - Implement logic to handle various types of queries and maintain contextual relevance.

### 5. Prototyping and Testing
- **Objective**: Prototype the chatbot and conduct initial testing.
- **Process**:
  - Develop a basic User Interface for interacting with the chatbot.
  - Conduct test runs to assess the chatbot's response accuracy and coherence.
  - Gather feedback and insights for further improvement.

In [1]:
import sys
sys.path.append('../scripts/')  

In [2]:
import logging
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import spacy
import nltk
import re
from utils import *
from sklearn.model_selection import train_test_split

In [3]:
# setup logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

#### Loading Cleaned Dataset

In [4]:
file_path = '../data/interim/cleaned_data.csv'
df = load_data(file_path=file_path)

2023-11-28 19:40:32,880 - INFO - Starting execution of load_data
2023-11-28 19:40:35,440 - INFO - Data loading completed successfully


In [5]:
df.head()

Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id,cleaned_text,processed_text,pos_tags,dep_parse,sentiment,entities,sentiment_class,topic
0,2149594,Tesco,False,2017-11-09 00:55:53,@535047 Can you please confirm the requested d...,,2149595.0,can you please confirm the requested details i...,confirm request detail dm thank,"['AUX', 'PRON', 'INTJ', 'VERB', 'DET', 'VERB',...","['aux', 'nsubj', 'intj', 'ROOT', 'det', 'amod'...",0.3612,[],Positive,8.0
1,1561881,482478,True,2017-11-04 04:57:53,What's with the pricing @GoDaddyHelp?\n$12.96....,156188215618831561880,,what s with the pricing ok t cs state ican ok ...,s pricing ok t cs state ican ok add option ok ...,"['PRON', 'VERB', 'ADP', 'DET', 'NOUN', 'INTJ',...","['nsubj', 'csubj', 'prep', 'det', 'pobj', 'pre...",0.6808,[],Positive,5.0
2,1956537,SpotifyCares,False,2017-10-30 17:24:43,@580626 Hey Henry! It's an easter egg for the ...,,1956538.0,hey henry it s an easter egg for the netflix s...,hey henry s easter egg netflix strange thing t...,"['INTJ', 'INTJ', 'PRON', 'VERB', 'DET', 'ADJ',...","['intj', 'intj', 'nsubj', 'ROOT', 'det', 'amod...",-0.3818,"[('henry s', 'PERSON'), ('netflix', 'GPE')]",Negative,1.0
3,1495379,AskCiti,False,2017-11-05 00:32:31,@467204 Hello. We haven't heard from u. If u s...,,1495377.0,hello we haven t heard from u if u still requi...,hello haven t hear u u require assistance pls ...,"['INTJ', 'PRON', 'VERB', 'PROPN', 'VERB', 'ADP...","['intj', 'nsubj', 'ROOT', 'nsubj', 'ccomp', 'p...",0.4215,[],Positive,3.0
4,582530,ATVIAssist,False,2017-12-03 13:19:03,"@257338 Apologies for the delay, please provid...",,582531.0,apologies for the delay please provide us more...,apology delay provide detail include gamer tag...,"['NOUN', 'ADP', 'DET', 'NOUN', 'INTJ', 'VERB',...","['nsubj', 'prep', 'det', 'pobj', 'intj', 'ROOT...",0.1531,[],Neutral,3.0


#### Preprocessing:

We will perform a series of steps to transform the data for modeling by doing this, we ensure the consistency on responses. We will also add some engineered features that will enhance the model for the response generation.

In [6]:
# Retain necessary features
features_to_keep = ['tweet_id', 'author_id', 
                    'processed_text', 'sentiment', 'entities', 
                    'sentiment_class', 'topic', 'pos_tags', 'dep_parse']
# creating a separate df with important features
df_features = df[features_to_keep]
# linking queries and responses
df_preprocessed = link_queries_responses(df)
# merging back the retained features
df_preprocessed = df_preprocessed.merge(df_features, on=['tweet_id', 'author_id', 'processed_text'], how='left')
# dropping rows with NaN in response_processed_text
df_preprocessed.dropna(subset=['response_processed_text'], inplace=True)

# checking data quality
df_preprocessed.head()

2023-11-28 19:40:46,093 - INFO - Queries and responses linked successfully


Unnamed: 0,tweet_id,author_id,processed_text,created_at,response_processed_text,response_created_at,sentiment,entities,sentiment_class,topic,pos_tags,dep_parse
17,664563,278380,s movie yo,2017-11-21 00:59:51,able stream movie personal device flight check...,2017-11-21 01:18:25,0.0,[],Neutral,5.0,"['SCONJ', 'VERB', 'DET', 'NOUN', 'PROPN']","['advmod', 'ROOT', 'det', 'nsubj', 'nsubj']"
23,2647246,746880,account hack way time try fix refuse help get ...,2017-11-17 22:50:25,m sorry frustration receive e mail account spe...,2017-11-17 22:53:29,0.128,[],Neutral,1.0,"['PRON', 'NOUN', 'AUX', 'VERB', 'ADV', 'ADV', ...","['poss', 'nsubjpass', 'auxpass', 'ccomp', 'adv..."
30,2192747,641784,soon flight check bag,2017-11-09 19:22:43,min prior departure count bag late check kr,2017-11-09 19:27:54,0.0,[],Neutral,4.0,"['SCONJ', 'ADV', 'ADP', 'DET', 'NOUN', 'AUX', ...","['advmod', 'advmod', 'prep', 'det', 'pobj', 'a..."
36,1820428,546520,escalate issue delivery guy roll order show re...,2017-10-29 07:33:39,kindly provide detail ll look issue appropriat...,2017-10-29 07:47:00,-0.296,[],Negative,5.0,"['INTJ', 'VERB', 'DET', 'NOUN', 'DET', 'NOUN',...","['intj', 'ROOT', 'det', 'dobj', 'det', 'compou..."
62,1067084,371817,dad accidentally account deactivate need activ...,2017-10-23 12:58:07,hi ve reply dm let s continue chat gu,2017-10-23 15:03:29,-0.34,[],Negative,5.0,"['PRON', 'NOUN', 'ADV', 'VERB', 'PRON', 'NOUN'...","['poss', 'nsubj', 'advmod', 'ROOT', 'poss', 'n..."


In [7]:
df_preprocessed = feature_engineering(df_preprocessed)

2023-11-28 19:41:03,436 - INFO - Feature Engineering function applied successfully


In [8]:
df_preprocessed.head()

Unnamed: 0,tweet_id,author_id,processed_text,created_at,response_processed_text,response_created_at,sentiment,entities,sentiment_class,topic,pos_tags,dep_parse,entity_count,text_length,unique_pos_count,sentence_complexity,vocab_diversity,product_entity_count
17,664563,278380,s movie yo,2017-11-21 00:59:51,able stream movie personal device flight check...,2017-11-21 01:18:25,0.0,[],Neutral,5.0,"['SCONJ', 'VERB', 'DET', 'NOUN', 'PROPN']","['advmod', 'ROOT', 'det', 'nsubj', 'nsubj']",0,10,0,0,1.0,0
23,2647246,746880,account hack way time try fix refuse help get ...,2017-11-17 22:50:25,m sorry frustration receive e mail account spe...,2017-11-17 22:53:29,0.128,[],Neutral,1.0,"['PRON', 'NOUN', 'AUX', 'VERB', 'ADV', 'ADV', ...","['poss', 'nsubjpass', 'auxpass', 'ccomp', 'adv...",0,72,0,0,1.0,0
30,2192747,641784,soon flight check bag,2017-11-09 19:22:43,min prior departure count bag late check kr,2017-11-09 19:27:54,0.0,[],Neutral,4.0,"['SCONJ', 'ADV', 'ADP', 'DET', 'NOUN', 'AUX', ...","['advmod', 'advmod', 'prep', 'det', 'pobj', 'a...",0,21,0,0,1.0,0
36,1820428,546520,escalate issue delivery guy roll order show re...,2017-10-29 07:33:39,kindly provide detail ll look issue appropriat...,2017-10-29 07:47:00,-0.296,[],Negative,5.0,"['INTJ', 'VERB', 'DET', 'NOUN', 'DET', 'NOUN',...","['intj', 'ROOT', 'det', 'dobj', 'det', 'compou...",0,64,0,0,0.9,0
62,1067084,371817,dad accidentally account deactivate need activ...,2017-10-23 12:58:07,hi ve reply dm let s continue chat gu,2017-10-23 15:03:29,-0.34,[],Negative,5.0,"['PRON', 'NOUN', 'ADV', 'VERB', 'PRON', 'NOUN'...","['poss', 'nsubj', 'advmod', 'ROOT', 'poss', 'n...",0,58,0,0,1.0,0
