<a href="https://colab.research.google.com/github/Sufyanahmad786/Sufyanahmad786/blob/main/Project_Building_a_conversational_Chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Aim of the Project**
Aim of the project is to build an intelligent conversational chatbot, Riki, that can understand complex queries from the user and intelligently respond.

## **Background**
R-Intelligence Inc., an AI startup, has partnered with an online chat and discussion website bluedit.io. They have an average of over 5 million active customers across the globe and more than 100,000 active chat rooms. Due to the increased traffic, they are looking at improving their user experience with a chatbot moderator, which helps them engage in a meaningful conversation and keeps them updated on trending topics, while merely chatting with Riki, a chatbot. The Artificial Intelligence-powered chat experience provides easy access to information and a host of options to the customers.

## **Business Requirement**
R-Intelligence Inc. has invested in Python, PySpark, and Tensorflow. Using emerging technologies of Artificial Intelligence, Machine Learning, and Natural Language Processing, Riki – the chatbot should make the whole conversation as realistic as talking to an actual human. The chatbot should understand that users have different intents and make it extremely simple to work around these by presenting the users with options and recommendations that best suit their needs.

## **Suggested Approach**
R-Intelligence Inc. used an approach using only Natural Language Processing, in which Seq2seq models (encoder and Decoder) are used as the state-of-the-art approach to implement end to end text generation for a conversational bot.

image.png

### **Tasks to be performed**
* Download the glove model available at https://nlp.stanford.edu/projects/glove/
* Specification: Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, * 100d, & 200d vectors, 1.42 GB download): glove.twitter.27B.zip
* Load the glove word embedding into a dictionary where the key is a unique word token and the value is a d dimension vector
* Data Preparation - Filter the conversations till max word length and convert * the dialogues pairs into input text and target texts. Put start and end token * to recognize the beginning and end of the sentence token
* Create two dictionaries:
* target_word2id
* target_id2word
* and save it as NumPy file format in the disk.

Prepare the input data with embedding. The input data is a list of lists:
* First list is a list of sentences
* Each sentence is a list of words
* Generate training data per batch

**Define the model architecture and perform the following steps:**

**Step 1:** Use a LSTM encoder to get input words encoded in the form of (encoder outputs, encoder hidden state, encoder context) from input words

**Step 2:** Use a LSTM decoder to get target words encoded in the form of (decoder outputs, decoder hidden state, decoder context) from target words. Use encoder hidden states and encoder context (represents input memory) as initial state.

**Step 3:** Use a dense layer to predict the next token out of the vocabulary given decoder output generated by Step 2.

**Step 4:** Use loss ='categorical_crossentropy' and optimizer='rmsprop'
Generate the model summary
Finally generate the prediction
**Dataset Description**
Dataset: Cornell Movie Dialogue corpus

### **Brief Description**
This corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts:

220,579 conversational exchanges between 10,292 pairs of movie characters
involves 9,035 characters from 617 movies
in total 304,713 utterances
movie metadata included:
genres
release year
IMDB rating
number of IMDB votes
IMDB rating
character metadata included:
gender (for 3,774 characters)
position on movie credits (3,321 characters)
File Description
In all files the field separator is " +++$+++ "

movie_titles_metadata.txt Contains information about each movie title
fields:
movieID,
movie title,
movie year,
IMDB rating,
no. IMDB votes,
genres in the format ['genre1','genre2',É,'genreN']
movie_characters_metadata.txt Contains information about each movie character
fields:
characterID
character name
movieID
movie title
gender ("?" for unlabeled cases)
position in credits ("?" for unlabeled cases)
movie_lines.txt Contains the actual text of each utterance
fields:
lineID
characterID (who uttered this phrase)
movieID
character name
text of the utterance
movie_conversations.txt Contains the structure of the conversations
fields
characterID of the first character involved in the conversation
characterID of the second character involved in the conversation
movieID of the movie in which the conversation occurred
List of the utterances that make the conversation, in chronological order: ['lineID1','lineID2',É,'lineIDN'] has to be matched with movie_lines.txt to reconstruct the actual content
raw_script_urls.txt Contains the urls from which the raw sources were retrieved

In [11]:
# prompt: generate a complete code for this dataset of data loading and preprocessing on advanced methods: /content/project_data.json

import json
import numpy as np
import re

# Load the dataset
with open('/content/project_data.json', 'r') as f:
    data = json.load(f)

# Placeholder for data preprocessing steps (replace with actual implementation)
def preprocess_data(data):
  """Preprocesses the conversation data.

  Args:
      data: A list of conversation dictionaries.

  Returns:
      A tuple of (input_texts, target_texts).
  """

  input_texts = []
  target_texts = []

  # Example preprocessing:
  # Accessing the conversations within the 'data' dictionary
  for item in data.get('conversations', []):  # Use get with default empty list for safety
      conversation = item.get('lines', [])
      for i in range(len(conversation) - 1):
          input_text = conversation[i]
          target_text = conversation[i + 1]
          # Basic cleaning (replace with your specific requirements)
          input_text = re.sub(r"[^a-zA-Z0-9]+", ' ', input_text).lower()
          target_text = re.sub(r"[^a-zA-Z0-9]+", ' ', target_text).lower()

          input_texts.append(input_text)
          target_texts.append(target_text)

  return input_texts, target_texts

# Example usage
input_texts, target_texts = preprocess_data(data)
print(f"Number of input texts: {len(input_texts)}")
print(f"Number of target texts: {len(target_texts)}")

# Build vocabulary
def build_vocabulary(texts):
    word_counts = {}
    for text in texts:
      for word in text.split():
          word_counts[word] = word_counts.get(word, 0) + 1

    vocabulary = [word for word, count in word_counts.items() if count >= 1] # Adjust minimum count if necessary
    word2id = {word: index for index, word in enumerate(vocabulary)}
    id2word = {index: word for index, word in enumerate(vocabulary)}
    return word2id, id2word, vocabulary

# Create and save vocabularies
target_word2id, target_id2word, target_vocab = build_vocabulary(target_texts)
np.save('target_word2id.npy', target_word2id)
np.save('target_id2word.npy', target_id2word)


# Placeholder for data embedding and batch generation steps (replace with your actual implementation)

# Example of data embedding and batch generation:
# ...


# Placeholder for model architecture, training, and prediction (replace with actual implementation)


print("Data loading and preprocessing complete.")

Number of input texts: 0
Number of target texts: 0
Data loading and preprocessing complete.


In [15]:
# Importing necessary libraries
import json
import nltk
import numpy as np
import random
from tensorflow import keras
from tensorflow.keras import layers

# Download the 'punkt_tab' resource
nltk.download('punkt_tab') # Download the missing resource

nltk.download('punkt')

# Load the dataset
with open('/content/project_data.json') as f:
    intents = json.load(f)

# Prepare dataset
words = []
classes = []
documents = []

for intent in intents['intents']:
    for pattern in intent['patterns']:
        # Tokenize each word
        word_list = nltk.word_tokenize(pattern)
        words.extend(word_list)
        # Add documents in the corpus
        documents.append((word_list, intent['tag']))
        # Add to our classes if it's not already there
        if intent['tag'] not in classes:
            classes.append(intent['tag'])

# Define ignore_words - list of words to be ignored
ignore_words = ['?', '!', '.', ',', "'s", "'m"]  # Example list, modify as needed

# Stem and lower each word and remove duplicates
words = [nltk.stem.PorterStemmer().stem(w.lower()) for w in words if w not in ignore_words]
words = sorted(set(words))

# Sort classes
classes = sorted(set(classes))

# Prepare training data
training = []
output_empty = [0] * len(classes)

for doc in documents:
    bag = []
    pattern_words = doc[0]
    pattern_words = [nltk.stem.PorterStemmer().stem(word.lower()) for word in pattern_words]

    for w in words:
        bag.append(1) if w in pattern_words else bag.append(0)

    # Output is a '0' for each tag and '1' for current tag
    output_row = list(output_empty)
    output_row[classes.index(doc[1])] = 1

    training.append(bag + output_row)

# Shuffle features and convert to numpy array
random.shuffle(training)
training = np.array(training)

# Splitting the dataset into features and labels
X_train = training[:, :-len(classes)]
y_train = training[:, -len(classes):]

# Create the model
model = keras.Sequential()
model.add(layers.Dense(128, activation='relu', input_shape=(len(X_train[0]),)))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(len(classes), activation='softmax'))

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=200, batch_size=5, verbose=1)

# Save the model
model.save('chatbot_model.h5')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/200
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 11ms/step - accuracy: 0.1769 - loss: 1.3948
Epoch 2/200
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.3538 - loss: 1.3772  
Epoch 3/200
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.2904 - loss: 1.4012 
Epoch 4/200
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.2788 - loss: 1.4162  
Epoch 5/200
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.3173 - loss: 1.3382 
Epoch 6/200
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.3423 - loss: 1.3475 
Epoch 7/200
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.3923 - loss: 1.2986 
Epoch 8/200
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.4558 - loss: 1.2546 
Epoch 9/200
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m



# **Chat Functionality**

In [17]:
import pickle

# Load model and data
model = keras.models.load_model('chatbot_model.h5')
with open("/content/project_data.json") as f:
    intents = json.load(f)

# Create a function to respond to user input
def chatbot_response(text):
    # Preprocessing the text
    p = nltk.word_tokenize(text)
    p = [nltk.stem.PorterStemmer().stem(word.lower()) for word in p]

    bag = [0]*len(words)
    for s in p:
        for i, w in enumerate(words):
            if w == s:
                bag[i] = 1

    # Predict the class
    pred = model.predict(np.array([bag]))[0]
    pred_index = np.argmax(pred)
    tag = classes[pred_index]

    responses = [intent['responses'] for intent in intents['intents'] if intent['tag'] == tag]
    return random.choice(responses[0])

# Example Usage
print(chatbot_response("Hello"))




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 49ms/step
Hello!
