# Architecture of a basic chatbot
A **basic chatbot** can be **trained** using a **predefined corpus** and **provide responses to queries** using **vectorization** and **cosine similarity**.
The **most important requirement** for building a **chatbot** is the **corpus** or **text data** on which the **chatbot will be trained**. 
* The **corpus** should be **relevant** and **exhaustive**. 
* Ensure that the **response time is
acceptable** and that the **bot** is **not taking** an **inordinate amount of time** to **respond**. 
* The **bot** should also ideally seem **human-like** and have an **acceptable accuracy rate**.

> **Amazon's Q&A data** is a **repository of questions** and **answers** gathered from
**Amazon's website** for **various product categories**. The **corpus** is provided in a **JavaScript Object Notation (JSON)** format.
>
> http://jmcauley.ucsd.edu/data/amazon/qa/


In order to **build a similarity-based chatbot**, following steps should be taken:
1. Store all the **questions** from the **corpus** in an array
2. Store all **corresponding answers** from the **corpus** in an array
3. **Vectorize** and **preprocess** the **question data**
4. **Vectorize** and **preprocess** the **user's query**
5. Assess the **most similar question** to the **user's query** using **cosine similarity**
6. Return the **corresponding answer** to the **most similar question** as a **chat response**

> Note that we **do not need** to **vectorize** and **preprocess** the **answers data**, because that's the **target to return**. Only the **user's query** is **compared** with the **question data**, and therefore, these require **text preprocessing**.

# Import dependencies

In [None]:
import urllib.request
from urllib.parse import urlparse
from urllib.error import HTTPError
import gzip
import json
import ast
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.tokenize import TweetTokenizer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from typing import Generator

In [None]:
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

# Read data from the compressed JSON

In [None]:
automotive_url = r'http://jmcauley.ucsd.edu/data/amazon/qa/qa_Automotive.json.gz'

In [None]:
def read_gzip_json(path_or_url: str, row_limit: int = None) -> Generator[str, None, str]:
  try:
    # Create a buffer from the HTTP response
    if urlparse(path_or_url).scheme:
      buffer = urllib.request.urlopen(path_or_url)
    
    # Set the buffer to a local path
    else:
      buffer = path_or_url
  
    # Yield each evaluated line from the compressed JSON
    for i, line in enumerate(gzip.open(buffer)):
      if i == row_limit:
        break
      yield ast.literal_eval(line.decode('utf-8'))
  
  # Raise appropriate error if buffer is invalid
  except FileNotFoundError:
    raise ValueError(f"Invalid File Path: {path_or_url} not found")
  except HTTPError:
    raise ValueError(f"HTTP Error 404: {automotive_url} not found")

In [None]:
try:
  # Read compressed JSON and parse lines to a DataFrame
  automotive_df = pd.concat(
      pd.json_normalize(data)[['question', 'answer']] 
      for data in read_gzip_json(automotive_url, row_limit=1000)
  )

  # Remove duplicate QAs and reset index
  automotive_df.drop_duplicates(
      ignore_index=True,
      inplace=True
  )
except ValueError as error:
  print(error)
  automotive_df = pd.DataFrame(columns=['question', 'answer'])

In [None]:
automotive_df.head(20)

Unnamed: 0,question,answer
0,What is the most useful length to get?,at least 20 feet.......heres why....say you ha...
1,Are these cables made of copper or aluminum?,Coleman's website does indeed say copper equiv...
2,I bought the Red Extra Heavy Duty. Is that too...,"For jumper cables, you can never have ""too muc..."
3,"Hi, Being 20ft 4gauge how heavy is this?",Not nearly heavy enough. I keep them under my ...
4,Do these cables come with a bag?,No
5,Are the wires paired together? Am surprised bo...,"Yes, it's a twined cable. And why does it surp..."
6,How many amps can this handle?,Per Coleman Cable specifications 4 gauge 20 fo...
7,Can I use this cables to boost a school bus ?,I would think so. I bought them to charge a pu...
8,"can this be a replacement for AQ41682059, is t...",It replace these models but you have to use th...
9,"It does not state what RPM this motor is, does...",1100 R 1100 1100 RPM Dan CSH


# Text preprocessing

## POS tagging

In [None]:
# This is a common method which is widely used across the NLP community
def wordnet_pos_tag(token: str) -> str:
  """
  Maps POS tags to the first character lemmatize() accepts.
  WordNet groups [N]ouns, [V]erbs, [A]djectives, and Adve[R]bs into synsets.
  """
  tag_dict = {
      'J': wordnet.ADJ,
      'N': wordnet.NOUN,
      'V': wordnet.VERB,
      'R': wordnet.ADV
  }
  tag = nltk.pos_tag([token])[0][1][0].upper()
  return tag_dict.get(tag, wordnet.NOUN)

## Lemma-based tokenization

In [None]:
def tokenize_to_lemma(doc: str) -> list:
  tokenizer = TweetTokenizer()
  lemmatizer = WordNetLemmatizer()
  for token in tokenizer.tokenize(doc):
    tag = wordnet_pos_tag(token)
    yield lemmatizer.lemmatize(token, pos=tag)

## Stopwords removal

In [None]:
def is_question_word(word: str) -> bool:
  question_words = {'who', 'what', 'when', 'why', 'how', 'which', 'where', 'whom'}
  return any(qw in word for qw in question_words)

In [None]:
stop_words = map(lambda w: not is_question_word(w), stopwords.words('english'))

# Text vectorization

In [None]:
vectorizer = TfidfVectorizer(
    tokenizer=tokenize_to_lemma,
    stop_words=stop_words,
    lowercase=True, 
    norm='l2'
)

In [None]:
X = vectorizer.fit_transform(automotive_df['question'])
y = automotive_df['answer']

# Similarity-based chatbot
***X*** is the **repository matrix** that will be searched every time a **new question is entered**
in the **chatbot** for the **most similar question**. To implement this, we need to calculate the **angle** between the **new question vector** and every row of the **X matrix**. Finally, we **search the
row** that has the **maximum cosine** (or the **minimum angle**) with the **new question vector**
and **return the corresponding answer** to that question as the response. 
* If the **smallest angle** between the **question vector** and **every row of the matrix** is **greater than a threshold value**,
then we consider that **question to be different enough not to warrant a response**.
 
> Note that we **search our repository matrix** ***X*** based on the **user's query**, meaning that the **angle** is measured as ***cosine_similarity(query, X)***

In [None]:
def chatbot_answer(query: str) -> str:
  # Define global dependencies
  global X, y, vectorizer
  
  # Apply TF-IDF transformation on the query
  query_vec = vectorizer.transform([query])

  # Calculate the cosine between the query vector and each row
  similarity_vec = cosine_similarity(query_vec, X)[0]
  
  # If angle is above the threshold, do not warrant a response
  if max(similarity_vec) < 0.5:
    return "Sorry, I didn't quite understand that. Would you might repeating once again?"
  
  # Answer corresponding to the maximum cosine (or the minimum angle) with the query vector
  return y[np.argmax(similarity_vec)]

In [None]:
def QA_support():
  try:
    user_name = input("Please enter your username: ")
    print("Q&A Support: Hi, welcome to the Amazon's Automotive Q&A Support. How may I help you?")
    while True:
      user_query = input(f"{user_name}: ")
      if 'bye' in user_query.lower():
        raise KeyboardInterrupt
      print(f"Q&A Support: {chatbot_answer(user_query)}")
  except KeyboardInterrupt:
    print(
      "Q&A Support: Thank you for our conversation, "
      "I'm always there willing to help you.\n" +
      "In case of any further questions, feel free to reach me out :)"
    )

In [None]:
if __name__ == '__main__':
  QA_support()

Please enter your username: mikey_72
Q&A Support: Hi, welcome to the Amazon's Automotive Q&A Support. How may I help you?
mikey_72: How many amps can this handle?
Q&A Support: Per Coleman Cable specifications 4 gauge 20 foot long cables current rating is 300 amps.
mikey_72: I need a replacement for the  fan model f0510B2944
Q&A Support: I just copied this information from the Lamanco site www.lamanco Are replacement parts available for power vents? Replacement parts for Lomanco power vents, Model 2000, 2000TH, or 1800 can be found in our online store. The most recent motor model number is F0510B2944. This motor replaces several older models (A0416B2059, A0510B2389, F0510B2497). We also have distributors who may stock the motor locally. Please contact customer service to see if the motor is available locally (800-643-5596). Help Topics: Specific Product / 1800 ADD-A-VENT, Specific Product / 2000 Power Vent, Product Support / Powered Vents Hope this helps
mikey_72: How heavy is the Being

# Summary
Needless to say, there is plenty of room for **further improvement** in this **chatbot** to **improve its accuracy**. Some areas that can be further
refined are 
* **preprocessing**, 
* **cleaning of raw data**, 
* **tweaking TF-IDF normalization**, 
and so on. 

While **cosine similarity-based chatbots** were the **first-generation NLP applications** used
in industry to **automate simple Q&A-based tasks**, **new-age chatbots** have come a long way
and are able to **handle much more complex** and **bespoke requirements** using **deep learning-based models**.