<a href="https://colab.research.google.com/github/TarunReddy77/Data-Science-Projects/blob/main/ChatBot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Hey Geeks! Welcome to an exciting project on Natural Language Processing, where we are going to build a Chat Bot within minutes. To build the bot, we are going to use some basic NLP techniques and build a simple Bot, that is pretty naive, but could get a lot better by adding a lot of data. So, let's get started.

Let's start by installing the newspaper package, which makes it very easy to process the language data.

In [2]:
!pip install newspaper3k --quiet

[K     |████████████████████████████████| 211 kB 27.9 MB/s 
[K     |████████████████████████████████| 7.4 MB 35.7 MB/s 
[K     |████████████████████████████████| 87 kB 6.9 MB/s 
[K     |████████████████████████████████| 81 kB 10.7 MB/s 
[?25h  Building wheel for tinysegmenter (setup.py) ... [?25l[?25hdone
  Building wheel for feedfinder2 (setup.py) ... [?25l[?25hdone
  Building wheel for jieba3k (setup.py) ... [?25l[?25hdone
  Building wheel for sgmllib3k (setup.py) ... [?25l[?25hdone


Next, let's import all the necessary libraries for the project.

In [3]:
import random, string, nltk, numpy as np # NLTK is a Natural Language Tool Kit, which contains a lot of useful functions for NLP.
from newspaper import Article # Article is used to handle articles which contain our data.
from sklearn.feature_extraction.text import CountVectorizer # CountVectorizer creates a matrix of words and their frequencies in different sentences.
from sklearn.metrics.pairwise import cosine_similarity # Cosine similarity is a measure of how similar two sentences are. 
import warnings
warnings.filterwarnings('ignore')

In [4]:
nltk.download('punkt', quiet=True) # This is used to split the raw language data into sentences.

True

Now, we start getting our data from the web and prepare it.

In [5]:
article = Article('https://en.wikipedia.org/wiki/Natural_language_processing') # Here, we pass the url for the model to scrape data from.
article.download() # To download the article.
article.parse() # To parse the data from html format.
article.nlp() # To convert the data into a format suitable for NLP.

Let's have a look at the data.

In [6]:
corpus = article.text
corpus

'This article is about natural language processing done by computers. For the natural language processing done by the human brain, see Language processing in the brain\n\nField of computer science and linguistics\n\nNatural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.\n\nChallenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.\n\nHistory [ edit ]\n\nNatural language processing has its

As you can see, the data has been converted into text format. It is now in the form of a long string of sentences. We'll now convert this string into a list of sentences by tokenizing it.

In [7]:
text = corpus
sentence_list = nltk.sent_tokenize(text) # This function is used to tokenize the string into constituent sentences.
sentence_list

['This article is about natural language processing done by computers.',
 'For the natural language processing done by the human brain, see Language processing in the brain\n\nField of computer science and linguistics\n\nNatural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.',
 'The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them.',
 'The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.',
 'Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.',
 'History [ edit ]\n\nNatural language pro

So now, it is a list of sentences.
Next up, we'll be building functions that make up the Chat Bot. To start with, we'll define a function that greets back the human user.

In [8]:
def greeting_response(text):
  """ This function takes an input (which is derived from the user) and checks if any word in the input is in the list of gretings to expect from the user.
      If it is, then the bot greets back the user by choosing a random greeting from the list of bot greetings. """

  text = text.lower()
  bot_greetings = ['hi', 'hello', 'namaste', 'hola', 'howdy', 'hey']
  user_greetings = ['hey', 'hoy', 'wassup?', 'namaskaram', 'etlunnav?', 'hi', 'hello']

  words_in_text = text.split()

  for word in words_in_text:
    if word in user_greetings:
      chosen_greeting = random.choice(bot_greetings)
      return chosen_greeting

Next, we define a function to sort elements of a list in descending order and then return a list containing their indices in the original list instead of the actual elements. This function will be used later on in the project.

In [9]:
def index_sort(similarity_scores_list):
  """ In order to return indices, we first enumerate the elements. Then, we sort them by the values. Finally, we return only the indices. """
  sorted_scores = sorted(enumerate(similarity_scores_list), key = lambda tup : tup[1], reverse = True)
  sorted_indices = [tup[0] for tup in sorted_scores] # To pick only the indices (which are the first elements in each tuple)
  return sorted_indices

Now comes the important part. Here, we would be using a few NLP techniques to obtain a similar (hence, hopefully sensible) response to the user input. Let's code up that part.

In [10]:
def bot_response(user_input):
  user_input = user_input.lower()
  sentence_list.append(user_input) # We append the user inputted sentence to our list of sentences.
  b_response = '' # The initial bot response is an empty string.
  vectorizer = CountVectorizer() # We are the using count vectorizer from sklearn to keep a count of words appearing in different sentences.
  count_matrix = vectorizer.fit_transform(sentence_list) # Fit transform creates a matrix of frequencies of each word in a sentence.
  # We want to measure the degree of similarity between the user_inputted sentence and the rest of the sentences in the matrix.
  similarity_scores = cosine_similarity(count_matrix[-1], count_matrix)
  # Similarity scores is a list of values specifying how similar the sentence at a particular index is with the input sentence.
  similarity_scores_list = similarity_scores.flatten() # To convert into a 1D array.
  index_list = index_sort(similarity_scores_list) # To obtain the indices of sentences. Most similar ones are at the starting.

  response_flag = 0 # To indicate that we haven't yet found a similar sentence.
  n = 2 # Number of similar sentences we want to find.
  for index in index_list[1:]: # We exclude the first sentence since it is the sentence we have inputted (which is obviously the most similar).
    if similarity_scores_list[index] > 0.0: # Score > 0 indicates similarity.
      b_response = b_response + ' ' + sentence_list[index] # We append the most similar sentence to the bot response.
      n -= 1 # We decrease the counter as we found one match.
      if n == 0: break 
      response_flag = 1 # To indicate we found atleast one similar sentence.

  if response_flag == 0:
    b_response = b_response + " Sorry, I couldn't understand that!" # Incase there is no similar sentence, we would want to convey that to the user.

  sentence_list.pop(-1) # We finally remove the user_inputted sentence from the list of sentences.

  return b_response # Returning the bot's response.

Done with that, we now create a way to interact with the ChatBot and let's test it!

In [11]:
exit_list = ['bye', 'see you later', 'good night'] # These are the words that would end the conversation with the Chat Bot.
print('Hi there! Doc Bot here! Ready to assist you.')

while True: # Loop forever
  user_input = input('User : ').lower()
  if user_input in exit_list:
    print('Doc Bot : Bye Pal! Talk to you later!')
    break
  else:
    if greeting_response(user_input) != None: # If the user is greeting the Bot
      print('Doc Bot : ' + greeting_response(user_input)) # Greeting from the bot!
    else: # If the user is in some conversation
      print('Doc Bot : ' + bot_response(user_input)) # Bot's reply to the user.

Hi there! Doc Bot here! Ready to assist you.
User : Hello Bot
Doc Bot : hey
User : How are you?
Doc Bot :  Then, identify semantic roles that are not explicitly realized in the current sentence, classify them into arguments that are explicitly realized elsewhere in the text and those that are not specified, and resolve the former against the local text. In some areas, this shift has entailed substantial changes in how NLP systems are designed, such that deep neural network-based approaches may be viewed as a new paradigm distinct from statistical natural language processing.
User : Oops! Straight to business?
Doc Bot :  to solve properly. Text-to-speech can be used to aid the visually impaired.
User : Okay, Got it! What is NLP?
Doc Bot :  Symbolic NLP (1950s – early 1990s) [ edit ]

The premise of symbolic NLP is well-summarized by John Searle's Chinese room experiment: Given a collection of rules (e.g., a Chinese phrasebook, with questions and matching answers), the computer emulates 

Haha! That was our Chat Bot's first conversation with a human agent! He did pretty decently, didn't he? So, that was all about our project. It's a pretty simple one, yet useful. 
We could make it a lot better by adding more data, using advanced NLP techniques, employing better similarity measures that give more importance to certain key words and using more robust algorithms such as K-nearest neighbours.
Next, we'll take up endeavours that address these issues. See you all in the next project!