## Building a Simple Chatbot

**Importing the required libraries**

In [2]:
import numpy as np  #For numerical computation in python
import nltk         #For natural language processing
import string       #process strings in python
import random
import warnings
warnings.filterwarnings('ignore')

**Importing and reading the corpus**

In [6]:
f=open('chatbotDS.txt','r',errors = 'ignore')  #r-raw document
raw_doc=f.read()
raw_doc=raw_doc.lower() #Converts text to lowercase
nltk.download('punkt') #Using the Punkt tokenizer. Other tokenizers inlcude tweek, RegEx etc
nltk.download('wordnet') #Using the WordNet dictionary
sent_tokens = nltk.sent_tokenize(raw_doc) #Converts doc to list of sentences 
word_tokens = nltk.word_tokenize(raw_doc) #Converts doc to list of words

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Chandru\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Chandru\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


#Corpus is taken from a wikipedia page
https://en.wikipedia.org/wiki/Data_science 

#More about tokenizers: <br>
https://www.analyticsvidhya.com/blog/2019/07/how-get-started-nlp-6-unique-ways-perform-tokenization/ 

**Example of sentance tokens**

In [7]:
sent_tokens[:2]  #Printing first two sentences

['data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data,[1][2] and apply knowledge and actionable insights from data across a broad range of application domains.',
 'data science is related to data mining, machine learning and big data.']

**Example of word tokens**

In [8]:
word_tokens[:10]  #Print first 10 words

['data',
 'science',
 'is',
 'an',
 'interdisciplinary',
 'field',
 'that',
 'uses',
 'scientific',
 'methods']

**Text preprocessing**

In [10]:
lemmer = nltk.stem.WordNetLemmatizer()
#WordNet is a semantically-oriented dictionary of English included in NLTK library.
def LemTokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

**Defining the greeting function**

In [11]:
GREET_INPUTS = ("hello", "hi", "greetings", "sup", "what's up","hey")   #sup is Millenial shortform for what's up?
GREET_RESPONSES = ["hi", "hey", "*nods*", "hi there", "hello", "I am glad! You are talking to me"]

def greet(sentence): 
    for word in sentence.split():
        if word.lower() in GREET_INPUTS:
            return random.choice(GREET_RESPONSES)

**Response generation**

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer  #Term frequency and inverse document frequency(for rare words)
from sklearn.metrics.pairwise import cosine_similarity       #It gives normalized vectors to the machine for it to understand.

In [13]:
def response(user_response):
  robo1_response=''
  TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
  tfidf = TfidfVec.fit_transform(sent_tokens)
  vals = cosine_similarity(tfidf[-1], tfidf)
  idx=vals.argsort()[0][-2]
  flat = vals.flatten()
  flat.sort()
  req_tfidf = flat[-2]
  if(req_tfidf==0):
    robo1_response=robo1_response+"I am sorry! I don't understand you"
    return robo1_response
  else:
    robo1_response = robo1_response+sent_tokens[idx]
    return robo1_response

**Defining conversation start/end protocols**

In [15]:
flag=True
print("BOT: My name is Chand. Let's have a conversation! Also, if you want to exit any time, just type Bye!")
while(flag==True):
    user_response = input()
    user_response=user_response.lower()
    if(user_response!='bye'):
        if(user_response=='thanks' or user_response=='thank you' ):
            flag=False
            print("BOT: You are welcome..")
        else:
            if(greet(user_response)!=None):
                print("BOT: "+greet(user_response))
            else:
                sent_tokens.append(user_response)
                word_tokens=word_tokens+nltk.word_tokenize(user_response)
                final_words=list(set(word_tokens))
                print("BOT: ",end="")
                print(response(user_response))
                sent_tokens.remove(user_response)
    else:
        flag=False
        print("BOT: Goodbye! Take care <3 ")  #<3 is for heart shape

BOT: My name is Chand. Let's have a conversation! Also, if you want to exit any time, just type Bye!
hello
BOT: hey
hi
BOT: *nods*
foundations
BOT: [4][5]


contents
1	foundations
1.1	relationship to statistics
2	etymology
2.1	early usage
2.2	modern usage
3	impact
4	technologies and techniques
4.1	techniques
5	see also
6	references
foundations
data science is an interdisciplinary field focused on extracting knowledge from data sets, which are typically large (see big data), and applying the knowledge and actionable insights from data to solve problems in a wide range of application domains.
impact of data science
BOT: "how data science will impact future of businesses?".
data science
BOT: "data science".
data science
BOT: "data science".
technologies and techniques
BOT: [29]

technologies and techniques
there is a variety of different technologies and techniques that are used for data science which depend on the application.
background
BOT: I am sorry! I don't understand you
bye
BOT: G

**This is one of the most simple chatbots you can build with very few lines of code. Of course, if you want more sophisticated chatbot then it all depends on the scale & vastness of the corpus which we give for training. Hope this helps my fellow friends and Data Science aspirants.**

In [1]:
print("You have learnt the chatbot development")

You have learnt the chatbot development


In [3]:
print("Its the end. Happy Learning")

Its the end. Happy Learning
