# Import Necessary Library

In [1]:
import numpy as np
import nltk
import string
import random

# Reading the corpus of Text

In [2]:
f = open('data.txt','r',errors = 'ignore')
raw_doc = f.read()

In [3]:
raw_doc = raw_doc.lower() # converting entire text to lower case
nltk.download('punkt') # using the punkt tokenizer
nltk.download('wordnet') # using the wordnet dictionary
nltk.download('omw-1.4')
sentence_tokens = nltk.sent_tokenize(raw_doc)
word_tokens = nltk.word_tokenize(raw_doc)

[nltk_data] Downloading package punkt to C:\Users\AYUSH NATH
[nltk_data]     TIWARI\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\AYUSH NATH
[nltk_data]     TIWARI\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to C:\Users\AYUSH NATH
[nltk_data]     TIWARI\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


>Punkt Sentence Tokenizer
This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.

>WordNet is an English dictionary which is a part of Natural Language Tool Kit (NLTK) for Python. This is an extensive library built to make Natural Language Processing (NLP) easy

In [4]:

raw_doc

'\nmain menu\n\nwikipedia the free encyclopedia\n\n    create account\n    log in\n\npersonal tools\n\ncontents\n\n    beginning\n    how does ai work?\n    history\n    related pages\n    references\n\nartificial intelligence\n\n    page\n    talk\n\n    read\n    change\n    change source\n    view history\n\ntools\n\nfrom simple english wikipedia, the free encyclopedia\n\t\nthis article needs to be updated. you can help wikipedia by updating it. (may 2023)\n\nartificial intelligence (ai) is the ability of a computer program or a machine to think and learn.[1] it is also a field of study which tries to make computers "smart". they work on their own without being encoded with commands. john mccarthy came up with the name, "artificial intelligence" in 1955.\n\nin general use, the term "artificial intelligence" means a programme which mimics human cognition. at least some of the things we associate with other minds, such as learning and problem solving can be done by computers, though n

# Tokenization

In [5]:

sentence_tokens = nltk.sent_tokenize(raw_doc)
word_tokens = nltk.word_tokenize(raw_doc)

In [6]:
sentence_tokens[6]

'john mccarthy came up with the name, "artificial intelligence" in 1955.\n\nin general use, the term "artificial intelligence" means a programme which mimics human cognition.'

In [7]:

word_tokens[2:8]

['wikipedia', 'the', 'free', 'encyclopedia', 'create', 'account']

## Performing Text Pre-Processing steps

In [8]:
lemmer = nltk.stem.WordNetLemmatizer()
def LemTokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]

remove_punc_dict = dict((ord(punct),None) for punct in string.punctuation)
def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punc_dict)))

>Lemmatization is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meanings to one word.

>The process of removing affixes from a word so that we are left with the stem of that word is called stemming. For example, consider the words 'run', 'running', and 'runs', all convert into the root word 'run' after stemming is implemented on them

>lemmatization is preferred over Stemming because lemmatization does morphological analysis of the words.

>Morphology focuses on how the components within a word (stems, root words, prefixes, suffixes, etc.) are arranged or modified to create different meanings. English, for example, often adds "-s" or "-es" to the end of count nouns to indicate plurality, and a "-d" or "-ed" to a verb to indicate past tense

## Define Greeting functions

In [9]:
greet_inputs = ("hello","hi","whassup","how are you?")
greet_responses = ("hi","Hey","Hey There!","There there!!")
def greet(sentence):
    for word in sentence.split():
        if word.lower() in greet_inputs:
            return random.choice(greet_responses)

## Response Generation by the Bot

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

>CountVectorizer simply counts the number of times a word appears in a document (using a bag-of-words approach), while TF-IDF Vectorizer takes into account not only how many times a word appears in a document but also how important that word is to the whole corpus.

>The TfidfVectorizer uses an in-memory vocabulary (a python dict) to map the most frequent words to feature indices and hence compute a word occurrence frequency (sparse) matrix.



In [11]:
def response(user_response):
    robo1_response = ' '
    TfidfVec = TfidfVectorizer(tokenizer = LemNormalize, stop_words = 'english')
    tfidf = TfidfVec.fit_transform(sentence_tokens)
    vals = cosine_similarity(tfidf[-1],tfidf)
    idx = vals.argsort()[0][-2]
    flat = vals.flatten() # flatten combines....items and lists to be combined into a single list in flatten.
    flat.sort()
    req_tfidf = flat[-2]
    if (req_tfidf == 0):
        robo1_response = robo1_response + "I am sorry. Unable to understand you!"
        return robo1_response
    else:
        robo1_response = robo1_response + sentence_tokens[idx]
        return robo1_response

>Stop words are a set of commonly used words in any language. For example, in English, “the”, “is” and “and”, would easily qualify as stop words. In NLP and text mining applications, stop words are used to eliminate unimportant words, allowing applications to focus on the important words instead.

## Defining the Chat Flow

In [12]:
flag = True
print('Hello!I am  Retreival Learning Bot. Start typing your text after greeting to talk to me.For ending conversation type bye!')
while(flag == True):
    user_response = input()
    user_response = user_response.lower()
    if(user_response != 'bye'):
        if(user_response == 'thank you' or user_response == 'thanks'):
            flag = False
            print('Bot: You are Welcome..')
        else:
            if(greet(user_response) != None):
                print('Bot ' + greet(user_response))
            else:
                sentence_tokens.append(user_response)
                word_tokens = word_tokens + nltk.word_tokenize(user_response)
                final_words = list(set(word_tokens))
                print('Bot: ',end = '')
                print(response(user_response))
                sentence_tokens.remove(user_response)
    else:
        flag = False
        print('Bot: Goodbye!')

Hello!I am  Retreival Learning Bot. Start typing your text after greeting to talk to me.For ending conversation type bye!
hello
Bot hi
Tell me about yourself
Bot: 



 I am sorry. Unable to understand you!
Tell me about genral AI
Bot:  "what is ai and how it work".
ok
Bot:  I am sorry. Unable to understand you!
bye
Bot: Goodbye!
