In [1]:
import numpy as np
import nltk
import random
import string
import warnings

## <font color='green'> Importing Corpus</font>

**Corpus** - Corpus is the input training data for our chatbot to learn. It can be defined as a collection of text documents. It can be thought as just a bunch of text files in a directory, often alongside many other directories of text files.

In [2]:
warnings.filterwarnings('ignore')

In [3]:
with open('data_R.txt',errors='ignore') as f:
    data=f.read()

In [4]:
print(data)

Delhi (/ËˆdÉ›li/; Hindi pronunciation: [ËˆdÉªlËiË] DillÄ«; Punjabi pronunciation: [ËˆdÉªlËiË] DillÄ«; Urdu pronunciation: [ËˆdÉ›É¦liË] DÃªhlÄ«),[17] officially the National Capital Territory (NCT) of Delhi, is a city and a union territory of India containing New Delhi, the capital of India.[18][19] Straddling the Yamuna river, primarily its western or right bank, Delhi shares borders with the state of Uttar Pradesh in the east and with the state of Haryana in the remaining directions. The NCT covers an area of 1,483.0 square kilometres (572.6 sq mi).[5] According to the 2011 census, Delhi's city proper population was over 11 million,[6][20] while the NCT's population was about 16.8 million.[7] Delhi's urban agglomeration, which includes the satellite cities of Ghaziabad, Faridabad, Gurgaon and Noida in an area known as the National Capital Region (NCR), has an estimated population of over 28 million, making it the largest metropolitan area in India and the second-largest in the world (

In [5]:
len(data)

3549

In [6]:
len(data.split('\n'))

5

180 lines

In [7]:
len(data.split('\n\n'))

3

47 paragraphs

In [8]:
len(data.split(' '))

575

## Converting our data to lower case

In [9]:
data=data.lower()

**Lets use the wordnet and Punkt for our upcoming work**

In [10]:
#nltk.download('wordnet')

In [11]:
#nltk.download('punkt')

**Lets convert our raw data to list of sentences and words**

In [12]:
sent=nltk.sent_tokenize(data)  #sentence tokens

In [13]:
word=nltk.wordpunct_tokenize(data) #word tokens

 **Lets check the sentences and words**

In [14]:
sent[0]

'delhi (/ëˆdé›li/; hindi pronunciation: [ëˆdéªlëië] dillä«; punjabi pronunciation: [ëˆdéªlëië] dillä«; urdu pronunciation: [ëˆdé›é¦lië] dãªhlä«),[17] officially the national capital territory (nct) of delhi, is a city and a union territory of india containing new delhi, the capital of india.'

In [15]:
print(sent[1:3])

['[18][19] straddling the yamuna river, primarily its western or right bank, delhi shares borders with the state of uttar pradesh in the east and with the state of haryana in the remaining directions.', 'the nct covers an area of 1,483.0 square kilometres (572.6 sq mi).']


In [16]:
word[2]

'ëˆdé'

In [17]:
print(word[1:15])

['(/', 'ëˆdé', '›', 'li', '/;', 'hindi', 'pronunciation', ':', '[', 'ëˆdéªlëië', ']', 'dillä', '«;', 'punjabi']


## <font color='green'>Text Preprocessing</font>

In [18]:
lem=nltk.stem.WordNetLemmatizer()

**We are going to do lemmanization and handle all the punctuation marks in our data**

In [19]:
def Lemmanization(t):
    return [lem.lemmatize(tok) for tok in t]
punc_remove=dict((ord(pun),None) for pun in string.punctuation)
def Normalise(text):
    return Lemmanization(nltk.word_tokenize(text.lower().translate(punc_remove)))

## <font color='green'>Lets define some Greeting Functions</font>

In [20]:
g_input=['hello','hi',"what's up",'how are you','hey','hey there','namaste']

In [21]:
g_response=['namaste','hello','hey',"It's nice talking to you",'hi']

In [22]:
def greet(s):
    for word in s.split():
        if word in g_input:
            return random.choice(g_response)

## <font color='green'> Text Generation </font>

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [24]:
from sklearn.metrics.pairwise import cosine_similarity

In [25]:
def answer(response):
    bot_response=''
    tfid=TfidfVectorizer(tokenizer=Lemmanization,stop_words='english')
    t=tfid.fit_transform(sent)
    vals=cosine_similarity(t[-1],t)
    idx=vals.argsort()[0][-2]
    flat=vals.flatten()
    flat.sort()
    rt=flat[-2]
    if(rt==0):
        bot_responset+="Sorry! I am not able to understand you"
    else:
        bot_response=bot_response+sent[idx]
        return bot_response
    

In [26]:
f=True
print("Hello My name is Jazzx, Lets keep having conversation and if you want to quit please type 'Bye'")
while(f==True):
    inp=input()
    inp=inp.lower()
    if(inp!='bye'):
        if(inp=='thanks' or inp=='thank you'):
            f=False
            print("Jazzx: You are welcome")
        else:
            if(greet(inp)!=None):
                print("Jazzx : ",greet(inp))
            else:
                sent.append(inp)
                word=word+nltk.word_tokenize(inp)
                ff=list(set(word))
                print("Jazzx: ",end='')
                print(answer(inp))
                sent.remove(inp)
    else:
        f=False
        print("Have a nice day ahead, Take care :)")
        

Hello My name is Jazzx, Lets keep having conversation and if you want to quit please type 'Bye'
hello
Jazzx :  hi
hi
Jazzx :  hey
Delgi GDP
Jazzx: [15] delhi has the second-highest gdp per capita in india (after goa).
delhi gdp
Jazzx: [15] delhi has the second-highest gdp per capita in india (after goa).
delhi language'
Jazzx: the khariboli dialect of delhi was part of a linguistic development that gave rise to the literature of the urdu language and then of modern standard hindi.
delhi human development index
Jazzx: [22] delhi ranks fifth among the indian states and union territories in human development index.
thank you
Jazzx: You are welcome


In [27]:
f=True
print("Hello My name is Jazzx, Lets keep having conversation and if you want to quit please type 'Bye'")
while(f==True):
    inp=input()
    inp=inp.lower()
    if(inp!='bye'):
        if(inp=='thanks' or inp=='thank you'):
            f=False
            print("Jazzx: You are welcome")
        else:
            if(greet(inp)!=None):
                print("Jazzx : ",greet(inp))
            else:
                sent.append(inp)
                word=word+nltk.word_tokenize(inp)
                ff=list(set(word))
                print("Jazzx: ",end='')
                print(answer(inp))
                sent.remove(inp)
    else:
        f=False
        print("Have a nice day ahead, Take care :)")
        

Hello My name is Jazzx, Lets keep having conversation and if you want to quit please type 'Bye'
hello
Jazzx :  hey
namaste
Jazzx :  namaste
bye
Have a nice day ahead, Take care :)
