# **Import Libraries**

In [143]:
import numpy as np
import nltk
import string
import random

# **Import and Read Data**

In [144]:
data = open('/content/drive/MyDrive/Colab Notebooks/data.txt','r',errors = 'ignore')
data = data.read()
data = data.lower()
nltk.download('punkt')
nltk.download('wordnet')
sent_tokens = nltk.sent_tokenize(data)
word_tokens = nltk.word_tokenize(data)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# **Check Tokens**

In [145]:
sent_tokens[:3]

['data science is an interdisciplinary academic field[1] that uses statistics, scientific computing, scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from potentially noisy, structured, or unstructured data.',
 '[2]\n\ndata science also integrates domain knowledge from the underlying application domain (e.g., natural sciences, information technology, and medicine).',
 '[3] data science is multifaceted and can be described as a science, a research paradigm, a research method, a discipline, a workflow, and a profession.']

In [146]:
word_tokens[:3]

['data', 'science', 'is']

# **Data Pre-Processing**

In [147]:
lemmer = nltk.stem.WordNetLemmatizer()
def LemTokens(tokens):
  [lemmer.lemmatize(token) for token in tokens]
def LemNormalize(text):
    lemmatizer = nltk.stem.WordNetLemmatizer()
    tokens = nltk.word_tokenize(text)
    return [lemmatizer.lemmatize(token) for token in tokens if token.isalnum()]  # Filter out non-alphanumeric tokens

# **Greeting Function**

In [148]:
GREET_INPUTS = ["hi","hello","what's up","hey","greetings"]
GREET_RESPONSES = ["hi there I'm AIzy. How can I help you!!","hello I'm AIzy. How can I help you!","hey I'm AIzy. How can I help you!"]
GREET_MUSLIMS = ["wasalam mera naam AIzy hai...Mai apki kis tarha madad kar sakti hoo!!","walaikum assalam mera naam AIzy hai...Mai apki kis tarha madad kar sakti hoo!!"]
def greet(sentence):
  for word in sentence.split():
    if word.lower() in GREET_INPUTS:
      return random.choice(GREET_RESPONSES)
    if ((word.lower() == "aoa") | (word.lower() == "assalam o alaikum") | (word.lower() == "assalam-o-alaikum") | (word.lower() == "assalam")):
      return random.choice(GREET_MUSLIMS)



# **Response Generator**

In [149]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [150]:
nltk.download('stopwords')
from nltk.corpus import stopwords
def response(user_response):
    ai_response = " "
    TfidVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words=stopwords.words('english'))
    tfidf = TfidVec.fit_transform(sent_tokens)
    vals = cosine_similarity(tfidf[-1], tfidf)
    idx = vals.argsort()[0][-2]
    flat = vals.flatten()
    flat.sort()
    req_tfidf = flat[-2]
    if req_tfidf == 0:
        ai_response = ai_response + "I am sorry! I don't understand you"
        return ai_response
    else:
        ai_response = ai_response+sent_tokens[idx]
        return ai_response

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


**argsort()** returns the indices that would sort an array in ascending order. In this case, vals.argsort()[0] gives the indices of sentences sorted by their cosine similarity to the user input. [-2] is used to select the second-to-last index, which corresponds to the index of the sentence with the highest cosine similarity (excluding the user input itself).
**flatten()**converts the 2D array vals into a 1D array (flat). flat.sort() then sorts this 1D array in ascending order. This step is necessary to find the second-highest cosine similarity.
**flat[-2]** retrieves the second-to-last element in the sorted array flat, which corresponds to the second-highest cosine similarity. This value (req_tfidf) is then used to determine whether the chatbot should apologize for not understanding the user's input (if req_tfidf is 0) or construct a response based on the most similar sentence.

# **Defining Conversation**

In [152]:
flag=True
print("AIzy: Hey!!I am AIzy....Let's have a conversation...If you want to exit just type esc")
while (flag==True):
  user_response=input()
  user_response=user_response.lower()
  if (user_response!='esc'):
    if (user_response=='bye'):
      flag=False
      print('Bye!!...Have a nice Day :)')
    else:
      if (greet(user_response)!=None):
        print("AIzy:"+greet(user_response))
      else:
        sent_tokens.append(user_response)
        word_tokens=word_tokens+nltk.word_tokenize(user_response)
        final_words=list(set(word_tokens))
        print("AIzy: ",end="")
        print(response(user_response))
        sent_tokens.remove(user_response.lower())
  else:
    flag=False
    print("AIzy:GoodBye!!")





AIzy: Hey!!I am AIzy....Let's have a conversation...If you want to exit just type esc
esc
AIzy:GoodBye!!
