# Asking ChatBot about chosen information


*   What is..
*   Tell me about..




## Libraries or concepts used in the process
 
*   NLTK - a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

*   TF-IDF - statistical method of evaluating the significance of a word in a given document.

*   Cosine similarity - denotes the similarity between the two words

*   WordNet -  a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short definitions and usage examples, and records a number of relations among these synonym sets or their members.

*   WordNetLemmatizer -  Lemmatize using WordNet's built-in morphy function. Lemmatization 

*   Wikipedia - Python library that makes it easy to access and parse data from Wikipedia.



## Installs libraries

In [18]:
!pip install wikipedia



## Imports libraries

In [19]:
import nltk
import random
import string
import nltk
nltk.download('averaged_perceptron_tagger')
import re, string, unicodedata
from nltk.corpus import wordnet as wn
from nltk.stem.wordnet import WordNetLemmatizer
import wikipedia as wk
from collections import defaultdict 
import warnings
warnings.filterwarnings("ignore")
nltk.download('punkt') 
nltk.download('wordnet')
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity, linear_kernel
import urllib.request
import re
from IPython.display import Image

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Gets text data from url and cleans it

In [0]:
uf = urllib.request.urlopen("https://plato.stanford.edu/entries/meaning/")
html = uf.read()

In [21]:
html[:100]

b'<!DOCTYPE html>\n<!--[if lt IE 7]> <html class="ie6 ie"> <![endif]-->\n<!--[if IE 7]>    <html class="'

In [0]:
# remove html tags and unnecessary characters
def cleanhtml(raw_html):
  cleanr = re.compile('\n|<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
  cleantext = re.sub(cleanr, '', raw_html)
  return cleantext

In [23]:
# change to string
html = html.decode("utf-8")
html[:100]

'<!DOCTYPE html>\n<!--[if lt IE 7]> <html class="ie6 ie"> <![endif]-->\n<!--[if IE 7]>    <html class="'

## Shows the beginning of the text

In [24]:
html = cleanhtml(html)
print(html[:100])
raw = html.lower()
print(raw[:100])

                  -->  Theories of Meaning (Stanford Encyclopedia of Philosophy)                    
                  -->  theories of meaning (stanford encyclopedia of philosophy)                    


## Sentence tokenizer

In [0]:
sent_tokens = nltk.sent_tokenize(raw)

In [26]:
sent_tokens[1:5]

['unfortunately, this term has also been used to mean a greatnumber of different things.',
 'in this entry, the focus is on two sortsof theory of meaning.',
 'the first sort of theoryasemantic theoryis a theory which assigns semantic contents toexpressions of a language.',
 'the second sort of theoryafoundational theory of meaningis a theory which states thefacts in virtue of which expressions have the semantic contents thatthey have.']

## Text normalisation

Word tokenization

*   Tokenization is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens

*   Tokens can be individual words, phrases or even whole sentences. In the process of tokenization, some characters like punctuation marks are discarded

 


## Working with Unicode
Strings are usually easy to deal with when they are made up of English ASCII characters, but “problems” appear when we enter into non-ASCII characters

### What are strings made of?
Byte is a unit of information that is built of 8 bits — bytes are used to store all files in a hard disk. So all of the CSVs and JSON files on your computer are built of bytes. 



In [27]:
# The ord() function returns an integer representing the Unicode character
ord('🐍')

128013

In [28]:
# The chr() returns a character (a string) whose Unicode code point is the integer
chr(128013)

'🐍'

### ASCII



* character encoding standard
* 127 symbol list 
* cool for the initial few decades or so



### Unicode

* International standard where a mapping of individual characters and a unique number is maintained
* Over 137k characters including different scripts including English, Hindi, Chinese and Japanese, as well as emojis



### Unicode encodings UTF-8, UTF-16, and UTF-32


*   UTF-8: It uses 1, 2, 3 or 4 bytes to encode every code point
*   UTF-16 is variable 2 or 4 bytes, great for Asian text
* UTF-32 is fixed 4 bytes, needs a lot of memory, not used very often

decode() -> str <br>
encode() -> bytes

![alt text](https://cdn.bulldogjob.com/system/photos/files/000/005/268/original/1_nyvQSXsxG7cZILqZ8H5-Wg.png)

### Example of encoding and decoding

In [29]:
word = "pythön"
# unicodedata normalize return the normal form
print(unicodedata.normalize('NFKD', word).encode('ascii', 'ignore'))
print(unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore'))

b'python'
python


In [30]:
word = "pythön"
# ignore means that we do not replace odd character with anything
print(word.encode('ascii', 'ignore'))
print(word.encode('ascii', 'ignore').decode('utf-8', 'ignore'))

b'pythn'
pythn


## Text normalization

## POS tagging

*  one of the fundamental tasks of natural language processing tasks (eg. Word Sense Disambiguation)
*  words often occur in different senses as different parts of speech, eg:

1.  She saw a bear
2.  Your efforts will bear fruit

* completely different senses -> one is a noun and other is a verb. 

In [0]:
def Normalize(text):
    remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
    #word tokenization
    word_token = nltk.word_tokenize(text.lower().translate(remove_punct_dict))

    #remove ascii
    new_words = []
    for word in word_token:
        new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        new_words.append(new_word)

    #remove tags
    rmv = []
    for w in new_words:
        text=re.sub("&lt;/?.*?&gt;","&lt;&gt;",w)
        rmv.append(text)
        
    #pos tagging and lemmatization (from nltk.corpus import wordnet as wn)
    tag_map = defaultdict(lambda : wn.NOUN)
    tag_map['J'] = wn.ADJ
    tag_map['V'] = wn.VERB
    tag_map['R'] = wn.ADV
    lmtzr = WordNetLemmatizer()
    lemma_list = []
    rmv = [i for i in rmv if i]
    for token, tag in nltk.pos_tag(rmv):
        lemma = lmtzr.lemmatize(token, tag_map[tag[0]])
        lemma_list.append(lemma)
    return lemma_list

## Creating greeting responses

In [0]:
# defining welcome input that will be recognized by bot
welcome_input = ["hello", "hi", "greetings", "sup", "what's up","hey"]
# defining welcome output from bot 
welcome_response = ["hi", "hey", "*nods*", "hi there", "hello", "I am glad! You are talking to me"]

def welcome(user_response):
    for word in user_response.split():
        if word.lower() in welcome_input:
            return random.choice(welcome_response)

## Generating response for the knowledge question

### Term Frequency (TF)


* The number of times a word appears in a document divded by the total number of words in the document

### Inverse Data Frequency (IDF)


* The log of the number of documents divided by the number of documents that contain the word w.
* Inverse data frequency determines the weight of rare words across all documents in the corpus



In [0]:
def generateResponse(user_response):
    robo_response=''
    sent_tokens.append(user_response)
    TfidfVec = TfidfVectorizer(tokenizer=Normalize, stop_words='english')
    tfidf = TfidfVec.fit_transform(sent_tokens)
    vals = cosine_similarity(tfidf[-1], tfidf)
    #vals = linear_kernel(tfidf[-1], tfidf)
    idx=vals.argsort()[0][-2]
    flat = vals.flatten()
    flat.sort()
    req_tfidf = flat[-2]
    if(req_tfidf==0) or "tell me about" in user_response:
        print("Checking Wikipedia")
        if user_response:
            robo_response = wikipedia_data(user_response)
            return robo_response
    else:
        robo_response = robo_response+sent_tokens[idx]
        return robo_response
        #wikipedia search
def wikipedia_data(input):
    reg_ex = re.search('tell me about (.*)', input)
    try:
        if reg_ex:
            topic = reg_ex.group(1)
            wiki = wk.summary(topic, sentences = 3)
            return wiki
    except Exception as e:
            print("No content has been found")

## Running the bot while True

In [0]:
flag=True
print("My name is Chatterbot and I'm a chatbot. If you want to exit, type Bye!")
while(flag==True):
    user_response = input()
    # changing input to lowercase
    user_response=user_response.lower()
    # checking if the user want to exit
    if(user_response not in ['bye','shutdown','exit', 'quit']):
        if(user_response=='thanks' or user_response=='thank you' ):
            flag=False
            print("Chatterbot : You are welcome..")
        else:
            if user_response in welcome_input:
                print("Chatterbot : "+welcome(user_response))
            else:
                print("Chatterbot : ",end="")
                print(generateResponse(user_response))
                sent_tokens.remove(user_response)
    else:
        flag=False
        print("Chatterbot : Bye!!! ")

My name is Chatterbot and I'm a chatbot. If you want to exit, type Bye!
Tell me about nothing
Chatterbot : Checking Wikipedia
"Nothing", used as a pronoun subject, is the absence of a something or particular thing that one might expect or desire to be present ("We found nothing", "Nothing was there") or the inactivity of a thing or things that are usually or could be active ("Nothing moved", "Nothing happened").  As a predicate or complement "nothing" is the absence of meaning, value, worth, relevance, standing, or significance ("It is a tale/ Told by an idiot, full of sound and fury,/ Signifying nothing"; "The affair meant nothing"; "I'm nothing in their eyes").  "Nothingness" is a philosophical term for the general state of nonexistence, sometimes reified as a domain or dimension into which things pass when they cease to exist or out of which they may come to exist, e.g., God is understood to have created the universe ex nihilo, "out of nothing".
What is theory of meaning
Chatterbot 

### Ask what is:

*   What is theory of meaning
*   What is possible worlds semantics

### Ask to tell:

*   Tell me about nothing
*   Tell me about human



In [0]:
'''Sources: https://towardsdatascience.com/lets-build-an-intelligent-chatbot-7ea7f215ada6,
https://towardsdatascience.com/a-guide-to-unicode-utf-8-and-strings-in-python-757a232db95c'''