<a href="https://www.kaggle.com/rajkumarl/wiki-ir-chatbot?scriptVersionId=83842798" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Wiki-IR-ChatBot

### A ChatBot that can respond to humans by retrieving information directly from Wikipedia.
### User is open to choose any topic of interest!


![robo_chat](https://raw.githubusercontent.com/RajkumarGalaxy/dataset/master/Images/robo_chat.jpg)
> [Image by Brett Jordan](https://unsplash.com/@brett_jordan)

### What is a chatbot?
> A ChatBot is a kind of virtual assistant that can build conversations with human users! A Chatting Robot. Building a chatbot is one of the popular tasks in Natural Language Processing.

### What is Information Retrieval?
> Information Retrieval (or, IR in short) is the task of identifying and collecting the most relevant information from a source based on a pre-defined heuristic. Text data is a good example of unordered data while it is abudant everywhere. It is hard to find the information manually from a huge collection of text data. Since need of information is time-bound in general, a good IR system is always in need. 

### Are all chatbots the same?
> Chatbots fall under three common categories:
>##### 1. Rule-based chatbots
>##### 2. Retrieval-based chatbots
>##### 3. Intelligent chatbots

### Rule-based chatbots
> These bots respond to users' inputs based on certain pre-specified rules. For instance, these rules can be defined as if-elif-else statements. While writing rules for these chatbots, it is important to expect all possible user inputs, else the bot may fail to answer properly. Hence, rule-based chatbots do not possess any cognitive skills.

### Retrieval-based chatbots
> These bots respond to users' inputs by retrieving the most relevant information from the given text document. The most relevant information can be determined by Natural Language Processing with a scoring system such as cosine-similarity-score. Though these bots use NLP to do conversations, they lack cognitive skills to match a real human chatting companion. This [Wiki-IR-ChatBot](https://github.com/RajkumarGalaxy/Wiki-IR-ChatBot) falls under this category!

### Intelligent AI chatbots
> These bots respond to users' inputs after understanding the inputs, as humans do. These bots are trained with a Machine Learning Model on a large training dataset of human conversations. These bots are cognitive to match a human in conversing. Popular Virtual Assistants such as Amazon's Alexa, Apple's Siri fall under this category. Further, most of these bots can make conversations based on the preceding chat texts. [Conversational AI ChatBot](https://github.com/RajkumarGalaxy/Conversational-AI-ChatBot), built by [Author](https://github.com/RajkumarGalaxy), employs Microsoft's DialoGPT to make intelligent conversations!

### In this project?
> This project builds an information retrieval (IR) chatbot that can scrape Wikipedia using BeautifulSoup in the topic of user's interest and collect information against user's queries following a heuristic backed by TF-IDF score and cosine-similarity score. This Wiki-IR-ChatBot is user-friendly in permitting users to choose any topic and presenting either crisp and short response or detailed response. It leverages NLTK library to do text processing and scikit-learn library to do modeling. Let's dive deep!


# Create Environment by Importing Libraries

In [41]:
# To scrape Wikipedia
from bs4 import BeautifulSoup
# To access contents from URLs
import requests
# to preprocess text
import nltk
# to handle punctuations
from string import punctuation
# TF-IDF vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# cosine similarity score
from sklearn.metrics.pairwise import cosine_similarity 
# to do array operations
import numpy as np
# to have sleep option
from time import sleep 

#nltk.download('stopwords')

# Create a ChatBot Class

A chatbot class that can perform information retrieval (IR) from Wikipedia, make coversations with human users based on the retrieved data!

In [45]:
class ChatBot():
    
    # initialize bot
    def __init__(self):
        # flag whether to end chat
        self.end_chat = False
        # flag whether topic is found in wikipedia
        self.got_topic = False
        # flag whether to call respond()
        # in some cases, response be made already
        self.do_not_respond = True
        
        # wikipedia title
        self.title = None
        # wikipedia scraped para and description data
        self.text_data = []
        # data as sentences
        self.sentences = []
        # to keep track of paragraph indices
        # corresponding to all sentences
        self.para_indices = []
        # currently retrieved sentence id
        self.current_sent_idx = None
        
        # a punctuation dictionary
        self.punctuation_dict = str.maketrans({p:None for p in punctuation})
        # wordnet lemmatizer for preprocessing text
        self.lemmatizer = nltk.stem.WordNetLemmatizer()
        # collection of stopwords
        self.stopwords = nltk.corpus.stopwords.words('english')
        # initialize chatting
        self.greeting()

    # greeting method - to be called internally
    # chatbot initializing chat on screen with greetings
    def greeting(self):
        print("Initializing ChatBot ...")
        # some time to get user ready
        sleep(2)
        # chat ending tags
        print('Type "bye" or "quit" or "exit" to end chat')
        sleep(2)
        # chatbot descriptions
        print('\nEnter your topic of interest when prompted. \
        \nChaBot will access Wikipedia, prepare itself to \
        \nrespond to your queries on that topic. \n')
        sleep(3)
        print('ChatBot will respond with short info. \
        \nIf you input "more", it will give you detailed info \
        \nYou can also jump to next query')
        # give time to read what has been printed
        sleep(3)
        print('-'*50)
        # Greet and introduce
        greet = "Hello, Great day! Please give me a topic of your interest. "
        print("ChatBot >>  " + greet)
        
    # chat method - should be called by user
    # chat method controls inputs, responses, data scraping, preprocessing, modeling.
    # once an instance of ChatBot class is initialized, chat method should be called
    # to do the entire chatting on one go!
    def chat(self):
        # continue chat
        while not self.end_chat:
            # receive input
            self.receive_input()
            # finish chat if opted by user
            if self.end_chat:
                print('ChatBot >>  See you soon! Bye!')
                sleep(2)
                print('\nQuitting ChatBot ...')
            # if data scraping successful
            elif self.got_topic:
                # in case not already responded
                if not self.do_not_respond:
                    self.respond()
                # clear flag so that bot can respond next time
                self.do_not_respond = False
    
    # receive_input method - to be called internally
    # recieves input from user and makes preliminary decisions
    def receive_input(self):
        # receive input from user
        text = input("User    >> ")
        # end conversation if user wishes so
        if text.lower().strip() in ['bye', 'quit', 'exit']:
            # turn flag on 
            self.end_chat=True
        # if user needs more information 
        elif text.lower().strip() == 'more':
            # respond here itself
            self.do_not_respond = True
            # if at least one query has been received 
            if self.current_sent_idx != None:
                response = self.text_data[self.para_indices[self.current_sent_idx]]
            # prompt user to start querying
            else:
                response = "Please input your query first!"
            print("ChatBot >>  " + response)
        # if topic is not chosen
        elif not self.got_topic:
            self.scrape_wiki(text)
        else:
            # add user input to sentences, so that we can vectorize in whole
            self.sentences.append(text)
                
    # respond method - to be called internally
    def respond(self):
        # tf-idf-modeling
        vectorizer = TfidfVectorizer(tokenizer=self.preprocess)
        # fit data and obtain tf-idf vector
        tfidf = vectorizer.fit_transform(self.sentences)
        # calculate cosine similarity scores
        scores = cosine_similarity(tfidf[-1],tfidf) 
        # identify the most closest sentence
        self.current_sent_idx = scores.argsort()[0][-2]
        # find the corresponding score value
        scores = scores.flatten()
        scores.sort()
        value = scores[-2]
        # if there is matching sentence
        if value != 0:
            print("ChatBot >>  " + self.sentences[self.current_sent_idx]) 
        # if no sentence is matching the query
        else:
            print("ChatBot >>  I am not sure. Sorry!" )
        # remove the user query from sentences
        del self.sentences[-1]
        
    # scrape_wiki method - to be called internally.
    # called when user inputs topic of interest.
    # employs requests to access Wikipedia via URL.
    # employs BeautifulSoup to scrape paragraph tagged data
    # and h1 tagged article heading.
    # employs NLTK to tokenize data
    def scrape_wiki(self,topic):
        # process topic as required by Wikipedia URL system
        topic = topic.lower().strip().capitalize().split(' ')
        topic = '_'.join(topic)
        try:
            # creata an url
            link = 'https://en.wikipedia.org/wiki/'+ topic
            # access contents via url
            data = requests.get(link).content
            # parse data as soup object
            soup = BeautifulSoup(data, 'html.parser')
            # extract all paragraph data
            # scrape strings with html tag 'p'
            p_data = soup.findAll('p')
            # scrape strings with html tag 'dd'
            dd_data = soup.findAll('dd')
            # scrape strings with html tag 'li'
            #li_data = soup.findAll('li')
            p_list = [p for p in p_data]
            dd_list = [dd for dd in dd_data]
            #li_list = [li for li in li_data]
            # iterate over all data
            for tag in p_list+dd_list: #+li_list:
                # a bucket to collect processed data
                a = []
                # iterate over para, desc data and list items contents
                for i in tag.contents:
                    # exclude references, superscripts, formattings
                    if i.name != 'sup' and i.string != None:
                        stripped = ' '.join(i.string.strip().split())
                        # collect data pieces
                        a.append(stripped)
                # with collected string pieces formulate a single string
                # each string is a paragraph
                self.text_data.append(' '.join(a))
            
            # obtain sentences from paragraphs
            for i,para in enumerate(self.text_data):
                sentences = nltk.sent_tokenize(para)
                self.sentences.extend(sentences)
                # for each sentence, its para index must be known
                # it will be useful in case user prompts "more" info
                index = [i]*len(sentences)
                self.para_indices.extend(index)
            
            # extract h1 heading tag from soup object
            self.title = soup.find('h1').string
            # turn respective flag on
            self.got_topic = True
            # announce user that chatbot is ready now
            print('ChatBot >>  Topic is "Wikipedia: {}". Let\'s chat!'.format(self.title)) 
        # in case of unavailable topics
        except Exception as e:
            print('ChatBot >>  Error: {}. \
            Please input some other topic!'.format(e))
        
    # preprocess method - to be called internally by Tf-Idf vectorizer
    # text preprocessing, stopword removal, lemmatization, word tokenization
    def preprocess(self, text):
        # remove punctuations
        text = text.lower().strip().translate(self.punctuation_dict) 
        # tokenize into words
        words = nltk.word_tokenize(text)
        # remove stopwords
        words = [w for w in words if w not in self.stopwords]
        # lemmatize 
        return [self.lemmatizer.lemmatize(w) for w in words]


# Happy Chatting!

Initialize ChatBot and start chatting.

In [46]:
# instantiate an object
wiki = ChatBot()
# call chat method
wiki.chat()

Initializing ChatBot ...
Type "bye" or "quit" or "exit" to end chat

Enter your topic of interest when prompted.         
ChaBot will access Wikipedia, prepare itself to         
respond to your queries on that topic. 

ChatBot will respond with short info.         
If you input "more", it will give you detailed info         
You can also jump to next query
--------------------------------------------------
ChatBot >>  Hello, Great day! Please give me a topic of your interest. 


User    >>  Natural Language Processing


ChatBot >>  Topic is "Wikipedia: Natural language processing". Let's chat!


User    >>  Tell me about Turing test.


ChatBot >>  Already in 1950, Alan Turing published an article titled " Computing Machinery and Intelligence " which proposed what is now called the Turing test as a criterion of intelligence, though at the time that was not articulated as a problem separate from artificial intelligence.


User    >>  What are the challenges in natural language processing?


ChatBot >>  Challenges in natural language processing frequently involve speech recognition , natural language understanding , and natural language generation .


User    >>  Explain lemmatization.


ChatBot >>  Lemmatization is another technique for reducing words to their normalized form.


User    >>  About stemming vs lemmatization?


ChatBot >>  Stemming yields similar results as lemmatization, but does so on grounds of rules, not a dictionary.


User    >>  How are hidden markov models used in natural language processing?


ChatBot >>  However, part-of-speech tagging introduced the use of hidden Markov models to natural language processing, and increasingly, research has focused on statistical models , which make soft, probabilistic decisions based on attaching real-valued weights to the features making up the input data.


User    >>  What is neural machine translation?


ChatBot >>  For instance, the term neural machine translation (NMT) emphasizes the fact that deep learning-based approaches to machine translation directly learn sequence-to-sequence transformations, obviating the need for intermediate steps such as word alignment and language modeling that was used in statistical machine translation (SMT).


User    >>  Tell me about the NLP task - Document AI.


ChatBot >>  A Document AI platform sits on top of the NLP technology enabling users with no prior experience of artificial intelligence, machine learning or NLP to quickly train a computer to extract the specific data they need from different document types.


User    >>  more


ChatBot >>  A Document AI platform sits on top of the NLP technology enabling users with no prior experience of artificial intelligence, machine learning or NLP to quickly train a computer to extract the specific data they need from different document types. NLP-powered Document AI enables non-technical teams to quickly access information hidden in documents, for example, lawyers, business analysts and accountants.


User    >>  Explain about cognitive linguistics.


ChatBot >>  Cognitive linguistics is an interdisciplinary branch of linguistics, combining knowledge and research from both psychology and linguistics.


User    >>  bye


ChatBot >>  See you soon! Bye!

Quitting ChatBot ...


# Some chats by this [Wiki-IR-ChatBot](https://www.kaggle.com/rajkumarl/wiki-ir-chatbot)
### A chat on topic **[Natural Language Processing](https://en.wikipedia.org/wiki/Natural_language_processing)**
![chat3](https://raw.githubusercontent.com/RajkumarGalaxy/Wiki-IR-ChatBot/main/wiki_ir_chatbot_chats_3.jpg)


### A chat on topic **[Bicycle](https://en.wikipedia.org/wiki/Bicycle)**
![chat2](https://raw.githubusercontent.com/RajkumarGalaxy/Wiki-IR-ChatBot/main/wiki_ir_chatbot_chats_2.jpg)


### A chat on topic **[Tea](https://en.wikipedia.org/wiki/Tea)**
![chat1](https://raw.githubusercontent.com/RajkumarGalaxy/Wiki-IR-ChatBot/main/wiki_ir_chatbot_chats_1.jpg)

#### Thanks for your time! 
#### Find [here](https://github.com/RajkumarGalaxy/Wiki-IR-ChatBot) the repo on Github for more details of project.