<a href="https://colab.research.google.com/github/Aryaa6603/PolicyBot/blob/main/Chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **PolicyBot**, your one step to easing policies and rules enforcement!

Navigating complex company rules and policies can be daunting for employees, leading to misunderstandings and compliance issues. Traditional methods, like manuals, often fall short in engaging and promptly informing employees.



**Project Description:**

In response to the challenge of effectively communicating company rules and policies to employees, I propose building a specialized Chatbot. This Chatbot will provide quick and accurate information about company regulations, aiming to address the following:

My goal is to create a Chatbot that:

1. **Ensures Instant Responses:** Offers real-time answers to employee queries on company rules.

2. **Enhances Accessibility:** Makes information easily accessible to all employees regardless of their role.

3. **Boosts Engagement:** Creates a user-friendly interface that encourages employees to stay informed.

4. **Allows Customization:** Tailors responses to specific company policies for accuracy.

5. **Enables Integration:** Seamlessly integrates into existing communication channels within the company.

By developing this specialized Chatbot, I aim to streamline communication, foster compliance, and empower employees with the knowledge needed to navigate corporate regulations effectively. The project aligns with my broader objective of enhancing organizational transparency and creating a more informed workforce.

## Import necessary libraries

In [1]:
import io
import random
import string # to process standard python strings
import warnings
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore')

## Downloading and installing NLTK
NLTK(Natural Language Toolkit) is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.

[Natural Language Processing with Python](http://www.nltk.org/book/) provides a practical introduction to programming for language processing.


In [2]:
pip install nltk



### Installing NLTK Packages




In [3]:
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('popular', quiet=True) # for downloading packages
#nltk.download('punkt') # first-time use only
#nltk.download('wordnet') # first-time use only

True

## Reading in the corpus

For my example, I will be using the dataset of sample policies that I created as my corpus. Downlaod the file named ‘chatbot.txt’. However, you can use any corpus of your choice.

In [4]:
f=open('/content/chatbot.txt','r',errors = 'ignore')
raw=f.read()
raw = raw.lower()# converts to lowercase

The primary challenge with text data is its format in strings, which is not suitable for machine learning algorithms. To prepare it for NLP projects, several basic text pre-processing steps are essential:

1. **Case Normalization:** Convert the entire text to either uppercase or lowercase to ensure consistent treatment of words regardless of their cases.

2. **Tokenization:** Tokenization involves converting text strings into lists of tokens, such as words. Sentence tokenizer identifies sentences, while word tokenizer identifies words within strings. NLTK provides a pre-trained Punkt tokenizer for English.

3. **Noise Removal:** Eliminate everything that doesn't consist of standard numbers or letters, removing irrelevant characters.

4. **Stop Words Removal:** Exclude extremely common words that contribute little value in matching user needs. These words, known as stop words, are removed from the vocabulary.

5. **Stemming:** Reduce inflected or derived words to their base or root form. For example, stemming "Stems," "Stemming," "Stemmed," and "Stemtization" results in the single word "stem."

6. **Lemmatization:** A variation of stemming, lemmatization reduces words to their base form, considering actual words or lemmas. For instance, "run" serves as the base form for words like "running" or "ran," and "better" and "good" are in the same lemma, treating them as identical.


## Tokenisation

In [5]:
sent_tokens = nltk.sent_tokenize(raw)# converts to list of sentences
word_tokens = nltk.word_tokenize(raw)# converts to list of words

## Preprocessing

I have now defined a function called LemTokens which will take as input the tokens and return normalized tokens.

In [6]:
lemmer = nltk.stem.WordNetLemmatizer()
#WordNet is a semantically-oriented dictionary of English included in NLTK.
def LemTokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)

def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

## Keyword matching

Next, I shall define a function for a greeting by the bot i.e if a user’s input is a greeting, the bot shall return a greeting response.

In [7]:
GREETING_INPUTS = ("hello", "hi", "greetings", "sup", "what's up","hey",)
GREETING_RESPONSES = ["hi", "hey", "*nods*", "hi there", "hello", "I am glad! You are talking to me"]
def greeting(sentence):

    for word in sentence.split():
        if word.lower() in GREETING_INPUTS:
            return random.choice(GREETING_RESPONSES)

## Generating Response
**Bag of Words:**

Following the initial preprocessing stage, the transformation of text into a meaningful array of numbers is crucial. The bag-of-words represents text by capturing the occurrence of words within a document. It comprises two key elements:

1. A vocabulary of known words.
2. A measure of the presence of known words.

The term "bag" is used because any information about the order or structure of words is discarded, focusing solely on whether known words occur in the document, not their specific placement.

The underlying idea behind the Bag of Words is that documents with similar content are considered similar. It allows us to extract meaning from the document based solely on its content.

For example, if our dictionary includes words like {Learning, is, the, not, great}, and we want to vectorize the text "Learning is great," the resulting vector would be (1, 1, 0, 0, 1).

**TF-IDF Approach:**

The Bag of Words approach has limitations, especially with highly frequent words dominating and potentially giving more weight to longer documents. To address this, we can rescale the frequency of words using Term Frequency-Inverse Document Frequency (TF-IDF). This scoring method considers both the frequency of a word in the current document and how rare the word is across all documents.

- Term Frequency (TF): Measures the frequency of a word in the current document.
  - TF = (Number of times term t appears in a document)/(Number of terms in the document)

- Inverse Document Frequency (IDF): Scores how rare a word is across documents.
  - IDF = 1 + log(N/n), where N is the number of documents, and n is the number of documents a term t has appeared in.

**Cosine Similarity:**

TF-IDF weight is a statistical measure used in information retrieval and text mining to assess a word's importance in a document. Cosine Similarity, calculated using the dot product of two non-zero vectors, is employed to measure the similarity between two documents (d1 and d2).

Cosine Similarity (d1, d2) = Dot product(d1, d2) / ||d1|| * ||d2||

To generate responses from the chatbot for user input questions, we utilize the concept of document similarity. The response function searches the user's input for known keywords and returns appropriate responses. If no keywords are found, it responds with "I am sorry! I don’t understand you."

To generate a response from our bot for input questions, the concept of document similarity will be used. I defined a function response which searches the user’s utterance for one or more known keywords and returns one of several possible responses. If it doesn’t find the input matching any of the keywords, it returns a response:” I am sorry! I don’t understand you”

In [8]:
def response(user_response):
    robo_response=''
    sent_tokens.append(user_response)
    TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
    tfidf = TfidfVec.fit_transform(sent_tokens)
    vals = cosine_similarity(tfidf[-1], tfidf)
    idx=vals.argsort()[0][-2]
    flat = vals.flatten()
    flat.sort()
    req_tfidf = flat[-2]
    if(req_tfidf==0):
        robo_response=robo_response+"I am sorry! I don't understand you"
        return robo_response
    else:
        robo_response = robo_response+sent_tokens[idx]
        return robo_response



Finally, I will feed the lines that I want the bot to say while starting and ending a conversation depending upon user’s input.

In [10]:
flag=True
print("PolicyBot: My name is PolicyBot. I will answer your queries about Chatbots. If you want to exit, type Bye!")
while(flag==True):
    user_response = input()
    user_response=user_response.lower()
    if(user_response!='bye'):
        if(user_response=='thanks' or user_response=='thank you' ):
            flag=False
            print("PolicyBot: You are welcome..")
        else:
            if(greeting(user_response)!=None):
                print("PolicyBot: "+greeting(user_response))
            else:
                print("PolicyBot: ",end="")
                print(response(user_response))
                sent_tokens.remove(user_response)
    else:
        flag=False
        print("PolicyBot: Bye! take care..")

PolicyBot: My name is PolicyBot. I will answer your queries about Chatbots. If you want to exit, type Bye!
Can I work remotely?
PolicyBot: please ensure that overtime work is pre-approved.
Explain the overtime compensation policy.
PolicyBot: please ensure that overtime work is pre-approved.
How are performance reviews conducted?
PolicyBot: performance reviews are conducted annually.
Can employees bring personal devices to work
PolicyBot: yes, employees are allowed to bring personal devices.
What steps should be taken in case of a cybersecurity threat?
PolicyBot: immediately report any cybersecurity threats or suspicious activities to the it department.
Can employees participate in open-source projects outside of work?
PolicyBot: yes, employees are encouraged to contribute to open-source projects.
bye
PolicyBot: Bye! take care..
