# **Project on Chatbot using TF-IDF and Cosine Similarity**

## **FactBot: Interactive Chatbot for Sharing Interesting Facts**

This project employs the **TF-IDF (Term Frequency-Inverse Document Frequency)** technique and **cosine similarity** to match the user's responses with a dataset of facts. By using TF-IDF, the chatbot can determine the relevance and importance of each word in the input. The cosine similarity measure helps in finding the most similar sentences or facts from the dataset.

Additionally, the chatbot is designed to respond to common user queries. It can provide information about itself, handle expressions of gratitude, bid farewell, and engage in greetings. This makes the chatbot more interactive and capable of engaging in small talk with users, enhancing the overall user experience.

### Subjects Familiar to the Factbot

1. General Knowledge
2. Science and Technology
3. History and Politics
4. Arts and Entertainment
5. Sports and Recreation
6. Health and Wellness
7. Social and Environmental Issues

## **Import and Setup**

In [None]:
import nltk
import string
import random
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import WordNetLemmatizer

In [None]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

### **GKFacts Dataset**

The GKFacts dataset is a collection of facts on General Knowledge topics. It serves as a valuable resource for those interested in gaining knowledge across various domains, including history, science, geography, arts, and more.

### Dataset Description

- Total Facts: ~1200
- Topics Covered: - General Knowledge
                  - Science and Technology
                  - History and Politics
                  - Arts and Entertainment
                  - Sports and Recreation
                  - Health and Wellness
                  - Social and Environmental Issues

### Usage

The facts in the dataset can be incorporated into chatbot applications to provide users with interesting and informative responses.

### Data Format

The dataset is provided in a text file format, with each fact separated by a newline. The facts are written in a concise and informative manner, making them suitable for quick consumption.

### Examples

Here are a few examples of facts from the GKFacts dataset:

1. Fact 1: The capital of China is Beijing.
2. Fact 2: There are 118 known elements, out of which 94 occur naturally on Earth.
3. Fact 3: "The Starry Night" by Vincent van Gogh is an iconic painting featuring swirling skies and a vibrant, expressive depiction of the night landscape.


In [None]:
f=open('/content/GKFacts.txt', 'r', errors = 'ignore')
raw_doc = f.read()

## **Performing Text Pre-processing Steps**

To ensure effective processing and analysis of text data in the GKFacts dataset, the following pre-processing steps have been performed:

### 1. Tokenization

Tokenization is the process of breaking down text into individual tokens or words. In this dataset, tokenization has been applied to each fact, dividing them into separate units.

### 2. Lemmatization

Lemmatization is the process of reducing words to their base or root form, known as lemmas. It helps in normalizing different word forms. The Lemmatization technique used in this dataset ensures that words like "running" and "ran" are reduced to their base form, "run".

### 3. Stemming

Stemming is another technique used to reduce words to their root form by removing prefixes and suffixes. It aims to create word stems, but the resulting stems may not always be actual words. Stemming can help in reducing the dimensionality of the data. In this dataset, a stemming algorithm has been applied to transform words to their root form.

These pre-processing steps improve the consistency and accuracy of the text data, making it easier for subsequent analysis and modeling tasks.




In [None]:
# TOKENIZATION
raw_doc = raw_doc.lower()  # Convert all text to lowercase
sentence_tokens = sent_tokenize(raw_doc)  # Divides the content into sentences
word_tokens = word_tokenize(raw_doc) # Divides the text into tokens(words)

In [None]:
# Lemmatization and Stemming
lemmer = WordNetLemmatizer()
def LemTokens(tokens):
  return [lemmer.lemmatize(token) for token in tokens]

remove_punc_dict = dict((ord(punct), None) for punct in string.punctuation)

def LemNormalize(text):
  return LemTokens(word_tokenize(text.lower().translate(remove_punc_dict)))

## **Personality Injection: Adding a Human Touch**

### Default Prompts

- Greeting: The chatbot can respond to common greetings like "hi," "hello," and "hey" with friendly responses.
- Thanks: It recognizes expressions of gratitude like "thank you" and "thanks" and acknowledges them appropriately.
- Goodbye: The chatbot understands various ways of ending the conversation and provides a farewell message.
- Description: It can provide a brief explanation of what the GKFacts Chatbot is and how it functions.
- Capabilities: The chatbot can explain its features and capabilities, such as sharing fascinating facts and answering questions.


In [None]:
greet_inputs =  [ "hi", "hello", "hey", "yo", "sup", "what's up", "hey there", "wassup", "howdy", "hola", "hey dude", "what's poppin'", "g'day mate", "hey ya", "hey buddy", "hey fam", "hey homie"]
greet_responses = [
    "Hello, how can I assist you?",    "Hey there! How can I help?",
    "Hi! What can I do for you?",    "Hey! How may I assist you today?",
    "Hi, how can I be of service?",    "Hello! How can I assist you?",
    "Hey, what can I do for you today?",    "Hi there! How may I help you?",
    "Yo! What's up? How can I assist you?",    "Sup! How may I help you today?",
    "What's up! How can I assist you?",    "Hey dude! How may I help you today?",
    "Hey ya! What can I do for you?",    "Hey buddy! How may I assist you today?",
    "Hey fam! How can I help you?",    "Hey homie! What can I do for you?",
    "Hey there! How can I assist you today?"]

goodbye_inputs = ["bye", "goodbye", "see you later", "take care", "farewell", "until next time" ]
goodbye_responses = [ "Goodbye!", "Farewell!", "Have a great day!", "Take care!", "Until we meet again!", "Goodbye! Take care!" ]

thanks_inputs = [ "thank you", "thanks a lot", "much appreciated", "thank you so much", "thanks for your help"]

thanks_responses = [ "You're welcome!", "No problem!", "Glad I could help!", "You're welcome. Happy to assist!",
                    "You're welcome! If you need further assistance, feel free to ask." ]

description_inputs = ["tell me about yourself", "who are you"]
description_responses = [
    "I am the Fact Bot, your friendly companion for interesting facts!",
    "I'm an AI-powered Fact Bot designed to share fascinating facts with you.",
    "Welcome to the Fact Bot! I'm here to provide you with intriguing facts and answer your questions."
]

capabilities_inputs = ["what can you do", "tell me your capabilities", "what are you good at", "how do you work"]
capabilities_responses = [
    "I can share facts on a wide range of topics. Just ask me anything!",
    "Ask me about animals, history, science, or any other subject you're curious about!",
    "I'm knowledgeable in various fields and can provide facts to quench your thirst for knowledge.",
    "I utilize TF-IDF and cosine similarity algorithms to analyze your input and find the most relevant facts to share with you.",
]

### Prompt Response Function

The response function in the code analyzes user input and generates an appropriate response based on the input. It compares the user's response with the input lists of if a match is found returns the corresponding response from the list.


In [None]:
def get_response(user_input, input_list, response_list):
  for i, input_item in enumerate(input_list):
      if input_item in user_input:
          return response_list[i]
  return None

## Response Generation by the Bot

The response function uses TF-IDF and cosine similarity to generate a suitable response for the chatbot. It tokenizes the user's input, calculates the TF-IDF matrix for the sentences, and computes the cosine similarity with the user's input. It selects the most similar sentence and returns it as the response, capitalized. If there is no significant similarity, the chatbot apologizes for not understanding.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
def response (user_response):
  robo1_response=''
  TfidfVec = TfidfVectorizer(tokenizer = LemNormalize, stop_words = 'english')
  tfidf = TfidfVec.fit_transform(sentence_tokens)
  vals = cosine_similarity(tfidf[-1], tfidf)
  idx = vals.argsort()[0][-2]
  flat = vals.flatten()
  flat.sort()
  req_tfidf = flat[-2]
  if(req_tfidf ==0):
    robo1_response = robo1_response + "I am sorry. I am unable to understand you."
    return robo1_response
  else:
    robo1_response = robo1_response + sentence_tokens[idx]
    return robo1_response.capitalize()

### **Main Chatbot Flow**



In [None]:
flag=True
print("Bot: Welcome to the Fact Bot! I'm here to provide you with interesting facts. Ask me anything, and I'll share a fascinating fact with you.\nFor ending convo type bye")
while(flag==True):
  user_response = input("User: ")
  user_response = user_response.lower()
  if(user_response not in goodbye_inputs):
    if(user_response in thanks_inputs):
      flag = False
      print("Bot: " + random.choice(thanks_responses))
    else:
      if(user_response in greet_inputs):
        print("Bot: " + get_response(user_response, greet_inputs, greet_responses))
      elif any(input_item in user_response for input_item in description_inputs):
        print("Bot: " + get_response(user_response, description_inputs, description_responses))
      elif any(input_item in user_response for input_item in capabilities_inputs):
        print("Bot: " + get_response(user_response, capabilities_inputs, capabilities_responses))
      else:
        sentence_tokens.append(user_response)
        word_tokens = word_tokens + word_tokenize(user_response)
        final_word=list(set(word_tokens))
        print('Bot: ', end = '')
        print(response(user_response))
        sentence_tokens.remove(user_response)
  else:
    flag = False
    print("Bot: " + random.choice(goodbye_responses))


Bot: Welcome to the Fact Bot! I'm here to provide you with interesting facts. Ask me anything, and I'll share a fascinating fact with you.
For ending convo type bye
User: Hey there
Bot: Hi! What can I do for you?
User: Who are you?
Bot: I'm an AI-powered Fact Bot designed to share fascinating facts with you.
User: Can you tell me a fact on the French Revolution?
Bot: The french revolution took place from 1789 to 1799, leading to the overthrow of the monarchy and the rise of napoleon bonaparte.
User: Explain the theory of relativity.
Bot: The theory of general relativity, also by einstein, provides a framework for understanding gravity as the curvature of spacetime.
User: Which is the largest country in the Middle east?
Bot: Saudi arabia is the largest country in the middle east.
User: Why is oxygen essential?
Bot: Oxygen is essential for respiration and combustion, while nitrogen is a crucial component of proteins and nucleic acids.
User: Who is Leo Tolstoy?
Bot: Leo tolstoy, a russian