Building a Simple Chatbot using Wikipedia URL in Python (using NLTK) With Voice Assistance

Project By - HARSH SHINDE

NLP
NLP is a way for computers to analyze, understand, and derive meaning from human language in a smart and useful way. By utilizing NLP, developers can organize and structure knowledge to perform tasks such as automatic summarization, translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation.

In [1]:
pip install nltk

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install newspaper3k




In [3]:
pip install pyttsx3

Note: you may need to restart the kernel to use updated packages.


In [4]:
import random
import string
import warnings
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from newspaper import Article
import pyttsx3

Downloading and installing NLTK
NLTK(Natural Language Toolkit) is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.

Natural Language Processing with Python provides a practical introduction to programming for language processing.

For platform-specific instructions, read here

In [5]:
# Initialize the text-to-speech engine
engine = pyttsx3.init()

In [6]:
#download package from nltk
warnings.filterwarnings('ignore')
nltk.download('popular', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('wordnet', quiet=True)


True

In [7]:
# Function to normalize text
def LemNormalize(text):
    return nltk.word_tokenize(text.lower().translate(remove_punct_dict))


In [8]:
# Keywords for greetings
greeting_input = ["hello", "hi", "greetings", "sup", "what's up", "hey"]
greeting_response = ["hi", "hey", "*nods*", "hi there", "hello", "I am glad! You are talking to me"]


In [9]:
# Function to recognize greetings
def greeting(sentence):
    for word in sentence.split():
        if word.lower() in greeting_input:
            return random.choice(greeting_response)

In [10]:
# Function to generate responses
def response(user_response):
    user_response = user_response.lower()
    user_response = user_response.upper()
    robo_response = ''
    sent_tokens.append(user_response)
    tfidfvec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
    tfidf = tfidfvec.fit_transform(sent_tokens)
    val = cosine_similarity(tfidf[-1], tfidf)
    idx = val.argsort()[0][-2]
    flat = val.flatten()
    flat.sort()
    score = flat[-2]
    if score == 0:
        robo_response = "Sorry, I don't understand."
    else:
        robo_response = sent_tokens[idx]

    sent_tokens.remove(user_response)
    speak(robo_response)  # Speak the response
    return robo_response

In [11]:
# Function to speak text
def speak(text):
    engine.say(text)
    engine.runAndWait()

In [12]:
# Get article URL
article = Article('https://simple.wikipedia.org/wiki/Earth')
article.download()
article.parse()
article.nlp()
corpus = article.text
#print
print(corpus)

Earth is the third planet from the Sun in the Solar System. It is the only planet known to have life on it. The Earth formed about 4.6 billion years ago.[29][30]

It is one of four rocky planets on the inner side of the Solar System. The other three are Mercury, Venus, and Mars.

The large mass of the Sun keeps the Earth in orbit through the force of gravity.[31] Earth also turns around in space, so that different parts face the Sun at different times. Earth goes around the Sun once (one year) for every 365​1⁄ 4 times it turns around (one day).

Earth is the only planet in the Solar System that has a large amount of liquid water on its surface.[32][33][34] About 71% of the surface of Earth is covered by liquid or frozen water.[35] Because of this, people sometimes call it the blue planet.[36]

Because of its water, Earth is home to millions of species of plants and animals which need water to survive.[37][38] The things that live on Earth have changed its surface greatly. For example, 

In [13]:
# Tokenization
text = corpus
sent_tokens = nltk.sent_tokenize(text)
print(sent_tokens)

['Earth is the third planet from the Sun in the Solar System.', 'It is the only planet known to have life on it.', 'The Earth formed about 4.6 billion years ago.', '[29][30]\n\nIt is one of four rocky planets on the inner side of the Solar System.', 'The other three are Mercury, Venus, and Mars.', 'The large mass of the Sun keeps the Earth in orbit through the force of gravity.', '[31] Earth also turns around in space, so that different parts face the Sun at different times.', 'Earth goes around the Sun once (one year) for every 365\u200b1⁄ 4 times it turns around (one day).', 'Earth is the only planet in the Solar System that has a large amount of liquid water on its surface.', '[32][33][34] About 71% of the surface of Earth is covered by liquid or frozen water.', '[35] Because of this, people sometimes call it the blue planet.', '[36]\n\nBecause of its water, Earth is home to millions of species of plants and animals which need water to survive.', '[37][38] The things that live on Ea

In [14]:
lemmer = nltk.stem.WordNetLemmatizer()
#WordNet is a semantically-oriented dictionary of English included in NLTK.
def LemTokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)

def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

In [15]:
#creating a dictionary to remove the punctuation
print(string.punctuation)
print(remove_punct_dict)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
{33: None, 34: None, 35: None, 36: None, 37: None, 38: None, 39: None, 40: None, 41: None, 42: None, 43: None, 44: None, 45: None, 46: None, 47: None, 58: None, 59: None, 60: None, 61: None, 62: None, 63: None, 64: None, 91: None, 92: None, 93: None, 94: None, 95: None, 96: None, 123: None, 124: None, 125: None, 126: None}


In [16]:
#create a function to return lower case words 
def LemNormalize(text):
  return nltk.word_tokenize(text.lower().translate(remove_punct_dict))
print(LemNormalize(text))

['earth', 'is', 'the', 'third', 'planet', 'from', 'the', 'sun', 'in', 'the', 'solar', 'system', 'it', 'is', 'the', 'only', 'planet', 'known', 'to', 'have', 'life', 'on', 'it', 'the', 'earth', 'formed', 'about', '46', 'billion', 'years', 'ago2930', 'it', 'is', 'one', 'of', 'four', 'rocky', 'planets', 'on', 'the', 'inner', 'side', 'of', 'the', 'solar', 'system', 'the', 'other', 'three', 'are', 'mercury', 'venus', 'and', 'mars', 'the', 'large', 'mass', 'of', 'the', 'sun', 'keeps', 'the', 'earth', 'in', 'orbit', 'through', 'the', 'force', 'of', 'gravity31', 'earth', 'also', 'turns', 'around', 'in', 'space', 'so', 'that', 'different', 'parts', 'face', 'the', 'sun', 'at', 'different', 'times', 'earth', 'goes', 'around', 'the', 'sun', 'once', 'one', 'year', 'for', 'every', '365\u200b1⁄', '4', 'times', 'it', 'turns', 'around', 'one', 'day', 'earth', 'is', 'the', 'only', 'planet', 'in', 'the', 'solar', 'system', 'that', 'has', 'a', 'large', 'amount', 'of', 'liquid', 'water', 'on', 'its', 'surfa

In [17]:
#keywords for greetings
greeting_input=["hello", "hi", "greetings", "sup", "what's up","hey"]
greeting_response=["hi", "hey", "*nods*", "hi there", "hello", "I am glad! You are talking to me"]
def greeting(sentence):
  for word in sentence.split():
    if word.lower() in greeting_input:
      return random.choice(greeting_response)

In [None]:
# Create dictionary to remove punctuation
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)

flag = True
print("Hello! This is HKS Bot. I can answer your queries related to anything. Just provide a Wikipedia URL. Type 'bye' to exit.")
while flag:
    user_response = input("\nUSER: ")
    if user_response != 'bye':
        if user_response == 'thanks' or user_response == 'thank you':
            flag = False
            print("HKS BOT: Anytime! :)")
        else:
            if greeting(user_response) is not None:
                print("HKS BOT: " + greeting(user_response))
            else:
                print("HKS BOT: " + response(user_response))
    else:
        flag = False
        print("HKS BOT: See you later! :)")

Hello! This is HKS Bot. I can answer your queries related to anything. Just provide a Wikipedia URL. Type 'bye' to exit.



USER:  hi


HKS BOT: hi



USER:  earth


HKS BOT: All places on Earth are made of, or are on top of, rocks.



USER:  earth


HKS BOT: All places on Earth are made of, or are on top of, rocks.



USER:  sun distance


HKS BOT: [40][41]

Earth is about 150,000,000 kilometres or 93,000,000 miles away from the Sun (this distance is called an "astronomical unit" or au.
