<a href="https://colab.research.google.com/github/SeanGMONeill/Chatbot_Lessons/blob/main/2_Building_a_simple_chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building a chatbot

This project will let you explore different ways in which you can build chatbots - virtual messaging partners who can answer questions on specific topics, or even have general conversations, in plain English.

### Part 1 - The old-fashioned way


In the software world, there are often different ways to approach the same problem.
In this first approach, you'll build a chatbot by writing your own hard-coded logic.

You'll be doing this in Python, within a Jupyter notebook. Don't worry if you've never used Python - the core concepts are similar to many other programming languages.

Here's a snippet of code to get you started - take some time to understand what it does:

In [None]:
# Initialize the run variable to True
run = True

# While run is still True, loop through the rest of the script
while run:
  # Wait for the user to input text, and store it in the msg variable
  msg = input()
  # Give a response, based on the input (if we recognise it)
  if msg == 'exit':
    print('Goodbye!')
    # Set run to False, so the loop won't run again
    # This means we won't be trapped in an infinite loop
    run = False
  elif msg == 'Hello':
    print('Hi!')
  # If the input doesn't match any of our statements, print a generic answer
  else:
    print('Sorry, I don\'t understand')

Hello
Hi!
yo
Sorry, I don't understand
exit
Goodbye!


**Press the Play button in the cell above to try it out**

If you break the 'exit' condition of your script, press *Runtime -> Interrupt Execution* to halt the script


## Exercises

Try adding your own case to the chatbot code - copy the 'Hello' segment, and make it respond to 'How are you?'



### Case Sensitive


Can you see any drawbacks to the approach we're using so far?



Notice that this script will only recognize a phrase if the user types it exactly the same way.

We can help them a little by running our msg through an in-built Python function, *str.lower()* - this converts the string into a lowercase string, so we can compare it to other lowercase strings, as demonstrated in the cells below.



In [None]:
msg = 'Hello'
print(msg)
print(msg == 'hello')

Hello
False


In [None]:
msg = 'Hello'
msg = msg.lower()
print(msg)
print(msg == 'hello')

hello
True


Try converting the input to lowercase in your code up above.

### Keywords

There's still another drawback! The user still needs to enter the exact wording that the programmer used, otherwise the program won't understand what they're saying.

One way to help with this is to look for words within the input, using Python's 'in' keyword.

In [None]:
msg = 'What is the weather like?'
msg = msg.lower()
print(msg == 'weather')

False


In [None]:
msg = 'What is the weather like?'
msg = msg.lower()
print('weather' in msg)

True


Try updating your script to use this to reply to the user with 'I love rain!' if they mention 'rain' anywhere in their message.

### Tokenization

Congratulations, you've just introduced a new issue!

Now, the script will reply to a message which contains the word 'rain', even if that word is in the middle of another word:

In [None]:
msg = 'I think it\'s going to rain'
msg = msg.lower()
print('rain' in msg)

In [None]:
msg = 'I strained the pasta'
msg = msg.lower()
print('rain' in msg)

True


Notice that there is 'rain' in 'st**rain**ed'

This will lead to some confusing answers, when someone talking about pasta receives replies about the weather.

We can solve this by using a Natural Language Processing (NLP) technique called *Tokenization* - splitting a phrase into a list of smaller chunks.


One approach to tokenization is splitting it into words - the simplest way to do this (though imperfect) is to split on spaces.

In [None]:
msg = 'I strained the pasta'
msg = msg.lower()
tokens = msg.split(' ') # Splitting on the space character
print(tokens)

['i', 'strained', 'the', 'pasta']


*str.split(separator)* splits a string into a list of strings, splitting wherever it sees the separator - in our case, the space character

We can also move our pre-processing into a function, for tidiness

In [None]:
def tokenize(msg):
  msg = msg.lower()
  tokens = msg.split(' ')
  return tokens

And we can check if a word appears in the list of tokens, using the *in* keyword again

In [None]:
msg = 'I think it\'s going to rain'
tokens = tokenize(msg)
print(tokens)
print('rain in msg? ')
print('rain' in msg)
print('rain in tokens?')
print('rain' in tokens)

['i', 'think', "it's", 'going', 'to', 'rain']
rain in msg? 
True
rain in tokens?
True


In [None]:
msg = 'I strained the pasta'
tokens = tokenize(msg)
print(tokens)
print('rain in msg? ')
print('rain' in msg)
print('rain in tokens?')
print('rain' in tokens)

['i', 'strained', 'the', 'pasta']
rain in msg? 
True
rain in tokens?
False


This avoids the false positive which we were seeing. Try updating your script to use this technique.

### Punctuation - friend or foe?

Punctuation is great, and can add a lot of meaning to sentences! But since our current script isn't complicated enough to take it into account, it could just get in the way.

For example:

In [None]:
msg = 'Did you see that cat?'
tokens = tokenize(msg)
print(tokens)
print('cat' in tokens)

['did', 'you', 'see', 'that', 'cat?']
False


Our token-matcher isn't recognising the word 'cat', because it's been tokenized with a question mark at the end (as 'cat?').

This is where removing punctuation could sometimes be helpful - we can choose certain symbols to remove using this function - try to understand what it does:

In [None]:
def remove_punctuation(msg):
  symbols = ['?','-',',',':',';']
  for symbol in symbols:
    msg = msg.replace(symbol, '')
  return msg

msg = 'Did you see that cat?'
remove_punctuation(msg)

'Did you see that cat'

Notice we aren't removing spaces - that's because we're splitting on these, so we want to keep them for tokenization.

In [None]:
msg = 'Did you see that cat?'
tokens = tokenize(remove_punctuation(msg))
print(tokens)
print('cat' in tokens)

['did', 'you', 'see', 'that', 'cat']
True


Now we can correctly identify the token 'cat'

### Making it more complicated

In [None]:
elements = {
    'earth': 1,
    'air': 2,
    'fire': 3,
    'water': 4
}

msg = 'What is the chemical symbol for fire'
tokens = tokenize(msg)
print('symbol' in tokens and len(set(elements).intersection(set(tokens)))> 0)

True
