# Week 6: Retrieval based chatbots

This week you will be learning about methods to help you build chatbots that retrieve data, both from local datasets ([Part 1](#part-1-retrieving-local-data)) and from the world wide web ([Part 2](#part-2-retrieving-data-from-the-web)). These two parts are independent so you can do part 2 before part 1 if you are more interested in web scraping.

Before you get started though, let just make sure that this notebook is setup to run using the `nlp` conda environment that you created last week.

To set this notebook to the right environment, click the **Select kernel** button in the top right corner of this notebook, then select **Python Environments...** and then select the environment `nlp`.

To double check you have done this correctly, hit the run cell button (▶) on the cell below:

In [None]:
import os
print(os.environ['CONDA_DEFAULT_ENV'])

##### import libraries

Now test that the libraries we will be using today for part 1 are installed correctly and can be imported:

In [None]:
import nltk
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import normalize

# Part 1: Retrieving local data

In this section you will at how to compare text inputs (aka queries) to documents in a local dataset. You will be trying to find the most relevant songs by comparing a text query to a dataset of song lyrics. Most of the code is already done for you, [the tasks](#task-2-add-functions-to-chatbot) are mostly getting you to incorporate this code into a helpful retrieval based chatbot. 

##### Load dataset into a pandas dataframe

Here is some code that loads in a dataset of song lyrics from the TSV file [class-datasets/lyric_data.tsv](class-datasets/lyric_data.tsv) into a [Pandas dataframe](https://www.w3schools.com/python/pandas/pandas_dataframes.asp), which is a very useful way of storing data tables in Python:

In [None]:
def load_lyrics(file_path, separator, column_names):
    df = pd.read_csv(file_path, sep=separator, usecols=column_names)
    return df

#### Song lyrics to bag of words

This function will convert the song lyrics into a bag of words (BoW) matrix:

In [None]:
def lyrics_to_bow(df):
    vectorizer = CountVectorizer(stop_words='english')
    lyrics_matrix = vectorizer.fit_transform(df['LYRICS'])
    return vectorizer, lyrics_matrix

. The word *matrix* sounds very technical, but it is essential just a big table that looks something like this:

![bag of words matrix visualisation](media/week-6/bow-visualisation.png)

#### Find nearest song function

The following function takes an text query as input, vectorises it using the vectorizer that is configured to the vocabulary of the dataset of song lyrics, then uses a function called the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) (more on the maths behind this in term 2!) to measure how similiar the BoW vector that represents our input query is to each BoW vector for every document (song's lyrics) in our dataset.

It then returns the artist name, song name and the lyrics for the closest matching song:

In [None]:
# Find the nearest song based on matrix multiplication
def find_nearest_song(input_query, vectorizer, lyrics_matrix, df):
    
    # Use the same vectorizer as the lyrics so that the dimensions match (so that they use the same dictionary of words)
    input_bow = vectorizer.transform([input_query])
    
    # Get similarity between input query and output 
    similarities = cosine_similarity(input_bow, lyrics_matrix).flatten()
    
    # Find the index of the most similar song
    nearest_index = similarities.argmax()
    
    # Extract artist, song name and the lyrics
    artist_song_id = df.iloc[nearest_index]['ARTIST_NAME-SONG_NAME']
    artist_id, song_id = artist_song_id.split('-')
    lyrics = df.iloc[nearest_index]['LYRICS']
    
    return artist_id, song_id, lyrics

##### Load and vectorise dataset

Now you can load in your data set and vectorise it, you should end up with a *"Compressed Sparse Row sparse matrix of dtype 'int64'"*, which essentially is just a table of integers that mostly has 0s in it.

In [None]:
dataset_path = 'class-datasets/lyric_data.tsv'
column_names = ['ARTIST_NAME-SONG_NAME', 'SONG_NAME', 'LYRICS']
separator = '\t'

df = load_lyrics(dataset_path, separator, column_names)
vectorizer, lyrics_matrix = lyrics_to_bow(df)
lyrics_matrix

Now you can test an input, try changing the value in `input_query` to find different songs:

In [None]:
input_query = "sunshine and rainbows"
artist_id, song_id, nearest_lyrics = find_nearest_song(input_query, vectorizer, lyrics_matrix, df)

print(f"Nearest Song: {song_id} by {artist_id}")
print(f"Lyrics: {nearest_lyrics}")

### Task 1: Use TF-IDF instead of Bag of Words

Write a new function that uses [the TfidfVectorizer](https://scikit-learn.org/1.5/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) instead of [the CountVectorizer](https://scikit-learn.org/1.5/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) (aka bag of words).

Use the code in [the song lyrics to bag of words function](#song-lyrics-to-bag-of-words) as inspiration. No need to import TfidfVectorizer as that has [already been imported](#import-libraries):

In [None]:
def lyrics_to_tfidf(df):
    # You're code here

Now replace the BoW vectorize and matrix with the new tf_idf one, you should end up with a *"Compressed Sparse Row sparse matrix of dtype 'float64'"*, this is now a table of **Floats** that mostly has 0s in it:

In [None]:
vectorizer, lyrics_matrix = lyrics_to_tfidf(df)
lyrics_matrix

And now test this code again:

In [None]:
input_query = "sunshine and rainbows"
artist_id, song_id, nearest_lyrics = find_nearest_song(input_query, vectorizer, lyrics_matrix, df)

print(f"Nearest Song: {song_id} by {artist_id}")
print(f"Lyrics: {nearest_lyrics}")

### Task 2: Add functions to Chatbot

In week-6b-music-info-chatbot.py add the functions `load_lyrics`, `lyrics_to_bow`, `lyrics_to_tfidf` and `find_nearest_song` from here to the class `MusicBot` as member functions, don't forget to use the `self` keyword here. 

### Task 3: Load and pre-process data in constructor

Then include the code for [loading and vectorizing your data](#load-and-vectorise-dataset) into the constructor (`__init__`) of your the music chatbot. Make sure that you assign the variables you create as member variables to your class.

### Task 4: Create chat interface for getting song recommendations

In the function `generate_response`:
1. Use a regex to match the string input 'give me a song about {x}' or 'recommend me a song about {x}', where {x} is whatever collection of words that a user is looking for a song about. 
2. Pass the extracted group from *{x}* as the query into the function `find_nearest_song`
3. Return a response that includes the artist name and the song name, and a sample of the lyrics (i.e. the first 250 characters).

## Bonus tasks

Here are some bonus tasks related to the first part of this notebook. Feel free to move straight onto [part 2 of the notebook](#part-2-retrieving-data-from-the-web) and come back to these if you want more of a challenge.

#### Bonus task A

Can you use a [stemmer](https://www.nltk.org/howto/stem.html) or a [lemmatizer](https://www.nltk.org/api/nltk.stem.WordNetLemmatizer.html?highlight=wordnet) from the [NLTK library](https://www.nltk.org/index.html) in your chatbot? 

> Tips: 
> 1. Write a function for stemming or lemmatising your text.
> 2. It is a good idea to write and test this code in this notebook first before implementing it in your chatbot.
> 3. Step 5 from this [stackoverflow answer](https://stackoverflow.com/a/45670652) shows you how to best apply stemming to text in the way that the song lyrics are formatted.
> 4. To transform all the lyrics in your dataframe, you can use pass your stemming/lemmatiser function as a parameter into the [dataframe.apply()](https://www.w3schools.com/python/pandas/ref_df_apply.asp) class method to quickly transform all of the song lyrics in one go.

#### Bonus task B

Can you include adapt the code to include the lyrics of Taylor Swift songs from [class-datasets/TaylorSwift.csv](class-datasets/TaylorSwift.csv)? 

> Tips: The data is in a different format and uses different names for the columns. You will need to adapt the previous code to take account of this:
> Simple solution: If you only want taylor swift songs you can just change these values. 
> Advanced solution: Writing code that loads both datasets in and harmonises them before calculating the BoW or TF-IDF matrix will be more complicated. You will need to [rename](https://stackoverflow.com/questions/11346283/renaming-column-names-in-pandas) and [reorder](https://stackoverflow.com/questions/13148429/how-to-change-the-order-of-dataframe-columns) the columns of the dataframes to be the same before [concatonating](https://stackoverflow.com/questions/59267129/how-to-concatenate-multiple-dataframes-from-multiple-sources-in-pandas) them.

#### Bonus task C

Have a look for [other datasets of song lyrics on kaggle](https://www.kaggle.com/search?q=song+lyrics+in%3Adatasets), can you adapt the code to work with any of these bigger datasets? 
> **Warning:** Loading and calculating your BoW or TF-IDF matrix may become very slow if your dataset is very large! If this happens, try [dropping some of the rows from your dataframe](https://stackoverflow.com/a/77145573) before vectorizing if this becomes a problem.

# Part 2: Retrieving data from the web

In part two you will be looking at how to extract data from the web using some simple web scraping. You will be extracting some biographical data from wikipedia for musicians that the user inputs. 


Lets import the libraries we need for this and make sure they can be loaded:

In [None]:
import time
import requests
import urllib.request
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from bs4 import BeautifulSoup

##### Limit requests 

Run this code to limit the rate and number of requests that are made:

In [None]:
# Limit requests, code from: https://stackoverflow.com/a/47475019
session = requests.Session()
retry = Retry(connect=3, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)

##### Get html content function

Function that gets data from a URL and uses beautiful soup to extract the html content:

In [None]:
def get_html_content(target_url):
    try:
        r = session.get(target_url)
        soup = BeautifulSoup(r.content, 'html.parser')
        return soup
    except Exception as e:
        return None

Now lets test it:

In [None]:
target_url = 'https://en.wikipedia.org/wiki/Taylor_Swift'
soup = get_html_content(target_url)
print(soup)

### Extract biographical information from the HTML

Extract name:

In [None]:
name = soup.find("div", class_="fn").get_text(strip=True)
print(f"Name: {name}")

Extract date of birth and age:

In [None]:
birth_info = soup.find("span", class_="bday")
birth_date = birth_info.get_text(strip=True)
age = soup.find("span", class_="ForceAgeToShow").get_text(strip=True).replace("(", "").replace(")", "")

print(f"Birth Date: {birth_date}")
print(f"Age: {age}")

Extract birthplace:

In [None]:
birthplace = soup.find("div", class_="birthplace").get_text(strip=True)

print(f"Birthplace: {birthplace}")

Extract musical genres:

In [None]:
genres_list = soup.find("th", string="Genres").find_next_sibling("td").find_all("li")
genres = [genre.get_text(strip=True) for genre in genres_list]

print(f"Genres: {', '.join(genres)}")

### Task 5: Format URLs

Write a function that takes a string of a musicians name and turns it into the name of the page on wikipedia.

For instance: 'taylor swift' -> 'Talyor_Swift'

To do this you will need to [capitalise each word in the string](https://stackoverflow.com/a/12336911), strip any whitespace at the [start](https://www.w3schools.com/python/ref_string_lstrip.asp) or [end](https://www.w3schools.com/python/ref_string_rstrip.asp) of the string, and replace spaces in between parts of the name with an underscore. You can do this with the Python [string replace method](https://www.w3schools.com/python/ref_string_replace.asp) or the [regex sub method](https://docs.python.org/3/library/re.html#re.sub).

In [None]:
def name_to_wikipedia_page_name(input_str):
    #

Now lets test the code with the following function to see if a URL exists:

In [None]:
# Function from here: https://gist.github.com/dehowell/884204?permalink_comment_id=1771089#gistcomment-1771089
def url_is_alive(url):
    request = urllib.request.Request(url)
    request.get_method = lambda: 'HEAD'
    try:
        urllib.request.urlopen(request)
        return True
    except urllib.request.HTTPError:
        return False

Now test your function with the following code, all the url's should be valid if your function works properly:

In [None]:
musicians = ['Akon ', 'taylor swift', ' Bob dylan', ' Cardi B ']
url_root = 'https://en.wikipedia.org/wiki/'

for musician in musicians:
    page_name = name_to_wikipedia_page_name(musician)
    target_url = url_root + page_name
    is_real_url = url_is_alive(target_url)
    print(f'URL: {target_url} is {"valid" if is_real_url else "not valid"}')
    time.sleep(0.1)

### Task 6: Add functions to chatbot

Add the functions `get_html_content` `name_to_wikipedia_page_name` and `url_is_alive` to the class `MusicBot` as member functions, don't forget to use the `self` keyword here. 

### Task 7: Add code to constructor 

Add the code to [create a requests session object and limit the requests made by the session](#limit-requests) to the constructor (`__init__`) of class `MusicBot`. Don't forget to use the `self` keyword to make the `sessions` object a member variable of your `MusicBot` class.

### Task 8: Write a function to get biographical information about a musician

Write a function that gets biographical information from the URL [using the examples above](#extract-biographical-information-from-the-html). There are many things you can get, such as age, birthplace, and musical genres. You can use the function below as a template. 
> Tip: you may want to use a [Try-Except block](https://www.w3schools.com/python/python_try_except.asp) in your function to catch any instances where the data you are looking for cannot be found on a particular page, then return something useful so that you can tell the user that information can't be found.

In [None]:
def get_age(url):
    # Get HTML content from URL
    # Extract the age from the HTML
    # Return the age as an Int

Once you have written it, test your function and then add it to the class `MusicBot` as member function. 

### Task 9: Create chat interface for retrieving biographical information

In the function `generate_response`:
1. Use a regex to match and input from the user where they ask a question about their favourite musician, i.e. the string input 'how old is {x}'. {x} will be the name of the musician.
2. Use the function `name_to_wikipedia_page_name` to get the URL for the specified musician.
3. Check to see if the url is valid using the function `url_is_alive`, if not tell the user the musician could not be found, if it is then move onto the next step.
4. Pass the url into your function that retrieves information about the artist.
5. Return a response that includes the artist name and the information that has been extracted.


## Bonus tasks

Some more bonus tasks...

#### Bonus task A

Write functions for [all of the biographical information](#extract-biographical-information-from-the-html) that can be found from the wikipedia page. Then incorporate them into the function `generate_response` in your chatbot.

#### Bonus task B

1. Look at the wikipedia page for a musician. Is there any other information that you could extract for your chatbot? 
2. Look at the source of the wikipedia page to see if you can identify the necessary tags to extract that content. 
3. Then try writing some code in this notebook to see if you can reliably extract that information.
4. Then follow the previous steps to incorporate that into your chatbot.

#### Bonus task C

Add functionality to your chatbot that allows it to answer questions about bands as well as solo musicians.
