<a href="https://colab.research.google.com/github/JRicardo11/recommendation_engine/blob/main/bag_of_words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Google Colab Configuration

**Execute this steps to configure the Google Colab environment in order to execute this notebook. It is not required if you are executing it locally and you have properly configured your local environment according to what explained in the Github Repository.**

The first step is to clone the repository to have access to all the data and files

In [1]:
! git clone https://github.com/acastellanos-ie/MBD-EN-BL-ENE-2020-J-1.git

Cloning into 'MBD-EN-BL-ENE-2020-J-1'...
remote: Enumerating objects: 4485, done.[K
remote: Counting objects: 100% (4485/4485), done.[K
remote: Compressing objects: 100% (4372/4372), done.[K
remote: Total 4485 (delta 161), reused 4387 (delta 94), pack-reused 0[K
Receiving objects: 100% (4485/4485), 13.41 MiB | 17.58 MiB/s, done.
Resolving deltas: 100% (161/161), done.


Install the requirements

In [None]:
! pip install -Uqqr MBD-EN-BL-ENE-2020-J-1/requirements.txt

[K     |████████████████████████████████| 1.5MB 10.3MB/s 
[K     |████████████████████████████████| 10.4MB 33.3MB/s 
[K     |████████████████████████████████| 12.0MB 247kB/s 
[K     |████████████████████████████████| 9.9MB 38.3MB/s 
[K     |████████████████████████████████| 348kB 43.3MB/s 
[K     |████████████████████████████████| 204kB 59.6MB/s 
[K     |████████████████████████████████| 727kB 34.5MB/s 
[K     |████████████████████████████████| 454.3MB 36kB/s 


Ensure that you have the GPU runtime activated:

![](https://miro.medium.com/max/3006/1*vOkqNhJNl1204kOhqq59zA.png)

Now you have everything you need to execute the code in Colab

# Bag-of-words

In [None]:
import nltk
nltk.download('shakespeare')
nltk.download('stopwords')

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd
import numpy as np

The `nltk` library includes several corpus for experimentation. In this markdown we are going to make use of the corpus including the set of Shakespeare's plays.

In the following cell, I will load the corpus and create a dataframe with the name of the book and the textual content.

In [None]:
shakespeare_df = pd.DataFrame(columns=["book", "words"])
for ii, book in enumerate(nltk.corpus.shakespeare.fileids()):
    shakespeare_df.loc[ii] = (book, " ".join(nltk.corpus.shakespeare.words(book)))
print(shakespeare_df)

While this representation can be useful for humans, it is of no use if you want to use these data for an NLP system.

As we discussed in class, we need to create the document-term matrix which will be the input for any NLP system we need to create on top of it. In the document term matrix we have a row for each one of the different documents (the Shakespeare's plays) and a column for each one of the words in the dataset. At each cell, you will find the weight of the word in the document (for example, how many times does the word appear in the document).

In class we presented several weighting approaches, let's see how we can create them.

Let's start with the simplest one: The Binary weighting. Binary weighting only defines if a word appears (1) or does not appear (0) in a document

In [None]:
binary_weighting = CountVectorizer(binary=True)
binary_shakespeare = binary_weighting.fit_transform(shakespeare_df.words)
binary_dt_matrix = pd.DataFrame(binary_shakespeare.A, columns=binary_weighting.get_feature_names())
print(binary_dt_matrix)

Let's inspect the most and least important terms related to the document 6 (Othello)

In [None]:
document = 6
print("25 most important terms for document", shakespeare_df.iloc[document]['book'])
print(binary_dt_matrix.iloc[:, np.argsort(binary_dt_matrix.loc[document])[::-1]].iloc[document][:25])

print("25 least important terms for document", shakespeare_df.iloc[document]['book'])
print(binary_dt_matrix.iloc[:, np.argsort(binary_dt_matrix.loc[document])[::-1]].iloc[document][-25:])



As you can see, the representation is not very useful as it is. By only telling us if a word appears or not in a document is not giving us a lot of information. **Can you think on a situation where this binary weighting can be sufficient?**

The next thing to know will be whether the word appears only once or several times.

In [None]:
tf_weighting = CountVectorizer()
tf_shakespeare = tf_weighting.fit_transform(shakespeare_df.words)
tf_dt_matrix = pd.DataFrame(tf_shakespeare.A, columns=tf_weighting.get_feature_names())
print(tf_dt_matrix)

Ok, now we have the words weighted according to how many times they appear in the document. 

Let's check now the most and least important words in Othello

In [None]:
document = 6
print("25 most important terms for document", shakespeare_df.iloc[document]['book'])
print(tf_dt_matrix.iloc[:, np.argsort(tf_dt_matrix.loc[document])[::-1]].iloc[document][:25])

print("25 least important terms for document", shakespeare_df.iloc[document]['book'])
print(tf_dt_matrix.iloc[:, np.argsort(tf_dt_matrix.loc[document])[::-1]].iloc[document][-25:])

**What problem do you see with the most important words? Are they really representative?**



Let's check now how to create the TF-IDF weighting to see if we can improve this representation

In [None]:
tf_idf_weighting = TfidfVectorizer()
tf_idf_shakespeare = tf_idf_weighting.fit_transform(shakespeare_df.words)
tf_idf_dt_matrix = pd.DataFrame(tf_idf_shakespeare.A, columns=tf_idf_weighting.get_feature_names())
print(tf_idf_dt_matrix)

In [None]:
document = 6
print("25 most important terms for document", shakespeare_df.iloc[document]['book'])
print(tf_idf_dt_matrix.iloc[:, np.argsort(tf_idf_dt_matrix.loc[document])[::-1]].iloc[document][:25])

print("25 least important terms for document", shakespeare_df.iloc[document]['book'])
print(tf_idf_dt_matrix.iloc[:, np.argsort(tf_idf_dt_matrix.loc[document])[::-1]].iloc[document][-25:])

**What do you see now in the representation? Have we solved all the problems?**

# StopWords

In the previous section we have experimenting some problems related to stopwords, such as `and` or `of`. These words do not carry any meaning and are unlikely to provide any advantage for any subsequent NLP task and, therefore, we are safe to remove them.

Let's see how to do it via NLTK.

Since stopwords are language-dependant, NLTK provides a list for several languages.

In [None]:
from nltk.corpus import stopwords
print("Languages for which NLTK provides an stopword list:", ", ".join(stopwords.fileids()))

We are just interested in the English stopword list

In [None]:
print("Example of 25 English stopwords:", ", ".join(stopwords.words("english")[:25]))

We can use this list to remove these words from our representation and create the document term matrix without them. Let's check.

In [None]:
sw_free_tf_idf_weighting = TfidfVectorizer(stop_words='english')
sw_free_tf_idf_shakespeare = sw_free_tf_idf_weighting.fit_transform(shakespeare_df.words)
sw_free_tf_idf_dt_matrix = pd.DataFrame(sw_free_tf_idf_shakespeare.A, columns=sw_free_tf_idf_weighting.get_feature_names())
print(sw_free_tf_idf_dt_matrix)

In [None]:
document = 6
print("25 most important terms for document", shakespeare_df.iloc[document]['book'])
print(sw_free_tf_idf_dt_matrix.iloc[:, np.argsort(sw_free_tf_idf_dt_matrix.loc[document])[::-1]].iloc[document][:25])

print("25 least important terms for document", shakespeare_df.iloc[document]['book'])
print(sw_free_tf_idf_dt_matrix.iloc[:, np.argsort(sw_free_tf_idf_dt_matrix.loc[document])[::-1]].iloc[document][-25:])

It's much better now, isn't it?

Try to play with the previous code, change the document to see how the different weightings affect their representation or to use a different corpus from the ones included in NLTK