# **Extract Bag of Words (BoW) Features from Course Textual Content**


Estimated time needed: **60** minutes


The main goal of recommender systems is to help users find items they potentially interested in. Depending on the recommendation tasks, an item can be a movie, a restaurant, or, in our case, an online course. 

Machine learning algorithms cannot work on an item directly so we first need to extract features and represent the items mathematically, i.e., with a feature vector.

Many items are often described by text so they are associated with textual data, such as the titles and descriptions of a movie or course. Since machine learning algorithms can not process textual data directly, we need to transform the raw text into numeric feature vectors.


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/module_2/images/extract_textual_features.png)


In this lab, you will be learning to extract the bag of words (BoW) features from course titles and descriptions. The BoW feature is a simple but effective feature characterizing textual data and is widely used in many textual machine learning tasks.


## Objectives


After completing this lab you will be able to:


* Extract Bag of Words (BoW) features from course titles and descriptions
* Build a course BoW dataset to be used for building a content-based recommender system later


----


## Prepare and setup the lab environment


First, let's install and import required libraries:


In [None]:
!pip install nltk==3.6.7
!pip install gensim==4.1.2

In [None]:
import gensim
import pandas as pd
import nltk as nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from gensim import corpora

%matplotlib inline

Download stopwords


In [None]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

In [None]:
# also set a random state
rs = 123

### Bag of Words (BoW) features


BoW features are essentially the counts or frequencies of each word that appears in a text (string). Let's illustrate it with some simple examples.


Suppose we have two course descriptions as follows:


In [None]:
course1 = "this is an introduction data science course which introduces data science to beginners"

In [None]:
course2 = "machine learning for beginners"

In [None]:
courses = [course1, course2]
courses

The first step is to split the two strings into words (tokens). A token in the text processing context means the smallest unit of text such as a word, a symbol/punctuation, or a phrase, etc. The process to transform a string into a collection of tokens is called `tokenization`.


One common way to do ```tokenization``` is to use the Python built-in `split()` method of the `str` class.  However, in this lab, we want to leverage the `nltk` (Natural Language Toolkit) package, which is probably the most commonly used package to process text or natural language.


 More specifically, we will use the ```word_tokenize()``` method on the content of course (string):


In [None]:
# Tokenize the two courses
tokenized_courses = [word_tokenize(course) for course in courses]

In [None]:
tokenized_courses

As you can see from the cell output, two courses have been tokenized and turned into two token arrays.


Next, we want to create a token dictionary to index all tokens. Basically, we want to assign a key/index for each token. One way to index tokens is to use the `gensim` package which is another popular package for processing textual data:


In [None]:
# Create a token dictionary for the two courses
tokens_dict = gensim.corpora.Dictionary(tokenized_courses)

In [None]:
print(tokens_dict.token2id)

With the token dictionary, we can easily count each token in the two example courses and output two BoW feature vectors. However, more conveniently, the `gensim` package provides us a `doc2bow` method to generate BoW features out-of-box.


In [None]:
# Generate BoW features for each course
courses_bow = [tokens_dict.doc2bow(course) for course in tokenized_courses]

In [None]:
courses_bow

It outputs two BoW arrays where each element is a tuple, e.g., (0, 1) and (7, 2). The first element of the tuple is the token ID and the second element is its count. So `(0, 1)` means `(``an``, 1)` and `(7, 2)` means `(``science``, 2)`.


We can use the following code snippet to print each token and its count:


In [None]:
for course_idx, course_bow in enumerate(courses_bow):
    print(f"Bag of words for course {course_idx}:")
    # For each token index, print its bow value (word count)
    for token_index, token_bow in course_bow:
        token = tokens_dict.get(token_index)
        print(f"--Token: '{token}', Count:{token_bow}")

If we turn to the long list into a horizontal feature vectors, we can see the two courses become two numerical feature vectors:


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/module_2/images/bow.png)


### BoW dimensionality reduction


A document may contain tens of thousands of words which makes the dimension of the BoW feature vector huge. To reduce the dimensionality, one common way is to filter the relatively meaningless tokens such as stop words or sometimes add position and adjective words.


Note there are many other ways to reduce dimensionality such as `stemming` and `lemmatization` but they are beyond the scope of this capstone project. You are encouraged to explore them yourself.


We can use the english stop words provided in `nltk`:


In [None]:
stop_words = set(stopwords.words('english'))

In [None]:
stop_words

Then we can filter those English stop words from the tokens in course1:


In [None]:
# Tokens in course 1
tokenized_courses[0]

In [None]:
processed_tokens = [w for w in tokenized_courses[0] if not w.lower() in stop_words]

In [None]:
processed_tokens

You can see the number of tokens for ```course1``` has been reduced.


Another common way is to only keep nouns in the text. We can use the `nltk.pos_tag()` method to analyze the part of speech (POS) and annotate each word.


In [None]:
tags = nltk.pos_tag(tokenized_courses[0])
tags

As we can see [`introduction`, `data`, `science`, `course`, `beginners`] are all of the nouns and we may keep them in the BoW feature vector.


### TASK: Extract BoW features for course textual content and build a dataset


By now you have learned what a BoW feature is, so let's start extracting BoW features from some real course textual content.


In [None]:
course_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/datasets/course_processed.csv"
course_content_df = pd.read_csv(course_url)

In [None]:
course_content_df.iloc[0, :]

The course content dataset has three columns `COURSE_ID`, `TITLE`, and `DESCRIPTION`. `TITLE` and `DESCRIPTION` are all text upon which we want to extract BoW features. 


Let's join those two text columns together.


In [None]:
# Merge TITLE and DESCRIPTION title
course_content_df['course_texts'] = course_content_df[['TITLE', 'DESCRIPTION']].agg(' '.join, axis=1)
course_content_df = course_content_df.reset_index()
course_content_df['index'] = course_content_df.index

In [None]:
course_content_df.iloc[0, :]

and we have prepared a `tokenize_course()` method for you to tokenize the course content:


In [None]:
def tokenize_course(course, keep_only_nouns=True):
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(course)
    # Remove English stop words and numbers
    word_tokens = [w for w in word_tokens if (not w.lower() in stop_words) and (not w.isnumeric())]
    # Only keep nouns 
    if keep_only_nouns:
        filter_list = ['WDT', 'WP', 'WRB', 'FW', 'IN', 'JJR', 'JJS', 'MD', 'PDT', 'POS', 'PRP', 'RB', 'RBR', 'RBS',
                       'RP']
        tags = nltk.pos_tag(word_tokens)
        word_tokens = [word for word, pos in tags if pos not in filter_list]

    return word_tokens

Let's try it on the first course.


In [None]:
a_course = course_content_df.iloc[0, :]['course_texts']
a_course

In [None]:
tokenize_course(a_course)

Next, you will need to write some code snippets to generate the BoW features for each course. Let's start by tokenzing all courses in the `courses_df`:


_TODO: Use provided tokenize_course() method to tokenize all courses in courses_df['course_texts']._


In [None]:
# WRITE YOUR CODE HERE



<details>
    <summary>Click here for Hints</summary>

Use `tokenize_course(text, True)` command to tokenize each text in `courses_df['course_texts']`


Then we need to create a token dictionary `tokens_dict`


_TODO: Use gensim.corpora.Dictionary(tokenized_courses) to create a token dictionary._


In [None]:
# WRITE YOUR CODE HERE


Then we can use `doc2bow()` method to generate BoW features for each tokenized course.


_TODO: Use tokens_dict.doc2bow() to generate BoW features for each tokenized course._


In [None]:
# WRITE YOUR CODE HERE


<details>
    <summary>Click here for Hints</summary>
    
You can use `tokens_dict.doc2bow(course)` command  for each course in `tokenized_courses`


Lastly, you need to append the BoW features for each course into a new BoW dataframe. The new dataframe needs to include the following columns (you may include other relevant columns as well):
- 'doc_index': the course index starting from 0
- 'doc_id': the actual course id such as `ML0201EN`
- 'token': the tokens for each course
- 'bow': the bow value for each token


_TODO: Create a new course_bow dataframe based on the extracted BoW features._


In [None]:
# WRITE YOUR CODE HERE

#  ...
#  bow_dicts = {"doc_index": doc_indices,
#            "doc_id": doc_ids,
#            "token": tokens,
#            "bow": bow_values}
#  pd.DataFrame(bow_dicts)

<details>
    <summary>Click here for Hints</summary>
    
You can use 2 for-loops to create your data frame: first one will be `for doc_index, doc_bow in enumerate(bow_docs):` where bow_docs is the list of BoW features for each tokenized course and within this for-loop you will have another loop `for token_index, token_bow in doc_bow:`. Then you can get each "token" by applying the `token_index` to your `token_dict`,  `token_bow` will give you "bow" values, `doc_indices` will give you values for  "doc_index" and you can get "doc_id" by using `courses_df['COURSE_ID']` list and `doc_index` as indexes.


Your course BoW dataframe may look like the following:


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/module_2/images/bow_dataset.png)


You may refer to previous code examples in this lab if you need help with creating the BoW dataframe.


### Other popular textual features


In addition to the basic token BoW feature, there are two other types of widely used textual features. If you are interested, you may explore them yourself to learn how to extract them from the course textual content: 


- **tf-idf**: tf-idf refers to Term Frequency–Inverse Document Frequency. Similar to BoW, the tf-idf also counts the word frequencies in each document. Furthermore, tf-idf will  offset the number of documents in the corpus that contain the word in order to adjust for the fact that some words appear more frequently in general. The higher the tf-idf normally means the greater the importance the word/token is.
- **Text embedding vector**. Embedding means projecting an object into a latent feature space. We normally employ neural networks or deep neural networks to learn the latent features of a textual object such as a word, a sentence, or the entire document. The learned latent feature vectors will be used to represent the original textual entities. 
