# Bag-of-Words


This notebook was prepared to help students improve their coding skills and show code examples of the bag-of-words (BoW) model.

Each topic contains two sections:
- Definition
- Code examples

## **1- Introduction**

Machine learning algorithms only accept numbers as inputs. How can we extract semantic information from unstructured text and convert it into a numerical input vector that the computer can process? The DLMAINLPCV01 lecture book introduces two methods to embed words and documents into a semantic vector space. The first method is the simple and intuitive bag-of-words approach and the second method is the more powerful neural word and sentence vectors. Neural language models (NLM) use these embeddings of words to make their predictions.

In this notebook, you will find code examples for the bag-of-words method. The word vectors are explained in [this notebook](https://colab.research.google.com/github/EmrahYener/DLMAINLPCV01_demo/blob/master/notebooks/nlp_1-3_word_vectors.ipynb).



## **2- Bag-of-Words**

Bag-of-Words (BoW) is a featurization method that uses a vector of word counts (or binary). In this method, the word order is ignored.

Most algorithms (kNN, Naïve Bayes, Logistic Regression, kMeans clustering, topic models, collaborative filtering) expect numerical vector inputs. BoW is a simple approach to convert textual information to numbers. In this method, we can represent a given text in the form of a unique set of words (“bag”), i.e. a vector containing word counts of a document.



#### **Code Examples:**

<b>1. </b>   The following example shows the following steps:
- concatenate all given sentences to one list ("doc")
- remove any period "." mark from the text
- iterate through the list "doc" and find all unique words in the given sentences
- append all unique words to the same list ("unique_tokens")

In [None]:
# Input sentences
s1="Federer is one of the greatest tennis players of all time."
s2="Federer has won twenty grand slam titles to date."

# Find unique word tokens for both sentences

## Remove '.' and concatenate the sentences to 1 list
doc = s1.replace('.','').split() + s2.replace('.','').split()

## Build an unordered collection of unique elements
oc = set()

## Iterate through the document and collect the unique elements
unique_tokens = [
                 t for t in doc if not (t in oc or oc.add(t))
                ] 


print(unique_tokens)


['Federer', 'is', 'one', 'of', 'the', 'greatest', 'tennis', 'players', 'all', 'time', 'has', 'won', 'twenty', 'grand', 'slam', 'titles', 'to', 'date']


<br></br>
<b>2. </b>   The following example shows how to calculate the frequency of each word in a given sentence: 

In [None]:
# Count the frequency of each unique word token in sentence s1

## Create an empty list to use as a vector
vec_1 = []

## Delete the period "." from the sentence s1 and append each word to the list token_1
token_1 = s1.replace('.','').split()


## Iterate through "unique_tokens", compare each word with the words in "token_1" and count the frequency of each word
for t in unique_tokens:
  count = token_1.count(t)
  print(f'{t}: {count}')
  vec_1.append(count)

print(f'\nVector ouput:\n{s1}\n{vec_1}')

Federer: 1
is: 1
one: 1
of: 2
the: 1
greatest: 1
tennis: 1
players: 1
all: 1
time: 1
has: 0
won: 0
twenty: 0
grand: 0
slam: 0
titles: 0
to: 0
date: 0

vector ouput
Federer is one of the greatest tennis players of all time.
[1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]


In [None]:
# Count the frequency of each unique word token in sentence s2

## We will follow the same steps as for sentence s1

vec_2 = []

token_2 = s2.replace('.','').split()

for t in unique_tokens:
  count = token_2.count(t)
  print(f'{t}: {count}')
  vec_2.append(count)

print(f'\nVector ouput:\n{s2}\n{vec_2}')

Federer: 1
is: 0
one: 0
of: 0
the: 0
greatest: 0
tennis: 0
players: 0
all: 0
time: 0
has: 1
won: 1
twenty: 1
grand: 1
slam: 1
titles: 1
to: 1
date: 1

vector ouput
Federer has won twenty grand slam titles to date.
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]


<br></br>
<b>3. </b>   We have seen how to find unique words in a given text and the frequency of each word. The following example shows how to print frequency data as a bag-of-words vector:

In [None]:
# Print each sentence together with its corresponding vector
print(f'\n{s1}\n{vec_1}')
print(f'\n{s2}\n{vec_2}')


Federer is one of the greatest tennis players of all time.
[1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]

Federer has won twenty grand slam titles to date.
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]


Copyright © 2021 IU International University of Applied Sciences