## Text Vectorization

Question: What is text vectorization?

Answer: **The process to transform text data to numerical vectors**

## Why do we need text vectorization?

Think back to when we learned about **Label Encoding** and **One-Hot Encoding**: We took categories (text) and transformed them into numerical values. 

Text vectorization is similar in that we are taking text and turning it into something a machine can understand and manipulate by translating a word in to a unique vector of numbers. For example, we could associate the unique vector `(0, 1, 0, 1)` to the word `queen`.

### Question: What are some other use cases for text vectorization?

##### Use Cases for Text Vectorization

- Count the number of unique words in each sentence (Bag-of-Words, we'll discuss this shortly!)

- Assign weights to each word in the sentence.

- Map each word to a number (dictionary with words as key and numbers as values) and represent each sentences as the sequence of numbers 


## Bag-of-Words Matrix

- Bag-of-Words (BoW) is a matrix where its **rows are sentences** and its **columns are unique words** seen across all of the sentences

### BoW Example

We have the following 4 sentences:

1. This is the first sentence.
1. This one is the second sentence.
1. And this is the third one.
1. Is this the first sentence?

**Question:** Given the above sentances, how many unique words are there?

<!-- Answer: 9 -->

A BoW matrix would look like the following, where `0` means the word does not appear in the sentence, and `1` means the word does appear in the sentence

![bow_matrix](../Notebooks/Images/bag-of-words-matrix.png)

## BoW Worksheet (7 min)

**Complete the following worksheet on your own:**

- Copy [this blank table](https://docs.google.com/presentation/d/1B7v33fPEwblhHYBCSrCvKRBZz776Df4T_t2jcPXt4k8/edit?usp=sharing), and create the BoW matrix for the following sentences:


1. Data Science is the best.
1. Data Science has cool topics.
1. Are these the best topics?
1. Is Data Science the best track?

## BoW in Sklearn

- We can write a function to return a BoW matrix 

- Below, we will see how we can build a BoW matrix by calling [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html?highlight=countvectorizer#sklearn-feature-extraction-text-countvectorizer) in sklearn

In [1]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

sentences = ['This is the first sentence.',
             'This one is the second sentence.',
             'And this is the third one.',
             'Is this the first sentence?']


In [3]:
vectorizer = CountVectorizer()
# create a term-document matrix: assign each word a tuple: 
# first number is the sentence, and the second is the unique number that corresponds to the word
# for example, if the word "one" is assigned the number 3,
# then the word "one" that is used in the third sentence is represented by the tuple (2,3)
X = vectorizer.fit_transform(sentences)

# from the term-document matrix, create the BoW matrix
print(X.toarray())

 # 1st sentence is the first line of horizontal vectors

[[0 1 1 0 0 1 1 0 1]
 [0 0 1 1 1 1 1 0 1]
 [1 0 1 1 0 0 1 1 1]
 [0 1 1 0 0 1 1 0 1]]


## How do we get unique words?

In [4]:
# Get the unique words
print(vectorizer.get_feature_names())

['and', 'first', 'is', 'one', 'second', 'sentence', 'the', 'third', 'this']


## Activity: Worksheet --> sklearn (7 min)

Use sklearn to take the 4 sentences you used in the worksheet and create the BoW matrix using sklearn

In [8]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

sentences = ['Data Science is the best.',
            'Data Science has cool topics.',
            'Are these the best topics?',
            'Is Data Science the best track?']
vectorizer = CountVectorizer()
# create a term-document matrix: assign each word a tuple: 
# first number is the sentence, and the second is the unique number that corresponds to the word
# for example, if the word "one" is assigned the number 3,
# then the word "one" that is used in the third sentence is represented by the tuple (2,3)
X = vectorizer.fit_transform(sentences)

# from the term-document matrix, create the BoW matrix
print(X.shape)
print(X.toarray())

 # 1st sentence is the first line of horizontal vectors

(4, 11)
[[0 1 0 1 0 1 1 1 0 0 0]
 [0 0 1 1 1 0 1 0 0 1 0]
 [1 1 0 0 0 0 0 1 1 1 0]
 [0 1 0 1 0 1 1 1 0 0 1]]


In [7]:
# Get the unique words
print(vectorizer.get_feature_names())

['are', 'best', 'cool', 'data', 'has', 'is', 'science', 'the', 'these', 'topics', 'track']


## Clustering

- Clustering is an unsupervised learning method. A **cluster** is a group of data points that are grouped together due to similarities in their features

- This is very often used **because we usually don’t have labeled data**

- **K-Means clustering** is a popular clustering algorithms: it finds a fixed number _(k)_ of clusters in a set of data. 

- The goal of any cluster algorithm is to **find groups (clusters) in the given data**

### Question: What are some use cases of clustering?

## Examples of Clustering

- Cluster movie dataset -> We expect the movies which their genres are similar be clustered in the same group

- News Article Clustering -> We want the News related to science be in the same group, News related to sport be in the same group

## Demo of K-means

In [None]:
from figures import plot_kmeans_interactive

plot_kmeans_interactive()