LIN 373 UT Austin :: Jessy Li

## Vectorizing categorical features

### The inner workings

Let's encode the Naive Bayes example we used in class into the count table shown on the slides.

In [1]:
## Here's the data. Let's pretend these are grammatical sentences.
docs_train = ["Chinese Beijing Chinese",
              "Chinese Chinese Shanghai",
              "Chinese Macao",
             "Tokyo Japan Chinese"]
Y_train = [1, 1, 1, 0]

docs_test = ["Chinese Chinese Chinese Tokyo Japan"]

In [2]:
## first need to tokenize each document
docs_train_tokenized = [doc.split() for doc in docs_train]
print(docs_train_tokenized)

docs_test_tokenized = [doc.split() for doc in docs_test]
print(docs_test_tokenized)

[['Chinese', 'Beijing', 'Chinese'], ['Chinese', 'Chinese', 'Shanghai'], ['Chinese', 'Macao'], ['Tokyo', 'Japan', 'Chinese']]
[['Chinese', 'Chinese', 'Chinese', 'Tokyo', 'Japan']]


So how do we put words into a table?
First, we need to create that table. The rows are just the examples. But we need to come up with the columns.
We need to assign each word to a column number!
We do that by creating a dictionary to map from word to a unique column id:

In [3]:
word_to_col_id = {}
for doc in docs_train_tokenized:
    for word in doc:
        if word not in word_to_col_id:
            word_to_col_id[word] = len(word_to_col_id)
            
print(word_to_col_id)

{'Chinese': 0, 'Beijing': 1, 'Shanghai': 2, 'Macao': 3, 'Tokyo': 4, 'Japan': 5}


Now, we can make the table, and fill it up!

In [5]:
import numpy as np

X_train = np.zeros((len(docs_train), len(word_to_col_id)))
for i,doc in enumerate(docs_train_tokenized):
    for word in doc:
        col_id = word_to_col_id[word]
        X_train[i][col_id] += 1
print(X_train)

[[2. 1. 0. 0. 0. 0.]
 [2. 0. 1. 0. 0. 0.]
 [1. 0. 0. 1. 0. 0.]
 [1. 0. 0. 0. 1. 1.]]


We do the same for testing docs.

In [6]:
X_test = np.zeros((len(docs_test), len(word_to_col_id)))
for i,doc in enumerate(docs_test_tokenized):
    for word in doc:
        col_id = word_to_col_id[word]
        X_test[i][col_id] += 1
print(X_test)

[[3. 0. 0. 0. 1. 1.]]


What if there is a new word in the testing docs?


### Using a tool

All of this is implemented by sklearn's [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer), which does tokenization AND the corpus-to-table transformation

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X_train_vectorizer = vectorizer.fit_transform(docs_train)

In [11]:
print(vectorizer.get_feature_names())

['beijing', 'chinese', 'japan', 'macao', 'shanghai', 'tokyo']


In [12]:
print(X_train_vectorizer.toarray())

[[1 2 0 0 0 0]
 [0 2 0 0 1 0]
 [0 1 0 1 0 0]
 [0 1 1 0 0 1]]


In [13]:
## vectorizer uses sparse encoding
print(X_train_vectorizer)

  (0, 1)	2
  (0, 0)	1
  (1, 1)	2
  (1, 4)	1
  (2, 1)	1
  (2, 3)	1
  (3, 1)	1
  (3, 5)	1
  (3, 2)	1


## Now: Naive Bayes

https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html

In [14]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_train, Y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [15]:
print(model.predict(X_test))

[1]


What if we set alpha to 0?

In [17]:
model2 = MultinomialNB(alpha = 0)
model2.fit(X_train, Y_train)
print(model2.predict(X_test))

[0]


  'setting alpha = %.1e' % _ALPHA_MIN)
