LIN 373 UT Austin :: Jessy Li

## Vectorizing categorical features

### Review: the inner workings

Let's encode the Naive Bayes example we used in class into the count table shown on the slides.

In [1]:
## Here's the data. Let's pretend these are grammatical sentences.
docs_train = ["Chinese Beijing Chinese",
              "Chinese Chinese Shanghai",
              "Chinese Macao",
             "Tokyo Japan Chinese"]
Y_train = [1, 1, 1, 0]

docs_test = ["Chinese Chinese Chinese Tokyo Japan"]

In [2]:
## first need to tokenize each document
docs_train_tokenized = [doc.split() for doc in docs_train]
print(docs_train_tokenized)

docs_test_tokenized = [doc.split() for doc in docs_test]
print(docs_test_tokenized)

[['Chinese', 'Beijing', 'Chinese'], ['Chinese', 'Chinese', 'Shanghai'], ['Chinese', 'Macao'], ['Tokyo', 'Japan', 'Chinese']]
[['Chinese', 'Chinese', 'Chinese', 'Tokyo', 'Japan']]


So how do we put words into a table?
First, we need to create that table. The rows are just the examples. But we need to come up with the columns.
We need to assign each word to a column number!
We do that by creating a dictionary to map from word to a unique column id:

In [3]:
word_to_col_id = {}
for doc in docs_train_tokenized:
    for word in doc:
        if word not in word_to_col_id:
            word_to_col_id[word] = len(word_to_col_id)
            
print(word_to_col_id)

{'Chinese': 0, 'Beijing': 1, 'Shanghai': 2, 'Macao': 3, 'Tokyo': 4, 'Japan': 5}


Now, we can make the table, and fill it up!

In [4]:
import numpy as np

X_train = np.zeros((len(docs_train), len(word_to_col_id)))
for i,doc in enumerate(docs_train_tokenized):
    for word in doc:
        col_id = word_to_col_id[word]
        X_train[i][col_id] += 1
print(X_train)

[[2. 1. 0. 0. 0. 0.]
 [2. 0. 1. 0. 0. 0.]
 [1. 0. 0. 1. 0. 0.]
 [1. 0. 0. 0. 1. 1.]]


We do the same for testing docs.

In [5]:
X_test = np.zeros((len(docs_test), len(word_to_col_id)))
for i,doc in enumerate(docs_test_tokenized):
    for word in doc:
        col_id = word_to_col_id[word]
        X_test[i][col_id] += 1
print(X_test)

[[3. 0. 0. 0. 1. 1.]]


## Class 2: adding features

What if we would like to add some new features? Say, the author of a document:
```
authors_train = ['Cao', 'Wang', 'Cao', 'Hirao']
authors_test = ['Cao']
```
What are the high level steps we should take?

In [6]:
authors_train = ['Cao', 'Wang', 'Cao', 'Hirao']
authors_test = ['Cao']

## First, let's vectorize that feature!

## just one way to do this: create a new matrix for authors
author_to_col_id = {}
for author in authors_train:
    if author not in author_to_col_id:
        author_to_col_id[author] = len(author_to_col_id)
print(author_to_col_id)

{'Cao': 0, 'Wang': 1, 'Hirao': 2}


In [7]:
X_train_authors = np.zeros((len(docs_train), len(author_to_col_id)))
for i, author in enumerate(authors_train):
    col_id = author_to_col_id[author]
    X_train_authors[i][col_id] += 1
print(X_train_authors)

[[1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]]


In [8]:
## now merge the features
X_train = np.hstack((X_train, X_train_authors))
print(X_train)

[[2. 1. 0. 0. 0. 0. 1. 0. 0.]
 [2. 0. 1. 0. 0. 0. 0. 1. 0.]
 [1. 0. 0. 1. 0. 0. 1. 0. 0.]
 [1. 0. 0. 0. 1. 1. 0. 0. 1.]]


In [9]:
## now do the same for test
X_test_authors = np.zeros((len(docs_test), len(author_to_col_id)))
for i, author in enumerate(authors_test):
    col_id = author_to_col_id[author]
    X_test_authors[i][col_id] += 1
    
X_test = np.hstack((X_test, X_test_authors))
print("X_test", X_test)

X_test [[3. 0. 0. 0. 1. 1. 1. 0. 0.]]


### Using a tool

**Review**: We saw that sklearn's [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) can tokenize and vectorize raw text.

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X_train_vectorizer = vectorizer.fit_transform(docs_train)
X_test_vectorizer = vectorizer.transform(docs_test)

In [11]:
print(vectorizer.get_feature_names())

['beijing', 'chinese', 'japan', 'macao', 'shanghai', 'tokyo']


In [12]:
print(X_train_vectorizer.toarray())

[[1 2 0 0 0 0]
 [0 2 0 0 1 0]
 [0 1 0 1 0 0]
 [0 1 1 0 0 1]]


In [13]:
## vectorizer uses sparse encoding
print(X_train_vectorizer)

  (0, 1)	2
  (0, 0)	1
  (1, 1)	2
  (1, 4)	1
  (2, 1)	1
  (2, 3)	1
  (3, 1)	1
  (3, 5)	1
  (3, 2)	1


**Today**, we look at sklearn's [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html), which vectorizes categorical features -- like author info!

In [14]:
from sklearn.preprocessing import OneHotEncoder
author_enc = OneHotEncoder()
X_train_authors = author_enc.fit_transform(authors_train)
print(X_train_authors)

ValueError: Expected 2D array, got 1D array instead:
array=['Cao' 'Wang' 'Cao' 'Hirao'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [16]:
## note that we first need to convert this from a list to a list of list,
## where each row stands for one example

X_train_authors = author_enc.fit_transform([[a] for a in authors_train])
print(X_train_authors)

  (0, 0)	1.0
  (1, 2)	1.0
  (2, 0)	1.0
  (3, 1)	1.0


In [19]:
## here's the original info in training
author_enc.inverse_transform(X_train_authors)

array([['Cao'],
       ['Wang'],
       ['Cao'],
       ['Hirao']], dtype=object)

In [21]:
## we fit over test as well
X_test_authors = author_enc.transform([[a] for a in authors_test])
print(X_test_authors)

  (0, 0)	1.0


Now, we're ready to put everything together. Specifically, we have learnt how to create two vectorizers that hold our features, one from the actual text, one using the author information. So how do we create a single vectorizer that includes both?

We use the [ColumnTransformer](https://scikit-learn.org/stable/modules/compose.html#featureunion-composite-feature-spaces). 

In [23]:
## first, create a single structure for both documents and authors
import pandas as pd

train_all = pd.DataFrame({"words": docs_train, "authors": authors_train})
train_all.head()

Unnamed: 0,words,authors
0,Chinese Beijing Chinese,Cao
1,Chinese Chinese Shanghai,Wang
2,Chinese Macao,Cao
3,Tokyo Japan Chinese,Hirao


In [24]:
test_all = pd.DataFrame({"words": docs_test, "authors": authors_test})
test_all.head()

Unnamed: 0,words,authors
0,Chinese Chinese Chinese Tokyo Japan,Cao


In [25]:
## using the ColumnTransformer
from sklearn.compose import ColumnTransformer
column_trans = ColumnTransformer(
    [("words", CountVectorizer(), "words"),
    ("author", OneHotEncoder(), "authors")]
)
    
X_train = column_trans.fit_transform(train_all)
print(X_train)

ValueError: 1D data passed to a transformer that expects 2D data. Try to specify the column selection as a list of one item instead of a scalar.

The `CountVectorizer` expects a 1D array as input and therefore the columns were specified as a string ('title'). However, `preprocessing.OneHotEncoder` as most of other transformers expects 2D data, therefore in that case you need to specify the column as a list of strings (`['authors']`).

In [26]:
column_trans = column_trans = ColumnTransformer(
    [("words", CountVectorizer(), "words"),
    ("author", OneHotEncoder(), ["authors"])]
)
    
X_train = column_trans.fit_transform(train_all)
print(X_train)

[[1. 2. 0. 0. 0. 0. 1. 0. 0.]
 [0. 2. 0. 0. 1. 0. 0. 0. 1.]
 [0. 1. 0. 1. 0. 0. 1. 0. 0.]
 [0. 1. 1. 0. 0. 1. 0. 1. 0.]]


In [28]:
X_test = column_trans.transform(test_all)
print(X_test)

[[0. 3. 1. 0. 0. 1. 1. 0. 0.]]


## We still run Naive Bayes as usual

https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html

In [29]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_train, Y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [30]:
print(model.predict(X_test))

[1]


## Adding continuous features

What if we want to add continuous features, say, the average length of words in each document?<br />
How would we do it from scratch?

Ok so now that we have understood the inner workings, let's use tools, which is usually much more reliable than custom implementations.

First, get the average word lengths!

In [31]:
def avg_word_len(doc):
    return np.mean([len(word) for word in doc.split()])

wl_train = [avg_word_len(doc) for doc in docs_train]
print(wl_train)

wl_test = [avg_word_len(doc) for doc in docs_test]
print(wl_test)

[7.0, 7.333333333333333, 6.0, 5.666666666666667]
[6.2]


Now, we update our Pandas DataFrame to put all info together again:

In [32]:
train_all = train_all.join(pd.DataFrame({"avg_word_len": wl_train}))
train_all.head()

Unnamed: 0,words,authors,avg_word_len
0,Chinese Beijing Chinese,Cao,7.0
1,Chinese Chinese Shanghai,Wang,7.333333
2,Chinese Macao,Cao,6.0
3,Tokyo Japan Chinese,Hirao,5.666667


In [33]:
test_all = test_all.join(pd.DataFrame({"avg_word_len": wl_test}))
test_all.head()

Unnamed: 0,words,authors,avg_word_len
0,Chinese Chinese Chinese Tokyo Japan,Cao,6.2


Next, we join features with the ColumnTransformer. Note that since avg_word_len is a *continuous* feature, i.e., not a categorical feature but a real-valued one, we don't need to vectorize it. So, we tell our ColumnTransformer to ignore it. Essentially, we *only* tell ColumnTransformer which features to vectorize and how, and set `remainder='passthrough'`.

In [35]:
column_trans = column_trans = ColumnTransformer(
    [("words", CountVectorizer(), "words"),
    ("author", OneHotEncoder(), ["authors"])],
    remainder = "passthrough"
)
    
X_train = column_trans.fit_transform(train_all)
print(X_train)

[[1.         2.         0.         0.         0.         0.
  1.         0.         0.         7.        ]
 [0.         2.         0.         0.         1.         0.
  0.         0.         1.         7.33333333]
 [0.         1.         0.         1.         0.         0.
  1.         0.         0.         6.        ]
 [0.         1.         1.         0.         0.         1.
  0.         1.         0.         5.66666667]]


In [36]:
X_test = column_trans.transform(test_all)
print(X_test)

[[0.  3.  1.  0.  0.  1.  1.  0.  0.  6.2]]


## Discussion

Can MultinomialNB be used with continuous features?

What should we do instead?

If *all* your features are continuous, you can use the [GaussianNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB) classifier which puts a Gaussian prior on your features.

However, our data is a *mixture* of continuous and categorical variables. We can do two things:

(1) transform our continuous data into categories, aka, bin them!

(2) model ensembling: build a MultinomialNB on categorical features ONLY, build a GaussianNB on the continuous features ONLY, then build another model on top.

(3) Use logistic regression!