# Vectorization

**Objectives**
- Describe text vectorization
- Vectorize text with sklearn
- Compare count vectorization to TFIDF vectorization
- Customize vectorization with sklearn

### Vectorization


Machine learning models require numerical input and output, but text is inherently non-numerical. We need to transform our text data into a numerical format to ensure our data is in a format that machine learning algorithms can interpret. This process is known as text vectorization.

Text vectorization is the process of converting text data into numeric data. There are various methods for achieving this, ranging from simple techniques like counting the occurrence of each word in each document (as done by CountVectorizer), to more complex methods like TF-IDF (Term Frequency-Inverse Document Frequency), which weighs the importance of terms based on their frequency in a document relative to their frequency in the entire corpus.

Beyond traditional methods like CountVectorizer and TfidfVectorizer, there are also more advanced techniques like word embeddings. Word embeddings are dense vectors that capture semantic meanings of words based on their context, and they often provide more nuanced representations than methods based solely on term frequency. While this lesson will focus on CountVectorizer and TfidfVectorizer, it's important to know that word embeddings offer another powerful tool for text vectorization, which we will cover in the following week.



**What is it?**

Machine learning models require numerical input and output, but text is inherently non-numerical. We need to transform our text data into a numerical format to ensure our data is in a format that machine learning algorithms can interpret. This process is known as text vectorization.

Text vectorization is the process of converting text data into numeric data. There are various methods for achieving this, ranging from simple techniques like counting the occurrence of each word in each document (as done by CountVectorizer), to more complex methods like TF-IDF (Term Frequency-Inverse Document Frequency), which weighs the importance of terms based on their frequency in a document relative to their frequency in the entire corpus.

Beyond traditional methods like CountVectorizer and TfidfVectorizer, there are also more advanced techniques like word embeddings. Word embeddings are dense vectors that capture semantic meanings of words based on their context, and they often provide more nuanced representations than methods based solely on term frequency. While this lesson will focus on CountVectorizer and TfidfVectorizer, it's important to know that word embeddings offer another powerful tool for text vectorization, which we will cover in the following week.


### Imports, data

In [1]:
import numpy as np
import pandas as pd
from sklearn import set_config
set_config(transform_output='pandas')

**Data**

We will write 4 sentences to serve as our sample data. In this example, each sentence is considered a "document," and the 4 sentences together are considered the "corpus."TF_



### Count Vectorization

Count vectorization is one of the simplest methods of converting text data into vectors. It tokenizes the text along with performing very basic preprocessing. The Count Vectorizer simply counts the occurrences of each unique word in each document and constructs a document-term matrix (DTM) from these counts.

Conceptually:
1. Tokenize the text: Break down the text into words (or n-grams, if specified).
2. Build a vocabulary: Create a vocabulary of unique words from the entire corpus.
3. Generate vectors: For each document, count the occurrences of each word in the vocabulary and store it in a vector.


In [2]:
# Sample sentences
X = np.array([
    "I enjoy learning new programming languages. The best is Python. Programming is so fun!",
    "I love programming, I would give it an A+!",
    "Programming is amazing. Programming is love. Programming is life.",
    "Python is my favorite programming language."
])

We will start by instantiating a default version of the CountVectorizer. We will fit it on our sample data (X).

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
# instantiate a vectorizer
vectorizer = CountVectorizer()
# Fit it on the data 
vectorizer.fit(X)

After fitting, we can find the full vocabulary the vectorizer used by checking the .vocabulary_ attribute. It will return a dictionary of unique vocabulary words as the keys and an integer assigned to the word as the value.

In [4]:
# Saves vocab - matches number of columns above
vocab_dict = vectorizer.vocabulary_
type(vocab_dict)

dict

In [5]:
# How many words in our vocab?
len(vocab_dict)

21

In [6]:
# This is a small dictionary, so we can display it here.
vocab_dict

{'enjoy': 3,
 'learning': 11,
 'new': 15,
 'programming': 16,
 'languages': 10,
 'the': 19,
 'best': 2,
 'is': 7,
 'python': 17,
 'so': 18,
 'fun': 5,
 'love': 13,
 'would': 20,
 'give': 6,
 'it': 8,
 'an': 1,
 'amazing': 0,
 'life': 12,
 'my': 14,
 'favorite': 4,
 'language': 9}

Note that the integer in the dictionary is simply an identifier that has been assigned based on the alphabetical order of the word. "Amazing" is assigned integer "0" while "would" is assigned integer 20.

We must transform the X data with the fitted vectorizer to obtain the count associated with each word.

In [7]:
# To obtain the count, transform the X data
X_count = vectorizer.transform(X)
type(X_count)

scipy.sparse._csr.csr_matrix

Vectorizers return sparse matrices to save memory. We can convert these to numpy arrays using the .toarray() method.

In [8]:
# Convert sparse matrix to array for display
X_count.toarray()

array([[0, 0, 1, 1, 0, 1, 0, 2, 0, 0, 1, 1, 0, 0, 0, 1, 2, 1, 1, 1, 0],
       [0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 1, 1, 0, 0, 3, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0]],
      dtype=int64)

In [9]:
# Check the shape of the array
X_count.shape

(4, 21)

There are four rows, each corresponding to one of the original sentences. There are 21 columns, each corresponding to a unique word used anywhere in the corpus.

It is easier to interpret when displayed as a dataframe. We can name the columns using .get_feature_names_out()

In [10]:
# Make array into a df
X_count_df = pd.DataFrame(X_count.toarray(), columns= vectorizer.get_feature_names_out())
X_count_df

Unnamed: 0,amazing,an,best,enjoy,favorite,fun,give,is,it,language,...,learning,life,love,my,new,programming,python,so,the,would
0,0,0,1,1,0,1,0,2,0,0,...,1,0,0,0,1,2,1,1,1,0
1,0,1,0,0,0,0,1,0,1,0,...,0,0,1,0,0,1,0,0,0,1
2,1,0,0,0,0,0,0,3,0,0,...,0,1,1,0,0,3,0,0,0,0
3,0,0,0,0,1,0,0,1,0,1,...,0,0,0,1,0,1,1,0,0,0


Now, we can see the results of the Count Vectorizer. Let's examine the column for "programming." We see that the term "programming" occurs twice in the first sentence (index 0), once in the sentence at index 1, three times in the sentence at index 2, and once in the sentence at index 3.

We also find that using the default CountVectorizer resulted in:
- The words have been converted to lowercase.
- Words that were less than 2 letters long were removed.
- Stopwords were NOT removed.
- Punctuation WAS removed.

We can control and customize the preprocessing performed by the CountVectorizer and have it perform more of our preprocessing simultaneously. Before we demonstrate the customization of the CountVectorrizer, we will introduce an alternative method of vectorization.


### TF-IDF Vectorization

TF-IDF stands for Term Frequency-Inverse Document Frequency, a statistic that reflects how important a word is to a document relative to a corpus. It considers the frequency of a word in a particular document and how unique the word is across all documents.

- Term Frequency: This is simply the number of times a word appears in a document. (This is the same value given by the count vectorizer.)
- Inverse Document Frequency
    - n is total number of documents
    - df is number of documents that contain the term, t
    - The calculation (not yet linked) reduces the weight of terms that appear frequently across documents and increases the weight of terms that appear in fewer documents. (A higher weight is given to unique words.)
 
By applying the TF-IDF formula, each term in each document in the corpus ends up with a TF-IDF weight that represents its importance in that document relative to the entire corpus. The TF-IDF weight (sometimes called TF-IDF score) will range from 0-1.





### Using Sklearn's TfidfVectorizer

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
# TfidfVectorizer Example
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(X)
X_tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns= tfidf_vectorizer.get_feature_names_out())
X_tfidf_df.round(4)

Unnamed: 0,amazing,an,best,enjoy,favorite,fun,give,is,it,language,...,learning,life,love,my,new,programming,python,so,the,would
0,0.0,0.0,0.297,0.297,0.0,0.297,0.0,0.3791,0.0,0.0,...,0.297,0.0,0.0,0.0,0.297,0.3099,0.2341,0.297,0.297,0.0
1,0.0,0.452,0.0,0.0,0.0,0.0,0.452,0.0,0.452,0.0,...,0.0,0.0,0.3564,0.0,0.0,0.2359,0.0,0.0,0.0,0.452
2,0.3383,0.0,0.0,0.0,0.0,0.0,0.0,0.6477,0.0,0.0,...,0.0,0.3383,0.2667,0.0,0.0,0.5296,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.4822,0.0,0.0,0.3078,0.0,0.4822,...,0.0,0.0,0.0,0.4822,0.0,0.2516,0.3801,0.0,0.0,0.0


Notice that the value for each term is no longer a simple count. Again, looking at "programming" as an example, we can see that it now has values calculated based on the frequency of the term in the document and within the corpus. All TF-IDF scores are scaled to obtain values from 0-1.

### Comparison: CountVectorizer vs TfidfVectorizer

![table](https://assets.codingdojo.com/boomyeah2015/codingdojo/curriculum/content/chapter/1698079860__Screenshot20231023125045.png)

- CountVectorizer: Simple word counts. Common words that appear in many documents could overshadow meaningful terms.
    - Use CountVectorizer when you want a simple representation and do not need to consider the importance of a term relative to the corpus.- TfidfVectorizer: Weights the word counts by a measure of how often they appear in the documents, which helps to adjust for the frequency of words across all documents.
    - Use TfidfVectorizer when you want to determine important terms that are relevant in the context of the entire corpus.





### Preprocessing using sklearn's vectorizers

Either vectorizer can also perform additional preprocessing on the text data, such as:
- Eliminating stopwords
- Creating n-grams as well as single tokens
- Changing tokenization patterns (or using a custom function to tokenize)

### Removing stopwords

The CountVectorizer has a stop_words parameter that can be one of the following:
- None (default): no stopwords removed
- 'english' (keyword): removes the list of stopwords from `sklearn.feature_extraction.text.ENGLISH_STOP_WORDS`


Here, we use stop_words arg when instantiating CountVectorizer.

In [12]:
# instantiate a vectorizer
vectorizer_stopped = CountVectorizer(stop_words='english')
# Fit it on the data 
X_vec = vectorizer_stopped.fit_transform(X)
X_stopped = pd.DataFrame(X_vec.toarray(), columns= vectorizer_stopped.get_feature_names_out())
X_stopped

Unnamed: 0,amazing,best,enjoy,favorite,fun,language,languages,learning,life,love,new,programming,python
0,0,1,1,0,1,0,1,1,0,0,1,2,1
1,0,0,0,0,0,0,0,0,0,1,0,1,0
2,1,0,0,0,0,0,0,0,1,1,0,3,0
3,0,0,0,1,0,1,0,0,0,0,0,1,1


We have reduced the number of words by eliminating stopwords.

In [13]:
# Comparing default vocab 
print(f"# of terms in original vocabulary: {len(vectorizer.vocabulary_)}")
print(f"# of terms in stopwords-removed vocabulary: {len(vectorizer_stopped.vocabulary_)}")

# of terms in original vocabulary: 21
# of terms in stopwords-removed vocabulary: 13


Some of the stopwords that were removed include "an," "is," and "it."

Note that we can perform the same customization with the TFIDF vectorizer.

In [14]:
# instantiate a vectorizer
vectorizer_tfidf_stopped = TfidfVectorizer(stop_words='english')
# Fit it on the data 
X_vec_tfidf_stopped = vectorizer_tfidf_stopped.fit_transform(X)
X_stopped_tfidf = pd.DataFrame(X_vec_tfidf_stopped.toarray(), columns= vectorizer_tfidf_stopped.get_feature_names_out())
X_stopped_tfidf

Unnamed: 0,amazing,best,enjoy,favorite,fun,language,languages,learning,life,love,new,programming,python
0,0.0,0.360121,0.360121,0.0,0.360121,0.0,0.360121,0.360121,0.0,0.0,0.360121,0.375852,0.283924
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.833884,0.0,0.551939,0.0
2,0.444008,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.444008,0.350061,0.0,0.695105,0.0
3,0.0,0.0,0.0,0.587838,0.0,0.587838,0.0,0.0,0.0,0.0,0.0,0.306758,0.463458


Notice that we now have tfidf scores for only the "non-stopwords."

**Customized stopwords**

We can also customize the list of stopwords as we have done in previous lessons. For example, we will include "programming" to the list of English stop words.


In [15]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
custom_stopwords = [*ENGLISH_STOP_WORDS, 'programming']
# instantiate a vectorizer
vectorizer_stopped_custom = CountVectorizer(stop_words=custom_stopwords)
# Fit it on the data 
X_vec = vectorizer_stopped_custom.fit_transform(X)
X_stopped_custom = pd.DataFrame(X_vec.toarray(), columns= vectorizer_stopped_custom.get_feature_names_out())
X_stopped_custom

Unnamed: 0,amazing,best,enjoy,favorite,fun,language,languages,learning,life,love,new,python
0,0,1,1,0,1,0,1,1,0,0,1,1
1,0,0,0,0,0,0,0,0,0,1,0,0
2,1,0,0,0,0,0,0,0,1,1,0,0
3,0,0,0,1,0,1,0,0,0,0,0,1


### Changing Tokenization Method

The sklearn tokenizer uses its built-in tokenizer by default, but we can control the tokenization method by using the tokenizer argument when we instantiate the vectorizer.

For example, we could use nltk's wordpunt_tokenizer, which will keep punctuation.

In [16]:
from nltk import wordpunct_tokenize
# instantiate a vectorizer with english stopwords
vectorizer_nltk = CountVectorizer(stop_words='english',
                                  tokenizer=wordpunct_tokenize, token_pattern = None)
# Fit it on the data 
X_count_nltk = vectorizer_nltk.fit_transform(X)
# Getting the feature names (vocabulary)
X_count_nltk_df = pd.DataFrame(X_count_nltk.toarray(), columns= vectorizer_nltk.get_feature_names_out())
X_count_nltk_df

Unnamed: 0,!,+!,",",.,amazing,best,enjoy,favorite,fun,language,languages,learning,life,love,new,programming,python
0,1,0,0,2,0,1,1,0,1,0,1,1,0,0,1,2,1
1,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,1,0
2,0,0,0,3,1,0,0,0,0,0,0,0,1,1,0,3,0
3,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,1,1


We can customize the tokenizer used with tfidf vectorizor using the same syntax.

In [17]:
from nltk import wordpunct_tokenize
# instantiate a vectorizer with english stopwords
vectorizer_nltk = TfidfVectorizer(stop_words='english',
                                  tokenizer=wordpunct_tokenize, token_pattern = None)
# Fit it on the data 
X_count_nltk = vectorizer_nltk.fit_transform(X)
# Getting the feature names (vocabulary)
X_count_nltk_df = pd.DataFrame(X_count_nltk.toarray(), columns= vectorizer_nltk.get_feature_names_out())
X_count_nltk_df

Unnamed: 0,!,+!,",",.,amazing,best,enjoy,favorite,fun,language,languages,learning,life,love,new,programming,python
0,0.310978,0.0,0.0,0.396986,0.0,0.310978,0.310978,0.0,0.310978,0.0,0.310978,0.310978,0.0,0.0,0.310978,0.324562,0.245178
1,0.0,0.587838,0.587838,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.463458,0.0,0.306758,0.0
2,0.0,0.0,0.0,0.647743,0.338271,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.338271,0.266697,0.0,0.529572,0.0
3,0.0,0.0,0.0,0.351295,0.0,0.0,0.0,0.550372,0.0,0.550372,0.0,0.0,0.0,0.0,0.0,0.287207,0.433919


### Ngrams

The vectorizers also have the option to extract not just single tokens but n-grams as well.

We specify this using the ngram_range argument, which is a tuple of the min and max number of words included in a token.

The default is ngram_range=(1,1) which indicates that only single words will be tokenized. If we specify ngram_range=(1,2), it will find bigrams as well.

In [18]:
# instantiate a vectorizer to include bigrams
vectorizer_ngrams = CountVectorizer(stop_words='english', ngram_range=(1,2))
# Fit it on the data 
X_vec = vectorizer_ngrams.fit_transform(X)
X_ngrams = pd.DataFrame(X_vec.toarray(), columns= vectorizer_ngrams.get_feature_names_out())
X_ngrams

Unnamed: 0,amazing,amazing programming,best,best python,enjoy,enjoy learning,favorite,favorite programming,fun,language,...,programming,programming amazing,programming fun,programming language,programming languages,programming life,programming love,python,python favorite,python programming
0,0,0,1,1,1,1,0,0,1,0,...,2,0,1,0,1,0,0,1,0,1
1,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2,1,1,0,0,0,0,0,0,0,0,...,3,1,0,0,0,1,1,0,0,0
3,0,0,0,0,0,0,1,1,0,1,...,1,0,0,1,0,0,0,1,1,0


Notice that we have single words and bigrams.

We could include trigrams as well.

In [19]:
# instantiate a vectorizer to include bigrams and trigrams
vectorizer_ngrams = CountVectorizer(stop_words='english', ngram_range=(1,3))
# Fit it on the data 
X_vec = vectorizer_ngrams.fit_transform(X)
X_ngrams = pd.DataFrame(X_vec.toarray(), columns= vectorizer_ngrams.get_feature_names_out())
X_ngrams

Unnamed: 0,amazing,amazing programming,amazing programming love,best,best python,best python programming,enjoy,enjoy learning,enjoy learning new,favorite,...,programming languages,programming languages best,programming life,programming love,programming love programming,python,python favorite,python favorite programming,python programming,python programming fun
0,0,0,0,1,1,1,1,1,1,0,...,1,1,0,0,0,1,0,0,1,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,1,1,1,0,0


In [20]:
len(vectorizer_ngrams.vocabulary_)

42

Be cautious about using n-grams larger than 2 to avoid running out of RAM. Note that the number of terms dramatically increases as n increases.

### Manually Controlling Vocabulary Size

Once we start using a full dataset, we may have thousands or hundreds of thousands of terms in our vocabulary.

We can artificially reduce the number of words included by changing the max_features argument. The vectorizer will only keep the top max_features # of the most common words in the final text data.

While this may improve the speed of our models, it may also lose much of the document's original meaning, especially if we limit the number of features and do not remove stopwords. (they are so common they would out-compete the more meaningful terms in the corpus).

**max_features**

We can limit the size of our vectorized data by reducing the maximum number of words included in the vocabulary.

In [21]:
# instantiate a vectorizer
vectorizer_max10 = CountVectorizer(stop_words='english', max_features=10)
# Fit it on the data 
X_vec = vectorizer_max10.fit_transform(X)
X_max10 = pd.DataFrame(X_vec.toarray(), columns= vectorizer_max10.get_feature_names_out())
X_max10

Unnamed: 0,amazing,best,enjoy,favorite,fun,language,languages,love,programming,python
0,0,1,1,0,1,0,1,0,2,1
1,0,0,0,0,0,0,0,1,1,0
2,1,0,0,0,0,0,0,1,3,0
3,0,0,0,1,0,1,0,0,1,1


**max_df (maximum document frequency)**

Any terms that appear in more than max_df proportion of documents will be ignored. Here we will set the max_df to 0.5. This will eliminate any words that occur in more than half of the documents. If an integer is used, it will define the maximum number of documents the term can appear in.


In [22]:
# instantiate a vectorizer
vectorizer_maxdf = CountVectorizer(stop_words='english', max_df = .5)
# Fit it on the data 
X_vec = vectorizer_maxdf.fit_transform(X)
X_maxdf = pd.DataFrame(X_vec.toarray(), columns= vectorizer_maxdf.get_feature_names_out())
X_maxdf

Unnamed: 0,amazing,best,enjoy,favorite,fun,language,languages,learning,life,love,new,python
0,0,1,1,0,1,0,1,1,0,0,1,1
1,0,0,0,0,0,0,0,0,0,1,0,0
2,1,0,0,0,0,0,0,0,1,1,0,0
3,0,0,0,1,0,1,0,0,0,0,0,1


Notice, that "programming" does not appear because it is in more than 50% of the documents.

**min_df (minimum document frequency)**

Any terms that appear in less than min_df proportion of documents will be ignored. If an integer is used, it will define the minimum number of documents the term must appear in.

In [23]:
# instantiate a vectorizer
vectorizer_mindf = CountVectorizer(stop_words='english', min_df = .5)
# Fit it on the data 
X_vec = vectorizer_mindf.fit_transform(X)
X_mindf = pd.DataFrame(X_vec.toarray(), columns= vectorizer_mindf.get_feature_names_out())
X_mindf

Unnamed: 0,love,programming,python
0,0,2,1
1,1,1,0
2,1,3,0
3,0,1,1


### Summary

In this lesson, you've learned about two popular methods for text vectorization: CountVectorizer and TfidfVectorizer. CountVectorizer is simple and effective for transforming text into numerical vectors based on word frequency. TfidfVectorizer offers a more sophisticated method that considers the importance of each term relative to the entire corpus. Sklearn's vectorizers can be customized using similar syntax. Both are powerful tools in the field of text analytics and are essential for further tasks like text classification.