# Lesson: Vectorization

## Imports

In [3]:
# packages

import numpy as np
import pandas as pd
from sklearn import set_config
set_config(transform_output='pandas')

In [4]:
# data (each sentence is considered a doc)

X = np.array([
    "I enjoy learning new programming languages. The best is Python. Programming is so fun!",
    "I love programming, I would give it an A+!",
    "Programming is amazing. Programming is love. Programming is life.",
    "Python is my favorite programming language."
])


## Count Vectorization

1) Tokenize
2) Build vocabulary
3) Generate vectors (frequency)

In [7]:
# import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# instantiate a vectorizer
vectorizer = CountVectorizer()

# Fit it on the data 
vectorizer.fit(X)

In [8]:
# Saves vocab - matches number of columns above, returns dict
vocab_dict = vectorizer.vocabulary_
type(vocab_dict)

dict

In [12]:
# dict assigns integer based on alphabetical order, i.e. 'amazing' = 0, 'an' = 1, etc.
vocab_dict

{'enjoy': 3,
 'learning': 11,
 'new': 15,
 'programming': 16,
 'languages': 10,
 'the': 19,
 'best': 2,
 'is': 7,
 'python': 17,
 'so': 18,
 'fun': 5,
 'love': 13,
 'would': 20,
 'give': 6,
 'it': 8,
 'an': 1,
 'amazing': 0,
 'life': 12,
 'my': 14,
 'favorite': 4,
 'language': 9}

In [11]:
# check count of unique words in vocabulary
len(vocab_dict)

21

In [13]:
# To obtain the count, transform the X data
X_count = vectorizer.transform(X)
type(X_count)

scipy.sparse._csr.csr_matrix

In [14]:
# Convert sparse matrix to array for display
X_count.toarray()

array([[0, 0, 1, 1, 0, 1, 0, 2, 0, 0, 1, 1, 0, 0, 0, 1, 2, 1, 1, 1, 0],
       [0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 1, 1, 0, 0, 3, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0]],
      dtype=int64)

In [15]:
# Check the shape of the array
X_count.shape

(4, 21)

In [16]:
# Make array into a df
X_count_df = pd.DataFrame(X_count.toarray(), columns= vectorizer.get_feature_names_out())
X_count_df

Unnamed: 0,amazing,an,best,enjoy,favorite,fun,give,is,it,language,...,learning,life,love,my,new,programming,python,so,the,would
0,0,0,1,1,0,1,0,2,0,0,...,1,0,0,0,1,2,1,1,1,0
1,0,1,0,0,0,0,1,0,1,0,...,0,0,1,0,0,1,0,0,0,1
2,1,0,0,0,0,0,0,3,0,0,...,0,1,1,0,0,3,0,0,0,0
3,0,0,0,0,1,0,0,1,0,1,...,0,0,0,1,0,1,1,0,0,0


using the default CountVectorizer resulted in

The words have been converted to lowercase.

Words that were less than 2-letters-long were removed ("I", "A")

Stopwords were not removed.

Punctuation was removed.

## TF-IDF Vectorization

a higher weight is given to unique words across all docs (ranges from 0-1)

In [18]:
# import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# TfidfVectorizer Example
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(X)
X_tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns= tfidf_vectorizer.get_feature_names_out())
X_tfidf_df.round(4)



Unnamed: 0,amazing,an,best,enjoy,favorite,fun,give,is,it,language,...,learning,life,love,my,new,programming,python,so,the,would
0,0.0,0.0,0.297,0.297,0.0,0.297,0.0,0.3791,0.0,0.0,...,0.297,0.0,0.0,0.0,0.297,0.3099,0.2341,0.297,0.297,0.0
1,0.0,0.452,0.0,0.0,0.0,0.0,0.452,0.0,0.452,0.0,...,0.0,0.0,0.3564,0.0,0.0,0.2359,0.0,0.0,0.0,0.452
2,0.3383,0.0,0.0,0.0,0.0,0.0,0.0,0.6477,0.0,0.0,...,0.0,0.3383,0.2667,0.0,0.0,0.5296,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.4822,0.0,0.0,0.3078,0.0,0.4822,...,0.0,0.0,0.0,0.4822,0.0,0.2516,0.3801,0.0,0.0,0.0


## Comparison

CountVectorizer:

Simple word counts. Common words that appear in many documents could overshadow meaningful terms.

Use CountVectorizer when you want a simple representation and do not need to consider the importance of a term relative to the corpus.

TfidfVectorizer:

Weights the word counts by a measure of how often they appear in the documents, which helps to adjust for the frequency of words across all documents.

Use TfidfVectorizer when you want to determine important terms that are relevant in the context of the entire corpus.

Either vectorizer can also perform additional preprocessing on the text data, such as:

Eliminating stopwords (Not removed by default)

Creating n-grams as well as single tokens.

Changing tokenization patterns (or using a custom function to tokenize)