#### Count Vectorizer

CountVectorizer is a tool commonly used in natural language processing (NLP) to convert a collection of text documents into a matrix of token counts. It's part of the scikit-learn library in Python. You can use CountVectorizer to tokenize names and create a matrix representing the frequency of each token in the dataset

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

# Samples names
names = ["John Doe", "Jane Smith", "Bob Johnson", "Alice Brown"]

# Create an instance of Count Vectorizer
vectorizer = CountVectorizer()

# Fit and transform the names to obtain the token count matrix
X = vectorizer.fit_transform(names)

# Get the feature names (tokens)
feature_names = vectorizer.get_feature_names_out()

# Convert the matrix to an array for easier inspection
matrix_array = X.toarray()

# Display the feature names and token count matrix
print("Feature Names (Tokens):", feature_names)
print("Token Count Matrix:")
print(matrix_array)


Feature Names (Tokens): ['alice' 'bob' 'brown' 'doe' 'jane' 'john' 'johnson' 'smith']
Token Count Matrix:
[[0 0 0 1 0 1 0 0]
 [0 0 0 0 1 0 0 1]
 [0 1 0 0 0 0 1 0]
 [1 0 1 0 0 0 0 0]]


In [5]:
from sklearn.feature_extraction.text import CountVectorizer

names = ["John Doe", "Jane Smith", "Bob Johnson", "Alice Brown"]

vectorizer = CountVectorizer(ngram_range=(1, 2), lowercase=True)
X = vectorizer.fit_transform(names)

feature_names = vectorizer.get_feature_names_out()
matrix_array = X.toarray()

# Display the feature names and token count matrix
print("Feature Names (Tokens):", feature_names)
print("Token Count Matrix:")
print(matrix_array)

Feature Names (Tokens): ['alice' 'alice brown' 'bob' 'bob johnson' 'brown' 'doe' 'jane'
 'jane smith' 'john' 'john doe' 'johnson' 'smith']
Token Count Matrix:
[[0 0 0 0 0 1 0 0 1 1 0 0]
 [0 0 0 0 0 0 1 1 0 0 0 1]
 [0 0 1 1 0 0 0 0 0 0 1 0]
 [1 1 0 0 1 0 0 0 0 0 0 0]]


#### Q. Differentiate between fit() and transform() method ? Can we apply transform() method to test data ?

In summary, fit() is used to learn parameters or statistics from the training data, and transform() applies these learned parameters to new data. If a transformer has both methods, using fit_transform() can be more efficient than calling fit() and transform() separately.

Yes, absolutely! In scikit-learn and many other machine learning frameworks, after you've trained a transformer using the fit() method on your training data, you can use the transform() method to apply the same transformations to new, unseen data.


In [7]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample data
corpus = ["This is the first document.", "This document is the second document."]

# Create an instance of CountVectorizer
vectorizer = CountVectorizer()

# Fit the vectorizer to the training data
vectorizer.fit(corpus)

# Transform new data using the previously fitted vectorizer
new_data = ["This is a new document."]
transformed_data = vectorizer.transform(new_data)

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

train_data = ["This is the first document.", "This document is the second document."]
test_data = ["This is a new document."]

# Create a instance of CountVectorizer
vectorizer = CountVectorizer()
vectorizer.fit(train_data)

# Transform both training and test data
train_transformed = vectorizer.transform(train_data)
test_transformed = vectorizer.transform(test_data)

# The transformed data is ready to be used in ML
print("Transformed Training Data:")
print(train_transformed.toarray())

print("Transformed Test Data:")
print(test_transformed.toarray())

Transformed Training Data:
[[1 1 1 0 1 1]
 [2 0 1 1 1 1]]
Transformed Test Data:
[[1 0 1 0 0 1]]
