In [None]:
CountVectorizer is a feature extraction tool in the scikit-learn library used for converting a collection of text documents into a bag-of-words model. 

It transforms the text into a matrix of token (word) counts, where:

Each row represents a document.

Each column represents a unique word (term) in the entire corpus.

The matrix entries are the count of occurrences of each word in the respective document.

Key Features
Tokenization: Splits the text into words (tokens).
Lowercasing: Converts all text to lowercase by default.
Stopword Removal: Optionally removes common words (e.g., "the", "and") that may not carry meaningful context.
Sparse Representation: Produces a sparse matrix to save memory for large datasets.


In [2]:
from sklearn.feature_extraction.text import CountVectorizer

# Example documents
documents = [
    "Cat runs behind rat",
    "Dog runs behind cat",
    "The quick brown fox jumps"
]

# Initialize CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the documents
X = vectorizer.fit_transform(documents)

# Extract feature names (unique words)
features = vectorizer.get_feature_names_out()

# Convert the sparse matrix to a dense matrix for better visualization
dense_matrix = X.toarray()

# Print the results
print("Feature Names (Vocabulary):", features)
print("\nDocument-Term Matrix:")
print(dense_matrix)


Feature Names (Vocabulary): ['behind' 'brown' 'cat' 'dog' 'fox' 'jumps' 'quick' 'rat' 'runs' 'the']

Document-Term Matrix:
[[1 0 1 0 0 0 0 1 1 0]
 [1 0 1 1 0 0 0 0 1 0]
 [0 1 0 0 1 1 1 0 0 1]]


In [None]:
Feature Names (Vocabulary):

['behind', 'brown', 'cat', 'dog', 'fox', 'jumps', 'quick', 'rat', 'runs', 'the']

Document-Term Matrix:

[[1 0 1 0 0 0 0 1 1 0]  # Document 1: "Cat runs behind rat
[1 0 1 1 0 0 0 0 1 0]  # Document 2: "Dog runs behind cat
[0 1 0 0 1 1 1 0 0 1]] # Document 3: "The quick brown fox jumps

Each row corresponds to a document, and each column corresponds to the count of a word in that document.

Parameters of CountVectorizer:

stop_words: Remove common stopwords (e.g., stop_words='english').
max_features: Limit the number of features (e.g., max_features=100).
ngram_range: Capture phrases of multiple words (e.g., ngram_range=(1, 2) for unigrams and bigrams).
min_df and max_df: Ignore words that are too rare or too common.

Applications:
Text Preprocessing: Transform raw text into numerical vectors for machine learning.
Feature Engineering: Generate features for text classification, clustering, or similarity analysis.
Information Retrieval: Represent documents for search and ranking.