### CountVectorizer :
- it is a tool provided by Scikit-learn(skkearn) in python that converts a collection of text documents into a matrix of token counts.
- simply, it converts text data into numerical form 

---

### why use? :
- ML models cannot understand text directly - they need numbers. CountVectorizer helps in this by :
- breaking down text into words (called tokens)
- countinhg how many times each word appears
- representing the text as a numerical vector

---


In [None]:
# Step-by-step Working of CountVecotrizer :
corpus_old = ["i love NLP", "NLP loves me"]

1. - Tokenization : breaks each sentence into words
- ["I", "love", "NLP", "loves", "me"]

2. - Build vocabulary : list of unique words 
- ["i", "love", "nlp", "loves", "me"]

3. - Count word frquencies : for each sentence
- Sentence 1: [1,1,1,0,0]
- Sentence 2: [0,0,1,1,1]


In [4]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample corpus
corpus = ["I love NLP", "NLP loves me"]

# Create CountVectorizer instance
vectorizer = CountVectorizer()

# Fit and transform the corpus
X = vectorizer.fit_transform(corpus)

# Print feature names
print("Vocabulary:", vectorizer.get_feature_names_out())

# Print the matrix
print("BoW Matrix:\n", X.toarray())


Vocabulary: ['love' 'loves' 'me' 'nlp']
BoW Matrix:
 [[1 0 0 1]
 [0 1 1 1]]


In [5]:
corpus = [
    "I love natural language processing.",
    "Language processing is fun.",
    "I love NLP!"
]


# 1. Basic use :
v = CountVectorizer()
X = v.fit_transform(corpus)
print(v.get_feature_names_out())
print(X.toarray())

['fun' 'is' 'language' 'love' 'natural' 'nlp' 'processing']
[[0 0 1 1 1 0 1]
 [1 1 1 0 0 0 1]
 [0 0 0 1 0 1 0]]


In [None]:
# 2. Using lowercase 
v = CountVectorizer(lowercase=True)
X = v.fit_transform(corpus)
print(v.get_feature_names_out())

v2 = CountVectorizer(lowercase=False)
X2 = v2.fit_transform(corpus)
print(v2.get_feature_names_out())

['fun' 'is' 'language' 'love' 'natural' 'nlp' 'processing']
['Language' 'NLP' 'fun' 'is' 'language' 'love' 'natural' 'processing']


In [20]:
# 3. Stop Words 

v = CountVectorizer(stop_words = 'english')
X = v.fit_transform(corpus)
print(v.get_feature_names_out())

['fun' 'language' 'love' 'natural' 'nlp' 'processing']


In [23]:
# 4. ngram
#ngram_range=(1,2) for using unigrams and bigrams

v = CountVectorizer(ngram_range=(1,2))
X = v.fit_transform(corpus)
print(v.get_feature_names_out())

['fun' 'is' 'is fun' 'language' 'language processing' 'love'
 'love natural' 'love nlp' 'natural' 'natural language' 'nlp' 'processing'
 'processing is']


In [26]:
# 5. analyzer = 'char
# character level analysis 
v = CountVectorizer(analyzer='char', ngram_range=(2,4))
X = v.fit_transform(corpus)
print(v.get_feature_names_out())

[' f' ' fu' ' fun' ' i' ' is' ' is ' ' l' ' la' ' lan' ' lo' ' lov' ' n'
 ' na' ' nat' ' nl' ' nlp' ' p' ' pr' ' pro' 'ag' 'age' 'age ' 'al' 'al '
 'al l' 'an' 'ang' 'angu' 'at' 'atu' 'atur' 'ce' 'ces' 'cess' 'e ' 'e n'
 'e na' 'e nl' 'e p' 'e pr' 'es' 'ess' 'essi' 'fu' 'fun' 'fun.' 'g ' 'g i'
 'g is' 'g.' 'ge' 'ge ' 'ge p' 'gu' 'gua' 'guag' 'i ' 'i l' 'i lo' 'in'
 'ing' 'ing ' 'ing.' 'is' 'is ' 'is f' 'l ' 'l l' 'l la' 'la' 'lan' 'lang'
 'lo' 'lov' 'love' 'lp' 'lp!' 'n.' 'na' 'nat' 'natu' 'ng' 'ng ' 'ng i'
 'ng.' 'ngu' 'ngua' 'nl' 'nlp' 'nlp!' 'oc' 'oce' 'oces' 'ov' 'ove' 'ove '
 'p!' 'pr' 'pro' 'proc' 'ra' 'ral' 'ral ' 'ro' 'roc' 'roce' 's ' 's f'
 's fu' 'si' 'sin' 'sing' 'ss' 'ssi' 'ssin' 'tu' 'tur' 'tura' 'ua' 'uag'
 'uage' 'un' 'un.' 'ur' 'ura' 'ural' 've' 've ' 've n']
