# **Count Vector**
No one hot encoding representamos a palavras com vetores de 0 e 1. No count vector, substituimos isso por um vetor com a frequência de cada palavra no texto. Ele deixa de ser um vetor binário e passa a ser a frequência da palavra que apareceu no documento.

### **Example**

In [1]:
C = ['The who is the band!', 'who is the band?', 'The band who plays the who.']

print('C has %d texts:' % len(C))
for i in range(len(C)):
    print(f"t{i+1} = {C[i]}")

C has 3 texts:
t1 = The who is the band!
t2 = who is the band?
t3 = The band who plays the who.


In [2]:
import re

def get_tokens(text):
    # ignore = ['a', 'the', 'is']
    tokens = re.sub("[^\w]", " ", text).split()
    return [w.lower() for w in tokens]

def tokenize(texts):
    words = []
    
    for text in texts:
        w = get_tokens(text)
        words.extend(w)
        words = sorted(list(set(words)))
    
    return words

V = tokenize(C)
print(f"V has {len(V)} words: {V}")
    

V has 5 words: ['band', 'is', 'plays', 'the', 'who']


In [3]:
import numpy

for text in C:
    words = get_tokens(text)
    bag_vector = numpy.zeros(len(V))
    
    for w in words:
        for i, word in enumerate(V):
            if word == w:
                bag_vector[i] += 1 # Modification
                
    print(f"{text} = {numpy.array(bag_vector)}")

The who is the band! = [1. 1. 0. 2. 1.]
who is the band? = [1. 1. 0. 1. 1.]
The band who plays the who. = [1. 0. 1. 2. 2.]


### **Text Vectorization Librarie**

In [6]:
def pre_process_corpus(corpus):
    new_corpus = [doc.lower() for doc in corpus]
    regex = r"(?<!\d)[\!\?.,;:-](?!\d)"
    return [re.sub(regex, "", doc, 0) for doc in new_corpus]

In [7]:
import sklearn
import pandas as pd
import nltk
import re
from sklearn.feature_extraction.text import CountVectorizer

corpus = pre_process_corpus(C)
print(corpus)

['the who is the band', 'who is the band', 'the band who plays the who']


In [8]:
vectorizer = CountVectorizer()
doc_term_matriz = vectorizer.fit_transform(corpus)
terms = vectorizer.get_feature_names_out()

print(pd.DataFrame(doc_term_matriz.A, columns=terms).to_string())

   band  is  plays  the  who
0     1   1      0    2    1
1     1   1      0    1    1
2     1   0      1    2    2
