<a href="https://colab.research.google.com/github/Rajora0/nlp_studies/blob/main/Bag_of_words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bag Of Words

---


- https://www.analyticsvidhya.com/blog/2021/08/a-friendly-guide-to-nlp-bag-of-words-with-python-example/

> É uma das técnicas mais utilizadas para fazer a feature extraction, ou seja, fazer com que as informações sejam organizadas de uma forma estruturada. Porém, os textos não são dados estruturados - não organizados de uma maneira fixa, como uma tabela - e, por isso, transformamos os textos em uma informação numérica.


> A técnica de bag of words nos permite representar o texto com a ocorrência de cada palavra, sem levar em conta a ordem das palavras ou a sua estrutura no texto. É realmente como se todas as palavras fossem colocadas dentro de um saco.

In [None]:
import pandas as pd
import numpy as np
import collections
import re

In [None]:
doc1 = 'Game of Thrones is an amazing tv series!'
doc2 = 'Game of Thrones is the best tv series!'
doc3 = 'Game of Thrones is so great'

In [None]:
l_doc1 = re.sub(r"[^a-zA-Z0-9]", " ", doc1.lower()).split()
l_doc2 = re.sub(r"[^a-zA-Z0-9]", " ", doc2.lower()).split()
l_doc3 = re.sub(r"[^a-zA-Z0-9]", " ", doc3.lower()).split()

In [None]:
wordset12 = np.union1d(l_doc1,l_doc2)
wordset =  np.union1d(wordset12,l_doc3)
print(wordset)

['amazing' 'an' 'best' 'game' 'great' 'is' 'of' 'series' 'so' 'the'
 'thrones' 'tv']


In [None]:
def calculateBOW(wordset,l_doc):
  tf_diz = dict.fromkeys(wordset,0)
  for word in l_doc:
      tf_diz[word]=l_doc.count(word)
  return tf_diz

In [None]:
bow1 = calculateBOW(wordset,l_doc1)
bow2 = calculateBOW(wordset,l_doc2)
bow3 = calculateBOW(wordset,l_doc3)
df_bow = pd.DataFrame([bow1,bow2,bow3])
df_bow.head()

Unnamed: 0,amazing,an,best,game,great,is,of,series,so,the,thrones,tv
0,1,1,0,1,0,1,1,1,0,0,1,1
1,0,0,1,1,0,1,1,1,0,1,1,1
2,0,0,0,1,1,1,1,0,1,0,1,0


## SckitLearn

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
vectorizer = CountVectorizer()

In [None]:
X = vectorizer.fit_transform([doc1,doc2,doc3])

In [None]:
vectorizer.get_feature_names_out()

array(['amazing', 'an', 'best', 'game', 'great', 'is', 'of', 'series',
       'so', 'the', 'thrones', 'tv'], dtype=object)

In [None]:
df_bow_sklearn = pd.DataFrame(X.toarray(),columns=vectorizer.get_feature_names_out())
df_bow_sklearn.head()

Unnamed: 0,amazing,an,best,game,great,is,of,series,so,the,thrones,tv
0,1,1,0,1,0,1,1,1,0,0,1,1
1,0,0,1,1,0,1,1,1,0,1,1,1
2,0,0,0,1,1,1,1,0,1,0,1,0


In [None]:
# Remover stopwords

vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform([doc1,doc2,doc3])
df_bow_sklearn = pd.DataFrame(X.toarray(),columns=vectorizer.get_feature_names_out())
df_bow_sklearn.head()

Unnamed: 0,amazing,best,game,great,series,thrones,tv
0,1,0,1,0,1,1,1
1,0,1,1,0,1,1,1
2,0,0,1,1,0,1,0


In [None]:
# Ngrams

vectorizer = CountVectorizer(stop_words='english',ngram_range=(2,2))
X = vectorizer.fit_transform([doc1,doc2,doc3])
df_bow_sklearn = pd.DataFrame(X.toarray(),columns=vectorizer.get_feature_names_out())
df_bow_sklearn.head()

Unnamed: 0,amazing tv,best tv,game thrones,thrones amazing,thrones best,thrones great,tv series
0,1,0,1,1,0,0,1
1,0,1,1,0,1,0,1
2,0,0,1,0,0,1,0


In [None]:
vectorizer = CountVectorizer(stop_words='english',ngram_range=(1,2))
X = vectorizer.fit_transform([doc1,doc2,doc3])
df_bow_sklearn = pd.DataFrame(X.toarray(),columns=vectorizer.get_feature_names_out())
df_bow_sklearn.head()

Unnamed: 0,amazing,amazing tv,best,best tv,game,game thrones,great,series,thrones,thrones amazing,thrones best,thrones great,tv,tv series
0,1,1,0,0,1,1,0,1,1,1,0,0,1,1
1,0,0,1,1,1,1,0,1,1,0,1,0,1,1
2,0,0,0,0,1,1,1,0,1,0,0,1,0,0
