# Nome: Matheus Gustavo Alves Sasso

> Indented block



Objetivo desse experimento é conhecer o CountVectorizer do scikit-learn, usando-o numa pequena amostra do dataset IMDB e codificando funções equivalente no Python.

Funções a serem implementadas:

1. vocab = build_vocab(corpus)
2. corpus_tok = tokenizer(corpus, vocab)
3. doc_term = feature(corpus_tok)

Enquanto está depurando o seu programa, utilize um corpus bem pequeno, com poucos exemplos e depois de depurado, rode ele nos 1000 exemplos do imdb_sample.

## Usando o exemplo do scikit-learn:

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
import re


In [0]:
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]


In [0]:
  vectorizer = CountVectorizer()
  X = vectorizer.fit_transform(corpus)
  vocab = vectorizer.get_feature_names()



## Mostrando o Document-term também denominado de "bag of words"

In [0]:
print(X.toarray())

[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


## Minha implementação de um tokenizador simples usando o vocabulário já extraído pelo scikit-learn

Primeira versão: usando for simples




In [0]:
list_word_based = []
list_token_based = []
for amostra in corpus:
    amostra = re.sub(r'\W',' ',amostra).strip().lower()
    list_words = amostra.split(' ')
    list_tokens = []
    for word in list_words:      
        list_tokens.append(vocab.index(word))
    list_word_based.append(list_words)
    list_token_based.append(list_tokens)
list_word_based, list_token_based

([['this', 'is', 'the', 'first', 'document'],
  ['this', 'document', 'is', 'the', 'second', 'document'],
  ['and', 'this', 'is', 'the', 'third', 'one'],
  ['is', 'this', 'the', 'first', 'document']],
 [[8, 3, 6, 2, 1], [8, 1, 3, 6, 5, 1], [0, 8, 3, 6, 7, 4], [3, 8, 6, 2, 1]])

Segunda versão: for com list comprehension




In [0]:
list_word_based = []
list_token_based = []
for amostra in corpus:
    amostra = re.sub(r'\W',' ',amostra).strip().lower()
    list_words = amostra.split(' ')
    list_tokens = [vocab.index(word)   for word in list_words]
    list_word_based.append(list_words)
    list_token_based.append(list_tokens)
list_word_based, list_token_based

([['this', 'is', 'the', 'first', 'document'],
  ['this', 'document', 'is', 'the', 'second', 'document'],
  ['and', 'this', 'is', 'the', 'third', 'one'],
  ['is', 'this', 'the', 'first', 'document']],
 [[8, 3, 6, 2, 1], [8, 1, 3, 6, 5, 1], [0, 8, 3, 6, 7, 4], [3, 8, 6, 2, 1]])

## Implementação Funções Matheus Sasso

Implementação indivudual das funções. Para simplificação da tarefa desenvolvi o tokenizer e o feature em um única função

In [0]:
import numpy as np

In [0]:
def split_corpus(corpus):
  return [re.findall(r'\b\w+\b',amostra.lower()) for amostra in corpus]

In [0]:
def build_vocab(corpus):
  vocab = set()
  for sample in split_corpus(corpus):
          vocab.update(sample)
  return sorted(vocab)

In [0]:
def tokenizer(corpus, vocab):
    """Retorna a representação numérica dos tokens do córpus"""
    dict_vocab = dict((v, i) for i, v in enumerate(vocab)) # cache
    idx = [[dict_vocab[tok] for tok in sample] for sample in split_corpus(corpus)]
    return idx

In [0]:
# def feature(corpus_tok, vocab):
#     """Retorna o total de ocorrências de cada palavra do vocabulário para as amostras do córpus"""
#     X = np.zeros((len(corpus_tok), len(vocab))).astype(int)
#     for i, pos in enumerate(corpus_tok):
#         print(X[i])
#         np.add.at(X[i], pos, 1).astype(int)#para cada linha adiciona 1 na posição que está vindo 
#     return X

# vocab = build_vocab(corpus)
# corpus_tok = tokenizer(corpus,vocab)
# feature(corpus,vocab)

# doc_term_exemplo

    # X = np.zeros((len(corpus_tok), len(vocab)))
    # list_corpus_tok = []
    # doct_term = []
    # for i,token in enumerate(corpus_tok):
    #   if i in token:
    #     list_corpus_tok.append(corpus_tok.count(i))
    #   elif :
    #   else:
    #     list_corpus_tok.append(int(0))
    #   doct_term.append(list_corpus_tok)
    # return doct_term

def feature(corpus_tok,vocab):
  list_count_vectorizer = []
  for amostra in corpus_tok:
    list_count_iter = [0] * (len(vocab))
    for v in amostra:
      if v != -1:
        list_count_iter[v] += 1
    list_count_vectorizer.append(list_count_iter)
  return list_count_vectorizer


In [0]:
def tokenizer_plus_feature(corpus, vocab):
  nrows = len(corpus)
  ncols = len(vocab)
  corpus_tok_zeros = np.zeros((nrows, ncols))
  list_word_based = []
  list_token_based = []
  corpus_tok = []
  for amostra,amostra_tokeriazada in zip(corpus,corpus_tok_zeros):
      regex = r'\b\w+\b'
      list_words=re.findall(regex,amostra.lower())
      list_tokens = [vocab.index(word) for word in list_words]
      list_word_based.append(list_words)
      list_token_based.append(list_tokens)
      list_corpus_tok = []
      for i,token in enumerate(amostra_tokeriazada):
        if i in list_tokens:
          list_corpus_tok.append(list_tokens.count(i))
        else:
          list_corpus_tok.append(int(0))
      corpus_tok.append(list_corpus_tok)
  return corpus_tok
corpus_tok = tokenizer_plus_feature(corpus, vocab)
corpus_tok





[[0, 1, 1, 1, 0, 0, 1, 0, 1],
 [0, 2, 0, 1, 0, 1, 1, 0, 1],
 [1, 0, 0, 1, 1, 0, 1, 1, 1],
 [0, 1, 1, 1, 0, 0, 1, 0, 1]]

# Download do dataset do IMDB_sample (apenas 1000 exemplos)

O dataset está sendo carregado dos datasets disponibilizados pelo curso fast.ai: https://course.fast.ai/datasets.html

O comando wget busca o arquivo imdb.tgz
O comando tar descomprime o arquivo no diretório local

In [0]:
!wget -nc http://files.fast.ai/data/examples/imdb_sample.tgz
!tar -xzf imdb_sample.tgz

File ‘imdb_sample.tgz’ already there; not retrieving.



O diretório descomprimido tem um arquivo no formato csv:

In [0]:
!ls imdb_sample

texts.csv


In [0]:
import pandas as pd

In [0]:
df = pd.read_csv('imdb_sample/texts.csv')
df.shape

(1000, 3)

In [0]:
df.head()

Unnamed: 0,label,text,is_valid
0,negative,Un-bleeping-believable! Meg Ryan doesn't even ...,False
1,positive,This is a extremely well-made film. The acting...,False
2,negative,Every once in a long while a movie will come a...,False
3,positive,Name just says it all. I watched this movie wi...,False
4,negative,This movie succeeds at being one of the most u...,False


In [0]:
corpus_exemplo = list(df['text'])

In [0]:
## Resolução
np.set_printoptions(edgeitems=20, linewidth=150)
vocab_exemplo = build_vocab(corpus_exemplo)
corpus_tok_exemplo = tokenizer(corpus_exemplo,vocab_exemplo)
doc_term_exemplo = feature(corpus_tok_exemplo,vocab_exemplo)

doc_term_exemplo

Output hidden; open in https://colab.research.google.com to view.

In [0]:
#Numero de Tokens
print(len(vocab_exemplo))

18705


## Prova Real



In [0]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus_exemplo)
vocab_exemplo = vectorizer.get_feature_names()
print(vocab_exemplo)
print(X.toarray())

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 .