# Códigos Tidene

## Leitura dos textos

### Opção 1: Leitura de corpus (textos) de tamanho muito grande

Classe readCorpus - Permite a extração de colunas específicas.
Requisitos: O arquivo csv deve ter uma linha de cabeçalho, que nomeia cada um dos colunas (campos)

Parâmetros de entrada:
   - csvfile => nome do arquivo csv
   - list_of_fields_to_read=[] ==> lista de colunas que deverão ser lidas (se não colocar valor, ele assume que deverá ler os valores de todas as colunas)
   - tokenizer = None => recebe um objeto do tipo tokenizador (caso tenha valor, retornará o texto já tokenizado utilizando aquele tokenizador) == vale apenas para lista de campos = 1
   - encoding => padrão de codificação (default = utf8)

Saída: iterador que percorre cada linha do corpus


#### Exemplo de entrada .csv

#### subgroup,maingroup,subclas,clas,section,othersipcs,data
B03B00402,B03B004,B03B,B03,B,B07B00408,separation apparatus this invention relates to a method for separation of a light material from a heavier material a separation table of vibrator type and a cyclone and a

B03B00500,B03B005,B03B,B03,B,B01D01102-E02Fn means00388,method and installation for desalinating sand and suction hopper comprising such an installation the invention 


In [3]:
import csv

class readCorpus(object):
    def __init__(self,csvfile,list_of_fields_to_read=[],tokenizer=None,encoding='utf8'):
        self.csvfile = csvfile
        self.fields = list_of_fields_to_read
        self.tokenizer = tokenizer
        self.encoding = encoding
    
    def __iter__(self):
        f = open(self.csvfile,encoding=self.encoding, errors='ignore')
        reader = csv.reader(f, delimiter=',', quoting=csv.QUOTE_MINIMAL) #separador dos campos\n",
        headers = next(reader, None)
        if (len(self.fields) <= 0):
            self.fields = headers
        selected_field_indexes = []
        for idx,field in enumerate(headers):
            if field in self.fields:
                selected_field_indexes.append(idx)

        for line in reader:
            if line:
                yield [line[idx] for idx in selected_field_indexes] if (len(selected_field_indexes)>1) else (line[selected_field_indexes[0]] if not self.tokenizer else tokenizer.tokenize(line[selected_field_indexes[0]]))
                        

#### Exemplo de uso

In [4]:
corpus = readCorpus("data/train.csv",list_of_fields_to_read=['sentiment','review'])
textos = [texto for texto in corpus]
print(textos[3])

['0', 'It must be assumed that those who praised this film (\\the greatest filmed opera ever,\\" didn\'t I read somewhere?) either don\'t care for opera, don\'t care for Wagner, or don\'t care about anything except their desire to appear Cultured. Either as a representation of Wagner\'s swan-song, or as a movie, this strikes me as an unmitigated disaster, with a leaden reading of the score matched to a tricksy, lugubrious realisation of the text.<br /><br />It\'s questionable that people with ideas as to what an opera (or, for that matter, a play, especially one by Shakespeare) is \\"about\\" should be allowed anywhere near a theatre or film studio; Syberberg, very fashionably, but without the smallest justification from Wagner\'s text, decided that Parsifal is \\"about\\" bisexual integration, so that the title character, in the latter stages, transmutes into a kind of beatnik babe, though one who continues to sing high tenor -- few if any of the actors in the film are the singers, an

In [6]:
#se quiser armazenar em uma estrutura do tipo DataFrame
import pandas as pd
df_textos = pd.DataFrame(textos,columns=['sentiment','review']) # armazenando somente os textos

In [7]:
df_textos

Unnamed: 0,sentiment,review
0,1,With all this stuff going down at the moment w...
1,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,0,The film starts with a manager (Nicholas Bell)...
3,0,It must be assumed that those who praised this...
4,1,Superbly trashy and wondrously unpretentious 8...
5,1,I dont know why people think this is such a ba...
6,0,"This movie could have been very good, but come..."
7,0,I watched this video at a friend's house. I'm ...
8,0,"A friend of mine bought this film for £1, and ..."
9,1,<br /><br />This movie is full of references. ...


### Opção 2: Ler direto do arquivo .csv em uma estrutura tipo DataFrame

In [9]:
import pandas as pd
df_textos = pd.read_csv('data/train.csv',encoding='utf8')['review']

In [10]:
print(df_textos)

0        With all this stuff going down at the moment w...
1        \The Classic War of the Worlds\" by Timothy Hi...
2        The film starts with a manager (Nicholas Bell)...
3        It must be assumed that those who praised this...
4        Superbly trashy and wondrously unpretentious 8...
5        I dont know why people think this is such a ba...
6        This movie could have been very good, but come...
7        I watched this video at a friend's house. I'm ...
8        A friend of mine bought this film for £1, and ...
9        <br /><br />This movie is full of references. ...
10       What happens when an army of wetbacks, towelhe...
11       Although I generally do not like remakes belie...
12       \Mr. Harvey Lights a Candle\" is anchored by a...
13       I had a feeling that after \Submerged\", this ...
14       note to George Litman, and others: the Mystery...
15       Stephen King adaptation (scripted by King hims...
16       `The Matrix' was an exciting summer blockbuste.

## Limpeza dos textos + redução de dimensionalidade

In [11]:
import nltk
import numpy as np

### Tokenização

In [13]:
import nltk
from nltk.tokenize import *

# instancia o tokenizador
tokenizer=nltk.tokenize.RegexpTokenizer("[a-zA-Z']+")

# ... este, por exemplo, separa por palvras e deixa as que tem ' juntas 
# exemplo de uso
tokenizer.tokenize("my can't go should't 321")


['my', "can't", 'go', "should't"]

In [30]:
corpus = readCorpus("data/train.csv",list_of_fields_to_read=['review'],tokenizer=tokenizer)
tokens = [texto for texto in corpus]   #values.tolist()
print(tokens[1])


['The', 'Classic', 'War', 'of', 'the', 'Worlds', 'by', 'Timothy', 'Hines', 'is', 'a', 'very', 'entertaining', 'film', 'that', 'obviously', 'goes', 'to', 'great', 'effort', 'and', 'lengths', 'to', 'faithfully', 'recreate', 'H', 'G', "Wells'", 'classic', 'book', 'Mr', 'Hines', 'succeeds', 'in', 'doing', 'so', 'I', 'and', 'those', 'who', 'watched', 'his', 'film', 'with', 'me', 'appreciated', 'the', 'fact', 'that', 'it', 'was', 'not', 'the', 'standard', 'predictable', 'Hollywood', 'fare', 'that', 'comes', 'out', 'every', 'year', 'e', 'g', 'the', 'Spielberg', 'version', 'with', 'Tom', 'Cruise', 'that', 'had', 'only', 'the', 'slightest', 'resemblance', 'to', 'the', 'book', 'Obviously', 'everyone', 'looks', 'for', 'different', 'things', 'in', 'a', 'movie', 'Those', 'who', 'envision', 'themselves', 'as', 'amateur', 'critics', 'look', 'only', 'to', 'criticize', 'everything', 'they', 'can', 'Others', 'rate', 'a', 'movie', 'on', 'more', 'important', 'bases', 'like', 'being', 'entertained', 'which

### Remoção de stopwords

In [18]:
from nltk import download
from nltk.corpus import stopwords
download('stopwords')

[nltk_data] Downloading package stopwords to /home/bruno/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

#### Lista de stopwords disponível na nltk

In [19]:
stop_words = stopwords.words('english')
print(stop_words)


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

#### Remove stopwords

In [21]:
tokens_noStp = [[word for word in texto if not word in stop_words and len(word) > 1] for texto in tokens]
print(tokens_noStp[1])

['The', 'Classic', 'War', 'Worlds', 'Timothy', 'Hines', 'entertaining', 'film', 'obviously', 'goes', 'great', 'effort', 'lengths', 'faithfully', 'recreate', "Wells'", 'classic', 'book', 'Mr', 'Hines', 'succeeds', 'watched', 'film', 'appreciated', 'fact', 'standard', 'predictable', 'Hollywood', 'fare', 'comes', 'every', 'year', 'Spielberg', 'version', 'Tom', 'Cruise', 'slightest', 'resemblance', 'book', 'Obviously', 'everyone', 'looks', 'different', 'things', 'movie', 'Those', 'envision', 'amateur', 'critics', 'look', 'criticize', 'everything', 'Others', 'rate', 'movie', 'important', 'bases', 'like', 'entertained', 'people', 'never', 'agree', 'critics', 'We', 'enjoyed', 'effort', 'Mr', 'Hines', 'put', 'faithful', "Wells'", 'classic', 'novel', 'found', 'entertaining', 'This', 'made', 'easy', 'overlook', 'critics', 'perceive', 'shortcomings']


#### Remoção de radicais (utilizando lemmatizador ou stemmer)

(https://textminingonline.com/dive-into-nltk-part-iv-stemming-and-lemmatization)

In [24]:
# Lematizador
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
wordnet_lemmatizer = WordNetLemmatizer()
tokens_lem = [[wordnet_lemmatizer.lemmatize(word) for word in texto] for texto in tokens_noStp]
print(tokens_lem[1])

[nltk_data] Downloading package wordnet to /home/bruno/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
['The', 'Classic', 'War', 'Worlds', 'Timothy', 'Hines', 'entertaining', 'film', 'obviously', 'go', 'great', 'effort', 'length', 'faithfully', 'recreate', "Wells'", 'classic', 'book', 'Mr', 'Hines', 'succeeds', 'watched', 'film', 'appreciated', 'fact', 'standard', 'predictable', 'Hollywood', 'fare', 'come', 'every', 'year', 'Spielberg', 'version', 'Tom', 'Cruise', 'slightest', 'resemblance', 'book', 'Obviously', 'everyone', 'look', 'different', 'thing', 'movie', 'Those', 'envision', 'amateur', 'critic', 'look', 'criticize', 'everything', 'Others', 'rate', 'movie', 'important', 'base', 'like', 'entertained', 'people', 'never', 'agree', 'critic', 'We', 'enjoyed', 'effort', 'Mr', 'Hines', 'put', 'faithful', "Wells'", 'classic', 'novel', 'found', 'entertaining', 'This', 'made', 'easy', 'overlook', 'critic', 'perceive', 'shortcoming']


In [25]:
# Stemmer
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()
tokens_stem = [[porter_stemmer.stem(word) for word in texto] for texto in tokens_noStp]
print(tokens_stem[1])

['the', 'classic', 'war', 'world', 'timothi', 'hine', 'entertain', 'film', 'obvious', 'goe', 'great', 'effort', 'length', 'faith', 'recreat', "wells'", 'classic', 'book', 'Mr', 'hine', 'succe', 'watch', 'film', 'appreci', 'fact', 'standard', 'predict', 'hollywood', 'fare', 'come', 'everi', 'year', 'spielberg', 'version', 'tom', 'cruis', 'slightest', 'resembl', 'book', 'obvious', 'everyon', 'look', 'differ', 'thing', 'movi', 'those', 'envis', 'amateur', 'critic', 'look', 'critic', 'everyth', 'other', 'rate', 'movi', 'import', 'base', 'like', 'entertain', 'peopl', 'never', 'agre', 'critic', 'We', 'enjoy', 'effort', 'Mr', 'hine', 'put', 'faith', "wells'", 'classic', 'novel', 'found', 'entertain', 'thi', 'made', 'easi', 'overlook', 'critic', 'perceiv', 'shortcom']


### Salvando no disco o corpus serializado em forma de (indice,frequencia)

$conda install gensim


In [27]:
import gensim


In [28]:
# primeiro, monta-se o dicionario (em forma de indice, palavra unica)
dictionary = gensim.corpora.Dictionary(tokens_lem)
dictionary.save('dictionary.dict')
print(dictionary.token2id)



#### Representação Bag-of-Words (contagem de palavras)

In [34]:
dictionary = gensim.corpora.Dictionary.load("dictionary.dict") #carrega o dicionario do disco

bowcorpus = [dictionary.doc2bow(texto) for texto in tokens_lem] #vetoriza para representacao (indice,freq)
gensim.corpora.MmCorpus.serialize('bowcorpus.mm', bowcorpus)  # grava no disco
print(bowcorpus[1])

[(24, 1), (61, 1), (74, 1), (78, 2), (105, 1), (110, 1), (119, 2), (130, 1), (158, 1), (170, 1), (178, 1), (179, 1), (180, 3), (181, 1), (182, 2), (183, 1), (184, 1), (185, 1), (186, 1), (187, 1), (188, 1), (189, 1), (190, 1), (191, 1), (192, 2), (193, 1), (194, 1), (195, 1), (196, 1), (197, 1), (198, 2), (199, 2), (200, 1), (201, 3), (202, 1), (203, 1), (204, 2), (205, 1), (206, 1), (207, 2), (208, 1), (209, 1), (210, 1), (211, 1), (212, 1), (213, 1), (214, 1), (215, 1), (216, 1), (217, 1), (218, 1), (219, 1), (220, 2), (221, 1), (222, 1), (223, 1), (224, 1), (225, 1), (226, 1), (227, 1), (228, 1), (229, 1), (230, 1), (231, 1), (232, 1), (233, 1), (234, 1), (235, 1), (236, 1)]


#### Representação Tf-idf (https://radimrehurek.com/gensim/tutorial.html)

In [35]:
bowcorpus = gensim.corpora.MmCorpus('bowcorpus.mm')
tfidf_vectorizer = gensim.models.TfidfModel(bowcorpus)
tfidf_corpus_matrix = tfidf_vectorizer[bowcorpus]
gensim.corpora.MmCorpus.serialize('tfidf_corpus_matrix.mm', tfidf_corpus_matrix)  # grava no disco

print(tfidf_corpus_matrix[1])


[(24, 0.007662904148069618), (61, 0.05643117471901765), (74, 0.046087077917631177), (78, 0.024121267352814908), (105, 0.016982230836856595), (110, 0.03039415623618463), (119, 0.019593985186936842), (130, 0.031710183360033936), (158, 0.03184060382785375), (170, 0.05613681462132145), (178, 0.12871114483228724), (179, 0.13416749380015613), (180, 0.46254837136180627), (181, 0.06341166601597016), (182, 0.1448957113078988), (183, 0.11196338499012136), (184, 0.12995395457082043), (185, 0.12641775185963003), (186, 0.01853308180925667), (187, 0.0982893326591571), (188, 0.12360940354782138), (189, 0.08566687355523837), (190, 0.08736074292359518), (191, 0.06215022331537497), (192, 0.3302624556632151), (193, 0.14984453437855053), (194, 0.08441583667184949), (195, 0.10488884648524759), (196, 0.10796603326121335), (197, 0.10968221273876273), (198, 0.120590749304403), (199, 0.12005086406135927), (200, 0.03825877860289763), (201, 0.2716356739596658), (202, 0.13705719592586532), (203, 0.078886856657866