# Códigos Tidene

## Leitura dos textos

### Opção 1: Leitura de corpus (textos) de tamanho muito grande

Classe readCorpus - Permite a extração de colunas específicas.
Requisitos: O arquivo csv deve ter uma linha de cabeçalho, que nomeia cada um dos colunas (campos)

Parâmetros de entrada:
   - csvfile => nome do arquivo csv
   - list_of_fields_to_read=[] ==> lista de colunas que deverão ser lidas (se não colocar valor, ele assume que deverá ler os valores de todas as colunas)
   - tokenizer = None => recebe um objeto do tipo tokenizador (caso tenha valor, retornará o texto já tokenizado utilizando aquele tokenizador) == vale apenas para lista de campos = 1
   - encoding => padrão de codificação (default = utf8)

Saída: iterador que percorre cada linha do corpus


#### Exemplo de entrada .csv

#### subgroup,maingroup,subclas,clas,section,othersipcs,data
B03B00402,B03B004,B03B,B03,B,B07B00408,separation apparatus this invention relates to a method for separation of a light material from a heavier material a separation table of vibrator type and a cyclone and a

B03B00500,B03B005,B03B,B03,B,B01D01102-E02Fn means00388,method and installation for desalinating sand and suction hopper comprising such an installation the invention 


In [1]:
import csv

class readCorpus(object):
    def __init__(self,csvfile,list_of_fields_to_read=[],tokenizer=None,encoding='utf8'):
        self.csvfile = csvfile
        self.fields = list_of_fields_to_read
        self.tokenizer = tokenizer
        self.encoding = encoding
    
    def __iter__(self):
        f = open(self.csvfile,encoding=self.encoding, errors='ignore')
        reader = csv.reader(f, delimiter=',', quoting=csv.QUOTE_MINIMAL) #separador dos campos\n",
        headers = next(reader, None)
        if (len(self.fields) <= 0):
            self.fields = headers
        selected_field_indexes = []
        for idx,field in enumerate(headers):
            if field in self.fields:
                selected_field_indexes.append(idx)

        for line in reader:
            if line:
                yield [line[idx] for idx in selected_field_indexes] if (len(selected_field_indexes)>1) else (line[selected_field_indexes[0]] if not self.tokenizer else tokenizer.tokenize(line[selected_field_indexes[0]]))
                        

#### Exemplo de uso

In [2]:
corpus = readCorpus("toy.csv",list_of_fields_to_read=['subgroup','data'])
textos = [texto for texto in corpus]
print(textos[3])

['B03B00510', 'method of separating particles in a fluid medium and an apparatus therefor the present invention relates to a method of separating particles in a fluid medium having a density higher than that of the particles to be separated whereby a mixture of the particles to be separated is fed to a separation chamber of a separation apparatus and streams enriched in a particular type of particles are discharged from the separation chamber the use of a fluid medium for the separation of a mixture of particles in two or more fractions is generally known when the particles have a specific density lower than that of the fluid medium such a separation is not very easy it is known in the art to use centrifugation in order to increase the effect of the difference in density between the types of particles this technique is expensive and often results in an unsatisfactory separation the object of the present invention is to provide a method wherein mixtures comprising different types of par

In [3]:
#se quiser armazenar em uma estrutura do tipo DataFrame
import pandas as pd
df_textos = pd.DataFrame(textos,columns=['subgroup','data']) # armazenando somente os textos

In [4]:
df_textos

Unnamed: 0,subgroup,data
0,B03B00402,separation apparatus this invention relates to...
1,B03B00500,method and installation for desalinating sand ...
2,B03B00546,device for sorting a mix of objects the invent...
3,B03B00510,method of separating particles in a fluid medi...
4,B03B00512,hutch chamber for jig background of invention ...
5,B03B00562,a method and a device for treatment of medium ...
6,B03B00562,title a reflux classifier technical field the ...
7,B03B00562,particle classifier field of the invention the...
8,B03B00566,apparatus for cleaning and destoning particula...
9,H03F00126,error extraction using autocalibrating rf corr...


### Opção 2: Ler direto do arquivo .csv em uma estrutura tipo DataFrame

In [5]:
import pandas as pd
df_textos = pd.read_csv('toy.csv',encoding='utf8')['data']

In [6]:
print(df_textos)

0     separation apparatus this invention relates to...
1     method and installation for desalinating sand ...
2     device for sorting a mix of objects the invent...
3     method of separating particles in a fluid medi...
4     hutch chamber for jig background of invention ...
5     a method and a device for treatment of medium ...
6     title a reflux classifier technical field the ...
7     particle classifier field of the invention the...
8     apparatus for cleaning and destoning particula...
9     error extraction using autocalibrating rf corr...
10    audio transient suppression device 1 field of ...
11    flexible current control in power amplifiers b...
12    system and method for compressing an audio sig...
13    system employing data compression transparent ...
14    arithmetic encoding decoding of a multi channe...
15    universally programmable variable length decod...
16    reception of variable and run length encoded d...
Name: data, dtype: object


## Limpeza dos textos + redução de dimensionalidade

In [7]:
import nltk
import numpy as np

### Tokenização

In [8]:
import nltk
from nltk.tokenize import *

# instancia o tokenizador
tokenizer=nltk.tokenize.RegexpTokenizer("[a-zA-Z']+")

# ... este, por exemplo, separa por palvras e deixa as que tem ' juntas 
# exemplo de uso
tokenizer.tokenize("my can't go should't 321")


['my', "can't", 'go', "should't"]

In [9]:
corpus = readCorpus("toy.csv",list_of_fields_to_read=['data'],tokenizer=tokenizer)
tokens = [texto for texto in corpus]   #values.tolist()
print(tokens)


[['separation', 'apparatus', 'this', 'invention', 'relates', 'to', 'a', 'method', 'for', 'separation', 'of', 'a', 'light', 'material', 'from', 'a', 'heavier', 'material', 'a', 'separation', 'table', 'of', 'vibrator', 'type', 'and', 'a', 'cyclone', 'and', 'a', 'fan', 'means', 'being', 'found', 'the', 'invention', 'also', 'relates', 'to', 'an', 'arrangement', 'for', 'making', 'the', 'separation', 'possible', 'it', 'is', 'previously', 'known', 'by', 'means', 'of', 'the', 'mentioned', 'equipment', 'to', 'separate', 'a', 'material', 'from', 'another', 'one', 'the', 'air', 'following', 'the', 'material', 'which', 'is', 'sucked', 'into', 'the', 'cyclone', 'often', 'contains', 'substances', 'which', 'is', 'damaging', 'for', 'the', 'environment', 'and', 'this', 'air', 'according', 'to', 'previously', 'known', 'technique', 'has', 'been', 'let', 'out', 'into', 'the', 'atmosphere', 'this', 'is', 'of', 'course', 'unsatisfying', 'in', 'the', 'society', 'of', 'today', 'the', 'purpose', 'of', 'this', 

### Remoção de stopwords

In [10]:
from nltk import download
from nltk.corpus import stopwords
download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/andreiabonfante/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

#### Lista de stopwords disponível na nltk

In [11]:
stop_words = stopwords.words('english')
print(stop_words)


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

#### Remove stopwords

In [12]:
tokens_noStp = [[word for word in texto if not word in stop_words] for texto in tokens]
print(tokens_noStp)

[['separation', 'apparatus', 'invention', 'relates', 'method', 'separation', 'light', 'material', 'heavier', 'material', 'separation', 'table', 'vibrator', 'type', 'cyclone', 'fan', 'means', 'found', 'invention', 'also', 'relates', 'arrangement', 'making', 'separation', 'possible', 'previously', 'known', 'means', 'mentioned', 'equipment', 'separate', 'material', 'another', 'one', 'air', 'following', 'material', 'sucked', 'cyclone', 'often', 'contains', 'substances', 'damaging', 'environment', 'air', 'according', 'previously', 'known', 'technique', 'let', 'atmosphere', 'course', 'unsatisfying', 'society', 'today', 'purpose', 'invention', 'eliminate', 'problem', 'provide', 'arrangement', 'air', 'system', 'closed', 'furthermore', 'arrangement', 'design', 'particularly', 'suitable', 'separation', 'light', 'material', 'instance', 'granulate', 'heavier', 'material', 'like', 'instance', 'metal', 'stones', 'preferred', 'embodiment', 'invention', 'shall', 'described', 'closely', 'reference', 'a

#### Remoção de radicais (utilizando lemmatizador ou stemmer)

(https://textminingonline.com/dive-into-nltk-part-iv-stemming-and-lemmatization)

In [13]:
# Lematizador
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
wordnet_lemmatizer = WordNetLemmatizer()
tokens_lem = [[wordnet_lemmatizer.lemmatize(word) for word in texto] for texto in tokens_noStp]
print(tokens_lem)

[nltk_data] Downloading package wordnet to
[nltk_data]     /home/andreiabonfante/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[['separation', 'apparatus', 'invention', 'relates', 'method', 'separation', 'light', 'material', 'heavier', 'material', 'separation', 'table', 'vibrator', 'type', 'cyclone', 'fan', 'mean', 'found', 'invention', 'also', 'relates', 'arrangement', 'making', 'separation', 'possible', 'previously', 'known', 'mean', 'mentioned', 'equipment', 'separate', 'material', 'another', 'one', 'air', 'following', 'material', 'sucked', 'cyclone', 'often', 'contains', 'substance', 'damaging', 'environment', 'air', 'according', 'previously', 'known', 'technique', 'let', 'atmosphere', 'course', 'unsatisfying', 'society', 'today', 'purpose', 'invention', 'eliminate', 'problem', 'provide', 'arrangement', 'air', 'system', 'closed', 'furthermore', 'arrangement', 'design', 'particularly', 'suitable', 'separation', 'light', 'material', 'instance', 'granulate', 'heavi

In [17]:
# Stemmer
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()
tokens_stem = [[porter_stemmer.stem(word) for word in texto] for texto in tokens_noStp]
print(tokens_stem)

[['separ', 'apparatu', 'invent', 'relat', 'method', 'separ', 'light', 'materi', 'heavier', 'materi', 'separ', 'tabl', 'vibrat', 'type', 'cyclon', 'fan', 'mean', 'found', 'invent', 'also', 'relat', 'arrang', 'make', 'separ', 'possibl', 'previous', 'known', 'mean', 'mention', 'equip', 'separ', 'materi', 'anoth', 'one', 'air', 'follow', 'materi', 'suck', 'cyclon', 'often', 'contain', 'substanc', 'damag', 'environ', 'air', 'accord', 'previous', 'known', 'techniqu', 'let', 'atmospher', 'cours', 'unsatisfi', 'societi', 'today', 'purpos', 'invent', 'elimin', 'problem', 'provid', 'arrang', 'air', 'system', 'close', 'furthermor', 'arrang', 'design', 'particularli', 'suitabl', 'separ', 'light', 'materi', 'instanc', 'granul', 'heavier', 'materi', 'like', 'instanc', 'metal', 'stone', 'prefer', 'embodi', 'invent', 'shall', 'describ', 'close', 'refer', 'accompani', 'draw', 'arrang', 'accord', 'invent', 'make', 'separ', 'possibl', 'shown', 'refer', 'draw', 'shown', 'essenti', 'horizont', 'separ', 'ta

### Salvando no disco o corpus serializado em forma de (indice,frequencia)

In [18]:
import gensim


In [19]:
# primeiro, monta-se o dicionario (em forma de indice, palavra unica)
dictionary = gensim.corpora.Dictionary(tokens_lem)
dictionary.save('dictionary.dict')
print(dictionary.token2id)

{'accompanying': 0, 'according': 1, 'air': 2, 'also': 3, 'another': 4, 'apparatus': 5, 'applicable': 6, 'arrangement': 7, 'atmosphere': 8, 'belt': 9, 'case': 10, 'closed': 11, 'closely': 12, 'contains': 13, 'conveyor': 14, 'course': 15, 'cyclone': 16, 'damaging': 17, 'described': 18, 'design': 19, 'drawing': 20, 'eliminate': 21, 'embodiment': 22, 'end': 23, 'environment': 24, 'equipment': 25, 'essentially': 26, 'fan': 27, 'fed': 28, 'feeding': 29, 'flute': 30, 'following': 31, 'found': 32, 'front': 33, 'furthermore': 34, 'granulate': 35, 'heavier': 36, 'heavy': 37, 'horizontal': 38, 'idea': 39, 'instance': 40, 'intended': 41, 'invention': 42, 'known': 43, 'let': 44, 'light': 45, 'like': 46, 'made': 47, 'making': 48, 'material': 49, 'mean': 50, 'mentioned': 51, 'metal': 52, 'method': 53, 'mm': 54, 'often': 55, 'one': 56, 'particularly': 57, 'possible': 58, 'preferred': 59, 'previously': 60, 'problem': 61, 'provide': 62, 'purpose': 63, 'rear': 64, 'reference': 65, 'referring': 66, 'relat

#### Representação Bag-of-Words (contagem de palavras)

In [20]:
dictionary = gensim.corpora.Dictionary.load("dictionary.dict") #carrega o dicionario do disco

bowcorpus = [dictionary.doc2bow(texto) for texto in tokens_lem] #vetoriza para representacao (indice,freq)
gensim.corpora.MmCorpus.serialize('bowcorpus.mm', bowcorpus)  # grava no disco
print(bowcorpus[0])

[(0, 1), (1, 3), (2, 3), (3, 1), (4, 1), (5, 2), (6, 1), (7, 4), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 2), (16, 2), (17, 1), (18, 1), (19, 1), (20, 2), (21, 1), (22, 2), (23, 2), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 2), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 2), (36, 2), (37, 1), (38, 1), (39, 1), (40, 2), (41, 1), (42, 6), (43, 2), (44, 1), (45, 3), (46, 2), (47, 1), (48, 2), (49, 10), (50, 3), (51, 1), (52, 2), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 2), (59, 1), (60, 2), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 2), (68, 1), (69, 2), (70, 1), (71, 9), (72, 2), (73, 2), (74, 1), (75, 1), (76, 2), (77, 1), (78, 1), (79, 1), (80, 1), (81, 4), (82, 1), (83, 1), (84, 3), (85, 1), (86, 3)]


#### Representação Tf-idf (https://radimrehurek.com/gensim/tutorial.html)

In [21]:
bowcorpus = gensim.corpora.MmCorpus('bowcorpus.mm')
tfidf_vectorizer = gensim.models.TfidfModel(bowcorpus)
tfidf_corpus_matrix = tfidf_vectorizer[bowcorpus]
gensim.corpora.MmCorpus.serialize('tfidf_corpus_matrix.mm', tfidf_corpus_matrix)  # grava no disco

print(tfidf_corpus_matrix[0])


[(0, 0.07863880912565448), (1, 0.10190151346903384), (2, 0.1444370169899047), (3, 0.020921727273285274), (4, 0.040160754557408335), (5, 0.05781329009169032), (6, 0.07863880912565448), (7, 0.23759912736612562), (8, 0.07863880912565448), (9, 0.059399781841531406), (10, 0.059399781841531406), (11, 0.07863880912565448), (12, 0.07863880912565448), (13, 0.059399781841531406), (14, 0.059399781841531406), (15, 0.15727761825130895), (16, 0.08032150911481667), (17, 0.07863880912565448), (18, 0.033967171156344615), (19, 0.07863880912565448), (20, 0.15727761825130895), (21, 0.07863880912565448), (22, 0.15727761825130895), (23, 0.08032150911481667), (24, 0.07863880912565448), (25, 0.059399781841531406), (26, 0.059399781841531406), (27, 0.07863880912565448), (28, 0.04814567232996823), (29, 0.15727761825130895), (30, 0.07863880912565448), (31, 0.059399781841531406), (32, 0.04814567232996823), (33, 0.07863880912565448), (34, 0.059399781841531406), (35, 0.15727761825130895), (36, 0.08032150911481667), 