## Simples classificação de texto usando Deep Learning com Python

Repositório Github (https://github.com/Richard-Teske/text-classification-keras) 

### Importando arquivos de treino / teste

In [1]:
import pandas as pd
import os
import datetime
import zipfile

begin = datetime.datetime.now()

file_path = %pwd

In [2]:
file_path

'D:\\GitHub\\text-classification-stack'

Extração dos arquivos de treino e teste zipados

In [3]:
dataset_folder = 'dataset'
zip_file = 'dataset.zip'

In [4]:
train_file = 'train.csv'
test_file = 'test.csv'

train_path = os.path.join(file_path, dataset_folder, train_file)
test_path = os.path.join(file_path, dataset_folder, test_file)

In [5]:
if not os.path.isfile(os.path.join(file_path,dataset_folder,train_file)):
    with zipfile.ZipFile(os.path.join(file_path, dataset_folder,zip_file), 'r') as zip:
        zip.extractall(os.path.join(file_path, dataset_folder))


In [6]:
print(train_path)
print(test_path)

D:\GitHub\text-classification-stack\dataset\train.csv
D:\GitHub\text-classification-stack\dataset\test.csv


<font color=red>*Tenha certeza que este é o diretório certo dos arquivos de treino e teste no seu ambiente*</font>

In [7]:
df_train_original = pd.read_csv(train_path, sep=';', header=None, names=['Body','Tag','Title'])
df_test_original = pd.read_csv(test_path, sep=';', header=None, names=['Body','Tag','Title'])

In [8]:
df_train_original.head(10)

Unnamed: 0,Body,Tag,Title
0,<p>I am new to Silverlight 2.0 and I am actual...,silverlight-2.0,Silverlight WebPart in Sharepoint
1,<p>I have been used to do some refactorings by...,refactoring,Is Refactoring by Compilation Errors Bad?
2,"<p>I've seen this questions <a href=""https://s...",coding-style,What is the name of this particular indent sty...
3,<p>How do you like your CRUD programs. Code-g...,refactoring,Code Generator vs Code Refactoring
4,"<p>Reading <a href=""https://stackoverflow.com/...",build-process,Continuous Integration vs. Nightly Builds
5,<p>It should first be noted that I am trying t...,build-process,VBC + NAnt. Error compiling WinForm
6,<p>SiteA.com and siteB.com are .NET 2.0 apps o...,.net-2.0,How can one IIS6 .NET app appear as subsite of...
7,"<p>In .NET 1.x, you could use the <a href=""htt...",.net-2.0,How can I prevent unauthorized code from acces...
8,<p>I heard google has some automated process l...,build-process,Build Server Best Practices
9,<p>I am trying to run a .Net 2.0 application f...,.net-2.0,.Net 2.0 application from network share withou...


## Tratamento de Dados

> `Este dataset provem do banco de dados publico do StackOverflow, que você poderá baixa-lo gratuitamente na internet o seu arquivo MDF, na estrutura original dos dados, existem algumas tags HTML no Body que precisam ser tratadas, além de algumas stopwords que precisam ser retiradas`

In [9]:
from nltk.corpus import stopwords 
from nltk import word_tokenize
import re

stopWords_original = stopwords.words('english')

`Existe uma lista de stopwords já definidas pelo NLTK em diferentes linguagens (também em Português).
Nas stopwords originais eu optei por trazer o tratamento de dados nela, retirando os caracteres especiais e deixando todas em lower case (Que já estão em sua lista original)`

In [10]:
stopWords_original

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [11]:
stopWords = []

for word in stopWords_original:
    stopWords.append(re.sub('[^a-z]+','',word.lower()))

stopWords

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 'youre',
 'youve',
 'youll',
 'youd',
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 'shes',
 'her',
 'hers',
 'herself',
 'it',
 'its',
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 'thatll',
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each',
 'fe

In [12]:
df_test = pd.DataFrame()
df_train = pd.DataFrame()

`Retira todas os caracteres especiais dos arquivos de treino e teste (Inclusive tags HTML do Body) e os deixa em Lower Case.
Também é feita a unificação do Body e o Title do dataset`

<font size=3>**Test**</font>

In [13]:
for index, rows in df_test_original.iterrows():

    title = word_tokenize(re.sub('[^a-z]+',' ',rows['Title'].lower()))
    body = word_tokenize(re.sub('[^a-z]+',' ',re.sub('<[^>]+>',' ',rows['Body'].lower())))

    content = []
    content.append([w for w in title + body if w not in stopWords])

    data = []
    data.append((' '.join(content[0]), rows['Tag']))

    df_test = pd.concat([df_test, pd.DataFrame(data)], ignore_index=True)

<font size=3>**Train**</font>

In [14]:
for index, rows in df_train_original.iterrows():

    title = word_tokenize(re.sub('[^a-z]+',' ',rows['Title'].lower()))
    body = word_tokenize(re.sub('[^a-z]+',' ',re.sub('<[^>]+>',' ',rows['Body'].lower())))

    content = []
    content.append([w for w in title + body if w not in stopWords])

    data = []
    data.append((' '.join(content[0]), rows['Tag']))

    df_train = pd.concat([df_train, pd.DataFrame(data)], ignore_index=True)

In [15]:
df_test.columns = ['Content','Tag']
df_train.columns = ['Content', 'Tag']

In [16]:
df_test.head(10)

Unnamed: 0,Content,Tag
0,determine user timezone standard way web serve...,javascript
1,swap unique indexed column values database dat...,sql
2,efficient code first prime numbers want print ...,performance
3,build windows nt using visual studio mfc appli...,c++
4,best way access exchange using php writing cms...,php
5,deploying sql server databases test live wonde...,sql-server
6,programmatically launch ie mobile favorites sc...,internet-explorer
7,lucene score results lucene multiple indexes c...,search
8,use unsigned values signed ones appropriate us...,language-agnostic
9,use nested classes case working collection cla...,c++


## Pre-processamento de Dados

In [17]:
from keras.preprocessing.text import Tokenizer
from sklearn.preprocessing import LabelEncoder
from keras import utils
from keras.models import Sequential
from keras.layers import Activation, Dense, Dropout
import numpy as np

Using TensorFlow backend.


In [18]:
num_Words = 1000

`Numero de palavras maxima para o modelo de treinamento (top N palavras)
**OBS: Caso o valor seja aumentado, irá aumentar também o consumo de memoria da sua maquina
Fique ligado a quanto de memoria você poderá utilizar no treinamento, caso contrato irá ocorrer erro de memoria`

In [19]:
tokenizer = Tokenizer(num_words=num_Words)
tokenizer.fit_on_texts(df_train['Content'])

### Transformação de textos para matrix

In [20]:
x_train = tokenizer.texts_to_matrix(df_train['Content'])
x_test = tokenizer.texts_to_matrix(df_test['Content'])

In [21]:
encoder = LabelEncoder()
encoder.fit(df_train['Tag'])

LabelEncoder()

In [22]:
y_train = encoder.transform(df_train['Tag'])
y_test = encoder.transform(df_test['Tag'])

In [23]:
n_classes = np.max(y_test) + 1
n_classes

136

In [24]:
y_train = utils.to_categorical(y_train, num_classes=n_classes)
y_test = utils.to_categorical(y_test, num_classes=n_classes)

In [25]:
## Numero de exemplos a serem treinados por vez
batch_size = 100
## Numero de vezes que a rede neural irá treinar todo o dataset
epochs = 10

### Layers 

In [26]:
model = Sequential()
model.add(Dense(512, input_shape=(num_Words,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(n_classes))
model.add(Activation('softmax'))

In [27]:
model.compile(  loss='categorical_crossentropy',
                optimizer='adam',
                metrics=['accuracy'])

In [28]:
history = model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=2,
                    validation_split=0.1)

Train on 209956 samples, validate on 23329 samples
Epoch 1/10
 - 30s - loss: 1.1121 - acc: 0.7297 - val_loss: 0.8776 - val_acc: 0.7691
Epoch 2/10
 - 29s - loss: 0.8244 - acc: 0.7692 - val_loss: 0.8286 - val_acc: 0.7746
Epoch 3/10
 - 30s - loss: 0.7505 - acc: 0.7821 - val_loss: 0.8233 - val_acc: 0.7768
Epoch 4/10
 - 30s - loss: 0.6997 - acc: 0.7921 - val_loss: 0.8226 - val_acc: 0.7763
Epoch 5/10
 - 31s - loss: 0.6543 - acc: 0.8022 - val_loss: 0.8260 - val_acc: 0.7746
Epoch 6/10
 - 31s - loss: 0.6159 - acc: 0.8114 - val_loss: 0.8393 - val_acc: 0.7744
Epoch 7/10
 - 31s - loss: 0.5809 - acc: 0.8197 - val_loss: 0.8578 - val_acc: 0.7762
Epoch 8/10
 - 32s - loss: 0.5511 - acc: 0.8257 - val_loss: 0.8724 - val_acc: 0.7758
Epoch 9/10
 - 33s - loss: 0.5275 - acc: 0.8327 - val_loss: 0.9057 - val_acc: 0.7728
Epoch 10/10
 - 32s - loss: 0.5007 - acc: 0.8395 - val_loss: 0.9222 - val_acc: 0.7714


In [29]:
score = model.evaluate(x_test, y_test,
                    batch_size= batch_size,
                    verbose=1)

print('Test score: ', score[0])
print('Test accuracy: ',score[1])

Test score:  0.8788562448718217
Test accuracy:  0.7733567661895484


`Todas as Tags usadas no treinamento`

In [30]:
labels = encoder.classes_
labels

array(['.net-2.0', '.net-3.5', 'actionscript', 'actionscript-3',
       'activerecord', 'ado.net', 'air', 'amazon-web-services',
       'android-emulator', 'ant', 'apache', 'api', 'arrays',
       'asp-classic', 'automated-tests', 'azure', 'bash', 'build-process',
       'button', 'c++', 'caching', 'cassandra', 'charts', 'cmake',
       'cocos2d-iphone', 'coding-style', 'collections',
       'compiler-construction', 'cookies', 'core-data', 'cryptography',
       'crystal-reports', 'csv', 'database-design', 'delphi',
       'dependency-injection', 'design-patterns', 'devexpress',
       'dictionary', 'diff', 'django', 'dll', 'dynamic', 'dynamics-crm',
       'embedded', 'ffmpeg', 'file-io', 'file-upload', 'filesystems',
       'firefox', 'floating-point', 'fortran', 'frameworks', 'function',
       'g++', 'geometry', 'google-app-engine', 'graph', 'html5', 'iframe',
       'image', 'indexing', 'installer', 'internet-explorer', 'ios4',
       'javascript', 'jpa', 'jqgrid', 'jquery-plugins

`Testes com alguns dados do dataset original`

In [31]:
for n in range(0,10):
    prediction = model.predict(np.array([x_test[n]]))
    predicted_label = labels[np.argmax(prediction)]
    print('Pergunta: '+ df_test_original['Title'].iloc[n])
    print('Label atual:' + df_test_original['Tag'].iloc[n])
    print("Label preditivo: " + predicted_label + "\n")

Pergunta: Determine a User's Timezone
Label atual:javascript
Label preditivo: php

Pergunta: Swap unique indexed column values in database
Label atual:sql
Label preditivo: sql

Pergunta: Most efficient code for the first 10000 prime numbers?
Label atual:performance
Label preditivo: c++

Pergunta: Build for Windows NT 4.0 using Visual Studio 2005?
Label atual:c++
Label preditivo: c++

Pergunta: Best way to access Exchange using PHP?
Label atual:php
Label preditivo: php

Pergunta: Deploying SQL Server Databases from Test to Live
Label atual:sql-server
Label preditivo: sql

Pergunta: Programmatically launch IE Mobile favorites screen
Label atual:internet-explorer
Label preditivo: internet-explorer

Pergunta: Lucene Score results
Label atual:search
Label preditivo: search

Pergunta: When to use unsigned values over signed ones?
Label atual:language-agnostic
Label preditivo: loops

Pergunta: Should I use nested classes in this case?
Label atual:c++
Label preditivo: c++



In [33]:
end = datetime.datetime.now()

diff = end - begin

print('Tempo estimado de execução: %d horas e %d minutos'%(diff.seconds/3600, diff.seconds/60))

Tempo estimado de execução: 1 horas e 62 minutos


*`Você pode salvar o modelo gerado em um arquivo para utilizar em uma classificação sem precisar re-treinar`*

In [34]:
import pickle

model.save('model.h5')
np.save('labels.npy',labels)
with open('tokenizer.pickle','wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

` É importante tambem salvar seus labels e o tokenizer que contem algumas propriedades usadas no treinamento
O label é salvo em um arquivo .npy (numpy), o tokenizer em um arquivo .pickle (pickle) e o modelo em um arquivo .h5 (HDF5)
`

**<font color=red>IMPORTANTE: Caso grave o arquivo labels ou tokenizer em outro formato, tenha certeza que os arquivos estejam com suas propriedades na forma original em que foram treinadas.
Qualquer ordem que não estejam conforme os treinamentos iram retornar resultados errados na predição</font>**

`Para re-utilizar os arquivos do treinamento, apenas carregue eles para o modelo`

In [35]:
from keras.models import load_model

model = load_model('model.h5')
labels = np.load('labels.npy')
with open('tokenizer.pickle','rb') as handle:
    tokenizer = pickle.load(handle)

In [36]:
texts = ['How can I redirect the user from one page to another using jQuery or pure JavaScript?',
        'What is the difference between “INNER JOIN” and “OUTER JOIN” in SQL Server?',
        'How do you give a C# Auto-Property a default value? I either use the constructor, or revert to the old syntax.',
        'In PHP 5, what is the difference between using self and $this? When is each appropriate?',
        'What are valid values for the id attribute in HTML?',
        'Optimizing Lucene performance',
        'Continuous Integration System for Delphi',
        'How does database indexing work?']

In [37]:
for t in texts:
    data = []
    data.append(t)
    X = tokenizer.texts_to_matrix(data)
    prediction = model.predict(X)
    predicted_label = labels[np.argmax(prediction)]
    print('Pergunta: ',t)
    print('Tag preditiva: ',predicted_label)
    print('\n')

Pergunta:  How can I redirect the user from one page to another using jQuery or pure JavaScript?
Tag preditiva:  javascript


Pergunta:  What is the difference between “INNER JOIN” and “OUTER JOIN” in SQL Server?
Tag preditiva:  sql-server


Pergunta:  How do you give a C# Auto-Property a default value? I either use the constructor, or revert to the old syntax.
Tag preditiva:  c++


Pergunta:  In PHP 5, what is the difference between using self and $this? When is each appropriate?
Tag preditiva:  php


Pergunta:  What are valid values for the id attribute in HTML?
Tag preditiva:  html5


Pergunta:  Optimizing Lucene performance
Tag preditiva:  performance


Pergunta:  Continuous Integration System for Delphi
Tag preditiva:  delphi


Pergunta:  How does database indexing work?
Tag preditiva:  sql




### <center>Isso é tudo, sugestões e criticas serão bem vindas, bons estudos e obrigado por comparecer.</center>
#### <center>Qualquer duvida estou a disposição :)</center>

LinkedIn (https://www.linkedin.com/in/richard-teske-25b88214b/) <br>
Email (richardaraujo.dba@gmail.com) <br>
GitHub (https://github.com/Richard-Teske)