## NLP (Natural Languange Processing)
### Tarefas
* Classificação de texto
* Geração de Texto
* Sumarização de Texto
* Reconhecimento de Entidade Nomeada (NER)
* Word Embedding
* Reconhecimento de parte da fala (POS)
* Sequence tagging
* Language Modelling
* Machine Translation
* Reconhecimento de falta
* Reconhecimento de Perguntas e Respostas

Fontes: 
* https://realpython.com/python-keras-text-classification/
* https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa
* http://norvig.com/spell-correct.html
* https://medium.com/@BhashkarKunal/spelling-correction-using-deep-learning-how-bi-directional-lstm-with-attention-flow-works-in-366fabcc7a2f


## Método BOW (Bag of words)
* Mecanismo para representar texto em formato numérico
* Um vocabulário associa paralavras a números
* Parágrafos são convertidos para uma sequência de números

### Representação de dados textuais

* Corpus: contagem da frequência de cada palavra em cada sentença  e vinculação  dessa contagem a todo o conjunto de palavras no conjunto de dados. 

* Vocabulário: lista de palavras que ocorreram em um texto, e um índice associado a cada palavra.

* A representação de um texto pode ser feita por meio de uma matriz. Cada frase é representada por um vetor de comprimento igual a quantidade de palavras do vocabulário. Esse vetor possui valor 0 para cada palavra do vocabulário que nao está presente na frase e para cada palavra presente,  número de vezes que a palavra aparece.

* Esse vetor é chamado vetor de features


## O Exemplo a seguir mostra como representar sentenças numericamente

In [None]:
sentences = ['John likes ice cream', 'John hates chocolate.', 'john hates chocolate and John.']

## CountVectorizer

* min_df: ao criar o vocabulário ignore os termos que tenham uma frequência de documento estritamente menor que o limite especificado (min_df), se usar float o parâmetro representa uma proporção de documentos
* Mudar o texto todo para letras minúsculas: lowercase

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(min_df=0, lowercase=False)
#vectorizer = CountVectorizer(min_df=0, lowercase=True)
vectorizer.fit(sentences)
vectorizer.vocabulary_

## Vetorizando o texto
* Baseado no vocabulário criado, transformar texto para número

In [None]:
vectorizer.transform(sentences).toarray()

## Base de dados usada para processamento NLP

https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences

### Essa base contém 3 colunas
* Fonte YELP, amazon ou imdb
* Texto
* Avaliação (0 ou 1) 
    * 1 = texto positivo
    * 0 = texto negativo

In [4]:
import pandas as pd
import matplotlib as plt

In [5]:
filepath_dict = {'yelp':   'data/sentiment/yelp_labelled.txt',
                 'amazon': 'data/sentiment/amazon_cells_labelled.txt',
                 'imdb':   'data/sentiment/imdb_labelled.txt'}

df_list = []
for source, filepath in filepath_dict.items():
    df = pd.read_csv(filepath, names=['sentence', 'label'], sep='\t')
    df['source'] = source  # Add another column filled with the source name
    df_list.append(df)

df = pd.concat(df_list)
print(df.iloc[0])

sentence    Wow... Loved this place.
label                              1
source                          yelp
Name: 0, dtype: object


In [None]:
df.describe()

## Preparando base para modelo

In [6]:
from sklearn.model_selection import train_test_split

df_yelp = df[df['source'] == 'yelp']

sentences = df_yelp['sentence'].values

y = df_yelp['label'].values

sentences_train, sentences_test, y_train, y_test = train_test_split(sentences, y, test_size=0.25, random_state=1000)

## Vetorizando sequencias

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorizer.fit(sentences_train)

X_train = vectorizer.transform(sentences_train)
X_test  = vectorizer.transform(sentences_test)
X_train

<750x1714 sparse matrix of type '<class 'numpy.int64'>'
	with 7368 stored elements in Compressed Sparse Row format>

## Aplicando regressão Logística

In [9]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression()
classifier.fit(X_train, y_train)
score = classifier.score(X_test, y_test)

print("Accuracy:", score)

Accuracy: 0.796




# avaliando a acurácia de todas as fontes

In [10]:
for source in df['source'].unique():
    df_source = df[df['source'] == source]
    sentences = df_source['sentence'].values
    y = df_source['label'].values

    sentences_train, sentences_test, y_train, y_test = train_test_split(
        sentences, y, test_size=0.25, random_state=1000)

    vectorizer = CountVectorizer()
    vectorizer.fit(sentences_train)
    X_train = vectorizer.transform(sentences_train)
    X_test  = vectorizer.transform(sentences_test)

    classifier = LogisticRegression()
    classifier.fit(X_train, y_train)
    score = classifier.score(X_test, y_test)
    print('Accuracy for {} data: {:.4f}'.format(source, score))

Accuracy for yelp data: 0.7960
Accuracy for amazon data: 0.7960
Accuracy for imdb data: 0.7487


## Montando uma Rede Neural Para identificar avaliações positivas ou negativas

In [None]:
from keras.models import Sequential
from keras import layers

input_dim = X_train.shape[1]  # Number of features

model = Sequential()
model.add(layers.Dense(10, input_dim=input_dim, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

In [None]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

In [None]:
history = model.fit(X_train, y_train, epochs=100, verbose=False, validation_data=(X_test, y_test) , batch_size=10)

In [None]:
loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))

In [25]:
#plt.style.use('ggplot')

def plot_history(history):
    acc = history.history['acc']
    val_acc = history.history['val_acc']
    loss = history.history['loss']
    val_loss = history.history['val_loss']
    x = range(1, len(acc) + 1)

    plt.figure(figsize=(12, 5))
    plt.subplot(1, 2, 1)
    plt.plot(x, acc, 'b', label='Training acc')
    plt.plot(x, val_acc, 'r', label='Validation acc')
    plt.title('Training and validation accuracy')
    plt.legend()
    plt.subplot(1, 2, 2)
    plt.plot(x, loss, 'b', label='Training loss')
    plt.plot(x, val_loss, 'r', label='Validation loss')
    plt.title('Training and validation loss')
    plt.legend()

In [None]:
plot_history(history)

## Estratégias de representação de dados

* Existem várias maneiras de vetorizar o texto:

* Palavras representadas por cada palavra como um vetor
* Caracteres representados por cada caractere como um vetor
* N gramas de palavras / caracteres representados como um vetor 
  * N gramas são grupos sobrepostos de várias palavras / caracteres sucessivos no texto

In [None]:
cities = ['London', 'Berlin', 'Berlin', 'New York', 'London']

In [None]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
city_labels = encoder.fit_transform(cities)
city_labels

In [None]:
city_labels.shape

# One Hot encoder

In [None]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False)
city_labels = city_labels.reshape((5, 1))
X=encoder.fit_transform(city_labels)

In [None]:
X.shape

In [None]:
X

## Word Embeddings


* Esse método representa as palavras como vetores de palavras densas (Word Embeddings), treinadas ao contrário da codificação one-hot, que é codificada permanentemente. Isso significa que a palavra incorporação coleta mais informações em menos dimensões.

* Word Embeddings não entende o texto como um ser humano entenderia, mas sim mapeia a estrutura estatística da linguagem usada no corpus. Seu objetivo é mapear o significado semântico em um espaço geométrico. Esse espaço geométrico é chamado de embedding space.

* Isso mapeará palavras semanticamente semelhantes próximas ao espaço de incorporação, como números ou cores. Se a incorporação captura bem o relacionamento entre as palavras, coisas como aritmética vetorial devem se tornar possíveis. 
* Um exemplo famoso neste campo de estudo é a capacidade de mapear Rei - Homem + Mulher = Rainha.

* Abaixo exemplo de Word Embeddings

In [19]:
from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(sentences_train)

X_train = tokenizer.texts_to_sequences(sentences_train)
X_test = tokenizer.texts_to_sequences(sentences_test)

vocab_size = len(tokenizer.word_index) + 1  # Adding 1 because of reserved 0 index

print(sentences_train[5])
print(X_train[5])
print(sentences_train.shape)

It was that year, however, that reminded us that Huston was still at the top of his game as evinced by his faithful adaptation of James Joyce's acclaimed novella "The Dead.  
[8, 12, 14, 311, 199, 14, 938, 394, 14, 939, 12, 112, 31, 1, 255, 4, 49, 173, 17, 940, 38, 49, 941, 565, 4, 942, 943, 944, 945, 1, 566]
(561,)


In [15]:
for word in ['the', 'all', 'novella']:
    print('{}: {}'.format(word, tokenizer.word_index[word]))

the: 1
all: 27
novella: 945


* Frases podem ser vetorizadas com tamanhos diferentes dependendo do número e frequência de palavras
* O processo de pad, que sigfinica completar com zeros, permite criar um dataset com mesmo número de colunas para todas as linhas
* Como exemplo é mostrada uma linha do dataset antes e após o padding

In [22]:
from keras.preprocessing.sequence import pad_sequences

print("Antes do Padding")
print(X_train[0])

maxlen = 100

X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)

print("Após Padding")
print(X_train[0, :])

Antes do Padding
[170 116 390  35   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0]
Após Padding
[170 116 390  35   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0]


## Keras Embedding Layer
* Camada Embedding em modelo MLP

In [23]:
from keras.models import Sequential
from keras import layers

embedding_dim = 50

model = Sequential()
model.add(layers.Embedding(input_dim=vocab_size, 
                           output_dim=embedding_dim, 
                           input_length=maxlen))
model.add(layers.Flatten())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 100, 50)           128750    
_________________________________________________________________
flatten_1 (Flatten)          (None, 5000)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 10)                50010     
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 11        
Total params: 178,771
Trainable params: 178,771
Non-trainable params: 0
_________________________________________________________________


In [26]:
history = model.fit(X_train, y_train,
                    epochs=20,
                    verbose=False,
                    validation_data=(X_test, y_test),
                    batch_size=10)
loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))
plot_history(history)

Training Accuracy: 1.0000
Testing Accuracy:  0.6684


AttributeError: module 'matplotlib' has no attribute 'figure'

## Keras Embedding Layer
* Camada Embedding 
* Camada Maxpool 
* modelo MLP


* Outra maneira de trabalhar com Embedding é usando uma camada MaxPooling1D / AveragePooling1D ou GlobalMaxPooling1D / GlobalAveragePooling1D após a Embedding. Você pode pensar nas camadas de pool como uma maneira de reduzir a amostra (uma maneira de reduzir o tamanho) dos vetores de recursos recebidos.

* Maxpool assume o valor máximo de todos as features no pool para cada dimensão de feature. 
* AVGpool assume a média de todos as features no pool para cada dimensão de feature. 


In [28]:
from keras.models import Sequential
from keras import layers

embedding_dim = 50

model = Sequential()
model.add(layers.Embedding(input_dim=vocab_size, 
                           output_dim=embedding_dim, 
                           input_length=maxlen))
model.add(layers.GlobalMaxPool1D())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 100, 50)           128750    
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 50)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 10)                510       
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 11        
Total params: 129,271
Trainable params: 129,271
Non-trainable params: 0
_________________________________________________________________


In [29]:
history = model.fit(X_train, y_train,
                    epochs=50,
                    verbose=False,
                    validation_data=(X_test, y_test),
                    batch_size=10)
loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))
plot_history(history)

Training Accuracy: 1.0000
Testing Accuracy:  0.7968


AttributeError: module 'matplotlib' has no attribute 'figure'

In [44]:
!ls ../glove


glove.6B.100d.txt  glove.6B.300d.txt  glove.6B.zip
glove.6B.200d.txt  glove.6B.50d.txt


## Word Embeddings pré-treinados

In [31]:
import numpy as np

def create_embedding_matrix(filepath, word_index, embedding_dim):
    vocab_size = len(word_index) + 1  # Adding again 1 because of reserved 0 index
    embedding_matrix = np.zeros((vocab_size, embedding_dim))

    with open(filepath) as f:
        for line in f:
            word, *vector = line.split()
            if word in word_index:
                idx = word_index[word] 
                embedding_matrix[idx] = np.array(
                    vector, dtype=np.float32)[:embedding_dim]

    return embedding_matrix

In [45]:
embedding_dim = 50
embedding_matrix = create_embedding_matrix(
     '../glove/glove.6B.50d.txt',
     tokenizer.word_index, embedding_dim)

In [46]:
nonzero_elements = np.count_nonzero(np.count_nonzero(embedding_matrix, axis=1))
nonzero_elements / vocab_size

0.9522330097087378

In [None]:
## Usando embedding pre-treinado

In [47]:
model = Sequential()
model.add(layers.Embedding(vocab_size, embedding_dim, 
                           weights=[embedding_matrix], 
                           input_length=maxlen, 
                           trainable=False))
model.add(layers.GlobalMaxPool1D())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 100, 50)           128750    
_________________________________________________________________
global_max_pooling1d_2 (Glob (None, 50)                0         
_________________________________________________________________
dense_5 (Dense)              (None, 10)                510       
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 11        
Total params: 129,271
Trainable params: 521
Non-trainable params: 128,750
_________________________________________________________________


In [48]:
history = model.fit(X_train, y_train,
                    epochs=50,
                    verbose=False,
                    validation_data=(X_test, y_test),
                    batch_size=10)
loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))
plot_history(history)

Training Accuracy: 0.7790
Testing Accuracy:  0.7219


AttributeError: module 'matplotlib' has no attribute 'figure'

## Usando embedding pre-treinado no modo trainable = True

In [49]:
model = Sequential()
model.add(layers.Embedding(vocab_size, embedding_dim, 
                           weights=[embedding_matrix], 
                           input_length=maxlen, 
                           trainable=True))
model.add(layers.GlobalMaxPool1D())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 100, 50)           128750    
_________________________________________________________________
global_max_pooling1d_3 (Glob (None, 50)                0         
_________________________________________________________________
dense_7 (Dense)              (None, 10)                510       
_________________________________________________________________
dense_8 (Dense)              (None, 1)                 11        
Total params: 129,271
Trainable params: 129,271
Non-trainable params: 0
_________________________________________________________________


In [50]:
history = model.fit(X_train, y_train,
                    epochs=50,
                    verbose=False,
                    validation_data=(X_test, y_test),
                    batch_size=10)
loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))
plot_history(history)

Training Accuracy: 1.0000
Testing Accuracy:  0.7326


AttributeError: module 'matplotlib' has no attribute 'figure'

## Embbedings
* https://radimrehurek.com/gensim/models/word2vec.html
* https://nlp.stanford.edu/projects/glove/

## Agrupando texto por tópico

In [51]:
import keras
from keras.datasets import reuters

(x_train, y_train), (x_test, y_test) = reuters.load_data(num_words=None, test_split=0.2)

word_index = reuters.get_word_index(path="reuters_word_index.json")

print('# of Training Samples: {}'.format(len(x_train)))
print('# of Test Samples: {}'.format(len(x_test)))

num_classes = max(y_train) + 1
print('# of Classes: {}'.format(num_classes))

# of Training Samples: 8982
# of Test Samples: 2246
# of Classes: 46
index_to_word = {}
for key, value in word_index.items():
    index_to_word[value] = key
print(' '.join([index_to_word[x] for x in x_train[0]]))
print(y_train[0])




Downloading data from https://s3.amazonaws.com/text-datasets/reuters.npz
Downloading data from https://s3.amazonaws.com/text-datasets/reuters_word_index.json
# of Training Samples: 8982
# of Test Samples: 2246
# of Classes: 46
the wattie nondiscriminatory mln loss for plc said at only ended said commonwealth could 1 traders now april 0 a after said from 1985 and from foreign 000 april 0 prices its account year a but in this mln home an states earlier and rise and revs vs 000 its 16 vs 000 a but 3 psbr oils several and shareholders and dividend vs 000 its all 4 vs 000 1 mln agreed largely april 0 are 2 states will billion total and against 000 pct dlrs
3


In [52]:
from keras.preprocessing.text import Tokenizer

max_words = 10000

tokenizer = Tokenizer(num_words=max_words)
x_train = tokenizer.sequences_to_matrix(x_train, mode='binary')
x_test = tokenizer.sequences_to_matrix(x_test, mode='binary')

y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
print(x_train[0])
print(len(x_train[0]))

print(y_train[0])
print(len(y_train[0]))


[0. 1. 0. ... 0. 0. 0.]
10000
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
46


In [56]:
y_test

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

In [57]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation

model = Sequential()
model.add(Dense(512, input_shape=(max_words,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.metrics_names)
['loss', 'acc']
batch_size = 16
epochs = 10

history = model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, verbose=1, validation_split=0.1)
score = model.evaluate(x_test, y_test, batch_size=batch_size, verbose=1)
print('Test loss:', score[0])
print('Test accuracy:', score[1])


['loss', 'acc']
Train on 8083 samples, validate on 899 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test loss: 1.1603857609223598
Test accuracy: 0.7947462155472882


## Sentiment Analysis

## Correção de Palavras (Análise Morfológica)

arquivo: https://raw.githubusercontent.com/dscape/spell/master/test/resources/big.txt

In [61]:
import re
from collections import Counter

def words(text): return re.findall(r'\w+', text.lower())

WORDS = Counter(words(open('../big.txt').read()))

def P(word, N=sum(WORDS.values())): 
    "Probability of `word`."
    return WORDS[word] / N

def correction(word): 
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)

def candidates(word): 
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

def known(words): 
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word): 
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

In [63]:
print(correction('speling'))
print(correction('korrectud'))


spelling
corrected
