# Preprocessing

Preprocess the text and clean it with these tecniques:

- Clean stopwords and custom stopwords
- Lemmatize
- Ngrams
- Brands

Create the **Word Embedding** and the **Autoencoder**

![img](https://www.kdnuggets.com/wp-content/uploads/text-preprocessing-framework-2.png)

In [1]:
import pandas as pd
import numpy as np
import nltk.data
from nltk import word_tokenize, sent_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
import gensim
from gensim.models import Word2Vec
from gensim.models import FastText
import re, unicodedata
from nltk.stem import LancasterStemmer, WordNetLemmatizer
from nltk.stem import SnowballStemmer

from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.layers import Embedding
from keras.models import Model, Sequential
from keras.layers import Input, Flatten, Dense, Conv1D, MaxPooling1D, GlobalMaxPool1D, \
                          UpSampling1D, LSTM, RepeatVector, TimeDistributed
from keras.utils import plot_model
from sklearn.neighbors import NearestNeighbors

import os
from custom_functions import norm_text, norm_brands

pd.set_option('max_colwidth', 250)

Using TensorFlow backend.


In [2]:
path = os.path.join('../Data/')
path_models = os.path.join('../Models/')
print (os.listdir(path))

['204kProducts.csv', 'Brands.csv', 'Categories.csv', 'Descriptions204k.csv', 'desktop.ini', 'FinalItems', 'Images', 'Images.csv', 'lemmatization-es.txt', 'stopwords_catalan.txt', 'Texto_PreProcesado.csv']


In [3]:
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\enric\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\enric\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\enric\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
data = pd.read_csv(path + 'FinalItems/data_filtered.csv', sep = ';')
data = data[['item_id', 'brand', 'name']]
data.columns = ['item_id', 'brand', 'text']
data.head()

Unnamed: 0,item_id,brand,text
0,A28233506,Woman Limited El Corte Inglés,Abrigo masculino con textura de mujer
1,A29054782,Woman Limited El Corte Inglés,Abrigo doble faz de mujer con cinturón a tono
2,A27354432,Woman El Corte Inglés,Abrigo largo de antelina de mujer Woman El Corte Inglés
3,A28302706,Lloyd's,Chaqueta térmica de mujer Lloyds con efecto cortavientos
4,A27435502,Lloyd's,Parka 100% algodón de mujer Lloyds con capucha


I create the brands list for substracting them of the text in case a brand appear on the text, and We delete the duplicates brands for compressing the size of the list and optimizing the search process

In [5]:
data.loc[199000]

item_id                       A16252195
brand                          Fox Home
text       Futurama. 7ª Temporada (DVD)
Name: 199000, dtype: object

## Standardization

With standardization we will perform a number of tasks aimed at putting all text on the same level: convert all text to the same uppercase or lowercase, remove punctuation, convert figures to their equivalents in words, and so on. Normalization puts all words on an equal importance and allows processing to be performed uniformly. Some of the techniques we will apply are:

- Drop strange characters
- put all the text in lower case
- Remove punctuation characters ( . , &, !, ?, ¿, /, etc)
- Numbers to text
- Remove stopwords
- Stemming
- Lemmatize

**Important:** After the standardization we will work at word level or token level instead of text level.

In [6]:
data_copy = data.copy()

In [7]:
%%time
#3min
main_dir = os.path.join(os.path.dirname(os.path.abspath('05_Preprocessing.ipynb')), 'custom_functions')
brands = norm_brands.launch_normalizer(data_copy)
STOPWORDS_ALL =  norm_text.gen_stopwords(main_dir)
lemmatizer_inv = norm_text.get_lemmatizer(main_dir)
words = [word.split() for word in data_copy['text'].values]
for i in range(len(words)):
    data_copy['text'].values[i] = norm_text.normalize(words = words[i], p_brands = brands, 
                                                    STOPWORDS_ALL = STOPWORDS_ALL, 
                                                    lemmatizer_inv = lemmatizer_inv)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\enric\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Wall time: 4min 25s


In [9]:
data_copy.head(5)

Unnamed: 0,item_id,brand,text
0,A28233506,woman limited el corte inglés,abrigo masculino textura mujer
1,A29054782,woman limited el corte inglés,abrigo doble faz mujer cinturon tono
2,A27354432,woman el corte inglés,abrigo largo antelina mujer woman corte_ingles
3,A28302706,lloyds,chaqueta termica mujer efecto cortavientos
4,A27435502,lloyds,parka algodon mujer capucha


In [10]:
data.head(5)

Unnamed: 0,item_id,brand,text
0,A28233506,Woman Limited El Corte Inglés,Abrigo masculino con textura de mujer
1,A29054782,Woman Limited El Corte Inglés,Abrigo doble faz de mujer con cinturón a tono
2,A27354432,Woman El Corte Inglés,Abrigo largo de antelina de mujer Woman El Corte Inglés
3,A28302706,Lloyd's,Chaqueta térmica de mujer Lloyds con efecto cortavientos
4,A27435502,Lloyd's,Parka 100% algodón de mujer Lloyds con capucha


In [11]:
#7000
print(data.loc[199000])
print(data_copy.loc[199000])

item_id                       A16252195
brand                          Fox Home
text       Futurama. 7ª Temporada (DVD)
Name: 199000, dtype: object
item_id                    A16252195
brand                       fox home
text       futurama 7a_temporada dvd
Name: 199000, dtype: object


In [12]:
data_copy['text'].shape

(204812,)

In [13]:
maxw = 0
mean = 0
max_s = 0
for i, sentence in enumerate(data_copy['text']):
    mean += len(sentence.split())
    if maxw < len(sentence.split()):
        maxw = len(sentence.split())
        max_s = i
print('Max words in a sentence:' + ' '*10 + 'Mean words in a sentence:')
print("-" * 60)
print (f"{maxw} {' '*30} {mean/len(data_copy['text'])}")
print("-" * 60)
print (data_copy['text'][max_s])

Max words in a sentence:          Mean words in a sentence:
------------------------------------------------------------
19                                5.062686756635354
------------------------------------------------------------
feel lite 16gb 5in2 hd 2gb ram 8mp 5mp android rosa gold 64gb microsd version homologada ce movil libre


In [35]:
data_copy.to_csv(path + 'Texto_PreProcesado.csv', sep = ';', index = False)

### Bag of Words

In [14]:
bag_of_words = nltk.word_tokenize(data_copy['text'].to_string())
bag_of_words = list(dict.fromkeys(bag_of_words))
print(bag_of_words[:10])
print(len(bag_of_words)) 

['0', 'abrigo', 'masculino', 'textura', 'mujer', '1', 'doble', 'faz', 'cinturon', 'tono']
256799


In [15]:
corpus = [sent for sent in data_copy['text']]
corpus[0:5]

['abrigo masculino textura mujer',
 'abrigo doble faz mujer cinturon tono',
 'abrigo largo antelina mujer woman corte_ingles',
 'chaqueta termica mujer efecto cortavientos',
 'parka algodon mujer capucha']

---

# Let´s test our results

## Word2Vec


In [43]:
EMBEDDING_DIM = 200

In [48]:
sentences = [word.split() for word in data_copy['text'].values]
sentences[:3]

len(sentences)

204812

In [45]:
%%time
modelWV = Word2Vec(sentences, workers = 3, min_count=5, window = 10, size = EMBEDDING_DIM)
modelWV.train(sentences, total_examples=len(sentences), epochs=50)

  "C extension not loaded, training will be slow. "


KeyboardInterrupt: 

In [47]:
modelWV.save(path_models + "word2vec_model_v2")

NameError: name 'modelWV' is not defined

In [35]:
#model = Word2Vec.load("word2vec_model_v1")
wl = 'sandalia'
modelWV.wv.most_similar (positive = wl)
#model.wv.most_similar_cosmul(positive = ['disfraz', 'abrigo'])

[('alpargata', 0.6290709972381592),
 ('botin', 0.6135832667350769),
 ('salon', 0.5850276947021484),
 ('mocasin', 0.5569319725036621),
 ('chancla', 0.5555477142333984),
 ('bota', 0.5497039556503296),
 ('mercedita', 0.5435197353363037),
 ('nautico', 0.5411031246185303),
 ('pepito', 0.5364699363708496),
 ('deportivas', 0.5191572308540344)]

In [36]:
def similar_products(text):
    text = normalize(text)
    list_text = text.split()
    most_similar = modelWV.wv.most_similar_cosmul(positive = list_text)
    
    return most_similar
    
similar_products('Silla de paseo')

[('maternal', 0.47183555364608765),
 ('carrycot', 0.4686814248561859),
 ('cambiador', 0.4548194706439972),
 ('born', 0.45320045948028564),
 ('portabebes', 0.44677817821502686),
 ('isofix', 0.4405463933944702),
 ('trona', 0.4361507296562195),
 ('seat', 0.4346296489238739),
 ('gemelar', 0.4344612658023834),
 ('capazo', 0.4313882887363434)]

---

## Vectorize Sentences

- Initialize tokenizer with num_words = MAX_NB_WORDS (200K). i.e. The tokenizer will perform a word count, sorted by number of occurences in descending order and pick top N words, 200K in this case 
- Use tokenizer's texts_to_sequences method to convert text to array of integers.
- The arrays obtained from previous step might not be of uniform length, use pad_sequences method to obtain arrays with length equal to MAX_SEQUENCE_LENGTH (30)

In [37]:
MAX_NB_WORDS = len(bag_of_words) #257_064k
MAX_SEQUENCE_LENGTH = 24                                

In [38]:
all_text = data_copy['text']
all_text = all_text.drop_duplicates (keep = False)

tokenizer = Tokenizer(num_words=MAX_NB_WORDS, )
tokenizer.fit_on_texts(all_text)

data_sequences = tokenizer.texts_to_sequences(data_copy['text'])
data_vec = pad_sequences(data_sequences, maxlen=MAX_SEQUENCE_LENGTH)

In [39]:
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
#49421

Found 49418 unique tokens.


In [41]:
data_vec[0]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    1,    3,   94,
       2461, 1432])

A word_index has a unique ID assigned to each word in the data. Example:

In [42]:
word_index = tokenizer.word_index
test_string = "ropa deporte abrigo raqueta bebe"
print("word\t\tid")
print("-" * 20)
for word in test_string.split():
    print("%s\t\t%s" % (word, word_index[word]))

word		id
--------------------
ropa		34
deporte		12
abrigo		94
raqueta		1645
bebe		4


In [43]:
EMBEDDING_DIM=200

In [44]:
word_vectors = modelWV.wv
vocabulary_size = len(word_index) + 1
embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))

for word, i in word_index.items():
    if word in modelWV:
        embedding_matrix[i] = modelWV[word]
    else:
        embedding_matrix[i] = np.random.rand(1, EMBEDDING_DIM)[0]
            

del(word_vectors)

embedding_layer = Embedding(input_dim = vocabulary_size,
                            output_dim = EMBEDDING_DIM,
                            input_length = MAX_SEQUENCE_LENGTH,
                            weights=[embedding_matrix],
                            name='w2v_embedding',
                            trainable=False)

  
  import sys


In [45]:
modelWV.save("w2v_embedding_v1_1.h5")

In [46]:
embedding_layer_2 = modelWV.wv.get_keras_embedding()

In [47]:
VOCAB_SIZE = word_index
#timesteps = X_train.shape[0]

### Dense

In [50]:
X_train = np.vstack(data_vec)
X_train.shape

(204812, 24)

In [51]:
model = Sequential()
model.add(embedding_layer)
model.compile('rmsprop', 'mse')

In [52]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
w2v_embedding (Embedding)    (None, 24, 200)           9883800   
Total params: 9,883,800
Trainable params: 0
Non-trainable params: 9,883,800
_________________________________________________________________


In [53]:
input_i = Input(shape=(MAX_SEQUENCE_LENGTH, EMBEDDING_DIM))
encoded_h1 = Dense(128, activation='relu')(input_i)
encoded_h2 = Dense(64, activation='relu')(encoded_h1)
encoded_h3 = Dense(32, activation='relu')(encoded_h2)
encoded_h4 = Dense(16, activation='relu')(encoded_h3)
#encoded_h5 = Dense(8, activation='relu')(encoded_h4)

latent = Dense(8, activation='relu', name = 'ENCODER')(encoded_h4)

#decoder_h1 = Dense(8, activation='relu')(latent)
decoder_h2 = Dense(16, activation='relu')(latent)
decoder_h3 = Dense(32, activation='relu')(decoder_h2)
decoder_h4 = Dense(64, activation='relu')(decoder_h3)
decoder_h5 = Dense(128, activation='relu')(decoder_h4)

output = Dense(EMBEDDING_DIM, activation='relu')(decoder_h5)

autoencoder = Model(input_i,output)

autoencoder.compile('rmsprop','mse')

In [54]:
autoencoder.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 24, 200)           0         
_________________________________________________________________
dense_1 (Dense)              (None, 24, 128)           25728     
_________________________________________________________________
dense_2 (Dense)              (None, 24, 64)            8256      
_________________________________________________________________
dense_3 (Dense)              (None, 24, 32)            2080      
_________________________________________________________________
dense_4 (Dense)              (None, 24, 16)            528       
_________________________________________________________________
ENCODER (Dense)              (None, 24, 8)             136       
_________________________________________________________________
dense_5 (Dense)              (None, 24, 16)            144       
__________

In [55]:
X_embedded = model.predict(X_train, verbose = 1)



In [56]:
%%time
autoencoder.fit(X_embedded,X_embedded,epochs=3,
            batch_size=32, verbose = 1)

Epoch 1/3
Epoch 2/3
Epoch 3/3
Wall time: 9min 10s


<keras.callbacks.History at 0x26ca7e916d8>

In [59]:
encoder = Model(inputs=autoencoder.input, outputs=autoencoder.get_layer('ENCODER').output)

In [60]:
#encoder.save('encoder_text_V1.h5')
#autoencoder = load_model('autoencoder_v2.h5')

In [63]:
#plot_model(model, to_file='encoder_plot.png', show_shapes=True, show_layer_names=True)

### Most similar Products

In [64]:
X_train[20000].shape

(24,)

In [69]:
data.loc[160700]

id                                         001097632608283
brand                                               Ludilo
text     Juguetes Juegos de mesa Habilidad Caperucita Roja
Name: 160700, dtype: object

In [70]:
query = X_embedded[160700]

In [71]:
X_test = X_embedded.copy()
X_test.shape

(204812, 24, 200)

In [72]:
%%time
codes = encoder.predict(X_test)
codes.shape

Wall time: 46.9 s


In [73]:
query_code = encoder.predict(query.reshape(1,MAX_SEQUENCE_LENGTH,EMBEDDING_DIM))
query_code.shape

(1, 24, 8)

In [74]:
codes = codes.reshape(-1, 24*8)
print(codes.shape)
query_code = query_code.reshape(1, 24*8)
print(query_code.shape)

(204812, 192)
(1, 192)


### Fit the KNN to the test set

In [75]:
%%time
n_neigh = 10
nbrs = NearestNeighbors(n_neighbors=n_neigh).fit(codes)

Wall time: 8min 16s


In [76]:
distances, indices = nbrs.kneighbors(np.array(query_code))

In [77]:
closest_sent = X_test[indices]
closest_sent = closest_sent.reshape(-1,MAX_SEQUENCE_LENGTH,EMBEDDING_DIM); 
print(closest_sent.shape)

(10, 24, 200)


## Get the closest text

In [78]:
data.loc[160700]

id                                         001097632608283
brand                                               Ludilo
text     Juguetes Juegos de mesa Habilidad Caperucita Roja
Name: 160700, dtype: object

In [79]:
mis_indices = indices.tolist()[0]
for i in range(n_neigh):
    print (data.loc[mis_indices[i]])
    print('-'*50)

id                                         001097632608283
brand                                               Ludilo
text     Juguetes Juegos de mesa Habilidad Caperucita Roja
Name: 160700, dtype: object
--------------------------------------------------
id                                                 001097631222540
brand                                                     Fournier
text     Juguetes Juegos de mesa Habilidad Baraja infantil Grojuss
Name: 160683, dtype: object
--------------------------------------------------
id                                                 001003916512670
brand                                                     Fournier
text     Juguetes Juegos de mesa Habilidad Baraja infantil Grojuss
Name: 160685, dtype: object
--------------------------------------------------
id                                          001097631508690
brand                                              IMC_Toys
text     Juguetes Juegos de mesa Habilidad Atrapa Estrellas
Name