## CAPAS DE PREPROCESADO DE KERAS

Igual que sklearn tenemos sus "transformers" y difernentes funciones para poder procesar los datos, para luego, si queremos incluirlos en un pipeline, en Keras existen "capas" de preprocesamiento que podemos incluir en el modelo de forma análoga

Vamos a trabajar con un dataset que empleamos en su día para "no-supervisado", para revisar unas cuantas de esas capas equivalentes a lo que ya hemos empleado con sklearn. Eso nos permitirá introducir las capas de Embedding y de ahí pasar al procesado de lenguaje natural empleando DL

In [1]:
import random as rm

In [2]:
import numpy as np
import pandas as pd
import re
import tensorflow as tf


In [76]:
df = pd.read_csv("./data/pharma_full.csv")
df.head()

Unnamed: 0,urlDrugName,rating,effectiveness,sideEffects,condition,benefitsReview,sideEffectsReview,commentsReview,Sales,Production
0,enalapril,4,Highly Effective,Mild Side Effects,management of congestive heart failure,slowed the progression of left ventricular dys...,"cough, hypotension , proteinuria, impotence , ...","monitor blood pressure , weight and asses for ...",318440,398.0
1,ortho-tri-cyclen,1,Highly Effective,Severe Side Effects,birth prevention,Although this type of birth control has more c...,"Heavy Cycle, Cramps, Hot Flashes, Fatigue, Lon...","I Hate This Birth Control, I Would Not Suggest...",888949,909.0
2,ponstel,10,Highly Effective,No Side Effects,menstrual cramps,I was used to having cramps so badly that they...,Heavier bleeding and clotting than normal.,I took 2 pills at the onset of my menstrual cr...,264077,465.0
3,prilosec,3,Marginally Effective,Mild Side Effects,acid reflux,The acid reflux went away for a few months aft...,"Constipation, dry mouth and some mild dizzines...",I was given Prilosec prescription at a dose of...,542110,602.0
4,lyrica,2,Marginally Effective,Severe Side Effects,fibromyalgia,I think that the Lyrica was starting to help w...,I felt extremely drugged and dopey. Could not...,See above,83761,124.0


In [77]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3107 entries, 0 to 3106
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   urlDrugName        3107 non-null   object 
 1   rating             3107 non-null   int64  
 2   effectiveness      3107 non-null   object 
 3   sideEffects        3107 non-null   object 
 4   condition          3106 non-null   object 
 5   benefitsReview     3107 non-null   object 
 6   sideEffectsReview  3105 non-null   object 
 7   commentsReview     3099 non-null   object 
 8   Sales              3107 non-null   int64  
 9   Production         3107 non-null   float64
dtypes: float64(1), int64(2), object(7)
memory usage: 242.9+ KB


Las prepararemos un poco para que podamos emplear todos los tipos de capas

Los missings seguiremos tratándolos, por ahora, a parte

In [78]:
df_clean = df.fillna("No Value")

In [79]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3107 entries, 0 to 3106
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   urlDrugName        3107 non-null   object 
 1   rating             3107 non-null   int64  
 2   effectiveness      3107 non-null   object 
 3   sideEffects        3107 non-null   object 
 4   condition          3107 non-null   object 
 5   benefitsReview     3107 non-null   object 
 6   sideEffectsReview  3107 non-null   object 
 7   commentsReview     3107 non-null   object 
 8   Sales              3107 non-null   int64  
 9   Production         3107 non-null   float64
dtypes: float64(1), int64(2), object(7)
memory usage: 242.9+ KB


In [80]:
import nltk as nl
nl.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\glezr\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [81]:
for col in [col for col in df_clean.columns if "Review" in col]:
    df_clean[col + "_wc"] = df_clean[col].apply(lambda value: len(nl.tokenize.word_tokenize(value)))


In [82]:
df_clean

Unnamed: 0,urlDrugName,rating,effectiveness,sideEffects,condition,benefitsReview,sideEffectsReview,commentsReview,Sales,Production,benefitsReview_wc,sideEffectsReview_wc,commentsReview_wc
0,enalapril,4,Highly Effective,Mild Side Effects,management of congestive heart failure,slowed the progression of left ventricular dys...,"cough, hypotension , proteinuria, impotence , ...","monitor blood pressure , weight and asses for ...",318440,398.0,26,29,11
1,ortho-tri-cyclen,1,Highly Effective,Severe Side Effects,birth prevention,Although this type of birth control has more c...,"Heavy Cycle, Cramps, Hot Flashes, Fatigue, Lon...","I Hate This Birth Control, I Would Not Suggest...",888949,909.0,38,58,14
2,ponstel,10,Highly Effective,No Side Effects,menstrual cramps,I was used to having cramps so badly that they...,Heavier bleeding and clotting than normal.,I took 2 pills at the onset of my menstrual cr...,264077,465.0,52,7,81
3,prilosec,3,Marginally Effective,Mild Side Effects,acid reflux,The acid reflux went away for a few months aft...,"Constipation, dry mouth and some mild dizzines...",I was given Prilosec prescription at a dose of...,542110,602.0,135,21,31
4,lyrica,2,Marginally Effective,Severe Side Effects,fibromyalgia,I think that the Lyrica was starting to help w...,I felt extremely drugged and dopey. Could not...,See above,83761,124.0,23,31,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3102,vyvanse,10,Highly Effective,Mild Side Effects,adhd,"Increased focus, attention, productivity. Bett...","Restless legs at night, insomnia, headache (so...","I took adderall once as a child, and it made m...",270483,470.0,41,52,124
3103,zoloft,1,Ineffective,Extremely Severe Side Effects,depression,Emotions were somewhat blunted. Less moodiness.,"Weight gain, extreme tiredness during the day,...",I was on Zoloft for about 2 years total. I am ...,504277,524.0,8,24,468
3104,climara,2,Marginally Effective,Moderate Side Effects,total hysterctomy,---,Constant issues with the patch not staying on....,---,127063,167.0,2,161,2
3105,trileptal,8,Considerably Effective,Mild Side Effects,epilepsy,Controlled complex partial seizures.,"Dizziness, fatigue, nausea",Started at 2 doses of 300 mg a day and worked ...,342695,502.0,5,5,66


In [83]:
df_target = df[["rating"]]-1 # Cositas del Keras

In [84]:
df_target

Unnamed: 0,rating
0,3
1,0
2,9
3,2
4,1
...,...
3102,9
3103,0
3104,1
3105,7


In [85]:
df_target.value_counts()

rating
9         742
7         558
8         480
6         350
0         305
4         159
5         157
2         146
3         107
1         103
dtype: int64

In [86]:
df_X = df_clean.drop("rating", axis = 1)

In [87]:
numericas = [col for col in df_X.columns if df_X[col].dtype != "object"]

## Preprocesado usando Keras

### Normalization Layer

In [88]:
X_num = df_X[numericas]
y_num = df_target.copy()
X_train = X_num[:2400]
y_train = y_num[:2400]
X_valid = X_num[2400:]
y_valid = y_num[2400:]

In [90]:
tf.random.set_seed(42)  # extra code – ensures reproducibility
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(100, activation = "relu"),
    tf.keras.layers.Dense(10, activation = "softmax")
])
model.compile(loss="sparse_categorical_crossentropy", \
              optimizer=tf.keras.optimizers.SGD(learning_rate=2e-3),\
              metrics =["acc"])
history = model.fit(X_train, y_train, validation_data=(X_valid, y_valid), epochs=300, verbose = 0)

In [91]:
pd.DataFrame(history.history)

Unnamed: 0,loss,acc,val_loss,val_acc
0,1.307729e+08,0.221250,2.296485,0.24611
1,2.315079e+00,0.237083,2.290218,0.24611
2,2.306194e+00,0.237083,2.284142,0.24611
3,2.297484e+00,0.236667,2.278260,0.24611
4,2.289187e+00,0.236667,2.272562,0.24611
...,...,...,...,...
295,2.078734e+00,0.237917,2.113286,0.24611
296,2.078717e+00,0.237917,2.113297,0.24611
297,2.078701e+00,0.237917,2.113309,0.24611
298,2.078690e+00,0.237917,2.113320,0.24611


In [92]:
tf.random.set_seed(42)  # extra code – ensures reproducibility
norm_layer = tf.keras.layers.Normalization()
model = tf.keras.models.Sequential([
    norm_layer,
    tf.keras.layers.Dense(100, activation = "relu"),
    tf.keras.layers.Dense(10, activation = "softmax")
])
model.compile(loss="sparse_categorical_crossentropy", \
              optimizer=tf.keras.optimizers.SGD(learning_rate=2e-3),\
              metrics =["acc"])
norm_layer.adapt(X_train)  # computes the mean and variance of every feature
history = model.fit(X_train, y_train, validation_data=(X_valid, y_valid), epochs=300, verbose = 0)

In [93]:
pd.DataFrame(history.history)

Unnamed: 0,loss,acc,val_loss,val_acc
0,2.387116,0.124583,2.375459,0.138614
1,2.361022,0.162917,2.350753,0.178218
2,2.337331,0.187083,2.328239,0.196605
3,2.315755,0.194167,2.307621,0.199434
4,2.295938,0.200833,2.288679,0.210750
...,...,...,...,...
295,1.600010,0.383750,1.618311,0.398868
296,1.598243,0.385000,1.616507,0.398868
297,1.596454,0.384583,1.614857,0.398868
298,1.594712,0.385833,1.613164,0.398868


### Categorical Encoding y Onehot Encoding Layer

Como ocurría con pyspark, primero tenemos que hacer un String encoding (StringIndexer), es decir un OrdinalEncoding

In [94]:
ordinalEncoding = tf.keras.layers.StringLookup()
ordinalEncoding.adapt(df_X["effectiveness"])

In [95]:
df_X["effec_coded"] = ordinalEncoding(df_X["effectiveness"])

In [96]:
numericas = [col for col in df_X.columns if df_X[col].dtype != "object"]
print([numericas])
X_num = df_X[numericas]
y_num = df_target.copy()
X_train = X_num[:2400]
y_train = y_num[:2400]
X_valid = X_num[2400:]
y_valid = y_num[2400:]

[['Sales', 'Production', 'benefitsReview_wc', 'sideEffectsReview_wc', 'commentsReview_wc', 'effec_coded']]


In [97]:
tf.random.set_seed(42)  # extra code – ensures reproducibility
norm_layer = tf.keras.layers.Normalization()
model = tf.keras.models.Sequential([
    norm_layer,
    tf.keras.layers.Dense(100, activation = "relu"),
    tf.keras.layers.Dense(10, activation = "softmax")
])
model.compile(loss="sparse_categorical_crossentropy", \
              optimizer=tf.keras.optimizers.SGD(learning_rate=2e-3),\
              metrics =["acc"])
norm_layer.adapt(X_train)  # computes the mean and variance of every feature
history = model.fit(X_train, y_train, validation_data=(X_valid, y_valid), epochs=300, verbose = 0)

In [98]:
pd.DataFrame(history.history).head(-10)

Unnamed: 0,loss,acc,val_loss,val_acc
0,2.392560,0.068333,2.350723,0.089109
1,2.351232,0.095000,2.311529,0.114569
2,2.313275,0.135000,2.275472,0.171146
3,2.278324,0.177917,2.242112,0.198020
4,2.245965,0.215000,2.211182,0.247525
...,...,...,...,...
285,1.490166,0.451667,1.522711,0.422914
286,1.489538,0.451250,1.522182,0.424328
287,1.488901,0.452083,1.521581,0.424328
288,1.488300,0.451667,1.521076,0.424328


Pero podemos hacer el onehot encoding de una vez configurando la StringLookup layer debidamente.  Además ahora usaremos la functional API para incluir la capa dentro del modelo (y no tener que hacer la conversión por fuera)

In [99]:
df_X.drop("effec_coded", axis = 1, inplace = True)

In [100]:
numericas = [col for col in df_X.columns if df_X[col].dtype != "object"]
print([numericas])
X_train = df_X[numericas]
y_train = df_target.copy()

[['Sales', 'Production', 'benefitsReview_wc', 'sideEffectsReview_wc', 'commentsReview_wc']]


In [101]:
X_train.shape

(3107, 5)

In [102]:
X_eff = df_X[["effectiveness"]]

In [103]:
X_eff.shape

(3107, 1)

In [104]:
X_eff.effectiveness.value_counts()

Highly Effective          1330
Considerably Effective     928
Moderately Effective       415
Ineffective                247
Marginally Effective       187
Name: effectiveness, dtype: int64

In [105]:
normalization_layer = tf.keras.layers.Normalization()
hidden_layer1 = tf.keras.layers.Dense(100, activation="relu")
codingLayer = tf.keras.layers.StringLookup(output_mode = "one_hot", )
concat_layer = tf.keras.layers.Concatenate()
output_layer = tf.keras.layers.Dense(10, activation = "softmax")


normalization_layer.adapt(X_train)
codingLayer.adapt(X_eff)


input_num = tf.keras.layers.Input(shape=X_train.shape[1:])
input_cat = tf.keras.layers.Input(shape=X_eff.shape[1:], dtype = tf.string)
normalized = normalization_layer(input_num)
encoded = codingLayer(input_cat)
concat = concat_layer([normalized,encoded])
hidden1 = hidden_layer1(concat)
output = output_layer(hidden1)

model = tf.keras.Model(inputs=[input_num,input_cat], outputs=[output])

In [106]:
normalization_layer.adapt(X_train)
codingLayer.adapt(X_eff)

In [107]:
model.compile(loss="sparse_categorical_crossentropy", \
              optimizer=tf.keras.optimizers.SGD(learning_rate=2e-3),\
              metrics =["acc"])
history = model.fit((X_train,X_eff),y_train, epochs=200, verbose = 0, validation_split = 0.2)

In [108]:
pd.DataFrame(history.history).tail()

Unnamed: 0,loss,acc,val_loss,val_acc
195,1.501737,0.457143,1.574187,0.419614
196,1.500772,0.457143,1.573279,0.419614
197,1.499824,0.457143,1.57245,0.419614
198,1.498845,0.457143,1.571551,0.419614
199,1.497899,0.45674,1.57068,0.419614


## Capas de Vectorizacion

El equivalente al CountVectorizer y al TfidfVectorizer de sklearn es la capa TextVectorization

In [109]:
text_vec_layer_count = tf.keras.layers.TextVectorization(output_mode = "count") # output_mode = "count" -> Countvectorizer
text_vec_layer_count.adapt(df_X[["sideEffectsReview"]])

In [110]:
text_vec_layer_count.get_vocabulary()[:10]

['[UNK]', 'i', 'the', 'and', 'to', 'a', 'of', 'my', 'it', 'was']

In [111]:
texto = df_X["sideEffectsReview"][2:3].values
print(texto)

['Heavier bleeding and clotting than normal.']


In [112]:
text_vec_layer_tfidf = tf.keras.layers.TextVectorization(output_mode= "tf_idf") # output_mode = "tf_idf" -> TfIdfVectorizer, existe un tercer modo (el que viene por defecto que veremos un poco más adelante)
text_vec_layer_tfidf.adapt(df_X["sideEffectsReview"])

In [113]:
vectors = text_vec_layer_count(df_X["sideEffectsReview"])

In [114]:
vectors

<tf.Tensor: shape=(3107, 7441), dtype=float32, numpy=
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 2., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 7., 5., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 3., 0., ..., 0., 0., 0.]], dtype=float32)>

In [115]:
df_vectors = pd.DataFrame(vectors.numpy(),\
                          columns= text_vec_layer_count.get_vocabulary())

In [116]:
df_vectors

Unnamed: 0,[UNK],i,the,and,to,a,of,my,it,was,...,10142008,1014,1012,100mgthe,100mgs,100mgdoses,100110,1000mg,10000,072009
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,2.0,0.0,2.0,1.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,1.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3102,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,2.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3103,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3104,0.0,7.0,5.0,5.0,5.0,1.0,2.0,5.0,3.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3105,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Tendremos que hacer la normalización/limpieza del texto...

In [117]:
# Rescatando la que hicimos para la IMDB Reviews

from nltk.corpus import stopwords
replace_no_space = "(\.)|(\;)|(\:)|(\!)|(\?)|(\,)|(\")|(\()|(\))|(\[)|(\])|(\d+)"
REPLACE_NO_SPACE = re.compile(replace_no_space)
replace_with_space = "(<br \s*/><br\s*/>)|(\-)|(\/)"
REPLACE_WITH_SPACE = re.compile(replace_with_space)
NO_SPACE = ""
SPACE = " "
dictionary = stopwords.words("english")


def clean(row):
    # Limpio signos y convierto a minúsculas
    dato = REPLACE_NO_SPACE.sub(NO_SPACE, row.lower())
    # Convierto los retornos de carro <br /><br /> en espacios y los guiones ("-")
    dato = REPLACE_WITH_SPACE.sub(SPACE, dato)
    # Quito cualquier link
    dato = " ".join([word for word in dato.split() if "http" not in word])
    # Quito los stopwords
    dato = " ".join([word for word in dato.split(" ") if word not in dictionary])
    return dato




Dos formas:  
    1. Por fuera

In [118]:
df_X_clean = df_X.copy()

In [119]:
df_X_clean["sideEffectsReview"] = df_X_clean.sideEffectsReview.apply(clean)

In [120]:
df_X_clean

Unnamed: 0,urlDrugName,effectiveness,sideEffects,condition,benefitsReview,sideEffectsReview,commentsReview,Sales,Production,benefitsReview_wc,sideEffectsReview_wc,commentsReview_wc
0,enalapril,Highly Effective,Mild Side Effects,management of congestive heart failure,slowed the progression of left ventricular dys...,cough hypotension proteinuria impotence renal ...,"monitor blood pressure , weight and asses for ...",318440,398.0,26,29,11
1,ortho-tri-cyclen,Highly Effective,Severe Side Effects,birth prevention,Although this type of birth control has more c...,heavy cycle cramps hot flashes fatigue long la...,"I Hate This Birth Control, I Would Not Suggest...",888949,909.0,38,58,14
2,ponstel,Highly Effective,No Side Effects,menstrual cramps,I was used to having cramps so badly that they...,heavier bleeding clotting normal,I took 2 pills at the onset of my menstrual cr...,264077,465.0,52,7,81
3,prilosec,Marginally Effective,Mild Side Effects,acid reflux,The acid reflux went away for a few months aft...,constipation dry mouth mild dizziness would go...,I was given Prilosec prescription at a dose of...,542110,602.0,135,21,31
4,lyrica,Marginally Effective,Severe Side Effects,fibromyalgia,I think that the Lyrica was starting to help w...,felt extremely drugged dopey could drive med a...,See above,83761,124.0,23,31,2
...,...,...,...,...,...,...,...,...,...,...,...,...
3102,vyvanse,Highly Effective,Mild Side Effects,adhd,"Increased focus, attention, productivity. Bett...",restless legs night insomnia headache sometime...,"I took adderall once as a child, and it made m...",270483,470.0,41,52,124
3103,zoloft,Ineffective,Extremely Severe Side Effects,depression,Emotions were somewhat blunted. Less moodiness.,weight gain extreme tiredness day insomnia nig...,I was on Zoloft for about 2 years total. I am ...,504277,524.0,8,24,468
3104,climara,Marginally Effective,Moderate Side Effects,total hysterctomy,---,constant issues patch staying called manufactu...,---,127063,167.0,2,161,2
3105,trileptal,Considerably Effective,Mild Side Effects,epilepsy,Controlled complex partial seizures.,dizziness fatigue nausea,Started at 2 doses of 300 mg a day and worked ...,342695,502.0,5,5,66


In [54]:
print("\n\n".join(df_X_clean["sideEffectsReview"][3104:3106].values))

constant issues patch staying called manufacture bayer took lot number said sorry said going send new box local pharmacy charge checked every week last three weeks nothing yeah great way follow throughout patch noticed large acne face lack energy libido vaginal dryness daily hot flashes extreme moodiness sorry husband even though great know three months body trying adjust could take anymore contacted gyn prescribed estratest cross fingers hope works

dizziness fatigue nausea


In [121]:
new_text_vec_layer_count = tf.keras.layers.TextVectorization(output_mode="count")

In [122]:
new_text_vec_layer_count.adapt(df_X_clean.sideEffectsReview)

In [123]:
df_vectors_clean = pd.DataFrame(new_text_vec_layer_count(df_X_clean.sideEffectsReview).numpy(),\
                                columns= new_text_vec_layer_count.get_vocabulary())

In [125]:
df_vectors_clean

Unnamed: 0,[UNK],side,effects,taking,none,also,drug,day,skin,medication,...,abslutly,abscense,abruptly,aboveand,abnormalities,abilify,abfter,abbsessed,abandoning,abandon
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3102,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3103,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3104,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3105,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


2. Segunda forma, incorporarlo a la capa en su instanciación

In [127]:
def clean_tensors(in_tensor):
    
    re_expression = "|".join([f"([\.,;: ]{word} )" for word in dictionary]) + "|" + "|".join([f"^{word} " for word in dictionary])
    #print(re_expression)
    lowercase = tf.strings.lower(in_tensor) # Lo pone en minúsculas
    cleaned_one = tf.strings.regex_replace(lowercase, replace_no_space, NO_SPACE) # Limpia los signos de puntuacion
    print(cleaned_one[0])
    cleaned_two = tf.strings.regex_replace(cleaned_one, replace_with_space, SPACE) # quita caracteres raros
    print(cleaned_two[0])
    cleaned_three = tf.strings.regex_replace(cleaned_two, re_expression,SPACE) # "quita" las stopwords
    #print(cleaned_three[0])
    return cleaned_three
        

In [128]:
clean_text_vec_layer_tf_idf = tf.keras.layers.TextVectorization(output_mode="count", standardize = clean_tensors)

In [129]:
clean_text_vec_layer_tf_idf.adapt(df_X.sideEffectsReview)

Tensor("strided_slice:0", shape=(1,), dtype=string)
Tensor("strided_slice_1:0", shape=(1,), dtype=string)


In [130]:
vectors = clean_text_vec_layer_tf_idf(df_X.sideEffectsReview)

tf.Tensor(b'cough hypotension  proteinuria impotence  renal failure  angina pectoris  tachycardia  eosinophilic pneumonitis tastes disturbances  anusease anorecia  weakness fatigue insominca weakness', shape=(), dtype=string)
tf.Tensor(b'cough hypotension  proteinuria impotence  renal failure  angina pectoris  tachycardia  eosinophilic pneumonitis tastes disturbances  anusease anorecia  weakness fatigue insominca weakness', shape=(), dtype=string)


In [131]:
df_X.sideEffectsReview

0       cough, hypotension , proteinuria, impotence , ...
1       Heavy Cycle, Cramps, Hot Flashes, Fatigue, Lon...
2              Heavier bleeding and clotting than normal.
3       Constipation, dry mouth and some mild dizzines...
4       I felt extremely drugged and dopey.  Could not...
                              ...                        
3102    Restless legs at night, insomnia, headache (so...
3103    Weight gain, extreme tiredness during the day,...
3104    Constant issues with the patch not staying on....
3105                           Dizziness, fatigue, nausea
3106    I find when I am taking Micardis that I tend t...
Name: sideEffectsReview, Length: 3107, dtype: object

In [132]:
df_vectors = pd.DataFrame(vectors.numpy(), columns= clean_text_vec_layer_tf_idf.get_vocabulary())

In [133]:
df_vectors

Unnamed: 0,[UNK],the,i,side,effects,a,my,was,taking,have,...,'emerging,'effective','coursing','cool','buzz','bug','blah','asshole',$million,#
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,2.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3102,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3103,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3104,0.0,3.0,1.0,0.0,0.0,0.0,4.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3105,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [134]:
X_train_vectors = df_vectors.copy()
y_train_vectors = df_target.copy()

In [135]:
tf.random.set_seed(42)  # extra code – ensures reproducibility
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(100, activation = "relu", input_shape = X_train_vectors.shape[1:]),
    tf.keras.layers.Dense(10, activation = "softmax")
])
model.compile(loss="sparse_categorical_crossentropy", \
              optimizer=tf.keras.optimizers.SGD(learning_rate= 1e-3, momentum = 0.09, nesterov = True),\
              metrics =["acc"])
history = model.fit(X_train_vectors.to_numpy(), y_train_vectors.to_numpy(), validation_split = 0.2, epochs=100, verbose = 0)

In [136]:
pd.DataFrame(history.history).tail()

Unnamed: 0,loss,acc,val_loss,val_acc
95,1.961522,0.308652,2.049788,0.273312
96,1.959983,0.30825,2.048953,0.276527
97,1.958426,0.309054,2.048133,0.276527
98,1.956905,0.309457,2.047332,0.276527
99,1.955366,0.309457,2.046525,0.276527


In [137]:
df_vectors.shape

(3107, 7083)

In [138]:
numericas = [col for col in df_X.columns if df_X[col].dtype != "object"]
print([numericas])
X_num = pd.concat([df_X[numericas].reset_index(),df_vectors], axis = 1)
y_train = df_target.copy()
X_train = X_num[:2400].to_numpy()
y_train = y_num[:2400].to_numpy()
X_valid = X_num[2400:].to_numpy()
y_valid = y_num[2400:].to_numpy()

[['Sales', 'Production', 'benefitsReview_wc', 'sideEffectsReview_wc', 'commentsReview_wc']]


In [139]:
X_train

array([[0.00000e+00, 3.18440e+05, 3.98000e+02, ..., 0.00000e+00,
        0.00000e+00, 0.00000e+00],
       [1.00000e+00, 8.88949e+05, 9.09000e+02, ..., 0.00000e+00,
        0.00000e+00, 0.00000e+00],
       [2.00000e+00, 2.64077e+05, 4.65000e+02, ..., 0.00000e+00,
        0.00000e+00, 0.00000e+00],
       ...,
       [2.39700e+03, 6.07030e+05, 7.47000e+02, ..., 0.00000e+00,
        0.00000e+00, 0.00000e+00],
       [2.39800e+03, 7.89800e+03, 6.80000e+01, ..., 0.00000e+00,
        0.00000e+00, 0.00000e+00],
       [2.39900e+03, 9.78193e+05, 1.03800e+03, ..., 0.00000e+00,
        0.00000e+00, 0.00000e+00]])

In [140]:
tf.random.set_seed(42)  # extra code – ensures reproducibility
norm_layer = tf.keras.layers.Normalization()
model = tf.keras.models.Sequential([
    norm_layer,
    tf.keras.layers.Dense(100),
    tf.keras.layers.Dense(10, activation = "softmax")
])
model.compile(loss="sparse_categorical_crossentropy", \
              optimizer=tf.keras.optimizers.SGD(learning_rate= 1e-3, momentum = 0.09, nesterov = True),\
              metrics =["acc"])
norm_layer.adapt(X_train)
history = model.fit(X_train, y_train,validation_data=(X_valid, y_valid), epochs=100, verbose = 0)

In [141]:
pd.DataFrame(history.history).tail()

Unnamed: 0,loss,acc,val_loss,val_acc
95,0.400767,0.892917,314135.25,0.23338
96,0.399127,0.89375,314437.875,0.23338
97,0.397523,0.89375,314738.71875,0.23338
98,0.395928,0.894167,315038.28125,0.23338
99,0.394372,0.89375,315336.375,0.231966


¿Demasiadas dimensiones? (7000+), quizás

## EMBEDDINGS

En general, un embedding es una representación compacta de un conjunto de datos de mayor dimensionalidad pero dispersos. Por ejemlo un one_hot encoding de una categorica con las capitales del mundo daría vectores de 195 dimesniones, con únicamente un 1 y 194 ceros... Podríamos intentar convertir cada capital en un vector de dos dimensiones (eso sería hacerle un embedding. Pero veamoslo aplicado al lenguaje.

#### Secuencias de índices

La representación de frases de texto con la vectorización que hemos visto hasta ahora tiene dos inconvenientes:  

1. En función del tamaño del vocabulario, resulta un representación muy dispersa.
2. No conserva información de orden.

Una posible forma de representar frases, que no vimos cuando tratamos NLP tradicional (que cucos), es convertir cada palabra de un vocabulario en un indice y cada frase en una secuencia de indices de ese vocabulario:

Suponiendo un vocabulario con 1000 palabras en el que:   
Me -> es la palabra 734  
llamo -> es la palabra 124,  
Iñigo -> es la palabra 343,  
Montoya -> es la palabra 99,  
tú -> es la palabra 2,  
mataste -> la 643,  
a -> la 1,  
mi -> la 23,  
padre -> la 15  
  
* "Me llamo Iñigo Montoya" se transformaría en [734 "Me",124 "llamo" ,343 "Iñigo" ,99 "Montoya"]  (longitud = 4)
* "Tú mataste a mi padre" se transformaría en [2,643,1,23,15] (longitud = 5)

  
Con esta "codificación"/vectorización, ya lo tendríamos todo, ¿no?

#### Word Embeddings

La representación anterior tiene dos problemas (relacionados):
    1. No existe relación entre los indices y las palabras (o puede existirlo pero sólo tenemo una dimensión para representarla) 
    2. Al no existir esta relación un regresor (una neurona en una capa DL, por ejemplo), aplicaría un único peso para todas las palabras (están representadas por una única feature)
    
Una solución es convertir las palabras en vectores y crear esa representación en vectores de forma que palabras parecidas o que tengan relación (por ejemplo sinónimos, palabras que pertenecen a un mismo campo semántico, etc) generasen vectores que tuvieran relación (en general, estuvieran próximos, a más dimensiones más formas de estar próximos :-).

Nuestras frases se converterían en secuencias (multivariantes, sí) del tipo:  
"Me llamo Iñigo Montoya" -> [ [12,3,12] "Me" , [0, -12,1,-123] "llamo", [111,0,111,1] "Iñigo", [-1,-1,-2,112] "Montoya"]

Los word embeddings son la forma de conseguirlo. Pero, ¿cómo construir un word embedding? (O sea que Iñigo pase de ser el 343 a un vector como [111,0,111,1].. Que lo aprenda la red :-)

Una capa de embedding es una capa que se coloca al principio con pesos inicializados alaetoriamente y cuya salida es la representación de cada palabra en un vector de dimensión que se le da de entrada (es un hiperparámetro). De esta forma aprende la representación más adecuada para el problema en cuestión. Pero...


#### 2013 (Efficient Estimation of Word Representations in Vector Space)

Desde 1960 que se plantean mecanismos para convertir la representación de palabras en vectores, pero la cosa de verdad estalló en 2013 cuando unos investigadores de (adivina,sí, Google) propusieron una forma de construir esos embeddings.  
Básicamente entrenaron una red neuronal sobre un dataset muy grande de texto. El objetivo predecir que palabras acompañaban (por delante y detrás) a otra. Las capas dedicadas aprendieron representaciones de las palabras bastante espectaculares. Habían conseguido un embedding muy potente, y... como son capas podemos incluirlos en nuestros modelos gracias al transfer learning.  

<img src="./img/embeddings.png" alt="Diagram of one-hot encodings" width="1000" />


Por ejemplo operando los vectores obtenidos con King,Man y Woman como Vking - VMan + VWoman el resultado es un vector que está muy próximo al vector obtenido con Queen.  

De igual manera haciéndolo con Madrid - Spain + France, el resultado es un vector muy próximo al vector que produce ... Paris

Desde ese momento, el uso de embeddings en las tareas de NLP con DL viene por defecto. 

### KERAS EMBEDDINGS

Keras proporciona una capa de Embeddings (entrenable), aunque también es posible bajarse embeddings preentrenados y adaptarlos a nuestro modelo

Ahora vamos a hacer un pequeño ejemplo de como funciona, y en el siguiente notebook construiremos el modelo de predicción que ya hicimos con NLP tradicional sobre la base de datos de reviews de Amazon

In [142]:
categorias_ejemplo = ["Me","llamo","Iñigo","Montoya","soy","tú","mataste","a","mi","padre"]
pre_conversion = tf.keras.layers.StringLookup() # Hay que convertir nuestro vocabulario a indices
pre_conversion.adapt(categorias_ejemplo)
lookup_y_embedding = tf.keras.Sequential([\
                                          tf.keras.layers.InputLayer(input_shape=[], dtype=tf.string), 
                                          pre_conversion,
                                          tf.keras.layers.Embedding(input_dim = pre_conversion.vocabulary_size(),
                                                                   output_dim = 2)])
# input_dim -> Tamaño del vocabulario a convertir en vectores de output_dim dimensiones

In [144]:
lookup_y_embedding(np.array([categorias_ejemplo]))

<tf.Tensor: shape=(1, 10, 2), dtype=float32, numpy=
array([[[ 0.04604291, -0.02598288],
        [ 0.04039853, -0.03119655],
        [-0.00592728,  0.04219984],
        [ 0.03060097, -0.034375  ],
        [ 0.0383018 ,  0.02832749],
        [ 0.00023763, -0.00343919],
        [-0.03366919,  0.02914692],
        [-0.00399958, -0.04371947],
        [-0.00220978,  0.00620984],
        [-0.01796765, -0.03476111]]], dtype=float32)>

In [145]:
frase = "Me llamo Iñigo Montoya"
lookup_y_embedding(np.array([frase.split()]))

<tf.Tensor: shape=(1, 4, 2), dtype=float32, numpy=
array([[[ 0.04604291, -0.02598288],
        [ 0.04039853, -0.03119655],
        [-0.00592728,  0.04219984],
        [ 0.03060097, -0.034375  ]]], dtype=float32)>