## [75.06 / 95.58] Organización de Datos
## Trabajo Práctico 2: Competencia de Machine Learning
### Grupo 18: DATAVID-20

* 102732 - Bilbao, Manuel
* 101933 - Karagoz, Filyan
* 98684 - Markarian, Darío
* 100901 - Stroia, Lautaro

La clasificación de sentimientos se encarga de tomar una parte de un texto y decir si a la persona que lo redactó le gusto o no el tema de lo que está hablando. En este caso, lo podemos adaptar a nuestro problema de decidir si un tweet escrito por alguien describe un suceso real o falso.

**Nos vamos a enfocar en:**
*  Construir una Red Neuronal profunda para la clasificacion.
*  Entrenar el modelo con Word Embeddings (usando Word2Vec).

**El proceso es, mas o menos, el siguiente:**

Tweets -> Embeddings -> Deep RRNN -> Red Fully connected -> Funcion de activacion (Sigmoidea en este caso) -> Target (1 o 0)

In [1]:
import pandas as pd
import numpy as np
import re as re
import os

import gensim
from gensim.parsing.preprocessing import remove_stopwords

import tensorflow
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, GRU

import nltk
from nltk.tokenize import word_tokenize

import sklearn
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings('ignore')

pd.options.display.max_rows = None #mostrar todas las filas del df
pd.options.display.float_format = '{:20,.2f}'.format # suprimimos la notacion cientifica en los outputs


## Primer modelo.

Dejando a cargo a Keras de generar los embeddings de los tweets

In [2]:
#Eliminar numeros de un texto
def eliminar_numeros(text):
    return re.sub("\d+", "",text)

#Eliminar puntuacion
def eliminar_puntuacion(text):
    return re.sub(r'[^\w\s]','',text)

#Pasar letras a minusculas
def minusculas(text):
    return text.lower()

#Eliminar caracteres especiales
def eliminar_caracteres(text):
    return re.sub('[^a-zA-Z0-9 \n\.]', '',text)

#Eliminar urls
def eliminar_url(text):
    url_reg = re.compile(r'https?://\S+|www\.\S+')
    return url_reg.sub(r'',text)

**Set up y split de datos**

In [3]:
test_set = pd.read_csv('test.csv')
train_set = pd.read_csv('train.csv')

for data in [test_set,train_set]:
    data['text'] = data['text'].apply(lambda x: eliminar_puntuacion(x))
    data['text'] = data['text'].apply(lambda x: minusculas(x))
    data['text'] = data['text'].apply(lambda x: eliminar_numeros(x))
    data['text'] = data['text'].apply(lambda x: eliminar_caracteres(x))
    data['text'] = data['text'].apply(lambda x: remove_stopwords(x))
    data['text'] = data['text'].apply(lambda x: eliminar_url(x))  

X = train_set['text'] #features
y = train_set['target'] #variable a predecir

#Me quedo con el 75% del set para entrenar, y el otro 25% para testear
X_train,X_valid,y_train,y_valid = train_test_split(X, y, test_size=0.25, random_state=1)
print(X_train.shape)
print(X_valid.shape)

(5709,)
(1904,)


**Tokenizacion de los datos**

In [55]:
tokenizer = Tokenizer()
data_text = pd.concat([X,test_set.text])
tokenizer.fit_on_texts(data_text)

#longitud para armar los textos con un pad para que tengan la misma longitud
max_len = max([len(text.split()) for text in data_text])

#Cantidad de vocablos
vocab_size = len(tokenizer.word_index)+1

X_train_tokens = tokenizer.texts_to_sequences(X_train)
X_valid_tokens = tokenizer.texts_to_sequences(X_valid)

X_train_padded = pad_sequences(X_train_tokens, maxlen=max_len, padding='post')
X_valid_padded = pad_sequences(X_valid_tokens, maxlen=max_len, padding='post')

**Construyendo el modelo**

In [56]:
#Defino el tamaño que van a tener los embeddings
EMBEDDING_SIZE = 100

model = Sequential()
model.add(Embedding(vocab_size, EMBEDDING_SIZE,input_length = max_len))
model.add(GRU(units=32, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy'])
print(model.summary())

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 21, 100)           2749800   
_________________________________________________________________
gru_1 (GRU)                  (None, 32)                12864     
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 33        
Total params: 2,762,697
Trainable params: 2,762,697
Non-trainable params: 0
_________________________________________________________________
None


**Entrenamos el modelo**

In [57]:
model.fit(X_train_padded,y_train, batch_size=128, epochs=25, validation_data=(X_valid_padded,y_valid),verbose=2)

Epoch 1/25
45/45 - 3s - loss: 0.6837 - accuracy: 0.5689 - val_loss: 0.6821 - val_accuracy: 0.5730
Epoch 2/25
45/45 - 2s - loss: 0.6804 - accuracy: 0.5695 - val_loss: 0.6611 - val_accuracy: 0.5746
Epoch 3/25
45/45 - 2s - loss: 0.4407 - accuracy: 0.7993 - val_loss: 0.4758 - val_accuracy: 0.7831
Epoch 4/25
45/45 - 2s - loss: 0.2141 - accuracy: 0.9245 - val_loss: 0.5687 - val_accuracy: 0.7789
Epoch 5/25
45/45 - 2s - loss: 0.1039 - accuracy: 0.9672 - val_loss: 0.6458 - val_accuracy: 0.7689
Epoch 6/25
45/45 - 2s - loss: 0.0669 - accuracy: 0.9809 - val_loss: 0.7889 - val_accuracy: 0.7363
Epoch 7/25
45/45 - 2s - loss: 0.0497 - accuracy: 0.9856 - val_loss: 0.7733 - val_accuracy: 0.7558
Epoch 8/25
45/45 - 2s - loss: 0.0349 - accuracy: 0.9918 - val_loss: 0.9650 - val_accuracy: 0.7511
Epoch 9/25
45/45 - 2s - loss: 0.0262 - accuracy: 0.9937 - val_loss: 0.8397 - val_accuracy: 0.7605
Epoch 10/25
45/45 - 2s - loss: 0.0308 - accuracy: 0.9926 - val_loss: 0.8498 - val_accuracy: 0.7574
Epoch 11/25
45/45 -

<tensorflow.python.keras.callbacks.History at 0x7ff2ecbb0f60>

**Predecimos con el set de test**

In [58]:
test_tokens = tokenizer.texts_to_sequences(test_set.text)
test_tokens_padded = pad_sequences(test_tokens, maxlen=max_len)

predictions = model.predict(test_tokens_padded)

In [59]:
submit = pd.DataFrame(test_set['id'],columns=['id'])
predictions = model.predict(test_tokens_padded)
submit['target'] = predictions
submit['target'] = round(submit['target']).astype('int')
#submit.to_csv('SUBMITS/embeddings+DL.csv', index=False)

#Con este, obtuvimos 0.73735 en kaggle

## Modelo 2

Vamos a usar la implementacion Word2Vec de Gensim. Primero vamos a tokenizar los tweets (preprocesados)

In [106]:
tokens_list = list()
for tweet in data_text.values:
    tokens_list.append(word_tokenize(tweet))

model2 = gensim.models.Word2Vec(sentences=tokens_list,size=EMBEDDING_SIZE,min_count=1)

#cantidad de vocablos aprendidos
tokens_generated = list(model2.wv.vocab)
print(len(tokens_generated))

27496


**Generacion de embeddings (usamos un set pre-entrenado de GloVe)**

In [120]:
embeds = {}
with open(os.environ['HOME']+'/glove.6B.100d.txt', 'r') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], "float32")
        embeds[word] = vector

**Ahora, hay que convertir los embeddings en un vector de tokens.**

In [138]:
tokenizer2 = Tokenizer()
tokenizer2.fit_on_texts(tokens_list)
sequences = tokenizer2.texts_to_sequences(tokens_list)

#longitud para armar los textos con un pad para que tengan la misma longitud
word_index = tokenizer2.word_index
print("Cantidad de tokens unicos: ",len(word_index))

tokens_padded = pad_sequences(sequences, maxlen=max_len)
tokens_padded.shape

Cantidad de tokens unicos:  27496


(10876, 21)

**Mapear los embeddings del GloVe para cada palabra del vocabulario word_index y crear una matriz con esos vectores**

In [122]:
cant_tokens = len(word_index) + 1
embeddings = np.zeros((cant_tokens,EMBEDDING_SIZE))
vec = []
for word, i in word_index.items():
    if i > cant_tokens:
        continue
    try:
       embeddings[i] = embeds[word]
    except KeyError:
        continue

In [137]:
embeddings.shape

(27497, 100)

**Generar modelo**

In [146]:
model2 = Sequential()
embds_layer = Embedding(cant_tokens, EMBEDDING_SIZE, weights=[embeddings], input_length= max_len,
              trainable=False)
model2.add(embds_layer)
model2.add(GRU(units=32,dropout=0.2, recurrent_dropout=0.2,return_sequences=False))
model2.add(Dense(1, activation='sigmoid'))
model2.compile(loss="binary_crossentropy",optimizer='adam',metrics=['accuracy'])
print(model2.summary())

Model: "sequential_21"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_16 (Embedding)     (None, 21, 100)           2749700   
_________________________________________________________________
gru_12 (GRU)                 (None, 32)                12864     
_________________________________________________________________
dense_7 (Dense)              (None, 1)                 33        
Total params: 2,762,597
Trainable params: 12,897
Non-trainable params: 2,749,700
_________________________________________________________________
None


**Como concatene los 2 datasets (train y test) para entrenar todos los embeddings, voy a separarlos de nuevo, y al de train lo vuelvo a separar en un set de validacion y uno de train.**

In [167]:
new_train = tokens_padded[0:train_set.shape[0]]
new_test = tokens_padded[train_set.shape[0]:]

X_train2,X_valid2,y_train2,y_valid2 = train_test_split(new_train,train_set.target, test_size=0.25, random_state=1)

**Entrenamos..**

In [168]:
model2.fit(X_train2,y_train2, batch_size=120,epochs=50,validation_data=(X_valid2,y_valid2),verbose=2)

Epoch 1/50
48/48 - 1s - loss: 0.6211 - accuracy: 0.6721 - val_loss: 0.5272 - val_accuracy: 0.7600
Epoch 2/50
48/48 - 1s - loss: 0.4889 - accuracy: 0.7807 - val_loss: 0.4525 - val_accuracy: 0.8020
Epoch 3/50
48/48 - 1s - loss: 0.4457 - accuracy: 0.8021 - val_loss: 0.4439 - val_accuracy: 0.8083
Epoch 4/50
48/48 - 1s - loss: 0.4375 - accuracy: 0.8066 - val_loss: 0.4410 - val_accuracy: 0.8072
Epoch 5/50
48/48 - 1s - loss: 0.4271 - accuracy: 0.8124 - val_loss: 0.4421 - val_accuracy: 0.8099
Epoch 6/50
48/48 - 1s - loss: 0.4197 - accuracy: 0.8142 - val_loss: 0.4379 - val_accuracy: 0.8041
Epoch 7/50
48/48 - 1s - loss: 0.4167 - accuracy: 0.8159 - val_loss: 0.4384 - val_accuracy: 0.8125
Epoch 8/50
48/48 - 1s - loss: 0.4143 - accuracy: 0.8191 - val_loss: 0.4359 - val_accuracy: 0.8093
Epoch 9/50
48/48 - 1s - loss: 0.4066 - accuracy: 0.8168 - val_loss: 0.4388 - val_accuracy: 0.8141
Epoch 10/50
48/48 - 1s - loss: 0.3999 - accuracy: 0.8240 - val_loss: 0.4411 - val_accuracy: 0.8136
Epoch 11/50
48/48 -

<tensorflow.python.keras.callbacks.History at 0x7ff1fc2c8860>

**Predecimos..**

In [177]:
submit2 = pd.DataFrame(test_set['id'],columns=['id'])
predictions2 = model2.predict(new_test)
submit2['target'] = predictions2
submit2['target'] = round(submit2['target']).astype('int')
#submit2.to_csv('SUBMITS/embeddings+DL-model2.csv', index=False)

#Con este obtuvimos 0.80416 en KAGGLE

## Modelo 3

Usando los embeddings de Gensim y gloVe, agregando la columna 'keyword' como feature

**Set-up, limpieza de datos y agregado de features**

In [113]:
test_features = pd.read_csv('test.csv')
train_features = pd.read_csv('train.csv')

train_features = train_features.fillna('')
test_features = test_features.fillna('')

for data in [test_features,train_features]:
    #Pasar a minusculas, eliminar numeros y stopwords
    data['text'] = data['text'].apply(lambda x: minusculas(x))
    data['text'] = data['text'].apply(lambda x: eliminar_numeros(x))
    data['text'] = data['text'].apply(lambda x: remove_stopwords(x))
    data['keyword'] = data['keyword'].apply(lambda x: re.sub(r'%20',' ', str(x)))
    #Agregado de features
    data['combined_text'] = data['text']+' '+data['keyword']
    data['qty_hashtags'] = data['text'].apply(lambda x: x.count('#'))
    data['qty_urls'] = data['text'].apply(lambda x: x.count('http'))
    data['tweet_len'] = data['text'].str.len()
    data['qty_words_tweet'] = data['text'].apply(lambda x: len(str(x).split()))
    data['len_diff_mean'] = abs(data['tweet_len'] - data['tweet_len'].mean())
    
X3 = train_features[['combined_text',
                     'qty_hashtags',
                     'qty_urls',
                     'tweet_len',
                     'qty_words_tweet',
                     'len_diff_mean']]
y3 = train_features.target

#Concateno los dos dataframes para vectorizar todo junto
data_text3 = pd.concat([X3,test_features[['combined_text',
                                         'qty_hashtags',
                                         'qty_urls',
                                         'tweet_len',
                                         'qty_words_tweet',
                                         'len_diff_mean']]])


**Tokenizacion**

In [42]:
EMBEDDING_SIZE = 100
tokens_list3 = list()
for tweet in data_text3.combined_text.values:
    tokens_list3.append(word_tokenize(tweet))

model3 = gensim.models.Word2Vec(sentences=tokens_list3,size=EMBEDDING_SIZE,min_count=1)

#cantidad de vocablos aprendidos
tokens_generated3 = list(model3.wv.vocab)
print(len(tokens_generated3))

28524


**Embeddings usando un set de GloVe**

In [44]:
embeds3 = {}
with open(os.environ['HOME']+'/glove.6B.100d.txt', 'r') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], "float32")
        embeds3[word] = vector

**Convertir embeddings en tokens**

In [108]:
tokenizer3 = Tokenizer()
tokenizer3.fit_on_texts(tokens_list3)
sequences3 = tokenizer3.texts_to_sequences(tokens_list3)

#longitud para armar los textos con un pad para que tengan la misma longitud
max_len3 = max([len(text.split()) for text in data_text3.combined_text])

#longitud para armar los textos con un pad para que tengan la misma longitud
word_index3 = tokenizer3.word_index
print("Cantidad de tokens unicos: ",len(word_index3))

tokens_padded3 = pad_sequences(sequences3, maxlen=max_len3)
tokens_padded3.shape

Cantidad de tokens unicos:  28524


(10876, 29)

**Mapear los embeddings del GloVe para cada palabra del vocabulario word_index y crear una matriz con esos vectores**

In [109]:
cant_tokens3 = len(word_index3) + 1
embeddings3 = np.zeros((cant_tokens3,EMBEDDING_SIZE))
vec = []
for word, i in word_index3.items():
    if i > cant_tokens3:
        continue
    try:
       embeddings3[i] = embeds3[word]
    except KeyError:
        continue
        
print("Dimension de la matriz de embeds: ",embeddings3.shape)

Dimension de la matriz de embeds:  (28525, 100)


**Como concatene los 2 datasets (train y test) para entrenar todos los embeddings, voy a separarlos de nuevo, y al de train lo vuelvo a separar en un set de validacion y uno de train.**

In [114]:
#Convierto a dataframe los vectores tokenizados para poder
#agregar features numericos
new_train3 = pd.DataFrame(tokens_padded3[0:train_features.shape[0]])
new_test3 = pd.DataFrame(tokens_padded3[train_features.shape[0]:])

#Agrego los features numericos
new_train3['qty_hashtags'] = X3['qty_hashtags']
new_train3['qty_urls'] = X3['qty_urls']
new_train3['tweet_len'] = X3['tweet_len']
new_train3['qty_words_tweet'] = X3['qty_words_tweet']
new_train3['len_diff_mean'] = X3['len_diff_mean']

new_test3['qty_hashtags'] = test_features['qty_hashtags']
new_test3['qty_urls'] = test_features['qty_urls']
new_test3['tweet_len'] = test_features['tweet_len']
new_test3['qty_words_tweet'] = test_features['qty_words_tweet']
new_test3['len_diff_mean'] = test_features['len_diff_mean']

#Divido el set de train en 75% para entrenar y 25% para validacion
X_train3,X_valid3,y_train3,y_valid3 = train_test_split(new_train3,train_features.target, test_size=0.25, random_state=1)

print(new_train3.shape)
print(new_test3.shape)

(7613, 34)
(3263, 34)


**Generar modelo**

In [124]:
model3 = Sequential()
embds_layer3 = Embedding(cant_tokens3, EMBEDDING_SIZE, weights=[embeddings3], input_length= max_len3+5,
              trainable=False)
model3.add(embds_layer3)
model3.add(LSTM(units=32,dropout=0.2, recurrent_dropout=0.2,return_sequences=False))
model3.add(Dense(1, activation='sigmoid'))
model3.compile(loss="binary_crossentropy",optimizer='adam',metrics=['accuracy'])
print(model3.summary())

Model: "sequential_17"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_17 (Embedding)     (None, 34, 100)           2852500   
_________________________________________________________________
lstm (LSTM)                  (None, 32)                17024     
_________________________________________________________________
dense_25 (Dense)             (None, 1)                 33        
Total params: 2,869,557
Trainable params: 17,057
Non-trainable params: 2,852,500
_________________________________________________________________
None


**Entrenamiento**

In [125]:
model3.fit(X_train3,y_train3,batch_size=128,epochs=50,validation_data=(X_valid3,y_valid3),verbose=2)

Epoch 1/50
45/45 - 2s - loss: 0.6283 - accuracy: 0.6434 - val_loss: 0.5232 - val_accuracy: 0.7642
Epoch 2/50
45/45 - 2s - loss: 0.5062 - accuracy: 0.7670 - val_loss: 0.4819 - val_accuracy: 0.7868
Epoch 3/50
45/45 - 2s - loss: 0.4659 - accuracy: 0.7944 - val_loss: 0.4497 - val_accuracy: 0.8057
Epoch 4/50
45/45 - 2s - loss: 0.4497 - accuracy: 0.7998 - val_loss: 0.4436 - val_accuracy: 0.8009
Epoch 5/50
45/45 - 2s - loss: 0.4374 - accuracy: 0.8033 - val_loss: 0.4393 - val_accuracy: 0.8104
Epoch 6/50
45/45 - 2s - loss: 0.4276 - accuracy: 0.8085 - val_loss: 0.4403 - val_accuracy: 0.8030
Epoch 7/50
45/45 - 2s - loss: 0.4159 - accuracy: 0.8185 - val_loss: 0.4453 - val_accuracy: 0.8104
Epoch 8/50
45/45 - 2s - loss: 0.4090 - accuracy: 0.8166 - val_loss: 0.4513 - val_accuracy: 0.8078
Epoch 9/50
45/45 - 2s - loss: 0.4071 - accuracy: 0.8241 - val_loss: 0.4487 - val_accuracy: 0.8114
Epoch 10/50
45/45 - 2s - loss: 0.4030 - accuracy: 0.8210 - val_loss: 0.4505 - val_accuracy: 0.8078
Epoch 11/50
45/45 -

<tensorflow.python.keras.callbacks.History at 0x7fb523ad4160>

**Predicción**

In [126]:
submit3 = pd.DataFrame(test_features['id'],columns=['id'])
predictions3 = model3.predict(new_test3)
submit3['target'] = predictions3
submit3['target'] = round(submit3['target']).astype('int')
submit3.to_csv('SUBMITS/embeddings+DL-model3.csv', index=False)

#Con este obtuvimos 0.79436 en KAGGLE, peor que sin usar features.