In [55]:
import pandas as pd
import numpy as np

from tensorflow import keras
from sklearn.model_selection import train_test_split

Datos obtenidos de: https://www.kaggle.com/code/tmishinev/nlp-depression-tweets-keras-lstm

In [2]:
raw = pd.read_csv("Mental-Health-Twitter.csv")
raw.shape

(20000, 11)

In [3]:
raw.head(10)

Unnamed: 0.1,Unnamed: 0,post_id,post_created,post_text,user_id,followers,friends,favourites,statuses,retweets,label
0,0,637894677824413696,Sun Aug 30 07:48:37 +0000 2015,It's just over 2 years since I was diagnosed w...,1013187241,84,211,251,837,0,1
1,1,637890384576778240,Sun Aug 30 07:31:33 +0000 2015,"It's Sunday, I need a break, so I'm planning t...",1013187241,84,211,251,837,1,1
2,2,637749345908051968,Sat Aug 29 22:11:07 +0000 2015,Awake but tired. I need to sleep but my brain ...,1013187241,84,211,251,837,0,1
3,3,637696421077123073,Sat Aug 29 18:40:49 +0000 2015,RT @SewHQ: #Retro bears make perfect gifts and...,1013187241,84,211,251,837,2,1
4,4,637696327485366272,Sat Aug 29 18:40:26 +0000 2015,It’s hard to say whether packing lists are mak...,1013187241,84,211,251,837,1,1
5,5,637692793083817985,Sat Aug 29 18:26:24 +0000 2015,Making packing lists is my new hobby... #movin...,1013187241,84,211,251,837,1,1
6,6,637691649943072772,Sat Aug 29 18:21:51 +0000 2015,At what point does keeping stuff for nostalgic...,1013187241,84,211,251,837,1,1
7,7,637689418472652800,Sat Aug 29 18:12:59 +0000 2015,Currently in the finding-boxes-of-random-shit ...,1013187241,84,211,251,837,0,1
8,8,637687177946734592,Sat Aug 29 18:04:05 +0000 2015,"Can't be bothered to cook, take away on the wa...",1013187241,84,211,251,837,0,1
9,9,637684866906255360,Sat Aug 29 17:54:54 +0000 2015,RT @itventsnews: ITV releases promo video for ...,1013187241,84,211,251,837,41,1


Para propositos de este proyecto, unicamente interesa el texto del post y el label para realizar clasificacion, por lo cual el dataset se reduce unicamente a estas columnas. Sin embargo, para un futuro analisis se podria realizar un modelo que, dado un handle de Twitter, realice webscrapping y analice los tweets del usuario para determinar si tiene depresion o no. Para este proyecto, unicamente se realizara clasificacion en base a un texto.

In [4]:
data = raw[["post_text", "label"]]
data.head(10)

Unnamed: 0,post_text,label
0,It's just over 2 years since I was diagnosed w...,1
1,"It's Sunday, I need a break, so I'm planning t...",1
2,Awake but tired. I need to sleep but my brain ...,1
3,RT @SewHQ: #Retro bears make perfect gifts and...,1
4,It’s hard to say whether packing lists are mak...,1
5,Making packing lists is my new hobby... #movin...,1
6,At what point does keeping stuff for nostalgic...,1
7,Currently in the finding-boxes-of-random-shit ...,1
8,"Can't be bothered to cook, take away on the wa...",1
9,RT @itventsnews: ITV releases promo video for ...,1


# Normalizacion
(Normalmente la eliminacion de stop words, junto con lemmatizacion o stemmizacion son parte de la normalizacion del texto, pero planteo la hipotesis de que para un modelo recurrente, la informacion de tener un stop word o de tener un verbo conjugado puede ser mas valioso que solo los stems o lemmas, por lo que no se eliminan)

In [5]:
data["norm_text"] = data["post_text"].str.replace(r'[^a-zA-Z0-9\s{1}áéíóúüñÁÉÍÓÚÑ]', '')
data["norm_text"] = data["norm_text"].str.lower().str.strip().str.rstrip('\n').str.rstrip('\r\n')
data.head(10)

  data["norm_text"] = data["post_text"].str.replace(r'[^a-zA-Z0-9\s{1}áéíóúüñÁÉÍÓÚÑ]', '')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["norm_text"] = data["post_text"].str.replace(r'[^a-zA-Z0-9\s{1}áéíóúüñÁÉÍÓÚÑ]', '')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["norm_text"] = data["norm_text"].str.lower().str.strip().str.rstrip('\n').str.rstrip('\r\n')


Unnamed: 0,post_text,label,norm_text
0,It's just over 2 years since I was diagnosed w...,1,its just over 2 years since i was diagnosed wi...
1,"It's Sunday, I need a break, so I'm planning t...",1,its sunday i need a break so im planning to sp...
2,Awake but tired. I need to sleep but my brain ...,1,awake but tired i need to sleep but my brain h...
3,RT @SewHQ: #Retro bears make perfect gifts and...,1,rt sewhq retro bears make perfect gifts and ar...
4,It’s hard to say whether packing lists are mak...,1,its hard to say whether packing lists are maki...
5,Making packing lists is my new hobby... #movin...,1,making packing lists is my new hobby movinghouse
6,At what point does keeping stuff for nostalgic...,1,at what point does keeping stuff for nostalgic...
7,Currently in the finding-boxes-of-random-shit ...,1,currently in the findingboxesofrandomshit pack...
8,"Can't be bothered to cook, take away on the wa...",1,cant be bothered to cook take away on the way ...
9,RT @itventsnews: ITV releases promo video for ...,1,rt itventsnews itv releases promo video for th...


### Tokenizacion

In [14]:
train, test = train_test_split(data[["norm_text", "label"]], test_size=0.2)
print(train.shape)
train.head(10)

(16000, 2)


Unnamed: 0,norm_text,label
16872,mhardzsali active na active bah\r\n\r\npaytfor...,0
3684,kimberlym06 those are just your shoes lol,1
19401,california college san diego national city htt...,0
17231,tweet and retweet \r\n\r\ngopayt dreamteamyong,0
2511,rt cbs11doug whats it like to be the homeowner...,1
10322,lydiamcrtins well she was yes but she wasnt ju...,0
12857,salon hey joy this morning was a complete fail...,0
13404,marclotter realdonaldtrump mikepence how ma...,0
19423,skinovate skin care clinic httpstcoxpfp5yaij5 ...,0
5101,gasp you hate me another gasp and my fucks ran...,1


In [18]:
tokenizer = keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(train["norm_text"].to_numpy())
train_sequences = tokenizer.texts_to_sequences(train["norm_text"].to_numpy())
test_sequences = tokenizer.texts_to_sequences(test["norm_text"].to_numpy())

words = len(tokenizer.word_index)
words

29805

In [19]:
max_seq = 0
for seq in train_sequences:
    if len(seq) > max_seq:
        max_seq = len(seq)
for seq in test_sequences:
    if len(seq) > max_seq:
        max_seq = len(seq)
        
max_seq

34

In [21]:
x_train = keras.preprocessing.sequence.pad_sequences(train_sequences, maxlen=max_seq)
x_test = keras.preprocessing.sequence.pad_sequences(test_sequences, maxlen=max_seq)
x_train.shape

(16000, 34)

In [30]:
y_train = train["label"].to_numpy()
y_test = test["label"].to_numpy()
y_train.shape

(16000,)

# Construccion de modelo simple

In [42]:
keras.backend.clear_session()
inputs = keras.Input(shape=(max_seq,), dtype="int32")
x = keras.layers.Embedding(words+1, 128)(inputs)
x = keras.layers.Bidirectional(keras.layers.LSTM(64, return_sequences=True))(x)
x = keras.layers.Bidirectional(keras.layers.LSTM(64))(x)
outputs = keras.layers.Dense(1, activation="sigmoid")(x)
simple_lstm = keras.Model(inputs, outputs)

simple_lstm.compile("adam", "binary_crossentropy", metrics=["accuracy", keras.metrics.Precision()])

simple_lstm.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 34)]              0         
                                                                 
 embedding (Embedding)       (None, 34, 128)           3815168   
                                                                 
 bidirectional (Bidirectiona  (None, 34, 128)          98816     
 l)                                                              
                                                                 
 bidirectional_1 (Bidirectio  (None, 128)              98816     
 nal)                                                            
                                                                 
 dense (Dense)               (None, 1)                 129       
                                                                 
Total params: 4,012,929
Trainable params: 4,012,929
Non-train

In [43]:
simple_lstm.fit(x_train, y_train, batch_size=32, epochs=2, validation_split=0.2)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x1c35f00cb80>

In [44]:
simple_lstm.evaluate(x_test, y_test)



[0.34942957758903503, 0.8522499799728394, 0.8514562845230103]

Con un modelo relativamente simple hemos logrado nuestro objetivo de obtener un accuracy de 0.85, con un precision bastante similar.

# Experimentando con mas modelos

In [46]:
keras.backend.clear_session()
inputs = keras.Input(shape=(max_seq,), dtype="int32")
x = keras.layers.Embedding(words+1, 128)(inputs)
x = keras.layers.Bidirectional(keras.layers.LSTM(64, return_sequences=True))(x)
x = keras.layers.Bidirectional(keras.layers.LSTM(64))(x)
x = keras.layers.Dense(32, activation='relu')(x)
outputs = keras.layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)

model.compile("adam", "binary_crossentropy", metrics=["accuracy", keras.metrics.Precision()])

model.fit(x_train, y_train, batch_size=32, epochs=2, validation_split=0.2)

print("")
print("Evalaucion")
model.evaluate(x_test, y_test)

Epoch 1/2
Epoch 2/2
Evalaucion)


[0.3594318926334381, 0.8514999747276306, 0.8512396812438965]

Agregar una capa densa mejoro el accuracy y precision de la validacion, pero empeoro en la prueba, parece indicar un overfitting, tratemos de agregar droput:

In [47]:
keras.backend.clear_session()
inputs = keras.Input(shape=(max_seq,), dtype="int32")
x = keras.layers.Embedding(words+1, 128)(inputs)
x = keras.layers.Bidirectional(keras.layers.LSTM(64, return_sequences=True))(x)
x = keras.layers.Dropout(0.2)(x)
x = keras.layers.Bidirectional(keras.layers.LSTM(64))(x)
x = keras.layers.Dense(32, activation='relu')(x)
x = keras.layers.Dropout(0.2)(x)
outputs = keras.layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)

model.compile("adam", "binary_crossentropy", metrics=["accuracy", keras.metrics.Precision()])

model.fit(x_train, y_train, batch_size=32, epochs=2, validation_split=0.2)

print("")
print("Evalaucion")
model.evaluate(x_test, y_test)

Epoch 1/2
Epoch 2/2

Evalaucion


[0.3679468631744385, 0.8550000190734863, 0.8560273051261902]

Se logra mejorar el accuracy de prueba marginalmente, que pasa si se agregan mas neuronas?

In [48]:
keras.backend.clear_session()
inputs = keras.Input(shape=(max_seq,), dtype="int32")
x = keras.layers.Embedding(words+1, 256)(inputs)
x = keras.layers.Bidirectional(keras.layers.LSTM(128, return_sequences=True))(x)
x = keras.layers.Dropout(0.2)(x)
x = keras.layers.Bidirectional(keras.layers.LSTM(64))(x)
x = keras.layers.Dense(64, activation='relu')(x)
x = keras.layers.Dropout(0.2)(x)
outputs = keras.layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)

model.compile("adam", "binary_crossentropy", metrics=["accuracy", keras.metrics.Precision()])

model.fit(x_train, y_train, batch_size=32, epochs=2, validation_split=0.2)

print("")
print("Evalaucion")
model.evaluate(x_test, y_test)

Epoch 1/2
Epoch 2/2

Evalaucion


[0.3578365445137024, 0.8579999804496765, 0.8218818306922913]

Vemos un incremento en el accuracy, pero una decremento en el precision, como este modelo va a servir para ayudar la realizacion de diagnosticos, no es particularmente buena idea permitir esta perdida.

Se realiza otra prueba reduciendo las neuronas de la capa densa

In [49]:
keras.backend.clear_session()
inputs = keras.Input(shape=(max_seq,), dtype="int32")
x = keras.layers.Embedding(words+1, 256)(inputs)
x = keras.layers.Bidirectional(keras.layers.LSTM(128, return_sequences=True))(x)
x = keras.layers.Dropout(0.2)(x)
x = keras.layers.Bidirectional(keras.layers.LSTM(64))(x)
x = keras.layers.Dense(32, activation='relu')(x)
x = keras.layers.Dropout(0.2)(x)
outputs = keras.layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)

model.compile("adam", "binary_crossentropy", metrics=["accuracy", keras.metrics.Precision()])

model.fit(x_train, y_train, batch_size=32, epochs=2, validation_split=0.2)

print("")
print("Evalaucion")
model.evaluate(x_test, y_test)

Epoch 1/2
Epoch 2/2

Evalaucion


[0.33578750491142273, 0.8514999747276306, 0.8804634213447571]

Esta prueba reduce nuestro accuray marginalmente, pero aumenta nuestro precision considerablemente.

Que pasa si agregamos una nueva capa recurrente?

In [50]:
keras.backend.clear_session()
inputs = keras.Input(shape=(max_seq,), dtype="int32")
x = keras.layers.Embedding(words+1, 256)(inputs)
x = keras.layers.Bidirectional(keras.layers.LSTM(128, return_sequences=True))(x)
x = keras.layers.Dropout(0.2)(x)
x = keras.layers.Bidirectional(keras.layers.LSTM(64, return_sequences=True))(x)
x = keras.layers.Dropout(0.2)(x)
x = keras.layers.Bidirectional(keras.layers.LSTM(32))(x)
x = keras.layers.Dense(16, activation='relu')(x)
x = keras.layers.Dropout(0.2)(x)
outputs = keras.layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)

model.compile("adam", "binary_crossentropy", metrics=["accuracy", keras.metrics.Precision()])

model.fit(x_train, y_train, batch_size=32, epochs=2, validation_split=0.2)

print("")
print("Evalaucion")
model.evaluate(x_test, y_test)

Epoch 1/2
Epoch 2/2

Evalaucion


[0.4035157859325409, 0.8512499928474426, 0.8852721452713013]

Se ve un incremento marginal en el precision y un decremento marginal en el accuracy. Aunque se debe notar que hasta ahora se ha tratado de  reducir el nivel de neuronas en cada capa, este es el mejor modelo hasta ahora, veamos que pasa si se cambia la politica de reduccion de neuronas.

In [51]:
keras.backend.clear_session()
inputs = keras.Input(shape=(max_seq,), dtype="int32")
x = keras.layers.Embedding(words+1, 256)(inputs)
x = keras.layers.Bidirectional(keras.layers.LSTM(128, return_sequences=True))(x)
x = keras.layers.Dropout(0.2)(x)
x = keras.layers.Bidirectional(keras.layers.LSTM(64, return_sequences=True))(x)
x = keras.layers.Dropout(0.2)(x)
x = keras.layers.Bidirectional(keras.layers.LSTM(64))(x)
x = keras.layers.Dense(128, activation='relu')(x)
x = keras.layers.Dropout(0.2)(x)
outputs = keras.layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)

model.compile("adam", "binary_crossentropy", metrics=["accuracy", keras.metrics.Precision()])

model.fit(x_train, y_train, batch_size=32, epochs=2, validation_split=0.2)

print("")
print("Evalaucion")
model.evaluate(x_test, y_test)

Epoch 1/2
Epoch 2/2

Evalaucion


[0.3523355722427368, 0.8615000247955322, 0.8523967862129211]

Se logra un incremento en el accuracy pero un decremento considerable en el accuracy, no se tomara en cuenta esta configuracion

Se decide entonces entrenar el modelo a lo largo de varias epochs usando la mejor configuracion hasta ahora

Se juntan los datasets de test y entrenamiento para entrenar todo el modelo:

In [62]:
total_x = np.vstack((x_train, x_test))
total_y = np.hstack((y_train, y_test))

total_x.shape, total_y.shape

((20000, 34), (20000,))

In [64]:
keras.backend.clear_session()
inputs = keras.Input(shape=(max_seq,), dtype="int32")
x = keras.layers.Embedding(words+1, 256)(inputs)
x = keras.layers.Bidirectional(keras.layers.LSTM(128, return_sequences=True))(x)
x = keras.layers.Dropout(0.2)(x)
x = keras.layers.Bidirectional(keras.layers.LSTM(64, return_sequences=True))(x)
x = keras.layers.Dropout(0.2)(x)
x = keras.layers.Bidirectional(keras.layers.LSTM(32))(x)
x = keras.layers.Dense(16, activation='relu')(x)
x = keras.layers.Dropout(0.2)(x)
outputs = keras.layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)

model.compile("adam", "binary_crossentropy", metrics=["accuracy", keras.metrics.Precision()])

earlyStopping = keras.callbacks.EarlyStopping(monitor="accuracy", patience=10, verbose=0, mode="max")
mcp_save = keras.callbacks.ModelCheckpoint("tweets_resulting_model.h5", save_best_only=True, monitor="accuracy", mode="max")

model.fit(total_x, total_y, batch_size=32, epochs=25,
          callbacks=[earlyStopping, mcp_save])

print("")
print("Evalaucion")
model.evaluate(x_test, y_test)

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25

Evalaucion


[0.011706149205565453, 0.9962499737739563, 0.9965652823448181]