# Data for Good: predicting suicidal behavior likelihood among Reddit users using Deep Learning (Part 2)

*Deep Learning and Reinforcement Learning (part of IBM Machine Learning Professional Certificate) - Course Project.*

>*No one is useless in this world who lightens the burdens of another.*  
― **Charles Dickens**

<img src='https://www.discover-norway.no/upload/images/-development/header/desktop/kul_munch/edvard%20munch%20the%20scream%201893_munchmmuseet.jpg'></img>

## Table of contents
1. [Data Preparation](#preparation)  
2. [Model Development: Recurrent Neural Network](#model)  
  2.1. [...](#kmeans)  
  2.2. [...](#hac)  
  2.3. [...](#dbscan)  
3. [Results](#results)  
4. [Discussion](#discussion)  
5. [Conclusion](#conclusion)  
  5.1. [Project Summary](#summary)  
  5.2. [Outcome of the Analysis](#outcome)  
  5.3. [Potential Developments](#developments)

## 1. Data Preparation <a name=preparation></a>

Steps to process the data for modeling:
1. Drop not-relevant dataset features.
2. Remove from data the stopwords found during the word cloud analysis.
3. Tokenize the posts.
4. One-Hot Encode the target variable (the classes)
5. Pad the sequences.
6. Split dataset into training and testing sets.

In [1]:
#Import needed libraries
import keras
import pandas as pd
import random
from random import randrange, seed
from keras.preprocessing.text import Tokenizer
import numpy as np
from keras.utils import pad_sequences
from sklearn.model_selection import train_test_split

In [2]:
#Import data (after cleaning and the EDA perfomed in word-cloud environment notebook)
data = pd.read_csv(r'data.csv')
processed_data = data.copy()
processed_data.head()

Unnamed: 0,User,Post,Label,word_count,Post_nostopwords,classes,class
0,user-0,its not a viable option and youll be leaving y...,Supportive,134,viable option leaving wife behind youd pain be...,0,0
1,user-1,it can be hard to appreciate the notion that y...,Ideation,2163,appreciate notion meet deeply boyfriend desire...,1,1
2,user-2,hi so last night i was sitting on the ledge of...,Behavior,470,hi night sitting ledge window contemplating wh...,1,1
3,user-3,i tried to kill my self once and failed badly ...,Attempt,885,tried kill self failed badly cause moment want...,1,1
4,user-4,hi nem3030 what sorts of things do you enjoy d...,Ideation,208,hi nem3030 sorts enjoy personally welcome musi...,1,1


##### 1. Drop not-relevant features.

In [3]:
#Drop not relevant features
processed_data.drop(['User', 'word_count', 'Label', 'Post_nostopwords'], axis=1, inplace=True)
processed_data.tail()

Unnamed: 0,Post,classes,class
495,its not the end it just feels that way or at l...,0,0
496,it was a skype call but she ended it and ventr...,0,0
497,that sounds really weird maybe you were distra...,0,0
498,dont know there as dumb as it sounds i feel hy...,1,1
499,gt it gets better trust me ive spent long enou...,1,1


###### 2. Remove the stop words.

I start processing the data by deleting the stopwords found during the word cloud analysis (see Part 1 Notebook).

#Import the stop_words list and create a Python list
stop_words = open(r'stop_words.txt', 'r')
sw=[]
for line in stop_words:
    sw.append(line[:-1])
    
print('Length of stop word list:', len(sw))

#Close the file
stop_words.close()
print('Is the file closed?', stop_words.closed)

print("First 50 stop words:\n",sw[:51])

#let's visualize a random post
random.seed(3)
processed_data.loc[randrange(500)]['Post']

random.seed(3)
print('Length of the post before removing the stop words:', len(processed_data.loc[randrange(500)]['Post']))

#let's remove the stop words
processed_data['Post'] = processed_data['Post'].apply(lambda x: ' '.join([word for word in x.split() if word not in (sw)]))

#let's visualize the same post without stopwords
random.seed(3)
processed_data.loc[randrange(500)]['Post']

random.seed(3)
print('Length of the post after removing the stop words:', len(processed_data.loc[randrange(500)]['Post']))

###### 3. Tokenize the text.

I am going to tokenize the posts, that is I'll turn the text into a list of individual words and then convert the words into integers, using the Keras Tokenizer class.

In [4]:
#let's visualize a random post
random.seed(13)
processed_data.loc[randrange(500)]['Post']

'dude dont do this you wont be called brave or bold you will just become the guy who killed himself a no body live through it buck up if i see it on the news when i die ill kick your ass in heaven or whever we go'

In [5]:
#Let's tokenize the data
tokenizer = Tokenizer()
#train the tokenizer
tokenizer.fit_on_texts(processed_data['Post'])
#conver text into lists of integers
posts = tokenizer.texts_to_sequences(processed_data['Post'])

In [6]:
#let's visualize the same post after tokenizing
random.seed(13)
print(posts[randrange(500)])

[804, 26, 27, 28, 2, 233, 17, 561, 1243, 33, 6501, 2, 40, 23, 392, 5, 353, 78, 952, 766, 6, 63, 545, 141, 103, 7, 6502, 50, 20, 3, 101, 7, 29, 5, 1294, 57, 3, 282, 191, 1424, 14, 998, 11, 2693, 33, 11724, 85, 76]


In [7]:
#Let's map the intetgers back to words to check integer meaning
random.seed(13)
' '.join(tokenizer.index_word[w] for w in posts[randrange(500)])

'dude dont do this you wont be called brave or bold you will just become the guy who killed himself a no body live through it buck up if i see it on the news when i die ill kick your ass in heaven or whever we go'

###### 4. One-Hot Encode the target variable.

I now one-hot encode, using Keras library, the data classes

processed_data['class'] =  processed_data['Label'].apply(lambda x: 1 if x == 'Supportive' else 2 if x == 'Indicator'
                                                         else 3 if x == 'Ideation' else 4 if x == 'Behavior' else 5 )

output = keras.utils.to_categorical(processed_data['class'])
output = output[:,1:]
output

###### 5. Pad the sequences.

Let's now create sequences of the same lenght. During the Exploratory Data Analysis we have foud out the 80% of posts have fewer than 2,000 words. Therefore I set the maximum sequence length as 2,000: post longer than 2,000 words will be truncated, whilst posts shorter then 2,000 words will be padded.

In [8]:
posts = pad_sequences(posts, maxlen=2000, padding='post', truncating='post')
#posts[0]

###### 6. Split the dataset into training and testing sets.

Let's now create the final dataset ready for modelling, by concatenating the tokenized word sequences with the encoded classes:

In [9]:
model_data = np.concatenate((posts, np.expand_dims(np.array(processed_data['class']), axis=1)), axis=1)
np.shape(model_data)

(500, 2001)

Let's count now the total number of words that our dataset contains. This is the size of our entire vocabulary.

In [10]:
num_words = len(np.unique(posts))
print('After the pre-processing stage, the data contains {} unique words'.format(f'{num_words:,}'))

After the pre-processing stage, the data contains 14,680 unique words


Let's split the dataset into train and test sets. I use 20% of the dataset (100 observations) as test data, and the stratify parameter to preserve the class imbalance.

In [11]:
x_train, x_test, y_train, y_test = train_test_split(model_data[:,:-1], model_data[:,-1], test_size=0.2, random_state=666,
                                                    stratify = model_data[:,-1])

In [12]:
print('Training dataset shape:', x_train.shape)
print('Testing dataset shape:', x_test.shape)

Training dataset shape: (400, 2000)
Testing dataset shape: (100, 2000)


## 2. Model Development <a name= 'model'></a>

Model hyperparameters:
- embeddeding layer dimensions and train/pretrained
- number of layers before/after the recorrent section of the network
- the state dimension
- RNN initializersL default
- number of neurons in the hidden layer(s)
- activation functions for the hidden layers (sigmoid, tangent, relu, leaky relu)
- learning rate
- bach size (usually 16 or 32)
- number of epochs
- regularization: stochastic or mini-batch (evaluate other regularization techinque only if the model overfits the data)
- optimizers

In [18]:
from keras.models import Sequential
from keras.layers import SimpleRNN, Dense, Embedding, GRU
from numpy.random import seed
import tensorflow
seed(1)
tensorflow.random.set_seed(1)

In [14]:
#Initialize the model
plain_rnn = Sequential()

# Add the Embedding layer, which maps each input integer (word) to a 50-dimensional vector.
#I am not using any pre-trained embeddings
plain_rnn.add(Embedding(posts.max()+1, output_dim=300, trainable=True, mask_zero=True))

# Add the RNN layer
plain_rnn.add(SimpleRNN(units=150, kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal',activation='tanh',
                        input_shape=x_train.shape[1:]))

# Add the more dense layers and the final output layer
plain_rnn.add(Dense(1, activation='sigmoid'))

# Compile the model
adam = keras.optimizers.Adam(learning_rate=0.001)
plain_rnn.compile(loss='binary_crossentropy', optimizer=adam, metrics=['accuracy'])

#Let's check the model architecture
plain_rnn.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 300)         5315700   
                                                                 
 simple_rnn (SimpleRNN)      (None, 150)               67650     
                                                                 
 dense (Dense)               (None, 1)                 151       
                                                                 
Total params: 5,383,501
Trainable params: 5,383,501
Non-trainable params: 0
_________________________________________________________________


In [15]:
# Train the model and seed the model to get reprducible results
plain_rnn.fit(x_train, y_train, batch_size=16, epochs=10, shuffle=True, validation_data=(x_test, y_test))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x28bf691c610>

In [16]:
# Train the model and seed the model to get reprducible results
plain_rnn.fit(x_train, y_train, batch_size=16, epochs=5, shuffle=True, validation_data=(x_test, y_test))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x28bffcac7f0>

In [None]:
#To output the validation set loss and metrics
plain_rnn.evaluate(x_test, y_test)

---

**Gated Recurrent Unit**

In [19]:
#Initialize the model
gru_rnn = Sequential()

# Add the Embedding layer, which maps each input integer (word) to a 50-dimensional vector.
#I am not using any pre-trained embeddings
gru_rnn.add(Embedding(posts.max()+1, output_dim=300, trainable=True, mask_zero=True))

# Add the RNN layer
gru_rnn.add(GRU(units=150, kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal',activation='tanh',
                recurrent_activation="sigmoid", input_shape=x_train.shape[1:]))

# Add the more dense layers and the final output layer
gru_rnn.add(Dense(1, activation='sigmoid'))

# Compile the model
adam = keras.optimizers.Adam(learning_rate=0.001)
gru_rnn.compile(loss='binary_crossentropy', optimizer=adam, metrics=['accuracy'])

#Let's check the model architecture
gru_rnn.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, None, 300)         5315700   
                                                                 
 gru (GRU)                   (None, 150)               203400    
                                                                 
 dense_1 (Dense)             (None, 1)                 151       
                                                                 
Total params: 5,519,251
Trainable params: 5,519,251
Non-trainable params: 0
_________________________________________________________________


In [20]:
gru_rnn.fit(x_train, y_train, batch_size=16, epochs=10, shuffle=True, validation_data=(x_test, y_test))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x28b9d4c6730>

---

In [21]:
#Initialize the model
gru_rnn = Sequential()

# Add the Embedding layer, which maps each input integer (word) to a 50-dimensional vector.
#I am not using any pre-trained embeddings
gru_rnn.add(Embedding(posts.max()+1, output_dim=300, trainable=True, mask_zero=True))

# Add the RNN layer
gru_rnn.add(GRU(units=150, kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal',activation='tanh',
                recurrent_activation="sigmoid", input_shape=x_train.shape[1:], dropout=0.25, recurrent_dropout=0.25))

# Add the more dense layers and the final output layer
gru_rnn.add(Dense(1, activation='sigmoid'))

# Compile the model
adam = keras.optimizers.Adam(learning_rate=0.001)
gru_rnn.compile(loss='binary_crossentropy', optimizer=adam, metrics=['accuracy'])

#Let's check the model architecture
gru_rnn.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, None, 300)         5315700   
                                                                 
 gru_1 (GRU)                 (None, 150)               203400    
                                                                 
 dense_2 (Dense)             (None, 1)                 151       
                                                                 
Total params: 5,519,251
Trainable params: 5,519,251
Non-trainable params: 0
_________________________________________________________________


In [22]:
gru_rnn.fit(x_train, y_train, batch_size=16, epochs=10, shuffle=True, validation_data=(x_test, y_test))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x28b9d595940>

Model 2: reducing complexity by reducing parameters, and using dropout.

In [None]:
posts_1000 = pad_sequences(posts, maxlen=1000, padding='post', truncating='post')

In [None]:
model_data = np.concatenate((posts_1000, np.expand_dims(np.array(processed_data['class']), axis=1)), axis=1)
np.shape(model_data)

Let's count now the total number of words that our dataset contains. This is the size of our entire vocabulary.

In [None]:
num_words = len(np.unique(posts_1000))
print('After the pre-processing stage, the data contains {} unique words'.format(f'{num_words:,}'))

Let's split the dataset into train and test sets. I use 20% of the dataset (100 observations) as test data, and the stratify parameter to preserve the class imbalance.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(model_data[:,:-1], model_data[:,-1], test_size=0.2, random_state=50,
                                                    stratify = model_data[:,-1])

In [None]:
print('Training feature dataset shape:', x_train.shape)
print('Testing feature dataset shape:', x_test.shape)

In [None]:
print('Training class dataset shape:', y_train.shape)
print('Testing class dataset shape:', y_test.shape)

In [None]:
seed(2)
tensorflow.random.set_seed(2)

#Initialize the model
rnn_2 = Sequential()

# Add the Embedding layer, which maps each input integer (word) to a 50-dimensional vector.
#I am not using any pre-trained embeddings
rnn_2.add(Embedding(posts_1000.max()+1, output_dim=250, trainable=True, mask_zero=True))

# Add the RNN layer
rnn_2.add(SimpleRNN(units=100, kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', activation='tanh',
                    input_shape=x_train.shape[1:]))

# Add the more dense layers and the final output layer
rnn_2.add(Dense(1, activation='sigmoid'))

# Compile the model
adam = keras.optimizers.Adam(learning_rate=0.001)
rnn_2.compile(loss='binary_crossentropy', optimizer=adam, metrics=['accuracy'])

#Let's check the model architecture
rnn_2.summary()

In [None]:
rnn_2.fit(x_train, y_train, batch_size=16, epochs=10, shuffle=True, validation_data=(x_test, y_test))

In [None]:
rnn_2.fit(x_train, y_train, batch_size=16, epochs=10, shuffle=True, validation_data=(x_test, y_test))

In [None]:
rnn_2.fit(x_train, y_train, batch_size=16, epochs=10, shuffle=True, validation_data=(x_test, y_test))

In [None]:
rnn_2.fit(x_train, y_train, batch_size=16, epochs=5, shuffle=True, validation_data=(x_test, y_test))

In [None]:
rnn_2.fit(x_train, y_train, batch_size=16, epochs=5, shuffle=True, validation_data=(x_test, y_test))

---

In [None]:
y_pred_train = rnn_2.predict(x_train)
sum(np.argmax(y_train, axis=1) == np.argmax(y_pred_train, axis=1))

In [None]:
y_pred = rnn_2.predict(x_test)
y_pred = np.argmax(y_pred, axis=1)
y_pred

In [None]:
np.unique(y_pred[np.where(np.argmax(y_test,axis=1) == 0)[0]],return_counts=True)[1]

In [None]:
2/22

In [None]:
#actual class: supportive (0)
i=0
for v in np.unique(y_pred[np.where(np.argmax(y_test, axis=1) == 0)]):
    print(v,
          np.unique(y_pred[np.where(np.argmax(y_test,axis=1) == 0)[0]],return_counts=True)[1][i]/
          len(np.where(np.argmax(y_test,axis=1)==0)[0]))
    i+=1

In [None]:
np.unique(y_pred[np.where(np.argmax(y_test,axis=1) == 1)[0]],return_counts=True)

In [None]:
len(np.where(np.argmax(y_test,axis=1)==1)[0])

In [None]:
6/20

In [None]:
#actual class: indicator (1)
i=0
for v in np.unique(y_pred[np.where(np.argmax(y_test, axis=1) == 1)]):
    print(v,
          np.unique(y_pred[np.where(np.argmax(y_test,axis=1) == 1)[0]],return_counts=True)[1][i]/
          len(np.where(np.argmax(y_test,axis=1)==1)[0]))
    i+=1

In [None]:
np.unique(y_pred[np.where(np.argmax(y_test,axis=1) == 2)[0]],return_counts=True)

In [None]:
len(np.where(np.argmax(y_test,axis=1)==2)[0])

In [None]:
11/34

In [None]:
#actual class: ideation (2)
i=0
for v in np.unique(y_pred[np.where(np.argmax(y_test, axis=1) == 2)]):
    print(v,
          np.unique(y_pred[np.where(np.argmax(y_test,axis=1) == 2)[0]],return_counts=True)[1][i]/
          len(np.where(np.argmax(y_test,axis=1)==2)[0]))
    i+=1

In [None]:
np.unique(y_pred[np.where(np.argmax(y_test,axis=1) == 3)[0]],return_counts=True)

In [None]:
len(np.where(np.argmax(y_test,axis=1)==3)[0])

In [None]:
3/15

In [None]:
#actual class: behavior (3)
i=0
for v in np.unique(y_pred[np.where(np.argmax(y_test, axis=1) == 3)]):
    print(v,
          np.unique(y_pred[np.where(np.argmax(y_test,axis=1) == 3)[0]],return_counts=True)[1][i]/
          len(np.where(np.argmax(y_test,axis=1)==3)[0]))
    i+=1

In [None]:
np.unique(y_pred[np.where(np.argmax(y_test,axis=1) == 4)[0]],return_counts=True)

In [None]:
len(np.where(np.argmax(y_test,axis=1)==4)[0])

In [None]:
3/9

In [None]:
#actual class: attempt (4)
i=0
for v in np.unique(y_pred[np.where(np.argmax(y_test, axis=1) == 4)]):
    print(v,
          np.unique(y_pred[np.where(np.argmax(y_test,axis=1) == 4)[0]],return_counts=True)[1][i]/
          len(np.where(np.argmax(y_test,axis=1)==4)[0]))
    i+=1

Extras:
1. can I do cross validation / hyperparameters tuning with deep learnig models: https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/#:~:text=By%20setting%20the%20n_jobs%20argument,for%20each%20combination%20of%20parameters.

---

sources for data-preprocessing (NLP):
- https://towardsdatascience.com/recurrent-neural-networks-by-example-in-python-ffd204f99470
- https://medium0.com/@saad.arshad102/sentiment-analysis-text-classification-using-rnn-bi-lstm-recurrent-neural-network-81086dda8472

---

data source: https://www.kaggle.com/datasets/thedevastator/c-ssrs-labeled-suicidality-in-500-anonymized-red
https://zenodo.org/record/2667859#.Y9aqCXZBw2z