# Data for Good: predicting suicidal behavior likelihood among Reddit users using Deep Learning (Part 2)

*Deep Learning and Reinforcement Learning (part of IBM Machine Learning Professional Certificate) - Course Project.*

>*No one is useless in this world who lightens the burdens of another.*  
― **Charles Dickens**

<img src='https://www.discover-norway.no/upload/images/-development/header/desktop/kul_munch/edvard%20munch%20the%20scream%201893_munchmmuseet.jpg'></img>

## Table of contents
1. [Data Preparation](#preparation)  
2. [Model Development: Recurrent Neural Network](#model)  
  2.1. [...](#kmeans)  
  2.2. [...](#hac)  
  2.3. [...](#dbscan)  
3. [Results](#results)  
4. [Discussion](#discussion)  
5. [Conclusion](#conclusion)  
  5.1. [Project Summary](#summary)  
  5.2. [Outcome of the Analysis](#outcome)  
  5.3. [Potential Developments](#developments)

## 1. Data Preparation <a name=preparation></a>

Steps to process the data for modeling:
1. Drop not-relevant dataset features.
2. Remove from data the stopwords found during the word cloud analysis.
3. Tokenize the posts.
4. One-Hot Encode the target variable (the classes)
5. Pad the sequences.
6. Split dataset into training and testing sets.

In [1]:
#Import needed libraries
import keras
import pandas as pd
import random
from random import randrange, seed
from keras.preprocessing.text import Tokenizer
import numpy as np
from keras.utils import pad_sequences
from sklearn.model_selection import train_test_split

In [2]:
#Import data (after cleaning and the EDA perfomed in word-cloud environment notebook)
data = pd.read_csv(r'data.csv')
processed_data = data.copy()
processed_data.head()

Unnamed: 0,User,Post,Label,word_count
0,user-0,its not a viable option and youll be leaving y...,Supportive,134
1,user-1,it can be hard to appreciate the notion that y...,Ideation,2163
2,user-2,hi so last night i was sitting on the ledge of...,Behavior,470
3,user-3,i tried to kill my self once and failed badly ...,Attempt,885
4,user-4,hi nem3030 what sorts of things do you enjoy d...,Ideation,208


##### 1. Drop not-relevant features.

In [3]:
#Drop not relevant features
processed_data.drop(['User', 'word_count'], axis=1, inplace=True)
processed_data.tail()

Unnamed: 0,Post,Label
495,its not the end it just feels that way or at l...,Supportive
496,it was a skype call but she ended it and ventr...,Indicator
497,that sounds really weird maybe you were distra...,Supportive
498,dont know there as dumb as it sounds i feel hy...,Attempt
499,gt it gets better trust me ive spent long enou...,Behavior


###### 2. Remove the stop words.

I start processing the data by deleting the stopwords found during the word cloud analysis (see Part 1 Notebook).

In [4]:
#Import the stop_words list and create a Python list
stop_words = open(r'stop_words.txt', 'r')
sw=[]
for line in stop_words:
    sw.append(line[:-1])
    
print('Length of stop word list:', len(sw))

Length of stop word list: 323


In [5]:
#Close the file
stop_words.close()
print('Is the file closed?', stop_words.closed)

Is the file closed? True


In [6]:
print("First 50 stop words:\n",sw[:51])

First 50 stop words:
 ['see', "where's", 'not', 'm', 'have', 'whom', 'need', 'maybe', 'to', 'someone', 'get', 'which', "aren't", 'our', 'made', 'like', "weren't", 'hasn', 'won', "you'd", "isn't", 'nor', 'back', "here's", 'my', 'else', 'too', 'shouldn', 'always', "who's", 'day', "we'd", 'how', "how's", 'years', 'since', 'happy', 'was', 'friends', 'under', "i've", 'www', 'try', 'thats', "you'll", 'give', 'yours', "we've", 'ever', 'll', 'd']


In [7]:
#let's visualize a random post
random.seed(3)
processed_data.loc[randrange(500)]['Post']

'no more ideas i dont agree with live for others kind of advice i think you should live for yourself and your friends and family the world isnt going to be fixed any time soon so stop thinking its all on your shoulders regular exercise and a lack of excessive stress is important to a good life so is a decent job work is now stressful yes its never done im on a long break now its tired hot and humid where i now live so i cant really do anything i cant handle the heat well i want to prepare for my death before i go back to work its not only that the career enabled me to live a certain lifestyle and live in a certain place and not have to worry too much about money and other things why would you like that i dont think there are any other kinds of job i could do in this country it has been 5 years since i lost my job i have tried my best the things i lost in my life i believe them to be extremely fundamental and important things i also lost a life that had little worry and stress now i hav

In [8]:
random.seed(3)
print('Length of the post before removing the stop words:', len(processed_data.loc[randrange(500)]['Post']))

Length of the post before removing the stop words: 2269


In [9]:
#let's remove the stop words
processed_data['Post'] = processed_data['Post'].apply(lambda x: ' '.join([word for word in x.split() if word not in (sw)]))

#let's visualize the same post without stopwords
random.seed(3)
processed_data.loc[randrange(500)]['Post']

'ideas agree others kind advice family world fixed soon stop thinking shoulders regular exercise lack excessive stress important decent job stressful yes done break tired hot humid handle heat prepare death career enabled certain lifestyle certain place worry money kinds job country 5 lost job tried best lost believe extremely fundamental important lost little worry stress job gets worse allow exercise boiling hot city saps energy horrible bitchy colleagues norm realize liked living country kind jobs worse world shitty jobs best jobs world threw tolerate job rest move different job industry city less hot humid place wont climate city ill lost suicide arent suicide attempt looked upon mental asthenia moment madness kind childish gesture arent actual suicides imagine kill guess lack understanding survival mechanism suicidal likely fixed world fucked 7bn fucking planet mere presence forget enjoy'

In [10]:
random.seed(3)
print('Length of the post after removing the stop words:', len(processed_data.loc[randrange(500)]['Post']))

Length of the post after removing the stop words: 904


###### 3. Tokenize the text.

I am going to tokenize the posts, that is I'll turn the text into a list of individual words and then convert the words into integers, using the Keras Tokenizer class.

In [11]:
#let's visualize a random post
random.seed(13)
processed_data.loc[randrange(500)]['Post']

'dude wont called brave bold become guy killed body buck news die ill kick ass heaven whever'

In [12]:
#Let's tokenize the data
tokenizer = Tokenizer()
#train the tokenizer
tokenizer.fit_on_texts(processed_data['Post'])
#conver text into lists of integers
posts = tokenizer.texts_to_sequences(processed_data['Post'])

In [13]:
#let's visualize the same post after tokenizing
random.seed(13)
print(posts[randrange(500)])

[579, 39, 342, 1008, 6259, 178, 143, 725, 326, 6260, 1059, 75, 17, 1188, 770, 2454, 11488]


In [14]:
#Let's map the intetgers back to words to check integer meaning
random.seed(13)
' '.join(tokenizer.index_word[w] for w in posts[randrange(500)])

'dude wont called brave bold become guy killed body buck news die ill kick ass heaven whever'

###### 4. One-Hot Encode the target variable.

I now one-hot encode, using Keras library, the data classes

In [15]:
processed_data['class'] =  processed_data['Label'].apply(lambda x: 1 if x == 'Supportive' else 2 if x == 'Indicator'
                                                         else 3 if x == 'Ideation' else 4 if x == 'Behavior' else 5 )

output = keras.utils.to_categorical(processed_data['class'])
output = output[:,1:]
output

array([[1., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       ...,
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 1., 0.]], dtype=float32)

###### 5. Pad the sequences.

Let's now create sequences of the same lenght. During the Exploratory Data Analysis we have foud out the 80% of posts have fewer than 2,000 words. Therefore I set the maximum sequence length as 2,000: post longer than 2,000 words will be truncated, whilst posts shorter then 2,000 words will be padded.

In [16]:
posts = pad_sequences(posts, maxlen=2000, padding='post', truncating='post')
#posts[0]

###### 6. Split the dataset into training and testing sets.

Let's now create the final dataset ready for modelling, by concatenating the tokenized word sequences with the encoded classes:

In [17]:
model_data = np.concatenate((posts, output), axis=1)
np.shape(model_data)

(500, 2005)

Let's count now the total number of words that our dataset contains. This is the size of our entire vocabulary.

In [18]:
num_words = len(np.unique(posts))
print('After the pre-processing stage, the data contains {} unique words'.format(f'{num_words:,}'))

After the pre-processing stage, the data contains 17,452 unique words


Let's split the dataset into train and test sets. I use 20% of the dataset (100 observations) as test data, and the stratify parameter to preserve the class imbalance.

In [19]:
x_train, x_test, y_train, y_test = train_test_split(model_data[:,:-5], model_data[:,-5:], test_size=0.2, random_state=666,
                                                    stratify = model_data[:,-5:])

In [20]:
print('Training dataset shape:', x_train.shape)
print('Testing dataset shape:', x_test.shape)

Training dataset shape: (400, 2000)
Testing dataset shape: (100, 2000)


## 2. Model Development <a name= 'model'></a>

Model hyperparameters:
- embeddeding layer dimensions and train/pretrained
- number of layers before/after the recorrent section of the network
- the state dimension
- RNN initializersL default
- number of neurons in the hidden layer(s)
- activation functions for the hidden layers (sigmoid, tangent, relu, leaky relu)
- learning rate
- bach size (usually 16 or 32)
- number of epochs
- regularization: stochastic or mini-batch (evaluate other regularization techinque only if the model overfits the data)
- optimizers

In [21]:
from keras.models import Sequential
from keras.layers import SimpleRNN, Dense, Embedding
from numpy.random import seed
import tensorflow

In [22]:
#Initialize the model
plain_rnn = Sequential()

# Add the Embedding layer, which maps each input integer (word) to a 50-dimensional vector.
#I am not using any pre-trained embeddings
plain_rnn.add(Embedding(posts.max()+1, output_dim=300, trainable=True, mask_zero=True))

# Add the RNN layer
plain_rnn.add(SimpleRNN(units=150, kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', activation='tanh',
                        input_shape=x_train.shape[1:]))

# Add the more dense layers and the final output layer
#plain_rnn.add(Dense(75, activation='sigmoid'))
#plain_rnn.add(Dense(50, activation='tanh'))
#plain_rnn.add(Dense(25, activation='sigmoid'))
#plain_rnn.add(Dense(10, activation='tanh'))
plain_rnn.add(Dense(5, activation='softmax'))

# Compile the model
adam = keras.optimizers.Adam(learning_rate=0.001)
plain_rnn.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])

#Let's check the model architecture
plain_rnn.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 300)         5248800   
                                                                 
 simple_rnn (SimpleRNN)      (None, 150)               67650     
                                                                 
 dense (Dense)               (None, 5)                 755       
                                                                 
Total params: 5,317,205
Trainable params: 5,317,205
Non-trainable params: 0
_________________________________________________________________


In [23]:
# Train the model and seed the model to get reprducible results
seed(1)
tensorflow.random.set_seed(2)
plain_rnn.fit(x_train, y_train, batch_size=16, epochs=10, shuffle=True, validation_data=(x_test, y_test))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x24f3764cac0>

In [24]:
y_pred = np.argmax(plain_rnn.predict(x_train), axis=1)
y_pred[:10]



array([2, 3, 2, 2, 0, 0, 1, 1, 2, 0], dtype=int64)

In [25]:
np.argmax(y_train[:10], axis=1)

array([2, 3, 2, 2, 0, 0, 1, 1, 2, 0], dtype=int64)

In [26]:
y_pred == np.argmax(y_train, axis=1)

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,

In [None]:
#To output the validation set loss and metrics
plain_rnn.evaluate(x_test, y_test)

Extras:
1. can I do cross validation / hyperparameters tuning with deep learnig models: https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/#:~:text=By%20setting%20the%20n_jobs%20argument,for%20each%20combination%20of%20parameters.

---

sources for data-preprocessing (NLP):
- https://towardsdatascience.com/recurrent-neural-networks-by-example-in-python-ffd204f99470
- https://medium0.com/@saad.arshad102/sentiment-analysis-text-classification-using-rnn-bi-lstm-recurrent-neural-network-81086dda8472

---

data source: https://www.kaggle.com/datasets/thedevastator/c-ssrs-labeled-suicidality-in-500-anonymized-red
https://zenodo.org/record/2667859#.Y9aqCXZBw2z