# Data for Good: predicting suicidal behavior likelihood among Reddit users using Deep Learning (Part 2)

*Deep Learning and Reinforcement Learning (part of IBM Machine Learning Professional Certificate) - Course Project.*

>*No one is useless in this world who lightens the burdens of another.*  
― **Charles Dickens**

<img src='https://www.discover-norway.no/upload/images/-development/header/desktop/kul_munch/edvard%20munch%20the%20scream%201893_munchmmuseet.jpg'></img>

## Table of contents
1. [Data Preparation](#preparation)  
2. [Model Development: Recurrent Neural Network](#model)  
  2.1. [...](#kmeans)  
  2.2. [...](#hac)  
  2.3. [...](#dbscan)  
3. [Results](#results)  
4. [Discussion](#discussion)  
5. [Conclusion](#conclusion)  
  5.1. [Project Summary](#summary)  
  5.2. [Outcome of the Analysis](#outcome)  
  5.3. [Potential Developments](#developments)

## 1. Data Preparation <a name=preparation></a>

In [1]:
#Import needed libraries
import keras
import pandas as pd
import random
from random import randrange, seed
from keras.preprocessing.text import Tokenizer

In [2]:
#Import data (after cleaning and EDA perfomed in word cloud environment notebook)
data = pd.read_csv(r'data.csv')
data.head()

Unnamed: 0,User,Post,Label,word_count
0,user-0,its not a viable option and youll be leaving y...,Supportive,134
1,user-1,it can be hard to appreciate the notion that y...,Ideation,2163
2,user-2,hi so last night i was sitting on the ledge of...,Behavior,470
3,user-3,i tried to kill my self once and failed badly ...,Attempt,885
4,user-4,hi nem3030 what sorts of things do you enjoy d...,Ideation,208


In [3]:
#Drop not relevant features
data.drop(['User', 'word_count'], axis=1, inplace=True)
data.tail()

Unnamed: 0,Post,Label
495,its not the end it just feels that way or at l...,Supportive
496,it was a skype call but she ended it and ventr...,Indicator
497,that sounds really weird maybe you were distra...,Supportive
498,dont know there as dumb as it sounds i feel hy...,Attempt
499,gt it gets better trust me ive spent long enou...,Behavior


I start processing the data by deleting the stopwords found during the word cloud analysis (see Part 1 Notebook).

In [4]:
#Import the stop_words list and create a Python list
stop_words = open(r'stop_words.txt', 'r')
sw=[]
for line in stop_words:
    sw.append(line[:-1])
    
print('Length of stop word list:', len(sw))

Length of stop word list: 323


In [5]:
#Close the file
stop_words.close()
print('Is the file closed?', stop_words.closed)

Is the file closed? True


In [6]:
print("First 50 stop words:\n",sw[:51])

First 50 stop words:
 ['were', 'understand', 'am', 'between', 'then', 'have', 'which', 'i', 'has', 'much', 'someone', 'a', 'himself', 'd', 'want', 'therefore', 'other', 'find', 'that', 'Ive', 'these', 'they', 'y', 'up', 'said', 'with', 'where', 'like', 'over', "it's", "you'll", 'll', 'things', 'nor', 'friend', 'any', 'know', 'their', 'from', 'or', "we've", 'again', 'being', 'say', 'try', 'thing', 'when', "how's", "she'll", 'Im', 'doing']


In [7]:
#let's visualize a random post
random.seed(3)
data.loc[randrange(500)]['Post']

'no more ideas i dont agree with live for others kind of advice i think you should live for yourself and your friends and family the world isnt going to be fixed any time soon so stop thinking its all on your shoulders regular exercise and a lack of excessive stress is important to a good life so is a decent job work is now stressful yes its never done im on a long break now its tired hot and humid where i now live so i cant really do anything i cant handle the heat well i want to prepare for my death before i go back to work its not only that the career enabled me to live a certain lifestyle and live in a certain place and not have to worry too much about money and other things why would you like that i dont think there are any other kinds of job i could do in this country it has been 5 years since i lost my job i have tried my best the things i lost in my life i believe them to be extremely fundamental and important things i also lost a life that had little worry and stress now i hav

In [8]:
random.seed(3)
print('Length of the post before removing the stop words:', len(data.loc[randrange(500)]['Post']))

Length of the post before removing the stop words: 2269


In [9]:
#let's remove the stop words
data['Post'] = data['Post'].apply(lambda x: ' '.join([word for word in x.split() if word not in (sw)]))

#let's visualize the same post without stopwords
random.seed(3)
data.loc[randrange(500)]['Post']

'ideas agree others kind advice family world fixed soon stop thinking shoulders regular exercise lack excessive stress important decent job stressful yes done im break tired hot humid handle heat prepare death career enabled certain lifestyle certain place worry money kinds job country 5 lost job tried best lost believe extremely fundamental important lost little worry stress job gets worse allow exercise boiling hot city saps energy horrible bitchy colleagues norm realize liked living country kind jobs worse world shitty jobs best jobs world threw tolerate job rest move different job industry city less hot humid place wont climate city ill lost suicide arent suicide attempt looked upon mental asthenia moment madness kind childish gesture arent actual suicides imagine kill guess lack understanding survival mechanism suicidal likely fixed world fucked 7bn fucking planet mere presence forget enjoy'

In [10]:
random.seed(3)
print('Length of the post after removing the stop words:', len(data.loc[randrange(500)]['Post']))

Length of the post after removing the stop words: 907


I am going to tokenize the posts, that is I'll turn the text into a list of individual words and then convert the words into integers, using the Keras Tokenizer class. I won't use any pre-trained embeddings, so the tokenizer will learn the word representation on this dataset.

In [11]:
#let's visualize a random post
random.seed(13)
data.loc[randrange(500)]['Post']

'dude wont called brave bold become guy killed body buck news die ill kick ass heaven whever'

In [12]:
#Let's tokenize the data
tokenizer = Tokenizer()
#train the tokenizer
tokenizer.fit_on_texts(data['Post'])
#conver text into lists of integers
posts = tokenizer.texts_to_sequences(data['Post'])

In [13]:
#let's visualize the same post after tokenizing
random.seed(13)
print(posts[randrange(500)])

[582, 42, 345, 1011, 6262, 181, 146, 728, 329, 6263, 1062, 78, 20, 1191, 773, 2457, 11491]


In [14]:
#Let's map the intetgers back to words to check integer meaning
random.seed(13)
' '.join(tokenizer.index_word[w] for w in posts[randrange(500)])

'dude wont called brave bold become guy killed body buck news die ill kick ass heaven whever'

---

sources for data-preprocessing (NLP):
- https://towardsdatascience.com/recurrent-neural-networks-by-example-in-python-ffd204f99470
- https://medium0.com/@saad.arshad102/sentiment-analysis-text-classification-using-rnn-bi-lstm-recurrent-neural-network-81086dda8472

---

data source: https://www.kaggle.com/datasets/thedevastator/c-ssrs-labeled-suicidality-in-500-anonymized-red
https://zenodo.org/record/2667859#.Y9aqCXZBw2z