<img src="../assets/change_faces.gif" style="float:right ; margin: 10px ; width:300px;"> 

<h1><left>Using Natural Language Processing to identify suicidal posts</left></h1>
<h4><left>by Ziyi Zhu</left></h4>

___

## Part Ⅰ. Data Collection
- using Reddit's API to collect posts from two subreddits: "r/depression" and "r/SuicideWatch"
- When collecting data from servers, create a randomized delay between requests as a consideration to Reddit's servers and security staff.
  
  

In [1]:
import requests
import time
import pandas as pd
from random import randint

### 1.1 Exploring the HTML architecture of the r/depression subreddit page 

In [2]:
# SCRAPE THE r/depression AND r/SuicideWatch SUBREDDITS
# START BY EXPLORING THE HTML INNARDS OF THE FORMER
url_1 = "https://www.reddit.com/r/depression.json"

In [3]:
#DEFINING A USER AGENT AND MAKING SURE STATUS IS GOOD TO GO
headers = {"User-agent" : "Sam He"}
res = requests.get(url_1, headers=headers)
res.status_code

200

In [4]:
# PEEKING AT WHAT THE DATA
depress_json = res.json()
depress_json

{'kind': 'Listing',
 'data': {'modhash': '',
  'dist': 27,
  'children': [{'kind': 't3',
    'data': {'approved_at_utc': None,
     'subreddit': 'depression',
     'selftext': 'We understand that most people who reply immediately to an OP with an invitation to talk privately  mean only to help, but this type of response usually leads to either disappointment or disaster.  it usually works out quite differently here than when you say "PM me anytime" in a casual social context.  \n\nWe have huge admiration and appreciation for the goodwill and good citizenship of so many of you who support others here and flag inappropriate content - even more so because we know that so many of you are struggling yourselves.  We\'re hard at work behind the scenes on more information and resources to make it easier to give and get quality help here - this is just a small start.  \n\nOur new wiki page explains in detail why it\'s much better to respond in public comments, at least until you\'ve gotten to k

In [6]:
# DATE IS ORGANISED AS A DICTIONARY
# GET ITS KEYS
sorted(depress_json["data"].keys())

['after', 'before', 'children', 'dist', 'modhash']

In [7]:
# THE after KEY IS THE QUERY STRING THAT WILL
# INDICATE IN OUR URL THAT WE WANT TO SEE THE
# NEXT 25 POSTS AFTER THE after "CODE"

depress_json["data"]["after"]

't3_jyqk5v'

In [8]:
# DOUBLE CONFIRMING THAT THE PREVIOUS AFTER KEY IS REALLY THE LAST ITEM ON OUR PAGE
[post["data"]["name"] for post in depress_json["data"]["children"]]

['t3_doqwow',
 't3_iq10oq',
 't3_jyo9pv',
 't3_jygm7t',
 't3_jyskqq',
 't3_jyrxiq',
 't3_jyty8g',
 't3_jyt5ek',
 't3_jygstw',
 't3_jytot0',
 't3_jymhv3',
 't3_jyqkl6',
 't3_jyvyeh',
 't3_jyridh',
 't3_jyow93',
 't3_jyr7kp',
 't3_jyoyab',
 't3_jyibec',
 't3_jyrc5j',
 't3_jyvw2b',
 't3_jyuwvn',
 't3_jytyup',
 't3_jyteur',
 't3_jyvddr',
 't3_jyuckd',
 't3_jywbri',
 't3_jyqk5v']

In [9]:
# CHECKING OUT THE NUMBER OF POSTS IN ONE PAGE
len(depress_json["data"]["children"])

27

In [10]:
# DATAFRAME IT. 
pd.DataFrame(depress_json["data"]["children"])

Unnamed: 0,kind,data
0,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
1,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
2,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
3,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
4,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
5,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
6,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
7,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
8,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
9,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."


In [11]:
# TAKE OUT ONE
depress_json["data"]["children"][0]["data"]

{'approved_at_utc': None,
 'subreddit': 'depression',
 'selftext': 'We understand that most people who reply immediately to an OP with an invitation to talk privately  mean only to help, but this type of response usually leads to either disappointment or disaster.  it usually works out quite differently here than when you say "PM me anytime" in a casual social context.  \n\nWe have huge admiration and appreciation for the goodwill and good citizenship of so many of you who support others here and flag inappropriate content - even more so because we know that so many of you are struggling yourselves.  We\'re hard at work behind the scenes on more information and resources to make it easier to give and get quality help here - this is just a small start.  \n\nOur new wiki page explains in detail why it\'s much better to respond in public comments, at least until you\'ve gotten to know someone.  It will be maintained at /r/depression/wiki/private_contact, and the full text of the current v

### 1.2 Creating functions to automate the Data Collection process 
- We will first run those functions on r/depression and check if they have worked well.

In [16]:
# NOW WE CAN DEFINE A FUNCTION TO SCRAPE A REDDIT PAGE

def reddit_scrape(url_string, number_of_scrapes, output_list):
    #SCRAPED POSTS WILL BE CONTAINED IN OUTPUT LIST(SHD BE EMPTY)
    #THIS IS USEFUL FOR THE FIRST SCRAPE FROM THE VIRGIN SUBREDDIT
    after = None 
    for _ in range(number_of_scrapes):
        if _ == 0:
            print("SCRAPING {}\n--------------------------------------------------".format(url_string))
            print("<<<SCRAPING COMMENCED>>>") 
            print("Downloading Batch {} of {}...".format(1, number_of_scrapes))
        elif (_+1) % 5 ==0:
            print("Downloading Batch {} of {}...".format((_ + 1), number_of_scrapes))
        
        if after == None:
            params = {}
        else:
            #THIS WILL TELL THE SCRAPER TO GET THE NEXT SET AFTER REDDIT'S after CODE
            params = {"after": after}
        
        # close the unnecessary connections
        s = requests.session()
        s.keep_alive = False
        
        res = requests.get(url_string, params=params, headers=headers)
        if res.status_code == 200:
            the_json = res.json()
            output_list.extend(the_json["data"]["children"])
            after = the_json["data"]["after"]
        else:
            print(res.status_code)
            break
        time.sleep(randint(1,6))
    
    print("<<<SCRAPING COMPLETED>>>")
    print("Number of posts downloaded: {}".format(len(output_list)))
    print("Number of unique posts: {}".format(len(set([p["data"]["name"] for p in output_list]))))
 

In [17]:
# CALLING THE FUNCTION ON OUR DEPRESSION SUBREDDIT
requests.adapters.DEFAULT_RETRIES = 5
depress_scraped = [] 
reddit_scrape("https://www.reddit.com/r/depression.json", 50, depress_scraped)

SCRAPING https://www.reddit.com/r/depression.json
--------------------------------------------------
<<<SCRAPING COMMENCED>>>
Downloading Batch 1 of 50...
Downloading Batch 5 of 50...
Downloading Batch 10 of 50...
Downloading Batch 15 of 50...
Downloading Batch 20 of 50...
Downloading Batch 25 of 50...
Downloading Batch 30 of 50...
Downloading Batch 35 of 50...
Downloading Batch 40 of 50...
Downloading Batch 45 of 50...
Downloading Batch 50 of 50...
<<<SCRAPING COMPLETED>>>
Number of posts downloaded: 1250
Number of unique posts: 950


In [18]:
#CREATING A FUNCTION TO OUTPUT A LIST OF UNIQUE POSTS
def create_unique_list(original_scrape_list, new_list_name):
    data_name_list=[]
    for i in range(len(original_scrape_list)):
        if original_scrape_list[i]["data"]["name"] not in data_name_list:
            new_list_name.append(original_scrape_list[i]["data"])
            data_name_list.append(original_scrape_list[i]["data"]["name"])
    #CHECKING IF THE NEW LIST IS OF SAME LENGTH AS UNIQUE POSTS
    print("LIST NOW CONTAINS {} UNIQUE SCRAPED POSTS".format(len(new_list_name)))
    

In [19]:
#CALLING THE FUNCTION ON OUR SCRAPED DATA
depress_scraped_unique = []
create_unique_list(depress_scraped, depress_scraped_unique)

LIST NOW CONTAINS 950 UNIQUE SCRAPED POSTS


In [20]:
#PUTTING DEPRESSION DATA INTO A DATAFRAME AND SAVING TO CSV
depression = pd.DataFrame(depress_scraped_unique)
depression["is_suicide"] = 0
depression.head() #CHECK IF THERE ARE 100 COLS AND LAST DUMMY is_suicide COL

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,author_cakeday,is_suicide
0,,depression,We understand that most people who reply immed...,t2_1t70,False,,1,False,Our most-broken and least-understood rules is ...,[],...,no_ads,True,https://www.reddit.com/r/depression/comments/d...,701874,1572361000.0,1,,False,,0
1,,depression,Welcome to /r/depression's check-in post - a p...,t2_1t70,False,,0,False,"Regular Check-In Post. Plus, a reminder about ...",[],...,no_ads,True,https://www.reddit.com/r/depression/comments/i...,701874,1599735000.0,0,,False,,0
2,,depression,"my memory is SO bad. i can’t focus, i can’t co...",t2_5gsuu4va,False,,0,False,I feel like years of depression has given me b...,[],...,no_ads,False,https://www.reddit.com/r/depression/comments/j...,701874,1606014000.0,0,,False,,0
3,,depression,16f ... cant manage to make friends and get ov...,t2_4aw3lv52,False,,0,False,Is it wierd that i have imaginary friends at t...,[],...,no_ads,False,https://www.reddit.com/r/depression/comments/j...,701874,1605986000.0,0,,False,,0
4,,depression,I do not exist. When you see me say hi to my c...,t2_6kvhqtit,False,,1,False,I'm a ghost.,[],...,no_ads,False,https://www.reddit.com/r/depression/comments/j...,701874,1606035000.0,0,,False,,0


### 1.3 Running our functions on the r/SuicideWatch subreddit 

In [21]:
#CALLING THE SCRAPING FUNCTION ON OUR SUICIDEWATCH SUBREDDIT
suicide_scraped = [] #DEFINING AN EMPTY LIST THAT WILL CONTAIN OUR SCRAPED DATA
reddit_scrape("https://www.reddit.com/r/SuicideWatch.json", 50, suicide_scraped)

SCRAPING https://www.reddit.com/r/SuicideWatch.json
--------------------------------------------------
<<<SCRAPING COMMENCED>>>
Downloading Batch 1 of 50...
Downloading Batch 5 of 50...
Downloading Batch 10 of 50...
Downloading Batch 15 of 50...
Downloading Batch 20 of 50...
Downloading Batch 25 of 50...
Downloading Batch 30 of 50...
Downloading Batch 35 of 50...
Downloading Batch 40 of 50...
Downloading Batch 45 of 50...
Downloading Batch 50 of 50...
<<<SCRAPING COMPLETED>>>
Number of posts downloaded: 1239
Number of unique posts: 988


In [22]:
#CALLING THE "UNIQUE ONLY" FUNCTION ON OUR SCRAPED DATA
suicide_scraped_unique = []
create_unique_list(suicide_scraped, suicide_scraped_unique)

LIST NOW CONTAINS 988 UNIQUE SCRAPED POSTS


In [23]:
#PUTTING SUICIDEWATCH DATA INTO A DATAFRAME AND SAVING TO CSV
suicide_watch = pd.DataFrame(suicide_scraped_unique)
suicide_watch["is_suicide"] = 1
suicide_watch.head() #CHECK IF THERE ARE 100 COLS AND LAST DUMMY is_suicide COL

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,author_cakeday,is_suicide
0,,SuicideWatch,We've been seeing a worrying increase in pro-s...,t2_1t70,False,,1,False,New wiki on how to avoid accidentally encourag...,[],...,no_ads,True,https://www.reddit.com/r/SuicideWatch/comments...,239628,1567526000.0,0,,False,,1
1,,SuicideWatch,"Activism, i.e. advocating or fundraising for s...",t2_1t70,False,,1,False,Please remember that NO ACTIVISM of any kind i...,[],...,no_ads,True,https://www.reddit.com/r/SuicideWatch/comments...,239628,1599734000.0,0,,False,,1
2,,SuicideWatch,"Whenever I'm sad, I imagine myself dying. This...",t2_7q7no6ah,False,,0,False,Suicide comforts me.,[],...,no_ads,False,https://www.reddit.com/r/SuicideWatch/comments...,239628,1606018000.0,0,,False,,1
3,,SuicideWatch,my fucked up logic is that they'll only have t...,t2_6ylbfxum,False,,0,False,I'm going to kill myself on my birthday,[],...,no_ads,False,https://www.reddit.com/r/SuicideWatch/comments...,239628,1606027000.0,0,,False,,1
4,,SuicideWatch,... And then they go back into treating you li...,t2_7q7no6ah,False,,0,False,"""No, don't kill yourself"". People say...",[],...,no_ads,False,https://www.reddit.com/r/SuicideWatch/comments...,239628,1606018000.0,0,,False,,1


### Write in csv

In [24]:
suicide_watch.to_csv('../data/suicide_watch.csv', index = False)
depression.to_csv('../data/depression.csv', index = False)

In [25]:
#INVESTIGATING THE CASE OF r/SuicideWatch HAVING AN ADDITIONAL COLUMN
suicide_watch.columns.difference(depression.columns)

Index([], dtype='object')

In [26]:
#LOOKING INTO THAT ADDITIONAL COLUMN
suicide_watch['author_cakeday'].isnull().value_counts()

True     987
False      1
Name: author_cakeday, dtype: int64

#### Thoughts about the collected data
  
- Some "uneven-ness" in the size of our set as we collected 980 r/SuicideWatch posts and 917 r/depression posts. We might want to consider "even-ing" out the posts with another round of collection. 
- There is also a matter of r/SuicideWatch having one extra column. Which is strange to me considering that they both exist on the same site. The column is "author_cakeday" and it is mostly NaNs. Thus, it doesn't seem like a column we will be using for our classifier.