## Overview

Detect depression from social media data


**1. Collect data from social media platform**

> * Decide the platform & method
> * Get proper authentication
> * Scrape data

**2. Label the data and create the data set**

> * Depression  : depressed (1)
> * Jolly       : not depressed (0)
> * Shuffle the data set

**3. Data set preparation**

> * Lower case
> * Remove punctuation, URLs and tags
> * Remove stop words
> * Divide the data set into train, eval and test.

**4. Model Building : Deep Learning Models**

> * BERT
> * ALBERT
> * XLNET
> * RoBERTa


#  Data Collection from twitter

 **Twitter authentication:**

 https://www.youtube.com/watch?v=vlvtqp44xoQ
 
 

**Import necessary libraries**

In [None]:
import pandas as pd
import numpy as np
import tweepy
import threading
import time

**Function to scrape data and create data frame**

In [None]:
def scrape(words, numtweet):
  db = pd.DataFrame()
  tweets = tweepy.Cursor(api.search, q=words, lang="en",tweet_mode='extended').items(numtweet)	
  list_tweets = [tweet for tweet in tweets]	
  i = 1
  for tweet in list_tweets:
    username = tweet.user.screen_name
    description = tweet.user.description
    try:
      text = tweet.retweeted_status.full_text
    except AttributeError:
      text = tweet.full_text    
    ith_tweet = [username, description, text]
    db = db.append({"username" : ith_tweet[0], "description" :ith_tweet[1], "tweet text" : ith_tweet[2]},ignore_index=True)
  return db

**User authentication and data scraping**

In [None]:
if __name__ == '__main__':
  consumer_key = 'SkS5RvRcwbLTQWm6BiNoJBQW6'
  consumer_secret = '70b7yKuFmPg4jsFS5rGPUQxLsaFsNthWTUFMZ1OlyXU4JMp5va'
  access_key = '1156440412615131136-uiUgTuCC1qxrzHR9LXxlEs2ZuZGRZD'
  access_secret = 'U2YhP7Y5i67fAlW4MgDf6NQpufI7P4HWxJ6zykroOJ9Sj'
  auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
  auth.set_access_token(access_key, access_secret)
  api = tweepy.API(auth)
  numtweet = 50
  depressed_df = scrape("depression",numtweet)
  not_depressed_df = scrape("jolly",numtweet)

In [None]:
depressed_df.head(5)

Unnamed: 0,description,tweet text,username
0,Vibes 🏀 MUFC 4L❤️🦅,Big thanks to @Bujutoyourears and @SmirnoffNg...,anon_abdul
1,,@teteales My mom is still in depression due to...,areyyyyaaarrr
2,ㅤㅤ ㅤㅤ ㅤㅤㅤ {♡} — ꒰ #이동훈 ꒱\n \n\n ㅤㅤㅤㅤㅤㅤㅤ ㅤㅤㅤ...,E: my kpop boys are enlisting and idk what to ...,dhnblue
3,| | architect | cat slave | creepy basement owner,@NatAlleyCat It’ll never work,cpt_depression_
4,Fucking google it,Ya know what’s really great for depression?? ...,YungSpinster


In [None]:
not_depressed_df.head(5)

Unnamed: 0,description,tweet text,username
0,"Anti-Fascist, Pro-Curry, Anti-Bigot, Pro-Gin, ...",Good morning. Feeling a bit glum ?\nJolly-up y...,Sillytees
1,Remembering the sacrifice of the few and Bombe...,For what it’s worth my \nHYPOCRITE OF THE DAY ...,dav_jolly
2,😊,Avar ku free time kidachaaa \n avarodaa entert...,eanokfdo
3,Remembering the sacrifice of the few and Bombe...,Al Capone inprisoned for Tax evasion not His M...,dav_jolly
4,“Live! Live the wonderful life that is in you!...,@miffythegamer @dav_jolly It certainly is.,mo04933471


## Data labelling & data set creation

In [None]:
d_df = pd.DataFrame()
d_df['text'] = []
d_df['labels'] = []
d_df['text'] =  depressed_df['description'] + depressed_df['tweet text']
d_df['labels'] = "1" #depressed
d_df.head(5)

Unnamed: 0,text,labels
0,Shout out to the particular hell that is funct...,1
1,I hate that people don’t understand that i don...,1
2,If 10 years ago someone told me that in 10 yea...,1
3,"It’s like I died at 15, but my body just kept ...",1
4,High-functioning depression: I feel like I'm l...,1


In [None]:
nd_df = pd.DataFrame()
nd_df['text'] = []
nd_df['labels'] = []
nd_df['text'] = not_depressed_df['description'] + not_depressed_df['tweet text']
nd_df['labels'] = "0" #not depressed
nd_df.head(5)

Unnamed: 0,text,labels
0,Global Protests: Stop Genocide of UyghyrsHello...,0
1,"""The 'blackies' are coming from Africa"" said P...",0
2,WHO is oulived itself. The healthcare system h...,0
3,I thought there is no video of this incident!,0
4,In the form of a crescent sea hidden behind on...,0


In [None]:
frames = [d_df, nd_df]
data_df = pd.concat(frames)
data_df

Unnamed: 0,text,labels
0,Shout out to the particular hell that is funct...,1
1,I hate that people don’t understand that i don...,1
2,If 10 years ago someone told me that in 10 yea...,1
3,"It’s like I died at 15, but my body just kept ...",1
4,High-functioning depression: I feel like I'm l...,1
...,...,...
45,100% renewable energy could power the world by...,0
46,Building in Fire. Warsaw Poland 08-06-2019,0
47,Nature reveals its treasures with the first ra...,0
48,𝑻𝒉𝒆 𝒌𝒊𝒏𝒅 𝒐𝒇 𝒇𝒓𝒊𝒆𝒏𝒅𝒔 𝒆𝒗𝒆𝒓𝒚𝒐𝒏𝒆 𝒏𝒆𝒆𝒅𝒔😍,0


In [None]:
data_df = data_df.sample(frac=1).reset_index(drop=True)
data_df.head(10)

Unnamed: 0,text,labels
0,I thought there is no video of this incident!,0
1,If 10 years ago someone told me that in 10 yea...,1
2,Does anyone else stay up to 3AM every night wa...,1
3,When you realize that you have an incredibly e...,1
4,Being suicidal is like being at a terrible par...,1
5,Weed is now finally legal in Mexico! Cheers!,0
6,Does anyone have a period where they feel real...,1
7,Its out!!,0
8,High-functioning depression: I feel like I'm l...,1
9,"I get it, being friends with a depressed perso...",1


## Data preparation

In [None]:
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')
import re


def preprocess(sentence):
    sentence=str(sentence)
    sentence = sentence.lower()
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', sentence)
    rem_url=re.sub(r'http\S+', '',cleantext)
    rem_num = re.sub('[0-9]+', '', rem_url)
    rem_tag = re.sub(r'@\S+', '',rem_num)
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(rem_tag)  
    filtered_words = [w for w in tokens if len(w) > 2 if not w in stopwords.words('english')]
    return " ".join(filtered_words)

data_df['text']=data_df['text'].map(lambda s:preprocess(s))


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
data_df.head(5)

Unnamed: 0,text,labels
0,shout particular hell functional depression ge...,1
1,scary part depression start feel hard know tem...,1
2,thoughts image,0
3,anyone else stay every night wanting die get s...,1
4,commit suicide option suddenly stop existing w...,1


**Split the data into train, evaluation and test**


`Download the csv files for future processing`

In [None]:
train = data_df[:60]
eval = data_df[61:80]
test = data_df[81:100]
train.to_csv("train.csv")
eval.to_csv("eval.csv")
test.to_csv("test.csv")

## Model building : Deep Learning

**Simple Transformers:** https://huggingface.co/transformers/v3.3.1/pretrained_models.html

In [None]:
!pip install simpletransformers

Collecting simpletransformers
  Downloading simpletransformers-0.63.3-py3-none-any.whl (247 kB)
[?25l[K     |█▎                              | 10 kB 40.5 MB/s eta 0:00:01[K     |██▋                             | 20 kB 10.2 MB/s eta 0:00:01[K     |████                            | 30 kB 8.5 MB/s eta 0:00:01[K     |█████▎                          | 40 kB 7.9 MB/s eta 0:00:01[K     |██████▋                         | 51 kB 7.1 MB/s eta 0:00:01[K     |████████                        | 61 kB 7.6 MB/s eta 0:00:01[K     |█████████▎                      | 71 kB 6.4 MB/s eta 0:00:01[K     |██████████▋                     | 81 kB 7.2 MB/s eta 0:00:01[K     |████████████                    | 92 kB 7.3 MB/s eta 0:00:01[K     |█████████████▎                  | 102 kB 6.9 MB/s eta 0:00:01[K     |██████████████▌                 | 112 kB 6.9 MB/s eta 0:00:01[K     |███████████████▉                | 122 kB 6.9 MB/s eta 0:00:01[K     |█████████████████▏              | 133 kB 6.

### BERT

In [None]:
from simpletransformers.classification import ClassificationModel

model=ClassificationModel('bert','bert-base-uncased',num_labels=2,use_cuda=False,args={
        "reprocess_input_data" : True,
        "use_cached_eval_features":False, 
        "overwrite_output_dir": True, 
        "num_train_epochs": 1 })

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

##### Train the model

In [None]:
df_train = pd.read_csv('train.csv')
df_eval = pd.read_csv('eval.csv')
df_test = pd.read_csv('test.csv')

model.train_model(df_train)

  0%|          | 0/60 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/8 [00:00<?, ?it/s]

(8, 0.7211211398243904)

#### Evaluate 

In [None]:
result, model_outputs, wrong_predictions = model.eval_model(df_eval)
print(result)

  0%|          | 0/19 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/3 [00:00<?, ?it/s]

{'mcc': 0.0, 'tp': 7, 'tn': 0, 'fp': 12, 'fn': 0, 'auroc': 0.6309523809523808, 'auprc': 0.49696969696969695, 'eval_loss': 0.7654226620992025}


#### Predict the labels

In [None]:
predictions, raw_outputs = model.predict(df_test['text'].tolist())
print(predictions)

  0%|          | 0/19 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]


#### Performance

In [None]:
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)

from sklearn.metrics import classification_report
print(classification_report(df_test.labels, predictions))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        12
           1       0.37      1.00      0.54         7

    accuracy                           0.37        19
   macro avg       0.18      0.50      0.27        19
weighted avg       0.14      0.37      0.20        19



### BERT with GPU



1. Go to Runtime ---> Change runtime type
2. Select GPU and Save
3. Connect to run time




In [None]:
!pip install simpletransformers

In [None]:
from simpletransformers.classification import ClassificationModel

model=ClassificationModel('bert','bert-base-uncased',num_labels=2,use_cuda=True,args={
        "reprocess_input_data" : True,
        "use_cached_eval_features":False, 
        "overwrite_output_dir": True, 
        "num_train_epochs": 3 }) #Increase for better performance

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

##### Train the model

**Upload the files and read the csv files**

In [None]:
df_train = pd.read_csv('train.csv')
df_eval = pd.read_csv('eval.csv')
df_test = pd.read_csv('test.csv')

model.train_model(df_train)

  0%|          | 0/60 [00:00<?, ?it/s]

Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

Running Epoch 0 of 5:   0%|          | 0/8 [00:00<?, ?it/s]

Running Epoch 1 of 5:   0%|          | 0/8 [00:00<?, ?it/s]

Running Epoch 2 of 5:   0%|          | 0/8 [00:00<?, ?it/s]

Running Epoch 3 of 5:   0%|          | 0/8 [00:00<?, ?it/s]

Running Epoch 4 of 5:   0%|          | 0/8 [00:00<?, ?it/s]

(40, 0.31236915588378905)

#### Evaluate 

In [None]:
result, model_outputs, wrong_predictions = model.eval_model(df_eval)
print(result)

  0%|          | 0/19 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/3 [00:00<?, ?it/s]

{'mcc': 0.6746010525388914, 'tp': 10, 'tn': 6, 'fp': 1, 'fn': 2, 'auroc': 0.9047619047619048, 'auprc': 0.9623538011695908, 'eval_loss': 0.2925872802734375}


#### Predict the labels

In [None]:
predictions, raw_outputs = model.predict(df_test['text'].tolist())
print(predictions)

  0%|          | 0/19 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

[0 1 0 1 1 1 1 1 1 1 1 0 1 0 1 1 0 1 1]


#### Performance

In [None]:
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)

from sklearn.metrics import classification_report
print(classification_report(df_test.labels, predictions))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00         5
           1       1.00      1.00      1.00        14

    accuracy                           1.00        19
   macro avg       1.00      1.00      1.00        19
weighted avg       1.00      1.00      1.00        19



# Data collection from Reddit

**Reddit Authentication:**

https://www.youtube.com/watch?v=4Lmfgw4RZCM


https://www.reddit.com/prefs/apps


**Install package - Python Reddit API Wrapper**

In [None]:
!pip install praw



**Import packages**

In [None]:
import praw
import pandas as pd

**Get secret tokens from Reddit**

In [None]:
reddit = praw.Reddit(client_id='NJKR19IOkedmpg', \
                     client_secret='65DR2u7ncehsg8Z2BAYaRzDlz28', \
                     user_agent='Reddit -data')

**Sub-reddit selection & number of posts**

In [None]:
subreddit = reddit.subreddit('depression')
top_subreddit = subreddit.top(limit=50)


In [None]:
top_subreddit

<praw.models.listing.generator.ListingGenerator at 0x7f6d03a713d0>

**Dump data into dictionary**

In [None]:
topics_dict = { "title":[], "id":[], "url":[],  "created": [],  "body":[]} 

for submission in top_subreddit:
    topics_dict["title"].append(submission.title)
    topics_dict["id"].append(submission.id)
    topics_dict["url"].append(submission.url)
    topics_dict["created"].append(submission.created)
    topics_dict["body"].append(submission.selftext)

In [None]:
submission.

**Converting dictionary to dataframe**

In [None]:
depressed_df = pd.DataFrame(topics_dict)
depressed_df.head(5)

Unnamed: 0,title,id,url,created,body
0,Shout out to the particular hell that is funct...,cd0hjp,https://www.reddit.com/r/depression/comments/c...,1563091000.0,"This is me. Don’t get me wrong, it’s better th..."
1,I hate that people don’t understand that i don...,i3ajk8,https://www.reddit.com/r/depression/comments/i...,1596506000.0,
2,If 10 years ago someone told me that in 10 yea...,ccaxvm,https://www.reddit.com/r/depression/comments/c...,1562939000.0,"But here I am, 24 years old man and doing exac..."
3,"It’s like I died at 15, but my body just kept ...",g9ndgw,https://www.reddit.com/r/depression/comments/g...,1588081000.0,I’m trapped inside. Does anyone else get that ...
4,High-functioning depression: I feel like I'm l...,dpl4bu,https://www.reddit.com/r/depression/comments/d...,1572515000.0,I read a lot of posts on here of people strugg...


In [None]:
subreddit = reddit.subreddit('world')
top_subreddit = subreddit.top(limit=50)

nd_dict = { "title":[], "score":[],"id":[], "url":[],   "created": [],  "body":[]} 

for submission in top_subreddit:
    nd_dict["title"].append(submission.title)
    nd_dict["score"].append(submission.score)
    nd_dict["id"].append(submission.id)
    nd_dict["url"].append(submission.url)
    nd_dict["created"].append(submission.created)
    nd_dict["body"].append(submission.selftext)

not_depressed_df = pd.DataFrame(nd_dict)
not_depressed_df.head(5)

It appears that you are using PRAW in an asynchronous environment.
It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Unnamed: 0,title,score,id,url,created,body
0,Global Protests: Stop Genocide of Uyghyrs,94,igbb84,https://www.reddit.com/r/world/comments/igbb84...,1598359000.0,"Hello,\n\nAs we all are aware of what is going..."
1,"""The 'blackies' are coming from Africa"" said P...",52,oatypp,https://v.redd.it/ovsx0uxvjd871,1625047000.0,
2,WHO is oulived itself. The healthcare system h...,42,s61pj9,https://www.reddit.com/r/world/comments/s61pj9...,1642418000.0,Hey fellas\n\nHow did your crypto year start?...
3,I thought there is no video of this incident!,39,bw3k3k,https://v.redd.it/gzfenwghy0231,1559517000.0,
4,In the form of a crescent sea hidden behind on...,34,qagvoa,https://i.redd.it/jkmqmalzi5u71.jpg,1634538000.0,


## Data labelling & data set creation

In [None]:
d_df = pd.DataFrame()
d_df['text'] = []
d_df['labels'] = []
d_df['text'] = depressed_df['title'] + depressed_df['body'] 
d_df['labels'] = "1" #depressed
d_df.head(5)

Unnamed: 0,text,labels
0,Shout out to the particular hell that is funct...,1
1,I hate that people don’t understand that i don...,1
2,If 10 years ago someone told me that in 10 yea...,1
3,"It’s like I died at 15, but my body just kept ...",1
4,High-functioning depression: I feel like I'm l...,1


In [None]:
nd_df = pd.DataFrame()
nd_df['text'] = []
nd_df['labels'] = []
nd_df['text'] = not_depressed_df['title'] + not_depressed_df['body']
nd_df['labels'] = "0" #not depressed
nd_df.head(5)

Unnamed: 0,text,labels
0,Global Protests: Stop Genocide of UyghyrsHello...,0
1,"""The 'blackies' are coming from Africa"" said P...",0
2,WHO is oulived itself. The healthcare system h...,0
3,I thought there is no video of this incident!,0
4,In the form of a crescent sea hidden behind on...,0


In [None]:
frames = [d_df, nd_df]
data_df = pd.concat(frames)
data_df

Unnamed: 0,text,labels
0,Shout out to the particular hell that is funct...,1
1,I hate that people don’t understand that i don...,1
2,If 10 years ago someone told me that in 10 yea...,1
3,"It’s like I died at 15, but my body just kept ...",1
4,High-functioning depression: I feel like I'm l...,1
...,...,...
45,100% renewable energy could power the world by...,0
46,Building in Fire. Warsaw Poland 08-06-2019,0
47,Nature reveals its treasures with the first ra...,0
48,𝑻𝒉𝒆 𝒌𝒊𝒏𝒅 𝒐𝒇 𝒇𝒓𝒊𝒆𝒏𝒅𝒔 𝒆𝒗𝒆𝒓𝒚𝒐𝒏𝒆 𝒏𝒆𝒆𝒅𝒔😍,0


In [None]:
data_df = data_df.sample(frac=1).reset_index(drop=True)
data_df.head(10)

Unnamed: 0,text,labels
0,Antlers are really beautiful!,0
1,"Colourised footage of England in 1901, everyon...",0
2,Being suicidal is like being at a terrible par...,1
3,Don't you just want to sleep and sleep and sle...,1
4,does anyone else have a constant feeling of no...,1
5,Like???????????,0
6,"Location Clouds, Philippines",0
7,"The art of Nature, Serbia! 🌾💜🌳",0
8,I don't want to die but if I was offered a cha...,1
9,The beautiful Iguazu Falls in South America 💚💦...,0


## Data preparation

In [None]:
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')
import re


def preprocess(sentence):
    sentence=str(sentence)
    sentence = sentence.lower()
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', sentence)
    rem_url=re.sub(r'http\S+', '',cleantext)
    rem_num = re.sub('[0-9]+', '', rem_url)
    rem_tag = re.sub(r'@\S+', '',rem_num)
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(rem_tag)  
    filtered_words = [w for w in tokens if len(w) > 2 if not w in stopwords.words('english')]
    return " ".join(filtered_words)

data_df['text']=data_df['text'].map(lambda s:preprocess(s))


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
data_df.head(5)

Unnamed: 0,text,labels
0,antlers really beautiful,0
1,colourised footage england everyone intrigued ...,0
2,suicidal like terrible partybeing suicidal lik...,1
3,want sleep sleep sleep closest thing dying lif...,1
4,anyone else constant feeling fitting belonging...,1


**Split the data into train, evaluation and test**


`Download the csv files for future processing`

In [None]:
train = data_df[:60]
eval = data_df[61:80]
test = data_df[81:100]
train.to_csv("train.csv")
eval.to_csv("eval.csv")
test.to_csv("test.csv")

## Classification

### ALBERT

In [None]:
!pip install simpletransformers

Collecting simpletransformers
  Downloading simpletransformers-0.63.4-py3-none-any.whl (248 kB)
[?25l[K     |█▎                              | 10 kB 41.3 MB/s eta 0:00:01[K     |██▋                             | 20 kB 31.0 MB/s eta 0:00:01[K     |████                            | 30 kB 20.2 MB/s eta 0:00:01[K     |█████▎                          | 40 kB 16.4 MB/s eta 0:00:01[K     |██████▋                         | 51 kB 16.4 MB/s eta 0:00:01[K     |████████                        | 61 kB 15.6 MB/s eta 0:00:01[K     |█████████▎                      | 71 kB 16.1 MB/s eta 0:00:01[K     |██████████▌                     | 81 kB 17.7 MB/s eta 0:00:01[K     |███████████▉                    | 92 kB 15.2 MB/s eta 0:00:01[K     |█████████████▏                  | 102 kB 13.7 MB/s eta 0:00:01[K     |██████████████▌                 | 112 kB 13.7 MB/s eta 0:00:01[K     |███████████████▉                | 122 kB 13.7 MB/s eta 0:00:01[K     |█████████████████▏              |

In [None]:
from simpletransformers.classification import ClassificationModel


model=ClassificationModel('albert','albert-base-v2',num_labels=2,use_cuda=True,args={
        "reprocess_input_data" : True,
        "use_cached_eval_features":False, 
        "overwrite_output_dir": True, 
        "num_train_epochs": 1})

Some weights of the model checkpoint at albert-base-v2 were not used when initializing AlbertForSequenceClassification: ['predictions.dense.bias', 'predictions.LayerNorm.weight', 'predictions.decoder.weight', 'predictions.dense.weight', 'predictions.decoder.bias', 'predictions.LayerNorm.bias', 'predictions.bias']
- This IS expected if you are initializing AlbertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing AlbertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of AlbertForSequenceClassification were not initialized from the model checkpoint at albert-base-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You sho

##### Train the model

In [None]:
df_train = pd.read_csv('train.csv')
df_eval = pd.read_csv('eval.csv')
df_test = pd.read_csv('test.csv')

model.train_model(df_train)

  0%|          | 0/60 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/8 [00:00<?, ?it/s]

(8, 0.4992561340332031)

#### Evaluate 

In [None]:
result, model_outputs, wrong_predictions = model.eval_model(df_eval)
print(result)

  0%|          | 0/19 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/3 [00:00<?, ?it/s]

{'mcc': 0.7272727272727273, 'tp': 8, 'tn': 8, 'fp': 3, 'fn': 0, 'auroc': 0.9659090909090909, 'auprc': 0.9659090909090909, 'eval_loss': 0.3691626638174057}


#### Predict the labels

In [None]:
df_test['text']

0     high functioning depression feel like living d...
1     suicide attempt one overdose graduated made fu...
2     worst part depression feeling deep deep reason...
3     anyone else stay every night wanting die get s...
4                                        thoughts image
5                                          spring japan
6                       isla blanca quintana roo mexico
7     alive barely wrote suicide note today listened...
8                                                   NaN
9     coming home coming rome italy defeats england ...
10                             went hiking south dakota
11    ever stop dead middle whatever feel great wave...
12    henan china experienced worst rainfall years l...
13    panther abandoned mother grows human rottweile...
14    anyone else strangely sick high functioning kn...
15                             malaysia large frogmouth
16                                       shizuoka japan
17    amazing everyone never depression lives ex

In [None]:
df_eval['text']

0     blackies coming africa said putin annual perso...
1     worst part depression waking first thought pop...
2                                         beautiful cat
3     global protests stop genocide uyghyrshello awa...
4                                 anna hummingbird male
5                                    heart lake ontario
6     nature reveals treasures first ray light dawn ...
7                                                 great
8     people depressed exist fuckin insane like post...
9     hate foggy brain syndrome anyone else feel lik...
10               miss myselfi miss happy life pointless
11    cried front family today ended comparing issue...
12    mars moon india russia germany several private...
13                                 colors smiles nature
14                 lake malawi one beautiful seas world
15                     weed finally legal mexico cheers
16    scary part depression start feel hard know tem...
17    fucking hate dreams make feel love make fe

In [None]:
predictions, raw_outputs = model.predict(df_eval['text'].tolist())
print(predictions)

  0%|          | 0/19 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

[0 1 1 1 0 0 0 1 1 1 1 1 0 0 0 0 1 1 1]


#### Performance

In [None]:
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)

from sklearn.metrics import classification_report
print(classification_report(df_test.labels, predictions))

              precision    recall  f1-score   support

           0       0.75      0.55      0.63        11
           1       0.55      0.75      0.63         8

    accuracy                           0.63        19
   macro avg       0.65      0.65      0.63        19
weighted avg       0.66      0.63      0.63        19



### XL-NET

In [None]:
from simpletransformers.classification import ClassificationModel


model=ClassificationModel('xlnet','xlnet-base-cased',num_labels=2,use_cuda=True,args={
        "reprocess_input_data" : True,
        "use_cached_eval_features":False, 
        "overwrite_output_dir": True, 
        "num_train_epochs": 2 })

Downloading:   0%|          | 0.00/760 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/445M [00:00<?, ?B/s]

Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetForSequenceClassification: ['lm_loss.weight', 'lm_loss.bias']
- This IS expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['sequence_summary.summary.weight', 'logits_proj.bias', 'logits_proj.weight', 'sequence_summary.summary.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions a

Downloading:   0%|          | 0.00/779k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

##### Train the model

In [None]:
df_train = pd.read_csv('train.csv')
df_eval = pd.read_csv('eval.csv')
df_test = pd.read_csv('test.csv')

model.train_model(df_train)

  0%|          | 0/60 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/8 [00:00<?, ?it/s]

(8, 0.7780409082770348)

#### Evaluate 

In [None]:
result, model_outputs, wrong_predictions = model.eval_model(df_eval)
print(result)

  0%|          | 0/19 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/3 [00:00<?, ?it/s]

{'mcc': 0.0, 'tp': 7, 'tn': 0, 'fp': 12, 'fn': 0, 'auroc': 0.5, 'auprc': 0.4655534941249227, 'eval_loss': 0.8694166938463846}


#### Predict the labels

In [None]:
predictions, raw_outputs = model.predict(df_test['text'].tolist())
print(predictions)

  0%|          | 0/19 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]


#### Performance

In [None]:
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)

from sklearn.metrics import classification_report
print(classification_report(df_test.labels, predictions))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        12
           1       0.37      1.00      0.54         7

    accuracy                           0.37        19
   macro avg       0.18      0.50      0.27        19
weighted avg       0.14      0.37      0.20        19



### RoBERTa

In [None]:
from simpletransformers.classification import ClassificationModel


model=ClassificationModel('roberta','roberta-base',num_labels=2,use_cuda=True,args={
        "reprocess_input_data" : True,
        "use_cached_eval_features":False, 
        "overwrite_output_dir": True, 
        "num_train_epochs": 2 }) #2

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'roberta.pooler.dense.bias', 'lm_head.dense.bias', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.weight', 'classi

##### Train the model

In [None]:
df_train = pd.read_csv('train.csv')
df_eval = pd.read_csv('eval.csv')
df_test = pd.read_csv('test.csv')

model.train_model(df_train)

  0%|          | 0/60 [00:00<?, ?it/s]

Epoch:   0%|          | 0/2 [00:00<?, ?it/s]

Running Epoch 0 of 2:   0%|          | 0/8 [00:00<?, ?it/s]

Running Epoch 1 of 2:   0%|          | 0/8 [00:00<?, ?it/s]

(16, 0.6698712110519409)

#### Evaluate 

In [None]:
result, model_outputs, wrong_predictions = model.eval_model(df_eval)
print(result)

  0%|          | 0/19 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/3 [00:00<?, ?it/s]

{'mcc': 0.0, 'tp': 7, 'tn': 0, 'fp': 12, 'fn': 0, 'auroc': 0.6666666666666666, 'auprc': 0.6547619047619047, 'eval_loss': 0.7553224762280782}


#### Predict the labels

In [None]:
predictions, raw_outputs = model.predict(df_test['text'].tolist())
print(predictions)

  0%|          | 0/19 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]


#### Performance

In [None]:
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)

from sklearn.metrics import classification_report
print(classification_report(df_test.labels, predictions))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        12
           1       0.37      1.00      0.54         7

    accuracy                           0.37        19
   macro avg       0.18      0.50      0.27        19
weighted avg       0.14      0.37      0.20        19



In [None]:
!pip install praw

Collecting praw
  Downloading praw-7.5.0-py3-none-any.whl (176 kB)
[K     |████████████████████████████████| 176 kB 4.9 MB/s 
[?25hCollecting update-checker>=0.18
  Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Collecting prawcore<3,>=2.1
  Downloading prawcore-2.3.0-py3-none-any.whl (16 kB)
Collecting websocket-client>=0.54.0
  Downloading websocket_client-1.2.3-py3-none-any.whl (53 kB)
[K     |████████████████████████████████| 53 kB 1.7 MB/s 
Installing collected packages: websocket-client, update-checker, prawcore, praw
Successfully installed praw-7.5.0 prawcore-2.3.0 update-checker-0.18.0 websocket-client-1.2.3


In [None]:
import praw
import pandas as pd
import datetime as dt

reddit = praw.Reddit(client_id='NJKR19IOkedmpg', \
                     client_secret='65DR2u7ncehsg8Z2BAYaRzDlz28', \
                     user_agent='Reddit -data')


subreddit = reddit.subreddit('depression')

top_subreddit = subreddit.top(limit=50)
topics_dict = { "title":[], "id":[], "url":[], "comms_num": [],  "created": [],  "body":[]} 

for submission in top_subreddit:
    topics_dict["title"].append(submission.title)
    topics_dict["id"].append(submission.id)
    topics_dict["url"].append(submission.url)
    topics_dict["comms_num"].append(submission.num_comments)
    topics_dict["created"].append(submission.created)
    topics_dict["body"].append(submission.selftext)

depressed_df = pd.DataFrame(topics_dict)
depressed_df.head(5)

It appears that you are using PRAW in an asynchronous environment.
It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Unnamed: 0,title,id,url,comms_num,created,body
0,Shout out to the particular hell that is funct...,cd0hjp,https://www.reddit.com/r/depression/comments/c...,350,1563091000.0,"This is me. Don’t get me wrong, it’s better th..."
1,I hate that people don’t understand that i don...,i3ajk8,https://www.reddit.com/r/depression/comments/i...,260,1596506000.0,
2,If 10 years ago someone told me that in 10 yea...,ccaxvm,https://www.reddit.com/r/depression/comments/c...,218,1562939000.0,"But here I am, 24 years old man and doing exac..."
3,"It’s like I died at 15, but my body just kept ...",g9ndgw,https://www.reddit.com/r/depression/comments/g...,311,1588081000.0,I’m trapped inside. Does anyone else get that ...
4,High-functioning depression: I feel like I'm l...,dpl4bu,https://www.reddit.com/r/depression/comments/d...,354,1572515000.0,I read a lot of posts on here of people strugg...


In [None]:
import praw
import pandas as pd
import datetime as dt

reddit = praw.Reddit(client_id='NJKR19IOkedmpg', \
                     client_secret='65DR2u7ncehsg8Z2BAYaRzDlz28', \
                     user_agent='Reddit -data')


subreddit = reddit.subreddit('world')

top_subreddit = subreddit.top(limit=50)
nd_dict = { "title":[], "score":[],"id":[], "url":[], "comms_num": [],  "created": [],  "body":[]} 

for submission in top_subreddit:
    nd_dict["title"].append(submission.title)
    nd_dict["score"].append(submission.score)
    nd_dict["id"].append(submission.id)
    nd_dict["url"].append(submission.url)
    nd_dict["comms_num"].append(submission.num_comments)
    nd_dict["created"].append(submission.created)
    nd_dict["body"].append(submission.selftext)

not_depressed_df = pd.DataFrame(nd_dict)
not_depressed_df.head(5)

It appears that you are using PRAW in an asynchronous environment.
It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Unnamed: 0,title,score,id,url,comms_num,created,body
0,Global Protests: Stop Genocide of Uyghyrs,92,igbb84,https://www.reddit.com/r/world/comments/igbb84...,5,1598359000.0,"Hello,\n\nAs we all are aware of what is going..."
1,"""The 'blackies' are coming from Africa"" said P...",50,oatypp,https://v.redd.it/ovsx0uxvjd871,23,1625047000.0,
2,WHO is oulived itself. The healthcare system h...,41,s61pj9,https://www.reddit.com/r/world/comments/s61pj9...,4,1642418000.0,Hey fellas\n\nHow did your crypto year start?...
3,I thought there is no video of this incident!,40,bw3k3k,https://v.redd.it/gzfenwghy0231,5,1559517000.0,
4,In the form of a crescent sea hidden behind on...,35,qagvoa,https://i.redd.it/jkmqmalzi5u71.jpg,1,1634538000.0,
