## Overview

Detect depression from social media data


**1. Collect data from social media platform**

> * Decide the platform & method
> * Get proper authentication
> * Scrape data

**2. Label the data and create the data set**

> * Depression  : depressed (1)
> * Jolly       : not depressed (0)
> * Shuffle the data set

**3. Data set preparation**

> * Lower case
> * Remove punctuation, URLs and tags
> * Remove stop words
> * Divide the data set into train, eval and test.

**4. Model Building : Deep Learning Models**

> * BERT
> * ALBERT
> * XLNET
> * RoBERTa


#  Data Collection from twitter

 **Twitter authentication:**

 https://www.youtube.com/watch?v=vlvtqp44xoQ
 

**Import necessary libraries**

In [None]:
import pandas as pd
import numpy as np
import tweepy
import threading
import time

**Function to scrape data and create data frame**

In [None]:
def scrape(words, numtweet):
  db = pd.DataFrame()
  tweets = tweepy.Cursor(api.search, q=words, lang="en",tweet_mode='extended').items(numtweet)	
  list_tweets = [tweet for tweet in tweets]	
  i = 1
  for tweet in list_tweets:
    username = tweet.user.screen_name
    description = tweet.user.description
    try:
      text = tweet.retweeted_status.full_text
    except AttributeError:
      text = tweet.full_text    
    ith_tweet = [username, description, text]
    db = db.append({"username" : ith_tweet[0], "description" :ith_tweet[1], "tweet text" : ith_tweet[2]},ignore_index=True)
  return db

**User authentication and data scraping**

In [None]:
if __name__ == '__main__':
  consumer_key = 'SkS5RvRcwbLTQWm6BiNoJBQW6'
  consumer_secret = '70b7yKuFmPg4jsFS5rGPUQxLsaFsNthWTUFMZ1OlyXU4JMp5va'
  access_key = '1156440412615131136-uiUgTuCC1qxrzHR9LXxlEs2ZuZGRZD'
  access_secret = 'U2YhP7Y5i67fAlW4MgDf6NQpufI7P4HWxJ6zykroOJ9Sj'
  auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
  auth.set_access_token(access_key, access_secret)
  api = tweepy.API(auth)
  numtweet = 50
  depressed_df = scrape("depression",numtweet)
  not_depressed_df = scrape("jolly",numtweet)

In [None]:
depressed_df.head(5)

Unnamed: 0,description,tweet text,username
0,Woman™️. Assigned Fabulous At Birth.,"@StopSurrogacy This is all leading to, “Men ge...",Sal_Robins
1,burnout,"Anxiety, depression and so on aren't a joke. N...",urfavfeeder
2,Sometimes I just want to give it all up and be...,seasonal depression before valentines https://...,uxama005
3,Bucky - she/they - 31 - Not A Girl || just an ...,"oop, bed time. gotta get an okay night's sleep...",princesDameron
4,ed rant,"ed makes me wanna do anything to lose weight, ...",lust4skinny


In [None]:
not_depressed_df.head(5)

Unnamed: 0,description,tweet text,username
0,,It's cognitively very dissonant to read that 4...,GabeShepperd
1,A Proud Fan of Sushant Singh Rajput.,SUSHANT DAY\n\nJolly SUSHANT,its_ssrwarrior
2,"For Climate change action, social justice and ...",It's cognitively very dissonant to read that 4...,strebormt
3,Random Stranger\nMahaLima is mijn liefde,@KAIATrendsPH @KAIAOfficialPH @SB19Official Th...,SfvXin
4,A proud Hindu (Sanatani).Against totalitarian ...,@Manik_M_Jolly @iitdelhi @iitdelhi what’s wron...,Panjanya3


## Data labelling & data set creation

In [None]:
d_df = pd.DataFrame()
d_df['text'] = []
d_df['labels'] = []
d_df['text'] =  depressed_df['description'] + depressed_df['tweet text']
d_df['labels'] = "1" #depressed
d_df.head(5)

Unnamed: 0,text,labels
0,Woman™️. Assigned Fabulous At Birth.@StopSurro...,1
1,"burnoutAnxiety, depression and so on aren't a ...",1
2,Sometimes I just want to give it all up and be...,1
3,Bucky - she/they - 31 - Not A Girl || just an ...,1
4,ed ranted makes me wanna do anything to lose w...,1


In [None]:
nd_df = pd.DataFrame()
nd_df['text'] = []
nd_df['labels'] = []
nd_df['text'] =  not_depressed_df['description'] + not_depressed_df['tweet text']
nd_df['labels'] = "0" #not depressed
nd_df.head(5)

Unnamed: 0,text,labels
0,It's cognitively very dissonant to read that 4...,0
1,A Proud Fan of Sushant Singh Rajput.SUSHANT DA...,0
2,"For Climate change action, social justice and ...",0
3,Random Stranger\nMahaLima is mijn liefde@KAIAT...,0
4,A proud Hindu (Sanatani).Against totalitarian ...,0


In [None]:
frames = [d_df, nd_df]
data_df = pd.concat(frames)
data_df

Unnamed: 0,text,labels
0,Woman™️. Assigned Fabulous At Birth.@StopSurro...,1
1,"burnoutAnxiety, depression and so on aren't a ...",1
2,Sometimes I just want to give it all up and be...,1
3,Bucky - she/they - 31 - Not A Girl || just an ...,1
4,ed ranted makes me wanna do anything to lose w...,1
...,...,...
45,"Vet.Gurkha,MI. Fellow-@echoinggreen @millersoc...",0
46,"✈️ Delayed, cancelled or overbooked flight?\n🔎...",0
47,#OTF #LLRJ🕊 #LLMURD🕊 #LLMunn🕊 #SmittyCitty🕊She...,0
48,@Manik_M_Jolly Exactly! So very well said. Par...,0


In [None]:
data_df = data_df.sample(frac=1).reset_index(drop=True)
data_df.head(10)

Unnamed: 0,text,labels
0,I thought there is no video of this incident!,0
1,If 10 years ago someone told me that in 10 yea...,1
2,Does anyone else stay up to 3AM every night wa...,1
3,When you realize that you have an incredibly e...,1
4,Being suicidal is like being at a terrible par...,1
5,Weed is now finally legal in Mexico! Cheers!,0
6,Does anyone have a period where they feel real...,1
7,Its out!!,0
8,High-functioning depression: I feel like I'm l...,1
9,"I get it, being friends with a depressed perso...",1


## Data preparation

In [None]:
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')
import re


def preprocess(sentence):
    sentence=str(sentence)
    sentence = sentence.lower()
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', sentence)
    rem_url=re.sub(r'http\S+', '',cleantext)
    rem_num = re.sub('[0-9]+', '', rem_url)
    rem_tag = re.sub(r'@\S+', '',rem_num)
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(rem_tag)  
    filtered_words = [w for w in tokens if len(w) > 2 if not w in stopwords.words('english')]
    return " ".join(filtered_words)

data_df['text']=data_df['text'].map(lambda s:preprocess(s))


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
data_df.head(15)

Unnamed: 0,text,labels
0,shout particular hell functional depression ge...,1
1,hate people understand want kill want alive an...,1
2,years ago someone told years would routinely s...,1
3,like died body kept living trapped inside anyo...,1
4,high functioning depression feel like living d...,1
5,cried front family today ended comparing issue...,1
6,anyone else stay super late avoid next day kno...,1
7,commit suicide option suddenly stop existing w...,1
8,sucks wake thing look forward sleeping againth...,1
9,become closed due depression friends family st...,1


**Split the data into train, evaluation and test**


`Download the csv files for future processing`

In [None]:
train = data_df[:60]
eval = data_df[61:80]
test = data_df[81:100]
train.to_csv("train.csv")
eval.to_csv("eval.csv")
test.to_csv("test.csv")

## Model building : Deep Learning

**Simple Transformers:** https://huggingface.co/transformers/v3.3.1/pretrained_models.html

In [None]:
!pip install simpletransformers

Collecting simpletransformers
  Downloading simpletransformers-0.63.4-py3-none-any.whl (248 kB)
[?25l[K     |█▎                              | 10 kB 25.8 MB/s eta 0:00:01[K     |██▋                             | 20 kB 9.1 MB/s eta 0:00:01[K     |████                            | 30 kB 7.9 MB/s eta 0:00:01[K     |█████▎                          | 40 kB 7.3 MB/s eta 0:00:01[K     |██████▋                         | 51 kB 5.2 MB/s eta 0:00:01[K     |████████                        | 61 kB 5.3 MB/s eta 0:00:01[K     |█████████▎                      | 71 kB 5.5 MB/s eta 0:00:01[K     |██████████▌                     | 81 kB 6.2 MB/s eta 0:00:01[K     |███████████▉                    | 92 kB 5.0 MB/s eta 0:00:01[K     |█████████████▏                  | 102 kB 5.4 MB/s eta 0:00:01[K     |██████████████▌                 | 112 kB 5.4 MB/s eta 0:00:01[K     |███████████████▉                | 122 kB 5.4 MB/s eta 0:00:01[K     |█████████████████▏              | 133 kB 5.4

### BERT

In [None]:
from simpletransformers.classification import ClassificationModel

model=ClassificationModel('bert','bert-base-uncased',num_labels=2,use_cuda=False,args={
        "reprocess_input_data" : True,
        "use_cached_eval_features":False, 
        "overwrite_output_dir": True, 
        "num_train_epochs": 1 })

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

##### Train the model

In [None]:
df_train = pd.read_csv('train.csv')
df_eval = pd.read_csv('eval.csv')
df_test = pd.read_csv('test.csv')

model.train_model(df_train)

  0%|          | 0/60 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/8 [00:00<?, ?it/s]

#### Evaluate 

In [None]:
result, model_outputs, wrong_predictions = model.eval_model(df_eval)
print(result)

  0%|          | 0/19 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/3 [00:00<?, ?it/s]

{'mcc': 0.0, 'tp': 0, 'tn': 0, 'fp': 19, 'fn': 0, 'auroc': nan, 'auprc': nan, 'eval_loss': 1.415236274401347}


  recall = tps / tps[-1]


#### Predict the labels

In [None]:
predictions, raw_outputs = model.predict(df_test['text'].tolist())
print(predictions)

  0%|          | 0/19 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]


#### Performance

In [None]:
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)

from sklearn.metrics import classification_report
print(classification_report(df_test.labels, predictions))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00      19.0
           1       0.00      0.00      0.00       0.0

    accuracy                           0.00      19.0
   macro avg       0.00      0.00      0.00      19.0
weighted avg       0.00      0.00      0.00      19.0



### BERT with GPU



1. Go to Runtime ---> Change runtime type
2. Select GPU and Save
3. Connect to run time




In [None]:
!pip install simpletransformers



In [None]:
from simpletransformers.classification import ClassificationModel

model=ClassificationModel('bert','bert-base-uncased',num_labels=2,use_cuda=True,args={
        "reprocess_input_data" : True,
        "use_cached_eval_features":False, 
        "overwrite_output_dir": True, 
        "num_train_epochs": 3 }) #Increase for better performance

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

##### Train the model

**Upload the files and read the csv files**

In [None]:
df_train = pd.read_csv('train.csv')
df_eval = pd.read_csv('eval.csv')
df_test = pd.read_csv('test.csv')

model.train_model(df_train)

  0%|          | 0/60 [00:00<?, ?it/s]

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Running Epoch 0 of 3:   0%|          | 0/8 [00:00<?, ?it/s]

Running Epoch 1 of 3:   0%|          | 0/8 [00:00<?, ?it/s]

Running Epoch 2 of 3:   0%|          | 0/8 [00:00<?, ?it/s]

(24, 0.34869130452473956)

#### Evaluate 

In [None]:
result, model_outputs, wrong_predictions = model.eval_model(df_eval)
print(result)

  0%|          | 0/19 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/3 [00:00<?, ?it/s]

{'mcc': 0.0, 'tp': 0, 'tn': 8, 'fp': 11, 'fn': 0, 'auroc': nan, 'auprc': nan, 'eval_loss': 0.9359571735064188}


#### Predict the labels

In [None]:
predictions, raw_outputs = model.predict(df_test['text'].tolist())
print(predictions)

  0%|          | 0/19 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

[0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1]


#### Performance

In [None]:
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)

from sklearn.metrics import classification_report
print(classification_report(df_test.labels, predictions))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00         5
           1       1.00      1.00      1.00        14

    accuracy                           1.00        19
   macro avg       1.00      1.00      1.00        19
weighted avg       1.00      1.00      1.00        19



# Data collection from Reddit

**Reddit Authentication:**

https://www.youtube.com/watch?v=4Lmfgw4RZCM


https://www.reddit.com/prefs/apps


In [None]:
!pip install praw

Collecting praw
  Downloading praw-7.5.0-py3-none-any.whl (176 kB)
[K     |████████████████████████████████| 176 kB 5.4 MB/s 
[?25hCollecting websocket-client>=0.54.0
  Downloading websocket_client-1.2.3-py3-none-any.whl (53 kB)
[K     |████████████████████████████████| 53 kB 2.1 MB/s 
[?25hCollecting prawcore<3,>=2.1
  Downloading prawcore-2.3.0-py3-none-any.whl (16 kB)
Collecting update-checker>=0.18
  Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Installing collected packages: websocket-client, update-checker, prawcore, praw
Successfully installed praw-7.5.0 prawcore-2.3.0 update-checker-0.18.0 websocket-client-1.2.3


In [None]:
import praw
import pandas as pd
import datetime as dt

reddit = praw.Reddit(client_id='NJKR19IOkedmpg', \
                     client_secret='65DR2u7ncehsg8Z2BAYaRzDlz28', \
                     user_agent='Reddit -data')

subreddit = reddit.subreddit('depression')

top_subreddit = subreddit.top(limit=50)
topics_dict = { "title":[], "id":[], "url":[],  "created": [],  "body":[]} 

for submission in top_subreddit:
    topics_dict["title"].append(submission.title)
    topics_dict["id"].append(submission.id)
    topics_dict["url"].append(submission.url)
    topics_dict["created"].append(submission.created)
    topics_dict["body"].append(submission.selftext)

depressed_df = pd.DataFrame(topics_dict)
depressed_df.head(5)

It appears that you are using PRAW in an asynchronous environment.
It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Unnamed: 0,title,id,url,created,body
0,Shout out to the particular hell that is funct...,cd0hjp,https://www.reddit.com/r/depression/comments/c...,1563091000.0,"This is me. Don’t get me wrong, it’s better th..."
1,I hate that people don’t understand that i don...,i3ajk8,https://www.reddit.com/r/depression/comments/i...,1596506000.0,
2,If 10 years ago someone told me that in 10 yea...,ccaxvm,https://www.reddit.com/r/depression/comments/c...,1562939000.0,"But here I am, 24 years old man and doing exac..."
3,"It’s like I died at 15, but my body just kept ...",g9ndgw,https://www.reddit.com/r/depression/comments/g...,1588081000.0,I’m trapped inside. Does anyone else get that ...
4,High-functioning depression: I feel like I'm l...,dpl4bu,https://www.reddit.com/r/depression/comments/d...,1572515000.0,I read a lot of posts on here of people strugg...


In [None]:
import praw
import pandas as pd
import datetime as dt

reddit = praw.Reddit(client_id='NJKR19IOkedmpg', \
                     client_secret='65DR2u7ncehsg8Z2BAYaRzDlz28', \
                     user_agent='Reddit -data')


subreddit = reddit.subreddit('world')

top_subreddit = subreddit.top(limit=50)
nd_dict = { "title":[], "score":[],"id":[], "url":[],   "created": [],  "body":[]} 

for submission in top_subreddit:
    nd_dict["title"].append(submission.title)
    nd_dict["score"].append(submission.score)
    nd_dict["id"].append(submission.id)
    nd_dict["url"].append(submission.url)
    nd_dict["created"].append(submission.created)
    nd_dict["body"].append(submission.selftext)

not_depressed_df = pd.DataFrame(nd_dict)
not_depressed_df.head(5)

It appears that you are using PRAW in an asynchronous environment.
It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Unnamed: 0,title,score,id,url,created,body
0,Global Protests: Stop Genocide of Uyghyrs,94,igbb84,https://www.reddit.com/r/world/comments/igbb84...,1598359000.0,"Hello,\n\nAs we all are aware of what is going..."
1,"""The 'blackies' are coming from Africa"" said P...",52,oatypp,https://v.redd.it/ovsx0uxvjd871,1625047000.0,
2,WHO is oulived itself. The healthcare system h...,42,s61pj9,https://www.reddit.com/r/world/comments/s61pj9...,1642418000.0,Hey fellas\n\nHow did your crypto year start?...
3,I thought there is no video of this incident!,39,bw3k3k,https://v.redd.it/gzfenwghy0231,1559517000.0,
4,In the form of a crescent sea hidden behind on...,37,qagvoa,https://i.redd.it/jkmqmalzi5u71.jpg,1634538000.0,


## Data labelling & data set creation

In [None]:
d_df = pd.DataFrame()
d_df['text'] = []
d_df['labels'] = []
d_df['text'] = depressed_df['title'] + depressed_df['body'] 
d_df['labels'] = "1" #depressed
d_df.head(5)

Unnamed: 0,text,labels
0,Shout out to the particular hell that is funct...,1
1,I hate that people don’t understand that i don...,1
2,If 10 years ago someone told me that in 10 yea...,1
3,"It’s like I died at 15, but my body just kept ...",1
4,High-functioning depression: I feel like I'm l...,1


In [None]:
nd_df = pd.DataFrame()
nd_df['text'] = []
nd_df['labels'] = []
nd_df['text'] = not_depressed_df['title'] + not_depressed_df['body']
nd_df['labels'] = "0" #not depressed
nd_df.head(5)

Unnamed: 0,text,labels
0,Global Protests: Stop Genocide of UyghyrsHello...,0
1,"""The 'blackies' are coming from Africa"" said P...",0
2,WHO is oulived itself. The healthcare system h...,0
3,I thought there is no video of this incident!,0
4,In the form of a crescent sea hidden behind on...,0


In [None]:
frames = [d_df, nd_df]
data_df = pd.concat(frames)
data_df

Unnamed: 0,text,labels
0,Shout out to the particular hell that is funct...,1
1,I hate that people don’t understand that i don...,1
2,If 10 years ago someone told me that in 10 yea...,1
3,"It’s like I died at 15, but my body just kept ...",1
4,High-functioning depression: I feel like I'm l...,1
...,...,...
45,100% renewable energy could power the world by...,0
46,Building in Fire. Warsaw Poland 08-06-2019,0
47,Nature reveals its treasures with the first ra...,0
48,𝑻𝒉𝒆 𝒌𝒊𝒏𝒅 𝒐𝒇 𝒇𝒓𝒊𝒆𝒏𝒅𝒔 𝒆𝒗𝒆𝒓𝒚𝒐𝒏𝒆 𝒏𝒆𝒆𝒅𝒔😍,0


In [None]:
data_df = data_df.sample(frac=1).reset_index(drop=True)
data_df.head(10)

Unnamed: 0,text,labels
0,"Henan, China has experienced the worst rainfal...",0
1,High-functioning depression: I feel like I'm l...,1
2,100% renewable energy could power the world by...,0
3,I fucking hate dreams that make you feel love ...,1
4,Does anyone else stay up to 3AM every night wa...,1
5,The worst part of depression is waking up and ...,1
6,"Heart Lake in Ontario, ❤️",0
7,A Panther Abandoned By Its Mother Grows Up Wit...,0
8,Anyone else ever feel like the “old you” died ...,1
9,If 10 years ago someone told me that in 10 yea...,1


## Data preparation

In [None]:
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')
import re


def preprocess(sentence):
    sentence=str(sentence)
    sentence = sentence.lower()
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', sentence)
    rem_url=re.sub(r'http\S+', '',cleantext)
    rem_num = re.sub('[0-9]+', '', rem_url)
    rem_tag = re.sub(r'@\S+', '',rem_num)
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(rem_tag)  
    filtered_words = [w for w in tokens if len(w) > 2 if not w in stopwords.words('english')]
    return " ".join(filtered_words)

data_df['text']=data_df['text'].map(lambda s:preprocess(s))


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
data_df.head(5)

Unnamed: 0,text,labels
0,shout particular hell functional depression ge...,1
1,scary part depression start feel hard know tem...,1
2,thoughts image,0
3,anyone else stay every night wanting die get s...,1
4,commit suicide option suddenly stop existing w...,1


**Split the data into train, evaluation and test**


`Download the csv files for future processing`

In [None]:
train = data_df[:60]
eval = data_df[61:80]
test = data_df[81:100]
train.to_csv("train.csv")
eval.to_csv("eval.csv")
test.to_csv("test.csv")

### ALBERT

In [None]:
from simpletransformers.classification import ClassificationModel


model=ClassificationModel('albert','albert-base-v2',num_labels=2,use_cuda=True,args={
        "reprocess_input_data" : True,
        "use_cached_eval_features":False, 
        "overwrite_output_dir": True, 
        "num_train_epochs": 1})

Downloading:   0%|          | 0.00/684 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/45.2M [00:00<?, ?B/s]

Some weights of the model checkpoint at albert-base-v2 were not used when initializing AlbertForSequenceClassification: ['predictions.bias', 'predictions.LayerNorm.bias', 'predictions.decoder.bias', 'predictions.dense.bias', 'predictions.decoder.weight', 'predictions.dense.weight', 'predictions.LayerNorm.weight']
- This IS expected if you are initializing AlbertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing AlbertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of AlbertForSequenceClassification were not initialized from the model checkpoint at albert-base-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You sho

Downloading:   0%|          | 0.00/742k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.25M [00:00<?, ?B/s]

##### Train the model

In [None]:
df_train = pd.read_csv('train.csv')
df_eval = pd.read_csv('eval.csv')
df_test = pd.read_csv('test.csv')

model.train_model(df_train)

  0%|          | 0/60 [00:00<?, ?it/s]

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Running Epoch 0 of 3:   0%|          | 0/8 [00:00<?, ?it/s]

Running Epoch 1 of 3:   0%|          | 0/8 [00:00<?, ?it/s]

Running Epoch 2 of 3:   0%|          | 0/8 [00:00<?, ?it/s]

(24, 0.17180967330932617)

#### Evaluate 

In [None]:
result, model_outputs, wrong_predictions = model.eval_model(df_eval)
print(result)

  0%|          | 0/19 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/3 [00:00<?, ?it/s]

{'mcc': 0.0, 'tp': 0, 'tn': 19, 'fp': 0, 'fn': 0, 'auroc': nan, 'auprc': nan, 'eval_loss': 0.4141506652037303}


#### Predict the labels

In [None]:
predictions, raw_outputs = model.predict(df_test['text'].tolist())
print(predictions)

  0%|          | 0/19 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

[0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1]


#### Performance

In [None]:
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)

from sklearn.metrics import classification_report
print(classification_report(df_test.labels, predictions))

              precision    recall  f1-score   support

           0       1.00      0.84      0.91        19
           1       0.00      0.00      0.00         0

    accuracy                           0.84        19
   macro avg       0.50      0.42      0.46        19
weighted avg       1.00      0.84      0.91        19



### XL-NET

In [None]:
from simpletransformers.classification import ClassificationModel


model=ClassificationModel('xlnet','xlnet-base-cased',num_labels=2,use_cuda=True,args={
        "reprocess_input_data" : True,
        "use_cached_eval_features":False, 
        "overwrite_output_dir": True, 
        "num_train_epochs": 2 })

Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetForSequenceClassification: ['lm_loss.weight', 'lm_loss.bias']
- This IS expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['sequence_summary.summary.bias', 'logits_proj.weight', 'logits_proj.bias', 'sequence_summary.summary.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions a

##### Train the model

In [None]:
df_train = pd.read_csv('train.csv')
df_eval = pd.read_csv('eval.csv')
df_test = pd.read_csv('test.csv')

model.train_model(df_train)

  0%|          | 0/60 [00:00<?, ?it/s]

Epoch:   0%|          | 0/2 [00:00<?, ?it/s]

Running Epoch 0 of 2:   0%|          | 0/8 [00:00<?, ?it/s]

Running Epoch 1 of 2:   0%|          | 0/8 [00:00<?, ?it/s]

(16, 0.29885751008987427)

#### Evaluate 

In [None]:
result, model_outputs, wrong_predictions = model.eval_model(df_eval)
print(result)

  0%|          | 0/19 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/3 [00:00<?, ?it/s]

{'mcc': 0.0, 'tp': 0, 'tn': 12, 'fp': 7, 'fn': 0, 'auroc': nan, 'auprc': nan, 'eval_loss': 1.2775149941444397}


#### Predict the labels

In [None]:
predictions, raw_outputs = model.predict(df_test['text'].tolist())
print(predictions)

  0%|          | 0/19 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1]


#### Performance

In [None]:
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)

from sklearn.metrics import classification_report
print(classification_report(df_test.labels, predictions))

              precision    recall  f1-score   support

           0       1.00      0.84      0.91        19
           1       0.00      0.00      0.00         0

    accuracy                           0.84        19
   macro avg       0.50      0.42      0.46        19
weighted avg       1.00      0.84      0.91        19



### RoBERTa

In [None]:
from simpletransformers.classification import ClassificationModel


model=ClassificationModel('roberta','roberta-base',num_labels=2,use_cuda=True,args={
        "reprocess_input_data" : True,
        "use_cached_eval_features":False, 
        "overwrite_output_dir": True, 
        "num_train_epochs": 2 }) #2

Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/478M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'roberta.pooler.dense.weight', 'lm_head.bias', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

##### Train the model

In [None]:
df_train = pd.read_csv('train.csv')
df_eval = pd.read_csv('eval.csv')
df_test = pd.read_csv('test.csv')

model.train_model(df_train)

  0%|          | 0/60 [00:00<?, ?it/s]

Epoch:   0%|          | 0/2 [00:00<?, ?it/s]

Running Epoch 0 of 2:   0%|          | 0/8 [00:00<?, ?it/s]

Running Epoch 1 of 2:   0%|          | 0/8 [00:00<?, ?it/s]

(16, 0.4545861482620239)

#### Evaluate 

In [None]:
result, model_outputs, wrong_predictions = model.eval_model(df_eval)
print(result)

  0%|          | 0/19 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/3 [00:00<?, ?it/s]

{'mcc': 0.0, 'tp': 0, 'tn': 0, 'fp': 19, 'fn': 0, 'auroc': nan, 'auprc': nan, 'eval_loss': 1.5303412675857544}


#### Predict the labels

In [None]:
predictions, raw_outputs = model.predict(df_test['text'].tolist())
print(predictions)

  0%|          | 0/19 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]


#### Performance

In [None]:
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)

from sklearn.metrics import classification_report
print(classification_report(df_test.labels, predictions))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00      19.0
           1       0.00      0.00      0.00       0.0

    accuracy                           0.00      19.0
   macro avg       0.00      0.00      0.00      19.0
weighted avg       0.00      0.00      0.00      19.0



In [None]:
!pip install praw



In [None]:
import praw
import pandas as pd
import datetime as dt

reddit = praw.Reddit(client_id='NJKR19IOkedmpg', \
                     client_secret='65DR2u7ncehsg8Z2BAYaRzDlz28', \
                     user_agent='Reddit -data')


subreddit = reddit.subreddit('depression')

top_subreddit = subreddit.top(limit=50)
topics_dict = { "title":[], "id":[], "url":[], "comms_num": [],  "created": [],  "body":[]} 

for submission in top_subreddit:
    topics_dict["title"].append(submission.title)
    topics_dict["id"].append(submission.id)
    topics_dict["url"].append(submission.url)
    topics_dict["comms_num"].append(submission.num_comments)
    topics_dict["created"].append(submission.created)
    topics_dict["body"].append(submission.selftext)

depressed_df = pd.DataFrame(topics_dict)
depressed_df.head(5)

It appears that you are using PRAW in an asynchronous environment.
It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Unnamed: 0,title,id,url,comms_num,created,body
0,Shout out to the particular hell that is funct...,cd0hjp,https://www.reddit.com/r/depression/comments/c...,350,1563091000.0,"This is me. Don’t get me wrong, it’s better th..."
1,I hate that people don’t understand that i don...,i3ajk8,https://www.reddit.com/r/depression/comments/i...,260,1596506000.0,
2,If 10 years ago someone told me that in 10 yea...,ccaxvm,https://www.reddit.com/r/depression/comments/c...,218,1562939000.0,"But here I am, 24 years old man and doing exac..."
3,"It’s like I died at 15, but my body just kept ...",g9ndgw,https://www.reddit.com/r/depression/comments/g...,311,1588081000.0,I’m trapped inside. Does anyone else get that ...
4,High-functioning depression: I feel like I'm l...,dpl4bu,https://www.reddit.com/r/depression/comments/d...,354,1572515000.0,I read a lot of posts on here of people strugg...


In [None]:
import praw
import pandas as pd
import datetime as dt

reddit = praw.Reddit(client_id='NJKR19IOkedmpg', \
                     client_secret='65DR2u7ncehsg8Z2BAYaRzDlz28', \
                     user_agent='Reddit -data')


subreddit = reddit.subreddit('world')

top_subreddit = subreddit.top(limit=50)
nd_dict = { "title":[], "score":[],"id":[], "url":[], "comms_num": [],  "created": [],  "body":[]} 

for submission in top_subreddit:
    nd_dict["title"].append(submission.title)
    nd_dict["score"].append(submission.score)
    nd_dict["id"].append(submission.id)
    nd_dict["url"].append(submission.url)
    nd_dict["comms_num"].append(submission.num_comments)
    nd_dict["created"].append(submission.created)
    nd_dict["body"].append(submission.selftext)

not_depressed_df = pd.DataFrame(nd_dict)
not_depressed_df.head(5)

It appears that you are using PRAW in an asynchronous environment.
It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Unnamed: 0,title,score,id,url,comms_num,created,body
0,Global Protests: Stop Genocide of Uyghyrs,89,igbb84,https://www.reddit.com/r/world/comments/igbb84...,5,1598359000.0,"Hello,\n\nAs we all are aware of what is going..."
1,"""The 'blackies' are coming from Africa"" said P...",52,oatypp,https://v.redd.it/ovsx0uxvjd871,23,1625047000.0,
2,WHO is oulived itself. The healthcare system h...,42,s61pj9,https://www.reddit.com/r/world/comments/s61pj9...,4,1642418000.0,Hey fellas\n\nHow did your crypto year start?...
3,I thought there is no video of this incident!,40,bw3k3k,https://v.redd.it/gzfenwghy0231,5,1559517000.0,
4,In the form of a crescent sea hidden behind on...,35,qagvoa,https://i.redd.it/jkmqmalzi5u71.jpg,1,1634538000.0,
