## Overview

Detect depression from social media data

1. Collect data from social media platform

> * Decide the platform & method
> * Get proper authentication
> * Select appropriate hashtags (depression/jolly) and extract

2. Label the data and create the data set

> * Depression  : depressed (1)
> * Jolly       : not depressed (0)
> * Shuffle the data set

3. Data preparation

> * Lower case
> * Remove punctuation, URLs and tags
> * Remove stop words
> * Divide the data set into train, eval and test.

4. Model Building

> * BERT

##  Data Collection from twitter

In [60]:
import pandas as pd
import numpy as np
import tweepy
import threading
import time
 
def scrape(words, numtweet):
  db = pd.DataFrame()
  tweets = tweepy.Cursor(api.search, q=words, lang="en",tweet_mode='extended').items(numtweet)	
  list_tweets = [tweet for tweet in tweets]	
  i = 1
  for tweet in list_tweets:
    username = tweet.user.screen_name
    description = tweet.user.description
    try:
      text = tweet.retweeted_status.full_text
    except AttributeError:
      text = tweet.full_text    
    ith_tweet = [username, description, text]
    db = db.append({"username" : ith_tweet[0], "description" :ith_tweet[1], "tweet text" : ith_tweet[2]},ignore_index=True)
  return db

if __name__ == '__main__':
  consumer_key = 'SkS5RvRcwbLTQWm6BiNoJBQW6'
  consumer_secret = '70b7yKuFmPg4jsFS5rGPUQxLsaFsNthWTUFMZ1OlyXU4JMp5va'
  access_key = '1156440412615131136-uiUgTuCC1qxrzHR9LXxlEs2ZuZGRZD'
  access_secret = 'U2YhP7Y5i67fAlW4MgDf6NQpufI7P4HWxJ6zykroOJ9Sj'
  auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
  auth.set_access_token(access_key, access_secret)
  api = tweepy.API(auth)
  numtweet = 50
  depressed_df = scrape("depression",numtweet)
  not_depressed_df = scrape("jolly",numtweet)

In [61]:
depressed_df.head(5)

Unnamed: 0,description,tweet text,username
0,♡I don't fuck with them chickens unless they l...,The depression is real,lydsmay3
1,cita-cita ingin menjadi slime,They say there are five stages of grief. Denia...,tafsirkowalski
2,im cool tbh 👩‍🔬,just found an unironic family guy depression e...,clown_jas
3,SDSU ALUM 2020,Them random waves of depression suck 👎🏽,khalilionnaire
4,"Hi, I'm Zora, and I destroy cringe culture eve...",you’ve heard of elf on the shelf now get ready...,Laundry_Pool


In [62]:
not_depressed_df.head(5)

Unnamed: 0,description,tweet text,username
0,𝗘𝗻𝗴𝗹𝗶𝘀𝗵 𝗥𝗣 - @konnect_YUJU // The cherry bloss...,“Jolly la Fiesta” will be held on the 28 - 30t...,koojunhoef
1,Dabble Dabble Doo 06-10-19 RIP Willis #GoPackG...,I got ptsd from those jolly ranchers- Ginger,ZLF420
2,"#OttoSquad Married to my soul mate, mom of 3,...",Have a Holly Jolly JUICI™ sweepstakes! Enter t...,StacyBauer22
3,20+ | she/her | i only have one dps and his na...,@Tasare_Art I mean…it does kinda glisten like ...,starrywishes_
4,Advocating for women’s and children’s rights. ...,@PierrePoilievre What a jolly little chap!! 🥰,MrsDrBee


## Data labelling & data set creation

In [104]:
d_df = pd.DataFrame()
d_df['text'] = []
d_df['labels'] = []
d_df['text'] = depressed_df['tweet text'] + depressed_df['description']
d_df['labels'] = "1" #depressed
d_df.head(5)

Unnamed: 0,text,labels
0,The depression is real♡I don't fuck with them ...,1
1,They say there are five stages of grief. Denia...,1
2,just found an unironic family guy depression e...,1
3,Them random waves of depression suck 👎🏽SDSU AL...,1
4,you’ve heard of elf on the shelf now get ready...,1


In [105]:
nd_df = pd.DataFrame()
nd_df['text'] = []
nd_df['labels'] = []
nd_df['text'] = not_depressed_df['tweet text'] + not_depressed_df['description']
nd_df['labels'] = "0" #not depressed
nd_df.head(5)

Unnamed: 0,text,labels
0,“Jolly la Fiesta” will be held on the 28 - 30t...,0
1,I got ptsd from those jolly ranchers- GingerDa...,0
2,Have a Holly Jolly JUICI™ sweepstakes! Enter t...,0
3,@Tasare_Art I mean…it does kinda glisten like ...,0
4,@PierrePoilievre What a jolly little chap!! 🥰A...,0


In [106]:
frames = [d_df, nd_df]
data_df = pd.concat(frames)
data_df

Unnamed: 0,text,labels
0,The depression is real♡I don't fuck with them ...,1
1,They say there are five stages of grief. Denia...,1
2,just found an unironic family guy depression e...,1
3,Them random waves of depression suck 👎🏽SDSU AL...,1
4,you’ve heard of elf on the shelf now get ready...,1
...,...,...
45,"@susie_j616 Honestly, this decision is depende...",0
46,@vijayshekhar Hope we create policies for safe...,0
47,@Woodgirl1977 I hope not.\n\nPS: Mostly becaus...,0
48,"Welp, I have made An Error of Judgement and no...",0


In [140]:
data_df = data_df.sample(frac=1).reset_index(drop=True)
data_df.head(10)

Unnamed: 0,text,labels
0,vaxxed testing big issue sky high cases people...,1
1,really loved ayesha today natural always said ...,0
2,weeks think see nancy pelosi amp cmte make rep...,0
3,jolly almost cambridge qwi pro wrestler commen...,0
4,jolly merry christmas dragoons artsuki dragoon...,0
5,hope create policies safe disposal waste unlik...,0
6,still wanting hold onto holiday mojo like holl...,0
7,heard elf shelf get ready depressionhi zora de...,1
8,started day parking lot covid visit ended day ...,1
9,season jolly ikon coming town hanbin love เอสร...,0


## Data preparation

In [None]:
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')
import re


def preprocess(sentence):
    sentence=str(sentence)
    sentence = sentence.lower()
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', sentence)
    rem_url=re.sub(r'http\S+', '',cleantext)
    rem_num = re.sub('[0-9]+', '', rem_url)
    rem_tag = re.sub(r'@\S+', '',rem_num)
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(rem_tag)  
    filtered_words = [w for w in tokens if len(w) > 2 if not w in stopwords.words('english')]
    return " ".join(filtered_words)

data_df['text']=data_df['text'].map(lambda s:preprocess(s))


In [139]:
data_df.head(5)

Unnamed: 0,text,labels
0,depression alum data science hooli,1
1,ever stop tweeting means either finally beat d...,1
2,holly jolly juici sweepstakes enter sweepstake...,0
3,heigh life jolly fanmade lyric bot tweeting ko...,0
4,thank making smile depression,1


In [None]:
train = data_df[:60]
eval = data_df[61:80]
test = data_df[81:100]
train.to_csv("train.csv")
eval.to_csv("eval.csv")
test.to_csv("test.csv")

## Model building

In [None]:
!pip install simpletransformers

In [124]:
import pandas as pd
from simpletransformers.classification import ClassificationModel
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import numpy as np

np.set_printoptions(threshold=np.inf)

model=ClassificationModel('bert','bert-base-uncased',num_labels=2,use_cuda=False,args={
        "reprocess_input_data" : True,
        "use_cached_eval_features":False, 
        "overwrite_output_dir": True, 
        "num_train_epochs": 1 })

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

### Train the model

In [None]:
df_train = pd.read_csv('train.csv')
df_eval = pd.read_csv('eval.csv')
df_test = pd.read_csv('test.csv')

model.train_model(df_train)


### Evaluate 

In [None]:
result, model_outputs, wrong_predictions = model.eval_model(df_eval)
print(result)

### Predict the labels

In [123]:
predictions, raw_outputs = model.predict(df_test['text'].tolist())
print(predictions)

  0%|          | 0/19 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

[0 0 1 1 0 1 0 0 0 0 0 1 0 0 1 1 1 1 0]


In [134]:
print("original   predicted")
for i in range(len( df_test.labels)):
  print(df_test['labels'][i],'        ',predictions[i])

original   predicted
1          0
0          0
0          1
1          1
0          0
0          1
0          0
0          0
0          0
1          0
0          0
0          1
0          0
1          0
1          1
1          1
0          1
1          1
0          0
