# Carrying out analysis on texts on the topic of depression and the emotions felt

The XED dataset (https://github.com/Helsinki-NLP/XED) was used to train the emotion classifier. The emotion classifier was trained and sourced from an app example of "Multilingual Sentiment Analysis" on Cohere's website (https://docs.cohere.com/page/multilingual-sentiment-analysis).

A little bit about the XED dataset:
The XED dataset consists of annotated movie subtitles. The annotations consists of these emotions: anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. The topics of sentences vary. Sentence length distribution is:
- 1-31 words: 41.9%
- 31-61 words: 42.2%
- 61-91 words: 11.9%

Initially, I explored a dataset of tweets. Tweets initiate conversation and it's max character length of 280 allowed a max word count of roughly between 40 - 70 words. This made it somewhat similar to the XED dataset. The dataset I used (https://www.kaggle.com/datasets/gargmanas/sentimental-analysis-for-tweets/data) was labelled with a keyword search.

Here, we are importing all the modules we need as well as loading a dataset of tweets into a dataframe.

In [3]:
import pandas as pd
import cohere
import torch
import pickle
import sklearn
import numpy
pd.set_option('display.max_colwidth', None)

# Load the reddit and stock data into a dataframe
df = pd.read_csv("data/sentiment_tweets3.csv")

## Exploring the Twitter dataset

In [4]:
# Top 5 rows of the dataset. There doesn't seem to be an order to the dataset
df.head()

Unnamed: 0,Index,message to examine,label (depression result)
0,106,"just had a real good moment. i missssssssss him so much,",0
1,217,is reading manga http://plurk.com/p/mzp1e,0
2,220,@comeagainjen http://twitpic.com/2y2lx - http://www.youtube.com/watch?v=zoGfqvh2ME8,0
3,288,"@lapcat Need to send 'em to my accountant tomorrow. Oddly, I wasn't even referring to my taxes. Those are supporting evidence, though.",0
4,540,ADD ME ON MYSPACE!!! myspace.com/LookThunder,0


In [5]:
# Nothing out of the ordinary
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10314 entries, 0 to 10313
Data columns (total 3 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Index                      10314 non-null  int64 
 1   message to examine         10314 non-null  object
 2   label (depression result)  10314 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 241.9+ KB


In [8]:
# 2314 rows where text is labelled as depression
df[df["label (depression result)"]==1].value_counts()

Index   message to examine                                                                                                                                                                                                                                                                                                        label (depression result)
800000  The lack of this understanding is a small but significant part of what causes anxiety & depression to both feel so incredibly lonely. It's soooo easy to compare. It's so easy to invalidate ourselves because of that.                                                                                                   1                            1
801590  I've got a black eye of a soulMorals in a hole Wish I was still dead But the tv said it's just a tropical depression                                                                                                                                                                               

In [9]:
# A sample of tweets labelled as depression
dep_rows = df[df["label (depression result)"]==1]
dep_rows.sample(10)

Unnamed: 0,Index,message to examine,label (depression result)
9167,801167,@awg_allan @TelegramSam100 I'll agree to disagree. Solving the self-esteem and depression issue stops him from ever getting to the murderous obsession (imo).,1
8064,800064,Depression and No tears left to cry snatched me bold pic.twitter.com/Fxw7YgNmwY,1
9237,801237,"For some people, psychiatric diagnosis is helpful, and the problem is that it was not given early enough. For others, a diagnosis is deeply oppressive. What's your experience? http://bit.ly/2JmldiAÂ #Psychiatry #MentalIllness #Schizophrenia #Depression #Borderline #OCD",1
10068,802068,"It's a cliche, but I hope you'll start a nuclear war to cure your depression",1
9396,801396,"@asymmetricinfo @ThePlumLineGS It is almost the definition of anti-capitalist. No, that doesn't mean that it can't be considered as a short term sunsetting fix for a depression. So why would we discuss it now?",1
9759,801759,The weather outside today is what depression looks like.,1
8195,800195,"Who wants to see me sing! Created for Stand-Up When, I present Crash! The Musical!, a multi-media number set in the great-depression right after the stock market crash of 1929. It's depression-tastic! pic.twitter.com/nEXTt0G5VZ",1
10220,802220,Today's a day where my depression is just a huge thing over my head,1
10305,802305,RT Depression Could Be Improved With Vitamin D Deficiency Treatment <Emoji: Rightwards arrow> http://aboutdepressionfacts.com/4wxuÂ pic.twitter.com/QGgbqPZUMR #health #well,1
8465,800465,@JustGiving I've #justdonated to develop a charity to further the understanding and treatment of anxiety and depression in young people so that other lives may be saved.. Donate on @justgiving and help raise Â£10000 https://www.justgiving.com/crowdfunding/youngpeoplesmentalhealth?utm_source=twitter&utm_medium=socpledgedesktop&utm_content=youngpeoplesmentalhealth&utm_campaign=post-pledge-desktop&utm_term=qveqXZaEyÂ â¦,1


I came to the conclusion that a dataset consisting of tweets is not a great dataset to use in this scenario, given tweets tend to consist of news and other topics unrelated to conversation or thoughts. Sometimes people tweet about other people's thoughts.

An example is the second last row. It seems to be a link to an article about treatment for depression. This could also be due to the keyword search not being the best method for labelling. The source of our data is incredibly important for an issue that is talked a lot about on social media.

Instead of taking comments or tweets, I looked at Reddit post titles. Post titles are basically summaries of the post. Below, I start exploring a dataset collected from various Reddit forums focused on depression-related topics (https://www.kaggle.com/datasets/diegosilvadefrana/depression-dataset). Reddit post titles focused on depression-related topics tend to be posts looking for support, conversation or a connection with others who may be experiencing similar struggles. Hence, by looking at post titles and carrying out sentiment analysis, we can aim to understand people struggling with depression better by understanding their emotions.

## Exploring the Reddit dataset

In [4]:
df_reddit = pd.read_csv("data/dataset.csv")
df_reddit.head()

Unnamed: 0,title,content,score
0,"Regular check-in post, with information about our rules and wikis","Welcome to /r/depression's check-in post - a place to take a moment and share what is going on and how you are doing. If you have an accomplishment you want to talk about (these shouldn't be standalone posts in the sub as they violate the ""role model"" rule, but are welcome here), or are having a tough time but prefer not to make your own post, this is a place you can share.\n\n-----\n\nOur subreddit rules are located in the sidebar (you can also always access them at https://www.reddit.com/r/depression/about/rules) - since all of them exist for important safety reasons, we ask everyone here to read and follow them. Please click 'report' on any harmful content you see here - we always want to know and deal as soon as we can.\n\nWe also have several wikis there for help with finding and giving support:\n\nhttps://www.reddit.com/r/depression/wiki/what_is_depression provides guidance about what is and isn't a depressive disorder, guidance on the complex nature of the illnesses that are usually grouped under the ""depression"" label, and redirect information for common off-topic issues.\n\nhttps://www.reddit.com/r/depression/wiki/giving_help offers information on the nature and value of peer support for mental-health issues in general, and lots of guidance for learning what is -- and isn't -- usually helpful in giving peer support.\n\nYSK that the types of rule violations that we most frequently see interfering with people getting safe and relevant support here are:\n\n- People breaking the private contact rule. You should never trust anyone who tries to get you into a private conversation in response to a post here. See https://www.reddit.com/r/depression/wiki/private_contact\n\n- ""I'm here to help"" posts. This shows that you don't understand the most basic principles of peer support, especially selectivity. The ""giving help"" wiki explains more about this.\n\n- Role modelling, i.e. ""achievement"" or ""advice"" posts. This is an expert-free zone -- that's what peer support means (rule 5). We know that ""internet culture"" celebrate not just bragging about your achievements but bragging about your good intentions. Nothing like that is ever acceptable here.\n\n- Content that's more about 'making a statement' or casually polling the sub than seeking personal support (or, in a comment, giving it) (rules 1, 2 and 10).\n\n- Off-topic posts about difficult situations or circumstances, including interpersonal losses. Grief, sadness, anger, and other difficult emotions are not mental illnesses. The ""what is depression"" wiki has suggestions for other places to post about these issues, which are 100% valid and serious but inappropriate here.",115
1,"Our most-broken and least-understood rules is ""helpers may not invite private contact as a first resort"", so we've made a new wiki to explain it","We understand that most people who reply immediately to an OP with an invitation to talk privately mean only to help, but this type of response usually leads to either disappointment or disaster. it usually works out quite differently here than when you say ""PM me anytime"" in a casual social context. \n\nWe have huge admiration and appreciation for the goodwill and good citizenship of so many of you who support others here and flag inappropriate content - even more so because we know that so many of you are struggling yourselves. We're hard at work behind the scenes on more information and resources to make it easier to give and get quality help here - this is just a small start. \n\nOur new wiki page explains in detail why it's much better to respond in public comments, at least until you've gotten to know someone. It will be maintained at /r/depression/wiki/private_contact, and the full text of the current version is below.\n\n*****\n\n###Summary###\n\n**Anyone who, while acting as a helper, invites or accepts private contact (I.e. PMs, chat, or any kind of offsite communication) early in the conversion is showing either bad intentions or bad judgement. Either way, it's unwise to trust them.** \n\n\n""PM me anytime"" seems like a kind and generous offer. And it might be perfectly well-meaning, but, unless and until a solid rapport has been established, it's just not a wise idea. Here are some points to consider before you offer or accept an invitation to communicate privately.\n\n* **By posting supportive replies publicly, you'll help more people than just the OP. If your responses are of good quality, you'll educate and inspire other helpers.** [The 1-9-90 rule](https://en.wikipedia.org/wiki/1%25_rule_(Internet_culture\)) applies here as much as it does anywhere else on the internet. \n\n* People who are struggling with serious mental-health issues often (justifiably) have a low tolerance for disappointment and a high-level of ever-changing emotional need. **Unless the helper is able to make a 100% commitment to be there for them in every way, for as long as necessary, offering a personal inbox as a resource is likely to do more harm than good.** This is why mental-health crisis-line responders usually don't give their names and callers aren't allowed to request specific responders. It's much healthier and safer for the callers to develop a relationship with the agency as a whole. Analogously, it's much safer and healthier for our OPs to develop a relationship with the community as a whole. Even trained responders are generally not allowed to work high-intensity situations alone. It's partly about availability, but it's mostly about wider perspective and preventing compassion fatigue. \n\n* **If a helper gets in over their head with someone whose mental-health issues (including suicidality, which is often comorbid with depression) escalate, in a PM conversation it's much harder for others, including the /r/depression and /r/SuicideWatch moderators to help**. (Contrary to common assumptions, moderators can't see or police PMs.) \n\n* In our observation over many years, the people who say ""PM me"" the most are consistently the ones with the least understanding of mental-health issues and mental-health support. We all have gaps in our knowledge and in our ability to communicate effectively. Community input mitigates these limitations. **There's no reason why someone who's truly here to help would want to hide their responses from community scrutiny**. If helpers are concerned about their own privacy, keep in mind that self-disclosure, when used supportively, is more about the feelings than the details, and that we have no problem here with the use of alt/throwaway accounts, and have no restrictions on account age or karma. \n\n* We all know the internet is used by some people to exploit or abuse others. These people *do* want to hide their deceptive and manipulative responses from everyone except their victims. There are many of them who specifically target those who are vulnerable because of mental-health issues. **If a helper invites an OP to talk privately and gives them a good, supportive experience, they've primed that person to be more vulnerable to abusers.** This sort of cognitive priming tends to be particularly effective when someone's in a state of mental-health crisis, when people rely more on heuristics than critical reasoning.\n\n* If OPs want to talk privately, posting on a wide-open anonymous forum like reddit might not be the best option. Although we don't recommend it, we do allow OPs to request private contact when asking for support. If you want to do this, please keep your expectations realistic, and to have a careful look at the history of anyone who offers to PM before opening up to them.",2365
2,Going back to college at 33 after 3 times of dropping out.,"I've always wanted to go back to school, the cards never aligned. I had too many distractions or no transportation and grew fed up with the bus. This time I have my own car, I've been at my job for 2 years and they offer the flexibility to go back.\n\nI want to prove to myself that I can stick to it and not self sabotage. I ran into someone(and avoided like the plague) at disneyland and began to want change. It feels daunting but if I start with 2 classes I can work from there.",84
3,Crying alone all the time,"All I wan is to be loved. I just want someone to hug me and tell me its gonna be ok. Just a little affection, not too much. Instead I have nothing and noone.",108
4,I genuinely don’t think I’ll ever be able to handle all of the adult responsibilities I have.,"I don’t even drive, didn’t finish my high school degree (can always get my GED though), can’t make it through college, can’t hold down a job. \nI know this is gross but I can barely shower or brush my teeth. \nWashing the dishes, cleaning the house, making my bed are almost impossible tasks. \n\nI’m so jealous of people that can do these things. I feel trapped and think of death often as the only way to escape this. \nMaybe I should try therapy? \nI don’t even know how to be an adult tbh. Like what is adulting? \nI do want to be a nurse (or a PSW at first) in the future \nI cannot leave my house unless I’m with my mom. I think this is due to bullying in school and now I just don’t want to be in the outside world.",60


I'm going to clean the data by removing the post content, as the sentence length distribution is far too different to the data the model was trained on. The score of the publication is also unnecessary.

In the future, I plan to use Cohere's `Summarize` to summarize the content of the post. I can then compare this data to the current dataset and to itself.

I will also remove the first 2 posts as they are introduction posts to the subreddit /r/depression.

In [5]:
df_reddit = df_reddit.drop(['content', 'score'], axis=1)
df_reddit.head()

Unnamed: 0,title
0,"Regular check-in post, with information about our rules and wikis"
1,"Our most-broken and least-understood rules is ""helpers may not invite private contact as a first resort"", so we've made a new wiki to explain it"
2,Going back to college at 33 after 3 times of dropping out.
3,Crying alone all the time
4,I genuinely don’t think I’ll ever be able to handle all of the adult responsibilities I have.


In [6]:
df_reddit = df_reddit.drop([0, 1])
df_reddit.head()

Unnamed: 0,title
2,Going back to college at 33 after 3 times of dropping out.
3,Crying alone all the time
4,I genuinely don’t think I’ll ever be able to handle all of the adult responsibilities I have.
5,I'm not depressed anymore yay
6,"Ever since I started meds, everyone acts like I’m not allowed to have negative feelings."


Exploring the dataset, briefly, further

In [85]:
df_reddit.sample(5)

Unnamed: 0,title
8224,hopeless. sorrowful. and lost.
10369,need to talk to someone..... saw someone that I have issues with.
425,I hate how life isn’t perfect
6375,STOP THE COMPETITVENESS
3242,damn I'm crying


In [86]:
df_reddit.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12454 entries, 2 to 12455
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   title   12454 non-null  object
dtypes: object(1)
memory usage: 97.4+ KB


In [87]:
# Some post titles are bound to be repeated, with some people sharing the same feelings
# The emotions associated with frequent pieces of text can be analysed further to create relationships between certain keywords and emotions
df_reddit.describe()

Unnamed: 0,title
count,12454
unique,12099
top,I need help
freq,8


In [88]:
# Prepping to create embeddings
[list(df_reddit['title'])[0]]

['Going back to college at 33 after 3 times of dropping out.']

## Getting the embeddings

In [8]:
api_key = ""

co = cohere.Client(api_key)

In [None]:
chain_model = pickle.load(open('models/trained_emotion_classifier.pkl', 'rb'))

# Using Cohere's powerful 'embed' method to be able to compute the text data
# NOTE: only passed in one piece of text due to a limit on API calls
text_embeddings = co.embed(texts=[list(df_reddit['title'])[0]],
                           model='multilingual-22-12',
                           truncate='RIGHT').embeddings

# Executing the model to predict probabilities of emotions with predict_proba (method in scikit-learn that returns probabilities
# using logistic regression since the trained classifier uses logistic regression)
outputs = torch.as_tensor(chain_model.predict_proba(text_embeddings),
                          dtype=torch.float32)

# Sorts probability outputs associated with the text input where the strongest emotions are first
probas, indices = torch.sort(outputs, descending=True)

# The tensors are converted from GPU-based to CPU-based as NumPy operates on CPU
probas = probas.cpu().numpy()[0]
indices = indices.cpu().numpy()[0]

In [10]:
# Probabilities associated with emotions
# (numbers associated with emotions are shown below, from 0 - 7)
print(probas)
print(indices)

[0.5975665  0.39681658 0.26276317 0.1869488  0.10719531 0.10698901
 0.09613472 0.02223289]
[4 5 7 1 0 3 2 6]


In [12]:
index_two = df_reddit.iloc[0]

# Assigning text labels to the classified emotions and appending new emotion columns to the dataframe
index_two['Anger'] = probas[numpy.where(indices == 0)[0][0]]
index_two['Anticipation'] = probas[numpy.where(indices == 1)[0][0]]
index_two['Disgust'] = probas[numpy.where(indices == 2)[0][0]]
index_two['Fear'] = probas[numpy.where(indices == 3)[0][0]]
index_two['Joy'] = probas[numpy.where(indices == 4)[0][0]]
index_two['Sadness'] = probas[numpy.where(indices == 5)[0][0]]
index_two['Surprise'] = probas[numpy.where(indices == 6)[0][0]]
index_two['Trust'] = probas[numpy.where(indices == 7)[0][0]]

index_two['Top 3 Emotions'] = ["Joy", "Sadness", "Trust"]

index_two

title             Going back to college at 33 after 3 times of dropping out.
Anger                                                               0.107195
Anticipation                                                        0.186949
Disgust                                                             0.096135
Fear                                                                0.106989
Joy                                                                 0.597566
Sadness                                                             0.396817
Surprise                                                            0.022233
Trust                                                               0.262763
Top 3 Emotions                                         [Joy, Sadness, Trust]
Name: 2, dtype: object

You don't need to carry out sentiment analysis to realise emotions felt throughout a depressive period isn't just sadness. Sentiment analysis can be both very important and useful, as well as redundant. Of-course, the topic of the inputs, the different emotion classifications in the training dataset, and source of text are examples of aspects that can change how useful sentiment analysis can be.

In this example, the largest probability is associated with "Joy". An interesting point is that "Sadness" is the emotion associated with the second largest probability. I don't think that makes either emotion redundant.

If I had the ability to be able to predict probabilities of emotions of all the datapoints, I would carry out some of the following statistical tests:
- looking at relationships between emotions
- looking at relationships between keywords and emotions

In a study carried out on Twitter, it was found that users with depression exhibited distinct tweeting patterns (https://www.psypost.org/2021/12/an-analysis-of-twitter-posts-suggests-that-people-with-depression-show-increased-rumination-on-social-media-overnight-62306). These patterns can be explored in the texts and emotions exhibited in this dataset.