**About Dataset**

**Context**

This repository contains the datasets for classification of stress from text-based social media articles from Reddit, which were created within the paper titled "Stress Detection from Social Media: New Dataset Benchmark and Analytical Study

*Dataset overview*


We construct four high quality datasets using the text articles from Reddit and Twitter. Against each of the articles is a class label with a value of '0' or '1', where '0' specifies a Stress Negative article and '1' specifies a Stress Positive article. Annotation was done using an automated DNN-based strategy highlighted in the aforementioned study.

The description about each of the datasets is given as under:


*Reddit Title*: Consists of titles from the articles collected from both stress and non-stress related subreddits from Reddit.

*Reddit Combi*: Consists of title and body text combined together to form a single text sequence, collected from both stress and non-stress related subreddits from Reddit.

In [4]:
# prompt: connect ,drive

from google.colab import drive
drive.mount('/content/drive')


#mounting google drive to collab to retrive the dataset

Mounted at /content/drive


In [6]:
import pandas as pd
import numpy as np

USECOLS = ['Body_Title', 'label']
df_reddit = pd.read_csv("/content/drive/MyDrive/archive/Reddit_Combi.csv", sep=';', usecols=USECOLS) # retriving the dataset which is  saved in csv format



**EDA** ***Exploratory Data Analysis***

In [7]:
df_reddit.head()

Unnamed: 0,Body_Title,label
0,Envy to other is swallowing me Im from develop...,1
1,Nothin outta the ordinary. Paradise. Job stres...,1
2,Almost 49 and the chasm of emptiness has never...,1
3,I’m happy again After my closest friend left m...,0
4,Is it possible to recover from such a traumati...,1


In [8]:
df_reddit.isnull()

Unnamed: 0,Body_Title,label
0,False,False
1,False,False
2,False,False
3,False,False
4,False,False
...,...,...
3118,False,False
3119,False,False
3120,False,False
3121,False,False


In [10]:
df_reddit.shape

(3123, 2)

In [11]:
df_reddit.describe

In [12]:
df_reddit.info

In [13]:

df_reddit.nunique()

Body_Title    3123
label            2
dtype: int64

**Data Visualaltion**

In [16]:
from plotly import express
express.pie(data_frame=df_reddit, names='label', color='label') #pie plot

*Our classes are unbalanced.*

In [15]:
express.histogram(x=df_reddit['Body_Title'].str.len(), log_y=True) #histogram


Let's first try to transform our documents into vectors we can use for classification. We need to first turn our documents into a gensim corpus.

In [17]:
from gensim.corpora import Dictionary
from gensim.parsing.preprocessing import preprocess_string
from gensim.parsing.preprocessing import remove_stopwords
from gensim.parsing.preprocessing import strip_multiple_whitespaces
from gensim.parsing.preprocessing import strip_numeric
from gensim.parsing.preprocessing import strip_punctuation
from gensim.parsing.preprocessing import strip_short
from gensim.parsing.preprocessing import strip_tags
CUSTOM_FILTERS = [lambda x: x.lower(),
                  remove_stopwords,
                  strip_multiple_whitespaces,
                  strip_numeric,
                  strip_punctuation,
                  strip_short,
                  strip_tags,
                 ]
documents = df_reddit['Body_Title'].values.tolist()
texts = [preprocess_string(s=document, filters=CUSTOM_FILTERS) for document in documents]
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
print(dictionary)

Dictionary<14372 unique tokens: ['afford', 'age', 'beetwen', 'better', 'big']...>


Arrow is a Python library that simplifies the handling of dates, times, and timestamps. It offers a sensible and human-friendly approach to creating, manipulating, formatting, and converting temporal data

In [18]:
pip install arrow

Collecting arrow
  Downloading arrow-1.3.0-py3-none-any.whl (66 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/66.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/66.4 kB[0m [31m1.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.4/66.4 kB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0m
Collecting types-python-dateutil>=2.8.10 (from arrow)
  Downloading types_python_dateutil-2.9.0.20240316-py3-none-any.whl (9.7 kB)
Installing collected packages: types-python-dateutil, arrow
Successfully installed arrow-1.3.0 types-python-dateutil-2.9.0.20240316


Now we are  training  our doc2vec model.

In [19]:
from arrow import now
from gensim.models.doc2vec import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
doc2vec_start = now()
doc2vec_model = Doc2Vec(vector_size=100, min_count=20, epochs=40)
corpus_iterable = [TaggedDocument(item, [index]) for index, item in enumerate(corpus) ]
doc2vec_model.build_vocab(corpus_iterable=corpus_iterable)
doc2vec_model.train(corpus_iterable=corpus_iterable, total_examples=doc2vec_model.corpus_count, epochs=doc2vec_model.epochs,)
df_reddit['vectors'] = doc2vec_model.dv.vectors.tolist()
print('doc2vec training time: {}'.format(now() - doc2vec_start))

doc2vec training time: 0:00:22.445607


Let's use dimension reduction to see if our document vectors contain a signal that will be easy for a model to find

UMAP (Uniform Manifold Approximation and Projection) is a powerful dimensionality reduction technique used for visualizing high-dimensional data in a lower-dimensional space. It is particularly useful for preserving both local and global structures of the data.

In [20]:
pip install umap

Collecting umap
  Downloading umap-0.1.1.tar.gz (3.2 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: umap
  Building wheel for umap (setup.py) ... [?25l[?25hdone
  Created wheel for umap: filename=umap-0.1.1-py3-none-any.whl size=3543 sha256=9a982fdb9f393c5024557e3161ee05a4a695d3c841c6a4baec1e7c943a88558e
  Stored in directory: /root/.cache/pip/wheels/15/f1/28/53dcf7a309118ed35d810a5f9cb995217800f3f269ab5771cb
Successfully built umap
Installing collected packages: umap
Successfully installed umap-0.1.1


In [21]:
!pip install umap-learn

Collecting umap-learn
  Downloading umap_learn-0.5.6-py3-none-any.whl (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.7/85.7 kB[0m [31m760.9 kB/s[0m eta [36m0:00:00[0m
Collecting pynndescent>=0.5 (from umap-learn)
  Downloading pynndescent-0.5.12-py3-none-any.whl (56 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.8/56.8 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pynndescent, umap-learn
Successfully installed pynndescent-0.5.12 umap-learn-0.5.6


In [22]:
from umap import UMAP

doc2vec_umap_start = now()
doc2vec_umap_model = UMAP(n_components=2, random_state=2024, verbose=1, init='pca', n_jobs=1)
df_reddit[['x', 'y']] = doc2vec_umap_model.fit_transform(X=df_reddit['vectors'].apply(func=pd.Series),)
df_reddit['short document'] = df_reddit['Body_Title'].str[:80]
print('doc2vec umap time: {}'.format(now() - doc2vec_umap_start))

UMAP(init='pca', n_jobs=1, random_state=2024, verbose=1)
Sat Apr  6 14:58:15 2024 Construct fuzzy simplicial set
Sat Apr  6 14:58:26 2024 Finding Nearest Neighbors
Sat Apr  6 14:58:32 2024 Finished Nearest Neighbor Search
Sat Apr  6 14:58:35 2024 Construct embedding


Epochs completed:   0%|            0/500 [00:00]

	completed  0  /  500 epochs
	completed  50  /  500 epochs
	completed  100  /  500 epochs
	completed  150  /  500 epochs
	completed  200  /  500 epochs
	completed  250  /  500 epochs
	completed  300  /  500 epochs
	completed  350  /  500 epochs
	completed  400  /  500 epochs
	completed  450  /  500 epochs
Sat Apr  6 14:58:44 2024 Finished embedding
doc2vec umap time: 0:00:31.467244


In [23]:
express.scatter(data_frame=df_reddit, x='x', y='y', color='label', height=800, hover_name='short document')

Doc2vec does not appear to cluster our documents according to their labels.

In [24]:
import arrow
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df_reddit['vectors'].apply(func=pd.Series), df_reddit['label'], test_size=0.25, random_state=2024, stratify=df_reddit['label'])

time_start = arrow.now()
regression = LogisticRegression(max_iter=100000, tol=1e-12).fit(X=X_train, y=y_train)
print('model fit in {} iterations took {}'.format(regression.n_iter_[0], arrow.now() - time_start))

print('accuracy: {:5.4f}'.format(accuracy_score(y_true=y_test, y_pred=regression.predict(X=X_test))))
print('model done in {}'.format(now() - time_start))


model fit in 59 iterations took 0:00:00.121352
accuracy: 0.8848
model done in 0:00:00.132128


An accuracy of nearly 0.9 seems encouraging, but our classes are unbalanced to the point that we can get an accuracy of nearly 0.9 with a dummy model that labels every document 1. Let's look at the classification report.

In [25]:
from sklearn.metrics import classification_report

print(classification_report(y_true=y_test, y_pred=regression.predict(X=X_test)))

              precision    recall  f1-score   support

           0       0.62      0.14      0.22        95
           1       0.89      0.99      0.94       686

    accuracy                           0.88       781
   macro avg       0.76      0.56      0.58       781
weighted avg       0.86      0.88      0.85       781



In [26]:
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF

gauss = GaussianProcessClassifier(1.0 * RBF(1.0), random_state=2024)
gauss.fit(X=X_train, y=y_train)

print(classification_report(y_true=y_test, y_pred=gauss.predict(X=X_test)))

              precision    recall  f1-score   support

           0       0.62      0.19      0.29        95
           1       0.90      0.98      0.94       686

    accuracy                           0.89       781
   macro avg       0.76      0.59      0.61       781
weighted avg       0.86      0.89      0.86       781



Our Gaussian Process model not doing better

lets try to use BERT embeddings

BERT (Bidirectional Encoder Representations from Transformers) is a groundbreaking language representation model that has revolutionized natural language processing (NLP).

In [27]:
%env TOKENIZERS_PARALLELISM=false
!pip install --quiet keybert
print('pip install keybert complete.')

env: TOKENIZERS_PARALLELISM=false
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.3/163.3 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.1/14.1 MB[0m [31m32.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m731.7/731.7 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.6/410.6 MB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.6/121.6 MB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

In [28]:
from arrow import now
from keybert import KeyBERT
from sklearn.feature_extraction.text import TfidfVectorizer

MAX_DF = 1.0
MIN_DF = 4
MODEL = 'all-MiniLM-L12-v2'
STOP_WORDS = 'english'
DOCS = df_reddit['Body_Title'].values.tolist()

model_start = now()
bert = KeyBERT(model=MODEL,)
bert.max_seq_length = 512
vectorizer = TfidfVectorizer(ngram_range=(1, 1), stop_words=STOP_WORDS, min_df=MIN_DF, max_df=MAX_DF, )
document_embeddings, word_embeddings = bert.extract_embeddings(docs=DOCS, vectorizer=vectorizer, )
print('embedding time: {}'.format(now() - model_start))
print('we have {} documents and {} words.'.format(len(document_embeddings), len(word_embeddings)))



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.



modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/352 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

embedding time: 0:06:33.427515
we have 3123 documents and 4663 words.


In [29]:
df_reddit['embedding'] = document_embeddings.tolist()
embedding_umap_model = UMAP(n_components=2, random_state=2024, verbose=1, init='pca', n_jobs=1)
df_reddit[['ex', 'ey']] = embedding_umap_model.fit_transform(X=df_reddit['embedding'].apply(func=pd.Series),)
express.scatter(data_frame=df_reddit, x='ex', y='ey', color='label', height=800, hover_name='short document')

UMAP(init='pca', n_jobs=1, random_state=2024, verbose=1)
Sat Apr  6 15:38:18 2024 Construct fuzzy simplicial set
Sat Apr  6 15:38:34 2024 Finding Nearest Neighbors
Sat Apr  6 15:38:35 2024 Finished Nearest Neighbor Search
Sat Apr  6 15:38:35 2024 Construct embedding


Epochs completed:   0%|            0/500 [00:00]

	completed  0  /  500 epochs
	completed  50  /  500 epochs
	completed  100  /  500 epochs
	completed  150  /  500 epochs
	completed  200  /  500 epochs
	completed  250  /  500 epochs
	completed  300  /  500 epochs
	completed  350  /  500 epochs
	completed  400  /  500 epochs
	completed  450  /  500 epochs
Sat Apr  6 15:38:44 2024 Finished embedding


In [30]:
import arrow
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

Xe_train, Xe_test, ye_train, ye_test = train_test_split(df_reddit['embedding'].apply(func=pd.Series), df_reddit['label'], test_size=0.25, random_state=2024, stratify=df_reddit['label'])

time_start = arrow.now()
embedding_regression = LogisticRegression(max_iter=100000, tol=1e-12).fit(X=Xe_train, y=ye_train)
print('model fit in {} iterations took {}'.format(embedding_regression.n_iter_[0], arrow.now() - time_start))

print('accuracy: {:5.4f}'.format(accuracy_score(y_true=ye_test, y_pred=embedding_regression.predict(X=Xe_test))))
print('model done in {}'.format(now() - time_start))

model fit in 25 iterations took 0:00:00.280040
accuracy: 0.9449
model done in 0:00:00.304798


In [31]:
print(classification_report(y_true=ye_test, y_pred=embedding_regression.predict(X=Xe_test)))

              precision    recall  f1-score   support

           0       0.92      0.60      0.73        95
           1       0.95      0.99      0.97       686

    accuracy                           0.94       781
   macro avg       0.93      0.80      0.85       781
weighted avg       0.94      0.94      0.94       781



In [32]:
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(alpha=1.0, max_iter=100000, random_state=2024).fit(X=Xe_train, y=ye_train)
print('score: {:5.4f}'.format(mlp.score(X=Xe_test, y=ye_test)))

print(classification_report(y_true=ye_test, y_pred=mlp.predict(X=Xe_test)))

score: 0.9462
              precision    recall  f1-score   support

           0       0.92      0.61      0.73        95
           1       0.95      0.99      0.97       686

    accuracy                           0.95       781
   macro avg       0.93      0.80      0.85       781
weighted avg       0.95      0.95      0.94       781

