Due to the large nature of the data, only a small test sample is uploaded for reproducibility. I have included some of the original outputs/images as markdown/.pngs to better communicate the original workflow. 

In [1]:
# Used while building util file
# %load_ext autoreload
# %autoreload 2

In [7]:
#Adding path to util 
import sys
sys.path[-1] = f'{sys.path[0]}'.replace('Notebooks', 'src')

#/src/cleaning/cleaning_util.py
import cleaning.cleaning_util as clean
#/src/visualizations/viz.py
from visualizations.viz import hist, class_word_count_plots, word_sims_plots

#Pandas preferences
clean.pd.set_option('display.max_rows', 500)
clean.pd.set_option('display.max_columns', 500)
clean.pd.options.mode.chained_assignment = None

# Data Processing

In [8]:
df = clean.load_process_text('../Data/raw/raw_sample.csv')

In [9]:
df.head()

Unnamed: 0,created,title,flair,text,edited,ups,down,num_comments,gilded,awards,sub,total_text,tokens,lemmatized,post_length
0,2020-07-14 15:52:12,How can I handle the anxiety that job intervie...,Work/School,I understand that job interviews are stressful...,False,4,0,10,0,0,anxiety,How can I handle the anxiety that job intervie...,"(how, can, i, handle, the, anxiety, that, job,...","[handle, anxiety, job, interview, cause, under...",182
1,2018-11-26 07:14:32,How I feel like a failure.,,"Hey everyone, longtime watcher first-time sub ...",False,3,0,2,0,0,adhd,"How I feel like a failure.. Hey everyone, long...","(how, i, feel, like, a, failure, .., hey, ever...","[feel, like, failure, hey, longtime, watcher, ...",183
2,2019-03-14 13:03:59,I've always needed upcoming excitement,,All my life I've needed something exciting in ...,False,4,0,2,0,0,adhd,I've always needed upcoming excitement. All my...,"(i, 've, always, needed, upcoming, excitement,...","[need, upcoming, excitement, life, need, excit...",101
3,2020-05-29 23:52:15,i’m not sure what to gift myself for my birthd...,:thinking: Thoughts & Ideas,i’ve never much been the type to ask for anyth...,False,3,0,6,0,0,non_clinical,i’m not sure what to gift myself for my birthd...,"(i, ’m, not, sure, what, to, gift, myself, for...","[sure, gift, birthday, type, ask, birthday, im...",138
4,2019-01-22 21:08:44,Free Online Therapy Service?,Advice Needed,This might be a long shot but I was wondering ...,False,1,0,0,0,0,anxiety,Free Online Therapy Service?. This might be a ...,"(free, online, therapy, service, ?, ., this, m...","[free, online, therapy, service, long, shot, w...",157


# EDA

While I didn't plan to do any feature engineering on structural aspects of posts (e.g., post lengths), I felt it negligent to not at least examine a couple basic distributions.

## Post distributions

In [28]:
hist(df['post_length'])

![total length](../reports/figures/output_14_0.png)

In [27]:
hist(df[df['post_length'] < 500]['post_length'])

![total length](../reports/figures/output_15_0.png)

Looks like most posts are < 100 words. Could have some difficulty in classification due to post brevity.

## Word Counts

In [12]:
class_word_counts = clean.class_count_dict(df, ['depression', 'anxiety', 'adhd', 'non_clinical'])

In [26]:
class_word_count_plots(2, 2, class_word_counts)

![total length](../reports/figures/output_20_0.png)

# Text feature engineering

I explored a variety of methods in engineering features, nearly all of which were inspired from this paper: https://white.ucc.asn.au/publications/White2015SentVecMeaning.pdf.

They are as follows:
1. Distributed memory paragraph vectors (PV-DM)
2. Distributed bag of words paragraph vectors (PV-DBOW)
3. Mean of word embeddings (MOWE)
4. Term frequency - inverse document frequency (Tf-idf)
5. Idf-weighted MOWE

The PV-DM and PV-CBOW methods were implemented using Gensim's ```Doc2Vec``` . Rather than using a pre-trained model, I chose to train custom embeddings. With approximately 135,000 posts, the dataset was relatively robust, and I felt the niche nature of the subreddits could provide a valuable contribution to mental health based language models. Fortunately, word embeddings were also developed seamlessly during this process. The MOWE method averaged the individual word embeddings for each post, with words not in the ```Doc2Vec``` model's vocabulary fit with 0 vectors. The tf-idf matrix was constructed using this same vocabulary, and the idf weights were also saved separately to weight the individual word embeddings in the final method.

Below are 2 classes for these purposes. The first, ```D2V```, does the document tagging and model training with default parameters based on the ```PV-DM``` and ```PV-DBOW``` methods, respectively. Next, ```GetVectors``` contains several methods for computing the MOWEs, Idf-weighted MOWEs, and tf-idf scores. 

First, I'll split the data into train/test sets. We don't want any test set leakage to occur while training the embeddings/getting the tf-idf scores.

In [14]:
X_train, X_test, y_train, y_test = clean.train_test_split(df.drop('sub', axis = 1),
                                                                  df['sub'], 
                                                                  test_size = .2,   
                                                                  random_state = 13,  
                                                                  stratify = df['sub'])

## Models & feature extraction

In [15]:
dm = clean.D2V(X_train).tag_docs().model_train()
dbow = clean.D2V(X_train, model_type = 'dbow').tag_docs().model_train()

Now that the models are trained, we have access to all of the ```Doc2Vec``` attributes/methods. As a sanity check, I'm going to run the classic ```King - man + woman = _``` example. The theoretically "correct" answer that we should see for the most similar word is "queen".

In [17]:
# print('PV-DM Model:\n')
# print(dm.wv.most_similar_cosmul(positive=["king", "woman"], negative=["man"], topn=10))
# print('\n')
# print('PV-DBOW Model:\n')
# print(dbow.wv.most_similar_cosmul(positive=["king", "woman"], negative=["man"], topn=10))

```
PV-DM Model:

[('profit', 0.661750853061676), ('prince', 0.6553366780281067), ('washington', 0.6530252695083618), ('atlanta', 0.6519854664802551), ('princess', 0.651204526424408), ('lookin', 0.6439474821090698), ('sopranos', 0.6396740078926086), ('san', 0.6389790177345276), ('brewery', 0.637477457523346), ('percentile', 0.6371206641197205)]


PV-DBOW Model:

[('queen', 0.6674830317497253), ('stephen', 0.6629737615585327), ('broadway', 0.6586296558380127), ('malcolm', 0.6560490131378174), ('lion', 0.6448732018470764), ('harley', 0.6433538198471069), ('j.', 0.6418041586875916), ('gecko', 0.6416754722595215), ('unfunny', 0.6396648287773132), ('prince', 0.637599766254425)]
```

<br>
<br>
The DBOW model is right on the money. Next, I'm going to extract the document vectors from both DBOW and DM models. Moving forward to get MOWE and idf-MOWE embeddings for both models might be overkill, so I'll do a quick logistic regression to compare the DBOW and DM models. I'll use these results to decide which one to extract the final sets of embeddings from. 

In [18]:
le = clean.LabelEncoder()
le.fit(y_train)
y_train = le.transform(y_train)
y_test = le.transform(y_test)

In [21]:
#Creating `GetVectors` object for both models
dm_fit = clean.GetVectors(X_train, X_test, model = dm)
dbow_fit = clean.GetVectors(X_train, X_test, model = dbow)

#PV-DM document vectors
dm_doc_vecs_train, dm_doc_vecs_test = dm_fit.d2v_vecs()

#PV-DBOW document vectors
dbow_doc_vecs_train, dbow_doc_vecs_test = dbow_fit.d2v_vecs()

In [35]:
doc_vec_pairs = [('PV-DM', dm_doc_vecs_train, y_train), ('PV-DBOW', dbow_doc_vecs_train, y_train)]
for pair in doc_vec_pairs:
    lr = clean.LogisticRegression(multi_class = 'multinomial', max_iter = 1000)
    lr.fit(pair[1], pair[2])
    print('{} F1 Score: {:.3f}'.format(pair[0], clean.f1_score(lr.predict(pair[1]), pair[2], average = 'macro')))

```
PV-DM F1 Score: 0.799
PV-DBOW F1 Score: 0.836
```

<br>
Based on these results, I'm going to use the DBOW model for the remainder of the embeddings.

In [23]:
#MOWE vectors
dbow_mowe_train, dbow_mowe_test = dbow_fit.w2v_vecs()
#Tf-idf
tf_fitted = dbow_fit.tfidf_fit()
#Tf-idf matrices
tfidf_train, tfidf_test = tf_fitted.tfidf_transform()
#Idf-MOWE vectors
dbow_idf_weighted_train, dbow_idf_weighted_test = tf_fitted.tf_mowe()

As a quick check, we could view idf weights of 2 words -- one being a word we think is quite common, and the other being rarer. For example, a word like "depressed" should have a lower weight than a word like "zoloft" since its frequency across the corpus is presumably much larger.

In [25]:
print('Depressed: {:.3f}'.format(tf_fitted.words_weights['depressed']))
print('Prescription: {:.3f}'.format(tf_fitted.words_weights['prescription']))

Depressed: 3.902
Prescription: 5.740


<br>

```zoloft``` was originally used but is not available in this corpus sample.

```
Depressed: 3.778
Zoloft: 6.209
```

Now, we'll save these before moving onto the modeling. Additionally, I'm saving the actual PV-DBOW model so that after testing, I can add the test set to continue training it.

In [None]:
# cleaning_util.dump(dbow_doc_vecs_train, '../Data/processed/dbow_vecs_train.joblib')
# cleaning_util.dump(dbow_doc_vecs_test, '../Data/processed/dbow_vecs_test.joblib')
# cleaning_util.dump(dbow_mowe_train, '../Data/processed/dbow_mowe_train.joblib')
# cleaning_util.dump(dbow_mowe_test, '../Data/processed/dbow_mowe_test.joblib')
# cleaning_util.dump(tfidf_train, '../Data/processed/tfidf_train.joblib') 
# cleaning_util.dump(tfidf_test, '../Data/processed/tfidf_test.joblib')
# cleaning_util.dump(dbow_idf_weighted_train, '../Data/processed/mowe_idf_train.joblib')
# cleaning_util.dump(dbow_idf_weighted_test, '../Data/processed/mowe_idf_test.joblib')

# Visualizations

Something really cool that we can do with Gensim is get word similarity scores. A good test of how well the embeddings are capturing meaning might be to examine medication word similarities. That is, if similar medications are similar in their vectors.

*Unfortunately, these words don't show up in the test sample, and the Doc2Vec model was too big to upload in a zipped folder. If you want to play around with the embeddings, I'm happy to send em on request!*

## Medications

Adderall = ADHD medication<br>
Zoloft = anti-Depressant<br>
Xanax = anxiety medication<br>

In [None]:
# model_dbow.wv.most_similar('adderall', topn = 5)

```
[('vyvanse', 0.7835362553596497),
 ('xr', 0.7615510821342468),
 ('ritalin', 0.757441520690918),
 ('concerta', 0.7297625541687012),
 ('med', 0.7136313915252686)]
```

In [None]:
# model_dbow.wv.most_similar('zoloft', topn = 5)

```
[('lexapro', 0.602796196937561),
 ('mg', 0.589096188545227),
 ('medication', 0.5872026681900024),
 ('wellbutrin', 0.5750789046287537),
 ('sertraline', 0.5705499649047852)]
```

In [None]:
# model_dbow.wv.most_similar('xanax', topn = 5)

```
[('klonopin', 0.5459912419319153),
 ('benzo', 0.49367237091064453),
 ('prescribe', 0.48809340596199036),
 ('ativan', 0.4806302785873413),
 ('pill', 0.47574320435523987)]
```

---
Visualizing two of these: adderall and xanax.

In [None]:
# word_sims_plots(1, 2, ['adderall', 'xanax'], dbow, 5, ['#B13000', '#FF7E4E'])

![medication similarities](../reports/figures/output_53_0.png)