Similar to the ```Cleaning``` notebook, test data samples are used to display functionality. Markdown cells from original output are also included.

4 different sets of features were modeled separately:
1. Doc2Vec
2. Word2Vec averages
3. Tf-idf
4. Word2Vec idf-weighted averages

In [10]:
# Used while building util file
# %load_ext autoreload
# %autoreload 2

In [3]:
#Adding path to util 
import sys
sys.path[-1] = f'{sys.path[0]}'.replace('Notebooks', 'src')

#/src/cleaning/modeling_util.py
import modeling.modeling_util as model
#/src/visualizations/viz.py
from visualizations.viz import confusion_heatmap, model_comparison

#Pandas preferences
model.pd.set_option('display.max_rows', 500)
model.pd.set_option('display.max_columns', 500)
model.pd.options.mode.chained_assignment = None

---

In [15]:
tfidf_train, tfidf_test, \
dbow_vecs_train, dbow_vecs_test, \
mowe_train, mowe_test, \
mowe_idf_train, mowe_idf_test, \
y_train, y_test = model.load_data()

---

Need to encode the y-labels numerically.

In [16]:
y_train, y_test = model.label_encode(y_train, y_test)

To get a baseline for each set of features, I'm going to first do a simple train/validation split and evaluate a few models for each. Ideally, I'd like to do a 5-fold cross validation for each, but at this early a stage in the modeling, the computational resources for that outweigh the initial purpose.

In [65]:
models = {'Logistic Regression':model.LogisticRegression(multi_class = 'multinomial', max_iter = 1000),
          'KNN': model.KNeighborsClassifier(),
          'SVM': model.LinearSVC(max_iter = 5000),
          'Random Forest': model.RandomForestClassifier()}
training_features = {'PV-DBOW': dbow_vecs_train,
                     'TFIDF':tfidf_train, 
                     'MOWE': mowe_train, 
                     'IDF-MOWE': mowe_idf_train}
initial_evaluations = model.train_val_df(training_features, y_train, models)

```
6.25% complete
12.50% complete
18.75% complete
25.00% complete
31.25% complete
37.50% complete
43.75% complete
50.00% complete
56.25% complete
62.50% complete
68.75% complete
75.00% complete
81.25% complete
87.50% complete
93.75% complete
100.00% complete
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Data</th>
      <th>Model</th>
      <th>F1</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>5</th>
      <td>TFIDF</td>
      <td>Logistic Regression</td>
      <td>0.842</td>
    </tr>
    <tr>
      <th>7</th>
      <td>TFIDF</td>
      <td>SVM</td>
      <td>0.836</td>
    </tr>
    <tr>
      <th>1</th>
      <td>PV-DBOW</td>
      <td>Logistic Regression</td>
      <td>0.831</td>
    </tr>
    <tr>
      <th>3</th>
      <td>PV-DBOW</td>
      <td>SVM</td>
      <td>0.830</td>
    </tr>
    <tr>
      <th>8</th>
      <td>TFIDF</td>
      <td>Random Forest</td>
      <td>0.809</td>
    </tr>
    <tr>
      <th>9</th>
      <td>MOWE</td>
      <td>Logistic Regression</td>
      <td>0.808</td>
    </tr>
    <tr>
      <th>11</th>
      <td>MOWE</td>
      <td>SVM</td>
      <td>0.808</td>
    </tr>
    <tr>
      <th>13</th>
      <td>IDF-MOWE</td>
      <td>Logistic Regression</td>
      <td>0.773</td>
    </tr>
    <tr>
      <th>15</th>
      <td>IDF-MOWE</td>
      <td>SVM</td>
      <td>0.773</td>
    </tr>
    <tr>
      <th>12</th>
      <td>MOWE</td>
      <td>Random Forest</td>
      <td>0.739</td>
    </tr>
    <tr>
      <th>4</th>
      <td>PV-DBOW</td>
      <td>Random Forest</td>
      <td>0.692</td>
    </tr>
    <tr>
      <th>16</th>
      <td>IDF-MOWE</td>
      <td>Random Forest</td>
      <td>0.692</td>
    </tr>
    <tr>
      <th>10</th>
      <td>MOWE</td>
      <td>KNN</td>
      <td>0.673</td>
    </tr>
    <tr>
      <th>14</th>
      <td>IDF-MOWE</td>
      <td>KNN</td>
      <td>0.626</td>
    </tr>
    <tr>
      <th>6</th>
      <td>TFIDF</td>
      <td>KNN</td>
      <td>0.613</td>
    </tr>
    <tr>
      <th>2</th>
      <td>PV-DBOW</td>
      <td>KNN</td>
      <td>0.559</td>
    </tr>
  </tbody>
</table>
</div>


<br>
Tf-idf and the document vectors are the 2 best performing sets of features, with logistic regression leading the pack in models. SVM is a close second, and I'd like to see how that continues to performs. This initial evaluation only included a single training/validation split, so such a close difference arguably necessitates further inspection. Increasing the folds will continue to add evidence towards generalizability, and it will be interesting to see if these results remain constant under more intense scrutiny.

In [23]:
models = {'Logistic Regression':model.LogisticRegression(multi_class = 'multinomial', max_iter = 1000),
          'SVM': model.LinearSVC(max_iter = 5000)}

training_features = {'PV-DBOW': dbow_vecs_train,
                     'TFIDF':tfidf_train}

cv_eval = model.kfold_df(training_features, y_train, models)

```
25.00% complete
50.00% complete
75.00% complete
100.00% complete
```


<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Data</th>
      <th>Model</th>
      <th>F1</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>3</th>
      <td>TFIDF</td>
      <td>Logistic Regression</td>
      <td>0.843</td>
    </tr>
    <tr>
      <th>4</th>
      <td>TFIDF</td>
      <td>SVM</td>
      <td>0.838</td>
    </tr>
    <tr>
      <th>1</th>
      <td>PV-DBOW</td>
      <td>Logistic Regression</td>
      <td>0.832</td>
    </tr>
    <tr>
      <th>2</th>
      <td>PV-DBOW</td>
      <td>SVM</td>
      <td>0.831</td>
    </tr>
  </tbody>
</table>
</div>


  The logistic models are more parsimonious and take substantially less time to train. From an engineering perspective, this would be far more preferable in production -- fortunately, we are not sacrificing any performance here either. The tf-idf set as a whole outperforms the PV-DBOW as well, albeit marginally. 
<br>
<br>
I decided to move forward with tf-idf over PV-DBOW. To be honest, I was biased towards the document vectors; they are much more interesting to me theoretically, and the idea of custom embeddings excites me. Alas, tf-idf had a slight edge in performance and a massive edge in a production setting. The end-game application I imagine for this model would be real-time evaluation of mental health status. In a domain like telehealth -- and dealing with certain psychopathologies -- minutes matter, and this particular model is the quickest of all current choices. Additionally, the document vectors would require far more maintenance and training than tf-idf. 

Now, all that is left is some hyperparameter tuning to try and squeeze out any last performance boosts.

In [27]:
param_grid = [{'penalty': ['l1', 'l2'], 
               'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
             ]
tfidf_logreg_grid = model.GridSearchCV(model.LogisticRegression(max_iter = 1000), 
                                 param_grid, 
                                 scoring = 'f1_macro', 
                                 n_jobs = -1)

tfidf_logreg_grid.fit(tfidf_train, y_train)
final_model = tfidf_logreg_grid.best_estimator_
print('Best F1 Score: {:.3f}'.format(tfidf_logreg_grid.best_score_))

```
Best F1 Score: 0.843
```

In [28]:
final_model = tfidf_logreg_grid.best_estimator_

<br>
<br>
Onto the final predictions!

In [31]:
preds = final_model.predict(tfidf_test)
print('Test Set F1 Score: {:.3f}'.format(model.f1_score(y_test, preds, average = 'macro')))

```
Test Set F1 Score: 0.848
```

---

Overall, I am quite pleased with the results. Mental health is something that I'm quite passionate about, and it's the domain that sprung my interest in data science. I really believe that machine learning (especially NLP) has spectacular implications for mental healthcare, and I based this project in one such application I imagined. 

The model performed nicely given the difficult nature of the problem. Especially with depression and anxiety, it can be hard to differentiate between disorders, as well as carve out the decision boundary between non-clinical and clinical thresholds. While I cannot assuredly say that every individual in the clinical subreddits do indeed meet clinical criteria, these are promising results that I would be eager to see on validated populations!

# Visualizations

In [63]:
# model_comparison(initial_evaluations)

![initial evals](../reports/figures/initial_model_evals.png)

In [61]:
# confusion_heatmap(y_test, preds, ['ADHD', 'Anxiety', 'Depression', 'Non-clinical'])

![confusion matrix](../reports/figures/new_heatmap.png)