# Narratives Classification using a Hierarchical Approach

### Pipeline:

1. Modify current dataset to only include category of the hassle
2. Train RoBertA
3. Make predictions
4. Compare performance vs old model
5. If there are improvements, modify dataset again to include the specific hassle for each category (need to train 1 model for each category)
6. Make 2nd predictions
7. Compare performance vs old model


# Section 1: Importing of libraries

In [343]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch

from transformers import AutoTokenizer
from sklearn.metrics import classification_report, accuracy_score
from pathlib import Path

from NarrativesDataset import NarrativesDataset
from DataModule import NarrativesDataModule
from Model import NarrativesClassifier

# Section 2: Data Cleaning: 1st part

In [344]:
train_path = 'data/train-1.csv'
val_path = 'data/val-1.csv'
num_workers = 4

temp_df = pd.read_csv(train_path)
temp_df.rename(columns={"Troubling thoughts about one’s future": "Troubling thoughts about ones future"}, inplace = True)
temp_val_df = pd.read_csv(val_path)
temp_val_df.rename(columns={"Troubling thoughts about one’s future": "Troubling thoughts about ones future"}, inplace = True)


In [345]:
temp_val_df

Unnamed: 0,Narrative,Misplacing or losing things,Silly practical mistakes,Trouble with pets,Difficulties with friends,Regrets over past decision/s,Concerned about the meaning of life,Being lonely,Inability to express oneself,Fear of rejection,...,Side effects of medication,Concerns about health in general,Concerns about bodily functions,Dissatisfaction with academic performance,Challenges with instructors,Discontent with current academic responsibilities,Concerns regarding academic transitions,Difficulties with peers or classmates,Challenges in managing group projects,Getting late to class
0,Today was just one of those days where everyth...,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Today was quite a rollercoaster. As a typical ...,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Today was just one of those days where everyth...,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Today started off with a frustrating experienc...,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"So, like, it was just another typical day at D...",1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
307,"So, today was one of those days where everythi...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
308,"So, here's the thing. I woke up this morning f...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
309,"So, it was just another typical day for me at ...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
310,"So, there I was, rushing through the bustling ...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [346]:
categories = {
    'General hassles': ['Misplacing or losing things', 'Silly practical mistakes', 'Trouble with pets', 'Difficulties with friends'],
    'Inner concerns': ['Regrets over past decision/s', 'Concerned about the meaning of life', 'Being lonely', 'Inability to express oneself', 'Fear of rejection', 'Trouble making decisions', 'Physical appearance', 'Not seeing people', "Troubling thoughts about ones future", 'Not enough personal energy', 'Concerns about getting ahead', 'Fear of confrontation', 'Wasting time'],
    'Financial concerns': ['Not enough money for basic necessities (food, clothing, transportation, housing, healthcare etc.)', 'Not enough money for wants (entertainment and recreation)', 'Concerns about owing money', 'Concerns about money for emergencies', 'Financial security'],
    'Time Pressures': ['Not enough time to do things one needs to', 'Too many responsibilities', 'Not getting enough rest', 'Too many interruptions', 'Not enough time for entertainment and recreation', 'Too many meetings', 'Social obligations', 'Concerns about meeting high standards', 'Noise'],
    'Environmental Hassles': ['Pollution', 'Crime', 'Traffic', 'Concerns about news events', 'Rising prices of common goods', 'Concerns about accidents'],
    'Family Hassles': ['Yardwork or outside home maintenance', 'Overloaded with family responsibilities', 'Home maintenance (inside)'],
    'Health Hassles': ['Concerns about medical treatment', 'Physical illness', 'Side effects of medication', 'Concerns about health in general', 'Concerns about bodily functions'],
    'Academic Hassles': ['Dissatisfaction with academic performance', 'Challenges with instructors', 'Discontent with current academic responsibilities', 'Concerns regarding academic transitions', 'Difficulties with peers or classmates', 'Challenges in managing group projects', 'Getting late to class']
}

In [347]:
train_df = pd.DataFrame()
val_df = pd.DataFrame()
train_df['Narrative'] = temp_df['Narrative']
val_df['Narrative'] = temp_val_df['Narrative']

In [348]:
for category, hassles in categories.items():
    train_df[category] = temp_df[hassles].max(axis = 1 )
    val_df[category] = temp_val_df[hassles].max(axis = 1)

In [349]:
#check the number of row per category
category = ['General hassles', 'Inner concerns', 'Financial concerns', 'Time Pressures', 'Environmental Hassles', 'Family Hassles', 'Health Hassles', 'Academic Hassles']
label_counts = train_df[category].sum(axis = 0)
label_counts

General hassles          120
Inner concerns           390
Financial concerns       150
Time Pressures           270
Environmental Hassles    180
Family Hassles            90
Health Hassles           150
Academic Hassles         210
dtype: int64

In [350]:
directory = 'hierarchical_data/'

#train data
train_filename = 'hierarchical_equal_train-1.csv'
train_filepath = directory + train_filename
#train_df.to_csv(train_filepath, index = False)
# temp_df.to_csv(train_filepath, index = False)

#val data
# val_filename = 'hierarchical_equal_val-1.csv'
# val_filepath = directory + val_filename
# val_df.to_csv(val_filepath, index = False)

# Section 3: Data Cleaning: 2nd part of hierarchical data

In [351]:
train_df

Unnamed: 0,Narrative,General hassles,Inner concerns,Financial concerns,Time Pressures,Environmental Hassles,Family Hassles,Health Hassles,Academic Hassles
0,Today was one of those days where everything s...,1,0,0,0,0,0,0,0
1,Today started off on a rough note as I misplac...,1,0,0,0,0,0,0,0
2,So today started off with me misplacing my key...,1,0,0,0,0,0,0,0
3,Today was a typical day in the life of a busy ...,1,0,0,0,0,0,0,0
4,"So, my day started off like any other day. Rus...",1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...
1555,Today was just one of those days where everyth...,0,0,0,0,0,0,0,1
1556,I woke up to the harsh sound of my alarm clock...,0,0,0,0,0,0,0,1
1557,"So there I was, rushing through the crowded st...",0,0,0,0,0,0,0,1
1558,"So, I woke up this morning feeling all groggy ...",0,0,0,0,0,0,0,1


In [352]:
def initialize_dataframes(categories):
    dfs_train = {}
    dfs_val = {}
    
    for category, columns in categories.items():
        columns = ['Narrative'] + columns
        dfs_train[category] = pd.DataFrame(columns=columns)
        dfs_val[category] = pd.DataFrame(columns=columns)
    
    return dfs_train, dfs_val

def separate_dataframe(category):
    None

In [353]:
initialize_dataframes(categories)

({'General hassles': Empty DataFrame
  Columns: [Narrative, Misplacing or losing things, Silly practical mistakes, Trouble with pets, Difficulties with friends]
  Index: [],
  'Inner concerns': Empty DataFrame
  Columns: [Narrative, Regrets over past decision/s, Concerned about the meaning of life, Being lonely, Inability to express oneself, Fear of rejection, Trouble making decisions, Physical appearance, Not seeing people, Troubling thoughts about ones future, Not enough personal energy, Concerns about getting ahead, Fear of confrontation, Wasting time]
  Index: [],
  'Financial concerns': Empty DataFrame
  Columns: [Narrative, Not enough money for basic necessities (food, clothing, transportation, housing, healthcare etc.), Not enough money for wants (entertainment and recreation), Concerns about owing money, Concerns about money for emergencies, Financial security]
  Index: [],
  'Time Pressures': Empty DataFrame
  Columns: [Narrative, Not enough time to do things one needs to, Too

In [354]:
dfs_train, dfs_val = initialize_dataframes(categories)

In [355]:
for category in categories.keys():
    common_columns = list(set(dfs_train[category].columns).intersection(set(temp_df.columns)))
    print(common_columns)
    dfs_train[category] = pd.concat([dfs_train[category], temp_df[common_columns]], ignore_index=True)
    dfs_train[category] = dfs_train[category][(dfs_train[category] == 1).any(axis=1)]

    common_columns = list(set(dfs_train[category].columns).intersection(set(temp_val_df.columns)))
    dfs_val[category] = pd.concat([dfs_val[category], temp_val_df[common_columns]], ignore_index=True)
    dfs_val[category] = dfs_val[category][(dfs_val[category] == 1).any(axis=1)]


['Silly practical mistakes', 'Misplacing or losing things', 'Narrative', 'Trouble with pets', 'Difficulties with friends']
['Concerned about the meaning of life', 'Not seeing people', 'Physical appearance', 'Fear of confrontation', 'Trouble making decisions', 'Not enough personal energy', 'Regrets over past decision/s', 'Narrative', 'Troubling thoughts about ones future', 'Inability to express oneself', 'Concerns about getting ahead', 'Wasting time', 'Fear of rejection', 'Being lonely']
['Not enough money for wants (entertainment and recreation)', 'Concerns about money for emergencies', 'Financial security', 'Not enough money for basic necessities (food, clothing, transportation, housing, healthcare etc.)', 'Narrative', 'Concerns about owing money']
['Noise', 'Concerns about meeting high standards', 'Not enough time for entertainment and recreation', 'Not enough time to do things one needs to', 'Too many responsibilities', 'Narrative', 'Too many meetings', 'Too many interruptions', 'No

In [356]:
#filenames
filenames = ['General_hassles', 'Inner_concerns', 'Financial_concerns', 'Time_Pressures', 'Environmental_Hassles', 'Family_Hassles', 'Health_Hassles', 'Academic_Hassles']


train_extension = '_train.csv'
val_extension = '_val.csv'

In [357]:
train_directory = 'hierarchical_data/train_pt2'
val_directory = 'hierarchical_data/val_pt2'
i = 0

In [358]:
# Save dataframes to CSV files for training set
for category in categories.keys():
    train_filepath = Path(train_directory) / (filenames[i] + train_extension)
    dfs_train[category].to_csv(train_filepath, index = False)

    val_filepath = Path(val_directory) / (filenames[i] + val_extension)
    dfs_val[category].to_csv(val_filepath, index = False)

    i += 1