# Final Project: Text Analysis and Uncovering Demographic or Regional Differences in Family Caregiver Experiences

### *May Khine, Andrea Robang, Mark Lekina*

## Visualization notebook

### Load packages

In [1]:
## helpful packages
import pandas as pd
import numpy as np
import random
import re
import nltk
import spacy
import en_core_web_sm
import gensim
import scipy as sp
from scipy.special import logsumexp

## vectorizer
from sklearn.feature_extraction.text import CountVectorizer

## sentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

## lda
from gensim import corpora
from gensim.models.coherencemodel import CoherenceModel

## nltk imports
from nltk.tokenize import word_tokenize, wordpunct_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk import pos_tag
from nltk.tokenize import word_tokenize, wordpunct_tokenize
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords

# libraries for data visualization
import pyLDAvis
import pyLDAvis.gensim_models
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# suppress warnings
import warnings
warnings.filterwarnings('ignore')

## load pipeline
nlp = en_core_web_sm.load()

## repeated printouts and wide-format text
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
pd.set_option('display.max_colwidth', None)

### Load data

In [2]:
# load datasets
dartmouth_data = pd.read_excel("../data/DartmouthDataSet.xlsx")
start_feis_data = pd.read_excel("../data/STARTFEISData.xlsx")

### Subset to relevant columns

In [3]:
# define columns of interest in each dataset
# Dartmouth dataset
cols_interest_dartmouth = ['Local ID', 'Region', 'County', 'Date Enrolled in START',
                 'Date of birth', 'Gender', 'Race', 'Ethnicity', 'Level of Intellectual Disability']

# START FEIS dataset
cols_interest_start = ['Respondent ID #  (SIRS Local ID)', 'Start Date', 'End Date',
                       'Was there any particular service that your\nfamily member needed that was not available?',
                       'If yes, please describe the service.',
                       'What\nadvice would you give to service planners regarding the mental health service\nneeds of persons with IDD and their families?']

# subset to relevant columns in each dataset
demo_df = dartmouth_data[cols_interest_dartmouth].copy()
resp_df = start_feis_data[cols_interest_start].copy()

### Merge relevant columns from both datasets

In [4]:
# merge demographic data from dartmouth_data with the responses from start_feis_data into a common dataframe
combined_data = pd.merge(demo_df, resp_df, how="inner", 
                              left_on='Local ID', right_on='Respondent ID #  (SIRS Local ID)')

# define the two open response columns
open_cols = ['If yes, please describe the service.', 
        'What\nadvice would you give to service planners regarding the mental health service\nneeds of persons with IDD and their families?']

# print the difference in dataframe shapes
demo_df.shape
resp_df.shape
combined_data.shape

# print the head of the resulting dataframe
combined_data.head()

(4986, 9)

(1940, 6)

(1097, 15)

Unnamed: 0,Local ID,Region,County,Date Enrolled in START,Date of birth,Gender,Race,Ethnicity,Level of Intellectual Disability,Respondent ID # (SIRS Local ID),Start Date,End Date,Was there any particular service that your\nfamily member needed that was not available?,"If yes, please describe the service.",What\nadvice would you give to service planners regarding the mental health service\nneeds of persons with IDD and their families?
0,8008815,California : CA START East Bay,Alameda,2020-12-29,2006-11-13,Male,Other: Mexican,Hispanic - specific origin not specified,Mild,8008815,2021-01-14 16:59:09,2021-01-14 17:13:45,No,,
1,6570649,California : CA START San Andreas,Santa Cruz,2020-12-29,1999-02-07,Female,"Unknown, not collected",Hispanic - specific origin not specified,Severe,6570649,2021-01-05 13:35:02,2021-01-05 13:42:08,Yes,A counselor was not and has not been made available for the last six months.,"â€œPlease be aware of her conditions and diagnosis, so many professionals are unfamiliar with the medical history of Citlalli. It is discouraging when professionals do not know Citlalli, but make recommendations for her. Also, it is discouraging when the professionals do not take the opinions of the family seriously.â€"
2,434021,New York : Region 3,Orange,2020-12-28,2001-12-24,Female,White,Not of Hispanic origin,Borderline,434021,2020-12-28 11:49:28,2020-12-28 11:55:53,Yes,In-home behavior support,
3,6580618,California : CA START San Andreas,Santa Cruz,2020-12-23,2004-12-21,Male,White,Not of Hispanic origin,Moderate,6580618,2021-01-22 11:42:45,2021-01-22 11:50:28,Yes,"""After Trevorâ€™s psychiatrist left the office, the office also stopped taking his insurance and as a result, Trevor went without a psychiatrist for a while. Trevorâ€™s family tried their best to get him in with other psychiatrists, but struggled to find one that would treat Trevor. Through SARC, Trevor was referred to Hope Services and will begin seeing a psychiatrist there on 1.27.21.""",Declined to answer/did not know.
4,354280,New York : Region 3,Warren,2020-12-21,2005-09-04,Male,"Unknown, not collected","Unknown, not collected",Mild,354280,2021-02-16 16:02:49,2021-02-16 16:38:02,Yes,"At home off hour support on phone or in person/respite, have removed for the night for safety reasons.","Listen to the parents, take what parents report seriously, and provide tips, not just call the cops, have options/walk parent through it."


### Topic modeling on the combined dataset

**Question 1**: What themes and keywords are most prevalent in responses from individuals with IDD and their families highlighting still-standing mental health needs and services?

**Response**: We will perform topic modeling on the two free response columns of the combined dataframe and determine the prevalent topics, as well as the most frequent words in each topic.

### Define a function to preprocess the responses in the relevant columns

In [5]:
def process_step(string):
    # Get stopwords
    list_stopwords = stopwords.words("english")

    # Add custom stopwords (if present)
    custom_words_toadd = []#["need", "needs", "family", "families", "service", "services", "provide", "provides"]
    list_stopwords_new = list_stopwords + custom_words_toadd

    # Initialize stemmer
    porter = PorterStemmer()
    
    # perform necessary steps and return preprocessed text (or a blank string on error)
    try:
        nostop_listing = [word for word in wordpunct_tokenize(string)
                          if word not in list_stopwords_new]
        clean_listing = [porter.stem(word) for word in nostop_listing
                         if word.isalpha() and len(word) > 3]    # Can change length of word here
#         clean_listing = [word for word in clean_listing if word not in ["need", "servic", "famili", "provid"]]
        clean_listing_str = " ".join(clean_listing)   
        
        return(clean_listing_str)
    except:
        return("")

### Apply `process_step` to the open response columns and return new columns for the processed text

In [6]:
# preprocess the responses in each open response column
for col in open_cols:
    combined_data["processed_{col}".format(col=col)] = [process_step(response) for response in combined_data[col]]
    
# print the head of the resulting columns
rel_cols = open_cols + ["processed_{col}".format(col=col) for col in open_cols]
combined_data[rel_cols].head()

Unnamed: 0,"If yes, please describe the service.",What\nadvice would you give to service planners regarding the mental health service\nneeds of persons with IDD and their families?,"processed_If yes, please describe the service.",processed_What\nadvice would you give to service planners regarding the mental health service\nneeds of persons with IDD and their families?
0,,,,
1,A counselor was not and has not been made available for the last six months.,"â€œPlease be aware of her conditions and diagnosis, so many professionals are unfamiliar with the medical history of Citlalli. It is discouraging when professionals do not know Citlalli, but make recommendations for her. Also, it is discouraging when the professionals do not take the opinions of the family seriously.â€",counselor made avail last month,œpleas awar condit diagnosi mani profession unfamiliar medic histori citlal discourag profession know citlal make recommend also discourag profession take opinion famili serious
2,In-home behavior support,,home behavior support,
3,"""After Trevorâ€™s psychiatrist left the office, the office also stopped taking his insurance and as a result, Trevor went without a psychiatrist for a while. Trevorâ€™s family tried their best to get him in with other psychiatrists, but struggled to find one that would treat Trevor. Through SARC, Trevor was referred to Hope Services and will begin seeing a psychiatrist there on 1.27.21.""",Declined to answer/did not know.,after trevorâ psychiatrist left offic offic also stop take insur result trevor went without psychiatrist trevorâ famili tri best psychiatrist struggl find would treat trevor through sarc trevor refer hope servic begin see psychiatrist,declin answer know
4,"At home off hour support on phone or in person/respite, have removed for the night for safety reasons.","Listen to the parents, take what parents report seriously, and provide tips, not just call the cops, have options/walk parent through it.",home hour support phone person respit remov night safeti reason,listen parent take parent report serious provid tip call cop option walk parent


### Define a function to perform topic modeling

In [7]:
def topic_model(col, num_topics):
    # compute parameters using helper function
    params = set_model_params(col)
    text_raw_dict = params[1]
    corpus_fromdict = params[2]
    
    # train the model
    ldamod = gensim.models.ldamodel.LdaModel(corpus_fromdict, 
                                    num_topics = num_topics, id2word=text_raw_dict, 
                                    passes=20, alpha = 'auto',
                                    per_word_topics = True, random_state = 0)

    # return trained model, tokenized text, dictionary and document-term matrix
    return ldamod

def set_model_params(col):
    # tokenize each response
    text_raw_tokens = [wordpunct_tokenize(text) for text in col]

    # use gensim to create a dictionary of all unique words across documents
    text_raw_dict = corpora.Dictionary(text_raw_tokens)

    # apply dictionary to tokenized texts
    corpus_fromdict = [text_raw_dict.doc2bow(text) for text in text_raw_tokens]
    
    # return parameters
    return (text_raw_tokens, text_raw_dict, corpus_fromdict)

In [8]:
# define columns on which to perform topic modelling
preprocessed_cols = ["processed_{col}".format(col=col) for col in open_cols]
preprocessed_names = ["response_service", "response_advice"]

preprocessed_dict = dict(zip(preprocessed_names, preprocessed_cols))
preprocessed_dict

{'response_service': 'processed_If yes, please describe the service.',
 'response_advice': 'processed_What\nadvice would you give to service planners regarding the mental health service\nneeds of persons with IDD and their families?'}

### Define function to visualize the topics given a column of a dataframe

In [9]:
# apply topic modelling function to each column
def visualize_topics(col):
    lda_model = topic_model(col, num_topics=7)
    tokenized_text, text_raw_dict, corpus_fromdict = set_model_params(col)
    
    # tabulate the topics and respective top words
    table = topic_model_rhea(tokenized_text, text_raw_dict, corpus_fromdict, topic_num=7)

    # visualize the topics
    pyLDAvis.enable_notebook()
    vis = pyLDAvis.gensim_models.prepare(lda_model, corpus_fromdict, text_raw_dict)
    
    # return visualization
    return table, vis

### Functions to help tabulate the data
>Originally in 01_topic_modeling.ipynb 

In [10]:
def prep_topicdata(df):
    # filtering out empty strings, words that do not appear in many documents
    df = df[df.proc_advice != ""].copy()
    tokenized_text = [wordpunct_tokenize(one_text) for one_text in 
                                      df.proc_advice]
    text_proc_dict = corpora.Dictionary(tokenized_text)
    text_proc_dict.filter_extremes(no_below = round(df.shape[0]*0.02),
                             no_above = round(df.shape[0]*0.98))
    
    corpus_fromdict_proc = [text_proc_dict.doc2bow(one_text) 
                   for one_text in tokenized_text]

    return tokenized_text, text_proc_dict, corpus_fromdict_proc

def topic_model_rhea(tokenized_text, text_proc_dict, corpus_fromdict_proc, topic_num):
        
    # create model
    ldamod = gensim.models.ldamodel.LdaModel(corpus_fromdict_proc, 
                                         num_topics = topic_num, id2word=text_proc_dict, 
                                         passes=6, alpha = 'auto',
                                        per_word_topics = True, random_state = 0)

    # get topics using frex values
    frex_words = create_topic_word_data(ldamod, corpus_fromdict_proc, frex_w=0.7)

    # get top 10 words per topic and collect in a list
    frex_topwords = frex_words.sort_values(['frex'], axis=0, ascending = False).groupby('topic').head(10)    
    frex_wordlist = frex_topwords.groupby('topic').agg({'word': [lambda x: list(x)]})
    frex_word_df = pd.DataFrame(frex_wordlist)
    frex_word_df.columns = ["top_words"]
    return(frex_word_df)

def ecdf(arr):
    """Calculate the ECDF values for all elements in a 1D array."""
    return sp.stats.rankdata(arr, method='max') / arr.size


def frex(mod, w=0.7):
    """Calculate FREX for all words in a topic model.

    See R STM package for details.

    """
    log_beta = np.log(mod.get_topics())
    log_exclusivity = log_beta - logsumexp(log_beta, axis=0)
    exclusivity_ecdf = np.apply_along_axis(ecdf, 1, log_exclusivity)
    freq_ecdf = np.apply_along_axis(ecdf, 1, log_beta)
    out = 1. / (w / exclusivity_ecdf + (1 - w) / freq_ecdf)
    return out

def create_topic_word_data(mod, corpus, frex_w=0.7):
    """Create data frame with topic-word information.

    Parameters
    ------------
    mod: :class:`gensim.models.LdaModel`
        Fitted LDA model
    corpus: list
        Corpus in the same input format as required by
        :class:`gensim.models.LdaModel`.
    frex_w: float
        Weight to use in FREX calculations.
        
    Returns
    ---------
    :class:`pandas.DataFrame`
        Data frame with the following columns.


        :word: ``str``. Word
        :word_id: ``float``. Word identifier
        :topic: ``int``. Topic number
        :prob: ``float``. Probability of the word conditional on a topic.
        :frex: ``float``. FREX score
        :lift: ``float``. Lift score
        :relevance: ``float``. Relevance score

    """
    id2word = mod.id2word
    term_topics = pd.DataFrame(mod.get_topics())
    term_topics.index.name = "topic"
    term_topics.reset_index(inplace=True)
    words = pd.DataFrame({
        'word_id': list(id2word.keys()),
        'word': list(id2word.values())
        }, columns=("word_id", "word"))
    term_topics = pd.melt(term_topics, id_vars=["topic"],
                          var_name="word_id",
                          value_name="prob").\
        merge(words, left_on="word_id", right_on="word_id")
    term_topics['frex'] = np.ravel(frex(mod, w=frex_w), order="F")
    return term_topics


def compute_coherence_values(corpus_fromdict_proc, tokenized_text, text_proc_dict):
    # initialize possible # of topics in the range 2 to 10
    limit=10; start=2; step=1;
    coherence_values = []
    model_list = []
    
    # create models for each number of topics and coherence model values
    for num in range(start, limit, step):
        ldamod = gensim.models.ldamodel.LdaModel(corpus_fromdict_proc, 
                                                 num_topics = num, id2word=text_proc_dict, 
                                                 passes=20, alpha = 'auto',
                                                 per_word_topics = True, random_state = 0)    
        model_list.append(ldamod)
        coherencemodel = CoherenceModel(model=ldamod, texts=tokenized_text, dictionary=text_proc_dict, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())


    # Show graph
    x = range(start, limit, step)
    print(x)
    print(coherence_values)

    plt.plot(x, coherence_values)
    plt.xlabel("Num Topics")
    plt.ylabel("Coherence score")
    plt.legend(("coherence_values"), loc='best')
    plt.show()

    return model_list, coherence_values

### Visualize topic modelling on the preprocessed columns for the entire corpus

In [15]:
for col_name in preprocessed_dict:
    col = preprocessed_dict[col_name]
    table, vis = visualize_topics(combined_data[col])
    
    # define pathnames for the tables and graphs
    path = '../output/topic_model_vis/{col_name}/whole_corpus/model_{col_name}'.format(col_name=col_name)
    graph_path = path + ".html"
    table_path = path + ".csv"
    
    print("visualizing topics and top words in {col_name} for complete corpus ...".format(col_name=col_name))
    print("see '{path}' if pyLDAvis graph is not displayed\n".format(path=path))
    
    # display the table and save it to .csv file
    print(table, '\n')
    table.to_csv(table_path)
    
    # visualize and save the output
    pyLDAvis.display(vis)
    pyLDAvis.save_html(vis, graph_path)

visualizing topics and top words in response_service for complete corpus ...
see '../output/topic_model_vis/response_service/whole_corpus/model_response_service' if pyLDAvis graph is not displayed

                                                                                     top_words
topic                                                                                         
0            [psychiatri, appoint, life, psychiatr, state, coach, sonâ, genet, syndrom, tutor]
1           [speech, therapi, respons, parent, occup, camp, hour, overnight, clarif, diagnost]
2               [intervent, more, mobil, autism, divers, plan, crisi, everyth, transport, med]
3      [afterschool, neurolog, olymp, describ, sign, american, identifi, lika, engag, austist]
4                  [school, psychiatrist, take, set, time, after, blood, sure, trevor, doctor]
5               [hous, function, escal, thing, with, doe, bigger, disord, agit, neurobehaivor]
6         [habilit, residenti, commun, loc

visualizing topics and top words in response_advice for complete corpus ...
see '../output/topic_model_vis/response_advice/whole_corpus/model_response_advice' if pyLDAvis graph is not displayed

                                                                                     top_words
topic                                                                                         
0      [profession, self, rural, appoint, citlal, discourag, histori, habit, with, acknowledg]
1                 [keep, meet, daughter, dysregul, need, behavior, have, regard, list, proper]
2       [social, say, unknown, psychiatr, therapist, fragil, insist, nurs, welcom, practition]
3                   [none, go, sure, figur, first, structur, child, everyth, baselin, coverag]
4             [earli, advic, live, choos, respit, experienc, disabl, environ, medicaid, skill]
5                   [answer, call, even, personnel, kind, wiiht, outcom, polic, enrol, period]
6                [continu, health, follow, co

### **Question 2**: How do they vary with racial groups, age, and nature of disability over the course of the pandemic?

**Response**: We repeat the same procedure, but also examine correlation between the demographics and topic proportions/top topics

### Subset the data by demographic

In [12]:
# define demographics of interest
demographics_cols = ['Gender', 'Level of Intellectual Disability', 'Race']

# specify the different categories in each demographic
# in the case of 'Race', responses from patients who identified as White made up a huge proportion,
# therefore we will compare them with the remainder of the responses (see cell below)
demographics_subsets = [['Male', 'Female'],
                        ['Mild', 'Moderate', 'Severe'], 
                        ['White']]

# combine them in a dictionary
demographics_dict = dict.fromkeys(demographics_cols , None)

for idx, col in enumerate(demographics_cols):
    demographics_dict[col] = demographics_subsets[idx]

demographics_dict

{'Gender': ['Male', 'Female'],
 'Level of Intellectual Disability': ['Mild', 'Moderate', 'Severe'],
 'Race': ['White']}

In [13]:
advice_col_1 = 'processed_If yes, please describe the service.'
advice_col_2 = ('processed_What\nadvice would you give to service planners regarding the mental'
              ' health service\nneeds of persons with IDD and their families?')

### Perform topic modeling on the preprocessed columns for each subset

In [14]:
# loop through each response
for col_name in preprocessed_dict:
    response_col = preprocessed_dict[col_name]
    
    # loop through each demographic
    for demo in demographics_dict:
        # loop through each subset
        for subset in demographics_dict[demo]:
            # visualize topics
            col = combined_data[combined_data[demo] == subset][response_col]
            table, vis = visualize_topics(col)
    
            # define pathnames for the tables and graphs
            # display outputs and save them to outputs folder
            path = "../output/topic_model_vis/{col_name}/by_demographic/{demo}/model_{subset}".format(demo=demo,
                                                                                                           subset=subset,
                                                                                                           col_name=col_name)
            
            graph_path = path + ".html"
            table_path = path + ".csv"
            
            print("visualizing topics and top words in {col_name} for subset: {demo} == {subset} ...".format(demo=demo,
                                                                                                             subset=subset,
                                                                                                             col_name=col_name))
            print("(see {path} to display pyLDAvis graph)\n".format(path=path))
            
            # display the table
            print(table, '\n')
            table.to_csv(table_path)

            # uncomment to display graph
            # pyLDAvis.display(vis)
            pyLDAvis.save_html(vis, graph_path)
        
# add race subset (excluding responses from White patients)
# loop through each response
for col_name in preprocessed_dict:
    response_col = preprocessed_dict[col_name]
    
    # specify subset and create visualization
    col = combined_data[combined_data['Race'] != 'White'][response_col]
    table, vis = visualize_topics(col)

    # save output to html and .csv file
    path = "../output/topic_model_vis/{col_name}/by_demographic/Race/model_not_White".format(col_name=col_name)
    graph_path = path + ".html"
    table_path = path + ".csv"

    print("visualizing topics and top words in {col_name} for subset: Race != White ...".format(col_name=col_name))
    print("(see {path} to display pyLDAvis graph)\n".format(path=path))
    # uncomment to display graph
    # pyLDAvis.display(vis)
    pyLDAvis.save_html(vis, graph_path)
    
    # display the table
    print(table, '\n')
    table.to_csv(table_path)

visualizing topics and top words in response_service for subset: Gender == Male ...
(see ../output/topic_model_vis/response_service/by_demographic/Gender/model_Male to display pyLDAvis graph)

                                                                                                  top_words
topic                                                                                                      
0      [mental, placement, health, psychiatrist, rehabilit, psychiatr, syndrom, better, life, psychologist]
1                          [crisi, intervent, respons, speech, occup, therapi, comm, prevent, anger, mobil]
2                           [psychiatri, blood, comprehens, neurolog, get, medicaid, sure, take, go, vocat]
3                             [habilit, counsel, care, autism, unit, inpati, teenag, progress, front, half]
4                                 [parent, program, work, train, hillsid, med, much, divers, great, mirror]
5                        [aid, build, outpati, regu

visualizing topics and top words in response_advice for subset: Level of Intellectual Disability == Mild ...
(see ../output/topic_model_vis/response_advice/by_demographic/Level of Intellectual Disability/model_Mild to display pyLDAvis graph)

                                                                                        top_words
topic                                                                                            
0          [situat, meet, report, present, obtain, background, psych, evalu, closest, underestim]
1      [appoint, simpli, enough, prescrib, doctor, psychiatrist, patricia, regard, someth, reach]
2                [experienc, continu, often, import, program, sever, mani, œwrongâ, truli, versu]
3              [none, live, expand, explain, gaurdian, sinc, environ, contact, figueora, weekend]
4                        [specif, taunt, turn, go, guard, structur, donâ, supervis, crise, adequ]
5               [there, dysregul, medic, high, behavior, system, covera