# Features Engineering

The thesis project focusing on analyzing the linguistic features of therapists' responses in an online forum to predict engagement levels. 

Let's break down the extraction of the linguistic features from the text answers in the dataset:

   - Use of Modal Verbs: Identify and count modal verbs (like could, should, would).
   - Concreteness 
   - Readability Scores: There are various formulas like Flesch Reading Ease, Flesch-Kincaid Grade Level, Gunning Fog Index, etc., that can be used to assess the readability of the text.
   - Clauses Density
   - T-Unit Analysis

In [12]:
import pandas as pd

# Load the dataset
file_path = 'data/counsel_df_processed.csv'
counsel_df = pd.read_csv(file_path)
counsel_df

Unnamed: 0,level_0,index,ID,topic,question,answer,upvotes,views,upvotes_scaled,views_scaled,weighted_engagement,engagement
0,0,1,0,depression,Do I have too many issues for counseling? I ha...,"I've never heard of someone having ""too many i...",3,386,0.166667,0.201888,0.173711,High
1,1,4,0,depression,Do I have too many issues for counseling? I ha...,I just want to acknowledge you for the courage...,2,256,0.083333,0.133718,0.093410,Medium
2,2,5,0,depression,Do I have too many issues for counseling? I ha...,It's not really a question of whether you have...,2,435,0.083333,0.227583,0.112183,Medium
3,3,6,0,depression,Do I have too many issues for counseling? I ha...,There is no such thing as too much. Start wher...,2,217,0.083333,0.113267,0.089320,Medium
4,4,7,0,depression,Do I have too many issues for counseling? I ha...,The most direct answer is no. I would venture ...,2,1064,0.083333,0.557420,0.178151,High
...,...,...,...,...,...,...,...,...,...,...,...,...
2610,2610,2744,939,counseling-fundamentals,Are some clients more difficult than others? W...,Although many clients have the capacity to be ...,1,47,0.000000,0.024122,0.004824,Low
2611,2611,2745,939,counseling-fundamentals,Are some clients more difficult than others? W...,"I usually don't label a client as ""difficult"" ...",1,22,0.000000,0.011012,0.002202,Low
2612,2612,2746,939,counseling-fundamentals,Are some clients more difficult than others? W...,"Dang right! :)Heh heh, and correct me if I'm ...",1,23,0.000000,0.011536,0.002307,Low
2613,2613,2747,939,counseling-fundamentals,Are some clients more difficult than others? W...,"Yes, just like some relationships outside of o...",1,41,0.000000,0.020975,0.004195,Low


In [13]:
counsel_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2615 entries, 0 to 2614
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   level_0              2615 non-null   int64  
 1   index                2615 non-null   int64  
 2   ID                   2615 non-null   int64  
 3   topic                2615 non-null   object 
 4   question             2615 non-null   object 
 5   answer               2615 non-null   object 
 6   upvotes              2615 non-null   int64  
 7   views                2615 non-null   int64  
 8   upvotes_scaled       2615 non-null   float64
 9   views_scaled         2615 non-null   float64
 10  weighted_engagement  2615 non-null   float64
 11  engagement           2615 non-null   object 
dtypes: float64(3), int64(5), object(4)
memory usage: 245.3+ KB


## Text Normalization

The following text normalization processes are most relevant, and will be adopted as needed for each feature extraction process:
1. Case Folding: Essential for ensuring uniformity in the text. It's especially important for consistency in word counts and lexical diversity measures.
2. Tokenization: Fundamental for almost all types of text analysis. It's necessary for calculating lexical diversity, sentiment analysis, and syntactic complexity.
3. Removing Punctuation and Special Characters: This is generally useful, but you might want to retain punctuation for sentiment analysis (as it can convey sentiment) and for readability scores (since punctuation impacts readability).
4. Stop Words Removal: This is a bit nuanced. For lexical diversity, you might want to keep stop words as they contribute to the overall diversity of the text. However, for sentiment analysis and readability scores, stop words removal can be considered.
5. Spelling Correction: This is optional and depends on the quality of the text. If the dataset has a lot of typos, it may be worth doing, but it's not usually critical for the types of analysis you're performing.
6. Lemmatization: Preferable over stemming as it retains the meaning of words. Useful for lexical diversity and sentiment analysis.
7. Handling Negations: Important for sentiment analysis to accurately capture the sentiment of the text.
8. Identifying Part-of-Speech (POS): Necessary for counting modal verbs, personal pronouns, and understanding sentence structure for syntactic complexity.
9. Sentence Segmentation: Crucial for syntactic complexity analysis (like average sentence length) and also useful for certain readability metrics.

In [14]:
from nltk.tokenize import sent_tokenize, word_tokenize

### Use of Modal Verbs
Count the frequency of modal verbs in the text. Modal verbs include words like "can," "could," "may," "might," "must," "shall," "should," "will," "would."

In [15]:
def count_modal_verbs(text):
    modal_verbs = {'can', 'could', 'may', 'might', 'must', 'shall', 'should', 'will', 'would'}
    words = word_tokenize(text.lower())
    return sum(word in modal_verbs for word in words)

In [16]:
# Apply the functions to each row in the 'answer' column
counsel_df['modal_verbs'] = counsel_df['answer'].apply(count_modal_verbs)

### Concreteness

To implement a solution using an existing concreteness database, I will leverage the concreteness ratings provided by Brysbaert et al. (2014), which is a widely recognized resource in this field. The database offers concreteness ratings for about 40,000 English words, rated on a scale from 1 (most abstract) to 5 (most concrete).


In [18]:
# Load the concreteness ratings database provided by Brysbaert et al. (2014)
concreteness_df = pd.read_excel('data/concreteness_brysbaert/13428_2013_403_MOESM1_ESM.xlsx')  # Update with the actual path
concreteness_dict = pd.Series(concreteness_df.conc_mean.values, index=concreteness_df.word).to_dict()

concreteness_df

Unnamed: 0,word,Bigram,conc_mean,Conc_std,Unknown,Total,Percent_known,SUBTLEX
0,a,0,1.46,1.14,2,30,0.933333,1041179
1,aardvark,0,4.68,0.86,0,28,1.000000,21
2,aback,0,1.65,1.07,4,27,0.851852,15
3,abacus,0,4.52,1.12,2,29,0.931034,12
4,abandon,0,2.54,1.45,1,27,0.962963,413
...,...,...,...,...,...,...,...,...
39949,zebra crossing,1,4.56,0.75,1,28,0.964286,0
39950,zero tolerance,1,2.21,1.45,0,29,1.000000,0
39951,ZIP code,1,3.77,1.59,0,30,1.000000,0
39952,zoom in,1,3.57,1.40,0,28,1.000000,0


In [19]:
def calculate_average_concreteness(text, concreteness_dict):
    words = word_tokenize(text.lower())
    words = [word for word in words if word.isalpha()]
    scores = [concreteness_dict.get(word, None) for word in words]
    scores = [score for score in scores if score is not None]
    return sum(scores)


# Apply the function to each row in the 'answer' column
counsel_df['concreteness'] = counsel_df['answer'].apply(lambda x: calculate_average_concreteness(x, concreteness_dict))

In [20]:
counsel_df

Unnamed: 0,level_0,index,ID,topic,question,answer,upvotes,views,upvotes_scaled,views_scaled,weighted_engagement,engagement,modal_verbs,concreteness
0,0,1,0,depression,Do I have too many issues for counseling? I ha...,"I've never heard of someone having ""too many i...",3,386,0.166667,0.201888,0.173711,High,5,194.24
1,1,4,0,depression,Do I have too many issues for counseling? I ha...,I just want to acknowledge you for the courage...,2,256,0.083333,0.133718,0.093410,Medium,10,437.93
2,2,5,0,depression,Do I have too many issues for counseling? I ha...,It's not really a question of whether you have...,2,435,0.083333,0.227583,0.112183,Medium,9,488.62
3,3,6,0,depression,Do I have too many issues for counseling? I ha...,There is no such thing as too much. Start wher...,2,217,0.083333,0.113267,0.089320,Medium,2,133.65
4,4,7,0,depression,Do I have too many issues for counseling? I ha...,The most direct answer is no. I would venture ...,2,1064,0.083333,0.557420,0.178151,High,3,265.12
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2610,2610,2744,939,counseling-fundamentals,Are some clients more difficult than others? W...,Although many clients have the capacity to be ...,1,47,0.000000,0.024122,0.004824,Low,8,352.25
2611,2611,2745,939,counseling-fundamentals,Are some clients more difficult than others? W...,"I usually don't label a client as ""difficult"" ...",1,22,0.000000,0.011012,0.002202,Low,7,528.16
2612,2612,2746,939,counseling-fundamentals,Are some clients more difficult than others? W...,"Dang right! :)Heh heh, and correct me if I'm ...",1,23,0.000000,0.011536,0.002307,Low,2,103.12
2613,2613,2747,939,counseling-fundamentals,Are some clients more difficult than others? W...,"Yes, just like some relationships outside of o...",1,41,0.000000,0.020975,0.004195,Low,9,422.44


The incorporation of concreteness ratings into the analysis harnesses a quantifiable measure of the tangibility and specificity of language used in therapists' responses, a factor potentially pivotal in client engagement and comprehension. By systematically evaluating the concreteness of lexical choices, this step crucially aligns the linguistic attributes of the responses with empirical benchmarks, offering a nuanced understanding of how the degree of abstract versus concrete language correlates with and possibly influences user engagement in an online counseling context.

Utilizing Brysbaert et al.'s concreteness ratings, this analysis step methodically quantifies the tangibility of language in therapists' responses, a crucial aspect for enhancing client engagement and understanding in online counseling. This approach, grounded in empirical benchmarks, judiciously evaluates lexical concreteness, thereby providing insights into the impact of abstract versus concrete language on user engagement, and substantiates the choice of using an established database for its reliability and academic rigor.

### Readability

To assess the readability of the therapists' answers in your dataset, you can use Python's textstat library, which provides several readability metrics. Commonly used metrics include:

- Flesch Reading Ease: Higher scores indicate material that is easier to read; lower numbers mark passages that are more difficult to read.
- Automated Readability Index (ARI): Like the Flesch-Kincaid Grade Level, this index estimates the grade level needed to comprehend the text.

In [21]:
import textstat

# Define a function to calculate readability scores
def calculate_readability(text):
    flesch_reading_ease = textstat.flesch_reading_ease(text)
    ari = textstat.automated_readability_index(text)
    return flesch_reading_ease, ari

In [22]:
# Apply the function to each row in the 'answer' column
counsel_df[['flesch_reading_ease', 'ari']] = counsel_df['answer'].apply(lambda x: pd.Series(calculate_readability(x)))

In [None]:
counsel_df

The integration of readability assessments, utilizing metrics like the Flesch Reading Ease and Automated Readability Index (ARI), plays a pivotal role in quantifying the linguistic accessibility of therapists' responses, an element crucial to client comprehension and engagement in an online counseling setting. This analytical step, by methodically evaluating text complexity, aids in understanding the impact of readability on user engagement, underlining the significance of clear and comprehendible communication in fostering effective therapeutic interactions and justifying the selection of these comprehensive and widely recognized readability metrics.

### Clause Density

Complex clauses involving subordination arise because a core or non-core dependent is realized as a clausal structure. We distinguish four basic types:
- Clausal subjects, divided into ordinary subjects (csubj) and passive subjects (csubjpass).
- Clausal complements (objects), divided into those with obligatory control (xcomp) and those without (ccomp).
- Clausal adverbial modifiers (advcl).
- Clausal adnominal modifiers (acl) (with relative clauses as an important subtype in many languages).
- Clausal conjunct (conj), two elements connected by a coordinating conjunction such as, and, or, etc.

In [23]:
import spacy

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

def count_clauses(sentence):
    doc = nlp(sentence)
    # Identifying clauses by looking for specific dependency tags
    clause_tags = {'csubj', 'csubjpass', 'conj', 'advcl', 'acl', 'xcomp', 'ccomp'}
    clauses = [tok for tok in doc if tok.dep_ in clause_tags]
    return len(clauses)

In [24]:
def calculate_clause_density(text):
    sentences = [sent.text.strip() for sent in nlp(text).sents]
    total_clauses = sum(count_clauses(sentence) for sentence in sentences)
    return total_clauses

In [25]:
# Apply the function to each row in the 'answer' column
counsel_df['clause_density'] = counsel_df['answer'].apply(calculate_clause_density)

In [28]:
counsel_df

Unnamed: 0,ID,topic,question,answer,upvotes,views,upvotes_scaled,views_scaled,weighted_engagement,engagement,modal_verbs,concreteness,flesch_reading_ease,ari,clause_density
0,0,depression,Do I have too many issues for counseling? I ha...,"I've never heard of someone having ""too many i...",3,386,0.166667,0.201888,0.173711,High,5,194.24,47.49,12.0,10
1,0,depression,Do I have too many issues for counseling? I ha...,I just want to acknowledge you for the courage...,2,256,0.083333,0.133718,0.093410,Medium,10,437.93,71.44,7.6,25
2,0,depression,Do I have too many issues for counseling? I ha...,It's not really a question of whether you have...,2,435,0.083333,0.227583,0.112183,Medium,9,488.62,64.85,11.0,23
3,0,depression,Do I have too many issues for counseling? I ha...,There is no such thing as too much. Start wher...,2,217,0.083333,0.113267,0.089320,Medium,2,133.65,93.54,3.1,9
4,0,depression,Do I have too many issues for counseling? I ha...,The most direct answer is no. I would venture ...,2,1064,0.083333,0.557420,0.178151,High,3,265.12,81.63,6.8,14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2610,939,counseling-fundamentals,Are some clients more difficult than others? W...,Although many clients have the capacity to be ...,1,47,0.000000,0.024122,0.004824,Low,8,352.25,46.71,16.2,23
2611,939,counseling-fundamentals,Are some clients more difficult than others? W...,"I usually don't label a client as ""difficult"" ...",1,22,0.000000,0.011012,0.002202,Low,7,528.16,68.60,10.8,34
2612,939,counseling-fundamentals,Are some clients more difficult than others? W...,"Dang right! :)Heh heh, and correct me if I'm ...",1,23,0.000000,0.011536,0.002307,Low,2,103.12,94.56,4.5,6
2613,939,counseling-fundamentals,Are some clients more difficult than others? W...,"Yes, just like some relationships outside of o...",1,41,0.000000,0.020975,0.004195,Low,9,422.44,62.41,14.1,25


Extracting syntactic features such as average sentence length and clause density, encompassing diverse elements like clausal subjects (csubj, csubjpass), clausal complements (xcomp, ccomp), adverbial (advcl), adnominal modifiers (acl), and conjuncts (conj), is crucial for understanding the structural intricacies of therapists' responses in online counseling. These metrics provide a detailed portrayal of linguistic complexity, revealing the depth and sophistication of sentence constructions used. By analyzing these syntactic elements, the study can elucidate the relationship between the complexity of language and user engagement, thereby underpinning the hypothesis that certain syntactic structures may either facilitate or impede client comprehension and connection in a therapeutic setting. This analytical approach is chosen for its capacity to offer a nuanced understanding of how therapists' communicative styles - reflected through their syntactic choices - potentially influence the effectiveness and accessibility of their online counseling interventions.

### T-Unit Analysis
A T-unit is a minimal terminable unit, essentially a main clause with all its subordinate clauses. Analyzing the length and number of T-units can indicate complexity.


In [29]:
def count_t_units(text):
    doc = nlp(text)
    t_units = 0
    for sent in doc.sents:
        main_clauses = [tok for tok in sent if tok.head == tok and tok.dep_ != 'conj']
        t_units += len(main_clauses)
    return t_units

In [30]:
def calculate_t_unit_complexity(text):
    sentences = list(nlp(text).sents)
    total_t_units = sum(count_t_units(sent.text) for sent in sentences)
    return total_t_units

In [31]:
# Apply the function to each row in the 'answer' column
counsel_df['t_unit_complexity'] = counsel_df['answer'].apply(calculate_t_unit_complexity)

In [32]:
counsel_df

Unnamed: 0,ID,topic,question,answer,upvotes,views,upvotes_scaled,views_scaled,weighted_engagement,engagement,modal_verbs,concreteness,flesch_reading_ease,ari,clause_density,t_unit_complexity
0,0,depression,Do I have too many issues for counseling? I ha...,"I've never heard of someone having ""too many i...",3,386,0.166667,0.201888,0.173711,High,5,194.24,47.49,12.0,10,7
1,0,depression,Do I have too many issues for counseling? I ha...,I just want to acknowledge you for the courage...,2,256,0.083333,0.133718,0.093410,Medium,10,437.93,71.44,7.6,25,13
2,0,depression,Do I have too many issues for counseling? I ha...,It's not really a question of whether you have...,2,435,0.083333,0.227583,0.112183,Medium,9,488.62,64.85,11.0,23,9
3,0,depression,Do I have too many issues for counseling? I ha...,There is no such thing as too much. Start wher...,2,217,0.083333,0.113267,0.089320,Medium,2,133.65,93.54,3.1,9,5
4,0,depression,Do I have too many issues for counseling? I ha...,The most direct answer is no. I would venture ...,2,1064,0.083333,0.557420,0.178151,High,3,265.12,81.63,6.8,14,8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2610,939,counseling-fundamentals,Are some clients more difficult than others? W...,Although many clients have the capacity to be ...,1,47,0.000000,0.024122,0.004824,Low,8,352.25,46.71,16.2,23,7
2611,939,counseling-fundamentals,Are some clients more difficult than others? W...,"I usually don't label a client as ""difficult"" ...",1,22,0.000000,0.011012,0.002202,Low,7,528.16,68.60,10.8,34,13
2612,939,counseling-fundamentals,Are some clients more difficult than others? W...,"Dang right! :)Heh heh, and correct me if I'm ...",1,23,0.000000,0.011536,0.002307,Low,2,103.12,94.56,4.5,6,7
2613,939,counseling-fundamentals,Are some clients more difficult than others? W...,"Yes, just like some relationships outside of o...",1,41,0.000000,0.020975,0.004195,Low,9,422.44,62.41,14.1,25,8


Incorporating T-unit analysis into the study provides a refined measure of syntactic complexity, enabling a deeper understanding of the structural sophistication in therapists' responses within an online counseling context. This analytical focus on T-units, which represent main clauses along with their subordinate structures, is pivotal for evaluating the linguistic intricacy of therapeutic communication, offering insights into how complex sentence constructions might influence client comprehension and engagement in digital therapeutic interactions.


### Present indicative verb

In [35]:
def extract_present_indicative_verbs(text):
    # Process the text using the spaCy NLP pipeline
    doc = nlp(text)
    # Extract verbs that are in present tense (verb, non-3rd person singular present: VBP; verb, 3rd person singular present: VBZ)
    present_verbs = [token.text for token in doc if token.tag_ in ['VBP', 'VBZ']]

    return len(present_verbs)

In [37]:
# Apply
counsel_df['present_indicative_verbs'] = counsel_df['answer'].apply(extract_present_indicative_verbs)

In [38]:
counsel_df

Unnamed: 0,ID,topic,question,answer,upvotes,views,upvotes_scaled,views_scaled,weighted_engagement,engagement,modal_verbs,concreteness,flesch_reading_ease,ari,clause_density,t_unit_complexity,present_indicative_verbs
0,0,depression,Do I have too many issues for counseling? I ha...,"I've never heard of someone having ""too many i...",3,386,0.166667,0.201888,0.173711,High,5,194.24,47.49,12.0,10,7,4
1,0,depression,Do I have too many issues for counseling? I ha...,I just want to acknowledge you for the courage...,2,256,0.083333,0.133718,0.093410,Medium,10,437.93,71.44,7.6,25,13,12
2,0,depression,Do I have too many issues for counseling? I ha...,It's not really a question of whether you have...,2,435,0.083333,0.227583,0.112183,Medium,9,488.62,64.85,11.0,23,9,19
3,0,depression,Do I have too many issues for counseling? I ha...,There is no such thing as too much. Start wher...,2,217,0.083333,0.113267,0.089320,Medium,2,133.65,93.54,3.1,9,5,7
4,0,depression,Do I have too many issues for counseling? I ha...,The most direct answer is no. I would venture ...,2,1064,0.083333,0.557420,0.178151,High,3,265.12,81.63,6.8,14,8,13
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2610,939,counseling-fundamentals,Are some clients more difficult than others? W...,Although many clients have the capacity to be ...,1,47,0.000000,0.024122,0.004824,Low,8,352.25,46.71,16.2,23,7,8
2611,939,counseling-fundamentals,Are some clients more difficult than others? W...,"I usually don't label a client as ""difficult"" ...",1,22,0.000000,0.011012,0.002202,Low,7,528.16,68.60,10.8,34,13,29
2612,939,counseling-fundamentals,Are some clients more difficult than others? W...,"Dang right! :)Heh heh, and correct me if I'm ...",1,23,0.000000,0.011536,0.002307,Low,2,103.12,94.56,4.5,6,7,4
2613,939,counseling-fundamentals,Are some clients more difficult than others? W...,"Yes, just like some relationships outside of o...",1,41,0.000000,0.020975,0.004195,Low,9,422.44,62.41,14.1,25,8,17


The extraction of present indicative verbs from therapists' responses serves to illuminate the active, ongoing discourse characteristic of therapeutic engagement, offering insight into the immediacy and directness of the interaction. This linguistic feature, indicative of current action and involvement, is posited to correlate with client engagement, thus providing a valuable metric for the analysis of effective communication within the counseling context.

### Self-referential sentences

In [42]:
def extract_self_referential_sentences(text):
    # Process the text with spaCy
    doc = nlp(text)

    # List to store self-referential sentences
    self_ref_sentences = []

    # Define first-person pronouns
    first_person_pronouns = {"I", "me", "my", "mine", "myself"}

    # Iterate over the sentences in the text
    for sent in doc.sents:
        # Check if any of the tokens in the sentence are first-person pronouns
        if any(token.text in first_person_pronouns for token in sent):
            self_ref_sentences.append(sent.text)

    return len(self_ref_sentences)

In [44]:
# Apply
counsel_df['self_referential_sent'] = counsel_df['answer'].apply(extract_self_referential_sentences)

In [47]:
counsel_df

Unnamed: 0,ID,topic,question,answer,upvotes,views,upvotes_scaled,views_scaled,weighted_engagement,engagement,modal_verbs,concreteness,flesch_reading_ease,ari,clause_density,t_unit_complexity,present_indicative_verbs,self_referential_sent
0,0,depression,Do I have too many issues for counseling? I ha...,"I've never heard of someone having ""too many i...",3,386,0.166667,0.201888,0.173711,High,5,194.24,47.49,12.0,10,7,4,2
1,0,depression,Do I have too many issues for counseling? I ha...,I just want to acknowledge you for the courage...,2,256,0.083333,0.133718,0.093410,Medium,10,437.93,71.44,7.6,25,13,12,5
2,0,depression,Do I have too many issues for counseling? I ha...,It's not really a question of whether you have...,2,435,0.083333,0.227583,0.112183,Medium,9,488.62,64.85,11.0,23,9,19,0
3,0,depression,Do I have too many issues for counseling? I ha...,There is no such thing as too much. Start wher...,2,217,0.083333,0.113267,0.089320,Medium,2,133.65,93.54,3.1,9,5,7,1
4,0,depression,Do I have too many issues for counseling? I ha...,The most direct answer is no. I would venture ...,2,1064,0.083333,0.557420,0.178151,High,3,265.12,81.63,6.8,14,8,13,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2610,939,counseling-fundamentals,Are some clients more difficult than others? W...,Although many clients have the capacity to be ...,1,47,0.000000,0.024122,0.004824,Low,8,352.25,46.71,16.2,23,7,8,1
2611,939,counseling-fundamentals,Are some clients more difficult than others? W...,"I usually don't label a client as ""difficult"" ...",1,22,0.000000,0.011012,0.002202,Low,7,528.16,68.60,10.8,34,13,29,4
2612,939,counseling-fundamentals,Are some clients more difficult than others? W...,"Dang right! :)Heh heh, and correct me if I'm ...",1,23,0.000000,0.011536,0.002307,Low,2,103.12,94.56,4.5,6,7,4,1
2613,939,counseling-fundamentals,Are some clients more difficult than others? W...,"Yes, just like some relationships outside of o...",1,41,0.000000,0.020975,0.004195,Low,9,422.44,62.41,14.1,25,8,17,4


In [48]:
counsel_df.to_csv('data/counsel_df_lingo.csv', index=False)

## Additional analysis for future work: 
Given the context of our analysis we might consider incorporating the following additional linguistic features:

1. Pragmatic Markers
Speech Acts: Analyze the types of speech acts (e.g., questioning, advising, reassuring) used in the responses, as they can significantly impact the perceived helpfulness or effectiveness of a response.
Politeness Strategies: Assess the use of politeness strategies, which can affect the tone and perceived empathy in the responses.
2. Discourse Analysis
Coherence and Cohesion: Analyze how ideas are connected and flow within the text. This includes the use of transition words, pronoun reference, and thematic progression.
Narrative Structure: Investigate the presence of narrative elements, such as storytelling or the use of anecdotes, which can be impactful in therapy.
3. Lexical Richness
Word Frequency Analysis: Beyond lexical diversity, examine the frequency of specific types of words (e.g., therapeutic jargon, emotion words) that could influence engagement.
Semantic Analysis: Explore the semantic fields most commonly used (e.g., emotional, cognitive, health-related) to understand thematic focuses in responses.
4. Psycholinguistic Features
Emotionally-Charged Language: Analyze the use of language that evokes emotions, as it can play a crucial role in empathy and rapport building.
Sensory Language: Assess the use of sensory descriptions, which can make responses more vivid and relatable.
5. Linguistic Style Matching
Mirroring Client’s Language: Measure the degree to which therapists’ linguistic style (e.g., vocabulary, syntax) matches that of the clients, which can be an indicator of rapport and alignment.
6. Nonverbal Communication Indicators
Punctuation and Formatting: Assess the use of punctuation and formatting (e.g., ellipses, emoticons, paragraph breaks) for their potential to convey nonverbal cues or emotional tone.
7. Topic Modeling
Identifying Key Themes: Use algorithms like Latent Dirichlet Allocation (LDA) to identify prevalent topics in the responses, which can help in understanding focus areas in therapy.