# Features Engineering

The thesis project focusing on analyzing the linguistic features of therapists' responses in an online forum to predict engagement levels. 

Let's break down the extraction of the linguistic features from the text answers in the dataset:

1. Length of the Answer: This can be measured in terms of characters (which you've mentioned as answer_len_char), but also consider measuring it in terms of words for a more nuanced understanding.
2. Lexical Diversity:

    - Overall Lexical Diversity: This can be calculated using measures like Type-Token Ratio (TTR), which is the ratio of unique words to the total number of words.
    - Use of Modal Verbs: Identify and count modal verbs (like could, should, would).
    - Frequency of Personal Pronouns: Count occurrences of personal pronouns (e.g., I, you, we).
    - Concreteness, Imageability, Perceptual Salience: These are more complex and might require specific lexicons or datasets that rate words based on these attributes. 

3. Sentiment Analysis: This involves categorizing the sentiment of the text (positive, negative, neutral). Utilize sentiment analysis tools or libraries (like TextBlob, NLTK's VADER, or more advanced models if available) to determine the sentiment polarity and subjectivity of the answers.
4. Readability Scores: There are various formulas like Flesch Reading Ease, Flesch-Kincaid Grade Level, Gunning Fog Index, etc., that can be used to assess the readability of the text.
5. Syntactic Complexity:
    - Average Sentence Length: Measure the average number of words per sentence.
    - Use of Subordinate Clauses: This might require parsing the sentence structure to identify subordinate clauses.
    - Other Metrics: Complex metrics like the number of dependent clauses, noun phrase complexity, etc., can also be considered.

In [1]:
import pandas as pd

# Load the dataset
file_path = 'data/counsel_df_processed.csv'
counsel_df = pd.read_csv(file_path)

counsel_df.drop(columns=['index'])

Unnamed: 0,ID,topic,question,answer,upvotes,views,upvotes_scaled,views_scaled,weighted_engagement,engagement
0,0,depression,Do I have too many issues for counseling? I ha...,"I've never heard of someone having ""too many i...",3,386,0.166667,0.201888,0.173711,High
1,0,depression,Do I have too many issues for counseling? I ha...,I just want to acknowledge you for the courage...,2,256,0.083333,0.133718,0.093410,High
2,0,depression,Do I have too many issues for counseling? I ha...,It's not really a question of whether you have...,2,435,0.083333,0.227583,0.112183,High
3,0,depression,Do I have too many issues for counseling? I ha...,There is no such thing as too much. Start wher...,2,217,0.083333,0.113267,0.089320,High
4,0,depression,Do I have too many issues for counseling? I ha...,The most direct answer is no. I would venture ...,2,1064,0.083333,0.557420,0.178151,High
...,...,...,...,...,...,...,...,...,...,...
2610,939,counseling-fundamentals,Are some clients more difficult than others? W...,Although many clients have the capacity to be ...,1,47,0.000000,0.024122,0.004824,Low
2611,939,counseling-fundamentals,Are some clients more difficult than others? W...,"I usually don't label a client as ""difficult"" ...",1,22,0.000000,0.011012,0.002202,Low
2612,939,counseling-fundamentals,Are some clients more difficult than others? W...,"Dang right! :)Heh heh, and correct me if I'm ...",1,23,0.000000,0.011536,0.002307,Low
2613,939,counseling-fundamentals,Are some clients more difficult than others? W...,"Yes, just like some relationships outside of o...",1,41,0.000000,0.020975,0.004195,Low


## Text Normalization

The following text normalization processes are most relevant:
1. Case Folding: Essential for ensuring uniformity in the text. It's especially important for consistency in word counts and lexical diversity measures.
2. Tokenization: Fundamental for almost all types of text analysis. It's necessary for calculating lexical diversity, sentiment analysis, and syntactic complexity.
3. Removing Punctuation and Special Characters: This is generally useful, but you might want to retain punctuation for sentiment analysis (as it can convey sentiment) and for readability scores (since punctuation impacts readability).
4. Stop Words Removal: This is a bit nuanced. For lexical diversity, you might want to keep stop words as they contribute to the overall diversity of the text. However, for sentiment analysis and readability scores, stop words removal can be considered.
5. Spelling Correction: This is optional and depends on the quality of the text. If the dataset has a lot of typos, it may be worth doing, but it's not usually critical for the types of analysis you're performing.
6. Lemmatization: Preferable over stemming as it retains the meaning of words. Useful for lexical diversity and sentiment analysis.
7. Handling Negations: Important for sentiment analysis to accurately capture the sentiment of the text.
8. Identifying Part-of-Speech (POS): Necessary for counting modal verbs, personal pronouns, and understanding sentence structure for syntactic complexity.
9. Sentence Segmentation: Crucial for syntactic complexity analysis (like average sentence length) and also useful for certain readability metrics.

## 1. Length of Answers

- char_count: The character length of each answer.
- word_count: The number of words in each answer.
- sentence_count: The number of sentences in each answer.

In [2]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

# Download necessary NLTK data
nltk.download('punkt')

def calculate_lengths(text):
    words = word_tokenize(text)  # Tokenize into words
    words = [word for word in words if word.isalpha()]   # Remove punctuation and non-alphabetic characters

    # Calculate word length
    word_len = len(words)
    
    # Calculate character length
    clean_sent = ''.join(words)
    char_len = len(clean_sent)  

    # Calculate sentence length
    sentences = sent_tokenize(text)
    sentence_len = len(sentences)

    return char_len, word_len, sentence_len


# Apply the function to each row in the 'answer' column
counsel_df[['char_count', 'word_count', 'sentence_count']] = counsel_df['answer'].apply(lambda x: pd.Series(calculate_lengths(x)))
counsel_df

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/ghaithalseirawan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,index,ID,topic,question,answer,upvotes,views,upvotes_scaled,views_scaled,weighted_engagement,engagement,char_count,word_count,sentence_count
0,1,0,depression,Do I have too many issues for counseling? I ha...,"I've never heard of someone having ""too many i...",3,386,0.166667,0.201888,0.173711,High,536,104,7
1,4,0,depression,Do I have too many issues for counseling? I ha...,I just want to acknowledge you for the courage...,2,256,0.083333,0.133718,0.093410,High,857,200,13
2,5,0,depression,Do I have too many issues for counseling? I ha...,It's not really a question of whether you have...,2,435,0.083333,0.227583,0.112183,High,875,207,9
3,6,0,depression,Do I have too many issues for counseling? I ha...,There is no such thing as too much. Start wher...,2,217,0.083333,0.113267,0.089320,High,217,58,5
4,7,0,depression,Do I have too many issues for counseling? I ha...,The most direct answer is no. I would venture ...,2,1064,0.083333,0.557420,0.178151,High,512,121,8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2610,2744,939,counseling-fundamentals,Are some clients more difficult than others? W...,Although many clients have the capacity to be ...,1,47,0.000000,0.024122,0.004824,Low,867,168,4
2611,2745,939,counseling-fundamentals,Are some clients more difficult than others? W...,"I usually don't label a client as ""difficult"" ...",1,22,0.000000,0.011012,0.002202,Low,1135,250,10
2612,2746,939,counseling-fundamentals,Are some clients more difficult than others? W...,"Dang right! :)Heh heh, and correct me if I'm ...",1,23,0.000000,0.011536,0.002307,Low,200,53,3
2613,2747,939,counseling-fundamentals,Are some clients more difficult than others? W...,"Yes, just like some relationships outside of o...",1,41,0.000000,0.020975,0.004195,Low,926,203,7


The extraction of textual length metrics, encompassing character, word, and sentence counts, is a fundamental step in this analysis as it provides a foundational understanding of the verbosity and structural complexity of therapists' responses. This quantification not only facilitates a preliminary assessment of response comprehensiveness but also serves as a pivotal baseline for correlating linguistic verbosity with user engagement levels, thereby enabling a more nuanced exploration of how textual length impacts the effectiveness of online counseling communication.

## 2. Lexical Diversity:

- Overall Lexical Diversity: This can be calculated using measures like Type-Token Ratio (TTR), which is the ratio of unique words to the total number of words.
- Use of Modal Verbs: Identify and count modal verbs (like could, should, would).
- Frequency of Personal Pronouns: Count occurrences of personal pronouns (e.g., I, you, we).
- Concreteness.

### 2.1 Type-Token Ratio (TTR)
TTR is a simple measure of lexical diversity, calculated as the number of unique words (types) divided by the total number of words (tokens) in a text.

In [3]:
def calculate_ttr(text):
    words = word_tokenize(text.lower())
    return len(set(words)) / len(words) if words else 0

### 2.2 Use of Modal Verbs
Count the frequency of modal verbs in the text. Modal verbs include words like "can," "could," "may," "might," "must," "shall," "should," "will," "would."

In [4]:
def count_modal_verbs(text):
    modal_verbs = {'can', 'could', 'may', 'might', 'must', 'shall', 'should', 'will', 'would'}
    words = word_tokenize(text.lower())
    return sum(word in modal_verbs for word in words)

### 2.3 Frequency of Personal Pronouns
Count occurrences of personal pronouns like "I," "you," "he," "she," "it," "we," "they."

In [5]:
def count_personal_pronouns(text):
    personal_pronouns = {'i', 'you', 'he', 'she', 'it', 'we', 'they'}
    words = word_tokenize(text.lower())
    return sum(word in personal_pronouns for word in words)

In [6]:
# Apply the functions to each row in the 'answer' column
counsel_df['TTR'] = counsel_df['answer'].apply(calculate_ttr)
counsel_df['modal_verbs'] = counsel_df['answer'].apply(count_modal_verbs)
counsel_df['personal_pronouns'] = counsel_df['answer'].apply(count_personal_pronouns)

In [7]:
counsel_df

Unnamed: 0,index,ID,topic,question,answer,upvotes,views,upvotes_scaled,views_scaled,weighted_engagement,engagement,char_count,word_count,sentence_count,TTR,modal_verbs,personal_pronouns
0,1,0,depression,Do I have too many issues for counseling? I ha...,"I've never heard of someone having ""too many i...",3,386,0.166667,0.201888,0.173711,High,536,104,7,0.650407,5,4
1,4,0,depression,Do I have too many issues for counseling? I ha...,I just want to acknowledge you for the courage...,2,256,0.083333,0.133718,0.093410,High,857,200,13,0.515982,10,19
2,5,0,depression,Do I have too many issues for counseling? I ha...,It's not really a question of whether you have...,2,435,0.083333,0.227583,0.112183,High,875,207,9,0.511211,9,22
3,6,0,depression,Do I have too many issues for counseling? I ha...,There is no such thing as too much. Start wher...,2,217,0.083333,0.113267,0.089320,High,217,58,5,0.805970,2,7
4,7,0,depression,Do I have too many issues for counseling? I ha...,The most direct answer is no. I would venture ...,2,1064,0.083333,0.557420,0.178151,High,512,121,8,0.613139,3,15
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2610,2744,939,counseling-fundamentals,Are some clients more difficult than others? W...,Although many clients have the capacity to be ...,1,47,0.000000,0.024122,0.004824,Low,867,168,4,0.548913,8,7
2611,2745,939,counseling-fundamentals,Are some clients more difficult than others? W...,"I usually don't label a client as ""difficult"" ...",1,22,0.000000,0.011012,0.002202,Low,1135,250,10,0.531469,7,19
2612,2746,939,counseling-fundamentals,Are some clients more difficult than others? W...,"Dang right! :)Heh heh, and correct me if I'm ...",1,23,0.000000,0.011536,0.002307,Low,200,53,3,0.746479,2,6
2613,2747,939,counseling-fundamentals,Are some clients more difficult than others? W...,"Yes, just like some relationships outside of o...",1,41,0.000000,0.020975,0.004195,Low,926,203,7,0.472803,9,8


The analysis of lexical diversity, encompassing the Type-Token Ratio, frequency of modal verbs, and use of personal pronouns, is instrumental in discerning the linguistic variability and relational tone embedded in therapists' responses. By quantitatively capturing these aspects, the study aims to unravel the correlation between the richness and personalization of language with user engagement, thereby offering insights into the linguistic patterns that most effectively foster a resonant and supportive online counseling environment.

### 2.4 Concreteness

To implement a solution using an existing concreteness database, I will leverage the concreteness ratings provided by Brysbaert et al. (2014), which is a widely recognized resource in this field. The database offers concreteness ratings for about 40,000 English words, rated on a scale from 1 (most abstract) to 5 (most concrete).


In [8]:
# Load the concreteness ratings database provided by Brysbaert et al. (2014)
concreteness_df = pd.read_excel('data/concreteness_brysbaert/13428_2013_403_MOESM1_ESM.xlsx')  # Update with the actual path
concreteness_dict = pd.Series(concreteness_df.conc_mean.values, index=concreteness_df.word).to_dict()

concreteness_df

Unnamed: 0,word,Bigram,conc_mean,Conc_std,Unknown,Total,Percent_known,SUBTLEX
0,a,0,1.46,1.14,2,30,0.933333,1041179
1,aardvark,0,4.68,0.86,0,28,1.000000,21
2,aback,0,1.65,1.07,4,27,0.851852,15
3,abacus,0,4.52,1.12,2,29,0.931034,12
4,abandon,0,2.54,1.45,1,27,0.962963,413
...,...,...,...,...,...,...,...,...
39949,zebra crossing,1,4.56,0.75,1,28,0.964286,0
39950,zero tolerance,1,2.21,1.45,0,29,1.000000,0
39951,ZIP code,1,3.77,1.59,0,30,1.000000,0
39952,zoom in,1,3.57,1.40,0,28,1.000000,0


In [9]:
def calculate_average_concreteness(text, concreteness_dict):
    words = word_tokenize(text.lower())
    words = [word for word in words if word.isalpha()]
    scores = [concreteness_dict.get(word, None) for word in words]
    scores = [score for score in scores if score is not None]
    return sum(scores) / len(scores) if scores else 0


# Apply the function to each row in the 'answer' column
counsel_df['average_concreteness_answer'] = counsel_df['answer'].apply(lambda x: calculate_average_concreteness(x, concreteness_dict))

In [10]:
counsel_df

Unnamed: 0,index,ID,topic,question,answer,upvotes,views,upvotes_scaled,views_scaled,weighted_engagement,engagement,char_count,word_count,sentence_count,TTR,modal_verbs,personal_pronouns,average_concreteness_answer
0,1,0,depression,Do I have too many issues for counseling? I ha...,"I've never heard of someone having ""too many i...",3,386,0.166667,0.201888,0.173711,High,536,104,7,0.650407,5,4,2.340241
1,4,0,depression,Do I have too many issues for counseling? I ha...,I just want to acknowledge you for the courage...,2,256,0.083333,0.133718,0.093410,High,857,200,13,0.515982,10,19,2.432944
2,5,0,depression,Do I have too many issues for counseling? I ha...,It's not really a question of whether you have...,2,435,0.083333,0.227583,0.112183,High,875,207,9,0.511211,9,22,2.455377
3,6,0,depression,Do I have too many issues for counseling? I ha...,There is no such thing as too much. Start wher...,2,217,0.083333,0.113267,0.089320,High,217,58,5,0.805970,2,7,2.386607
4,7,0,depression,Do I have too many issues for counseling? I ha...,The most direct answer is no. I would venture ...,2,1064,0.083333,0.557420,0.178151,High,512,121,8,0.613139,3,15,2.477757
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2610,2744,939,counseling-fundamentals,Are some clients more difficult than others? W...,Although many clients have the capacity to be ...,1,47,0.000000,0.024122,0.004824,Low,867,168,4,0.548913,8,7,2.348333
2611,2745,939,counseling-fundamentals,Are some clients more difficult than others? W...,"I usually don't label a client as ""difficult"" ...",1,22,0.000000,0.011012,0.002202,Low,1135,250,10,0.531469,7,19,2.286407
2612,2746,939,counseling-fundamentals,Are some clients more difficult than others? W...,"Dang right! :)Heh heh, and correct me if I'm ...",1,23,0.000000,0.011536,0.002307,Low,200,53,3,0.746479,2,6,2.398140
2613,2747,939,counseling-fundamentals,Are some clients more difficult than others? W...,"Yes, just like some relationships outside of o...",1,41,0.000000,0.020975,0.004195,Low,926,203,7,0.472803,9,8,2.283459


The incorporation of concreteness ratings into the analysis harnesses a quantifiable measure of the tangibility and specificity of language used in therapists' responses, a factor potentially pivotal in client engagement and comprehension. By systematically evaluating the concreteness of lexical choices, this step crucially aligns the linguistic attributes of the responses with empirical benchmarks, offering a nuanced understanding of how the degree of abstract versus concrete language correlates with and possibly influences user engagement in an online counseling context.

Utilizing Brysbaert et al.'s concreteness ratings, this analysis step methodically quantifies the tangibility of language in therapists' responses, a crucial aspect for enhancing client engagement and understanding in online counseling. This approach, grounded in empirical benchmarks, judiciously evaluates lexical concreteness, thereby providing insights into the impact of abstract versus concrete language on user engagement, and substantiates the choice of using an established database for its reliability and academic rigor.

## 3. Sentiment

### 3.1 Using TextBlob
TextBlob is a simple library for processing textual data. It can provide sentiment polarity (ranging from -1 to 1, where -1 is very negative, 0 is neutral, and 1 is very positive) and subjectivity (ranging from 0 to 1, where 0 is very objective and 1 is very subjective).

In [11]:
from textblob import TextBlob

def get_sentiment(text):
    blob = TextBlob(text)
    return blob.sentiment.polarity, blob.sentiment.subjectivity

In [12]:

# Apply the function to each row in the 'answer' column
counsel_df[['sentiment_polarity_answer', 'sentiment_subjectivity_answer']] = counsel_df['answer'].apply(lambda x: pd.Series(get_sentiment(x)))
counsel_df[['sentiment_polarity_question', 'sentiment_subjectivity_question']] = counsel_df['question'].apply(lambda x: pd.Series(get_sentiment(x)))

In [15]:
counsel_df

Unnamed: 0,index,ID,topic,question,answer,upvotes,views,upvotes_scaled,views_scaled,weighted_engagement,...,word_count,sentence_count,TTR,modal_verbs,personal_pronouns,average_concreteness_answer,sentiment_polarity_answer,sentiment_subjectivity_answer,sentiment_polarity_question,sentiment_subjectivity_question
0,1,0,depression,Do I have too many issues for counseling? I ha...,"I've never heard of someone having ""too many i...",3,386,0.166667,0.201888,0.173711,...,104,7,0.650407,5,4,2.340241,0.194444,0.525926,0.314286,0.469048
1,4,0,depression,Do I have too many issues for counseling? I ha...,I just want to acknowledge you for the courage...,2,256,0.083333,0.133718,0.093410,...,200,13,0.515982,10,19,2.432944,0.218889,0.428889,0.314286,0.469048
2,5,0,depression,Do I have too many issues for counseling? I ha...,It's not really a question of whether you have...,2,435,0.083333,0.227583,0.112183,...,207,9,0.511211,9,22,2.455377,0.282051,0.548718,0.314286,0.469048
3,6,0,depression,Do I have too many issues for counseling? I ha...,There is no such thing as too much. Start wher...,2,217,0.083333,0.113267,0.089320,...,58,5,0.805970,2,7,2.386607,0.286508,0.467063,0.314286,0.469048
4,7,0,depression,Do I have too many issues for counseling? I ha...,The most direct answer is no. I would venture ...,2,1064,0.083333,0.557420,0.178151,...,121,8,0.613139,3,15,2.477757,0.076786,0.373214,0.314286,0.469048
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2610,2744,939,counseling-fundamentals,Are some clients more difficult than others? W...,Although many clients have the capacity to be ...,1,47,0.000000,0.024122,0.004824,...,168,4,0.548913,8,7,2.348333,0.265000,0.530000,0.000000,0.750000
2611,2745,939,counseling-fundamentals,Are some clients more difficult than others? W...,"I usually don't label a client as ""difficult"" ...",1,22,0.000000,0.011012,0.002202,...,250,10,0.531469,7,19,2.286407,-0.050430,0.649432,0.000000,0.750000
2612,2746,939,counseling-fundamentals,Are some clients more difficult than others? W...,"Dang right! :)Heh heh, and correct me if I'm ...",1,23,0.000000,0.011536,0.002307,...,53,3,0.746479,2,6,2.398140,0.267857,0.760119,0.000000,0.750000
2613,2747,939,counseling-fundamentals,Are some clients more difficult than others? W...,"Yes, just like some relationships outside of o...",1,41,0.000000,0.020975,0.004195,...,203,7,0.472803,9,8,2.283459,0.045767,0.641723,0.000000,0.750000


### 3.2 Using NLTK's VADER
VADER (Valence Aware Dictionary and sEntiment Reasoner) is particularly good for sentiments expressed in social media due to its understanding of slang and emoji.

In [18]:
from nltk.sentiment import SentimentIntensityAnalyzer

sia = SentimentIntensityAnalyzer()

def get_vader_sentiment(text):
    return sia.polarity_scores(text)['compound']

In [19]:
# Apply the function to each row in the 'answer' column
counsel_df['vader_sentiment'] = counsel_df['answer'].apply(get_vader_sentiment)

# Now your DataFrame will have a new column 'vader_sentiment'


In [20]:
counsel_df

Unnamed: 0,index,ID,topic,question,answer,upvotes,views,upvotes_scaled,views_scaled,weighted_engagement,...,sentence_count,TTR,modal_verbs,personal_pronouns,average_concreteness_answer,sentiment_polarity_answer,sentiment_subjectivity_answer,sentiment_polarity_question,sentiment_subjectivity_question,vader_sentiment
0,1,0,depression,Do I have too many issues for counseling? I ha...,"I've never heard of someone having ""too many i...",3,386,0.166667,0.201888,0.173711,...,7,0.650407,5,4,2.340241,0.194444,0.525926,0.314286,0.469048,-0.9460
1,4,0,depression,Do I have too many issues for counseling? I ha...,I just want to acknowledge you for the courage...,2,256,0.083333,0.133718,0.093410,...,13,0.515982,10,19,2.432944,0.218889,0.428889,0.314286,0.469048,0.9719
2,5,0,depression,Do I have too many issues for counseling? I ha...,It's not really a question of whether you have...,2,435,0.083333,0.227583,0.112183,...,9,0.511211,9,22,2.455377,0.282051,0.548718,0.314286,0.469048,0.9847
3,6,0,depression,Do I have too many issues for counseling? I ha...,There is no such thing as too much. Start wher...,2,217,0.083333,0.113267,0.089320,...,5,0.805970,2,7,2.386607,0.286508,0.467063,0.314286,0.469048,0.7876
4,7,0,depression,Do I have too many issues for counseling? I ha...,The most direct answer is no. I would venture ...,2,1064,0.083333,0.557420,0.178151,...,8,0.613139,3,15,2.477757,0.076786,0.373214,0.314286,0.469048,0.4215
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2610,2744,939,counseling-fundamentals,Are some clients more difficult than others? W...,Although many clients have the capacity to be ...,1,47,0.000000,0.024122,0.004824,...,4,0.548913,8,7,2.348333,0.265000,0.530000,0.000000,0.750000,0.8441
2611,2745,939,counseling-fundamentals,Are some clients more difficult than others? W...,"I usually don't label a client as ""difficult"" ...",1,22,0.000000,0.011012,0.002202,...,10,0.531469,7,19,2.286407,-0.050430,0.649432,0.000000,0.750000,-0.6065
2612,2746,939,counseling-fundamentals,Are some clients more difficult than others? W...,"Dang right! :)Heh heh, and correct me if I'm ...",1,23,0.000000,0.011536,0.002307,...,3,0.746479,2,6,2.398140,0.267857,0.760119,0.000000,0.750000,-0.3187
2613,2747,939,counseling-fundamentals,Are some clients more difficult than others? W...,"Yes, just like some relationships outside of o...",1,41,0.000000,0.020975,0.004195,...,7,0.472803,9,8,2.283459,0.045767,0.641723,0.000000,0.750000,0.9307


Applying sentiment analysis to the therapists' responses, using tools like TextBlob and VADER, is crucial in quantifying the emotional tone and subjectivity embedded in the language, an aspect integral to understanding user engagement in online counseling. This analytical step, through its systematic evaluation of sentiment polarity and subjectivity, enables the exploration of how emotional expression in therapist communication correlates with and potentially influences client engagement, thereby substantiating the choice of these tools for their robustness and adaptability in processing varied linguistic expressions.

## 4. Readability

To assess the readability of the therapists' answers in your dataset, you can use Python's textstat library, which provides several readability metrics. Commonly used metrics include:

- Flesch Reading Ease: Higher scores indicate material that is easier to read; lower numbers mark passages that are more difficult to read.
- Flesch-Kincaid Grade Level: Shows the US grade level needed to understand the text.
- Gunning Fog Index: Estimates the years of formal education needed to understand the text on the first reading.
- SMOG Index: Calculates the years of education needed to understand a piece of writing.
- Automated Readability Index (ARI): Like the Flesch-Kincaid Grade Level, this index estimates the grade level needed to comprehend the text.

In [21]:
import textstat

# Define a function to calculate readability scores
def calculate_readability(text):
    flesch_reading_ease = textstat.flesch_reading_ease(text)
    flesch_kincaid_grade = textstat.flesch_kincaid_grade(text)
    gunning_fog = textstat.gunning_fog(text)
    smog_index = textstat.smog_index(text)
    ari = textstat.automated_readability_index(text)
    return flesch_reading_ease, flesch_kincaid_grade, gunning_fog, smog_index, ari

In [22]:

# Apply the function to each row in the 'answer' column
counsel_df[['flesch_reading_ease', 'flesch_kincaid_grade', 'gunning_fog', 'smog_index', 'ari']] = counsel_df['answer'].apply(lambda x: pd.Series(calculate_readability(x)))


In [24]:
counsel_df

Unnamed: 0,index,ID,topic,question,answer,upvotes,views,upvotes_scaled,views_scaled,weighted_engagement,...,sentiment_polarity_answer,sentiment_subjectivity_answer,sentiment_polarity_question,sentiment_subjectivity_question,vader_sentiment,flesch_reading_ease,flesch_kincaid_grade,gunning_fog,smog_index,ari
0,1,0,depression,Do I have too many issues for counseling? I ha...,"I've never heard of someone having ""too many i...",3,386,0.166667,0.201888,0.173711,...,0.194444,0.525926,0.314286,0.469048,-0.9460,47.49,10.4,12.10,13.3,12.0
1,4,0,depression,Do I have too many issues for counseling? I ha...,I just want to acknowledge you for the courage...,2,256,0.083333,0.133718,0.093410,...,0.218889,0.428889,0.314286,0.469048,0.9719,71.44,7.4,9.08,10.3,7.6
2,5,0,depression,Do I have too many issues for counseling? I ha...,It's not really a question of whether you have...,2,435,0.083333,0.227583,0.112183,...,0.282051,0.548718,0.314286,0.469048,0.9847,64.85,10.0,11.19,11.9,11.0
3,6,0,depression,Do I have too many issues for counseling? I ha...,There is no such thing as too much. Start wher...,2,217,0.083333,0.113267,0.089320,...,0.286508,0.467063,0.314286,0.469048,0.7876,93.54,3.1,5.33,5.7,3.1
4,7,0,depression,Do I have too many issues for counseling? I ha...,The most direct answer is no. I would venture ...,2,1064,0.083333,0.557420,0.178151,...,0.076786,0.373214,0.314286,0.469048,0.4215,81.63,5.6,8.67,9.5,6.8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2610,2744,939,counseling-fundamentals,Are some clients more difficult than others? W...,Although many clients have the capacity to be ...,1,47,0.000000,0.024122,0.004824,...,0.265000,0.530000,0.000000,0.750000,0.8441,46.71,12.8,15.37,15.7,16.2
2611,2745,939,counseling-fundamentals,Are some clients more difficult than others? W...,"I usually don't label a client as ""difficult"" ...",1,22,0.000000,0.011012,0.002202,...,-0.050430,0.649432,0.000000,0.750000,-0.6065,68.60,8.5,10.80,11.7,10.8
2612,2746,939,counseling-fundamentals,Are some clients more difficult than others? W...,"Dang right! :)Heh heh, and correct me if I'm ...",1,23,0.000000,0.011536,0.002307,...,0.267857,0.760119,0.000000,0.750000,-0.3187,94.56,2.7,5.75,6.7,4.5
2613,2747,939,counseling-fundamentals,Are some clients more difficult than others? W...,"Yes, just like some relationships outside of o...",1,41,0.000000,0.020975,0.004195,...,0.045767,0.641723,0.000000,0.750000,0.9307,62.41,10.9,13.17,13.0,14.1


The integration of readability assessments, utilizing metrics like the Flesch Reading Ease and the Gunning Fog Index, plays a pivotal role in quantifying the linguistic accessibility of therapists' responses, an element crucial to client comprehension and engagement in an online counseling setting. This analytical step, by methodically evaluating text complexity, aids in understanding the impact of readability on user engagement, underlining the significance of clear and comprehendible communication in fostering effective therapeutic interactions and justifying the selection of these comprehensive and widely recognized readability metrics.

## 5. Syntactic Complexity Analysis

Applying syntactic complexity analysis to the therapists' responses involves examining the structure of the sentences to determine their complexity. This can include metrics like average sentence length, the use of subordinate clauses, and the variety of sentence structures used. Syntactic complexity can be indicative of the sophistication of the text and may affect readability and user engagement.

**Metrics for Syntactic Complexity**
1. Average Sentence Length: Measured in words per sentence, it's a basic indicator of complexity.
2. Clause Density: The average number of clauses per sentence.
3. T-Unit Analysis: A T-unit is a minimal terminable unit, essentially a main clause with all its subordinate clauses. Analyzing the length and number of T-units can indicate complexity.
4. Subordination Index: The ratio of subordinate clauses to total clauses.
5. Sentence Variety: Different types of sentence structures (simple, compound, complex, compound-complex).

### 5.1 Using NLTK for Syntactic Complexity Analysis
Python's NLTK library can be used to analyze some aspects of syntactic complexity. Here’s an example of how you could implement a basic analysis:

In [25]:
def calculate_syntactic_complexity(text):
    sentences = sent_tokenize(text)
    total_words = word_tokenize(text)

    if not sentences:
        return 0, 0

    average_sentence_length = len(total_words) / len(sentences)
    # Other metrics like clause density, T-unit analysis, subordination index, etc., can be added here
    return average_sentence_length, len(sentences)

In [26]:
# Apply the function to each row in the 'answer' column
counsel_df[['average_sentence_length', 'sentence_count']] = counsel_df['answer'].apply(lambda x: pd.Series(calculate_syntactic_complexity(x)))

In [27]:
counsel_df

Unnamed: 0,index,ID,topic,question,answer,upvotes,views,upvotes_scaled,views_scaled,weighted_engagement,...,sentiment_subjectivity_answer,sentiment_polarity_question,sentiment_subjectivity_question,vader_sentiment,flesch_reading_ease,flesch_kincaid_grade,gunning_fog,smog_index,ari,average_sentence_length
0,1,0,depression,Do I have too many issues for counseling? I ha...,"I've never heard of someone having ""too many i...",3,386,0.166667,0.201888,0.173711,...,0.525926,0.314286,0.469048,-0.9460,47.49,10.4,12.10,13.3,12.0,17.571429
1,4,0,depression,Do I have too many issues for counseling? I ha...,I just want to acknowledge you for the courage...,2,256,0.083333,0.133718,0.093410,...,0.428889,0.314286,0.469048,0.9719,71.44,7.4,9.08,10.3,7.6,16.846154
2,5,0,depression,Do I have too many issues for counseling? I ha...,It's not really a question of whether you have...,2,435,0.083333,0.227583,0.112183,...,0.548718,0.314286,0.469048,0.9847,64.85,10.0,11.19,11.9,11.0,24.777778
3,6,0,depression,Do I have too many issues for counseling? I ha...,There is no such thing as too much. Start wher...,2,217,0.083333,0.113267,0.089320,...,0.467063,0.314286,0.469048,0.7876,93.54,3.1,5.33,5.7,3.1,13.400000
4,7,0,depression,Do I have too many issues for counseling? I ha...,The most direct answer is no. I would venture ...,2,1064,0.083333,0.557420,0.178151,...,0.373214,0.314286,0.469048,0.4215,81.63,5.6,8.67,9.5,6.8,17.125000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2610,2744,939,counseling-fundamentals,Are some clients more difficult than others? W...,Although many clients have the capacity to be ...,1,47,0.000000,0.024122,0.004824,...,0.530000,0.000000,0.750000,0.8441,46.71,12.8,15.37,15.7,16.2,46.000000
2611,2745,939,counseling-fundamentals,Are some clients more difficult than others? W...,"I usually don't label a client as ""difficult"" ...",1,22,0.000000,0.011012,0.002202,...,0.649432,0.000000,0.750000,-0.6065,68.60,8.5,10.80,11.7,10.8,28.600000
2612,2746,939,counseling-fundamentals,Are some clients more difficult than others? W...,"Dang right! :)Heh heh, and correct me if I'm ...",1,23,0.000000,0.011536,0.002307,...,0.760119,0.000000,0.750000,-0.3187,94.56,2.7,5.75,6.7,4.5,23.666667
2613,2747,939,counseling-fundamentals,Are some clients more difficult than others? W...,"Yes, just like some relationships outside of o...",1,41,0.000000,0.020975,0.004195,...,0.641723,0.000000,0.750000,0.9307,62.41,10.9,13.17,13.0,14.1,34.142857


### 5.2 Clause Density

Complex clauses involving subordination arise because a core or non-core dependent is realized as a clausal structure. We distinguish four basic types:
- Clausal subjects, divided into ordinary subjects (csubj) and passive subjects (csubjpass).
- Clausal complements (objects), divided into those with obligatory control (xcomp) and those without (ccomp).
- Clausal adverbial modifiers (advcl).
- Clausal adnominal modifiers (acl) (with relative clauses as an important subtype in many languages).
- Clausal conjunct (conj), two elements connected by a coordinating conjunction such as, and, or, etc.

In [41]:
import spacy

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

def count_clauses(sentence):
    doc = nlp(sentence)
    # Identifying clauses by looking for specific dependency tags
    clause_tags = {'csubj', 'csubjpass', 'conj', 'advcl', 'acl', 'xcomp', 'ccomp'}
    clauses = [tok for tok in doc if tok.dep_ in clause_tags]
    return len(clauses)

In [43]:
def calculate_clause_density(text):
    sentences = [sent.text.strip() for sent in nlp(text).sents]
    total_clauses = sum(count_clauses(sentence) for sentence in sentences)
    return total_clauses

In [44]:
# Apply the function to each row in the 'answer' column
counsel_df['clause_density'] = counsel_df['answer'].apply(calculate_clause_density)

In [45]:
counsel_df

Unnamed: 0,index,ID,topic,question,answer,upvotes,views,upvotes_scaled,views_scaled,weighted_engagement,...,sentiment_polarity_question,sentiment_subjectivity_question,vader_sentiment,flesch_reading_ease,flesch_kincaid_grade,gunning_fog,smog_index,ari,average_sentence_length,clause_density
0,1,0,depression,Do I have too many issues for counseling? I ha...,"I've never heard of someone having ""too many i...",3,386,0.166667,0.201888,0.173711,...,0.314286,0.469048,-0.9460,47.49,10.4,12.10,13.3,12.0,17.571429,10
1,4,0,depression,Do I have too many issues for counseling? I ha...,I just want to acknowledge you for the courage...,2,256,0.083333,0.133718,0.093410,...,0.314286,0.469048,0.9719,71.44,7.4,9.08,10.3,7.6,16.846154,25
2,5,0,depression,Do I have too many issues for counseling? I ha...,It's not really a question of whether you have...,2,435,0.083333,0.227583,0.112183,...,0.314286,0.469048,0.9847,64.85,10.0,11.19,11.9,11.0,24.777778,23
3,6,0,depression,Do I have too many issues for counseling? I ha...,There is no such thing as too much. Start wher...,2,217,0.083333,0.113267,0.089320,...,0.314286,0.469048,0.7876,93.54,3.1,5.33,5.7,3.1,13.400000,9
4,7,0,depression,Do I have too many issues for counseling? I ha...,The most direct answer is no. I would venture ...,2,1064,0.083333,0.557420,0.178151,...,0.314286,0.469048,0.4215,81.63,5.6,8.67,9.5,6.8,17.125000,14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2610,2744,939,counseling-fundamentals,Are some clients more difficult than others? W...,Although many clients have the capacity to be ...,1,47,0.000000,0.024122,0.004824,...,0.000000,0.750000,0.8441,46.71,12.8,15.37,15.7,16.2,46.000000,23
2611,2745,939,counseling-fundamentals,Are some clients more difficult than others? W...,"I usually don't label a client as ""difficult"" ...",1,22,0.000000,0.011012,0.002202,...,0.000000,0.750000,-0.6065,68.60,8.5,10.80,11.7,10.8,28.600000,34
2612,2746,939,counseling-fundamentals,Are some clients more difficult than others? W...,"Dang right! :)Heh heh, and correct me if I'm ...",1,23,0.000000,0.011536,0.002307,...,0.000000,0.750000,-0.3187,94.56,2.7,5.75,6.7,4.5,23.666667,6
2613,2747,939,counseling-fundamentals,Are some clients more difficult than others? W...,"Yes, just like some relationships outside of o...",1,41,0.000000,0.020975,0.004195,...,0.000000,0.750000,0.9307,62.41,10.9,13.17,13.0,14.1,34.142857,25


Extracting syntactic features such as average sentence length and clause density, encompassing diverse elements like clausal subjects (csubj, csubjpass), clausal complements (xcomp, ccomp), adverbial (advcl), adnominal modifiers (acl), and conjuncts (conj), is crucial for understanding the structural intricacies of therapists' responses in online counseling. These metrics provide a detailed portrayal of linguistic complexity, revealing the depth and sophistication of sentence constructions used. By analyzing these syntactic elements, the study can elucidate the relationship between the complexity of language and user engagement, thereby underpinning the hypothesis that certain syntactic structures may either facilitate or impede client comprehension and connection in a therapeutic setting. This analytical approach is chosen for its capacity to offer a nuanced understanding of how therapists' communicative styles - reflected through their syntactic choices - potentially influence the effectiveness and accessibility of their online counseling interventions.

### 5.3 T-Unit Analysis
A T-unit is a minimal terminable unit, essentially a main clause with all its subordinate clauses. Analyzing the length and number of T-units can indicate complexity.


In [46]:
def count_t_units(text):
    doc = nlp(text)
    t_units = 0
    for sent in doc.sents:
        main_clauses = [tok for tok in sent if tok.head == tok and tok.dep_ != 'conj']
        t_units += len(main_clauses)
    return t_units

In [50]:
def calculate_t_unit_complexity(text):
    sentences = list(nlp(text).sents)
    total_t_units = sum(count_t_units(sent.text) for sent in sentences)
    return total_t_units

In [51]:
# Apply the function to each row in the 'answer' column
counsel_df['t_unit_complexity'] = counsel_df['answer'].apply(calculate_t_unit_complexity)

In [52]:
counsel_df

Unnamed: 0,index,ID,topic,question,answer,upvotes,views,upvotes_scaled,views_scaled,weighted_engagement,...,sentiment_subjectivity_question,vader_sentiment,flesch_reading_ease,flesch_kincaid_grade,gunning_fog,smog_index,ari,average_sentence_length,clause_density,t_unit_complexity
0,1,0,depression,Do I have too many issues for counseling? I ha...,"I've never heard of someone having ""too many i...",3,386,0.166667,0.201888,0.173711,...,0.469048,-0.9460,47.49,10.4,12.10,13.3,12.0,17.571429,10,7
1,4,0,depression,Do I have too many issues for counseling? I ha...,I just want to acknowledge you for the courage...,2,256,0.083333,0.133718,0.093410,...,0.469048,0.9719,71.44,7.4,9.08,10.3,7.6,16.846154,25,13
2,5,0,depression,Do I have too many issues for counseling? I ha...,It's not really a question of whether you have...,2,435,0.083333,0.227583,0.112183,...,0.469048,0.9847,64.85,10.0,11.19,11.9,11.0,24.777778,23,9
3,6,0,depression,Do I have too many issues for counseling? I ha...,There is no such thing as too much. Start wher...,2,217,0.083333,0.113267,0.089320,...,0.469048,0.7876,93.54,3.1,5.33,5.7,3.1,13.400000,9,5
4,7,0,depression,Do I have too many issues for counseling? I ha...,The most direct answer is no. I would venture ...,2,1064,0.083333,0.557420,0.178151,...,0.469048,0.4215,81.63,5.6,8.67,9.5,6.8,17.125000,14,8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2610,2744,939,counseling-fundamentals,Are some clients more difficult than others? W...,Although many clients have the capacity to be ...,1,47,0.000000,0.024122,0.004824,...,0.750000,0.8441,46.71,12.8,15.37,15.7,16.2,46.000000,23,7
2611,2745,939,counseling-fundamentals,Are some clients more difficult than others? W...,"I usually don't label a client as ""difficult"" ...",1,22,0.000000,0.011012,0.002202,...,0.750000,-0.6065,68.60,8.5,10.80,11.7,10.8,28.600000,34,13
2612,2746,939,counseling-fundamentals,Are some clients more difficult than others? W...,"Dang right! :)Heh heh, and correct me if I'm ...",1,23,0.000000,0.011536,0.002307,...,0.750000,-0.3187,94.56,2.7,5.75,6.7,4.5,23.666667,6,7
2613,2747,939,counseling-fundamentals,Are some clients more difficult than others? W...,"Yes, just like some relationships outside of o...",1,41,0.000000,0.020975,0.004195,...,0.750000,0.9307,62.41,10.9,13.17,13.0,14.1,34.142857,25,8


Incorporating T-unit analysis into the study provides a refined measure of syntactic complexity, enabling a deeper understanding of the structural sophistication in therapists' responses within an online counseling context. This analytical focus on T-units, which represent main clauses along with their subordinate structures, is pivotal for evaluating the linguistic intricacy of therapeutic communication, offering insights into how complex sentence constructions might influence client comprehension and engagement in digital therapeutic interactions.

## Additional analysis for future work: 
Given the context of our analysis we might consider incorporating the following additional linguistic features:

1. Pragmatic Markers
Speech Acts: Analyze the types of speech acts (e.g., questioning, advising, reassuring) used in the responses, as they can significantly impact the perceived helpfulness or effectiveness of a response.
Politeness Strategies: Assess the use of politeness strategies, which can affect the tone and perceived empathy in the responses.
2. Discourse Analysis
Coherence and Cohesion: Analyze how ideas are connected and flow within the text. This includes the use of transition words, pronoun reference, and thematic progression.
Narrative Structure: Investigate the presence of narrative elements, such as storytelling or the use of anecdotes, which can be impactful in therapy.
3. Lexical Richness
Word Frequency Analysis: Beyond lexical diversity, examine the frequency of specific types of words (e.g., therapeutic jargon, emotion words) that could influence engagement.
Semantic Analysis: Explore the semantic fields most commonly used (e.g., emotional, cognitive, health-related) to understand thematic focuses in responses.
4. Psycholinguistic Features
Emotionally-Charged Language: Analyze the use of language that evokes emotions, as it can play a crucial role in empathy and rapport building.
Sensory Language: Assess the use of sensory descriptions, which can make responses more vivid and relatable.
5. Linguistic Style Matching
Mirroring Client’s Language: Measure the degree to which therapists’ linguistic style (e.g., vocabulary, syntax) matches that of the clients, which can be an indicator of rapport and alignment.
6. Nonverbal Communication Indicators
Punctuation and Formatting: Assess the use of punctuation and formatting (e.g., ellipses, emoticons, paragraph breaks) for their potential to convey nonverbal cues or emotional tone.
7. Topic Modeling
Identifying Key Themes: Use algorithms like Latent Dirichlet Allocation (LDA) to identify prevalent topics in the responses, which can help in understanding focus areas in therapy.

In [55]:
counsel_df

Unnamed: 0,ID,topic,question,answer,upvotes,views,upvotes_scaled,views_scaled,weighted_engagement,engagement,...,sentiment_subjectivity_question,vader_sentiment,flesch_reading_ease,flesch_kincaid_grade,gunning_fog,smog_index,ari,average_sentence_length,clause_density,t_unit_complexity
0,0,depression,Do I have too many issues for counseling? I ha...,"I've never heard of someone having ""too many i...",3,386,0.166667,0.201888,0.173711,High,...,0.469048,-0.9460,47.49,10.4,12.10,13.3,12.0,17.571429,10,7
1,0,depression,Do I have too many issues for counseling? I ha...,I just want to acknowledge you for the courage...,2,256,0.083333,0.133718,0.093410,High,...,0.469048,0.9719,71.44,7.4,9.08,10.3,7.6,16.846154,25,13
2,0,depression,Do I have too many issues for counseling? I ha...,It's not really a question of whether you have...,2,435,0.083333,0.227583,0.112183,High,...,0.469048,0.9847,64.85,10.0,11.19,11.9,11.0,24.777778,23,9
3,0,depression,Do I have too many issues for counseling? I ha...,There is no such thing as too much. Start wher...,2,217,0.083333,0.113267,0.089320,High,...,0.469048,0.7876,93.54,3.1,5.33,5.7,3.1,13.400000,9,5
4,0,depression,Do I have too many issues for counseling? I ha...,The most direct answer is no. I would venture ...,2,1064,0.083333,0.557420,0.178151,High,...,0.469048,0.4215,81.63,5.6,8.67,9.5,6.8,17.125000,14,8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2610,939,counseling-fundamentals,Are some clients more difficult than others? W...,Although many clients have the capacity to be ...,1,47,0.000000,0.024122,0.004824,Low,...,0.750000,0.8441,46.71,12.8,15.37,15.7,16.2,46.000000,23,7
2611,939,counseling-fundamentals,Are some clients more difficult than others? W...,"I usually don't label a client as ""difficult"" ...",1,22,0.000000,0.011012,0.002202,Low,...,0.750000,-0.6065,68.60,8.5,10.80,11.7,10.8,28.600000,34,13
2612,939,counseling-fundamentals,Are some clients more difficult than others? W...,"Dang right! :)Heh heh, and correct me if I'm ...",1,23,0.000000,0.011536,0.002307,Low,...,0.750000,-0.3187,94.56,2.7,5.75,6.7,4.5,23.666667,6,7
2613,939,counseling-fundamentals,Are some clients more difficult than others? W...,"Yes, just like some relationships outside of o...",1,41,0.000000,0.020975,0.004195,Low,...,0.750000,0.9307,62.41,10.9,13.17,13.0,14.1,34.142857,25,8


In [56]:
counsel_df.to_csv('data/counsel_df_lingo.csv', index=False)

Sum up: 

Text Length Metrics

- Character, word, and sentence length: These metrics provide basic insights into the verbosity and structural breadth of the responses.

Lexical Diversity

- Type-Token Ratio (TTR): Measures the variety of vocabulary used.
- Frequency of modal verbs and personal pronouns: These offer insights into the tone and interpersonal dynamics of the responses.

Concreteness and Imageability

- Average concreteness and imageability scores: These scores, derived from established lexicons, quantify how tangible or visually evocative the language is.

Sentiment Analysis

- Polarity and subjectivity scores: Assessed using tools like TextBlob or VADER, these scores indicate the emotional tone of the responses.

Readability Scores

- Metrics such as Flesch Reading Ease and Gunning Fog Index: These provide an understanding of the textual complexity and accessibility.

Syntactic Complexity

- Average sentence length and clause density: These metrics evaluate the structural complexity of the responses.
- T-unit analysis: Offers deeper insights into the complexity of sentence constructions.