# **Solution**

---


## **PREPARE Challenge - Phase 3**

# **Summary**

---



### This Notebook Covers:
✅ **Component 0. Setup Environment**: Install dependencies, load SpeachCARE model checkpoints & configure paths

✅ **Component 1.  Model Initialization & Data Preprocessing**: Initializing the SpeachCARE model with the trained checkpoint; preprocessing audio data. To run this step the path to the speaker's audio file and age must be specified if you want to use your own data.

✅ **Component 2.  SHAP-Highlighted Transcripts**: Visualization of linguistic cues using SHAP (Shapley Additive exPlanations). This process generates SHAP values, quantifying the contribution of each textual input's token to the SPeechCARE model decision making.

✅ **Component 3: Linguistic Features of the Transcripts (NEW In Phase 3)**: Extraction of a set of predefined linguistic features from the transcripts. These features provide a method to quantify the transcript with regard to four main categories (Lexical Richness, Syntactic Complexity, Disfluencies and Repetition, and Semantic Coherence and Referential Clarity), which are later used to provide an interpretation of the transcripts.

✅ **Component 4: Transcript Interpretation with Linguistic Features and SHAP Analysis: LLaMA 70B as LLM for Interpretation (IMPROVED IN PAHSE 3)**:
The system integrates token-level SHAP values with quantified linguistic features to interpret the transcript across multiple dimensions. First, it analyzes the SHAP values to identify which parts of the text most influenced the model’s decision, focusing on lexical items, syntactic patterns, and semantic cues. Then, it incorporates linguistic features categorized into four key areas: Lexical Richness (e.g., diversity of vocabulary), Syntactic Complexity (e.g., use of embedded or coordinated structures), Disfluencies and Repetition (e.g., fillers, pauses, repeated words), and Semantic Coherence and Referential Clarity (e.g., logical flow, clear referents). Finally, both layers of analysis—model-driven (SHAP) and linguistic—are combined to provide a comprehensive interpretation, capturing the interplay between surface-level cues and deeper structural and semantic patterns in the transcript.



# **Hardware**

---

**Hardware Requirements**
*  Components 1–4: A minimum of one NVIDIA T4 GPU with 16 GB of VRAM is required.


**Training Time**
*  No training is needed; you only need to load the model checkpoint.

**Inference Time**
*  Components 1–4: Inference time is estimated to be less than 2 minutes for each component.


# **Component 0. Setup Environment**

---

**Install the required python packages**

This cell installs all Python packages listed in `requirements.txt`. The file contains all necessary dependencies.

- Run this cell only once at the beginning of your session

- If you encounter errors, you may need to restart the runtime after installation


In [None]:
!git clone https://github.com/SpeechCARE/SpeechCARE_Explainability_Framework.git

In [4]:
import sys
sys.path.append('/content/SpeechCARE_Linguistic_Explainability_Framework')

In [None]:
%cd SpeechCARE_Explainability_Framework

In [None]:
!pip install --upgrade pip
!pip uninstall xformers -y
!pip install -r requirements.txt
!apt update && apt install ffmpeg -y

**Import Necessary Libraries**

This cell imports all required Python libraries and modules. The imports are organized by functionality for better understanding.

In [None]:
# Model and weights management
from model.WeightsManager import WeightManager
from model.ModelWrapper import ModelWrapper

# SHAP analysis tools
from explainability.SHAP.LinguisticShap import LinguisticShap

# Configuration handling
from utils.Config import Config
from utils.Utils import load_yaml_file
from utils.dataset_utils import preprocess_data

# Data processing and visualization
import pandas as pd
import yaml
import torch
from IPython.display import Image, display, Markdown, HTML

# General explainability functionality
from linguistic_module.text_feature_extraction import TextFeatureExtractor
from linguistic_module.text_interpreter import TextInterpreter
from linguistic_module.utils import visualize_linguistic_features,generate_final_linguistic_interpretation_html

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

**OpenRouter API Configuration**

Configure the OpenRouter API connection settings for making requests.

**Configuration Options:**

* `api_key`: Replace 'YOUR_API_KEY' with your actual OpenRouter API key

* `base_url`: The OpenRouter API endpoint (typically doesn't need modification)

In [11]:
openrouter_base_url = 'https://openrouter.ai/api/v1'
open_ai = {'api_key': 'YOUR_API_KEY', 'base_url':openrouter_base_url}

**Download Model Checkpoint**

Downloade-trained modecheckpointts from Google Drive using the WeightManager class.


**Configuration Options:**

- `file_id`: Replace with your own Google Drive file ID

- `output_name`: Change to save weights with a different filename

In [None]:
# Initialize the weight manager
weight_manager = WeightManager()

try:
    # Download model weights from Google Drive
    weight_path = weight_manager.download_weights(
        file_id="1-SLBGZRoNGRPHBtWJUDR8WCFpFiVYJM4",  # Google Drive file ID
        output_name="model.pt",                       # Name for downloaded file
    )
    print(f"Success: Weights saved to {weight_path}")

except Exception as e:
    print(f"Error downloading weights: {e}")
    print("Please check:")
    print("- Your internet connection")
    print("- The file ID is correct")
    print("- Google Drive permissions")

**Load and Display Configurations**

Loads and displays configuration file that control model behavior.

In [9]:
# Display model configuration
display(Markdown("### Model Configuration"))
with open("data/model_config.yaml") as f:
    model_config = yaml.safe_load(f)
    display(model_config)

# Initialize config objects for the application
config = Config(load_yaml_file("data/model_config.yaml"))

### Model Configuration

{'model_checkpoints': {'HUBERT': 'facebook/hubert-base-ls960',
  'WAV2VEC2': 'facebook/wav2vec2-base-960h',
  'mHuBERT': 'utter-project/mHuBERT-147',
  'MGTEBASE': 'Alibaba-NLP/gte-multilingual-base',
  'WHISPER': 'openai/whisper-large-v3-turbo'},
 'config': {'seed': 133,
  'bs': 4,
  'epochs': 14,
  'lr': '1e-6',
  'hidden_size': 128,
  'wd': '1e-3',
  'integration': 16,
  'num_labels': 3,
  'txt_transformer_chp': 'Alibaba-NLP/gte-multilingual-base',
  'speech_transformer_chp': 'utter-project/mHuBERT-147',
  'segment_size': 5,
  'active_layers': 12,
  'demography': 'age_bin',
  'demography_hidden_size': 128,
  'max_num_segments': 7}}

# **Component 1: Model Initialization & Data Preprocessing**
---

### **Model Intialization:**
The SpeechCARE model with trained checkpoint is loaded here.


**Key Components:**

- `ModelWrapper`: Handles model architecture

- `config`: Contains model parameters

- `weight_path`: Location of trained SpeechCARE model checkpoint

In [None]:
# Initialize model wrapper with configuration
wrapper = ModelWrapper(config)

# Load trained weights into the model
model = wrapper.get_model(weight_path)
model.eval()

print("Model successfully initialized with weights from:", weight_path)

### **Data Preprocessing:**

Prepares a specific audio sample by denoising the audio, providing its transcription and age category. You should provide:

- `AUDIO_PATH`: Full path to the audio file, WAV format recommended (required for an individual sample) .
- `AGE`: Age of the speaker as an integer value (required for an individual sample).

In [12]:
AUDIO_PATH = f"data/qnvo.mp3"  # Path to the input audio file (update it with your desired audio file path)
ID = "qnvo"
AGE = 72  # Age of the speaker

In [None]:
# Process audio sample and corresponding age
# Arguments:
#   audio_path: Path to input audio file
#   age: Speaker's age
#   output_dir: Directory to store processed audio file

processed_audio_path, demography_info, transcription = preprocess_data(audio_path=AUDIO_PATH, age=AGE, output_dir='../processed_audio')

# Convert age (by default: int) to tensor
# Reshaped it to (1, 1) for batch processing compatibility
demography_tensor = torch.tensor(demography_info, dtype=torch.float16).reshape(1, 1)

# --- Word-Level Analysis Setup ---
# Print analysis information for verification
print(f"Analyzing sample: {processed_audio_path}")
print(f"Speaker age category: {demography_info}")


# **Component 2: SHAP-Highlighted Transcripts**


---
Visualization of linguistic cues using SHAP (Shapley Additive exPlanations). This process generates SHAP values, quantifying the contribution of each textual input's token to the SPeechCARE model decision making.





#### Initalizing the linguistic explainer class


In [14]:
linguistc_explainer = LinguisticShap(model)
print("Linguistic explainer ready for analysis")

Linguistic explainer ready for analysis


#### Run inference to get predicted_label and transcription

In [None]:
predicted_label, probabilities = model.inference(config=config,audio_path = processed_audio_path, demography_info=AGE)
print("Running inference to compute predicted_label and transcription.")

#### Generating SHAP values

In [16]:
shap_values, shap_html_result = linguistc_explainer.get_text_shap_results()

Running SHAP values...
Input text: [" from the cookie jar and he's going to fall off the stool and the sink and the bath and the kitchen is overflowing and the mother is drying the dishes and I see the backyard through the window and what else? The cookie jar, boy and a girl and it's on the top shelf and he's going to fall off that stool and what else? The mom's going to slip on the water if she doesn't be careful. And what else do you want me to tell you? Do you want me to describe everything on that that I see? Oh, that's it, I think. She's drying the dish. Yes, I can. Okay, so the"]
Values explained...
In len 2
mci


#### Highlighting informative linguistic cues





In [17]:
display(HTML(shap_html_result))

# **Component 3: Linguistic Features of the Transcripts (NEW In Phase 3)**


---
Extraction of a set of predefined linguistic features from the transcripts. These features provide a method to quantify the transcript with regard to four main categories (Lexical Richness, Syntactic Complexity, Disfluencies and Repetition, and Semantic Coherence and Referential Clarity), which are later used to provide an interpretation of the transcripts.



#### Initalizing the linguistic explainer class


In [18]:
linguistc_feature_extractor = TextFeatureExtractor()
print("Linguistic Feature Extractor ready for analysis")

Linguistic Feature Extractor ready for analysis


#### Run inference to get predicted_label and transcription

In [None]:
predicted_label, probabilities = model.inference(config=config,audio_path = processed_audio_path, demography_info=AGE)
print("Running inference to compute predicted_label and transcription.")

#### Generate Linguistic Features


In [None]:
linguistc_feature = linguistc_feature_extractor.extract_all_features(text = model.transcription, audio_path= processed_audio_path,save_path = "")
linguistc_feature = linguistc_feature.loc[0, "Content_Density":].to_dict() # Select only the columns with the linguistic features

#### Displaying Linguistic Features

In [22]:
linguistic_features_html = visualize_linguistic_features(linguistc_feature)

In [23]:
display(HTML(linguistic_features_html))

Feature,Value,Normal Range,Interpretation
Type-Token Ratio (TTR),0.475,0-1,LOW: repetitive; HIGH: diverse
Root Type-Token Ratio (RTTR),5.251,2.0-8.0 (Guiraud's Index),LOW: simple vocab; HIGH: varied vocab
Corrected Type-Token Ratio (CTTR),3.713,1.5-5.0 (Carroll's CTTR),LOW: restricted vocab; HIGH: rich vocab
Brunet's Index,11.685,~10-100,LOW: diverse; HIGH: limited vocab
Honoré's Statistic,1272.505,~0-2000,LOW: low richness; HIGH: high richness
Measure of Textual Lexical Diversity (MTLD),23.907,~10-150,LOW: limited vocab; HIGH: stable diversity
Hypergeometric Distribution Diversity (HDD),1.0,0-1,LOW: low diversity; HIGH: diverse vocab
Ratio unique word count to total word count,0.475,0-1,LOW: repetition; HIGH: variety
Unique Word count,58.0,10-∞,LOW: restricted vocab; HIGH: lexical richness
Lexical frequency,5.358,0-∞,LOW: rare words; HIGH: frequent/common words

Feature,Value,Normal Range,Interpretation
Part_of_Speech_rate,0.8,0-1,LOW: reduced variation; HIGH: balanced grammar
Relative_pronouns_rate,0.0,0-1,LOW: simple syntax; HIGH: complex clauses
Determiners Ratio,0.131,0-1,LOW: vague; HIGH: clear reference
Verbs Ratio,0.131,0-1,LOW: static speech; HIGH: dynamic structure
Nouns Ratio,0.156,0-1,LOW: low content; HIGH: info-dense
Negative_adverbs_rate,0.0,0-1,LOW: less negation; HIGH: complex expression
Word count,122.0,10-∞,LOW: brevity; HIGH: verbosity/planning

Feature,Value,Normal Range,Interpretation
Speech rate (wps),4.067,2.3-3.3 wps,LOW: slowed cognition; HIGH: normal/pressured
Consecutive repeated clauses count,2.0,0-∞,LOW: flexible; HIGH: perseveration

Feature,Value,Normal Range,Interpretation
Content_Density,0.456,0-1,LOW: vague; HIGH: info-rich
Reference_Rate_to_Reality (noun-to-verb ratio),1.187,0-∞,LOW: abstract; HIGH: concrete info
Pronouns Ratio,0.18,0-1,LOW: specific; HIGH: ambiguous
Definite_articles Ratio,0.123,0-1,LOW: vague; HIGH: specific reference
Indefinite_articles Ratio,0.008,0-1,LOW: specific; HIGH: general


# **Component 4: Transcript Interpretation with Linguistic Features and SHAP Analysis: LLaMA 70B as LLM for Interpretation (IMPROVED IN PAHSE 3 using Tree of Thought Reasoning)**
---
The system integrates token-level SHAP values with quantified linguistic features to interpret the transcript across multiple dimensions. The Tree of Thoughts methods has been used in prompting for generating the final interpretation. This approach provides a deep interpretation of linguistic features (Lexical Richness, Syntactic Complexity,Disfluencies and Repetition, and Semantic Coherence and Referential Clarity) associated with cognitive impairment.The details are as follows:



#### Initalizing the Transcription Interpreter class


In [24]:
interpreter = TextInterpreter(openai_config = open_ai)

#### Run inference to get predicted_label, transcription, and probabilities

In [None]:
predicted_label, probabilities = model.inference(config=config,audio_path = processed_audio_path, demography_info=AGE)
print("Running inference to compute predicted_label and transcription.")

#### Generate the Interpretation of the Transcription

In [26]:
result_linguistic = interpreter.get_all_interpretations(model.transcription,predicted_label, shap_values, linguistc_feature, probabilities)

#### **Step 1. SHAP-values Analysis:** It analyzes the SHAP values to identify which parts of the text most influenced the model’s decision, focusing on lexical items, syntactic patterns, and semantic cues.

In [27]:
display(Markdown(result_linguistic[1]))

Here is the analysis of the text in terms of the provided linguistic features:

• **Lexical Richness**: The speaker uses simple vocabulary and repeats words ("cookie jar", "stool", "what else"), indicating potential word-finding issues or overuse of familiar words. This feature supports the model's prediction of cognitive impairment.

• **Syntactic Complexity**: The speaker's sentence structure is simplified, with short, fragmented sentences and a lack of complex grammar. This feature also supports the model's prediction.

• **Disfluencies and Repetition**: The speaker frequently uses fillers ("and", "what else") and repeats words, indicating disfluency. This feature is consistent with cognitive impairment.

• **Semantic Coherence**: The speaker's ideas are somewhat disorganized, jumping between different elements of the scene (e.g., the boy on the stool, the mother drying dishes, the backyard). This feature supports the model's prediction.

• **Difficulty with Spatial Reasoning and Visualization**: The speaker has trouble describing the spatial relationships between objects (e.g., "it's on the top shelf", "the kitchen is overflowing"). This feature is indicative of cognitive impairment.

• **Impaired Executive Function**: The speaker's speech is disorganized, with abrupt topic changes and a lack of clear sequencing. This feature strongly supports the model's prediction.

• **Additional Feature: Lack of Descriptive Detail**: The speaker fails to provide detailed descriptions of the scene, relying on vague references (e.g., "the kitchen is overflowing"). This feature is consistent with cognitive impairment.

Overall, the speaker's language patterns suggest difficulties with word-finding, sentence structure, and spatial reasoning, as well as disorganized and repetitive speech. These features collectively support the model's prediction of cognitive impairment with a confidence level of 0.7129285931587219.

#### **Step 2. Linguistic Features Analysis:** It incorporates linguistic features categorized into four key categories: Lexical Richness (e.g., diversity of vocabulary), Syntactic Complexity (e.g., use of simple or complex sentences' structure), Disfluencies and Repetition (e.g., fillers, pauses, repeated words), and Semantic Coherence and Referential Clarity (e.g., logical flow, clear **pronoun referents**).

In [28]:
display(Markdown(result_linguistic[3]))


**Analysis of Linguistic Features in Relation to Cognitive Status**

Based on the provided text and linguistic features, here is a concise analysis of the speaker's language use and its implications for cognitive status:

* **Lexical Richness:** The speaker's vocabulary is somewhat restricted, as indicated by low values in Type-Token Ratio (TTR), Root Type-Token Ratio (RTTR), and Corrected Type-Token Ratio (CTTR). This may suggest word-finding difficulties or lexical retrieval deficits. However, the Unique Word Count is relatively high (58), indicating some degree of lexical richness.
* **Syntactic Complexity:** The speaker's syntax is relatively simple, with low values in Part_of_Speech_rate and Relative_pronouns_rate. This may indicate reduced structural variety and sentence planning abilities.
* **Disfluencies and Repetition:** The speaker exhibits some disfluencies, with a Consecutive Repeated Clauses Count of 2. This may reflect planning difficulties and reduced cognitive flexibility.
* **Semantic Coherence and Referential Clarity:** The speaker's language is somewhat vague, with low values in Content_Density and Reference_Rate_to_Reality. This may indicate impaired semantic organization and discourse tracking.

**Model's Prediction and Confidence:**

The machine learning model predicts cognitive impairment with a confidence of 0.7129285931587219. Based on the analysis of linguistic features, this prediction is supported by the speaker's restricted vocabulary, simple syntax, disfluencies, and vague language.

**Key Aspects of the Analysis:**

* The speaker's language is characterized by a mix of restricted and rich vocabulary, suggesting some degree of lexical retrieval deficits.
* Simple syntax and disfluencies may indicate reduced cognitive flexibility and planning difficulties.
* Vague language and low semantic coherence may reflect impaired semantic organization and discourse tracking.
* The model's prediction of cognitive impairment is supported by these linguistic features, which are consistent with cognitive decline.

Overall, the analysis suggests that the speaker's language use is consistent with cognitive impairment, particularly in terms of lexical richness, syntactic complexity, and semantic coherence.

#### **Step 3. Reasoning:** Reasoning using Tree of Thought: outputs of the step 1 and step 2 are given to the llama model to capture the interplay between SHAP-based interpretation (surface-level cues) and linguistic feature interpretation (deeper structural and semantic patterns) in the transcript, using tree of thought technique.

In [29]:
display(Markdown(result_linguistic[5]))


**Unified Analysis of Cognitive Status**

Based on the two expert interpretations, the following key aspects of the analysis describe the linguistic features of the passage and their implications for cognitive status:

• **Lexical Richness:** The speaker's vocabulary is somewhat restricted, with low values in Type-Token Ratio (TTR) and other lexical richness metrics. However, the Unique Word Count is relatively high, indicating some degree of lexical richness. This mixed pattern may suggest word-finding difficulties or lexical retrieval deficits, which are consistent with cognitive impairment.

• **Syntactic Complexity:** The speaker's sentence structure is simplified, with short, fragmented sentences and a lack of complex grammar. This feature supports the model's prediction of cognitive impairment and is consistent with reduced structural variety and sentence planning abilities.

• **Disfluencies and Repetition:** The speaker exhibits disfluencies, including fillers ("and", "what else") and repeated words. This feature is consistent with cognitive impairment and may reflect planning difficulties and reduced cognitive flexibility.

• **Semantic Coherence:** The speaker's ideas are somewhat disorganized, jumping between different elements of the scene. This feature supports the model's prediction of cognitive impairment and is consistent with impaired semantic organization and discourse tracking.

• **Difficulty with Spatial Reasoning and Visualization:** The speaker has trouble describing the spatial relationships between objects, which is indicative of cognitive impairment.

• **Impaired Executive Function:** The speaker's speech is disorganized, with abrupt topic changes and a lack of clear sequencing. This feature strongly supports the model's prediction of cognitive impairment.

• **Additional Feature: Lack of Descriptive Detail:** The speaker fails to provide detailed descriptions of the scene, relying on vague references. This feature is consistent with cognitive impairment and may reflect impaired semantic organization and discourse tracking.

Overall, the analysis suggests that the speaker's language use is consistent with cognitive impairment, characterized by restricted vocabulary, simple syntax, disfluencies, and vague language. The model's prediction of cognitive impairment is supported by these linguistic features, which are consistent with cognitive decline.

#### **Step 4. Summarization:** It provides a brief, clear, and human-understandable interpretation of the four linguistic categories.

In [30]:
linguistic_interpretation_html = generate_final_linguistic_interpretation_html(result_linguistic[-1])

In [31]:
display(HTML(linguistic_interpretation_html))