Skip to content

NIEr66/Mental-Health-Conversation-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Mental Health Conversation Analysis

This project analyzes conversations between users and psychologists related to mental health topics using Natural Language Processing (NLP) techniques. The analysis aims to identify patterns, extract insights, and develop models that could potentially support automated mental health guidance systems.

Dataset Source

The dataset used in this project is sourced from the Mental Health Counseling Conversations dataset on Hugging Face. This dataset contains:

  • 3,512 conversations between users and psychologists
  • Questions covering a wide range of mental health topics
  • Professional responses from qualified psychologists
  • All data is anonymized and contains no personally identifiable information

The dataset is particularly valuable for:

  • Training and fine-tuning language models for mental health advice
  • Analyzing patterns in mental health conversations
  • Developing automated mental health guidance systems

Features

  • Text Preprocessing: Cleaning, tokenization, and lemmatization of conversation data
  • Sentiment Analysis: Analysis of emotional content in both user questions and professional responses
  • Topic Modeling: Latent Dirichlet Allocation (LDA) to identify key topics in conversations
  • Text Embeddings: Multiple embedding approaches including:
    • Word2Vec
    • LDA-based embeddings
    • BERT embeddings (both pre-trained and fine-tuned)
  • Document Similarity: Comparison of different similarity models
  • Response Generation: Framework for generating response recommendations based on similar conversations

Requirements

R Packages

install.packages(c(
  "jsonlite",
  "tidyverse",
  "tidytext",
  "wordcloud",
  "tm",
  "topicmodels",
  "text2vec",
  "sentimentr",
  "ggplot2",
  "textdata",
  "bit",
  "reticulate"
))

Python Dependencies

pip install sentence-transformers torch pandas sklearn datasets numpy tqdm

Data Structure

The dataset contains two main columns:

  • Context: Questions or concerns expressed by users about mental health issues
  • Response: Professional responses from psychologists

Analysis Components

  1. Text Statistics

    • Word count distribution
    • Text length analysis
    • Comparison of question vs. response lengths
  2. Sentiment Analysis

    • Basic sentiment scoring
    • Emotional dimensions analysis
    • Comparison of sentiment between questions and responses
  3. Topic Modeling

    • Optimal topic number determination
    • Topic distribution analysis
    • Key term extraction for each topic
  4. Text Embeddings

    • Word2Vec implementation
    • LDA-based embeddings
    • BERT embeddings (with fine-tuning capability)
  5. Model Comparison

    • Performance evaluation of different similarity models
    • Runtime comparison
    • Success rate analysis
  6. Response Generation

    • Similar conversation identification
    • Response recommendation system
    • Multiple model support

Usage

  1. Data Loading
# Load the JSON dataset
json_data <- stream_in(file("combined_dataset.json"))
mental_health_df <- as.data.frame(json_data)
  1. Text Preprocessing
# Clean and tokenize text
mental_health_df$clean_context <- sapply(mental_health_df$Context, clean_text)
mental_health_df$clean_response <- sapply(mental_health_df$Response, clean_text)
  1. Response Generation
# Generate response recommendations
query <- "I feel anxious all the time and can't focus on my work. What should I do?"
recommendations <- generate_response_recommendations(query)

Model Performance

The project compares three main similarity models:

  1. Word2Vec
  2. LDA-based
  3. BERT (pre-trained or fine-tuned)

Each model is evaluated based on:

  • Average runtime
  • Top similarity scores
  • Mean similarity scores
  • Success rate

Notes

  • The BERT model can be fine-tuned on the specific mental health conversation dataset
  • The system automatically selects the best performing model for response generation
  • All models include duplicate detection to ensure diverse recommendations

About

one of the projects during my study at DABE

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages