This project analyzes conversations between users and psychologists related to mental health topics using Natural Language Processing (NLP) techniques. The analysis aims to identify patterns, extract insights, and develop models that could potentially support automated mental health guidance systems.
The dataset used in this project is sourced from the Mental Health Counseling Conversations dataset on Hugging Face. This dataset contains:
- 3,512 conversations between users and psychologists
- Questions covering a wide range of mental health topics
- Professional responses from qualified psychologists
- All data is anonymized and contains no personally identifiable information
The dataset is particularly valuable for:
- Training and fine-tuning language models for mental health advice
- Analyzing patterns in mental health conversations
- Developing automated mental health guidance systems
- Text Preprocessing: Cleaning, tokenization, and lemmatization of conversation data
- Sentiment Analysis: Analysis of emotional content in both user questions and professional responses
- Topic Modeling: Latent Dirichlet Allocation (LDA) to identify key topics in conversations
- Text Embeddings: Multiple embedding approaches including:
- Word2Vec
- LDA-based embeddings
- BERT embeddings (both pre-trained and fine-tuned)
- Document Similarity: Comparison of different similarity models
- Response Generation: Framework for generating response recommendations based on similar conversations
install.packages(c(
"jsonlite",
"tidyverse",
"tidytext",
"wordcloud",
"tm",
"topicmodels",
"text2vec",
"sentimentr",
"ggplot2",
"textdata",
"bit",
"reticulate"
))pip install sentence-transformers torch pandas sklearn datasets numpy tqdmThe dataset contains two main columns:
- Context: Questions or concerns expressed by users about mental health issues
- Response: Professional responses from psychologists
-
Text Statistics
- Word count distribution
- Text length analysis
- Comparison of question vs. response lengths
-
Sentiment Analysis
- Basic sentiment scoring
- Emotional dimensions analysis
- Comparison of sentiment between questions and responses
-
Topic Modeling
- Optimal topic number determination
- Topic distribution analysis
- Key term extraction for each topic
-
Text Embeddings
- Word2Vec implementation
- LDA-based embeddings
- BERT embeddings (with fine-tuning capability)
-
Model Comparison
- Performance evaluation of different similarity models
- Runtime comparison
- Success rate analysis
-
Response Generation
- Similar conversation identification
- Response recommendation system
- Multiple model support
- Data Loading
# Load the JSON dataset
json_data <- stream_in(file("combined_dataset.json"))
mental_health_df <- as.data.frame(json_data)- Text Preprocessing
# Clean and tokenize text
mental_health_df$clean_context <- sapply(mental_health_df$Context, clean_text)
mental_health_df$clean_response <- sapply(mental_health_df$Response, clean_text)- Response Generation
# Generate response recommendations
query <- "I feel anxious all the time and can't focus on my work. What should I do?"
recommendations <- generate_response_recommendations(query)The project compares three main similarity models:
- Word2Vec
- LDA-based
- BERT (pre-trained or fine-tuned)
Each model is evaluated based on:
- Average runtime
- Top similarity scores
- Mean similarity scores
- Success rate
- The BERT model can be fine-tuned on the specific mental health conversation dataset
- The system automatically selects the best performing model for response generation
- All models include duplicate detection to ensure diverse recommendations