# Mental Health Text Analysis Pipeline

This notebook processes mental health-related text data using NLP techniques and API services.
It includes a pilot test section to validate the pipeline on a small subset before processing the entire dataset.

In [2]:
# Import necessary libraries
import os
import csv
import pandas as pd
import numpy as np
from dotenv import load_dotenv
import time
from typing import Dict, List, Any, Optional
import logging

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

## Load Environment Variables for API Keys

We'll load API keys from environment variables to keep them secure.

In [3]:
# Load environment variables from .env file
load_dotenv()

# API Keys (loaded from environment variables)
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
# Add any other API keys as needed
# OTHER_API_KEY = os.getenv("OTHER_API_KEY")

# Validate API keys
def validate_api_keys():
    """Validate that required API keys are available."""
    if not OPENAI_API_KEY:
        logger.warning("OpenAI API key not found in environment variables.")
        print("⚠️ Warning: OpenAI API key not found in environment variables.")
        # You can decide whether to raise an exception or continue with limited functionality
    else:
        print("✅ OpenAI API key loaded successfully.")
    
    # Add validation for other API keys as needed

validate_api_keys()

✅ OpenAI API key loaded successfully.


## Configuration

Set up the configuration for the pipeline.

In [5]:
# Configuration
DATA_PATH = "data/mentalhealth_post_features_tfidf_256.csv"
RESULTS_DIR = "full_results"
PILOT_SAMPLE_SIZE = 5

# Create results directory if it doesn't exist
if not os.path.exists(RESULTS_DIR):
    os.makedirs(RESULTS_DIR)
    print(f"Created results directory: {RESULTS_DIR}")
else:
    print(f"Results directory already exists: {RESULTS_DIR}")

Created results directory: full_results


## Data Loading

Load the data from the CSV file.

In [6]:
def load_data(data_path=DATA_PATH):
    """Load data from CSV file."""
    print(f"Loading data from {data_path}")
    try:
        df = pd.read_csv(data_path)
        print(f"Loaded {len(df)} records")
        return df
    except Exception as e:
        print(f"Error loading data: {e}")
        raise

# Load the data
df = load_data()

# Display the first few rows to understand the structure
print("\nDataset preview:")
df.head()

Loading data from data/mentalhealth_post_features_tfidf_256.csv
Loaded 13514 records

Dataset preview:


Unnamed: 0,subreddit,author,date,post,automated_readability_index,coleman_liau_index,flesch_kincaid_grade_level,flesch_reading_ease,gulpease_index,gunning_fog_index,...,tfidf_wish,tfidf_without,tfidf_wonder,tfidf_work,tfidf_worri,tfidf_wors,tfidf_would,tfidf_wrong,tfidf_x200b,tfidf_year
0,mentalhealth,Autumfire117,2020/01/01,"Not depressed or suicidal, yet the thought of ...",6.283511,5.687673,6.568202,80.795116,66.002584,9.955408,...,0.0,0.139581,0.0,0.0,0.0,0.0,0.051431,0.0,0.0,0.093246
1,mentalhealth,elf_boy_,2020/01/01,How I Barely Survived the Last Decade Trigger ...,4.953877,6.13965,5.820522,78.808202,69.141398,9.052063,...,0.0,0.0,0.0,0.0,0.0,0.0,0.020834,0.096198,0.0,0.245525
2,mentalhealth,mcks02,2020/01/01,Coping skills I was wondering if anyone had an...,0.919777,2.657734,3.307321,90.400864,80.95122,6.22948,...,0.0,0.0,0.137643,0.092402,0.0,0.0,0.0,0.0,0.0,0.081734
3,mentalhealth,IAndrOwS,2020/01/01,Overcoming a Mental Illness is Like Trying to ...,1.398685,4.035708,2.814915,88.778398,87.263069,5.48193,...,0.0,0.0,0.0,0.17977,0.0,0.062912,0.087707,0.0,0.0,0.079508
4,mentalhealth,Nyteblk,2020/01/01,Sooo I need your help I’m going to lead with w...,5.030568,5.675974,6.477938,76.127234,68.640288,9.055476,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.258315,0.0,0.152145


## Data Preprocessing

Preprocess the data before analysis.

In [7]:
def preprocess_data(df):
    """Preprocess the data."""
    print("Preprocessing data...")
    
    # Add your preprocessing steps here
    # Example: df = df.dropna(subset=['text'])
    
    print("Preprocessing complete.")
    return df

# We'll apply preprocessing in the pilot and full run sections

## Text Analysis Function

Define the function to analyze a single text entry.

In [9]:
def analyze_text(text):
    """Analyze a single text entry."""
    
    result = {
        "sentiment": None,
        "topics": [],
        "mental_health_indicators": [],
        "risk_assessment": None,
    }
    
    if OPENAI_API_KEY:
        # TODO: replace with actual API calls
        # result["sentiment"] = call_openai_api(text)
        pass
    
    return result

## Batch Processing Function

Define the function to process a batch of data.

In [10]:
def process_batch(df):
    """Process a batch of data."""
    results = []
    
    for idx, row in df.iterrows():
        # Extract text from row - adjust column name as needed
        # Assuming there's a 'text' column - modify as needed
        text = row.get('text', '')
        
        # Skip empty text
        if not text:
            continue
            
        # Process the text
        result = analyze_text(text)
        
        # Add metadata
        result['id'] = row.get('id', idx)
        
        # Add to results
        results.append(result)
        
        # Log progress periodically
        if (idx + 1) % 100 == 0:
            print(f"Processed {idx + 1} records")
    
    return results

## Save Results Function

Define the function to save results to a CSV file.

In [11]:
def save_results(results, filename="results.csv"):
    """Save results to a CSV file."""
    if not results:
        print("No results to save")
        return
        
    output_path = os.path.join(RESULTS_DIR, filename)
    print(f"Saving results to {output_path}")
    
    # Get all possible keys from all dictionaries
    all_keys = set()
    for result in results:
        all_keys.update(result.keys())
    
    with open(output_path, 'w', newline='') as f:
        writer = csv.DictWriter(f, fieldnames=list(all_keys))
        writer.writeheader()
        writer.writerows(results)
        
    print(f"Saved {len(results)} records to {output_path}")

## Pilot Test

Run a pilot test on a small sample of the data to validate the pipeline.

In [None]:
def run_pilot():
    """Run a pilot test on a small sample of the data."""
    print(f"\n🧪 Running pilot test on {PILOT_SAMPLE_SIZE} records")
    
    # Take a small sample
    sample_df = df.head(PILOT_SAMPLE_SIZE)
    
    # Preprocess
    sample_df = preprocess_data(sample_df)
    
    # Process the sample
    results = process_batch(sample_df)
    
    # Save pilot results
    save_results(results, "pilot_results.csv")
    
    # Print sample of results for inspection
    print("\nPilot test results sample:")
    for i, result in enumerate(results):
        print(f"Result {i+1}: {result}")
    
    return results

# Run the pilot test
pilot_results = run_pilot()

## Full Pipeline

Run the full pipeline on all data.

In [None]:
def run_full_pipeline():
    """Run the full pipeline on all data."""
    print("\n🚀 Running full pipeline")
    start_time = time.time()
    
    # Preprocess
    processed_df = preprocess_data(df)
    
    # Process all data
    results = process_batch(processed_df)
    
    # Save results
    save_results(results)
    
    # Calculate and log total runtime
    total_time = time.time() - start_time
    print(f"\n✅ Full pipeline completed with {len(results)} results")
    print(f"Total runtime: {total_time:.2f} seconds")
    
    return results

# Uncomment the following line to run the full pipeline
# full_results = run_full_pipeline()

## Summary

This notebook implements a pipeline for analyzing mental health text data. It includes:

1. Loading API keys from environment variables
2. Using the mentalhealth_post_features_tfidf_256.csv file (configurable)
3. A pilot test on 5 examples to validate the pipeline structure
4. The full pipeline for processing the entire dataset

To use a different dataset, simply change the DATA_PATH variable at the beginning of the notebook.