# Mental Health Analysis - Model Testing

This notebook demonstrates how to load trained models and test them on new data.

## Overview
- Load pre-trained models and preprocessing components
- Test models on sample text
- Demonstrate prediction pipeline
- Evaluate model performance on test cases

## 1. Import Libraries

In [11]:
import pickle
import pandas as pd
import numpy as np
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from scipy.sparse import hstack
import warnings
warnings.filterwarnings('ignore')

# Download required NLTK data
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)

print("Libraries imported successfully! ✓")

Libraries imported successfully! ✓


## 2. Load Pre-trained Models and Components

Loading all the saved models and preprocessing components from the training phase.

In [12]:
# Load the TF-IDF vectorizer
with open('models/tfidf_vectorizer.pkl', 'rb') as f:
    vectorizer = pickle.load(f)

# Load the label encoder
with open('models/label_encoder.pkl', 'rb') as f:
    label_encoder = pickle.load(f)

# Load preprocessing pipeline
with open('models/preprocessing_pipeline.pkl', 'rb') as f:
    preprocessing_pipeline = pickle.load(f)

# Load model information
with open('models/model_info.pkl', 'rb') as f:
    model_info = pickle.load(f)

# Extract preprocessing components
stemmer = preprocessing_pipeline['stemmer']
lemmatizer = preprocessing_pipeline['lemmatizer']
stopwords = preprocessing_pipeline['stopwords']

# Load the best trained model
with open('models/training_model.pkl', 'rb') as f:
    best_model = pickle.load(f)

print(f"✓ Best model ({model_info['model_name']}) loaded successfully")
print(f"✓ Model accuracy: {model_info['accuracy']:.4f}")
print(f"\nAll model accuracies during training:")
for name, acc in model_info['all_accuracies'].items():
    print(f"  - {name}: {acc:.4f}")

print(f"\nAvailable classes: {label_encoder.classes_}")
print("Model ready for testing! ✓")

✓ Best model (Logistic Regression) loaded successfully
✓ Model accuracy: 0.7541

All model accuracies during training:
  - Bernoulli Naive Bayes: 0.6538
  - Decision Tree: 0.6753
  - Logistic Regression: 0.7541
  - KNN: 0.5155

Available classes: ['Anxiety' 'Bipolar' 'Depression' 'Normal' 'Personality disorder' 'Stress'
 'Suicidal']
Model ready for testing! ✓


In [None]:
# Extract preprocessing components
stemmer = preprocessing_pipeline['stemmer']
lemmatizer = preprocessing_pipeline['lemmatizer']
stopwords = preprocessing_pipeline['stopwords']

# Define text_preprocessing function using loaded components
def text_preprocessing(text):
    """
    Basic text preprocessing: lowercase, remove non-alphabetic, remove stopwords.
    """
    # Lowercase
    text = text.lower()
    # Remove non-alphabetic characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Tokenize
    tokens = word_tokenize(text)
    # Remove stopwords
    filtered_tokens = [token for token in tokens if token not in stopwords]
    # Join back to string
    return ' '.join(filtered_tokens)

## 3. Create Prediction Pipeline

Define a complete pipeline function that can take raw text and return predictions.

In [14]:
def predict_mental_health(text):
    """
    Complete prediction pipeline for mental health classification using the best trained model.
    
    Args:
        text (str): Input text to classify
    
    Returns:
        dict: Prediction results including class and confidence
    """
    try:
        # Step 1: Text preprocessing
        cleaned_text = text_preprocessing(text)
        
        # Step 2: Lemmatization
        lemmatized_text = lemmatizer.lemmatize(cleaned_text)
        
        # Step 3: Tokenization
        tokens = word_tokenize(lemmatized_text)
        
        # Step 4: Stemming
        stemmed_tokens = ' '.join([stemmer.stem(str(token)) for token in tokens])
        
        # Step 5: Calculate numerical features
        num_characters = len(text)
        num_sentences = len(nltk.sent_tokenize(text))
        
        # Step 6: TF-IDF transformation
        text_features = vectorizer.transform([stemmed_tokens])
        
        # Step 7: Combine features
        numerical_features = np.array([[num_characters, num_sentences]])
        combined_features = hstack([text_features, numerical_features])
        
        # Step 8: Make prediction using the best model
        prediction = best_model.predict(combined_features)[0]
        
        # Get prediction probabilities if available
        confidence = None
        probabilities = None
        if hasattr(best_model, 'predict_proba'):
            probabilities = best_model.predict_proba(combined_features)[0]
            confidence = np.max(probabilities)
            # Create probability dictionary
            prob_dict = {label_encoder.classes_[i]: prob for i, prob in enumerate(probabilities)}
        else:
            prob_dict = None
        
        # Decode the prediction
        predicted_class = label_encoder.inverse_transform([prediction])[0]
        
        return {
            'predicted_class': predicted_class,
            'confidence': confidence,
            'probabilities': prob_dict,
            'model_used': model_info['model_name'],
            'processed_text': stemmed_tokens[:100] + '...' if len(stemmed_tokens) > 100 else stemmed_tokens,
            'features': {
                'num_characters': num_characters,
                'num_sentences': num_sentences,
                'num_tokens': len(tokens)
            }
        }
    
    except Exception as e:
        return {
            'error': str(e),
            'predicted_class': None,
            'confidence': None
        }

print("Prediction pipeline created successfully! ✓")

Prediction pipeline created successfully! ✓


## 4. Test with Sample Cases

Let's test our prediction pipeline with various sample texts representing different mental health conditions.

In [15]:
# Define test cases for different mental health conditions
test_cases = [
    {
        'text': "I've been feeling really sad and hopeless lately. Nothing seems to bring me joy anymore and I just want to stay in bed all day.",
        'expected': 'Depression'
    },
    {
        'text': "My heart races and I can't breathe properly when I'm in crowded places. I'm constantly worried about having a panic attack.",
        'expected': 'Anxiety'
    },
    {
        'text': "I feel great today! The weather is beautiful and I'm excited about my new project at work. Life is good.",
        'expected': 'Normal'
    },
    {
        'text': "I keep checking if I locked the door over and over again. I can't stop these repetitive thoughts and behaviors.",
        'expected': 'OCD'
    },
    {
        'text': "I've been having trouble sleeping for weeks now. I just can't seem to get a good night's rest no matter what I try.",
        'expected': 'Insomnia'
    }
]

# Test each case with the best model
print(f"Testing prediction pipeline with {model_info['model_name']} model:\n")
print("=" * 80)

for i, case in enumerate(test_cases, 1):
    print(f"\nTest Case {i}:")
    print(f"Text: {case['text']}")
    print(f"Expected: {case['expected']}")
    print("-" * 50)
    
    # Make prediction
    result = predict_mental_health(case['text'])
    
    if 'error' not in result:
        print(f"Prediction: {result['predicted_class']}")
        if result['confidence']:
            print(f"Confidence: {result['confidence']:.3f}")
        print(f"Model: {result['model_used']}")
        
        # Show top 3 probabilities if available
        if result['probabilities']:
            sorted_probs = sorted(result['probabilities'].items(), key=lambda x: x[1], reverse=True)[:3]
            print("Top 3 predictions:")
            for j, (class_name, prob) in enumerate(sorted_probs, 1):
                print(f"  {j}. {class_name}: {prob:.3f}")
    else:
        print(f"Error: {result['error']}")
    
    print("=" * 80)

Testing prediction pipeline with Logistic Regression model:


Test Case 1:
Text: I've been feeling really sad and hopeless lately. Nothing seems to bring me joy anymore and I just want to stay in bed all day.
Expected: Depression
--------------------------------------------------
Prediction: Normal
Confidence: 0.481
Model: Logistic Regression
Top 3 predictions:
  1. Normal: 0.481
  2. Depression: 0.285
  3. Suicidal: 0.121

Test Case 2:
Text: My heart races and I can't breathe properly when I'm in crowded places. I'm constantly worried about having a panic attack.
Expected: Anxiety
--------------------------------------------------
Prediction: Anxiety
Confidence: 0.552
Model: Logistic Regression
Top 3 predictions:
  1. Anxiety: 0.552
  2. Normal: 0.296
  3. Stress: 0.145

Test Case 3:
Text: I feel great today! The weather is beautiful and I'm excited about my new project at work. Life is good.
Expected: Normal
--------------------------------------------------
Prediction: Normal
Confid

## 5. Interactive Testing

Try the model with your own text input!

In [16]:
# Interactive testing function
def interactive_test():
    """Interactive function to test custom text input"""
    print("Mental Health Text Classifier - Interactive Testing")
    print("=" * 55)
    print(f"Using best model: {model_info['model_name']}")
    print(f"Model accuracy: {model_info['accuracy']:.4f}")
    print("\nEnter your text below (or type 'quit' to exit):")
    
    while True:
        # Get user input
        user_text = input("\nEnter text: ")
        
        if user_text.lower() == 'quit':
            print("Thank you for testing! 😊")
            break
        
        if not user_text.strip():
            print("Please enter some text.")
            continue
        
        # Make prediction
        result = predict_mental_health(user_text)
        
        # Display results
        print("\n" + "-" * 50)
        if 'error' not in result:
            print(f"Prediction: {result['predicted_class']}")
            if result['confidence']:
                print(f"Confidence: {result['confidence']:.3f}")
            print(f"Model used: {result['model_used']}")
            print(f"Text features:")
            for key, value in result['features'].items():
                print(f"  - {key}: {value}")
            
            # Show probabilities if available
            if result['probabilities']:
                print("\nAll class probabilities:")
                sorted_probs = sorted(result['probabilities'].items(), key=lambda x: x[1], reverse=True)
                for class_name, prob in sorted_probs:
                    print(f"  - {class_name}: {prob:.3f}")
        else:
            print(f"Error: {result['error']}")
        print("-" * 50)

# Example usage - uncomment the line below to run interactively
# interactive_test()

# For demonstration, let's test with a custom example
custom_text = "I feel overwhelmed with work and can't seem to manage my stress levels. Everything feels too much."
result = predict_mental_health(custom_text)

print("Custom Text Analysis:")
print(f"Text: {custom_text}")
print(f"Prediction: {result['predicted_class']}")
if result['confidence']:
    print(f"Confidence: {result['confidence']:.3f}")
print(f"Model: {result['model_used']}")
print(f"Features: {result['features']}")

if result['probabilities']:
    print("\nClass probabilities:")
    for class_name, prob in sorted(result['probabilities'].items(), key=lambda x: x[1], reverse=True):
        print(f"  - {class_name}: {prob:.3f}")

Custom Text Analysis:
Text: I feel overwhelmed with work and can't seem to manage my stress levels. Everything feels too much.
Prediction: Stress
Confidence: 0.943
Model: Logistic Regression
Features: {'num_characters': 98, 'num_sentences': 2, 'num_tokens': 11}

Class probabilities:
  - Stress: 0.943
  - Suicidal: 0.048
  - Normal: 0.004
  - Depression: 0.003
  - Bipolar: 0.000
  - Personality disorder: 0.000
  - Anxiety: 0.000
