# Mental Health Detection ML Project

This notebook implements a machine learning model to predict mental health states based on various input features like stress level, work hours, and other relevant factors. The model will classify individuals into different mental states such as normal, stressed, anxious, etc.

## Project Overview
1. Data Loading and Analysis
2. Data Preprocessing
3. Feature Engineering
4. Model Development
5. Model Evaluation
6. Interactive Prediction System

In [31]:
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import xgboost as xgb
import matplotlib.pyplot as plt
import seaborn as sns
import shap
import joblib
import warnings

# Ignore warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)
print("All required libraries imported successfully!")

All required libraries imported successfully!


In [32]:
# Load both datasets
try:
    # Load the datasets
    df1 = pd.read_csv('Mental Health Dataset.csv')
    df2 = pd.read_csv('mental_health_data final data.csv')

    print("Dataset 1 Shape:", df1.shape)
    print("Dataset 2 Shape:", df2.shape)

    # Display columns from both datasets to compare
    print("\nColumns in Dataset 1:", df1.columns.tolist())
    print("\nColumns in Dataset 2:", df2.columns.tolist())

    # Display first few rows of both datasets
    print("\nFirst few rows of Dataset 1:")
    display(df1.head())
    print("\nFirst few rows of Dataset 2:")
    display(df2.head())

    # Check for missing values in both datasets
    print("\nMissing Values in Dataset 1:")
    display(df1.isnull().sum())
    print("\nMissing Values in Dataset 2:")
    display(df2.isnull().sum())
    
    print("\nDatasets loaded successfully!")

except FileNotFoundError as e:
    print(f"Error: Could not find one of the dataset files. Please ensure both CSV files are in the correct location.")
    raise e
except Exception as e:
    print(f"An error occurred while loading the datasets: {str(e)}")
    raise e

Dataset 1 Shape: (292364, 17)
Dataset 2 Shape: (50000, 17)

Columns in Dataset 1: ['Timestamp', 'Gender', 'Country', 'Occupation', 'self_employed', 'family_history', 'treatment', 'Days_Indoors', 'Growing_Stress', 'Changes_Habits', 'Mental_Health_History', 'Mood_Swings', 'Coping_Struggles', 'Work_Interest', 'Social_Weakness', 'mental_health_interview', 'care_options']

Columns in Dataset 2: ['User_ID', 'Age', 'Gender', 'Occupation', 'Country', 'Mental_Health_Condition', 'Severity', 'Consultation_History', 'Stress_Level', 'Sleep_Hours', 'Work_Hours', 'Physical_Activity_Hours', 'Social_Media_Usage', 'Diet_Quality', 'Smoking_Habit', 'Alcohol_Consumption', 'Medication_Usage']

First few rows of Dataset 1:


Unnamed: 0,Timestamp,Gender,Country,Occupation,self_employed,family_history,treatment,Days_Indoors,Growing_Stress,Changes_Habits,Mental_Health_History,Mood_Swings,Coping_Struggles,Work_Interest,Social_Weakness,mental_health_interview,care_options
0,8/27/2014 11:29,Female,United States,Corporate,,No,Yes,1-14 days,Yes,No,Yes,Medium,No,No,Yes,No,Not sure
1,8/27/2014 11:31,Female,United States,Corporate,,Yes,Yes,1-14 days,Yes,No,Yes,Medium,No,No,Yes,No,No
2,8/27/2014 11:32,Female,United States,Corporate,,Yes,Yes,1-14 days,Yes,No,Yes,Medium,No,No,Yes,No,Yes
3,8/27/2014 11:37,Female,United States,Corporate,No,Yes,Yes,1-14 days,Yes,No,Yes,Medium,No,No,Yes,Maybe,Yes
4,8/27/2014 11:43,Female,United States,Corporate,No,Yes,Yes,1-14 days,Yes,No,Yes,Medium,No,No,Yes,No,Yes



First few rows of Dataset 2:


Unnamed: 0,User_ID,Age,Gender,Occupation,Country,Mental_Health_Condition,Severity,Consultation_History,Stress_Level,Sleep_Hours,Work_Hours,Physical_Activity_Hours,Social_Media_Usage,Diet_Quality,Smoking_Habit,Alcohol_Consumption,Medication_Usage
0,1,36,Male,Education,Australia,Yes,,Yes,Low,7.6,46,8,2.2,Healthy,Regular Smoker,Regular Drinker,Yes
1,2,48,Male,Engineering,Other,No,Low,No,Low,6.8,74,2,3.4,Unhealthy,Heavy Smoker,Social Drinker,No
2,3,18,Prefer not to say,Sales,India,No,,Yes,Medium,7.1,77,9,5.9,Healthy,Heavy Smoker,Social Drinker,No
3,4,30,Non-binary,Engineering,Australia,No,Medium,No,Low,6.9,57,4,5.4,Average,Regular Smoker,Regular Drinker,No
4,5,58,Male,IT,USA,Yes,,Yes,High,4.7,45,10,3.3,Unhealthy,Regular Smoker,Non-Drinker,Yes



Missing Values in Dataset 1:


Timestamp                     0
Gender                        0
Country                       0
Occupation                    0
self_employed              5202
family_history                0
treatment                     0
Days_Indoors                  0
Growing_Stress                0
Changes_Habits                0
Mental_Health_History         0
Mood_Swings                   0
Coping_Struggles              0
Work_Interest                 0
Social_Weakness               0
mental_health_interview       0
care_options                  0
dtype: int64


Missing Values in Dataset 2:


User_ID                        0
Age                            0
Gender                         0
Occupation                     0
Country                        0
Mental_Health_Condition        0
Severity                   25002
Consultation_History           0
Stress_Level                   0
Sleep_Hours                    0
Work_Hours                     0
Physical_Activity_Hours        0
Social_Media_Usage             0
Diet_Quality                   0
Smoking_Habit                  0
Alcohol_Consumption            0
Medication_Usage               0
dtype: int64


Datasets loaded successfully!


In [33]:
# Combine and preprocess datasets
try:
    # First, let's identify common features between the two datasets
    common_columns = list(set(df1.columns) & set(df2.columns))
    print("Common columns between datasets:", common_columns)

    # Function to standardize column names
    def standardize_column_names(df):
        return df.rename(columns=lambda x: str(x).strip().lower().replace(' ', '_'))

    # Standardize column names
    df1 = standardize_column_names(df1)
    df2 = standardize_column_names(df2)

    # Ensure all columns are present in both datasets
    all_columns = list(set(df1.columns) | set(df2.columns))
    for col in all_columns:
        if col not in df1.columns:
            df1[col] = np.nan
        if col not in df2.columns:
            df2[col] = np.nan

    # Combine datasets
    combined_df = pd.concat([df1, df2], axis=0, ignore_index=True)

    # Remove duplicates if any
    combined_df = combined_df.drop_duplicates()

    # Convert numeric columns that might have been read as strings
    for col in combined_df.columns:
        try:
            if combined_df[col].dtype == 'object':
                combined_df[col] = pd.to_numeric(combined_df[col], errors='ignore')
        except:
            continue

    print("\nCombined Dataset Shape:", combined_df.shape)
    print("\nColumns in Combined Dataset:", combined_df.columns.tolist())

    # Display basic statistics of the combined dataset
    print("\nBasic Statistics of Combined Dataset:")
    display(combined_df.describe())

    # Update the main dataframe
    df = combined_df
    print("\nDatasets combined successfully!")

except Exception as e:
    print(f"An error occurred while combining datasets: {str(e)}")
    raise e

Common columns between datasets: ['Occupation', 'Country', 'Gender']

Combined Dataset Shape: (340051, 31)

Columns in Combined Dataset: ['timestamp', 'gender', 'country', 'occupation', 'self_employed', 'family_history', 'treatment', 'days_indoors', 'growing_stress', 'changes_habits', 'mental_health_history', 'mood_swings', 'coping_struggles', 'work_interest', 'social_weakness', 'mental_health_interview', 'care_options', 'stress_level', 'age', 'work_hours', 'social_media_usage', 'alcohol_consumption', 'physical_activity_hours', 'consultation_history', 'severity', 'mental_health_condition', 'diet_quality', 'smoking_habit', 'sleep_hours', 'user_id', 'medication_usage']

Basic Statistics of Combined Dataset:


Unnamed: 0,age,work_hours,social_media_usage,physical_activity_hours,sleep_hours,user_id
count,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0
mean,41.47308,55.06286,3.24316,4.98204,7.009934,25000.5
std,13.844185,14.691575,1.585235,3.161759,1.732674,14433.901067
min,18.0,30.0,0.5,0.0,4.0,1.0
25%,29.0,42.0,1.9,2.0,5.5,12500.75
50%,41.0,55.0,3.2,5.0,7.0,25000.5
75%,53.0,68.0,4.6,8.0,8.5,37500.25
max,65.0,80.0,6.0,10.0,10.0,50000.0



Datasets combined successfully!


In [34]:
# Verify data quality
print("=== Data Quality Check ===")

# Check for infinite values
inf_count = np.isinf(df.select_dtypes(include=np.number)).sum().sum()
print("\nNumber of infinite values:", inf_count)

# Check for missing values
missing_count = df.isnull().sum().sum()
print("Number of missing values:", missing_count)

# Display percentage of missing values by column
missing_percentage = (df.isnull().sum() / len(df)) * 100
print("\nMissing values percentage by column:")
print(missing_percentage[missing_percentage > 0])

# Check data types
print("\nData types of columns:")
print(df.dtypes)

# Basic statistics
print("\nBasic statistics for numerical columns:")
display(df.describe())

print("\nData quality check completed!")

=== Data Quality Check ===

Number of infinite values: 0
Number of missing values: 4790909

Missing values percentage by column:
timestamp                  14.703677
self_employed              16.230801
family_history             14.703677
treatment                  14.703677
days_indoors               14.703677
growing_stress             14.703677
changes_habits             14.703677
mental_health_history      14.703677
mood_swings                14.703677
coping_struggles           14.703677
work_interest              14.703677
social_weakness            14.703677
mental_health_interview    14.703677
care_options               14.703677
stress_level               85.296323
age                        85.296323
work_hours                 85.296323
social_media_usage         85.296323
alcohol_consumption        85.296323
physical_activity_hours    85.296323
consultation_history       85.296323
severity                   92.648750
mental_health_condition    85.296323
diet_quality        

Unnamed: 0,age,work_hours,social_media_usage,physical_activity_hours,sleep_hours,user_id
count,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0
mean,41.47308,55.06286,3.24316,4.98204,7.009934,25000.5
std,13.844185,14.691575,1.585235,3.161759,1.732674,14433.901067
min,18.0,30.0,0.5,0.0,4.0,1.0
25%,29.0,42.0,1.9,2.0,5.5,12500.75
50%,41.0,55.0,3.2,5.0,7.0,25000.5
75%,53.0,68.0,4.6,8.0,8.5,37500.25
max,65.0,80.0,6.0,10.0,10.0,50000.0



Data quality check completed!


In [35]:
# Data Preprocessing
try:
    # Print all columns first for verification
    print("All columns in dataset:", df.columns.tolist())
    print("\nColumn data types:")
    print(df.dtypes)

    # Identify numeric and categorical columns
    numerical_features = df.select_dtypes(include=['float64', 'int64']).columns.tolist()
    categorical_features = df.select_dtypes(include=['object']).columns.tolist()

    print("\nNumerical features:", numerical_features)
    print("\nCategorical features:", categorical_features)

    # Determine the target variable based on available columns
    # We'll look for mental health related columns in both datasets
    possible_target_columns = [col for col in df.columns if any(term in col.lower() for term in ['mental', 'health', 'condition', 'state'])]
    print("\nPossible target columns:", possible_target_columns)

    if not possible_target_columns:
        raise ValueError("No mental health related columns found in the dataset")

    # Select the most appropriate target variable
    target_variable = possible_target_columns[0]  # We'll use the first mental health related column as target
    print("\nSelected target variable:", target_variable)

    # Remove target variable from features lists
    if target_variable in numerical_features:
        numerical_features.remove(target_variable)
    if target_variable in categorical_features:
        categorical_features.remove(target_variable)

    # Create preprocessing pipelines
    from sklearn.preprocessing import StandardScaler, OneHotEncoder
    from sklearn.impute import SimpleImputer
    from sklearn.compose import ColumnTransformer
    from sklearn.pipeline import Pipeline

    # Create preprocessing steps
    numeric_transformer = None
    if numerical_features:
        numeric_transformer = Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler())
        ])

    categorical_transformer = None
    if categorical_features:
        categorical_transformer = Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
            ('onehot', OneHotEncoder(drop='first', sparse=False, handle_unknown='ignore'))
        ])

    # Combine preprocessing steps
    transformers = []
    if numeric_transformer:
        transformers.append(('num', numeric_transformer, numerical_features))
    if categorical_transformer:
        transformers.append(('cat', categorical_transformer, categorical_features))

    preprocessor = ColumnTransformer(transformers=transformers)

    # Prepare feature matrix X and target vector y
    feature_columns = numerical_features + categorical_features
    print("\nFeatures being used:", feature_columns)

    X = df[feature_columns].copy()
    y = df[target_variable].copy()

    print("\nShape of feature matrix X:", X.shape)
    print("Shape of target vector y:", y.shape)

    # Convert target to numerical values if it's categorical
    if y.dtype == 'object':
        label_encoder = LabelEncoder()
        y = label_encoder.fit_transform(y)
        print("\nUnique classes in target variable:", label_encoder.classes_)
    else:
        print("\nTarget variable is already numeric")
        label_encoder = None

    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

    print("\nTraining set shape:", X_train.shape)
    print("Testing set shape:", X_test.shape)
    
    if label_encoder:
        print("\nTarget classes:", label_encoder.classes_)

except Exception as e:
    print(f"Error occurred: {str(e)}")
    print("\nPlease verify that:")
    print("1. All required libraries are imported")
    print("2. The datasets are loaded correctly")
    print("3. The column names are as expected")

All columns in dataset: ['timestamp', 'gender', 'country', 'occupation', 'self_employed', 'family_history', 'treatment', 'days_indoors', 'growing_stress', 'changes_habits', 'mental_health_history', 'mood_swings', 'coping_struggles', 'work_interest', 'social_weakness', 'mental_health_interview', 'care_options', 'stress_level', 'age', 'work_hours', 'social_media_usage', 'alcohol_consumption', 'physical_activity_hours', 'consultation_history', 'severity', 'mental_health_condition', 'diet_quality', 'smoking_habit', 'sleep_hours', 'user_id', 'medication_usage']

Column data types:
timestamp                   object
gender                      object
country                     object
occupation                  object
self_employed               object
family_history              object
treatment                   object
days_indoors                object
growing_stress              object
changes_habits              object
mental_health_history       object
mood_swings                 obje

In [36]:
# Model Training
try:
    # Verify that preprocessor and data are available
    if 'preprocessor' not in locals() or 'X_train' not in locals() or 'y_train' not in locals():
        raise ValueError("Please run the preprocessing cell first to create the preprocessor and split the data")

    print("Starting model training with data shapes:")
    print(f"X_train shape: {X_train.shape}")
    print(f"y_train shape: {y_train.shape}")
    print(f"X_test shape: {X_test.shape}")
    print(f"y_test shape: {y_test.shape}")

    # 1. Random Forest Classifier with error handling
    print("\nInitializing Random Forest model...")
    rf_model = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
    ])

    # 2. XGBoost Classifier with error handling
    print("Initializing XGBoost model...")
    xgb_model = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', xgb.XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='mlogloss'))
    ])

    # Train Random Forest model
    print("\nTraining Random Forest model...")
    rf_model.fit(X_train, y_train)
    print("Random Forest model training completed!")

    # Train XGBoost model
    print("\nTraining XGBoost model...")
    xgb_model.fit(X_train, y_train)
    print("XGBoost model training completed!")

    # Define evaluation function
    def evaluate_model(model, X, y, model_name):
        try:
            # Make predictions
            predictions = model.predict(X)
            
            # Print classification report
            print(f"\n{model_name} Results:")
            print("\nClassification Report:")
            print(classification_report(y, predictions))
            
            # Plot confusion matrix
            plt.figure(figsize=(10, 8))
            cm = confusion_matrix(y, predictions)
            sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
            plt.title(f'{model_name} Confusion Matrix')
            plt.ylabel('True Label')
            plt.xlabel('Predicted Label')
            plt.tight_layout()
            plt.show()
            
            # Calculate and print accuracy
            accuracy = np.mean(predictions == y)
            print(f"\n{model_name} Accuracy: {accuracy:.4f}")
            
        except Exception as e:
            print(f"Error evaluating {model_name}: {str(e)}")

    # Evaluate both models
    print("\nEvaluating models...")
    evaluate_model(rf_model, X_test, y_test, "Random Forest")
    evaluate_model(xgb_model, X_test, y_test, "XGBoost")

except Exception as e:
    print(f"\nError during model training: {str(e)}")
    print("\nPlease ensure:")
    print("1. The preprocessing cell has been run successfully")
    print("2. The data has been properly split into training and test sets")
    print("3. All required libraries are imported")
    print("4. The target variable has been properly encoded")


Error during model training: Please run the preprocessing cell first to create the preprocessor and split the data

Please ensure:
1. The preprocessing cell has been run successfully
2. The data has been properly split into training and test sets
3. All required libraries are imported
4. The target variable has been properly encoded


In [37]:
# Feature Importance Analysis
try:
    # Ensure models are trained
    if 'rf_model' not in locals() or 'numerical_features' not in locals():
        raise ValueError("Please run the model training cell first")

    # Get feature importance from Random Forest
    feature_importance = pd.DataFrame({
        'feature': numerical_features,
        'importance': rf_model.named_steps['classifier'].feature_importances_
    })
    feature_importance = feature_importance.sort_values('importance', ascending=False)

    # Plot feature importance
    plt.figure(figsize=(10, 6))
    sns.barplot(data=feature_importance, x='importance', y='feature')
    plt.title('Feature Importance (Random Forest)')
    plt.tight_layout()
    plt.show()

    # Create a prediction function for new data with error handling
    def predict_mental_health(input_data):
        """
        Predict mental health state based on input features.
        input_data: dictionary containing feature values
        Returns: dictionary with predicted class and probabilities
        Raises: ValueError if input data is invalid
        """
        try:
            # Validate input data
            required_features = set(numerical_features)
            provided_features = set(input_data.keys())
            
            # Check for missing features
            missing_features = required_features - provided_features
            if missing_features:
                raise ValueError(f"Missing required features: {missing_features}")
            
            # Validate data types and ranges
            for feature, value in input_data.items():
                if not isinstance(value, (int, float)):
                    raise ValueError(f"Value for {feature} must be numeric")
                if value < 0:
                    raise ValueError(f"Value for {feature} cannot be negative")
                
                # Additional range checks for specific features
                if feature == 'Hours_Worked' and value > 24:
                    raise ValueError("Hours_Worked cannot exceed 24")
                if feature == 'Sleep_Hours' and value > 24:
                    raise ValueError("Sleep_Hours cannot exceed 24")
                if 'Level' in feature and value > 10:
                    raise ValueError(f"{feature} should be on a scale of 1-10")

            # Convert input dictionary to DataFrame
            input_df = pd.DataFrame([input_data])
            
            # Make prediction using the best model (assuming XGBoost performed better)
            prediction = xgb_model.predict(input_df)
            prediction_proba = xgb_model.predict_proba(input_df)
            
            # Get the predicted class label
            predicted_class = label_encoder.inverse_transform(prediction)[0]
            
            # Get prediction probabilities
            class_probabilities = dict(zip(label_encoder.classes_, prediction_proba[0]))
            
            return {
                'predicted_class': predicted_class,
                'probabilities': class_probabilities
            }
            
        except Exception as e:
            print(f"Error in prediction: {str(e)}")
            print("\nPlease ensure:")
            print("1. All required features are provided")
            print("2. All values are numeric and within valid ranges")
            print("3. The model has been trained successfully")
            raise

    # Save the best model with error handling
    try:
        print("Saving models...")
        joblib.dump(xgb_model, 'mental_health_model.joblib')
        joblib.dump(label_encoder, 'label_encoder.joblib')
        print("Models saved successfully!")
    except Exception as e:
        print(f"Error saving models: {str(e)}")
        print("Please ensure you have write permissions in the current directory")

    # Example of how to use the prediction function
    print("\nTesting prediction with example input...")
    example_input = {
        'Stress_Level': 7,
        'Hours_Worked': 8,
        'Sleep_Hours': 6,
    }
    
    # Add all required numerical features with default values
    for feature in numerical_features:
        if feature not in example_input:
            if 'Level' in feature or 'Rating' in feature:
                example_input[feature] = 5  # Default middle value for ratings
            elif 'Hours' in feature:
                example_input[feature] = 8  # Default reasonable hours
            else:
                example_input[feature] = 0  # Default safe value for other features

    result = predict_mental_health(example_input)
    print("\nExample Prediction:")
    print(f"Predicted Mental Health State: {result['predicted_class']}")
    print("\nProbabilities for each class:")
    for class_name, prob in result['probabilities'].items():
        print(f"{class_name}: {prob:.2f}")

except Exception as e:
    print(f"\nError in feature importance analysis: {str(e)}")
    print("\nPlease ensure:")
    print("1. The model training cell has been run successfully")
    print("2. Required variables (rf_model, numerical_features) are available")
    print("3. The models have been trained properly")
    print("4. All required libraries are imported")


Error in feature importance analysis: Please run the model training cell first

Please ensure:
1. The model training cell has been run successfully
2. Required variables (rf_model, numerical_features) are available
3. The models have been trained properly
4. All required libraries are imported


In [38]:
# Interactive Mental Health Assessment Questionnaire
import os

def get_user_input(required_features):
    """
    Function to ask questions and collect user responses for mental health assessment
    Args:
        required_features: list of features required by the model
    Returns:
        Dictionary of user responses or None if cancelled
    """
    questions = {
        'Hours_Worked': "How many hours do you work per day? (Enter a number between 0-24): ",
        'Sleep_Hours': "How many hours do you sleep per day? (Enter a number between 0-24): ",
        'Stress_Level': "On a scale of 1-10, how would you rate your current stress level? (1=lowest, 10=highest): ",
        'Physical_Activity': "How many hours per week do you spend on physical activity? (Enter a number): ",
        'Social_Connection': "On a scale of 1-10, how would you rate your social connections? (1=very isolated, 10=very connected): ",
        'Work_Life_Balance': "On a scale of 1-10, how would you rate your work-life balance? (1=poor, 10=excellent): ",
        'Support_System': "On a scale of 1-10, how strong is your support system? (1=none, 10=very strong): ",
        'Job_Satisfaction': "On a scale of 1-10, how satisfied are you with your job? (1=very dissatisfied, 10=very satisfied): "
    }
    
    # Add any missing required features with default questions
    for feature in required_features:
        if feature not in questions:
            if 'level' in feature.lower():
                questions[feature] = f"On a scale of 1-10, how would you rate your {feature.replace('_', ' ').lower()}? (1=lowest, 10=highest): "
            elif 'hours' in feature.lower():
                questions[feature] = f"How many hours of {feature.replace('_', ' ').lower()}? (Enter a number between 0-24): "
            else:
                questions[feature] = f"Please enter a value for {feature.replace('_', ' ').lower()} (Enter a number): "
    
    responses = {}
    print("\n=== Mental Health Assessment Questionnaire ===\n")
    
    try:
        for column, question in questions.items():
            if column in required_features:  # Only ask questions for required features
                while True:
                    try:
                        response = float(input(question))
                        if column in ['Hours_Worked', 'Sleep_Hours'] or 'hours' in column.lower():
                            if 0 <= response <= 24:
                                break
                            print("Please enter a valid number between 0 and 24.")
                        elif 'scale' in question.lower() or 'level' in column.lower():
                            if 1 <= response <= 10:
                                break
                            print("Please enter a valid number between 1 and 10.")
                        else:
                            if response >= 0:
                                break
                            print("Please enter a valid positive number.")
                    except ValueError:
                        print("Please enter a valid number.")
                
                responses[column] = response
                print()  # Add a blank line between questions
        
    except KeyboardInterrupt:
        print("\nAssessment cancelled.")
        return None
    
    return responses

def assess_mental_health():
    """
    Main function to run the mental health assessment
    """
    try:
        # Check if required variables are available
        if 'numerical_features' not in globals():
            raise ValueError("Model not initialized. Please run the model training cells first.")
        
        if not os.path.exists('mental_health_model.joblib') or not os.path.exists('label_encoder.joblib'):
            raise FileNotFoundError("Model files not found. Please train and save the model first.")
            
        print("Welcome to the Mental Health Assessment System")
        print("Please answer the following questions honestly for an accurate assessment.\n")
        
        # Get user responses with required features
        responses = get_user_input(numerical_features)
        
        if responses:
            try:
                print("\nAnalyzing your responses...")
                # Make prediction using our trained model
                result = predict_mental_health(responses)
                
                print("\n=== Mental Health Assessment Results ===")
                print(f"\nBased on your responses, you may be experiencing: {result['predicted_class']}")
                print("\nProbability breakdown:")
                for state, prob in result['probabilities'].items():
                    print(f"{state}: {prob*100:.1f}%")
                
                # Provide some general recommendations
                print("\nGeneral Recommendations:")
                if float(responses['Sleep_Hours']) < 7:
                    print("- Consider getting more sleep (7-9 hours is recommended)")
                if float(responses['Stress_Level']) > 7:
                    print("- Your stress level is high. Consider stress management techniques")
                if float(responses.get('Physical_Activity', 0)) < 3:
                    print("- Try to increase your physical activity")
                if float(responses.get('Social_Connection', 0)) < 5:
                    print("- Consider strengthening your social connections")
                
                print("\nNote: This is a preliminary assessment. Please consult with a mental health professional for a proper diagnosis.")
            
            except Exception as e:
                print(f"\nError during prediction: {str(e)}")
                print("Please ensure all required features are provided and within valid ranges.")
        
    except Exception as e:
        print(f"\nError initializing assessment: {str(e)}")
        print("\nPlease ensure:")
        print("1. You have run all the notebook cells in order")
        print("2. The model has been trained successfully")
        print("3. The model files have been saved correctly")

# Run the interactive assessment
if __name__ == "__main__":
    assess_mental_health()


Error initializing assessment: Model files not found. Please train and save the model first.

Please ensure:
1. You have run all the notebook cells in order
2. The model has been trained successfully
3. The model files have been saved correctly


# How to Use the Mental Health Assessment System

1. Run the assessment by executing the cell above
2. Answer each question honestly
3. The system will:
   - Analyze your responses
   - Predict your mental health state
   - Show probability for each possible state
   - Provide some general recommendations

Note: This is a preliminary screening tool and should not be used as a substitute for professional medical advice. Always consult with a qualified mental health professional for proper diagnosis and treatment.

# ML Concepts Used in This Project

1. **Supervised Learning**: We used supervised learning for classification, where we train the model on labeled data (known mental health states) to predict the mental health state of new individuals.

2. **Feature Engineering**:
   - Numerical feature scaling using StandardScaler
   - Missing value imputation using SimpleImputer
   - Label encoding for categorical target variable

3. **Models Used**:
   - **Random Forest**: An ensemble learning method that creates multiple decision trees and combines their predictions
   - **XGBoost**: A gradient boosting algorithm known for its high performance and speed

4. **Model Evaluation**:
   - Train-test split for model validation
   - Classification metrics (accuracy, precision, recall, F1-score)
   - Confusion matrix visualization
   - Feature importance analysis

5. **Pipeline Implementation**:
   - Scikit-learn Pipeline for combining preprocessing and model training
   - ColumnTransformer for handling different types of features

6. **Model Persistence**:
   - Saving trained model using joblib for future use
   - Creating a prediction function for easy model deployment

This implementation focuses on creating an accurate and interpretable model for mental health state prediction based on various input features. The model can be used to assess an individual's mental health state by answering questions related to stress level, work hours, and other relevant factors.