In [15]:
def predict_depression_status(text_input):
    # Tokenize the input text
    encoded_input = tokenizer(
        text_input,
        truncation=True,
        padding='max_length',
        max_length=MAX_LEN,
        return_tensors='pt'
    )

    # Move tokenized inputs to the same device as the model
    input_ids = encoded_input['input_ids'].to(device)
    attention_mask = encoded_input['attention_mask'].to(device)

    # Set the model to evaluation mode
    model.eval()

    # Make prediction
    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits

    # Get predicted label
    prediction = torch.argmax(logits, dim=1).cpu().numpy()[0]

    # Map numerical prediction back to original label
    reverse_label_mapping = {v: k for k, v in label_mapping.items()}
    predicted_status = reverse_label_mapping[prediction]

    return predicted_status

print("Depression Status Prediction System")
print("Enter your text below. Type 'exit' to quit.\n")

while True:
    user_text = input("Your text: ")
    if user_text.lower() == 'exit':
        break

    status = predict_depression_status(user_text)
    print(f"Predicted Status: {status}\n")

print("Thank you for using the Depression Status Prediction System!")

Depression Status Prediction System
Enter your text below. Type 'exit' to quit.

Your text: After hearing about the accident, I feel restless and can’t sleep properly.
Predicted Status: Anxious

Your text: The heat and power cuts are draining me every day.
Predicted Status: Anxious

Your text: Reading about the flood damage made me feel empty and hopeless..
Predicted Status: Stressed

Your text: Sad news today. Hope things improve soon.
Predicted Status: Neutral

Your text: exit
Thank you for using the Depression Status Prediction System!


# Task
Develop a Depression Detection System using a BERT model. The system should involve loading data from '/content/Dataset_2025.csv', preprocessing it for BERT, configuring and training the BERT model, implementing strategies to mitigate overfitting during training, and evaluating the model's performance. Finally, provide a summary of the complete Depression System program, including a discussion on the methods used to mitigate overfitting and the achieved model performance.

## Load and Explore Data

### Subtask:
Load the dataset from '/content/Dataset_2025.csv' into a pandas DataFrame. Perform an initial exploration to understand the data's structure, identify text and label columns, and check for any missing values or inconsistencies.


**Reasoning**:
I will import the pandas library and load the dataset into a DataFrame. Then, I'll display the first few rows and a concise summary to get a preliminary understanding of the data's structure, column names, and data types, as well as to identify non-null values.



In [1]:
import pandas as pd

# Load the dataset
df = pd.read_csv('/content/Dataset_2025.csv')

# Display the first few rows
print("First 5 rows of the DataFrame:")
print(df.head())

# Print a concise summary of the DataFrame
print("\nDataFrame Info:")
df.info()

First 5 rows of the DataFrame:
         Date   Location Event_Category  Casualty_Count  Age Gender  \
0  2024-09-03  Bengaluru  Building Fire              28   26   Male   
1  2024-06-05      Delhi  Road Accident              27   32   Male   
2  2024-06-29      Assam  Air Pollution              35   18   Male   
3  2024-07-06    Kolkata  Building Fire              97   39   Male   
4  2024-06-13  Hyderabad        Cyclone              77   34   Male   

                                       Reaction_Text Label_Status  \
0  I feel completely drained after hearing this n...    Depressed   
1  I feel numb reading about this. Everything fee...    Depressed   
2  I’m feeling uneasy and nervous since this inci...      Anxious   
3  This was unexpected. Hope authorities handle i...      Neutral   
4  I feel numb reading about this. Everything fee...    Depressed   

   Severity_Score  
0               3  
1               3  
2               2  
3               0  
4               3  

DataFr

**Reasoning**:
Now that the data is loaded and its initial structure is understood, I will check for missing values across all columns to ensure data quality. Afterwards, I will examine the unique values and their counts in the 'Label_Status' column to understand the distribution of the target variable, which is crucial for the depression detection system.



In [2]:
print("\nMissing values per column:")
print(df.isnull().sum())

print("\nUnique values and their counts for 'Label_Status' column:")
print(df['Label_Status'].value_counts())


Missing values per column:
Date              0
Location          0
Event_Category    0
Casualty_Count    0
Age               0
Gender            0
Reaction_Text     0
Label_Status      0
Severity_Score    0
dtype: int64

Unique values and their counts for 'Label_Status' column:
Label_Status
Anxious      160
Depressed    149
Stressed     147
Neutral      144
Name: count, dtype: int64


## Preprocess Data for BERT

### Subtask:
Prepare the text data for input into the BERT model. This will involve tokenization using a BERT tokenizer, encoding the tokens into numerical IDs, creating attention masks, and segmenting the data into training and validation sets suitable for BERT's input requirements.


**Reasoning**:
First, I will import the necessary libraries, instantiate the BERT tokenizer, define the maximum sequence length, tokenize the 'Reaction_Text' column, and encode the 'Label_Status' into numerical labels. This sets up the core components for BERT input preparation.



In [3]:
import torch
from transformers import BertTokenizerFast
from sklearn.model_selection import train_test_split

# 2. Instantiate a BertTokenizerFast object
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

# 3. Define the maximum sequence length
MAX_LEN = 128  # A common choice, can be adjusted based on text length distribution

# 4. Apply the tokenizer to the 'Reaction_Text' column
encodings = tokenizer(list(df['Reaction_Text']), truncation=True, padding='max_length', max_length=MAX_LEN)

# 5. Encode the 'Label_Status' column into numerical labels
label_mapping = {'Anxious': 0, 'Depressed': 1, 'Stressed': 2, 'Neutral': 3}
df['Encoded_Label'] = df['Label_Status'].map(label_mapping)

print("Tokenization and label encoding complete.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Tokenization and label encoding complete.


**Reasoning**:
I will now split the tokenized inputs and numerical labels into training and validation sets, ensuring stratification, and then convert these into PyTorch tensors, as per the remaining instructions of the subtask.



In [4]:
X = encodings['input_ids']
Y = encodings['attention_mask']
labels = df['Encoded_Label'].values

# 6. Split the tokenized inputs and numerical labels into training and validation sets
X_train, X_val, Y_train, Y_val, labels_train, labels_val = train_test_split(
    X, Y, labels, test_size=0.2, random_state=42, stratify=labels
)

print("Data split into training and validation sets.")

# 7. Convert the training and validation inputs and labels into PyTorch tensors
input_ids_train = torch.tensor(X_train)
attention_masks_train = torch.tensor(Y_train)
labels_train = torch.tensor(labels_train)

input_ids_val = torch.tensor(X_val)
attention_masks_val = torch.tensor(Y_val)
labels_val = torch.tensor(labels_val)

print("Training and validation data converted to PyTorch tensors.")

Data split into training and validation sets.
Training and validation data converted to PyTorch tensors.


## Configure and Load BERT Model

### Subtask:
Load a pre-trained BERT base model (e.g., 'bert-base-uncased') from a library like Hugging Face Transformers. Configure the model by adding a classification layer on top, appropriate for the depression detection task (e.g., a dense layer with a sigmoid activation for binary classification).


**Reasoning**:
To configure and load the BERT model, I will first import `BertForSequenceClassification` and define the number of unique labels using the `label_mapping` from previous steps. Then, I will load the pre-trained `bert-base-uncased` model for sequence classification and move it to the appropriate device (GPU if available, otherwise CPU).



In [5]:
from transformers import BertForSequenceClassification

# 1. Determine the number of unique labels
num_labels = len(label_mapping)

# 2. Load a pre-trained BertForSequenceClassification model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=num_labels)

# 3. Move the model to GPU if available, otherwise CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

print(f"BERT model loaded and moved to {device}.")
print(f"Number of output labels configured: {num_labels}")



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERT model loaded and moved to cpu.
Number of output labels configured: 4


**Reasoning**:
The previous steps successfully configured and loaded the BERT model. The next logical step, according to the main task description, is to prepare the data for training, define data loaders, and set up the training parameters.



In [7]:
from torch.utils.data import TensorDataset, DataLoader

# 1. Create TensorDataset objects for both the training and validation sets
train_dataset = TensorDataset(input_ids_train, attention_masks_train, labels_train)
val_dataset = TensorDataset(input_ids_val, attention_masks_val, labels_val)

# 2. Define batch sizes for training and validation
batch_size = 32 # A common batch size, can be adjusted

# 3. Create DataLoader objects for both the training and validation datasets
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

print("TensorDatasets and DataLoaders created successfully.")

TensorDatasets and DataLoaders created successfully.


## Train BERT Model with Overfitting Mitigation

### Subtask:
Train the configured BERT model using the preprocessed training data, incorporating early stopping and weight decay to mitigate overfitting.


**Reasoning**:
I will implement the training loop for the BERT model, incorporating the AdamW optimizer, a linear learning rate scheduler with warmup, and early stopping to mitigate overfitting, as per the subtask instructions.



In [10]:
import torch
from torch.optim import AdamW
from transformers import get_linear_schedule_with_warmup
from tqdm.notebook import tqdm

# 1. Set up the optimizer
optimizer = AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)

# 2. Calculate the total number of training steps
epochs = 5  # Number of training epochs
total_steps = len(train_dataloader) * epochs

# 3. Set up the learning rate scheduler
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0, # No warmup steps
    num_training_steps=total_steps
)

# 4. Initialize variables for early stopping
best_val_loss = float('inf')
epochs_no_improve = 0
patience = 3 # Number of epochs to wait for improvement before stopping

print("Starting BERT model training...")

# 5. Main training loop
for epoch_i in range(epochs):
    print(f"\n======== Epoch {epoch_i + 1} / {epochs} ========")
    print("Training...")

    # Training phase
    model.train() # Set model to training mode
    total_train_loss = 0

    for batch in tqdm(train_dataloader, desc="Training"): # Iterate over batches with a progress bar
        # Move batch to GPU if available
        input_ids = batch[0].to(device)
        attention_mask = batch[1].to(device)
        labels = batch[2].to(device)

        optimizer.zero_grad() # Clear any previously calculated gradients

        # Forward pass
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        logits = outputs.logits
        total_train_loss += loss.item()

        # Backward pass
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # Clip gradients to prevent exploding gradients

        optimizer.step() # Update model parameters
        scheduler.step() # Update learning rate

    # Calculate the average training loss
    avg_train_loss = total_train_loss / len(train_dataloader)
    print(f"  Average training loss: {avg_train_loss:.2f}")

    # Evaluation phase
    print("\nRunning Validation...")
    model.eval() # Set model to evaluation mode
    total_val_loss = 0

    with torch.no_grad(): # Disable gradient calculations during validation
        for batch in tqdm(val_dataloader, desc="Validation"): # Iterate over validation batches with a progress bar
            input_ids = batch[0].to(device)
            attention_mask = batch[1].to(device)
            labels = batch[2].to(device)

            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            logits = outputs.logits
            total_val_loss += loss.item()

    # Calculate the average validation loss
    avg_val_loss = total_val_loss / len(val_dataloader)
    print(f"  Average validation loss: {avg_val_loss:.2f}")

    # Early stopping logic
    if avg_val_loss < best_val_loss:
        best_val_loss = avg_val_loss
        epochs_no_improve = 0
        print("  Validation loss improved. Saving model (not implemented in this step).")
    else:
        epochs_no_improve += 1
        print(f"  Validation loss did not improve. Epochs without improvement: {epochs_no_improve}")
        if epochs_no_improve >= patience:
            print(f"  Early stopping triggered after {patience} epochs without improvement.")
            break

print("\nTraining complete!")

Starting BERT model training...

Training...


Training:   0%|          | 0/15 [00:00<?, ?it/s]

  Average training loss: 1.15

Running Validation...


Validation:   0%|          | 0/4 [00:00<?, ?it/s]

  Average validation loss: 0.78
  Validation loss improved. Saving model (not implemented in this step).

Training...


Training:   0%|          | 0/15 [00:00<?, ?it/s]

  Average training loss: 0.62

Running Validation...


Validation:   0%|          | 0/4 [00:00<?, ?it/s]

  Average validation loss: 0.38
  Validation loss improved. Saving model (not implemented in this step).

Training...


Training:   0%|          | 0/15 [00:00<?, ?it/s]

  Average training loss: 0.32

Running Validation...


Validation:   0%|          | 0/4 [00:00<?, ?it/s]

  Average validation loss: 0.20
  Validation loss improved. Saving model (not implemented in this step).

Training...


Training:   0%|          | 0/15 [00:00<?, ?it/s]

  Average training loss: 0.20

Running Validation...


Validation:   0%|          | 0/4 [00:00<?, ?it/s]

  Average validation loss: 0.14
  Validation loss improved. Saving model (not implemented in this step).

Training...


Training:   0%|          | 0/15 [00:00<?, ?it/s]

  Average training loss: 0.15

Running Validation...


Validation:   0%|          | 0/4 [00:00<?, ?it/s]

  Average validation loss: 0.12
  Validation loss improved. Saving model (not implemented in this step).

Training complete!


## Test Model with New Data

### Subtask:
Test the trained model with new, unseen text data to observe its predictions and demonstrate how it classifies mental statuses based on input text.

**Reasoning**:
To test the model, I will define a list of example sentences. Then, I will tokenize these sentences, convert them into PyTorch tensors, pass them through the trained BERT model, and finally interpret the model's output to display the predicted mental status for each sentence.

In [12]:
# 1. Define sample test sentences
test_sentences = [
    "I am feeling very down and sad today.",
    "Everything is going great, I'm so happy!",
    "I have a lot on my plate, feeling quite overwhelmed.",
    "Just another day, nothing special."
]

# 2. Tokenize the test sentences
test_encodings = tokenizer(
    test_sentences,
    truncation=True,
    padding='max_length',
    max_length=MAX_LEN,
    return_tensors='pt'
)

# 3. Move tokenized inputs to the same device as the model
input_ids_test = test_encodings['input_ids'].to(device)
attention_mask_test = test_encodings['attention_mask'].to(device)

# 4. Set the model to evaluation mode
model.eval()

# 5. Make predictions
with torch.no_grad():
    outputs = model(input_ids_test, attention_mask=attention_mask_test)
    logits = outputs.logits

# 6. Get predicted labels
predictions_test = torch.argmax(logits, dim=1).cpu().numpy()

# 7. Map numerical predictions back to original labels
# Reverse the label_mapping for easier lookup
reverse_label_mapping = {v: k for k, v in label_mapping.items()}
predicted_statuses = [reverse_label_mapping[pred] for pred in predictions_test]

# 8. Print the results
print("\n--- Model Predictions on Test Data ---")
for i, sentence in enumerate(test_sentences):
    print(f"Sentence: \"{sentence}\"")
    print(f"Predicted Status: {predicted_statuses[i]}\n")



--- Model Predictions on Test Data ---
Sentence: "I am feeling very down and sad today."
Predicted Status: Anxious

Sentence: "Everything is going great, I'm so happy!"
Predicted Status: Stressed

Sentence: "I have a lot on my plate, feeling quite overwhelmed."
Predicted Status: Depressed

Sentence: "Just another day, nothing special."
Predicted Status: Stressed



## Evaluate Model Performance

### Subtask:
Evaluate the trained model's performance on the validation set. Calculate and report relevant classification metrics, including accuracy, precision, recall, and F1-score, to assess how well the model generalizes to unseen data.


**Reasoning**:
I will import the necessary evaluation metrics from `sklearn.metrics` and then iterate through the validation data loader to collect the model's predictions and true labels. Finally, I will calculate and print the accuracy, precision, recall, and F1-score as instructed.



In [11]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Set the model to evaluation mode
model.eval()

# Initialize empty lists to store true labels and predicted logits
true_labels = []
predictions = []

print("Evaluating model performance on the validation set...")

# Iterate through the val_dataloader
for batch in val_dataloader:
    # Move batch to device (CPU in this case, as defined earlier)
    input_ids = batch[0].to(device)
    attention_mask = batch[1].to(device)
    labels = batch[2].to(device)

    # Perform a forward pass with torch.no_grad()
    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        logits = outputs.logits

    # Append the true labels and predicted logits to their respective lists
    true_labels.extend(labels.cpu().numpy())
    predictions.extend(torch.argmax(logits, dim=1).cpu().numpy())

# Calculate evaluation metrics
accuracy = accuracy_score(true_labels, predictions)
precision = precision_score(true_labels, predictions, average='weighted') # Use 'weighted' for multi-class
recall = recall_score(true_labels, predictions, average='weighted')       # Use 'weighted' for multi-class
f1 = f1_score(true_labels, predictions, average='weighted')               # Use 'weighted' for multi-class

# Print the calculated metrics
print(f"\nModel Performance on Validation Set:")
print(f"Accuracy:  {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"F1-Score:  {f1:.4f}")

Evaluating model performance on the validation set...

Model Performance on Validation Set:
Accuracy:  1.0000
Precision: 1.0000
Recall:    1.0000
F1-Score:  1.0000


## Final Task

### Subtask:
Provide a summary of the complete Depression System program, including a discussion on the methods used to mitigate overfitting and the achieved model performance.


## Summary:

### Q&A
The Depression Detection System utilizes a BERT model to classify `Reaction_Text` into four sentiment labels: 'Anxious', 'Depressed', 'Stressed', and 'Neutral'.

**Methods used to mitigate overfitting:**
To prevent overfitting during training, the following strategies were implemented:
*   **Weight Decay:** The AdamW optimizer was configured with `weight_decay=0.01`, which adds L2 regularization to the model weights.
*   **Early Stopping:** An early stopping mechanism was set up with a `patience` of 3 epochs. Training would halt if the validation loss did not improve for three consecutive epochs.
*   **Gradient Clipping:** Gradients were clipped to a maximum norm of 1.0 to prevent exploding gradients.

**Achieved model performance:**
The model demonstrated exceptional performance on the validation set, achieving perfect scores across all evaluated metrics:
*   **Accuracy:** 1.0000
*   **Precision:** 1.0000
*   **Recall:** 1.0000
*   **F1-Score:** 1.0000

### Data Analysis Key Findings
*   The dataset `Dataset_2025.csv` contains 600 entries and 9 columns, with `Reaction_Text` identified as the input text and `Label_Status` as the target variable.
*   No missing values were found in any column, indicating a clean dataset.
*   The `Label_Status` column contains four classes ('Anxious', 'Depressed', 'Stressed', 'Neutral') with a relatively balanced distribution (160, 149, 147, and 144 instances respectively).
*   Text data was tokenized using `BertTokenizerFast` with a maximum length of 128, and labels were numerically encoded.
*   The data was split into training and validation sets (80/20 ratio), and converted into PyTorch tensors for efficient batch processing using `DataLoader` objects.
*   A `bert-base-uncased` model with a classification layer configured for 4 output labels was used.
*   The model was trained for 5 epochs. The average training loss decreased from 1.15 in Epoch 1 to 0.15 in Epoch 5, while the average validation loss decreased from 0.78 to 0.12 over the same period.
*   Despite the implementation of early stopping, it was not triggered as the validation loss continuously improved throughout all 5 epochs.
*   The final model achieved perfect scores (1.0000 for Accuracy, Precision, Recall, and F1-Score) on the validation set.

### Insights or Next Steps
*   The exceptionally high (perfect) performance metrics on the validation set are unusual for a text classification task and may suggest potential data leakage between the training and validation sets or an issue with the dataset itself. Further investigation is warranted to confirm the model's true generalization ability.
*   It is crucial to evaluate the model on an independent, unseen test set to ensure robustness and generalizability, and potentially perform cross-validation to get a more reliable estimate of performance.
