# Task
Write and execute Python code to implement a depression detection system using BERT, including data loading, preprocessing, model setup, training, evaluation, and inference, using the dataset from "/content/Student Survey Form (Responses) - Form responses 1.csv".

## Load and explore the data

### Subtask:
Load the provided CSV file into a pandas DataFrame and examine its structure and content.


**Reasoning**:
Load the CSV file into a pandas DataFrame, display the first few rows, check data types, summary statistics, and missing values as per the instructions.



In [None]:
import pandas as pd

# Read the CSV file into a pandas DataFrame
df = pd.read_csv("/content/Student Survey Form (Responses) - Form responses 1.csv")

# Display the first 5 rows of the DataFrame
print("First 5 rows of the DataFrame:")
display(df.head())

# Display the column names and their data types
print("\nColumn names and data types:")
display(df.info())

# Display summary statistics of the DataFrame
print("\nSummary statistics:")
display(df.describe())

# Check for missing values in each column
print("\nMissing values per column:")
display(df.isnull().sum())

First 5 rows of the DataFrame:


Unnamed: 0,Timestamp,Name,Age,Occupation,1. How would you describe your overall mental health at this moment?,2. What factors do you think contribute the most to students/working professionals experiencing mental health issues?,"3. In your opinion, how does academic/work-life pressure affect students' or individual's mental health?",4. How comfortable do you feel discussing mental health and depression with your peers or teachers or colleague? Why or why not?,5. Suppose your parents or guardians don’t understand your mental health struggles and think you're just being “lazy” or “dramatic.” How would you handle that situation emotionally and practically?,6. How do you cope with stress or difficult emotions during challenging times?,7. Do you often find yourself expecting negative outcomes in life ? Please elaborate.,"8. How does your personal relationships (family, friends, romantic) influence your mental well-being?","9. Have strained relationships ever contributed to feelings of sadness, loneliness, or depression for you? Please explain","10. If you notice yourself thinking negatively about the future or feeling hopeless for several days, how would you interpret these signs? Would you seek help?"
0,22/04/2025 12:19:22,Tamal Bhattacharjee,21,Student,Ok,Work Load,Depends (but in general it induces stress),Never did an analysis,Seeking for professional help and educating th...,Having sweets or going for a walk,Alot of times (Mostly it's the feeling of not ...,Mixed emotions,No,I would try to work harder
1,22/04/2025 12:27:12,Sreeja Biswas,22,Student,"It's good, calm",Extensive work pressure and project deadlines,It effects gradually step by step consequently...,not comfortable as it's not about any professi...,I'll make them understand with patience. Hopef...,I always give time to myself and take deep bre...,"Yes. While giving multiple job interviews, see...",Somewhere my mental health is dependent on my ...,Not at all.,I will go for a tour. I think daily work life ...
2,22/04/2025 12:46:07,Shombit Ghosh,20,STUDENT,It's under pressure due to exams schedule,The pressure of upcoming projects and its resu...,It does affect the mental health to a great ex...,I am not comfortable in sharing stuff.,I would find a solution for my problem of bein...,By playing my favourite sports.,In most of the scenarios I find negative outco...,It has the most important role in my mental we...,Yes it does because I am very family oriented ...,I won't seek help but would try to deal with i...
3,22/04/2025 12:58:36,Sambit Bhattacharyya,21,Student,Bad,Lack of understanding and judgemental behaviour,It impacts a lot,Not very comfortable. People are very judgemen...,My parents totally understand me. But in such ...,Talking with family,Very often. Don't know how should I elaborate,Dependent on them in an unhealthy way,Very often,I do take help
4,22/04/2025 13:00:26,Subhodeep Ghosh,22,Student,Good,Less conversation,It's all about let go.,I feel comfortable to share my mental conditio...,Overcome on my own,Give time,Yes .. there's always ups and downs,Both are mutually dependent,No,Will distract myself from these and get into s...



Column names and data types:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 97 entries, 0 to 96
Data columns (total 14 columns):
 #   Column                                                                                                                                                                                                   Non-Null Count  Dtype 
---  ------                                                                                                                                                                                                   --------------  ----- 
 0   Timestamp                                                                                                                                                                                                97 non-null     object
 1   Name                                                                                                                                                                           

None


Summary statistics:


Unnamed: 0,Timestamp,Name,Age,Occupation,1. How would you describe your overall mental health at this moment?,2. What factors do you think contribute the most to students/working professionals experiencing mental health issues?,"3. In your opinion, how does academic/work-life pressure affect students' or individual's mental health?",4. How comfortable do you feel discussing mental health and depression with your peers or teachers or colleague? Why or why not?,5. Suppose your parents or guardians don’t understand your mental health struggles and think you're just being “lazy” or “dramatic.” How would you handle that situation emotionally and practically?,6. How do you cope with stress or difficult emotions during challenging times?,7. Do you often find yourself expecting negative outcomes in life ? Please elaborate.,"8. How does your personal relationships (family, friends, romantic) influence your mental well-being?","9. Have strained relationships ever contributed to feelings of sadness, loneliness, or depression for you? Please explain","10. If you notice yourself thinking negatively about the future or feeling hopeless for several days, how would you interpret these signs? Would you seek help?"
count,97,97,97,97,97,97,97,97,97,97,96,97,97,96
unique,97,97,12,17,79,94,96,96,97,96,87,94,82,88
top,22/04/2025 12:19:22,Tamal Bhattacharjee,21,Student,Good,No,Academic and work-life pressure can cause stre...,No,Seeking for professional help and educating th...,Talking to family,Yes,No,No,Yes
freq,1,1,24,41,11,2,2,2,1,2,4,2,11,6



Missing values per column:


Unnamed: 0,0
Timestamp,0
Name,0
Age,0
Occupation,0
1. How would you describe your overall mental health at this moment?,0
2. What factors do you think contribute the most to students/working professionals experiencing mental health issues?,0
"3. In your opinion, how does academic/work-life pressure affect students' or individual's mental health?",0
4. How comfortable do you feel discussing mental health and depression with your peers or teachers or colleague? Why or why not?,0
5. Suppose your parents or guardians don’t understand your mental health struggles and think you're just being “lazy” or “dramatic.” How would you handle that situation emotionally and practically?,0
6. How do you cope with stress or difficult emotions during challenging times?,0


## Preprocessing

### Subtask:
Prepare the text data for BERT by handling missing values, tokenization, and creating appropriate input formats (input IDs, attention masks, etc.).


**Reasoning**:
Identify the relevant text columns, handle missing values, define a simple target variable for demonstration, combine the text data, initialize a BERT tokenizer, tokenize the text data, and create PyTorch tensors.



In [None]:
print(df.columns)

Index(['Timestamp', 'Name', 'Age', 'Occupation ',
       '1. How would you describe your overall mental health at this moment?  ',
       '2. What factors do you think contribute the most to students/working professionals experiencing mental health issues?  ',
       '3. In your opinion, how does academic/work-life pressure affect students' or individual's mental health?  ',
       '4. How comfortable do you feel discussing mental health and depression with your peers or teachers or colleague? Why or why not? ',
       '5. Suppose your parents or guardians don’t understand your mental health struggles and think you're just being “lazy” or “dramatic.” How would you handle that situation emotionally and practically?  ',
       '6. How do you cope with stress or difficult emotions during challenging times? ',
       '7.  Do you often find yourself expecting negative outcomes in life ? Please elaborate.',
       '8. How does your personal relationships (family, friends, romantic) influence

In [None]:
import torch
from transformers import BertTokenizer

# 1. Identify relevant text columns (Corrected based on df.columns output and character fix)
text_cols = [
    '1. How would you describe your overall mental health at this moment?  ',
    '2. What factors do you think contribute the most to students/working professionals experiencing mental health issues?  ',
    "3. In your opinion, how does academic/work-life pressure affect students' or individual's mental health?  ",
    '4. How comfortable do you feel discussing mental health and depression with your peers or teachers or colleague? Why or why not? ',
    "5. Suppose your parents or guardians don’t understand your mental health struggles and think you're just being “lazy” or “dramatic.” How would you handle that situation emotionally and practically?  ",
    '6. How do you cope with stress or difficult emotions during challenging times? ',
    '7.  Do you often find yourself expecting negative outcomes in life ? Please elaborate.',
    '8. How does your personal relationships (family, friends, romantic) influence your mental well-being? ',
    '9. Have strained relationships ever contributed to feelings of sadness, loneliness, or depression for you? Please explain',
    '10. If you notice yourself thinking negatively about the future or feeling hopeless for several days, how would you interpret these signs? Would you seek help?  '
]

# 2. Handle missing values in the selected text columns
df[text_cols] = df[text_cols].fillna('')

# 3. Define a target variable for depression detection (simple approach for demonstration)
# Label responses containing keywords related to negative feelings or seeking help as potentially depressed (1), otherwise 0.
negative_keywords = ['bad', 'stress', 'pressure', 'negative', 'sadness', 'loneliness', 'depression', 'hopeless', 'not good', 'messed up', 'need help', 'seek help']

def label_depression(row):
    combined_text = ' '.join(str(row[col]).lower() for col in text_cols)
    for keyword in negative_keywords:
        if keyword in combined_text:
            return 1
    return 0

df['potential_depression'] = df.apply(label_depression, axis=1)

# 4. Combine text from selected columns
df['combined_text'] = df[text_cols].agg(' '.join, axis=1)

# 5. Initialize a BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# 6. Tokenize the combined text data
max_seq_length = 128 # Choose a suitable sequence length

encoded_inputs = tokenizer(
    list(df['combined_text']),
    add_special_tokens=True,
    padding='max_length',
    truncation=True,
    max_length=max_seq_length,
    return_attention_mask=True,
    return_tensors='pt' # Return PyTorch tensors
)

input_ids = encoded_inputs['input_ids']
attention_masks = encoded_inputs['attention_mask']

# 7. Create PyTorch tensors for the target variable
labels = torch.tensor(df['potential_depression'].values)

print("Shape of input_ids:", input_ids.shape)
print("Shape of attention_masks:", attention_masks.shape)
print("Shape of labels:", labels.shape)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Shape of input_ids: torch.Size([97, 128])
Shape of attention_masks: torch.Size([97, 128])
Shape of labels: torch.Size([97])


## Model setup

### Subtask:
Load a pre-trained BERT model and configure it for a binary classification task (depressed/not depressed).


**Reasoning**:
Import the necessary modules and load the pre-trained BERT model for sequence classification.



In [None]:
import torch
from transformers import BertForSequenceClassification

# 1. Define the number of labels for the classification task
num_labels = 2

# 2. Load a pre-trained BertForSequenceClassification model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=num_labels)

# 3. If a GPU is available, move the model to the GPU
if torch.cuda.is_available():
    model.to('cuda')
    print("Model moved to GPU.")
else:
    print("GPU not available, model will run on CPU.")

print("BERT model loaded and configured for binary classification.")

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


GPU not available, model will run on CPU.
BERT model loaded and configured for binary classification.


## Training

### Subtask:
Train the BERT model on the preprocessed data.


**Reasoning**:
Split the data, create DataLoaders, define optimizer and scheduler, train the model, evaluate, and save the model.



In [None]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from transformers import get_linear_schedule_with_warmup
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import numpy as np
import random
import time
import datetime
from torch.optim import AdamW
import os

# Helper function for formatting time
def format_time(elapsed):
    elapsed_rounded = int(round((elapsed)))
    return str(datetime.timedelta(seconds=elapsed_rounded))

# 1. Split the data into training and validation sets
train_inputs, validation_inputs, train_labels, validation_labels, train_masks, validation_masks = train_test_split(
    input_ids, labels, attention_masks, random_state=2018, test_size=0.1
)

# 2. Create PyTorch DataLoader objects
batch_size = 32

train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)

# 3. Define an optimizer and a learning rate scheduler
optimizer = AdamW(model.parameters(), lr=5e-5, eps=1e-8)

epochs = 4 # Reduced epochs for faster execution
total_steps = len(train_dataloader) * epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)

# Function to calculate accuracy
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

# Set the seed value for reproducibility.
seed_val = 42
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

# Store the average loss after each epoch so we can plot them.
loss_values = []

# 4. Train the model
for epoch_i in range(0, epochs):
    print(f'======== Epoch {epoch_i + 1} / {epochs} ========')
    print('Training...')

    t0 = time.time()
    total_loss = 0
    model.train() # Put the model into training mode

    for step, batch in enumerate(train_dataloader):
        # Progress update every 40 batches.
        if step % 40 == 0 and not step == 0:
            elapsed = format_time(time.time() - t0)
            print(f'  Batch {step} of {len(train_dataloader)}. Elapsed: {elapsed}.')

        b_input_ids = batch[0].to(model.device)
        b_input_mask = batch[1].to(model.device)
        b_labels = batch[2].to(model.device)

        model.zero_grad() # Clear previously calculated gradients

        outputs = model(b_input_ids,
                        token_type_ids=None,
                        attention_mask=b_input_mask,
                        labels=b_labels)

        loss = outputs[0]
        total_loss += loss.item()

        loss.backward() # Perform a backward pass to calculate the gradients.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # Clip the norm of the gradients to 1.0 to prevent the "exploding gradients" problem.
        optimizer.step() # Update parameters
        scheduler.step() # Update the learning rate.

    avg_train_loss = total_loss / len(train_dataloader)
    loss_values.append(avg_train_loss)

    print(f"  Average training loss: {avg_train_loss:.2f}")
    print(f"  Training epoch took: {format_time(time.time() - t0)}")

    # 5. Evaluation on the validation set
    print("Running Validation...")
    t0 = time.time()
    model.eval() # Put the model in evaluation mode

    eval_accuracy = 0
    nb_eval_steps = 0
    all_preds = []
    all_labels = []

    for batch in validation_dataloader:
        batch = tuple(t.to(model.device) for t in batch)
        b_input_ids, b_input_mask, b_labels = batch

        with torch.no_grad(): # Don't compute gradients
            outputs = model(b_input_ids,
                            token_type_ids=None,
                            attention_mask=b_input_mask)

        logits = outputs[0]
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()

        eval_accuracy += flat_accuracy(logits, label_ids)
        nb_eval_steps += 1

        preds = np.argmax(logits, axis=1).flatten()
        all_preds.extend(preds)
        all_labels.extend(label_ids)

    avg_val_accuracy = eval_accuracy / nb_eval_steps
    print(f"  Accuracy: {avg_val_accuracy:.2f}")
    print(f"  Validation took: {format_time(time.time() - t0)}")

    # Calculate Precision, Recall, F1-score
    precision, recall, f1, _ = precision_recall_fscore_support(all_labels, all_preds, average='binary')
    print(f"  Precision: {precision:.2f}")
    print(f"  Recall: {recall:.2f}")
    print(f"  F1-score: {f1:.2f}")


print("")
print("Training complete!")

# 6. Save the trained model
output_dir = './model_save/'
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

print(f"Saving model to {output_dir}")
model_to_save = model.module if hasattr(model, 'module') else model
model_to_save.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

Training...
  Average training loss: 0.44
  Training epoch took: 0:01:55
Running Validation...
  Accuracy: 0.70
  Validation took: 0:00:04
  Precision: 0.70
  Recall: 1.00
  F1-score: 0.82
Training...
  Average training loss: 0.42
  Training epoch took: 0:01:55
Running Validation...
  Accuracy: 0.70
  Validation took: 0:00:04
  Precision: 0.70
  Recall: 1.00
  F1-score: 0.82
Training...
  Average training loss: 0.36
  Training epoch took: 0:01:54
Running Validation...
  Accuracy: 0.70
  Validation took: 0:00:04
  Precision: 0.70
  Recall: 1.00
  F1-score: 0.82
Training...
  Average training loss: 0.32
  Training epoch took: 0:01:54
Running Validation...
  Accuracy: 0.70
  Validation took: 0:00:05
  Precision: 0.70
  Recall: 1.00
  F1-score: 0.82

Training complete!
Saving model to ./model_save/


('./model_save/tokenizer_config.json',
 './model_save/special_tokens_map.json',
 './model_save/vocab.txt',
 './model_save/added_tokens.json')

In [None]:
from transformers import BertForSequenceClassification, BertTokenizer
import torch

# Define the directory where the model was saved
output_dir = './model_save/'

# Load the saved model
model = BertForSequenceClassification.from_pretrained(output_dir)

# Load the saved tokenizer
tokenizer = BertTokenizer.from_pretrained(output_dir)

# Move the model to the appropriate device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

print(f"Model and tokenizer loaded from {output_dir}")

Model and tokenizer loaded from ./model_save/


## Evaluation

### Subtask:
Evaluate the performance of the trained model using appropriate metrics.


**Reasoning**:
Evaluate the performance of the trained model using the validation data and calculate evaluation metrics.



In [None]:
# The model and tokenizer are already loaded and available from the previous step.
# The validation data (validation_inputs, validation_masks, validation_labels)
# and the validation_dataloader are also available from the previous step.

print("Running Final Evaluation...")
t0 = time.time()

# Put the model in evaluation mode
model.eval()

# Initialize variables to store evaluation metrics
eval_accuracy = 0
nb_eval_steps = 0
all_preds = []
all_labels = []

# Iterate through the validation data
for batch in validation_dataloader:
    # Move batch to the appropriate device
    batch = tuple(t.to(model.device) for t in batch)
    b_input_ids, b_input_mask, b_labels = batch

    # Perform forward pass
    with torch.no_grad():
        outputs = model(b_input_ids,
                        token_type_ids=None,
                        attention_mask=b_input_mask)

    # Get logits and move to CPU
    logits = outputs[0]
    logits = logits.detach().cpu().numpy()
    label_ids = b_labels.to('cpu').numpy()

    # Calculate accuracy
    eval_accuracy += flat_accuracy(logits, label_ids)
    nb_eval_steps += 1

    # Get predictions and store
    preds = np.argmax(logits, axis=1).flatten()
    all_preds.extend(preds)
    all_labels.extend(label_ids)

# Calculate average validation accuracy
avg_val_accuracy = eval_accuracy / nb_eval_steps
print(f"  Final Accuracy: {avg_val_accuracy:.2f}")

# Calculate and print Precision, Recall, and F1-score
precision, recall, f1, _ = precision_recall_fscore_support(all_labels, all_preds, average='binary')
print(f"  Final Precision: {precision:.2f}")
print(f"  Final Recall: {recall:.2f}")
print(f"  Final F1-score: {f1:.2f}")

print(f"Evaluation took: {format_time(time.time() - t0)}")

Running Final Evaluation...
  Final Accuracy: 0.70
  Final Precision: 0.70
  Final Recall: 1.00
  Final F1-score: 0.82
Evaluation took: 0:00:04


## Inference

### Subtask:
Use the trained model to predict depression status on new data.


**Reasoning**:
Implement the inference steps to predict depression status for new text samples using the trained BERT model.



In [None]:
# 1. Define a list of new text samples for inference.
new_texts = [
    "I feel great today, everything is going well.",
    "Lately, I've been feeling very down and isolated.",
    "Exams are coming up, feeling a bit stressed but manageable.",
    "I have no motivation and feel like nothing matters.",
    "Enjoying time with friends and family."
]

# 2. Tokenize the new text samples using the same tokenizer used during training, ensuring consistent padding and truncation.
encoded_new_inputs = tokenizer(
    new_texts,
    add_special_tokens=True,
    padding='max_length',
    truncation=True,
    max_length=max_seq_length,
    return_attention_mask=True,
    return_tensors='pt' # Return PyTorch tensors
)

# 3. Convert the tokenized inputs (input IDs and attention masks) into PyTorch tensors.
# This was already done by setting return_tensors='pt' in the tokenizer.
new_input_ids = encoded_new_inputs['input_ids']
new_attention_masks = encoded_new_inputs['attention_mask']

# 4. Move the tensors to the same device as the trained model (CPU or GPU).
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
new_input_ids = new_input_ids.to(device)
new_attention_masks = new_attention_masks.to(device)
model.to(device) # Ensure the model is on the correct device

# 5. Set the model to evaluation mode using model.eval().
model.eval()

# 6. Perform a forward pass through the model with the new input tensors, ensuring gradients are not computed using torch.no_grad().
with torch.no_grad():
    outputs = model(new_input_ids,
                    token_type_ids=None,
                    attention_mask=new_attention_masks)

# 7. Obtain the logits from the model's output.
logits = outputs.logits

# 8. Move the logits to the CPU and convert them to NumPy arrays.
logits = logits.detach().cpu().numpy()

# 9. Determine the predicted class (0 or 1) for each sample by taking the argmax of the logits.
predictions = np.argmax(logits, axis=1)

# 10. Print the predicted depression status for each new text sample.
print("Predicted depression status for new samples:")
for text, prediction in zip(new_texts, predictions):
    status = "Potential Depression" if prediction == 1 else "No Potential Depression"
    print(f"- Text: \"{text}\" -> Status: {status}")

Predicted depression status for new samples:
- Text: "I feel great today, everything is going well." -> Status: Potential Depression
- Text: "Lately, I've been feeling very down and isolated." -> Status: Potential Depression
- Text: "Exams are coming up, feeling a bit stressed but manageable." -> Status: Potential Depression
- Text: "I have no motivation and feel like nothing matters." -> Status: Potential Depression
- Text: "Enjoying time with friends and family." -> Status: Potential Depression


## Summary:

### Data Analysis Key Findings

*   The dataset contains 97 rows and 14 columns, with column names corresponding to survey questions.
*   Two columns have missing values: '7. Do you often find yourself expecting negative outcomes in life ? Please elaborate.' (1 missing value) and '10. If you notice yourself thinking negatively about the future or feeling hopeless for several days, how would you interpret these signs? Would you seek help?' (1 missing value).
*   A binary target variable `potential_depression` was created based on the presence of predefined negative keywords in the combined text of relevant survey responses.
*   The BERT model was successfully loaded and configured for binary classification.
*   The data was split into training (90%) and validation (10%) sets.
*   The BERT model was trained for 4 epochs.
*   On the validation set, the model achieved a final Accuracy of 0.70, Precision of 0.70, Recall of 1.00, and an F1-score of 0.82.
*   During inference on new text samples, the model consistently predicted "Potential Depression" for all samples, suggesting potential issues with the training or dataset.

### Insights or Next Steps

*   Investigate the distribution of the `potential_depression` target variable to check for class imbalance, which could explain the model's consistent positive predictions during inference. Implement techniques to handle class imbalance if necessary.
*   Refine the labeling process for the target variable. The current keyword-based approach might be too simplistic and could misclassify responses. Consider using a more nuanced approach or expert annotation for labeling.


# Task
**User task:** Load the `/content/indian_depression_dataset_2025.csv` dataset, preprocess it for a BERT model, train a new BERT model, evaluate its performance, and then use it to perform inference on new examples. Finally, summarize the model's performance on this new dataset.

## Load New Dataset

### Subtask:
Load the `/content/indian_depression_dataset_2025.csv` file into a pandas DataFrame and display its head, info, and describe to understand its structure.


**Reasoning**:
Load the new CSV file into a pandas DataFrame and display its basic structure, descriptive statistics, and missing values as per the subtask instructions.



In [1]:
import pandas as pd

# 2. Load the CSV file into a pandas DataFrame
df_new = pd.read_csv('/content/indian_depression_dataset_2025.csv')

# 3. Display the first 5 rows of the DataFrame
print("First 5 rows of the new DataFrame:")
display(df_new.head())

# 4. Display the column names and their data types
print("\nColumn names and data types of the new DataFrame:")
display(df_new.info())

# 5. Display summary statistics of the DataFrame
print("\nSummary statistics of the new DataFrame:")
display(df_new.describe())

# 6. Check for missing values in each column
print("\nMissing values per column in the new DataFrame:")
display(df_new.isnull().sum())

First 5 rows of the new DataFrame:


Unnamed: 0,clean_text,label,timestamp
0,Office Christmas party was actually fun this y...,0,2025-12-06 01:02:48
1,The pollution in Mumbai feels like it's chokin...,1,2025-12-02 13:38:36
2,I don't have the energy to reply to texts. I j...,1,2025-12-02 09:11:58
3,No one cares until it's too late. I'm so tired...,1,2025-12-06 07:19:50
4,Loving the cool breeze in Bangalore evenings. ☕,0,2025-12-08 01:25:32



Column names and data types of the new DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   clean_text  400 non-null    object
 1   label       400 non-null    int64 
 2   timestamp   400 non-null    object
dtypes: int64(1), object(2)
memory usage: 9.5+ KB


None


Summary statistics of the new DataFrame:


Unnamed: 0,label
count,400.0
mean,0.495
std,0.500601
min,0.0
25%,0.0
50%,0.0
75%,1.0
max,1.0



Missing values per column in the new DataFrame:


Unnamed: 0,0
clean_text,0
label,0
timestamp,0


## Preprocess New Data for BERT

### Subtask:
Identify the text and label columns from the loaded DataFrame. Handle any missing values in the text column. Tokenize the text data using the existing `BertTokenizer`, ensuring consistent padding and truncation. Create PyTorch tensors for input IDs, attention masks, and the target labels.


**Reasoning**:
Identify the text and label columns, handle any potential missing values in the text column, tokenize the text data using the pre-existing tokenizer with consistent parameters, and convert the labels into PyTorch tensors for model input.



In [2]:
import torch

# 1. Identify the 'clean_text' column as the text input and 'label' as the target variable
text_column = 'clean_text'
label_column = 'label'

# 2. Handle any missing values in the 'clean_text' column (fill with empty string)
df_new[text_column] = df_new[text_column].fillna('')

# 3. Use the previously initialized `tokenizer` to tokenize the 'clean_text' column from df_new
# 4. Apply the same `max_seq_length` (128) for padding and truncation
# 5. Ensure that `add_special_tokens` is set to `True`, `return_attention_mask` is `True`, and `return_tensors` is set to `'pt'`
encoded_inputs_new = tokenizer(
    list(df_new[text_column]),
    add_special_tokens=True,
    padding='max_length',
    truncation=True,
    max_length=max_seq_length, # Use the same max_seq_length as before
    return_attention_mask=True,
    return_tensors='pt' # Return PyTorch tensors
)

# 6. Extract the `input_ids` and `attention_mask` from the tokenized output
input_ids_new = encoded_inputs_new['input_ids']
attention_masks_new = encoded_inputs_new['attention_mask']

# 7. Convert the 'label' column from df_new into a PyTorch tensor
labels_new = torch.tensor(df_new[label_column].values, dtype=torch.long)

print("Shape of new input_ids:", input_ids_new.shape)
print("Shape of new attention_masks:", attention_masks_new.shape)
print("Shape of new labels:", labels_new.shape)

NameError: name 'tokenizer' is not defined

**Reasoning**:
The previous code failed because the `tokenizer` object was not defined. I need to re-initialize `BertTokenizer` and `max_seq_length` to ensure `tokenizer` is available for use in the current code block.



In [3]:
import torch
from transformers import BertTokenizer # Re-import BertTokenizer

# Re-initialize the tokenizer and max_seq_length
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
max_seq_length = 128 # Re-define max_seq_length, ensure consistency

# 1. Identify the 'clean_text' column as the text input and 'label' as the target variable
text_column = 'clean_text'
label_column = 'label'

# 2. Handle any missing values in the 'clean_text' column (fill with empty string)
df_new[text_column] = df_new[text_column].fillna('')

# 3. Use the previously initialized `tokenizer` to tokenize the 'clean_text' column from df_new
# 4. Apply the same `max_seq_length` (128) for padding and truncation
# 5. Ensure that `add_special_tokens` is set to `True`, `return_attention_mask` is `True`, and `return_tensors` is set to `'pt'`
encoded_inputs_new = tokenizer(
    list(df_new[text_column]),
    add_special_tokens=True,
    padding='max_length',
    truncation=True,
    max_length=max_seq_length, # Use the same max_seq_length as before
    return_attention_mask=True,
    return_tensors='pt' # Return PyTorch tensors
)

# 6. Extract the `input_ids` and `attention_mask` from the tokenized output
input_ids_new = encoded_inputs_new['input_ids']
attention_masks_new = encoded_inputs_new['attention_mask']

# 7. Convert the 'label' column from df_new into a PyTorch tensor
labels_new = torch.tensor(df_new[label_column].values, dtype=torch.long)

print("Shape of new input_ids:", input_ids_new.shape)
print("Shape of new attention_masks:", attention_masks_new.shape)
print("Shape of new labels:", labels_new.shape)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Shape of new input_ids: torch.Size([400, 128])
Shape of new attention_masks: torch.Size([400, 128])
Shape of new labels: torch.Size([400])


**Reasoning**:
The previous code block successfully preprocessed the new dataset. Now, I need to prepare the data for training by creating PyTorch DataLoader objects from the preprocessed tensors. This involves splitting the data into training and validation sets, and then wrapping them into `TensorDataset` and `DataLoader` instances.



In [4]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from sklearn.model_selection import train_test_split

# 1. Combine the input_ids, attention_masks, and labels into a TensorDataset
data_new = TensorDataset(input_ids_new, attention_masks_new, labels_new)

# 2. Split the dataset into training and validation sets (e.g., 80% train, 20% validation)
train_size = int(0.8 * len(data_new))
val_size = len(data_new) - train_size

train_dataset, val_dataset = torch.utils.data.random_split(data_new, [train_size, val_size])

print(f'{len(train_dataset)} training samples')
print(f'{len(val_dataset)} validation samples')

# 3. Create PyTorch DataLoader objects for training and validation
batch_size = 32 # Using the same batch size as before

train_dataloader_new = DataLoader(
    train_dataset,
    sampler=RandomSampler(train_dataset), # Select batches randomly
    batch_size=batch_size # Trains with this batch size
)

validation_dataloader_new = DataLoader(
    val_dataset,
    sampler=SequentialSampler(val_dataset), # Sequential sampling for evaluation
    batch_size=batch_size # Evaluates with this batch size
)

print("DataLoaders created for the new dataset.")

320 training samples
80 validation samples
DataLoaders created for the new dataset.


## Initialize and Train BERT Model

### Subtask:
Split the preprocessed data into training and validation sets. Re-initialize a `BertForSequenceClassification` model for binary classification. Set up data loaders, optimizer, and a learning rate scheduler. Train the BERT model using the new training data.


**Reasoning**:
The subtask requires re-initializing the BERT model, setting up the optimizer and scheduler, and then training the model on the new preprocessed data. This code block will cover all these steps, including the training loop, evaluation on the new validation set, and saving the newly trained model.



In [5]:
import torch
from transformers import BertForSequenceClassification, get_linear_schedule_with_warmup
from torch.optim import AdamW
import numpy as np
import random
import time
import datetime
import os
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Helper function for formatting time (assuming it was defined earlier, re-defining for self-containment if not)
def format_time(elapsed):
    elapsed_rounded = int(round((elapsed)))
    return str(datetime.timedelta(seconds=elapsed_rounded))

# Function to calculate accuracy (assuming it was defined earlier, re-defining for self-containment if not)
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

# 1. Re-initialize a BertForSequenceClassification model
num_labels = 2 # Binary classification (0 or 1 for depression)
model_new = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=num_labels)

# Move the model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_new.to(device)
print(f"New model initialized and moved to {device}.")

# 2. Define an AdamW optimizer for the new model
optimizer_new = AdamW(model_new.parameters(), lr=5e-5, eps=1e-8)

# 3. Set the number of training epochs and calculate total training steps
epochs = 4 # As suggested, reduced epochs for faster execution
total_steps_new = len(train_dataloader_new) * epochs

# 4. Initialize a learning rate scheduler
scheduler_new = get_linear_schedule_with_warmup(optimizer_new, num_warmup_steps=0, num_training_steps=total_steps_new)

# Set the seed value for reproducibility (important for consistent results)
seed_val = 42
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

# Store the average loss after each epoch
loss_values_new = []

# 5. Implement the training loop
print("\nStarting training for the new model...")
for epoch_i in range(0, epochs):
    print(f'======== Epoch {epoch_i + 1} / {epochs} ========')
    print('Training...')

    t0 = time.time()
    total_loss_new = 0
    model_new.train() # Put the model into training mode

    for step, batch in enumerate(train_dataloader_new):
        if step % 40 == 0 and not step == 0:
            elapsed = format_time(time.time() - t0)
            print(f'  Batch {step} of {len(train_dataloader_new)}. Elapsed: {elapsed}.')

        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)

        model_new.zero_grad() # Clear previously calculated gradients

        outputs = model_new(b_input_ids,
                              token_type_ids=None,
                              attention_mask=b_input_mask,
                              labels=b_labels)

        loss = outputs[0]
        total_loss_new += loss.item()

        loss.backward() # Perform a backward pass to calculate the gradients.
        torch.nn.utils.clip_grad_norm_(model_new.parameters(), 1.0) # Clip gradients
        optimizer_new.step() # Update parameters
        scheduler_new.step() # Update the learning rate.

    avg_train_loss_new = total_loss_new / len(train_dataloader_new)
    loss_values_new.append(avg_train_loss_new)

    print(f"  Average training loss: {avg_train_loss_new:.2f}")
    print(f"  Training epoch took: {format_time(time.time() - t0)}")

    # Evaluation on the validation set
    print("Running Validation...")
    t0 = time.time()
    model_new.eval() # Put the model in evaluation mode

    eval_accuracy_new = 0
    nb_eval_steps_new = 0
    all_preds_new = []
    all_labels_new = []

    for batch in validation_dataloader_new:
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_input_mask, b_labels = batch

        with torch.no_grad(): # Don't compute gradients
            outputs = model_new(b_input_ids,
                                token_type_ids=None,
                                attention_mask=b_input_mask)

        logits = outputs[0]
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()

        eval_accuracy_new += flat_accuracy(logits, label_ids)
        nb_eval_steps_new += 1

        preds = np.argmax(logits, axis=1).flatten()
        all_preds_new.extend(preds)
        all_labels_new.extend(label_ids)

    avg_val_accuracy_new = eval_accuracy_new / nb_eval_steps_new
    print(f"  Accuracy: {avg_val_accuracy_new:.2f}")
    print(f"  Validation took: {format_time(time.time() - t0)}")

    # Calculate Precision, Recall, F1-score
    precision_new, recall_new, f1_new, _ = precision_recall_fscore_support(all_labels_new, all_preds_new, average='binary', zero_division=0)
    print(f"  Precision: {precision_new:.2f}")
    print(f"  Recall: {recall_new:.2f}")
    print(f"  F1-score: {f1_new:.2f}")

print("\nTraining complete for the new model!")

# 6. Save the trained model to a *new* directory
output_dir_new = './new_model_save/'
if not os.path.exists(output_dir_new):
    os.makedirs(output_dir_new)

print(f"Saving new model to {output_dir_new}")
model_to_save_new = model_new.module if hasattr(model_new, 'module') else model_new
model_to_save_new.save_pretrained(output_dir_new)
tokenizer.save_pretrained(output_dir_new)

print("New model and tokenizer saved.")

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


New model initialized and moved to cpu.

Starting training for the new model...
Training...
  Average training loss: 0.29
  Training epoch took: 0:07:19
Running Validation...
  Accuracy: 1.00
  Validation took: 0:00:31
  Precision: 1.00
  Recall: 1.00
  F1-score: 1.00
Training...
  Average training loss: 0.04
  Training epoch took: 0:06:59
Running Validation...
  Accuracy: 1.00
  Validation took: 0:00:31
  Precision: 1.00
  Recall: 1.00
  F1-score: 1.00
Training...
  Average training loss: 0.01
  Training epoch took: 0:06:59
Running Validation...
  Accuracy: 1.00
  Validation took: 0:00:32
  Precision: 1.00
  Recall: 1.00
  F1-score: 1.00
Training...
  Average training loss: 0.00
  Training epoch took: 0:06:56
Running Validation...
  Accuracy: 1.00
  Validation took: 0:00:31
  Precision: 1.00
  Recall: 1.00
  F1-score: 1.00

Training complete for the new model!
Saving new model to ./new_model_save/
New model and tokenizer saved.


**Reasoning**:
The previous step successfully trained and saved a new BERT model. Now, I need to load this newly trained model and its tokenizer to perform inference on new, unseen text examples, as required by the overall task.



In [6]:
import torch
from transformers import BertForSequenceClassification, BertTokenizer
import numpy as np

# Define the directory where the new model was saved
output_dir_new = './new_model_save/'

# Load the saved model_new
model_new_loaded = BertForSequenceClassification.from_pretrained(output_dir_new)

# Load the saved tokenizer (same tokenizer is used but re-loading ensures it's from the new_model_save dir)
tokenizer_loaded = BertTokenizer.from_pretrained(output_dir_new)

# Move the model to the appropriate device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_new_loaded.to(device)

print(f"New model and tokenizer loaded from {output_dir_new}")

# 1. Define a list of new text samples for inference.
new_inference_texts = [
    "I am so happy today, everything is just perfect.",
    "Feeling lost and hopeless, can't seem to find my way out.",
    "Just had a great meal with friends, feeling good.",
    "Every day feels like a struggle, I'm exhausted and sad.",
    "Life's challenges are tough, but I'm determined to overcome them."
]

# 2. Tokenize the new text samples using the loaded tokenizer
encoded_inference_inputs = tokenizer_loaded(
    new_inference_texts,
    add_special_tokens=True,
    padding='max_length',
    truncation=True,
    max_length=max_seq_length, # Use the same max_seq_length as before
    return_attention_mask=True,
    return_tensors='pt' # Return PyTorch tensors
)

# 3. Convert the tokenized inputs into PyTorch tensors.
inference_input_ids = encoded_inference_inputs['input_ids']
inference_attention_masks = encoded_inference_inputs['attention_mask']

# 4. Move the tensors to the same device as the trained model.
inference_input_ids = inference_input_ids.to(device)
inference_attention_masks = inference_attention_masks.to(device)

# 5. Set the model to evaluation mode.
model_new_loaded.eval()

# 6. Perform a forward pass through the model.
with torch.no_grad():
    outputs = model_new_loaded(
        inference_input_ids,
        token_type_ids=None,
        attention_mask=inference_attention_masks
    )

# 7. Obtain the logits from the model's output.
logits = outputs.logits

# 8. Move the logits to the CPU and convert them to NumPy arrays.
logits = logits.detach().cpu().numpy()

# 9. Determine the predicted class (0 or 1) for each sample.
predictions = np.argmax(logits, axis=1)

# 10. Print the predicted depression status for each new text sample.
print("\nPredicted depression status for new inference samples:")
for text, prediction in zip(new_inference_texts, predictions):
    status = "Depressed" if prediction == 1 else "Not Depressed"
    print(f"- Text: \"{text}\" -> Status: {status}")


New model and tokenizer loaded from ./new_model_save/

Predicted depression status for new inference samples:
- Text: "I am so happy today, everything is just perfect." -> Status: Depressed
- Text: "Feeling lost and hopeless, can't seem to find my way out." -> Status: Depressed
- Text: "Just had a great meal with friends, feeling good." -> Status: Depressed
- Text: "Every day feels like a struggle, I'm exhausted and sad." -> Status: Depressed
- Text: "Life's challenges are tough, but I'm determined to overcome them." -> Status: Depressed


## Summary for the `indian_depression_dataset_2025.csv`

### Data Analysis Key Findings
*   The new dataset (`indian_depression_dataset_2025.csv`) contains 400 entries with `clean_text` (text input), `label` (target variable: 0 for not depressed, 1 for depressed), and `timestamp` columns.
*   There were no missing values in this dataset, which simplified preprocessing.
*   The target `label` column is a binary indicator of depression status.

### Model Training and Evaluation
*   A new `BertForSequenceClassification` model was re-initialized and trained specifically on this `indian_depression_dataset_2025.csv`.
*   The training process ran for 4 epochs.
*   The model achieved exceptionally high performance on the validation set derived from this dataset, with:
    *   **Accuracy: 1.00**
    *   **Precision: 1.00**
    *   **Recall: 1.00**
    *   **F1-score: 1.00**
*   The training loss consistently decreased, reaching near zero by the end of training.

### Inference Performance
*   When tested on new, unseen text samples, the newly trained model consistently predicted "Depressed" for all samples, regardless of their content (e.g., "I am so happy today" was classified as "Depressed").

### Insights and Next Steps
*   The perfect scores (Accuracy, Precision, Recall, F1-score of 1.00) on the validation set, coupled with the consistent

## Summary for the `indian_depression_dataset_2025.csv`

### Data Analysis Key Findings
*   The new dataset (`indian_depression_dataset_2025.csv`) contains 400 entries with `clean_text` (text input), `label` (target variable: 0 for not depressed, 1 for depressed), and `timestamp` columns.
*   There were no missing values in this dataset, which simplified preprocessing.
*   The target `label` column is a binary indicator of depression status.

### Model Training and Evaluation
*   A new `BertForSequenceClassification` model was re-initialized and trained specifically on this `indian_depression_dataset_2025.csv`.
*   The training process ran for 4 epochs.
*   The model achieved exceptionally high performance on the validation set derived from this dataset, with:
    *   **Accuracy: 1.00**
    *   **Precision: 1.00**
    *   **Recall: 1.00**
    *   **F1-score: 1.00**
*   The training loss consistently decreased, reaching near zero by the end of training.

### Inference Performance
*   When tested on new, unseen text samples, the newly trained model consistently predicted "Depressed" for all samples, regardless of their content (e.g., "I am so happy today" was classified as "Depressed").

### Insights and Next Steps
*   The perfect scores (Accuracy, Precision, Recall, F1-score of 1.00) on the validation set, coupled with the consistent "Depressed" predictions during inference, strongly suggest **overfitting** to the training data and/or a **severe class imbalance** in the new dataset where the positive class (Depressed) might be dominant, leading the model to simply predict the majority class.
*   **Investigate Class Distribution:** Analyze the distribution of labels (0 and 1) in the `indian_depression_dataset_2025.csv` to confirm class imbalance. If confirmed, address it using techniques like oversampling (SMOTE), undersampling, or using weighted loss functions during training.
*   **Review Data Quality and Labeling:** Thoroughly check the `clean_text` and `label` columns for potential issues such as data leakage or incorrect labeling in the `indian_depression_dataset_2025.csv` that could lead to perfect validation scores.
*   **Increase Dataset Size/Diversity:** A small and homogeneous dataset can lead to overfitting. Expanding the dataset or incorporating more diverse examples could help generalization.
*   **Hyperparameter Tuning:** While the current hyperparameters worked for the previous dataset, they might need adjustment for this new dataset to combat overfitting.
*   **Cross-validation:** Implement k-fold cross-validation during training to get a more robust estimate of the model's performance and to detect overfitting more effectively.

## Summary for the `indian_depression_dataset_2025.csv`

### Data Analysis Key Findings
*   The new dataset (`indian_depression_dataset_2025.csv`) contains 400 entries with `clean_text` (text input), `label` (target variable: 0 for not depressed, 1 for depressed), and `timestamp` columns.
*   There were no missing values in this dataset, which simplified preprocessing.
*   The target `label` column is a binary indicator of depression status.

### Model Training and Evaluation
*   A new `BertForSequenceClassification` model was re-initialized and trained specifically on this `indian_depression_dataset_2025.csv`.
*   The training process ran for 4 epochs.
*   The model achieved exceptionally high performance on the validation set derived from this dataset, with:
    *   **Accuracy: 1.00**
    *   **Precision: 1.00**
    *   **Recall: 1.00**
    *   **F1-score: 1.00**
*   The training loss consistently decreased, reaching near zero by the end of training.

### Inference Performance
*   When tested on new, unseen text samples, the newly trained model consistently predicted "Depressed" for all samples, regardless of their content (e.g., "I am so happy today" was classified as "Depressed").

### Insights and Next Steps
*   The perfect scores (Accuracy, Precision, Recall, F1-score of 1.00) on the validation set, coupled with the consistent "Depressed" predictions during inference, strongly suggest **overfitting** to the training data and/or a **severe class imbalance** in the new dataset where the positive class (Depressed) might be dominant, leading the model to simply predict the majority class.
*   **Investigate Class Distribution:** Analyze the distribution of labels (0 and 1) in the `indian_depression_dataset_2025.csv` to confirm class imbalance. If confirmed, address it using techniques like oversampling (SMOTE), undersampling, or using weighted loss functions during training.
*   **Review Data Quality and Labeling:** Thoroughly check the `clean_text` and `label` columns for potential issues such as data leakage or incorrect labeling in the `indian_depression_dataset_2025.csv` that could lead to perfect validation scores.
*   **Increase Dataset Size/Diversity:** A small and homogeneous dataset can lead to overfitting. Expanding the dataset or incorporating more diverse examples could help generalization.
*   **Hyperparameter Tuning:** While the current hyperparameters worked for the previous dataset, they might need adjustment for this new dataset to combat overfitting.
*   **Cross-validation:** Implement k-fold cross-validation during training to get a more robust estimate of the model's performance and to detect overfitting more effectively.


## Summary for the `indian_depression_dataset_2025.csv`

### Data Analysis Key Findings
*   The new dataset (`indian_depression_dataset_2025.csv`) contains 400 entries with `clean_text` (text input), `label` (target variable: 0 for not depressed, 1 for depressed), and `timestamp` columns.
*   There were no missing values in this dataset, which simplified preprocessing.
*   The target `label` column is a binary indicator of depression status.

### Model Training and Evaluation
*   A new `BertForSequenceClassification` model was re-initialized and trained specifically on this `indian_depression_dataset_2025.csv`.
*   The training process ran for 4 epochs.
*   The model achieved exceptionally high performance on the validation set derived from this dataset, with:
    *   **Accuracy: 1.00**
    *   **Precision: 1.00**
    *   **Recall: 1.00**
    *   **F1-score: 1.00**
*   The training loss consistently decreased, reaching near zero by the end of training.

### Inference Performance
*   When tested on new, unseen text samples, the newly trained model consistently predicted "Depressed" for all samples, regardless of their content (e.g., "I am so happy today" was classified as "Depressed").

### Insights and Next Steps
*   The perfect scores (Accuracy, Precision, Recall, F1-score of 1.00) on the validation set, coupled with the consistent "Depressed" predictions during inference, strongly suggest **overfitting** to the training data and/or a **severe class imbalance** in the new dataset where the positive class (Depressed) might be dominant, leading the model to simply predict the majority class.
*   **Investigate Class Distribution:** Analyze the distribution of labels (0 and 1) in the `indian_depression_dataset_2025.csv` to confirm class imbalance. If confirmed, address it using techniques like oversampling (SMOTE), undersampling, or using weighted loss functions during training.
*   **Review Data Quality and Labeling:** Thoroughly check the `clean_text` and `label` columns for potential issues such as data leakage or incorrect labeling in the `indian_depression_dataset_2025.csv` that could lead to perfect validation scores.
*   **Increase Dataset Size/Diversity:** A small and homogeneous dataset can lead to overfitting. Expanding the dataset or incorporating more diverse examples could help generalization.
*   **Hyperparameter Tuning:** While the current hyperparameters worked for the previous dataset, they might need adjustment for this new dataset to combat overfitting.
*   **Cross-validation:** Implement k-fold cross-validation during training to get a more robust estimate of the model's performance and to detect overfitting more effectively.

## Summary for the `indian_depression_dataset_2025.csv`

### Data Analysis Key Findings
*   The new dataset (`indian_depression_dataset_2025.csv`) contains 400 entries with `clean_text` (text input), `label` (target variable: 0 for not depressed, 1 for depressed), and `timestamp` columns.
*   There were no missing values in this dataset, which simplified preprocessing.
*   The target `label` column is a binary indicator of depression status.

### Model Training and Evaluation
*   A new `BertForSequenceClassification` model was re-initialized and trained specifically on this `indian_depression_dataset_2025.csv`.
*   The training process ran for 4 epochs.
*   The model achieved exceptionally high performance on the validation set derived from this dataset, with:
    *   **Accuracy: 1.00**
    *   **Precision: 1.00**
    *   **Recall: 1.00**
    *   **F1-score: 1.00**
*   The training loss consistently decreased, reaching near zero by the end of training.

### Inference Performance
*   When tested on new, unseen text samples, the newly trained model consistently predicted "Depressed" for all samples, regardless of their content (e.g., "I am so happy today" was classified as "Depressed").

### Insights and Next Steps
*   The perfect scores (Accuracy, Precision, Recall, F1-score of 1.00) on the validation set, coupled with the consistent "Depressed" predictions during inference, strongly suggest **overfitting** to the training data and/or a **severe class imbalance** in the new dataset where the positive class (Depressed) might be dominant, leading the model to simply predict the majority class.
*   **Investigate Class Distribution:** Analyze the distribution of labels (0 and 1) in the `indian_depression_dataset_2025.csv` to confirm class imbalance. If confirmed, address it using techniques like oversampling (SMOTE), undersampling, or using weighted loss functions during training.
*   **Review Data Quality and Labeling:** Thoroughly check the `clean_text` and `label` columns for potential issues such as data leakage or incorrect labeling in the `indian_depression_dataset_2025.csv` that could lead to perfect validation scores.
*   **Increase Dataset Size/Diversity:** A small and homogeneous dataset can lead to overfitting. Expanding the dataset or incorporating more diverse examples could help generalization.
*   **Hyperparameter Tuning:** While the current hyperparameters worked for the previous dataset, they might need adjustment for this new dataset to combat overfitting.
*   **Cross-validation:** Implement k-fold cross-validation during training to get a more robust estimate of the model's performance and to detect overfitting more effectively.


## Summary for the `indian_depression_dataset_2025.csv`

### Data Analysis Key Findings
*   The new dataset (`indian_depression_dataset_2025.csv`) contains 400 entries with `clean_text` (text input), `label` (target variable: 0 for not depressed, 1 for depressed), and `timestamp` columns.
*   There were no missing values in this dataset, which simplified preprocessing.
*   The target `label` column is a binary indicator of depression status.

### Model Training and Evaluation
*   A new `BertForSequenceClassification` model was re-initialized and trained specifically on this `indian_depression_dataset_2025.csv`.
*   The training process ran for 4 epochs.
*   The model achieved exceptionally high performance on the validation set derived from this dataset, with:
    *   **Accuracy: 1.00**
    *   **Precision: 1.00**
    *   **Recall: 1.00**
    *   **F1-score: 1.00**
*   The training loss consistently decreased, reaching near zero by the end of training.

### Inference Performance
*   When tested on new, unseen text samples, the newly trained model consistently predicted "Depressed" for all samples, regardless of their content (e.g., "I am so happy today" was classified as "Depressed").

### Insights and Next Steps
*   The perfect scores (Accuracy, Precision, Recall, F1-score of 1.00) on the validation set, coupled with the consistent "Depressed" predictions during inference, strongly suggest **overfitting** to the training data and/or a **severe class imbalance** in the new dataset where the positive class (Depressed) might be dominant, leading the model to simply predict the majority class.
*   **Investigate Class Distribution:** Analyze the distribution of labels (0 and 1) in the `indian_depression_dataset_2025.csv` to confirm class imbalance. If confirmed, address it using techniques like oversampling (SMOTE), undersampling, or using weighted loss functions during training.
*   **Review Data Quality and Labeling:** Thoroughly check the `clean_text` and `label` columns for potential issues such as data leakage or incorrect labeling in the `indian_depression_dataset_2025.csv` that could lead to perfect validation scores.
*   **Increase Dataset Size/Diversity:** A small and homogeneous dataset can lead to overfitting. Expanding the dataset or incorporating more diverse examples could help generalization.
*   **Hyperparameter Tuning:** While the current hyperparameters worked for the previous dataset, they might need adjustment for this new dataset to combat overfitting.
*   **Cross-validation:** Implement k-fold cross-validation during training to get a more robust estimate of the model's performance and to detect overfitting more effectively.


## Summary for the `indian_depression_dataset_2025.csv`

### Data Analysis Key Findings
*   The new dataset (`indian_depression_dataset_2025.csv`) contains 400 entries with `clean_text` (text input), `label` (target variable: 0 for not depressed, 1 for depressed), and `timestamp` columns.
*   There were no missing values in this dataset, which simplified preprocessing.
*   The target `label` column is a binary indicator of depression status.

### Model Training and Evaluation
*   A new `BertForSequenceClassification` model was re-initialized and trained specifically on this `indian_depression_dataset_2025.csv`.
*   The training process ran for 4 epochs.
*   The model achieved exceptionally high performance on the validation set derived from this dataset, with:
    *   **Accuracy: 1.00**
    *   **Precision: 1.00**
    *   **Recall: 1.00**
    *   **F1-score: 1.00**
*   The training loss consistently decreased, reaching near zero by the end of training.

### Inference Performance
*   When tested on new, unseen text samples, the newly trained model consistently predicted "Depressed" for all samples, regardless of their content (e.g., "I am so happy today" was classified as "Depressed").

### Insights and Next Steps
*   The perfect scores (Accuracy, Precision, Recall, F1-score of 1.00) on the validation set, coupled with the consistent "Depressed" predictions during inference, strongly suggest **overfitting** to the training data and/or a **severe class imbalance** in the new dataset where the positive class (Depressed) might be dominant, leading the model to simply predict the majority class.
*   **Investigate Class Distribution:** Analyze the distribution of labels (0 and 1) in the `indian_depression_dataset_2025.csv` to confirm class imbalance. If confirmed, address it using techniques like oversampling (SMOTE), undersampling, or using weighted loss functions during training.
*   **Review Data Quality and Labeling:** Thoroughly check the `clean_text` and `label` columns for potential issues such as data leakage or incorrect labeling in the `indian_depression_dataset_2025.csv` that could lead to perfect validation scores.
*   **Increase Dataset Size/Diversity:** A small and homogeneous dataset can lead to overfitting. Expanding the dataset or incorporating more diverse examples could help generalization.
*   **Hyperparameter Tuning:** While the current hyperparameters worked for the previous dataset, they might need adjustment for this new dataset to combat overfitting.
*   **Cross-validation:** Implement k-fold cross-validation during training to get a more robust estimate of the model's performance and to detect overfitting more effectively.


## Summary for the `indian_depression_dataset_2025.csv`

### Data Analysis Key Findings
*   The new dataset (`indian_depression_dataset_2025.csv`) contains 400 entries with `clean_text` (text input), `label` (target variable: 0 for not depressed, 1 for depressed), and `timestamp` columns.
*   There were no missing values in this dataset, which simplified preprocessing.
*   The target `label` column is a binary indicator of depression status.

### Model Training and Evaluation
*   A new `BertForSequenceClassification` model was re-initialized and trained specifically on this `indian_depression_dataset_2025.csv`.
*   The training process ran for 4 epochs.
*   The model achieved exceptionally high performance on the validation set derived from this dataset, with:
    *   **Accuracy: 1.00**
    *   **Precision: 1.00**
    *   **Recall: 1.00**
    *   **F1-score: 1.00**
*   The training loss consistently decreased, reaching near zero by the end of training.

### Inference Performance
*   When tested on new, unseen text samples, the newly trained model consistently predicted "Depressed" for all samples, regardless of their content (e.g., "I am so happy today" was classified as "Depressed").

### Insights and Next Steps
*   The perfect scores (Accuracy, Precision, Recall, F1-score of 1.00) on the validation set, coupled with the consistent "Depressed" predictions during inference, strongly suggest **overfitting** to the training data and/or a **severe class imbalance** in the new dataset where the positive class (Depressed) might be dominant, leading the model to simply predict the majority class.
*   **Investigate Class Distribution:** Analyze the distribution of labels (0 and 1) in the `indian_depression_dataset_2025.csv` to confirm class imbalance. If confirmed, address it using techniques like oversampling (SMOTE), undersampling, or using weighted loss functions during training.
*   **Review Data Quality and Labeling:** Thoroughly check the `clean_text` and `label` columns for potential issues such as data leakage or incorrect labeling in the `indian_depression_dataset_2025.csv` that could lead to perfect validation scores.
*   **Increase Dataset Size/Diversity:** A small and homogeneous dataset can lead to overfitting. Expanding the dataset or incorporating more diverse examples could help generalization.
*   **Hyperparameter Tuning:** While the current hyperparameters worked for the previous dataset, they might need adjustment for this new dataset to combat overfitting.
*   **Cross-validation:** Implement k-fold cross-validation during training to get a more robust estimate of the model's performance and to detect overfitting more effectively.


## Evaluate New Model

### Subtask:
Evaluate the performance of the newly trained model on the validation set. Calculate and display metrics such as accuracy, precision, recall, and F1-score.


**Reasoning**:
Evaluate the performance of the newly trained model on the validation set using the validation data loader, and calculate accuracy, precision, recall, and F1-score.



In [7]:
import torch
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import time
import datetime

# Helper function for formatting time (re-defining for self-containment if not already in context)
def format_time(elapsed):
    elapsed_rounded = int(round((elapsed)))
    return str(datetime.timedelta(seconds=elapsed_rounded))

# Function to calculate accuracy (re-defining for self-containment if not already in context)
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

# Ensure the model is in evaluation mode
model_new_loaded.eval()

print("Evaluating new model on validation set...")
t0 = time.time()

# Initialize variables to store evaluation metrics
eval_accuracy_new = 0
nb_eval_steps_new = 0
all_preds_new = []
all_labels_new = []

# Iterate through the validation data
for batch in validation_dataloader_new:
    # Move batch to the appropriate device
    batch = tuple(t.to(device) for t in batch)
    b_input_ids, b_input_mask, b_labels = batch

    # Perform forward pass without computing gradients
    with torch.no_grad():
        outputs = model_new_loaded(
            b_input_ids,
            token_type_ids=None,
            attention_mask=b_input_mask
        )

    # Get logits and move to CPU
    logits = outputs[0]
    logits = logits.detach().cpu().numpy()
    label_ids = b_labels.to('cpu').numpy()

    # Calculate batch accuracy and accumulate
    eval_accuracy_new += flat_accuracy(logits, label_ids)
    nb_eval_steps_new += 1

    # Get predictions and store
    preds = np.argmax(logits, axis=1).flatten()
    all_preds_new.extend(preds)
    all_labels_new.extend(label_ids)

# Calculate average validation accuracy
avg_val_accuracy_new = eval_accuracy_new / nb_eval_steps_new
print(f"  Final Validation Accuracy (New Model): {avg_val_accuracy_new:.2f}")

# Calculate and print Precision, Recall, and F1-score
precision_new, recall_new, f1_new, _ = precision_recall_fscore_support(all_labels_new, all_preds_new, average='binary', zero_division=0)
print(f"  Final Validation Precision (New Model): {precision_new:.2f}")
print(f"  Final Validation Recall (New Model): {recall_new:.2f}")
print(f"  Final Validation F1-score (New Model): {f1_new:.2f}")

print(f"Evaluation of new model took: {format_time(time.time() - t0)}")

Evaluating new model on validation set...
  Final Validation Accuracy (New Model): 1.00
  Final Validation Precision (New Model): 1.00
  Final Validation Recall (New Model): 1.00
  Final Validation F1-score (New Model): 1.00
Evaluation of new model took: 0:00:32


## Perform Inference with New Model

### Subtask:
Use the newly trained model to perform inference on a few new example text samples, demonstrating its ability to predict depression status.


## Final Task

### Subtask:
Provide a summary of the BERT model's performance on the new `indian_depression_dataset_2025.csv` dataset, including key evaluation metrics and observations from the training and inference steps.


## Summary:

### Data Analysis Key Findings
*   The `indian_depression_dataset_2025.csv` dataset contains 400 entries with `clean_text`, `label` (binary: 0 or 1), and `timestamp` columns, with no missing values. The `label` column has an approximately balanced distribution (mean of ~0.495).
*   The data was successfully preprocessed for BERT, resulting in 320 training samples and 80 validation samples, prepared into PyTorch DataLoaders.
*   A new `BertForSequenceClassification` model was trained for 4 epochs on this dataset.
*   During training, the average training loss decreased from 0.29 in Epoch 1 to 0.00 in Epoch 4.
*   The model achieved exceptionally high performance on the validation set, with an Accuracy of 1.00, Precision of 1.00, Recall of 1.00, and F1-score of 1.00 across all epochs.
*   However, when used for inference on new, unseen text samples (e.g., "I am so happy today, everything is just perfect."), the model consistently predicted "Depressed" for all examples, regardless of their content.

### Insights or Next Steps
*   The discrepancy between perfect validation scores (1.00 for all metrics) and the model's failure to generalize during inference (always predicting "Depressed") strongly indicates severe **overfitting** to the training data. This also suggests a potential **severe class imbalance** where the positive class (Depressed) might be overwhelmingly dominant, causing the model to simply predict the majority class.
*   **Next Steps:**
    *   **Investigate Class Distribution:** Analyze the actual distribution of labels (0 and 1) in the `indian_depression_dataset_2025.csv` to confirm class imbalance. If confirmed, address it using techniques such as oversampling, undersampling, or weighted loss functions.
    *   **Review Data Quality and Labeling:** Thoroughly check the dataset for potential issues like data leakage or incorrect labeling, which could artificially inflate validation scores and lead to misleading performance.


# Task
Load the `/content/raw_depression_dataset_india.csv` dataset into a pandas DataFrame and examine its structure, including the first few rows, column names, data types, summary statistics, and missing values.

## Load Test Dataset

### Subtask:
Load the `/content/raw_depression_dataset_india.csv` file into a pandas DataFrame and examine its structure.


**Reasoning**:
Load the new CSV file into a pandas DataFrame named `df_raw` and then display its head, info, describe, and null values to understand its structure and content as per the instructions.



In [8]:
import pandas as pd

# 2. Load the CSV file '/content/raw_depression_dataset_india.csv' into a pandas DataFrame named df_raw.
df_raw = pd.read_csv('/content/raw_depression_dataset_india.csv')

# 3. Display the first 5 rows of the df_raw DataFrame.
print("First 5 rows of the raw DataFrame:")
display(df_raw.head())

# 4. Display the column names and their data types using df_raw.info().
print("\nColumn names and data types of the raw DataFrame:")
display(df_raw.info())

# 5. Display descriptive statistics of the df_raw DataFrame using df_raw.describe().
print("\nSummary statistics of the raw DataFrame:")
display(df_raw.describe())

# 6. Check for missing values in each column of df_raw using df_raw.isnull().sum().
print("\nMissing values per column in the raw DataFrame:")
display(df_raw.isnull().sum())

First 5 rows of the raw DataFrame:


Unnamed: 0,tweet_text,label,date
0,my anxiety is at an all time high. i feel like...,1,2025-12-11
1,nothing brings me joy anymore. not even the th...,1,2025-12-11
2,honestly why am i even trying anymore. nothing...,1,2025-12-02
3,The pollution in Delhi is so bad i can barely ...,0,2025-12-05
4,my anxiety is at an all time high. i feel like...,1,2025-12-02



Column names and data types of the raw DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   tweet_text  400 non-null    object
 1   label       400 non-null    int64 
 2   date        400 non-null    object
dtypes: int64(1), object(2)
memory usage: 9.5+ KB


None


Summary statistics of the raw DataFrame:


Unnamed: 0,label
count,400.0
mean,0.4875
std,0.50047
min,0.0
25%,0.0
50%,0.0
75%,1.0
max,1.0



Missing values per column in the raw DataFrame:


Unnamed: 0,0
tweet_text,0
label,0
date,0


## Preprocess Test Data for BERT

### Subtask:
Identify the text and label columns from the new test DataFrame. Handle any missing values in the text column. Tokenize the text data using the existing `tokenizer_loaded`, ensuring consistent padding and truncation. Create PyTorch tensors for input IDs, attention masks, and the target labels.


**Reasoning**:
I need to preprocess the new test data (`df_raw`) for BERT by identifying the text and label columns, handling any missing values, tokenizing the text using the pre-loaded tokenizer, and converting the labels into PyTorch tensors. This aligns with the subtask's instructions.



In [9]:
import torch

# 1. Identify 'tweet_text' as the text column and 'label' as the target variable
text_column_raw = 'tweet_text'
label_column_raw = 'label'

# 2. Handle any missing values in the 'tweet_text' column (fill with empty string)
df_raw[text_column_raw] = df_raw[text_column_raw].fillna('')

# 3. Tokenize the 'tweet_text' column from df_raw using the tokenizer_loaded
# 4. Ensure consistent padding, truncation, max_length, and return PyTorch tensors
encoded_inputs_raw = tokenizer_loaded(
    list(df_raw[text_column_raw]),
    add_special_tokens=True,
    padding='max_length',
    truncation=True,
    max_length=max_seq_length, # Use the same max_seq_length (128) as before
    return_attention_mask=True,
    return_tensors='pt' # Return PyTorch tensors
)

# 5. Extract the input_ids and attention_mask from the tokenized output
input_ids_raw = encoded_inputs_raw['input_ids']
attention_masks_raw = encoded_inputs_raw['attention_mask']

# 6. Convert the 'label' column from df_raw into a PyTorch tensor (dtype=torch.long)
labels_raw = torch.tensor(df_raw[label_column_raw].values, dtype=torch.long)

# 7. Print the shapes of the resulting tensors for verification
print("Shape of raw input_ids:", input_ids_raw.shape)
print("Shape of raw attention_masks:", attention_masks_raw.shape)
print("Shape of raw labels:", labels_raw.shape)


Shape of raw input_ids: torch.Size([400, 128])
Shape of raw attention_masks: torch.Size([400, 128])
Shape of raw labels: torch.Size([400])


## Evaluate Model on Test Data

### Subtask:
Use the `model_new_loaded` (the model trained on `indian_depression_dataset_2025.csv`) to perform inference on the preprocessed test data. Calculate and display evaluation metrics (accuracy, precision, recall, F1-score) against the true labels in the test dataset.


**Reasoning**:
I need to evaluate the newly trained model on the raw test data by performing inference and calculating performance metrics as per the subtask instructions.



In [10]:
import torch
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from torch.utils.data import TensorDataset, DataLoader, SequentialSampler

# 1. Combine input_ids_raw, attention_masks_raw, and labels_raw into a TensorDataset
test_dataset = TensorDataset(input_ids_raw, attention_masks_raw, labels_raw)

# 2. Create a DataLoader named test_dataloader
batch_size = 32 # Use the same batch size as before
test_dataloader = DataLoader(
    test_dataset,
    sampler=SequentialSampler(test_dataset), # Sequential sampling for evaluation
    batch_size=batch_size
)

# 3. Set the model_new_loaded to evaluation mode
model_new_loaded.eval()

print("Evaluating model_new_loaded on the raw test data...")

# 4. Initialize lists to store all predictions and true labels
all_preds_test = []
all_labels_test = []

# 5. Iterate through the test_dataloader in batches
for batch in test_dataloader:
    # a. Move the input IDs, attention masks, and labels to the appropriate device
    batch = tuple(t.to(device) for t in batch)
    b_input_ids, b_input_mask, b_labels = batch

    # b. Perform a forward pass through model_new_loaded without computing gradients
    with torch.no_grad():
        outputs = model_new_loaded(
            b_input_ids,
            token_type_ids=None,
            attention_mask=b_input_mask
        )

    # c. Move the logits to CPU and convert them to a NumPy array
    logits = outputs.logits
    logits = logits.detach().cpu().numpy()

    # d. Get the predicted class for each sample in the batch and extend all_preds_test
    preds = np.argmax(logits, axis=1).flatten()
    all_preds_test.extend(preds)

    # e. Extend all_labels_test with the true labels for the batch
    label_ids = b_labels.to('cpu').numpy()
    all_labels_test.extend(label_ids)

# Convert lists to numpy arrays for metric calculation
all_preds_test_np = np.array(all_preds_test)
all_labels_test_np = np.array(all_labels_test)

# 6. Calculate and print the overall accuracy
accuracy_test = accuracy_score(all_labels_test_np, all_preds_test_np)
print(f"  Test Accuracy: {accuracy_test:.2f}")

# 7. Calculate and print precision, recall, and F1-score
precision_test, recall_test, f1_test, _ = precision_recall_fscore_support(
    all_labels_test_np, all_preds_test_np, average='binary', zero_division=0
)
print(f"  Test Precision: {precision_test:.2f}")
print(f"  Test Recall: {recall_test:.2f}")
print(f"  Test F1-score: {f1_test:.2f}")

print("Evaluation on raw test data complete.")

Evaluating model_new_loaded on the raw test data...
  Test Accuracy: 0.93
  Test Precision: 1.00
  Test Recall: 0.85
  Test F1-score: 0.92
Evaluation on raw test data complete.


## Summarize Test Results

### Subtask:
Summarize the model's performance on the `raw_depression_dataset_india.csv` test dataset, discussing the key evaluation metrics and comparing them to previous results.


## Summary of Model Performance on `raw_depression_dataset_india.csv`

### Evaluation Metrics Overview:

**Validation Metrics (from training on `indian_depression_dataset_2025.csv`):**
*   **Accuracy:** 1.00
*   **Precision:** 1.00
*   **Recall:** 1.00
*   **F1-score:** 1.00

**Test Metrics (on `raw_depression_dataset_india.csv`):**
*   **Accuracy:** 0.93
*   **Precision:** 1.00
*   **Recall:** 0.85
*   **F1-score:** 0.92

### Comparison and Analysis:

The model, `model_new_loaded`, which was trained on the `indian_depression_dataset_2025.csv` dataset, exhibited exceptionally high performance during its own validation phase, achieving perfect scores (1.00 for Accuracy, Precision, Recall, and F1-score). This raised concerns about potential overfitting or data leakage, as noted in previous summaries.

When this same model was evaluated on the `raw_depression_dataset_india.csv` test dataset, the performance metrics changed significantly:

*   **Accuracy** dropped from 1.00 to 0.93.
*   **Precision** remained at 1.00, indicating that when the model predicts a sample as 'Depressed', it is always correct (no false positives) on the test set. This is a strong indicator.
*   **Recall** decreased from 1.00 to 0.85. This means the model was able to identify 85% of the actual 'Depressed' cases in the test set, missing 15% (false negatives).
*   **F1-score**, a harmonic mean of precision and recall, also dropped from 1.00 to 0.92, reflecting the decrease in recall.

### Implications:

*   **Overfitting:** The perfect validation scores contrasted with the slightly lower, but still strong, test scores confirm the initial suspicion of overfitting during training on the `indian_depression_dataset_2025.csv`. The model learned the training data too well, leading to a slight drop in performance on a truly unseen dataset.
*   **Generalization:** Despite the drop, the model still generalizes reasonably well to the `raw_depression_dataset_india.csv` test set, especially maintaining perfect Precision. The 0.93 Accuracy and 0.92 F1-score are good, indicating that the features learned from the `indian_depression_dataset_2025.csv` are somewhat transferable.
*   **Recall vs. Precision:** The perfect precision on the test set (1.00) is remarkable. However, the recall of 0.85 suggests that there are some actual depressed cases in the `raw_depression_dataset_india.csv` that the model failed to identify. Depending on the application, missing depressed cases (false negatives) might be a more critical error than false positives.

### Strengths and Weaknesses:

*   **Strengths:** The model demonstrates very high confidence in its positive predictions (Precision = 1.00), meaning users can trust its 'Depressed' classifications when it makes them. Overall accuracy and F1-score are also very good.
*   **Weaknesses:** The drop in Recall (from 1.00 to 0.85) indicates that the model struggles to identify all instances of depression in the new test data, suggesting room for improvement in capturing the full spectrum of depressive indicators or handling variability across different datasets. The consistent positive predictions during inference (as seen previously) also hint that while the model has high precision when it does predict positive, it might be overly cautious or biased towards certain features in the positive class leading to lower recall on a varied dataset.

### Conclusion:

The model trained on `indian_depression_dataset_2025.csv` shows promising performance on the `raw_depression_dataset_india.csv` test set, particularly with its perfect precision. However, the drop in recall and accuracy from validation to test metrics highlights that while the model has learned the positive class well, it does not perfectly generalize to all unseen instances. Further steps should focus on improving recall and addressing the subtle signs of overfitting/bias observed.

## Final Task

### Subtask:
Provide a comprehensive summary of the model's performance across all datasets and insights gained throughout the process.


## Summary:

### Data Analysis Key Findings

*   The `raw_depression_dataset_india.csv` dataset, consisting of 400 entries and 3 columns (`tweet_text`, `label`, `date`), was successfully loaded and found to have no missing values. The `label` column exhibited a nearly balanced distribution, with a mean of approximately 0.4875.
*   The text data from the `raw_depression_dataset_india.csv` was preprocessed for a BERT model, including handling missing values in `tweet_text` by filling them with empty strings, and tokenizing the text into PyTorch tensors (`input_ids`, `attention_masks`, `labels`) of shapes (400, 128) and (400) respectively.
*   The `model_new_loaded` (trained on `indian_depression_dataset_2025.csv`) was evaluated on the preprocessed `raw_depression_dataset_india.csv` test set, yielding the following performance metrics:
    *   **Accuracy:** 0.93
    *   **Precision:** 1.00
    *   **Recall:** 0.85
    *   **F1-score:** 0.92
*   Compared to its perfect validation scores (1.00 for Accuracy, Precision, Recall, and F1-score) during training, the model experienced a drop in Accuracy (to 0.93), Recall (to 0.85), and F1-score (to 0.92) on the unseen `raw_depression_dataset_india.csv` data.
*   The model maintained a perfect Precision of 1.00 on the test set, indicating it made no false positive predictions (i.e., it never incorrectly classified a non-depressed tweet as depressed). However, the Recall of 0.85 signifies that 15% of actual depressed cases in the test set were not identified by the model.

### Insights or Next Steps

*   The significant drop in performance metrics from validation to test sets (particularly Recall and F1-score) confirms initial suspicions of overfitting, highlighting that the model learned the `indian_depression_dataset_2025.csv` too well, impacting its generalization to new, unseen data.
*   To improve the model's robustness and recall on diverse datasets, future efforts should focus on strategies to mitigate overfitting, such as using a larger and more varied training dataset, applying more aggressive regularization techniques, or exploring ensemble methods.
