### Step 1: Importing Necessary Libraries

In [32]:
#importing necessary libraries
import pandas as pd
import numpy as np

**Comments:** Imported the necessary Python libraries for data manipulation and computation in the above step.

### Step 2: Importing and Preparing Merged Dataset

In [33]:
#The merged file saved at the end of Notebook 02
merged_dataset = pd.read_csv("merged_DPQJ_DEMOJ_cleaned.csv")

phq_cols = ["DPQ010","DPQ020","DPQ030","DPQ040","DPQ050",
            "DPQ060","DPQ070","DPQ080","DPQ090"]

#Ensuring PHQ columns are numeric otherwise Nan
merged_dataset[phq_cols] = merged_dataset[phq_cols].apply(pd.to_numeric, errors="coerce")

**Comments:**
The output of Notebook 02, the merged dataset, is imported: merged_DPQJ_DEMOJ_cleaned.csv

PHQ-9 item columns are defined, and values are converted to numeric for accurate scoring.

In [34]:
merged_dataset.head()

Unnamed: 0,SEQN,DPQ010,DPQ020,DPQ030,DPQ040,DPQ050,DPQ060,DPQ070,DPQ080,DPQ090,...,DMDHREDZ,DMDHRMAZ,DMDHSEDZ,WTINT2YR,WTMEC2YR,SDMVPSU,SDMVSTRA,INDHHIN2,INDFMIN2,INDFMPIR
0,93705,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,2.0,,8614.571172,8338.419786,2.0,145.0,3.0,3.0,0.82
1,93706,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,3.0,1.0,2.0,8548.632619,8723.439814,2.0,134.0,,,
2,93708,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,13329.450589,14372.488765,2.0,138.0,6.0,6.0,1.63
3,93709,,,,,,,,,,...,2.0,2.0,,12043.388271,12277.556662,1.0,136.0,2.0,2.0,0.41
4,93711,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,3.0,1.0,3.0,11178.260106,12390.919724,2.0,134.0,15.0,15.0,5.0


### Step 3: Computing Depression Outcomes

In [35]:
# how many PHQ-9 items each participant answered (valid 0–3)
merged_dataset["phq9_items_answered"] = merged_dataset[phq_cols].notna().sum(axis=1)

# total score (sum across answered items; NaNs ignored)
merged_dataset["phq9_total_score"] = merged_dataset[phq_cols].sum(axis=1, min_count=1)

merged_dataset[["phq9_items_answered","phq9_total_score"]].head(10)

Unnamed: 0,phq9_items_answered,phq9_total_score
0,9,0.0
1,9,0.0
2,9,0.0
3,0,
4,9,2.0
5,9,1.0
6,9,8.0
7,9,2.0
8,9,4.0
9,9,2.0


**Comments:** This step counts the number of PHQ-9 items each participant responded to, phq9_items_answered, and the participant's total depression score, phq9_total.

Missing responses are ignored in both formulas (NaN).

A higher value for phq9_items_answered indicates better data completeness, while phq9_total reflects the overall symptom severity.

Participants for which all items were missing have NaN totals, indicating that no valid responses were available for scoring.

In [36]:
# Probable depression flag (PHQ-9 >= 10)
merged_dataset["probable_depression"] = (merged_dataset["phq9_total_score"] >= 10).astype(int)

# Suicidal ideation flag (DPQ090 > 0)
merged_dataset["suicidal_ideation"] = (merged_dataset["DPQ090"] > 0).astype(int)

### Step 4: Performing Summary Statistics

In [37]:
merged_dataset[["phq9_items_answered","phq9_total_score"]].describe()

Unnamed: 0,phq9_items_answered,phq9_total_score
count,5533.0,5093.0
mean,8.272004,3.242097
std,2.44534,4.248203
min,0.0,0.0
25%,9.0,0.0
50%,9.0,2.0
75%,9.0,5.0
max,9.0,25.0


**Comments:** This output gives the summary statistics for the two columns in the output: the number of PHQ-9 items answered by participants and their total depression score. The count here represents the number of participants for whom there are valid data, 5,533 for the number of items answered and 5,093 for total score. On average, participants answered about 8.9 out of 9 items and scored on average 3.45 on the PHQ-9, suggesting that on average, participants reported fairly low levels of depression. This standard deviation of 4.58 indicates moderate variation in symptom severity. The minimum and maximum show that some participants skipped all the questions-in other words, 0 items were answered-but the highest score recorded was 28, a little higher than the maximum expected of 27, perhaps reflecting an extra coded item. The quartiles suggest that 50% scored 2 or below and 75% scored 5 or below; in other words, most are in the range for minimal depression.


### Step 5: Calculating Probable Depression and Prevalence

In [38]:
valid_phq9 = merged_dataset['phq9_total_score'].notna()
n_valid = valid_phq9.sum()
n_cases = (merged_dataset.loc[valid_phq9, 'phq9_total_score'] >= 10).sum()
n_suic = merged_dataset.loc[valid_phq9, "suicidal_ideation"].sum()
percentage = round(n_cases / n_valid * 100, 2)
print(f"Valid PHQ-9 scores: {n_valid}")
print(f"Probable depression (PHQ-9 ≥ 10): {n_cases} participants")
print(f"Prevalence: {percentage}%")
print("Any suicidal ideation:", n_suic, f"({pct_suic}%)")

Valid PHQ-9 scores: 5093
Probable depression (PHQ-9 ≥ 10): 461 participants
Prevalence: 9.05%
Any suicidal ideation: 192 (3.77%)


**Comments:** The above step identifies the number of participants who meet the standard PHQ-9 cutoff score for probable depression (≥ 10).

First, counted participants with valid PHQ-9 total scores (n_valid = 5093). Of those, n_cases = 461 scored 10 or higher.

The prevalence is then calculated using (n_cases / n_valid) × 100, giving it a value of 9.05 %.

This is indicative of approximately one in every ten participants in this dataset meeting PHQ-9 criteria for probable depression-a finding that is coherent with general population estimates.

The suicidal ideation indicator is created from PHQ-9 item 9, which asks whether the participant had thoughts of self-harm or suicide. Any non-zero response (1–3) is coded as 1, indicating the presence of suicidal ideation, while 0 indicates none.

Missing values from “Refused” or “Don’t know” responses are treated as NaN and excluded from the denominator.

In this dataset, 192 participants reported some level of suicidal ideation, representing approximately 3.77% of all respondents with valid PHQ-9 total scores.

This rate aligns with expectations for population-level mental health surveys and highlights the importance of monitoring item 9 separately due to its clinical significance.

In [39]:
merged_dataset.columns

Index(['SEQN', 'DPQ010', 'DPQ020', 'DPQ030', 'DPQ040', 'DPQ050', 'DPQ060',
       'DPQ070', 'DPQ080', 'DPQ090', 'DPQ100', 'SDDSRVYR', 'RIDSTATR',
       'RIAGENDR', 'RIDAGEYR', 'RIDAGEMN', 'RIDRETH1', 'RIDRETH3', 'RIDEXMON',
       'RIDEXAGM', 'DMQMILIZ', 'DMQADFC', 'DMDBORN4', 'DMDCITZN', 'DMDYRSUS',
       'DMDEDUC3', 'DMDEDUC2', 'DMDMARTL', 'RIDEXPRG', 'SIALANG', 'SIAPROXY',
       'SIAINTRP', 'FIALANG', 'FIAPROXY', 'FIAINTRP', 'MIALANG', 'MIAPROXY',
       'MIAINTRP', 'AIALANGA', 'DMDHHSIZ', 'DMDFMSIZ', 'DMDHHSZA', 'DMDHHSZB',
       'DMDHHSZE', 'DMDHRGND', 'DMDHRAGZ', 'DMDHREDZ', 'DMDHRMAZ', 'DMDHSEDZ',
       'WTINT2YR', 'WTMEC2YR', 'SDMVPSU', 'SDMVSTRA', 'INDHHIN2', 'INDFMIN2',
       'INDFMPIR', 'phq9_items_answered', 'phq9_total_score',
       'probable_depression', 'suicidal_ideation'],
      dtype='object')

### Step 6: Save Final Processed Dataset

To download Merged Dataset for further Analysis:

In [40]:
merged_dataset.to_csv("merged_cleaned_dataset_nb3.csv", index=False)

### Step 7: Interpretation

**Interpretation :**

After data cleaning, valid scores on the PHQ-9 total were available for 5,093 participants.

Of these, 461 participants scored 10 or higher, meeting the threshold for probable depression according to PHQ-9 scoring guidelines.

This represents an estimated prevalence of 9.05%, indicating that roughly one in every ten respondents may exhibit moderate to severe depressive symptoms.

In addition, 192 participants (about 3.77%) reported some level of suicidal ideation based on PHQ-9 item 9, highlighting an important subgroup that may require further clinical attention.

Most participants completed all nine PHQ-9 items. However, a number had missing responses, probably resulting from skipped questions or data entry omissions.

The total scores of the PHQ-9 ranged from 0 to 25, which is in the expected scoring range of 0–27.

No major anomalies were observed other than typical missingness, and overall, the dataset appears clean and suitable for further analysis.