### Step 1: Importing Libraries

In [16]:
#import necessary libraries
import pandas as pd

### Step 2: Loading both datasets


In [17]:
#Loading both "DEMO_J.XPT" and "DPQ_J.XPT" datasets

Depression_data = pd.read_csv("DPQ_J.csv")
Demographic_data  = pd.read_csv("DEMO_J.csv")

###Step 3: Checking uniqueness of SEQN in both

In [18]:
print("Unique IDs in DPQ_J :", Depression_data['SEQN'].is_unique)
print("Unique IDs in DEMO_J:", Demographic_data['SEQN'].is_unique)

#counting duplicates explicitly
print("Duplicate IDs in DPQ_J:", Depression_data['SEQN'].duplicated().sum())
print("Duplicate IDs in DEMO_J:", Demographic_data['SEQN'].duplicated().sum())

Unique IDs in DPQ_J : True
Unique IDs in DEMO_J: True
Duplicate IDs in DPQ_J: 0
Duplicate IDs in DEMO_J: 0


**Comments:** The SEQN variable (Respondent Sequence Number) is the unique number linking participants across NHANES components.

Uniqueness was established in both the depression screener (**DPQ_J**) and depression (**DEMO_J**) data sets with is_unique and duplicated() tests.

Results confirmed that each participant appears only once in each dataset (is_unique = True), validating SEQN as a reliable key for merging in step two.

### Step 4: Merging both data sets on SEQN using Inner join


In [21]:
merged_data = pd.merge(Depression_data, Demographic_data, on='SEQN', how='inner')

### Step 5: Checking shapes before and after merge

In [22]:
print("Shape of DEMO_J:", Demographic_data.shape)
print("Shape of DPQ_J:", Depression_data.shape)
print("Shape of Merged data:", merged_data.shape)

Shape of DEMO_J: (9254, 46)
Shape of DPQ_J: (5533, 11)
Shape of Merged data: (5533, 56)


The merged data set contains 5533 participants and 56 columns as SEQN is the common coloumn in both datasets.

### Step 6: Previewing merged file

In [23]:
merged_data.head()

Unnamed: 0,SEQN,DPQ010,DPQ020,DPQ030,DPQ040,DPQ050,DPQ060,DPQ070,DPQ080,DPQ090,...,DMDHREDZ,DMDHRMAZ,DMDHSEDZ,WTINT2YR,WTMEC2YR,SDMVPSU,SDMVSTRA,INDHHIN2,INDFMIN2,INDFMPIR
0,93705.0,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79,...,1.0,2.0,,8614.571172,8338.419786,2.0,145.0,3.0,3.0,0.82
1,93706.0,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79,...,3.0,1.0,2.0,8548.632619,8723.439814,2.0,134.0,,,
2,93708.0,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79,...,1.0,1.0,1.0,13329.450589,14372.488765,2.0,138.0,6.0,6.0,1.63
3,93709.0,,,,,,,,,,...,2.0,2.0,,12043.388271,12277.556662,1.0,136.0,2.0,2.0,0.41
4,93711.0,1.0,5.397605e-79,1.0,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79,...,3.0,1.0,3.0,11178.260106,12390.919724,2.0,134.0,15.0,15.0,5.0


**Comments:** The demographic file (DEMO_J) included 9,254 participants, and the depression screener Questionnaire file(DPQ_J) included 5,533 participants who responded to the PHQ-9.

After performing an inner join on SEQN, the merged dataset still had 5,533 participants which are individuals who provided both demographic and mental health information.

This is consistent with the NHANES design, in which a portion of all those interviewed will complete each of the modules of questionnaires.