# Introduction

In this notebook, we perform exploration on the 2021-2023 NSDUH data with the goals of: 

1) Learning more about the demographic characteristics of young adults aged 18-25 
2) Learning more about the substance use and mental health patterns among young adults
3) Examining the associations between our confounders and mental health
4) Examining initial associations between illicit substance use and mental health

To accomplish goal 1, we create bar plots that reveal frequencies of demographics among Young adults such as income, household composition, and education. 

To accomplish goal 2, we create bar plots to examine substance use and mental health patterns.

To accomplish goal 3, we create comparative tables (followed up with chi-squared tests and 2 sample t-tests in RStudio).

To accomplish goal 4, we do a combination of bar plots with variability measures and inferential statistics to capture initial associations between illicit substance use and mental health outcomes.

By learning more about the demographic characteristics of Young Adults, we gain an understanding of our sample.

By learning more about substance use and mental health patterns among young adults, we can gain an initial understanding on why young adults is an important population to be studied in health-related research: their tendency to experience more substance use and mental health issues (as outlined in the 2023 NSDUH national report). 

By examining the associations between our covariates and mental health, we gain an important understanding of confounders that influence the relationship between illicit substance use and mental health, which will be accounted for in our logistic regression analysis.

By examining initial associations between illicit substance use and mental health, we can gain an initial understanding on how illicit substance use is associated with mental health outcomes among young adults, motivating us to dive deeper into areas where there is a clear association shown between illicit substance use and mental health.

In [None]:
#Import modules
import os #file path
import pandas as pd #Access, modify, and manipulate data
import matplotlib.pyplot as plt #Data visualization
import seaborn as sns #Data visualization 
import numpy as np #Numerical computation
from samplics.categorical import Tabulation #Survey design
from samplics.estimation import TaylorEstimator #Survey design
from samplics.utils.types import PopParam #Survey design
#Base path to GitHub repository
path="C:/Users/John Platt/OneDrive/DA301Project/John-Platt-DA-401-Project"
#Directory where the project data is
datadir="data"

In [None]:
#Read in the cleaned 2023 NSDUH dataset
nsduh_2123 = pd.read_csv(os.path.join(path, datadir, "NSDUH_2021_2023.csv"))

In [None]:
#Delete first column of the data that has the row numbers (starts at 1)
nsduh_2123.drop('Unnamed: 0', axis=1, inplace=True) 
#Print out the dataset
nsduh_2123

In [None]:
def rel_frequencies(df,xvar,na_rm=True):
    '''
    Compute weighted frequencies (including accounting for survey design)
    across demographics 

    Parameters:
    df - object - Cleaned 2023 NSDUH dataframe
    xvar - string - The substance use variable that will be generate an average from
    na_rm - boolean - Whether or not we will eliminate missing values

    Return:
    Table with frequencies and 95% confidence intervals
    '''
    # Work on a copy to avoid modifying original df
    df = df.copy()
    #Create new column based on where there are Young Adults in the cleaned dataset
    df["subpop_xvar"]=df[xvar].where(df["CATAGE"] == 2, np.nan)
    #Tabulation function calculates and tabulates summary statistics of a categorical variable given
    #survey weights and survey design measures
    #We account for the subpopulation (Young Adults) WHEN we do the tabulation 
    #Don't filter data down to Young Adults before doing the tabulation as 
    #that would bias the SE/variance estimates towards that specific subgroup rather than the total population
    #Tabulation with full dataset, but subpopulation defined
    tab = Tabulation(PopParam.prop)  
    tab.tabulate(
        #Column we want to tabulate relative frequencies based on 
        vars=df[["subpop_xvar"]],
        #2023 NSDUH sample weight
        samp_weight=df["ANALWT2_C3"],
        #Variance stratum
        stratum=df["VESTR_C"],
        #Variance primary sampling unit
        psu=df["VEREP"],
        remove_nan=na_rm
    )
    #Convert the weighted frequency estimates to a dataframe
    df=tab.to_dataframe()
    #Sort the frequency dataframe by frequencies in descending order
    df.sort_values(by=PopParam.prop,ascending=False,inplace=True)
    return df

In [None]:
def rel_freq_barplot(df,xvar,xlabel,xaxisrotation,title):
    '''
    Given a dataframe, creates barplot for relative frequencies of one variable
    
    Parameters:
    df - object - Cleaned 2023 NSDUH dataframe
    xvar - string - The variable that will be the x-axis of our barplot (variable that we will show proportions of)
    xlabel - string - The x-axis label of our barplot
    xaxisrotation - integer - How much we should rotate the x-axis tick labels (0 to 360 degrees)
    title - string - The title of our barplot

    Return:
    The frequency table
    '''   
    #Create table with weighted relative frequency estimates 
    df=rel_frequencies(df,xvar,na_rm=True)
    #Create percentage column
    df['Percentage']=df[PopParam.prop]*100
    #Create column that has the lower bound of 95% CI expressed as a percentage
    df['Lower_ci_percentage']=df["lower_ci"]*100
    #Create column that has the upper bound of 95% CI expressed as a percentage
    df['Upper_ci_percentage']=df["upper_ci"]*100
    #Create new column that calculates the lower error based on the percentage subtracted to the Lower_ci_percentage
    df['lower_error']=df["Percentage"]-df['Lower_ci_percentage']
    #Create new column that calculates the upper error based on the percentage subtracted to the Upper_ci_percentage
    df['upper_error']=df['Upper_ci_percentage']-df["Percentage"]
    #Put lower and upper measures into a list
    asymmetric_err=[df['lower_error'],df['upper_error']]
    #Create bar plot with error bars
    plot=plt.bar(df['category'],df['Percentage'], yerr=asymmetric_err, capsize=5, color=sns.color_palette('Spectral', n_colors=len(df)))
    #Add data labels to bar plot
    plt.bar_label(plot,fmt="%.2f",label_type='center')
    #Change x-axis label of plot
    plt.xlabel(xlabel)
    #Change y-axis label of plot
    plt.ylabel("Weighted Proportion (%)")
    #Change title of plot
    plt.title(title)
    #Change rotation angle of x-axis tick labels of plot
    plt.xticks(rotation=xaxisrotation)
    #Show plot
    plt.show()
    #Return frequency table
    return df

In [None]:
def frequencies(df,xvar,na_rm=True):
    '''
    Compute weighted counts (including accounting for survey design)
    across demographics 

    Parameters:
    df - object - Cleaned 2023 NSDUH dataframe
    xvar - string - The demographic variable that will be generate counts from
    na_rm - boolean - Whether or not we will eliminate missing values

    Return:
    Table with frequencies and 95% confidence intervals
    '''
    # Work on a copy to avoid modifying original df
    df = df.copy()
    #Create new column based on where there are Young Adults in the cleaned dataset
    df["subpop_xvar"]=df[xvar].where(df["CATAGE"] == 2, np.nan)
    #Do tabulation to estimate counts of demographoic variable
    tab = Tabulation(PopParam.count)  
    tab.tabulate(
        #Column we want to tabulate relative frequencies based on 
        vars=df[["subpop_xvar"]],
        #2023 NSDUH sample weight
        samp_weight=df["ANALWT2_C3"],
        #Variance stratum
        stratum=df["VESTR_C"],
        #Variance primary sampling unit
        psu=df["VEREP"],
        remove_nan=na_rm
    )
    totals=tab.to_dataframe()
    return totals

In [None]:
def avg_tbl(df,xvar,na_rm=True):
    '''
    Compute weighted mean (including accounting for survey design)
    across substance use measures

    Parameters:
    df - object - Cleaned 2023 NSDUH dataframe
    xvar - string - The substance use variable that will be generate an average from
    na_rm - boolean - Whether or not we will eliminate missing values

    Return:
    Table with substance use averages (for Young Adults and all other populations) and 95% confidence intervals
    '''
    # Work on a copy to avoid modifying original df
    df = df.copy()
    #Create new domain variable based on where there are Young Adults in the cleaned dataset
    #This will be used when accounting for subgroup analysis in the taylor estimation
    domain=(df["CATAGE"] == 2).astype(int)
    #Taylor estimator for mean substance use frequency
    est = TaylorEstimator(PopParam.mean)
    est.estimate(
        y=df[xvar], #Substance use variable calculating the mean frequency on
        samp_weight=df['ANALWT2_C3'], #Sample weight
        stratum=df['VESTR_C'], #Variance stratum
        psu=df['VEREP'], #Primary sampling unit
        domain=domain, #pass subgroup vector
        remove_nan=na_rm
    )
    #to_dataframe() is handy when domain is provided (row per domain level)
    out = est.to_dataframe()
    #Change values of domain column to be more descriptive
    out['_domain'] = out['_domain'].replace({1: 'Young Adults', 0: 'All Other Populations'})
    #Sort the dataframe in descending order by the mean estimates
    out.sort_values(by='_estimate',ascending=False,inplace=True)
    #Return dataframe with mean (accounts for complex survey design)
    return out

In [None]:
def mean_comp(df,xvar,xlabel,xaxisrotation,title):
    '''
    Create bar plots with error bars that compare substance use averages between Young Adults aged 18-25 and
    all other populations

    Parameters:
    df - object - Cleaned 2023 NSDUH dataframe
    xvar - string - The variable that will be the x-axis of our barplot (variable that we will show averages of)
    xlabel - string - The x-axis label of our barplot
    xaxisrotation - integer - How much we should rotate the x-axis tick labels (0 to 360 degrees)
    title - string - The title of our barplot

    Return:
    The averages table
    '''    
    df=df.copy()
    #Obtain averages table
    df=avg_tbl(df,xvar,na_rm=True)
    #Create new column that calculates the lower error based on the average subtracted to the _lci
    df['lower_error']=df['_estimate']-df['_lci']
    #Create new column that calculates the upper error based on the average subtracted to the _uci
    df['upper_error']=df['_uci']-df['_estimate']
    #Put lower and upper measures into a list
    asymmetric_err=[df['lower_error'],df['upper_error']]
    #Create bar plot with error bars
    plot=plt.bar(df['_domain'],df['_estimate'], yerr=asymmetric_err, capsize=5, color=sns.color_palette('Spectral', n_colors=len(df)))
    #Add data labels to bar plot
    plt.bar_label(plot,fmt="%.2f",label_type='center')
    #Change x-axis label of plot
    plt.xlabel(xlabel)
    #Change y-axis label of plot
    plt.ylabel("Weighted Average")
    #Change title of plot
    plt.title(title)
    #Change rotation angle of x-axis tick labels of plot
    plt.xticks(rotation=xaxisrotation)
    #Show plot
    plt.show()
    #Return averages table
    return df
    

In [None]:
def avg_tbl2(df,xvar,yvar,na_rm=True):
    '''
    Compute weighted mean (including accounting for survey design)
    across substance use measures by mental health measures

    Parameters:
    df - object - Cleaned 2023 NSDUH dataframe
    xvar - string - The substance use variable that will be generate an average from
    yvar - string - The mental health variable we are comparing substance use measures across
    na_rm - boolean - Whether or not we will eliminate missing values
    Return:
    Table with substance use averages (for Young Adults and all other populations) and 95% confidence intervals
    '''
    # Work on a copy to avoid modifying original df
    df = df.copy()
    #Create domain series for Young Adults who have experienced the specified mental health issue (yvar)
    domain_yes = ((df["CATAGE"]==2) & (df[yvar]=='Yes')).astype(int)
    #Taylor estimator for mean substance use frequency for mental health group 1 (Young Adults 
    #who experienced the mental health issue in the past year)
    est = TaylorEstimator(PopParam.mean)
    est.estimate(
        y=df[xvar],
        samp_weight=df["ANALWT2_C3"],
        stratum=df["VESTR_C"],
        psu=df["VEREP"],
        domain=domain_yes,
        remove_nan=na_rm
    )
    #to_dataframe() is handy when domain is provided (row per domain level)
    out1 = est.to_dataframe()
    #Change values of domain column to be more descriptive
    out1['_domain'] = out1['_domain'].replace({1: 'Yes', 0: 'All Other Populations'})
    #Create domain series for Young Adults who have NOT experienced the specified mental health issue (yvar)
    domain_no = ((df["CATAGE"]==2) & (df[yvar]=='No')).astype(int)
    #Taylor estimator for mean substance use frequency for mental health group 2 (multiracial young 
    #adults who did not experience the mental health issue in the past year)
    est = TaylorEstimator(PopParam.mean)
    est.estimate(
        y=df[xvar],
        samp_weight=df["ANALWT2_C3"],
        stratum=df["VESTR_C"],
        psu=df["VEREP"],
        domain=domain_no,
        remove_nan=na_rm
    )
    #to_dataframe() is handy when domain is provided (row per domain level)
    out2 = est.to_dataframe()
    #Change values of domain column to be more descriptive
    out2['_domain'] = out2['_domain'].replace({1: 'No', 0: 'All Other Populations'})
    #Combine the two dataframes, taking the row of each dataframe where _domain=Group 1 or _domain=Group 2
    group_comp=pd.concat([out1.loc[out1['_domain']=='Yes',:],out2.loc[out2['_domain']=='No',:]])
    #Sort the combined dataframe
    group_comp.sort_values(by='_estimate',ascending=False,inplace=True)
    #Return dataframe with mean (accounts for complex survey design) illicit substance use for each group
    return group_comp

In [None]:
def avg_tbl3(df,xvar,yvar,na_rm=True):
    '''
    Compute weighted mean (including accounting for survey design)
    across potential substance use confounders by suicidal thoughts 

    Parameters:
    df - object - Cleaned 2023 NSDUH dataframe
    xvar - string - The substance use variable that will be generate an average from
    yvar - string - The mental health variable we are comparing substance use measures across
    na_rm - boolean - Whether or not we will eliminate missing values
    Return:
    Table with substance use averages (for Young Adults and all other populations) and 95% confidence intervals
    '''
    # Work on a copy to avoid modifying original df
    df = df.copy()
    #Recode suicidal ideation variable
    df['IRSUICTHNK']=df['IRSUICTHNK'].replace({'Yes':1,'No':0})
    #Create domain series for Young Adults who have experienced the specified mental health issue (yvar)
    domain_yes = ((df["CATAGE"]==2) & (df[yvar]==1)).astype(int)
    #Taylor estimator for mean substance use frequency for mental health group 1 (Young Adults 
    #who experienced the mental health issue in the past year)
    est = TaylorEstimator(PopParam.mean)
    est.estimate(
        y=df[xvar],
        samp_weight=df["ANALWT2_C3"],
        stratum=df["VESTR_C"],
        psu=df["VEREP"],
        domain=domain_yes,
        remove_nan=na_rm
    )
    #to_dataframe() is handy when domain is provided (row per domain level)
    out1 = est.to_dataframe()
    #Change values of domain column to be more descriptive
    out1['_domain'] = out1['_domain'].replace({1: 'Yes', 0: 'All Other Populations'})
    #Create domain series for Young Adults who have NOT experienced the specified mental health issue (yvar)
    domain_no = ((df["CATAGE"]==2) & (df[yvar]==0)).astype(int)
    #Taylor estimator for mean substance use frequency for mental health group 2 (multiracial young 
    #adults who did not experience the mental health issue in the past year)
    est = TaylorEstimator(PopParam.mean)
    est.estimate(
        y=df[xvar],
        samp_weight=df["ANALWT2_C3"],
        stratum=df["VESTR_C"],
        psu=df["VEREP"],
        domain=domain_no,
        remove_nan=na_rm
    )
    #to_dataframe() is handy when domain is provided (row per domain level)
    out2 = est.to_dataframe()
    #Change values of domain column to be more descriptive
    out2['_domain'] = out2['_domain'].replace({1: 'No', 0: 'All Other Populations'})
    #Combine the two dataframes, taking the row of each dataframe where _domain=Group 1 or _domain=Group 2
    group_comp=pd.concat([out1.loc[out1['_domain']=='Yes',:],out2.loc[out2['_domain']=='No',:]])
    #Sort the combined dataframe
    group_comp.sort_values(by='_estimate',ascending=False,inplace=True)
    #Return dataframe with mean (accounts for complex survey design) substance use for each group
    return group_comp

In [None]:
def mean_comp2(df,xvar,yvar,xlabel,xaxisrotation,title):
    '''
    Create bar plots with error bars that compare substance use averages between Young Adults aged 18-25 and
    all other populations

    Parameters:
    df - object - Cleaned 2023 NSDUH dataframe
    xvar - string - The illicit substance use variable that will be the y-axis of our barplot (variable that we will show averages of)
    yvar - string - The mental health variable that will be on the x-axis of our barplot
    xlabel - string - The x-axis label of our barplot
    xaxisrotation - integer - How much we should rotate the x-axis tick labels (0 to 360 degrees)
    title - string - The title of our barplot

    Return:
    The averages table 
    '''    
    df=df.copy()
    #Obtain averages table
    df=avg_tbl2(df,xvar,yvar,na_rm=True)
    #Create new column that calculates the lower error based on the average subtracted to the _lci
    df['lower_error']=df['_estimate']-df['_lci']
    #Create new column that calculates the upper error based on the average subtracted to the _uci
    df['upper_error']=df['_uci']-df['_estimate']
    #Put lower and upper measures into a list
    asymmetric_err=[df['lower_error'],df['upper_error']]
    #Create bar plot with error bars
    plot=plt.bar(df['_domain'],df['_estimate'], yerr=asymmetric_err, capsize=5, color=sns.color_palette('Spectral', n_colors=len(df)))
    #Add data labels to bar plot
    plt.bar_label(plot,fmt="%.2f",label_type='center')
    #Change x-axis label of plot
    plt.xlabel(xlabel)
    #Change y-axis label of plot
    plt.ylabel("Weighted Average")
    #Change title of plot
    plt.title(title)
    #Change rotation angle of x-axis tick labels of plot
    plt.xticks(rotation=xaxisrotation)
    #Show plot
    plt.show()
    #Return averages table
    return df

# Understanding Our Sample

In this section, we accomplish goal 1 outlined in the introduction. 

In [None]:
#Examine gender frequencies among Young Adults aged 18-25
rel_freq_barplot(nsduh_2123,'IRSEX','Sex at birth',0,"Percentage of Young Adults Aged 18-25 \n Who Are Certain Genders")

In [None]:
#Compute counts/frequencies (weighted to represent the totals for Young Adults in the U.S. population)
frequencies(nsduh_2123,'IRSEX',na_rm=True)

In [None]:
#Examine education frequencies among Young Adults aged 18-25
rel_freq_barplot(nsduh_2123,'EDUHIGHCAT','Highest Education Obtained',45,"Percentage of Young Adults Aged 18-25 \n With Certain Education Levels")

In [None]:
#Compute counts/frequencies (weighted to represent the totals for Young Adults in the U.S. population)
frequencies(nsduh_2123,'EDUHIGHCAT',na_rm=True)

In [None]:
#Examine work status among Young Adults aged 18-25
rel_freq_barplot(nsduh_2123,'IRWRKSTAT18','Work Status',45,"Percentage of Young Adults Aged 18-25 \n With Certain Work Statuses")

In [None]:
#Compute counts/frequencies (weighted to represent the totals for Young Adults in the U.S. population)
frequencies(nsduh_2123,'IRWRKSTAT18',na_rm=True)

In [None]:
#Examine private health insurance frequencies among Young Adults aged 18-25
rel_freq_barplot(nsduh_2123,'IRPRVHLT','Has private health insurance',0,"Percentage of Young Adults Aged 18-25 \n With Private Health Insurance")

In [None]:
#Compute counts/frequencies (weighted to represent the totals for Young Adults in the U.S. population)
frequencies(nsduh_2123,'IRPRVHLT',na_rm=True)

In [None]:
#Examine household size frequencies among Young Adults aged 18-25
rel_freq_barplot(nsduh_2123,'IRHHSIZ2','Number of people in household',45,"Percentage of Young Adults Aged 18-25 \n With Certain Household Sizes")

In [None]:
#Compute counts/frequencies (weighted to represent the totals for Young Adults in the U.S. population)
frequencies(nsduh_2123,'IRHHSIZ2',na_rm=True)

In [None]:
#Examine income frequencies among Young Adults aged 18-25
rel_freq_barplot(nsduh_2123,'INCOME','Total Household Income',45,"Percentage of Young Adults Aged 18-25 \n With Certain Household Incomes")

In [None]:
#Compute counts/frequencies (weighted to represent the totals for Young Adults in the U.S. population) among
#Young Adults aged 18-25
frequencies(nsduh_2123,'INCOME',na_rm=True)

# Examining Substance Use and Mental Health Patterns

In this section, we accomplish goal 2 outlined in the introduction.

In [None]:
#Examine co-occurring substance use disorder and any mental illness frequencies among Young Adults aged 18-25
rel_freq_barplot(nsduh_2123,'AMISUD5ANYO','Had SUD (substance use disorder) and/or AMI (any mental illness)',45,"Percentage of Young Adults Aged 18-25 \n With Certain AMI or SUD Frequencies")
 

In [None]:
#Compute counts/frequencies (weighted to represent the totals for Young Adults in the U.S. population)
frequencies(nsduh_2123,'AMISUD5ANYO',na_rm=True)

In [None]:
#Examine average of alcohol use frequency for those that are Young Adults vs other populations 
mean_comp(nsduh_2123,"IRALCFY",'Past Year Alcohol Use (0-365 days)',0,'Average Past Year Alcohol Use for\n Young Adults Aged 18-25 Compared to All Other Populations')

In [None]:
#Examine average of binge drinking frequency in the past month for those that are Young Adults vs other populations 
mean_comp(nsduh_2123,"IRALCBNG30D",'Past Month Binge Drinking (0-30 days)',0,'Average Past Month Binge Drinking for\n Young Adults Aged 18-25 Compared to All Other Populations')

In [None]:
#Examine average of marijuana use frequency for those that are Young Adults vs other populations 
mean_comp(nsduh_2123,"IRMJFY",'Past Year Marijuana Use (0-365 days)',0,'Average Past Year Marijuana Use for\n Young Adults Aged 18-25 Compared to All Other Populations')

In [None]:
#Examine average of cocaine frequency for those that are Young Adults vs other populations 
mean_comp(nsduh_2123,"IRCOCFY",'Past Year Cocaine Use (0-365 days)',0,'Average Past Year Cocaine Use for\n Young Adults Aged 18-25 Compared to All Other Populations')

In [None]:
#Examine average of hallucinogen frequency for those that are Young Adults vs other populations 
mean_comp(nsduh_2123,"IRCOCFY",'Past Year Hallucinogen Use (0-365 days)',0,'Average Past Year Hallucinogen Use for\n Young Adults Aged 18-25 Compared to All Other Populations')

In [None]:
#Examine average of cigarette frequency in the past month for those that are Young Adults vs other populations 
mean_comp(nsduh_2123,"IRCIGFM",'Past Month Cigarette Use (0-30 days)',0,'Average Past Month Cigarette Use for\n Young Adults Aged 18-25 Compared to All Other Populations')

In [None]:
#Examine average of nicotine vaping frequency in the past month for those that are Young Adults vs other populations 
df=nsduh_2123.copy()
df=df[df['IRNICVAP30N'] != -9] 
#Obtain averages table
df=avg_tbl(df,'IRNICVAP30N',na_rm=True)
#Create new column that calculates the lower error based on the average subtracted to the _lci
df['lower_error']=df['_estimate']-df['_lci']
    #Create new column that calculates the upper error based on the average subtracted to the _uci
df['upper_error']=df['_uci']-df['_estimate']
#Put lower and upper measures into a list
asymmetric_err=[df['lower_error'],df['upper_error']]
#Create bar plot with error bars
plot=plt.bar(df['_domain'],df['_estimate'], yerr=asymmetric_err, capsize=5, color=sns.color_palette('Spectral', n_colors=len(df)))
#Add data labels to bar plot
plt.bar_label(plot,fmt="%.2f",label_type='center')
#Change x-axis label of plot
plt.xlabel('Past Month Nicotine Vaping')
#Change y-axis label of plot
plt.ylabel("Weighted Average")
#Change title of plot
plt.title('Average Past Month Nicotine Vaping for\n Young Adults Aged 18-25 Compared to All Other Populations')
#Change rotation angle of x-axis tick labels of plot
plt.xticks(rotation=0)
#Show plot
plt.show()
#Print averages table
print(df)


In [None]:
#Examine substance use treatment frequencies among Young Adults aged 18-25
rel_freq_barplot(nsduh_2123,'SUTINPPY','Received In-Patient Substance Use Treatment',0,"Percentage of Young Adults Aged 18-25 \n Who Received Inpatient Substance Use Treatment")

In [None]:
#Compute counts/frequencies (weighted to represent the totals for Young Adults in the U.S. population)
frequencies(nsduh_2123,'SUTINPPY',na_rm=True)

In [None]:
#Examine past year feeling nervous frequencies among Young Adults aged 18-25
rel_freq_barplot(nsduh_2123,'IRDSTNRV12','How Often Felt Nervous - Worst Month',45,"Percentage of Young Adults Aged 18-25 \n Who Felt Varying Levels of Nervousness")

In [None]:
#Compute counts/frequencies (weighted to represent the totals for Young Adults in the U.S. population)
frequencies(nsduh_2123,'IRDSTNRV12',na_rm=True)

In [None]:
#Examine past year everything felt like an effort frequencies among Young Adults aged 18-25
rel_freq_barplot(nsduh_2123,'IRDSTEFF12','How Often Everything Felt Like an Effort - Worst Month',45,"Percentage of Young Adults Aged 18-25 \n Who Felt Varying Levels of Effort")

In [None]:
#Compute counts/frequencies (weighted to represent the totals for Young Adults in the U.S. population)
frequencies(nsduh_2123,'IRDSTEFF12',na_rm=True)

In [None]:
#Examine past year difficulty concentrating frequencies among Young Adults aged 18-25
rel_freq_barplot(nsduh_2123,'IRIMPCONCN','Difficulty Concentrating - One Month',45,"Percentage of Young Adults Aged 18-25 \n With Varying Levels of Concentration Difficulties")

In [None]:
#Compute counts/frequencies (weighted to represent the totals for Young Adults in the U.S. population)
frequencies(nsduh_2123,'IRIMPCONCN',na_rm=True)

In [None]:
#Examine past year suicidal thoughts frequencies among Young Adults aged 18-25
rel_freq_barplot(nsduh_2123,'IRSUICTHNK','Seriously Thought About Killing Self - Past Year',0,"Percentage of Young Adults Aged 18-25 \n Who Had Suicidal Thoughts in Past Year")

In [None]:
#Compute counts/frequencies (weighted to represent the totals for Young Adults in the U.S. population)
frequencies(nsduh_2123,'IRSUICTHNK',na_rm=True)

In [None]:
#Examine past year major depressive episode frequencies among Young Adults aged 18-25
rel_freq_barplot(nsduh_2123,'IRAMDEYR','Major Depressive Episode - Past Year',0,"Percentage of Young Adults Aged 18-25 \n Who Had a Major Depressive Episode in Past Year")

In [None]:
#Compute counts/frequencies (weighted to represent the totals for Young Adults in the U.S. population)
frequencies(nsduh_2123,'IRAMDEYR',na_rm=True)

In [None]:
#Examine past year receipt of inpatient mental health treatment frequencies among Young Adults aged 18-25
rel_freq_barplot(nsduh_2123,'MHTINPPY','Received Mental Health Treatment as an Inpatient - Past Year',0,"Percentage of Young Adults Aged 18-25 \n Who Received Inpatient Mental Health Treatment in Past Year")

In [None]:
#Compute counts/frequencies (weighted to represent the totals for Young Adults in the U.S. population)
frequencies(nsduh_2123,'MHTINPPY',na_rm=True)

# Associations between our confounders and mental health

In this section, we accomplish goal 3 outlined in the introduction.

In [None]:
#Define variables
df=nsduh_2123
xvar='IRSEX'
na_rm=True
#Work on a copy to avoid modifying original df
df = df.copy()
#Recode suicidal ideation variable
df['IRSUICTHNK']=df['IRSUICTHNK'].replace({'Yes':1,'No':0})
#Create new column based on where there are young adults in the cleaned dataset and where there are people that had suicial thoughts in the cleaned dataset
df["subpop_xvar"]=df[xvar].where((df["CATAGE"] == 2) & (df['IRSUICTHNK'] == 1), np.nan)
#Tabulation function calculates and tabulates summary statistics of a categorical variable given
#survey weights and survey design measures
#We account for the subpopulation (Young Adults) WHEN we do the tabulation 
#Don't filter data down to Young Adults before doing the tabulation as 
#that would bias the SE/variance estimates towards that specific subgroup rather than the total population
#Tabulation with full dataset, but subpopulation defined
tab = Tabulation(PopParam.prop)  
tab.tabulate(
#Column we want to tabulate relative frequencies based on 
vars=df[["subpop_xvar"]],
#2023 NSDUH sample weight
samp_weight=df["ANALWT2_C3"],
#Variance stratum
stratum=df["VESTR_C"],
#Variance primary sampling unit
psu=df["VEREP"],
remove_nan=na_rm
    )
#Convert the weighted frequency estimates to a dataframe
df1=tab.to_dataframe()
#Sort the frequency dataframe by frequencies in descending order
df1.sort_values(by=PopParam.prop,ascending=False,inplace=True)
#Create new column based on where there are young adults in the cleaned dataset and where there are people that didn't have suicial thoughts in the cleaned dataset
df["subpop_xvar"]=df[xvar].where((df["CATAGE"] == 2) & (df['IRSUICTHNK'] == 0), np.nan)
#Tabulation with full dataset, but subpopulation defined
tab = Tabulation(PopParam.prop)  
tab.tabulate(
#Column we want to tabulate relative frequencies based on 
vars=df[["subpop_xvar"]],
#2023 NSDUH sample weight
samp_weight=df["ANALWT2_C3"],
#Variance stratum
stratum=df["VESTR_C"],
#Variance primary sampling unit
psu=df["VEREP"],
remove_nan=na_rm
    )
#Convert the weighted frequency estimates to a dataframe
df2=tab.to_dataframe()
#Sort the frequency dataframe by frequencies in descending order
df2.sort_values(by=PopParam.prop,ascending=False,inplace=True)
#Display first dataframe
print(df1)
#Display second dataframe
print(df2)



In [None]:
#Define variables
df=nsduh_2123
xvar='NEWRACE2'
na_rm=True
#Work on a copy to avoid modifying original df
df = df.copy()
#Recode suicidal ideation variable
df['IRSUICTHNK']=df['IRSUICTHNK'].replace({'Yes':1,'No':0})
#Create new column based on where there are young adults in the cleaned dataset and where there are people that had suicial thoughts in the cleaned dataset
df["subpop_xvar"]=df[xvar].where((df["CATAGE"] == 2) & (df['IRSUICTHNK'] == 1), np.nan)
#Tabulation function calculates and tabulates summary statistics of a categorical variable given
#survey weights and survey design measures
#We account for the subpopulation (Young Adults) WHEN we do the tabulation 
#Don't filter data down to Young Adults before doing the tabulation as 
#that would bias the SE/variance estimates towards that specific subgroup rather than the total population
#Tabulation with full dataset, but subpopulation defined
tab = Tabulation(PopParam.prop)  
tab.tabulate(
#Column we want to tabulate relative frequencies based on 
vars=df[["subpop_xvar"]],
#2023 NSDUH sample weight
samp_weight=df["ANALWT2_C3"],
#Variance stratum
stratum=df["VESTR_C"],
#Variance primary sampling unit
psu=df["VEREP"],
remove_nan=na_rm
    )
#Convert the weighted frequency estimates to a dataframe
df1=tab.to_dataframe()
#Sort the frequency dataframe by frequencies in descending order
df1.sort_values(by=PopParam.prop,ascending=False,inplace=True)
#Create new column based on where there are young adults in the cleaned dataset and where there are people that didn't have suicial thoughts in the cleaned dataset
df["subpop_xvar"]=df[xvar].where((df["CATAGE"] == 2) & (df['IRSUICTHNK'] == 0), np.nan)
#Tabulation with full dataset, but subpopulation defined
tab = Tabulation(PopParam.prop)  
tab.tabulate(
#Column we want to tabulate relative frequencies based on 
vars=df[["subpop_xvar"]],
#2023 NSDUH sample weight
samp_weight=df["ANALWT2_C3"],
#Variance stratum
stratum=df["VESTR_C"],
#Variance primary sampling unit
psu=df["VEREP"],
remove_nan=na_rm
    )
#Convert the weighted frequency estimates to a dataframe
df2=tab.to_dataframe()
#Sort the frequency dataframe by frequencies in descending order
df2.sort_values(by=PopParam.prop,ascending=False,inplace=True)
#Display first dataframe
print(df1)
#Display second dataframe
print(df2)

In [None]:
#Define variables
df=nsduh_2123
xvar='EDUHIGHCAT'
na_rm=True
#Work on a copy to avoid modifying original df
df = df.copy()
#Recode suicidal ideation variable
df['IRSUICTHNK']=df['IRSUICTHNK'].replace({'Yes':1,'No':0})
#Create new column based on where there are young adults in the cleaned dataset and where there are people that had suicial thoughts in the cleaned dataset
df["subpop_xvar"]=df[xvar].where((df["CATAGE"] == 2) & (df['IRSUICTHNK'] == 1), np.nan)
#Tabulation function calculates and tabulates summary statistics of a categorical variable given
#survey weights and survey design measures
#We account for the subpopulation (Young Adults) WHEN we do the tabulation 
#Don't filter data down to Young Adults before doing the tabulation as 
#that would bias the SE/variance estimates towards that specific subgroup rather than the total population
#Tabulation with full dataset, but subpopulation defined
tab = Tabulation(PopParam.prop)  
tab.tabulate(
#Column we want to tabulate relative frequencies based on 
vars=df[["subpop_xvar"]],
#2023 NSDUH sample weight
samp_weight=df["ANALWT2_C3"],
#Variance stratum
stratum=df["VESTR_C"],
#Variance primary sampling unit
psu=df["VEREP"],
remove_nan=na_rm
    )
#Convert the weighted frequency estimates to a dataframe
df1=tab.to_dataframe()
#Sort the frequency dataframe by frequencies in descending order
df1.sort_values(by=PopParam.prop,ascending=False,inplace=True)
#Create new column based on where there are young adults in the cleaned dataset and where there are people that didn't have suicial thoughts in the cleaned dataset
df["subpop_xvar"]=df[xvar].where((df["CATAGE"] == 2) & (df['IRSUICTHNK'] == 0), np.nan)
#Tabulation with full dataset, but subpopulation defined
tab = Tabulation(PopParam.prop)  
tab.tabulate(
#Column we want to tabulate relative frequencies based on 
vars=df[["subpop_xvar"]],
#2023 NSDUH sample weight
samp_weight=df["ANALWT2_C3"],
#Variance stratum
stratum=df["VESTR_C"],
#Variance primary sampling unit
psu=df["VEREP"],
remove_nan=na_rm
    )
#Convert the weighted frequency estimates to a dataframe
df2=tab.to_dataframe()
#Sort the frequency dataframe by frequencies in descending order
df2.sort_values(by=PopParam.prop,ascending=False,inplace=True)
#Display first dataframe
print(df1)
#Display second dataframe
print(df2)

In [None]:
#Define variables
df=nsduh_2123
xvar='IRWRKSTAT18'
na_rm=True
#Work on a copy to avoid modifying original df
df = df.copy()
#Recode suicidal ideation variable
df['IRSUICTHNK']=df['IRSUICTHNK'].replace({'Yes':1,'No':0})
#Create new column based on where there are young adults in the cleaned dataset and where there are people that had suicial thoughts in the cleaned dataset
df["subpop_xvar"]=df[xvar].where((df["CATAGE"] == 2) & (df['IRSUICTHNK'] == 1), np.nan)
#Tabulation function calculates and tabulates summary statistics of a categorical variable given
#survey weights and survey design measures
#We account for the subpopulation (Young Adults) WHEN we do the tabulation 
#Don't filter data down to Young Adults before doing the tabulation as 
#that would bias the SE/variance estimates towards that specific subgroup rather than the total population
#Tabulation with full dataset, but subpopulation defined
tab = Tabulation(PopParam.prop)  
tab.tabulate(
#Column we want to tabulate relative frequencies based on 
vars=df[["subpop_xvar"]],
#2023 NSDUH sample weight
samp_weight=df["ANALWT2_C3"],
#Variance stratum
stratum=df["VESTR_C"],
#Variance primary sampling unit
psu=df["VEREP"],
remove_nan=na_rm
    )
#Convert the weighted frequency estimates to a dataframe
df1=tab.to_dataframe()
#Sort the frequency dataframe by frequencies in descending order
df1.sort_values(by=PopParam.prop,ascending=False,inplace=True)
#Create new column based on where there are young adults in the cleaned dataset and where there are people that didn't have suicial thoughts in the cleaned dataset
df["subpop_xvar"]=df[xvar].where((df["CATAGE"] == 2) & (df['IRSUICTHNK'] == 0), np.nan)
#Tabulation with full dataset, but subpopulation defined
tab = Tabulation(PopParam.prop)  
tab.tabulate(
#Column we want to tabulate relative frequencies based on 
vars=df[["subpop_xvar"]],
#2023 NSDUH sample weight
samp_weight=df["ANALWT2_C3"],
#Variance stratum
stratum=df["VESTR_C"],
#Variance primary sampling unit
psu=df["VEREP"],
remove_nan=na_rm
    )
#Convert the weighted frequency estimates to a dataframe
df2=tab.to_dataframe()
#Sort the frequency dataframe by frequencies in descending order
df2.sort_values(by=PopParam.prop,ascending=False,inplace=True)
#Display first dataframe
print(df1)
#Display second dataframe
print(df2)

In [None]:
#Define variables
df=nsduh_2123
xvar='IRPRVHLT'
na_rm=True
#Work on a copy to avoid modifying original df
df = df.copy()
#Recode suicidal ideation variable
df['IRSUICTHNK']=df['IRSUICTHNK'].replace({'Yes':1,'No':0})
#Create new column based on where there are young adults in the cleaned dataset and where there are people that had suicial thoughts in the cleaned dataset
df["subpop_xvar"]=df[xvar].where((df["CATAGE"] == 2) & (df['IRSUICTHNK'] == 1), np.nan)
#Tabulation function calculates and tabulates summary statistics of a categorical variable given
#survey weights and survey design measures
#We account for the subpopulation (Young Adults) WHEN we do the tabulation 
#Don't filter data down to Young Adults before doing the tabulation as 
#that would bias the SE/variance estimates towards that specific subgroup rather than the total population
#Tabulation with full dataset, but subpopulation defined
tab = Tabulation(PopParam.prop)  
tab.tabulate(
#Column we want to tabulate relative frequencies based on 
vars=df[["subpop_xvar"]],
#2023 NSDUH sample weight
samp_weight=df["ANALWT2_C3"],
#Variance stratum
stratum=df["VESTR_C"],
#Variance primary sampling unit
psu=df["VEREP"],
remove_nan=na_rm
    )
#Convert the weighted frequency estimates to a dataframe
df1=tab.to_dataframe()
#Sort the frequency dataframe by frequencies in descending order
df1.sort_values(by=PopParam.prop,ascending=False,inplace=True)
#Create new column based on where there are young adults in the cleaned dataset and where there are people that didn't have suicial thoughts in the cleaned dataset
df["subpop_xvar"]=df[xvar].where((df["CATAGE"] == 2) & (df['IRSUICTHNK'] == 0), np.nan)
#Tabulation with full dataset, but subpopulation defined
tab = Tabulation(PopParam.prop)  
tab.tabulate(
#Column we want to tabulate relative frequencies based on 
vars=df[["subpop_xvar"]],
#2023 NSDUH sample weight
samp_weight=df["ANALWT2_C3"],
#Variance stratum
stratum=df["VESTR_C"],
#Variance primary sampling unit
psu=df["VEREP"],
remove_nan=na_rm
    )
#Convert the weighted frequency estimates to a dataframe
df2=tab.to_dataframe()
#Sort the frequency dataframe by frequencies in descending order
df2.sort_values(by=PopParam.prop,ascending=False,inplace=True)
#Display first dataframe
print(df1)
#Display second dataframe
print(df2)

In [None]:
#Define variables
df=nsduh_2123
xvar='IRHHSIZ2'
na_rm=True
#Work on a copy to avoid modifying original df
df = df.copy()
#Recode suicidal ideation variable
df['IRSUICTHNK']=df['IRSUICTHNK'].replace({'Yes':1,'No':0})
#Create new column based on where there are young adults in the cleaned dataset and where there are people that had suicial thoughts in the cleaned dataset
df["subpop_xvar"]=df[xvar].where((df["CATAGE"] == 2) & (df['IRSUICTHNK'] == 1), np.nan)
#Tabulation function calculates and tabulates summary statistics of a categorical variable given
#survey weights and survey design measures
#We account for the subpopulation (Young Adults) WHEN we do the tabulation 
#Don't filter data down to Young Adults before doing the tabulation as 
#that would bias the SE/variance estimates towards that specific subgroup rather than the total population
#Tabulation with full dataset, but subpopulation defined
tab = Tabulation(PopParam.prop)  
tab.tabulate(
#Column we want to tabulate relative frequencies based on 
vars=df[["subpop_xvar"]],
#2023 NSDUH sample weight
samp_weight=df["ANALWT2_C3"],
#Variance stratum
stratum=df["VESTR_C"],
#Variance primary sampling unit
psu=df["VEREP"],
remove_nan=na_rm
    )
#Convert the weighted frequency estimates to a dataframe
df1=tab.to_dataframe()
#Sort the frequency dataframe by frequencies in descending order
df1.sort_values(by=PopParam.prop,ascending=False,inplace=True)
#Create new column based on where there are young adults in the cleaned dataset and where there are people that didn't have suicial thoughts in the cleaned dataset
df["subpop_xvar"]=df[xvar].where((df["CATAGE"] == 2) & (df['IRSUICTHNK'] == 0), np.nan)
#Tabulation with full dataset, but subpopulation defined
tab = Tabulation(PopParam.prop)  
tab.tabulate(
#Column we want to tabulate relative frequencies based on 
vars=df[["subpop_xvar"]],
#2023 NSDUH sample weight
samp_weight=df["ANALWT2_C3"],
#Variance stratum
stratum=df["VESTR_C"],
#Variance primary sampling unit
psu=df["VEREP"],
remove_nan=na_rm
    )
#Convert the weighted frequency estimates to a dataframe
df2=tab.to_dataframe()
#Sort the frequency dataframe by frequencies in descending order
df2.sort_values(by=PopParam.prop,ascending=False,inplace=True)
#Display first dataframe
print(df1)
#Display second dataframe
print(df2)

In [None]:
#Define variables
df=nsduh_2123
xvar='INCOME'
na_rm=True
#Work on a copy to avoid modifying original df
df = df.copy()
#Recode suicidal ideation variable
df['IRSUICTHNK']=df['IRSUICTHNK'].replace({'Yes':1,'No':0})
#Create new column based on where there are young adults in the cleaned dataset and where there are people that had suicial thoughts in the cleaned dataset
df["subpop_xvar"]=df[xvar].where((df["CATAGE"] == 2) & (df['IRSUICTHNK'] == 1), np.nan)
#Tabulation function calculates and tabulates summary statistics of a categorical variable given
#survey weights and survey design measures
#We account for the subpopulation (Young Adults) WHEN we do the tabulation 
#Don't filter data down to Young Adults before doing the tabulation as 
#that would bias the SE/variance estimates towards that specific subgroup rather than the total population
#Tabulation with full dataset, but subpopulation defined
tab = Tabulation(PopParam.prop)  
tab.tabulate(
#Column we want to tabulate relative frequencies based on 
vars=df[["subpop_xvar"]],
#2023 NSDUH sample weight
samp_weight=df["ANALWT2_C3"],
#Variance stratum
stratum=df["VESTR_C"],
#Variance primary sampling unit
psu=df["VEREP"],
remove_nan=na_rm
    )
#Convert the weighted frequency estimates to a dataframe
df1=tab.to_dataframe()
#Sort the frequency dataframe by frequencies in descending order
df1.sort_values(by=PopParam.prop,ascending=False,inplace=True)
#Create new column based on where there are young adults in the cleaned dataset and where there are people that didn't have suicial thoughts in the cleaned dataset
df["subpop_xvar"]=df[xvar].where((df["CATAGE"] == 2) & (df['IRSUICTHNK'] == 0), np.nan)
#Tabulation with full dataset, but subpopulation defined
tab = Tabulation(PopParam.prop)  
tab.tabulate(
#Column we want to tabulate relative frequencies based on 
vars=df[["subpop_xvar"]],
#2023 NSDUH sample weight
samp_weight=df["ANALWT2_C3"],
#Variance stratum
stratum=df["VESTR_C"],
#Variance primary sampling unit
psu=df["VEREP"],
remove_nan=na_rm
    )
#Convert the weighted frequency estimates to a dataframe
df2=tab.to_dataframe()
#Sort the frequency dataframe by frequencies in descending order
df2.sort_values(by=PopParam.prop,ascending=False,inplace=True)
#Display first dataframe
print(df1)
#Display second dataframe
print(df2)

In [None]:
#Define variables
df=nsduh_2123
xvar='IRDSTNRV12'
na_rm=True
#Work on a copy to avoid modifying original df
df = df.copy()
#Recode suicidal ideation variable
df['IRSUICTHNK']=df['IRSUICTHNK'].replace({'Yes':1,'No':0})
#Create new column based on where there are young adults in the cleaned dataset and where there are people that had suicial thoughts in the cleaned dataset
df["subpop_xvar"]=df[xvar].where((df["CATAGE"] == 2) & (df['IRSUICTHNK'] == 1), np.nan)
#Tabulation function calculates and tabulates summary statistics of a categorical variable given
#survey weights and survey design measures
#We account for the subpopulation (Young Adults) WHEN we do the tabulation 
#Don't filter data down to Young Adults before doing the tabulation as 
#that would bias the SE/variance estimates towards that specific subgroup rather than the total population
#Tabulation with full dataset, but subpopulation defined
tab = Tabulation(PopParam.prop)  
tab.tabulate(
#Column we want to tabulate relative frequencies based on 
vars=df[["subpop_xvar"]],
#2023 NSDUH sample weight
samp_weight=df["ANALWT2_C3"],
#Variance stratum
stratum=df["VESTR_C"],
#Variance primary sampling unit
psu=df["VEREP"],
remove_nan=na_rm
    )
#Convert the weighted frequency estimates to a dataframe
df1=tab.to_dataframe()
#Sort the frequency dataframe by frequencies in descending order
df1.sort_values(by=PopParam.prop,ascending=False,inplace=True)
#Create new column based on where there are young adults in the cleaned dataset and where there are people that didn't have suicial thoughts in the cleaned dataset
df["subpop_xvar"]=df[xvar].where((df["CATAGE"] == 2) & (df['IRSUICTHNK'] == 0), np.nan)
#Tabulation with full dataset, but subpopulation defined
tab = Tabulation(PopParam.prop)  
tab.tabulate(
#Column we want to tabulate relative frequencies based on 
vars=df[["subpop_xvar"]],
#2023 NSDUH sample weight
samp_weight=df["ANALWT2_C3"],
#Variance stratum
stratum=df["VESTR_C"],
#Variance primary sampling unit
psu=df["VEREP"],
remove_nan=na_rm
    )
#Convert the weighted frequency estimates to a dataframe
df2=tab.to_dataframe()
#Sort the frequency dataframe by frequencies in descending order
df2.sort_values(by=PopParam.prop,ascending=False,inplace=True)
#Display first dataframe
print(df1)
#Display second dataframe
print(df2)

In [None]:
#Define variables
df=nsduh_2123
xvar='IRDSTEFF12'
na_rm=True
#Work on a copy to avoid modifying original df
df = df.copy()
#Recode suicidal ideation variable
df['IRSUICTHNK']=df['IRSUICTHNK'].replace({'Yes':1,'No':0})
#Create new column based on where there are young adults in the cleaned dataset and where there are people that had suicial thoughts in the cleaned dataset
df["subpop_xvar"]=df[xvar].where((df["CATAGE"] == 2) & (df['IRSUICTHNK'] == 1), np.nan)
#Tabulation function calculates and tabulates summary statistics of a categorical variable given
#survey weights and survey design measures
#We account for the subpopulation (Young Adults) WHEN we do the tabulation 
#Don't filter data down to Young Adults before doing the tabulation as 
#that would bias the SE/variance estimates towards that specific subgroup rather than the total population
#Tabulation with full dataset, but subpopulation defined
tab = Tabulation(PopParam.prop)  
tab.tabulate(
#Column we want to tabulate relative frequencies based on 
vars=df[["subpop_xvar"]],
#2023 NSDUH sample weight
samp_weight=df["ANALWT2_C3"],
#Variance stratum
stratum=df["VESTR_C"],
#Variance primary sampling unit
psu=df["VEREP"],
remove_nan=na_rm
    )
#Convert the weighted frequency estimates to a dataframe
df1=tab.to_dataframe()
#Sort the frequency dataframe by frequencies in descending order
df1.sort_values(by=PopParam.prop,ascending=False,inplace=True)
#Create new column based on where there are young adults in the cleaned dataset and where there are people that didn't have suicial thoughts in the cleaned dataset
df["subpop_xvar"]=df[xvar].where((df["CATAGE"] == 2) & (df['IRSUICTHNK'] == 0), np.nan)
#Tabulation with full dataset, but subpopulation defined
tab = Tabulation(PopParam.prop)  
tab.tabulate(
#Column we want to tabulate relative frequencies based on 
vars=df[["subpop_xvar"]],
#2023 NSDUH sample weight
samp_weight=df["ANALWT2_C3"],
#Variance stratum
stratum=df["VESTR_C"],
#Variance primary sampling unit
psu=df["VEREP"],
remove_nan=na_rm
    )
#Convert the weighted frequency estimates to a dataframe
df2=tab.to_dataframe()
#Sort the frequency dataframe by frequencies in descending order
df2.sort_values(by=PopParam.prop,ascending=False,inplace=True)
#Display first dataframe
print(df1)
#Display second dataframe
print(df2)

In [None]:
#Define variables
df=nsduh_2123
xvar='IRIMPCONCN'
na_rm=True
#Work on a copy to avoid modifying original df
df = df.copy()
#Recode suicidal ideation variable
df['IRSUICTHNK']=df['IRSUICTHNK'].replace({'Yes':1,'No':0})
#Create new column based on where there are young adults in the cleaned dataset and where there are people that had suicial thoughts in the cleaned dataset
df["subpop_xvar"]=df[xvar].where((df["CATAGE"] == 2) & (df['IRSUICTHNK'] == 1), np.nan)
#Tabulation function calculates and tabulates summary statistics of a categorical variable given
#survey weights and survey design measures
#We account for the subpopulation (Young Adults) WHEN we do the tabulation 
#Don't filter data down to Young Adults before doing the tabulation as 
#that would bias the SE/variance estimates towards that specific subgroup rather than the total population
#Tabulation with full dataset, but subpopulation defined
tab = Tabulation(PopParam.prop)  
tab.tabulate(
#Column we want to tabulate relative frequencies based on 
vars=df[["subpop_xvar"]],
#2023 NSDUH sample weight
samp_weight=df["ANALWT2_C3"],
#Variance stratum
stratum=df["VESTR_C"],
#Variance primary sampling unit
psu=df["VEREP"],
remove_nan=na_rm
    )
#Convert the weighted frequency estimates to a dataframe
df1=tab.to_dataframe()
#Sort the frequency dataframe by frequencies in descending order
df1.sort_values(by=PopParam.prop,ascending=False,inplace=True)
#Create new column based on where there are young adults in the cleaned dataset and where there are people that didn't have suicial thoughts in the cleaned dataset
df["subpop_xvar"]=df[xvar].where((df["CATAGE"] == 2) & (df['IRSUICTHNK'] == 0), np.nan)
#Tabulation with full dataset, but subpopulation defined
tab = Tabulation(PopParam.prop)  
tab.tabulate(
#Column we want to tabulate relative frequencies based on 
vars=df[["subpop_xvar"]],
#2023 NSDUH sample weight
samp_weight=df["ANALWT2_C3"],
#Variance stratum
stratum=df["VESTR_C"],
#Variance primary sampling unit
psu=df["VEREP"],
remove_nan=na_rm
    )
#Convert the weighted frequency estimates to a dataframe
df2=tab.to_dataframe()
#Sort the frequency dataframe by frequencies in descending order
df2.sort_values(by=PopParam.prop,ascending=False,inplace=True)
#Display first dataframe
print(df1)
#Display second dataframe
print(df2)

In [None]:
#Define variables
df=nsduh_2123
xvar='SUTINPPY'
na_rm=True
#Work on a copy to avoid modifying original df
df = df.copy()
#Recode suicidal ideation variable
df['IRSUICTHNK']=df['IRSUICTHNK'].replace({'Yes':1,'No':0})
#Create new column based on where there are young adults in the cleaned dataset and where there are people that had suicial thoughts in the cleaned dataset
df["subpop_xvar"]=df[xvar].where((df["CATAGE"] == 2) & (df['IRSUICTHNK'] == 1), np.nan)
#Tabulation function calculates and tabulates summary statistics of a categorical variable given
#survey weights and survey design measures
#We account for the subpopulation (Young Adults) WHEN we do the tabulation 
#Don't filter data down to Young Adults before doing the tabulation as 
#that would bias the SE/variance estimates towards that specific subgroup rather than the total population
#Tabulation with full dataset, but subpopulation defined
tab = Tabulation(PopParam.prop)  
tab.tabulate(
#Column we want to tabulate relative frequencies based on 
vars=df[["subpop_xvar"]],
#2023 NSDUH sample weight
samp_weight=df["ANALWT2_C3"],
#Variance stratum
stratum=df["VESTR_C"],
#Variance primary sampling unit
psu=df["VEREP"],
remove_nan=na_rm
    )
#Convert the weighted frequency estimates to a dataframe
df1=tab.to_dataframe()
#Sort the frequency dataframe by frequencies in descending order
df1.sort_values(by=PopParam.prop,ascending=False,inplace=True)
#Create new column based on where there are young adults in the cleaned dataset and where there are people that didn't have suicial thoughts in the cleaned dataset
df["subpop_xvar"]=df[xvar].where((df["CATAGE"] == 2) & (df['IRSUICTHNK'] == 0), np.nan)
#Tabulation with full dataset, but subpopulation defined
tab = Tabulation(PopParam.prop)  
tab.tabulate(
#Column we want to tabulate relative frequencies based on 
vars=df[["subpop_xvar"]],
#2023 NSDUH sample weight
samp_weight=df["ANALWT2_C3"],
#Variance stratum
stratum=df["VESTR_C"],
#Variance primary sampling unit
psu=df["VEREP"],
remove_nan=na_rm
    )
#Convert the weighted frequency estimates to a dataframe
df2=tab.to_dataframe()
#Sort the frequency dataframe by frequencies in descending order
df2.sort_values(by=PopParam.prop,ascending=False,inplace=True)
#Display first dataframe
print(df1)
#Display second dataframe
print(df2)

In [None]:
avg_tbl3(df,'IRALCFY','IRSUICTHNK',na_rm=True)

In [None]:
avg_tbl3(df,'IRCIGFM','IRSUICTHNK',na_rm=True)

In [None]:
avg_tbl3(df,'IRALCBNG30D','IRSUICTHNK',na_rm=True)

In [None]:
#Examine average of nicotine vaping frequency in the past month for those that are Young Adults with suicidal thoughts
#vs young adults without suicidal thoughts 
#Create copy of nsduh dataframe
df=nsduh_2123.copy()
#Filter values of -9 out of the column
df=df[df['IRNICVAP30N'] != -9] 
#Recode suicidal ideation variable
df['IRSUICTHNK']=df['IRSUICTHNK'].replace({'Yes':1,'No':0})
#Create domain series for Young Adults who have experienced the specified mental health issue (yvar)
domain_yes = ((df["CATAGE"]==2) & (df['IRSUICTHNK']==1)).astype(int)
#Taylor estimator for mean substance use frequency for mental health group 1 (Young Adults 
#who experienced the mental health issue in the past year)
est = TaylorEstimator(PopParam.mean)
est.estimate(
        y=df['IRNICVAP30N'],
        samp_weight=df["ANALWT2_C3"],
        stratum=df["VESTR_C"],
        psu=df["VEREP"],
        domain=domain_yes,
        remove_nan=na_rm
    )
#to_dataframe() is handy when domain is provided (row per domain level)
out1 = est.to_dataframe()
#Change values of domain column to be more descriptive
out1['_domain'] = out1['_domain'].replace({1: 'Yes', 0: 'All Other Populations'})
#Create domain series for Young Adults who have NOT experienced the specified mental health issue (yvar)
domain_no = ((df["CATAGE"]==2) & (df['IRSUICTHNK']==0)).astype(int)
#Taylor estimator for mean substance use frequency for mental health group 2 (multiracial young 
#adults who did not experience the mental health issue in the past year)
est = TaylorEstimator(PopParam.mean)
est.estimate(
        y=df['IRNICVAP30N'],
        samp_weight=df["ANALWT2_C3"],
        stratum=df["VESTR_C"],
        psu=df["VEREP"],
        domain=domain_no,
        remove_nan=na_rm
    )
#to_dataframe() is handy when domain is provided (row per domain level)
out2 = est.to_dataframe()
#Change values of domain column to be more descriptive
out2['_domain'] = out2['_domain'].replace({1: 'No', 0: 'All Other Populations'})
#Combine the two dataframes, taking the row of each dataframe where _domain=Group 1 or _domain=Group 2
group_comp=pd.concat([out1.loc[out1['_domain']=='Yes',:],out2.loc[out2['_domain']=='No',:]])
#Sort the combined dataframe
group_comp.sort_values(by='_estimate',ascending=False,inplace=True)
#Return dataframe with mean (accounts for complex survey design) substance use for each group
group_comp

# Initial Associations Between Substance Use and Mental Health

In this section, we accomplish goal 4 outlined in the introduction.

In [None]:
#Estimate average yearly marijuana use among Young Adults who experienced an MDE in the past year
#vs Young Adults who did not experience an MDE in the past year
mean_comp2(nsduh_2123,'IRMJFY','IRAMDEYR','Experience an MDE (major depressive episode) in the past year',0,'Average Past Year Marijuana Use for\n Young Adults that experienced an MDE vs did not experience an MDE')

In [None]:
#Estimate average yearly cocaine use among Young Adults who experienced an MDE in the past year
#vs Young Adults who did not experience an MDE in the past year
mean_comp2(nsduh_2123,'IRCOCFY','IRAMDEYR','Experience an MDE (major depressive episode) in the past year',0,'Average Past Year Cocaine Use for\n Young Adults that experienced an MDE vs did not experience an MDE')

In [None]:
#Estimate average yearly hallucinogen use among Young Adults who experienced an MDE in the past year
#vs Young Adults who did not experience an MDE in the past year
mean_comp2(nsduh_2123,'IRHALLUCYFQ','IRAMDEYR','Experience an MDE (major depressive episode) in the past year',0,'Average Past Year Hallucinogen Use for\n Young Adults that experienced an MDE vs did not experience an MDE')

In [None]:
#Estimate average yearly marijuana use among Young Adults who experienced suicidal thoughts in the past year
#vs Young Adults who did not experience suicidal thoughts in the past year
mean_comp2(nsduh_2123,'IRMJFY','IRSUICTHNK','Experience suicidal thoughts in the past year',0,'Average Past Year Marijuana Use for\n Young Adults that Experienced Suicidal Thoughts vs did not Experience Suicidal Thoughts')

In [None]:
#Estimate average yearly cocaine use among Young Adults who experienced suicidal thoughts in the past year
#vs Young Adults who did not experience suicidal thoughts in the past year
mean_comp2(nsduh_2123,'IRCOCFY','IRSUICTHNK','Experience suicidal thoughts in the past year',0,'Average Past Year Cocaine Use for\n Young Adults that Experienced Suicidal Thoughts vs did not Experience Suicidal Thoughts')

In [None]:
#Estimate average yearly hallucinogen use among Young Adults who experienced suicidal thoughts in the past year
#vs Young Adults who did not experience suicidal thoughts in the past year
mean_comp2(nsduh_2123,'IRHALLUCYFQ','IRSUICTHNK','Experience suicidal thoughts in the past year',0,'Average Past Year Hallucinogen Use for\n Young Adults that Experienced Suicidal Thoughts vs did not Experience Suicidal Thoughts')

In [None]:
#Estimate average yearly marijuana use among Young Adults who received inpatient mental health treatment in the past year
#vs Young Adults who did not receive inpatient mental health treatment
mean_comp2(nsduh_2123,'IRMJFY','MHTINPPY','Received inpatient mental health treatment in the past year',0,'Average Past Year Marijuana Use for\n Young Adults that Received Inpatient Mental Health Treatment\n vs did not Receive Inpatient Mental Health Treatment')

In [None]:
#Estimate average yearly cocaine use among Young Adults who received inpatient mental health treatment in the past year
#vs Young Adults who did not receive inpatient mental health treatment
mean_comp2(nsduh_2123,'IRCOCFY','MHTINPPY','Received inpatient mental health treatment in the past year',0,'Average Past Year Cocaine Use for\n Young Adults that Received Inpatient Mental Health Treatment\n vs did not Receive Inpatient Mental Health Treatment')

In [None]:
#Estimate average yearly hallucinogen use among Young Adults who received inpatient mental health treatment in the past year
#vs Young Adults who did not receive inpatient mental health treatment
mean_comp2(nsduh_2123,'IRHALLUCYFQ','MHTINPPY','Received inpatient mental health treatment in the past year',0,'Average Past Year Hallucinogen Use for\n Young Adults that Received Inpatient Mental Health Treatment\n vs did not Receive Inpatient Mental Health Treatment')

# Checking Conditions

In this final section, we check the conditions necessary to do a weighted t-test before conducting the t-tests in RStudio.
In addition, we make sure the confidence interval estimates all calculated above are valid in addition to the estimates calculated in the taylor linearization and tabulation functions.

The conditions for taylor linearization are:
1. Correctly specified survey design-weights, strata, psus
2. Subpopulation definition-Use a domain (subpopulation) approach instead of dropping rows, which would bias the results to the sample
3. Sufficient number of PSUs/strata-degrees of freedom>=30
4. Assume PSUs to be independent draws-not random draws (multistage probabilistic stratified sample), but that is accounted for when we specify survey design
5. Smoothness of statistic - an example of a smooth statistic that works well for taylor estimation are the one we used: mean 
6. No extreme instability from weights
7. Correctly specified model form (if using taylor linearization for a model such as linear regression)

The conditions for tabulation are:
1. Correctly specified survey design-weights, strata, psus
2. Subpopulation definition-Use a domain (subpopulation) approach instead of dropping rows, which would bias the results to the sample
3. Sufficient number of PSUs/strata-degrees of freedom>=30
4. The effective sample size ((sum of weights)^2/sum of weights^2) should be large enough (>30 at least, ideally 50)
5. Outcome variable format-make sure outcome variable is coded consistently
6. Smoothness of statistic
7. Assume PSUs to be independent draws-not random draws (multistage probabilistic stratified sample), but that is accounted for when we specify survey design

The conditions for a weighted 95% confidence interval are:
1. Correctly specified survey design-weights, strata, psus
2. Subpopulation definition-Use a domain (subpopulation) approach instead of dropping rows, which would bias the results to the sample
3. Sufficient number of PSUs/strata-degrees of freedom>=30
4. Large-sample (CLT) conditions-Within each group/contrast, the effective sample size (sum of weights^2/sum of weights) should be large enough (>30 at least, ideally 50) for the sample mean to approximate normality 
5. No stratum with only 1 contributing PSU

The conditions for a weighted t-tests are:
1. Correctly specified survey design-weights, strata, psus
2. Subpopulation definition-Use a domain (subpopulation) approach instead of dropping rows, which would bias the results to the sample
3. Large sample (asymptotic) validity-The test statistic is approximately t- or normal-distributed if we have: a sufficiently large number of PSUs within strata (at least 2 PSUs within each stratum), and
not too many empty strata/PSUs in the domain.
4. Design Degrees of freedom-#Strata in our case (should be at least 30)
5. Distribution of variable-the variable we're taking the average of must be normally distributed
6. Assume PSUs to be independent draws-not random draws (multistage probabilistic stratified sample), but that is accounted for when we specify survey design









# Computing the number of unique strata in our sample, effective sample sizes, and number of PSUs within each stratum
Overall, given that we specify survey design and our subpopulation properly during our analysis in addition to other conditions we met during our code (e.g. choosing smooth estimates), the main conditions that we will need to check for are the following:
1. Large sample (asymptotic) validity-The test statistic is approximately t- or normal-distributed if we have: a sufficiently large number of PSUs within strata (at least 2 PSUs within each stratum), and
not too many empty strata/PSUs in the domain.
2. Large-sample (CLT) conditions-Within each group/contrast, the effective sample size (sum of weights^2/sum of weights) should be large enough (>30 at least, ideally 50) for the sample mean to approximate normality 
3. No stratum with only 1 contributing PSU
4. Design Degrees of freedom-#Strata in our case (should be at least 30)



In [None]:
#Filter to domain
df_sub = nsduh_2123.loc[nsduh_2123["CATAGE"] == 2, :]
#Count distinct PSUs by stratum
psu_by_stratum = df_sub.groupby("VESTR_C")["VEREP"].nunique().reset_index()
psu_by_stratum.columns = ["Stratum", "n_PSUs"]
print(psu_by_stratum)
#Count total strata in the domain
total_psus = df_sub["VESTR_C"].nunique()
print("Total distinct PSUs in domain:", total_psus)
#Count stratum with only 1 PSU
n_single_psu = (psu_by_stratum["n_PSUs"] == 1).sum()
print("Strata with only 1 PSU in domain:", n_single_psu)
#Compute ESS
ESS=sum(df_sub['ANALWT2_C3'])**2/sum(df_sub['ANALWT2_C3']**2)
print(ESS)