In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


FILENAME = "drug_age.csv"
filepath = f"../data/{FILENAME}"

drug_age_df = pd.read_csv(filepath)

# Display the first few rows of the data
drug_age_df.head()

Unnamed: 0,age,time,start_time,end_time,setting,all drugs,all opioids,stimulants,cannabis,benzodiazepine
0,0-15 years,1,01/01/2020,01/31/2020,ip,7.7007,0.845,0.1908,1.0358,0.2044
1,0-15 years,1,01/01/2020,01/31/2020,ed,4.0613,0.0239,0.0239,0.1117,0.0106
2,16-34 years,1,01/01/2020,01/31/2020,ip,28.1293,4.932,3.0782,5.8844,1.1224
3,16-34 years,1,01/01/2020,01/31/2020,ed,31.2102,2.2127,1.8439,4.5482,0.4594
4,35-54 years,1,01/01/2020,01/31/2020,ip,39.7774,7.3291,5.6598,3.5135,1.4467


Conduct t-tests to determine if there are statistically significant differences in drug usage between the two age categories for each drug category. 

In a t-test, the null hypothesis (H0) states that there is no relationship or difference between two groups, and the alternate hypothesis (H1) states that there is a relationship or difference. 

In this case, the null hypothesis is that there's no difference between the two different age groups in their drug usage. Running the t-test provides a p-value, which is used to decide whether to accept or reject the null hypothesis.

A small p-value (usually less than 0.05) leads us to reject the null hypothesis. Meaning, there is a statistically significant difference in drug usage between the two age groups.

In the context of your data, the t-tests help in identifying whether the specific age groups ('16-34 years' and '35-54 years') have different tendencies towards the usage of different drug categories ('all drugs', 'all opioids', 'stimulants', 'cannabis', 'benzodiazepine'). These findings can be useful in many ways such as for targeting specific age groups in public health interventions, understanding patterns of drug usage, etc.

In [3]:
from scipy.stats import ttest_ind

# Filter the DataFrame to include only rows with 'ip' setting
ip_df = drug_age_df[drug_age_df['setting'] == 'ip']

# Filter data for the two age groups
age_group1 = ip_df[ip_df['age'] == '16-34 years']['all drugs']
age_group2 = ip_df[ip_df['age'] == '35-54 years']['all drugs']

# Perform independent t-test
t_stat, p_val = ttest_ind(age_group1, age_group2)

print(f't-statistic: {t_stat}')
print(f'p-value: {p_val}')

t-statistic: -41.87025715912268
p-value: 3.5847980364740063e-56


The t-statistic measures the size of the difference relative to the variance in your sample data. A large absolute value for the t-statistic indicates a larger difference between the groups, while a smaller absolute value indicates a smaller difference.

The p-value is the probability of obtaining the observed data (or data more extreme) if the null hypothesis is true. In this case, the null hypothesis is that there's no difference between the means of the two age groups.

With these results:

- The t-statistic value of -41.87 suggests a large difference between the two age groups for the feature 'all drugs'. 

- The p-value is extremely small (almost zero), far less than the common alpha level 0.05, leading us to reject the null hypothesis. This means there is a statistically significant difference in the 'all drugs' usage between the '16-34 years' and '35-54 years' age groups.

In simpler terms, this tells us that the amount of 'all drugs' usage is significantly different between these two age groups. The negative sign of the t-statistic suggests that the '35-54 years' group has a higher mean 'all drugs' usage compared to the '16-34 years' group. Always double-check your group

In [4]:
from scipy.stats import ttest_ind

# Define columns to test
columns_to_test = ['all drugs', 'all opioids', 'stimulants', 'cannabis', 'benzodiazepine']

# Filter data for the two age groups
age_group1 = ip_df[ip_df['age'] == '16-34 years']
age_group2 = ip_df[ip_df['age'] == '35-54 years']

# Loop through the columns and perform t-test
for column in columns_to_test:
    t_stat, p_val = ttest_ind(age_group1[column], age_group2[column])
    print(f'{column}:')
    print(f'\t t-statistic: {t_stat}')
    print(f'\t p-value: {p_val}\n')

all drugs:
	 t-statistic: -41.87025715912268
	 p-value: 3.5847980364740063e-56

all opioids:
	 t-statistic: -19.602329905140994
	 p-value: 2.7700257062194467e-32

stimulants:
	 t-statistic: -20.148558228586293
	 p-value: 4.430776262571822e-33

cannabis:
	 t-statistic: 28.325721277109956
	 p-value: 1.8500483270204584e-43

benzodiazepine:
	 t-statistic: -5.745343337292002
	 p-value: 1.5971265540250007e-07



1. `all drugs`: The negative t-statistic suggests that the '35-54 years' age group uses all drugs significantly more than the '16-34 years' age group. The p-value is extremely small, indicating a strong evidence against the null hypothesis that there's no difference between the two age groups for 'all drugs' usage.

2. `all opioids`: The negative t-statistic suggests that the '35-54 years' age group uses all opioids significantly more than the '16-34 years' age group. The p-value is extremely small, indicating a strong evidence against the null hypothesis that there's no difference between the two age groups for 'all opioids' usage.

3. `stimulants`: The negative t-statistic suggests that the '35-54 years' age group uses stimulants significantly more than the '16-34 years' age group. The p-value is extremely small, indicating a strong evidence against the null hypothesis that there's no difference between the two age groups for 'stimulants' usage.

4. `cannabis`: The positive t-statistic suggests that the '16-34 years' age group uses cannabis significantly more than the '35-54 years' age group. The p-value is extremely small, indicating a strong evidence against the null hypothesis that there's no difference between the two age groups for cannabis usage.

5. `benzodiazepine`: The negative t-statistic suggests that the '35-54 years' age group uses benzodiazepine significantly more than the '16-34 years' age group. The p-value is extremely small, indicating a strong evidence against the null hypothesis that there's no difference between the two age groups for benzodiazepine usage.

In summary, for all drug categories, there are significant differences in usage between the '16-34 years' and '35-54 years' age groups. The direction of the difference depends on the drug category, with '35-54 years' age group using 'all drugs', 'all opioids', 'stimulants', and 'benzodiazepine' more, while the '16-34 years' age group uses 'cannabis' more.