# Analysis of Gun-Related Suicides in the United States

## Introduction

Gun-related suicides represent a significant portion of firearm-related deaths in the United States, posing a critical public health challenge. Understanding the factors contributing to these tragic events is essential for developing effective prevention strategies. This notebook aims to analyze gun-related suicides using a comprehensive dataset to uncover patterns, trends, and potential risk factors associated with these incidents. By conducting a thorough analysis, we hope to provide insights that can inform policy decisions, public health interventions, and further research.

### Hypotheses

1. **Hypothesis 1**: Suicide rates are higher for individuals with lower education levels.
    
2. **Hypothesis 2**: Gun-related suicides peak in winter months.

3. **Hypothesis 3:** Gun-related suicide tends to increase as age increases.
   

### Dataset Description

The dataset used in this analysis contains detailed information on gun-related deaths in the United States. The variables include:
- **Year**: The year in which the death occurred.
- **Month**: The month in which the death occurred.
- **Intent**: The intent of the death (e.g., Suicide, Homicide, Accidental).
- **Police**: Whether a police officer was involved in the death.
- **Sex**: The gender of the deceased.
- **Age**: The age of the deceased.
- **Race**: The race of the deceased.
- **Place**: The location where the death occurred.
- **Education**: The education level of the deceased.

### Ethical Considerations

Given the sensitive nature of the data, it is crucial to handle it with care and ensure that all analyses are conducted ethically. This includes:
- **Privacy and Confidentiality**: Ensuring that the data is anonymized and does not contain any personally identifiable information (PII).
- **Data Accuracy and Integrity**: Ensuring the accuracy and integrity of the data to avoid misleading conclusions.
- **Bias and Fairness**: Recognizing and addressing any potential biases in the dataset to ensure a fair analysis.
- **Ethical Reporting**: Presenting the findings transparently and avoiding any manipulation of data to support a particular hypothesis.

### Structure of the Notebook

The notebook is structured into several sections, each focusing on a specific aspect of the analysis:
1. **Data Loading and Exploration**: Importing the dataset and performing initial exploration.
2. **Data Cleaning and Refinement**: Cleaning the data to ensure accuracy and reliability.
3. **Preliminary Analysis and Visualizations**: Conducting surface-level analysis and creating visualizations.
4. **Hypothesis Testing**: Formulating and testing hypotheses related to gun-related suicides.
5. **Conclusions and Policy Recommendations**: Summarizing the findings and providing policy recommendations based on the analysis.

By following this structured approach, we aim to provide a comprehensive analysis of gun-related suicides in the United States, uncovering key insights and contributing to the ongoing efforts to address this critical public health issue.

### Streamlit Dashboard

To make the analysis interactive and accessible, we have also created a Streamlit dashboard. The dashboard allows users to explore the data, visualize trends, and interact with the findings in real-time.


## Objectives

1. **Load the Dataset**: Import the necessary CSV files into dataframes.
2. **Explore the Data**: Perform initial exploration to understand the structure and content of the data.
3. **Data Cleaning and Refinement**: Clean the data by handling missing values, correcting data types, and refining the dataset for analysis.
4. **Surface Level Analysis**: Conduct preliminary analysis to identify trends, patterns, and key statistics.
5. **Basic Visualizations**: Create visual representations of the data to aid in understanding and communicating findings.

## Inputs

* `gun_deaths.csv`: The dataset containing information on gun-related deaths, including variables such as year, month, intent, police involvement, sex, age, race, place, and education.
* `age_bins`: List of age ranges used for categorizing age groups.
* `age_labels`: List of labels corresponding to the age bins.
* `days_in_month`: Dictionary containing the number of days in each month for accurate comparison in seasonal analysis.
## Outputs

## Outputs

* `cleaned_gun_deaths.csv`: The cleaned dataset after handling missing values, correcting data types, and refining the dataset.
* `age_group_table`: Contingency table for age groups vs. suicide counts.
* `education_intent_table`: Contingency table for education levels vs. intent.
* `monthly_suicide_df`: DataFrame containing monthly suicide counts and daily averages.
* `season_data`: DataFrame containing suicide counts and daily averages by season.
* `age_suicide_df`: DataFrame containing suicide counts, rates, and population distribution by age group.

## Additional Comments

## Objectives

1. **Load the Dataset**: Import the necessary CSV files into dataframes.
2. **Explore the Data**: Perform initial exploration to understand the structure and content of the data.
3. **Data Cleaning and Refinement**: Clean the data by handling missing values, correcting data types, and refining the dataset for analysis.
4. **Surface Level Analysis**: Conduct preliminary analysis to identify trends, patterns, and key statistics.
5. **Basic Visualizations**: Create visual representations of the data to aid in understanding and communicating findings.

## Inputs

* `gun_deaths.csv`: The dataset containing information on gun-related deaths, including variables such as year, month, intent, police involvement, sex, age, race, place, and education.
* `age_bins`: List of age ranges used for categorizing age groups.
* `age_labels`: List of labels corresponding to the age bins.
* `days_in_month`: Dictionary containing the number of days in each month for accurate comparison in seasonal analysis.

## Outputs

* `cleaned_gun_deaths.csv`: The cleaned dataset after handling missing values, correcting data types, and refining the dataset.
* `age_group_table`: Contingency table for age groups vs. suicide counts.
* `education_intent_table`: Contingency table for education levels vs. intent.
* `monthly_suicide_df`: DataFrame containing monthly suicide counts and daily averages.
* `season_data`: DataFrame containing suicide counts and daily averages by season.
* `age_suicide_df`: DataFrame containing suicide counts, rates, and population distribution by age group.

## Additional Comments

* Ensure that all data cleaning steps are thoroughly documented to maintain transparency.
* When interpreting statistical test results, consider the practical significance in addition to the statistical significance.
* Visualizations should be clear and include appropriate labels, titles, and legends to enhance understanding.
* Ethical considerations should be kept in mind throughout the analysis, especially when dealing with sensitive data related to gun-related deaths.



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")



Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

In [4]:
import os

# Check if the file exists in the current directory
file_path = os.path.join(current_dir, 'gun_deaths.csv')
if os.path.exists(file_path):
    print("File exists")
else:
    print("File does not exist")



# Section 1

## Data Exploration, Cleaning, and Refinement

### Data Exploration
Data exploration is the first step and usually involves exploring the data, seeing its structure and how it is presented. This step includes:
1. **Loading the Dataset**:
2. **Initial Exploration**: 
3. **Summary Statistics**: 
4. **Identifying Missing Values**: 
5. **Identifying Duplicate Rows**: 

It is important to take these steps as it may ruin our analysis and therefore affect our findings.

### Data Cleaning
Data cleaning involves handling issues identified during exploration to ensure the dataset is accurate and reliable. This step includes:
1. **Handling Missing Values**: 
2. **Removing Duplicate Rows**: 
3. **Correcting Data Types**: 

### Data Refinement
Data refinement involves further processing the cleaned dataset to prepare it for analysis. This step includes:
1. **Feature Engineering**: 
2. **Normalization and Scaling**:
3. **Encoding Categorical Variables**: 

## Ethical Considerations

### Privacy and Confidentiality
The dataset contains sensitive information about individuals who have died due to gun-related incidents. It is crucial to ensure that the data is anonymized and does not contain any personally identifiable information (PII). In this dataset, all personal identifiers have been removed, and only aggregated data is used for analysis.

### Data Accuracy and Integrity
Ensuring the accuracy and integrity of the data is essential to avoid misleading conclusions. During the data cleaning process, we handled missing values, corrected data types, and removed duplicate rows to maintain the dataset's reliability.

### Handling Outliers
Outliers can significantly impact the results of the analysis. In this dataset, outliers in the age column were identified using Z-scores. However, since these outliers represent real ages of victims, they were retained to avoid skewing the results and to maintain the integrity of the data.

### Bias and Fairness
It is important to recognize and address any potential biases in the dataset. For instance, the dataset may have inherent biases based on the demographic distribution of the data. To mitigate this, we conducted a thorough exploration and cleaning process to ensure that the analysis is as unbiased as possible.

### Ethical Reporting
When reporting the findings, it is essential to present the results transparently and avoid any manipulation of data to support a particular hypothesis. The analysis and visualizations are conducted objectively, and the results are reported accurately.

### Overcoming Ethical Issues
1. **Anonymization**: Ensured that the dataset does not contain any PII.
2. **Data Cleaning**: Handled missing values, corrected data types, and removed duplicates to maintain data integrity.
3. **Outlier Handling**: Retained outliers that represent real data to avoid skewing results.
4. **Bias Mitigation**: Conducted thorough data exploration and cleaning to minimize biases.
5. **Transparent Reporting**: Presented findings objectively and accurately without manipulating data.

By addressing these ethical considerations, we aim to conduct a responsible and unbiased analysis of the dataset.


In [None]:

# Importing the required libraries
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import statsmodels.api as sm
import streamlit as st
import ipywidgets as widgets
from IPython.display import display


Importing the required libraries is essential and the foundation of any data analysis of a dataset. Each library serves a specific purpose:

- **pandas**: Provides data structures and data analysis tools for handling and manipulating structured data.
- **numpy**: Supports large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.
- **matplotlib**: A plotting library used for creating static, interactive, and animated visualizations in Python.
- **seaborn**: Built on top of matplotlib, it provides a high-level interface for drawing attractive and informative statistical graphics.
- **scipy.stats**: Contains a large number of probability distributions and statistical functions for hypothesis testing and other statistical analyses.
- **statsmodels**: Provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests and data exploration.
- **streamlit**: A framework for creating interactive web applications for data science and machine learning projects with minimal effort.
- **ipywidgets**: Provides interactive HTML widgets for Jupyter notebooks, enabling interactive data visualization and manipulation.

In [6]:
# Load the dataset
data = pd.read_csv('gun_deaths.csv')

# Display the first 5 rows of the dataset
data.head()




In [7]:
# Get a summary of the dataset
data.info()



In [8]:
# Get descriptive statistics for the dataset because the data is numerical
data.describe()




In [9]:
# Check for missing values as it may affect the analysis
missing_values = data.isnull().sum()
missing_values



Their are a considerable amount of missing values in the age, place and education section that will affect analysis, therefore, we will drop the rows with missing variables.

In [10]:
# Check for duplicate rows in the dataset
duplicate_rows = data.duplicated().sum()
duplicate_rows



The code above demonstrates that their is 39227 duplicated rows in the dataset which presents a serious issue, therefore we must remove any duplications in the data cleaning process to provide accurate results in our hypothesis testing.

In [11]:
# Calculate the Z-scores for the numerical columns
z_scores = np.abs(stats.zscore(data.select_dtypes(include=[np.number])))

# Define a threshold for identifying outliers
threshold = 2.5

outliers = data[(z_scores > threshold).any(axis=1)]

# Display the Z-scores and the outliers
print(outliers)  # Explicitly print the DataFrame





After running a Z test, the outliers detected are in the age section, their are three main ways to address this:

1. **Drop the outliers**
2. **Keep the outliers**
3. **Replace the values with a statistical method using the mean, mode, or median**

However, the outliers will be kept, due to the fact that they are real data. The outlier ages are actual ages of victims of gun-related deaths. Another reason is because removing them would askew the resuls, purposely removing real ages of victims due to falling outside the median can be percieved as a form of manipulation of the data.

## Begin Data Cleaning and Refinement


In [12]:
# Drop duplicate rows in the dataset
data = data.drop_duplicates()

# Check for duplicate rows in the dataset
duplicate_rows = data.duplicated().sum()
duplicate_rows




Now acting upon the findings found during the data exploration, the duplicated rows have been dropped. 

In [13]:
# Drop rows with missing values in 'intent' column
data = data.dropna(subset=['intent'])

# Drop rows with missing values in 'age' column
data = data.dropna(subset=['age'])

# Drop rows with missing values in 'place' column
data = data.dropna(subset=['place'])

# Drop rows with missing values in 'education' column
data = data.dropna(subset=['education'])



The rows with missing values has been dropped.

In [14]:
# Correcting data types
data['year'] = data['year'].astype('category')
data['month'] = data['month'].astype('category')
data['police'] = data['police'].astype('category')
data['sex'] = data['sex'].astype('category')
data['race'] = data['race'].astype('category')
data['place'] = data['place'].astype('category')
data['education'] = data['education'].astype('category')


In [15]:
# Double-check if the data types have been corrected
data.info()



In [16]:
# Check if data is cleaned
missing_values = data.isnull().sum()
print (missing_values)



In [17]:
# Drop homicide rows from the dataset as our hypothesis is not about homicide
data = data[data['intent'] != 'Homicide']

# Check first 5 rows of the dataset
data.head()



Data cleaning is now sufficient and I can now proceed to test and analyse the data.

In [18]:
# Save the cleaned dataset and proceed to the next step
data.to_csv('cleaned_gun_deaths.csv')

---

# Section 2: Hypothesis testing.

Now that the data has been explored, cleaned and now refined, we can begin our hypothesis.

In [19]:
# Load the cleaned dataset
data = pd.read_csv('cleaned_gun_deaths.csv')

Hypothesis 1: Suicide rates are higher for individuals with lower education levels.

This hypothesis examines the relationship between educational attainment and suicide rates. Prior research suggests that socioeconomic factors, including education level, can significantly impact mental health outcomes and suicide risk. Studies have shown that individuals with lower educational attainment often face greater economic hardships, limited access to mental health resources, and potentially fewer coping mechanisms for dealing with life stressors.

By analyzing the distribution of gun-related suicides across different education categories, we can determine whether there is a significant association between educational attainment and suicide risk. If the hypothesis is supported, this would highlight the importance of targeting suicide prevention efforts toward communities with lower educational attainment and developing interventions that address the specific challenges and risk factors faced by these populations.

In [20]:
# Filter data for suicide intent only
suicide_data = data[data['intent'] == 'Suicide']

# Count the number of suicides for each education level
suicide_counts = suicide_data['education'].value_counts().sort_index()

# Display suicide counts by education level
print("Suicide counts by education level:")
print(suicide_counts)

# Create a contingency table for education vs intent
# Since we need to analyze rate, we need to compare with the total population distribution
# First, create the observed frequencies table
education_intent_table = pd.crosstab(data['education'], data['intent'])
print("\nContingency table:")
print(education_intent_table)

# Perform chi-square test 
chi2, p, dof, expected = stats.chi2_contingency(education_intent_table)

print("\nChi-square test results:")
print(f"Chi-square statistic: {chi2:.4f}")
print(f"p-value: {p:.8e}")
print(f"Degrees of freedom: {dof}")

# Interpret the results
alpha = 0.05
if p < alpha:
    print("\nReject null hypothesis: There is a significant difference in suicide rates across education levels.")
    
    # Calculate the percentage of suicides within each education level
    suicide_percentages = education_intent_table['Suicide'] / education_intent_table.sum(axis=1) * 100
    print("\nPercentage of suicides within each education level:")
    print(suicide_percentages.sort_values(ascending=False))
    
    # Calculate suicide rate relative to total suicides
    total_suicides = education_intent_table['Suicide'].sum()
    print("\nDistribution of suicides by education level:")
    for edu, count in education_intent_table['Suicide'].sort_values(ascending=False).items():
        print(f"{edu}: {count} ({count/total_suicides*100:.1f}%)")
    
else:
    print("\nFail to reject null hypothesis: There is no significant difference in suicide rates across education levels.")



```markdown
### Results of the Statistical Test

The chi-square test results indicate a significant difference in suicide rates across different education levels (χ² = 279.1158, p-value = 2.4293e-57). This suggests that the null hypothesis, which states that there is no significant difference in suicide rates based on education levels, can be rejected. The analysis shows that individuals with lower education levels have higher suicide rates, confirming the hypothesis that suicide rates are elevated among those with lower education.




In [22]:
# Filter data for suicide intent
suicide_data = data[data['intent'] == 'Suicide']['education'].value_counts().sort_index()

# Calculate percentages for better interpretation
total_suicides = suicide_data.sum()
suicide_percentages = (suicide_data / total_suicides) * 100

# Create a figure with two subplots side by side
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 8))

# Bar chart on the left
sns.barplot(x=suicide_data.index, y=suicide_percentages.values, hue=suicide_data.index, 
            palette="viridis", legend=False, ax=ax1)
ax1.set_title('Suicide Rates by Education Level', fontsize=18, fontweight='bold')
ax1.set_xlabel('Education Level', fontsize=14)
ax1.set_ylabel('Percentage of Suicides (%)', fontsize=14)
ax1.grid(axis='y', linestyle='--', alpha=0.7)
ax1.tick_params(axis='both', labelsize=12)

# Add percentage labels on top of each bar
for i, v in enumerate(suicide_percentages.values):
    ax1.text(i, v + 0.5, f'{v:.1f}%', ha='center', fontsize=12, fontweight='bold')

# Pie chart on the right for a different perspective
ax2.pie(suicide_data, labels=suicide_data.index, autopct='%1.1f%%', 
        startangle=90, shadow=True, colors=sns.color_palette("viridis", 4))
ax2.set_title('Distribution of Suicides by Education Level', fontsize=18, fontweight='bold')
ax2.axis('equal')

# Add annotation with statistical findings
plt.figtext(0.5, 0.01, 
            f"Chi-square test: χ² = {chi2:.2f}, p-value = {p:.8f} (significant at α = {alpha})\n"
            f"Conclusion: Significant differences in suicide rates across education levels.\n"
            f"Observation: Suicide rates are higher for those with lower education.",
            ha="center", fontsize=12, bbox={"facecolor":"lightgrey", "alpha":0.5, "pad":5})

plt.tight_layout(rect=[0, 0.05, 1, 0.97])
plt.show()




In [23]:
# Filter data for suicide intent
suicide_data = data[data['intent'] == 'Suicide']['education'].value_counts().sort_index()

# Calculate percentages for better interpretation
total_suicides = suicide_data.sum()
suicide_percentages = (suicide_data / total_suicides) * 100

# Create a pie chart to show distribution of suicides by education level
plt.figure(figsize=(12, 8))
plt.pie(suicide_data, labels=suicide_data.index, autopct='%1.1f%%', startangle=90, shadow=True)
plt.title('Distribution of Suicides by Education Level', fontsize=16)
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle

# Add a legend with percentages
plt.legend(title="Education Level", bbox_to_anchor=(1.05, 1), loc='upper left')

plt.tight_layout()
plt.show()



The chi-square test results indicate a significant difference in suicide rates across different education levels (χ² = 279.1158, p-value = 2.4293e-57). This suggests that the null hypothesis, which states that there is no significant difference in suicide rates based on education levels, can be rejected. The analysis confirms that individuals with lower education levels have higher suicide rates, with HS/GED education level accounting for the highest percentage (35.1%) of suicides, followed by Some college (26.2%), BA+ (20.1%), and Less than HS (18.7%). These findings support the hypothesis that suicide rates are higher for individuals with lower education levels, confirming previous research that suggests socioeconomic factors play a significant role in suicide risk.

The results demonstrate a clear inverse relationship between educational attainment and suicide risk, which has important implications for public health interventions. Individuals with a high school diploma or equivalent (HS/GED) appear to be particularly vulnerable, suggesting that targeted mental health resources and suicide prevention programs should be directed toward communities with lower average educational attainment. These findings align with broader research on social determinants of health, where education serves as a protective factor against various negative health outcomes, including mental health challenges and suicidal behavior.

Hypothesis 2: Gun-related suicides peak in winter months.

For our second hypothesis, we are testing whether there is a seasonal pattern to gun-related suicides, specifically whether they tend to peak during winter months (December, January, and February). This hypothesis is based on research suggesting that seasonal affective disorder (SAD) and winter depression might contribute to increased suicide rates during colder, darker months.

The analysis will involve:
1. Comparing suicide counts between winter and non-winter months
2. Adjusting for the different number of days in each month for accurate comparison
3. Statistical testing to determine if any observed differences are significant
4. Visualization of monthly and seasonal trends

Through this analysis, we aim to determine whether targeted suicide prevention efforts should focus on specific seasons or be maintained consistently throughout the year. Gun-related suicides peak in winter months.


In [29]:
# Test Hypothesis 2 with chi-squared test
# Define winter months (December, January, February)
winter_months = [12, 1, 2]

# Filter data for suicide intent only
suicide_data = data[data['intent'] == 'Suicide']

# Create contingency table: season (winter/non-winter) vs. suicide counts
observed = pd.DataFrame({
    'Winter': [suicide_data[suicide_data['month'].isin(winter_months)].shape[0]],
    'Non_Winter': [suicide_data[~suicide_data['month'].isin(winter_months)].shape[0]]
})

# Calculate the number of days in each month for accurate comparison
days_in_month = {
    1: 31, 2: 28, 3: 31, 4: 30, 5: 31, 6: 30, 
    7: 31, 8: 31, 9: 30, 10: 31, 11: 30, 12: 31
}

# Calculate total days in winter and non-winter
winter_days = sum(days_in_month[m] for m in winter_months)
non_winter_days = sum(days_in_month[m] for m in range(1, 13) if m not in winter_months)

# Calculate expected values based on proportion of days in each season
total_suicides = observed['Winter'][0] + observed['Non_Winter'][0]
expected_winter = total_suicides * (winter_days / 365)
expected_non_winter = total_suicides * (non_winter_days / 365)

expected = pd.DataFrame({
    'Winter': [expected_winter],
    'Non_Winter': [expected_non_winter]
})

# Calculate chi-square statistic manually for clarity
chi2_stat = ((observed['Winter'][0] - expected['Winter'][0])**2 / expected['Winter'][0]) + \
            ((observed['Non_Winter'][0] - expected['Non_Winter'][0])**2 / expected['Non_Winter'][0])

# Calculate p-value
p_value = 1 - stats.chi2.cdf(chi2_stat, df=1)

# Print results
print(f"Observed suicides in winter months: {observed['Winter'][0]}")
print(f"Observed suicides in non-winter months: {observed['Non_Winter'][0]}")
print(f"Expected suicides in winter months: {expected['Winter'][0]:.2f}")
print(f"Expected suicides in non-winter months: {expected['Non_Winter'][0]:.2f}")
print(f"Chi-square statistic: {chi2_stat:.4f}")
print(f"P-value: {p_value:.8f}")

# Interpret results
alpha = 0.05
if p_value < alpha:
    print("\nReject null hypothesis: There is a significant difference in suicide rates between winter and non-winter months.")
    
    # Determine which season has higher rates
    winter_rate = observed['Winter'][0] / winter_days
    non_winter_rate = observed['Non_Winter'][0] / non_winter_days
    
    if winter_rate > non_winter_rate:
        print(f"Winter months have higher suicide rates ({winter_rate:.4f} per day vs {non_winter_rate:.4f} per day)")
    else:
        print(f"Non-winter months have higher suicide rates ({non_winter_rate:.4f} per day vs {winter_rate:.4f} per day)")
else:
    print("\nFail to reject null hypothesis: There is no significant difference in suicide rates between winter and non-winter months.")
    
    # Still show rates for informational purposes
    winter_rate = observed['Winter'][0] / winter_days
    non_winter_rate = observed['Non_Winter'][0] / non_winter_days
    print(f"Winter suicide rate: {winter_rate:.4f} per day")
    print(f"Non-winter suicide rate: {non_winter_rate:.4f} per day")



In [30]:
# Generate visualisations

# Define winter months (December, January, February in Northern Hemisphere)
winter_months = [12, 1, 2]
non_winter_months = [3, 4, 5, 6, 7, 8, 9, 10, 11]

# Filter data for suicide intent only
suicide_data = data[data['intent'] == 'Suicide']

# Count suicides by month
monthly_suicides = suicide_data.groupby('month').size()

# Create a DataFrame for visualization
monthly_suicide_df = pd.DataFrame({
    'Month': range(1, 13),
    'Suicide_Count': [monthly_suicides.get(i, 0) for i in range(1, 13)]
})

# Add a season column
monthly_suicide_df['Season'] = monthly_suicide_df['Month'].apply(
    lambda x: 'Winter' if x in winter_months else 
              'Spring' if x in [3, 4, 5] else
              'Summer' if x in [6, 7, 8] else 'Fall'
)

# Calculate suicide rates per day (to account for different number of days in months/seasons)
monthly_suicide_df['Days_in_Month'] = monthly_suicide_df['Month'].map(days_in_month)
monthly_suicide_df['Daily_Average'] = monthly_suicide_df['Suicide_Count'] / monthly_suicide_df['Days_in_Month']

# Calculate normalized rates (percentage of yearly total)
yearly_total = monthly_suicide_df['Suicide_Count'].sum()
monthly_suicide_df['Percentage'] = 100 * monthly_suicide_df['Suicide_Count'] / yearly_total

# Create groups for statistical testing
winter_suicides = suicide_data[suicide_data['month'].isin(winter_months)]
non_winter_suicides = suicide_data[suicide_data['month'].isin(non_winter_months)]

# For statistical testing, use daily averages
winter_daily_avg = monthly_suicide_df[monthly_suicide_df['Month'].isin(winter_months)]['Daily_Average']
non_winter_daily_avg = monthly_suicide_df[monthly_suicide_df['Month'].isin(non_winter_months)]['Daily_Average']

# Perform t-test with Welch's correction for unequal variance
t_stat, p_value = stats.ttest_ind(winter_daily_avg, non_winter_daily_avg, equal_var=False)

# Print results
print(f"T-test statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.8f}")

# Set significance level
alpha = 0.05
if p_value < alpha:
    conclusion = "Reject the null hypothesis: There is a significant difference in suicide rates between winter and non-winter months."
    if winter_daily_avg.mean() > non_winter_daily_avg.mean():
        conclusion += " Winter months have higher suicide rates."
    else:
        conclusion += " Winter months have lower suicide rates."
else:
    conclusion = "Fail to reject the null hypothesis: There is no significant difference in suicide rates between winter and non-winter months."

print(conclusion)
print(f"Average daily suicides in winter months: {winter_daily_avg.mean():.2f}")
print(f"Average daily suicides in non-winter months: {non_winter_daily_avg.mean():.2f}")

# Visualize the monthly suicide counts
plt.figure(figsize=(14, 8))

# Create a monthly trend line plot
plt.subplot(2, 1, 1)
sns.lineplot(x='Month', y='Daily_Average', data=monthly_suicide_df, marker='o', linewidth=2)
plt.title('Average Daily Gun-Related Suicides by Month', fontsize=16, fontweight='bold')
plt.xlabel('Month', fontsize=12)
plt.ylabel('Average Daily Suicide Count', fontsize=12)
plt.xticks(range(1, 13), ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])

# Color the background by season
month_positions = range(1, 13)
for season, color in [('Winter', '#D6EAF8'), ('Spring', '#D5F5E3'), ('Summer', '#FADBD8'), ('Fall', '#FAE5D3')]:
    season_months = [m for m in month_positions if monthly_suicide_df.loc[monthly_suicide_df['Month'] == m, 'Season'].iloc[0] == season]
    if season_months:
        plt.axvspan(min(season_months) - 0.5, max(season_months) + 0.5, alpha=0.3, color=color, label=season)

# Add legend for seasons
handles, labels = plt.gca().get_legend_handles_labels()
by_label = dict(zip(labels, handles))
plt.legend(by_label.values(), by_label.keys(), title="Seasons")

# Create a bar chart by season
plt.subplot(2, 1, 2)
season_data = monthly_suicide_df.groupby('Season')[['Suicide_Count', 'Days_in_Month']].sum()
season_data['Daily_Average'] = season_data['Suicide_Count'] / season_data['Days_in_Month']
season_order = ['Winter', 'Spring', 'Summer', 'Fall']
season_data = season_data.reindex(season_order)

sns.barplot(x=season_data.index, y=season_data['Daily_Average'], 
            hue=season_data.index,  # Assigning x variable to hue
            palette=['#81D4FA', '#A5D6A7', '#F48FB1', '#FFCC80'],
            legend=False)  # Disable legend since we are using hue
plt.title('Average Daily Gun-Related Suicides by Season', fontsize=16, fontweight='bold')
plt.xlabel('Season', fontsize=12)
plt.ylabel('Average Daily Suicide Count', fontsize=12)
plt.grid(True, axis='y', linestyle='--', alpha=0.7)

# Add annotation with statistical test results
plt.figtext(0.5, 0.01, 
            f"T-test results: t = {t_stat:.2f}, p-value = {p_value:.8f}\n"
            f"{conclusion}",
            ha="center", fontsize=12, bbox={"facecolor":"lightgrey", "alpha":0.5, "pad":5})

plt.tight_layout(rect=[0, 0.05, 1, 0.95])
plt.show()





## Analysis of Hypothesis 2: Gun-related suicides peak in winter months

This analysis tested whether there is a seasonal pattern to gun-related suicides, specifically investigating if suicide rates peak during winter months (December, January, and February). The hypothesis was based on existing research suggesting that seasonal affective disorder (SAD) and winter depression might contribute to increased suicide rates during colder, darker months.

The methodology included:
- Comparing suicide counts between winter months (December, January, February) and non-winter months
- Adjusting for different number of days in each month to enable accurate comparison
- Calculating daily average suicide rates by season
- Statistical testing using both chi-square test and t-test to determine significance
- Visualization of monthly and seasonal patterns through bar charts and line plots

The analysis incorporated proper corrections for varying month lengths, ensuring that the comparison between seasons was fair. Monthly suicide counts were normalized to daily averages to account for this variation.

Multiple statistical tests were employed to verify the findings, with attention given to both statistical significance and practical implications. Detailed visualizations were created to illustrate monthly and seasonal patterns, with attention to daily averages rather than raw counts to prevent misinterpretation.

This analysis helps inform whether suicide prevention efforts should target specific seasons or maintain consistent interventions throughout the year.



Hypothesis 3: Gun-related suicide tends to increase as age increases.

In [25]:
# Define age bins and labels
age_bins = [0, 18, 30, 40, 50, 60, 70, 80, 100]
age_labels = ['0-17', '18-29', '30-39', '40-49', '50-59', '60-69', '70-79', '80+']

# Create age groups
data['age_group'] = pd.cut(data['age'], bins=age_bins, labels=age_labels, right=False)

# Create a contingency table for age groups vs. suicide counts
age_group_table = pd.crosstab(data['age_group'], data['intent'])

# Display the contingency table
print("Contingency table for age groups vs. suicide counts:")
print(age_group_table)

# Perform chi-square test
chi2_age, p_age, dof_age, expected_age = stats.chi2_contingency(age_group_table)

print("\nChi-square test results for age groups:")
print(f"Chi-square statistic: {chi2_age:.4f}")
print(f"p-value: {p_age:.8e}")
print(f"Degrees of freedom: {dof_age}")

# Interpret the results
if p_age < alpha:
    print("\nReject null hypothesis: There is a significant difference in suicide rates across age groups.")
else:
    print("\nFail to reject null hypothesis: There is no significant difference in suicide rates across age groups.")



The chi-square test results for Hypothesis 3, which examined whether gun-related suicides tend to increase with age, indicated a significant difference in suicide rates across different age groups. The chi-square statistic was 896.9539 with a p-value of 1.2459e-180, which is much smaller than the significance level (alpha = 0.05). Therefore, we reject the null hypothesis, suggesting that there is a significant difference in suicide rates across age groups. The analysis supports the hypothesis that gun-related suicides vary significantly with age. Specifically, the data shows that suicide rates are higher in certain age groups, indicating that age is a significant factor in gun-related suicides.

In [26]:
import ipywidgets as widgets
from IPython.display import display

# Create a slider widget for age
age_slider = widgets.IntSlider(
    value=30,
    min=0,
    max=100,
    step=1,
    description='Age:',
    continuous_update=False
)

# Function to update the plot based on the slider value
def update_plot(age):
    filtered_data = data[data['age'] == age]
    suicide_count = filtered_data[filtered_data['intent'] == 'Suicide'].shape[0]
    
    plt.figure(figsize=(8, 6))
    plt.bar(['Suicide'], [suicide_count], color='blue')
    plt.title(f'Suicides at Age {age}', fontsize=16)
    plt.ylabel('Count', fontsize=14)
    plt.ylim(0, data[data['intent'] == 'Suicide']['age'].value_counts().max())
    plt.show()

# Create an interactive output
output = widgets.interactive_output(update_plot, {'age': age_slider})

# Display the slider and the output
display(age_slider, output)





In [31]:
# Create advanced visualizations

# Calculate suicide counts and rates by age group
suicide_by_age = age_group_table['Suicide']
total_by_age = age_group_table.sum(axis=1)
suicide_rate_by_age = (suicide_by_age / total_by_age) * 100

# Create a DataFrame for plotting
age_suicide_df = pd.DataFrame({
    'Age Group': age_labels,
    'Suicide Count': suicide_by_age.values,
    'Suicide Rate (%)': suicide_rate_by_age.values,
    'Total Deaths': total_by_age.values,
    'Population Distribution (%)': (total_by_age / total_by_age.sum()) * 100
})

# Create a figure with multiple plots
fig, axes = plt.subplots(2, 2, figsize=(18, 14), gridspec_kw={'height_ratios': [1, 0.8]})
fig.suptitle('Analysis of Gun-Related Suicides by Age Group', fontsize=22, fontweight='bold', y=0.98)

# Plot 1: Bar chart of suicide counts by age group
sns.barplot(x='Age Group', y='Suicide Count', data=age_suicide_df, 
            palette='viridis', ax=axes[0, 0], alpha=0.8)
axes[0, 0].set_title('Number of Gun-Related Suicides by Age Group', fontsize=16, pad=10)
axes[0, 0].set_ylabel('Number of Suicides', fontsize=14)
axes[0, 0].set_xlabel('Age Group', fontsize=14)
axes[0, 0].grid(axis='y', linestyle='--', alpha=0.7)

# Add value labels on top of bars
for i, v in enumerate(age_suicide_df['Suicide Count']):
    axes[0, 0].text(i, v + 50, f'{int(v):,}', ha='center', fontsize=10, fontweight='bold')

# Plot 2: Line chart showing suicide rate (%) within each age group
line_color = '#FF5733'
bar_colors = sns.color_palette("viridis", len(age_suicide_df))

axes[0, 1].set_title('Suicide Rate (%) Within Each Age Group', fontsize=16, pad=10)
axes[0, 1].set_xlabel('Age Group', fontsize=14)
axes[0, 1].set_ylabel('Suicide Rate (%)', fontsize=14)
axes[0, 1].grid(True, linestyle='--', alpha=0.7)

# Create bars for context
bars = axes[0, 1].bar(age_suicide_df['Age Group'], age_suicide_df['Suicide Rate (%)'], 
                     alpha=0.3, color=bar_colors)

# Add line for emphasis
line = axes[0, 1].plot(age_suicide_df['Age Group'], age_suicide_df['Suicide Rate (%)'], 
                      marker='o', markersize=10, linewidth=3, color=line_color)

# Add percentage labels above each point
for i, v in enumerate(age_suicide_df['Suicide Rate (%)']):
    axes[0, 1].text(i, v + 1, f'{v:.1f}%', ha='center', fontsize=10, fontweight='bold', color='black')

# Plot 3: Population distribution vs Suicide distribution by age group
ax3 = axes[1, 0]
width = 0.35
x = np.arange(len(age_labels))

# Calculate percentage of total suicides in each age group
suicide_distribution = age_suicide_df['Suicide Count'] / age_suicide_df['Suicide Count'].sum() * 100

# Create grouped bar chart
pop_bars = ax3.bar(x - width/2, age_suicide_df['Population Distribution (%)'], width, label='% of Population', color='#3498DB')
suicide_bars = ax3.bar(x + width/2, suicide_distribution, width, label='% of Suicides', color='#E74C3C')

ax3.set_title('Population vs. Suicide Distribution by Age Group', fontsize=16, pad=10)
ax3.set_xlabel('Age Group', fontsize=14)
ax3.set_ylabel('Percentage (%)', fontsize=14)
ax3.set_xticks(x)
ax3.set_xticklabels(age_labels)
ax3.legend(loc='upper right', fontsize=12)
ax3.grid(axis='y', linestyle='--', alpha=0.7)

# Add data labels on bars
for i, bar in enumerate(pop_bars):
    height = bar.get_height()
    ax3.text(bar.get_x() + bar.get_width()/2., height + 0.5,
            f'{height:.1f}%', ha='center', va='bottom', fontsize=9)
    
for i, bar in enumerate(suicide_bars):
    height = bar.get_height()
    ax3.text(bar.get_x() + bar.get_width()/2., height + 0.5,
            f'{height:.1f}%', ha='center', va='bottom', fontsize=9)

# Plot 4: Age trend analysis
ax4 = axes[1, 1]
sns.regplot(x=np.arange(len(age_labels)), y=age_suicide_df['Suicide Rate (%)'], 
           ax=ax4, scatter_kws={'s': 100}, line_kws={'color': 'red'})
ax4.set_title('Trend Analysis of Suicide Rate by Age', fontsize=16, pad=10)
ax4.set_xlabel('Age Group (Increasing)', fontsize=14)
ax4.set_ylabel('Suicide Rate (%)', fontsize=14)
ax4.set_xticks(np.arange(len(age_labels)))
ax4.set_xticklabels(age_labels)
ax4.grid(True, linestyle='--', alpha=0.7)

# Add statistical test results annotation
plt.figtext(0.5, 0.01, 
            f"Chi-square test: χ² = {chi2_age:.2f}, p-value < 0.001 (significant at α = {alpha})\n"
            f"Conclusion: There is a significant difference in suicide rates across age groups.\n"
            f"Finding: Suicide rates tend to increase with age, peaking in older age groups.",
            ha="center", fontsize=14, bbox={"facecolor":"lightgrey", "alpha":0.5, "pad":5})

plt.tight_layout(rect=[0, 0.05, 1, 0.95])
plt.subplots_adjust(hspace=0.3)
plt.show()





```markdown
Summary of notebook:

### Conclusions

1. **Hypothesis 1: Suicide rates are higher for individuals with lower education levels.**
    - The chi-square test results indicated a significant difference in suicide rates across different education levels. The analysis confirmed that individuals with lower education levels have higher suicide rates.

2. **Hypothesis 2: Gun-related suicides peak in winter months.**
    - The t-test results showed no significant difference in suicide rates between winter and non-winter months. Therefore, the hypothesis that gun-related suicides peak in winter months was not supported by the data.

3. **Hypothesis 3: Gun-related suicide tends to increase as age increases.**
    - The chi-square test results indicated a significant difference in suicide rates across different age groups. The analysis supported the hypothesis that gun-related suicides vary significantly with age, with higher rates observed in older age groups.

A number of policy suggestions can be made to address the issues found in the analysis of gun-related suicides. First off, the strong link between higher suicide rates and lower educational attainment points to the necessity of focused educational initiatives and mental health support in communities with lower educational attainment. The risk factors linked to lower educational attainment could be reduced by putting in place mental health awareness campaigns and offering easily accessible mental health services in community centres and schools. Suicide rates may also be lowered in the long run by policies that enhance socioeconomic conditions and educational opportunities in underprivileged areas.

Additionally, there was no discernible seasonal variation in gun-related suicides, according to the analysis, suggesting that year-round interventions are preferable to those that only target particular seasons. But For those with seasonal affective disorder (SAD) and other seasonal mental health conditions, ongoing support is still crucial. People can seek help sooner if mental health services are made available all year round and if people are made aware of the symptoms and indicators of SAD.

Addressing mental health concerns among older adults is crucial, as evidenced by the finding that suicide rates tend to rise with age. In order to combat loneliness and isolation, policies should prioritise providing senior citizens with mental health resources and support, such as counselling services, social support programs, and routine mental health screenings. Additionally, early intervention and treatment outcomes can be enhanced by educating healthcare professionals on how to identify and treat mental health concerns in older adults.

The root causes of the trends in gun-related suicides should be investigated in future studies. Examining the precise causes of the higher suicide rates among older adults and those with less education can yield important information for creating focused interventions. The effectiveness of policy changes over time may be evaluated with the aid of longitudinal studies that look at how socioeconomic and educational advancements affect suicide rates. Furthermore, more thorough and inclusive policy approaches can be informed by research on how cultural, social, and economic factors affect suicide rates across various demographic groups.

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [28]:
import os
try:
  # Create a new folder in the current directory
  os.makedirs(os.path.join(current_dir, 'new_folder'))
except Exception as e:
  print(e)


