# Suicide Rate Regression

The notebook predicts an individual's suicide rate as a regression problem. Download the [suicide rates dataset](https://www.kaggle.com/datasets/russellyates88/suicide-rates-overview-1985-to-2016) from Kaggle. 

In [1]:
import pandas as pd
import numpy as np
from typing import Callable
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

In [2]:
suicide_data = pd.read_csv('../EP_datasets/suicide-rates-overview-1985-to-2016/master.csv')
suicide_data.head()

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_for_year ($),gdp_per_capita ($),generation
0,Albania,1987,male,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X
1,Albania,1987,male,35-54 years,16,308000,5.19,Albania1987,,2156624900,796,Silent
2,Albania,1987,female,15-24 years,14,289700,4.83,Albania1987,,2156624900,796,Generation X
3,Albania,1987,male,75+ years,1,21800,4.59,Albania1987,,2156624900,796,G.I. Generation
4,Albania,1987,male,25-34 years,9,274300,3.28,Albania1987,,2156624900,796,Boomers


Note: Omitting data exploration because this step was completed in a previous assignment/notebook. See `Suicide-Rate-Classification.ipynb`.

## Helpers

<a id="display_coeffs_and_equation"></a>
## display_coeffs_and_equation

*The display_coeffs_and_equation function prints the regression coefficients and regression formula for a fitted linear regression model.*

* **model** Callable: a trained linear regression model 

**returns** None

In [3]:
def display_coeffs_and_equation(model: Callable) -> None:
    coeffs = model.coef_
    intercept = model.intercept_
    
    print("The coefficients of the linear regression model are:\n")
    for feature, coeff in zip(features, coeffs):
        print(f"{feature}: {coeff}")

    print("\nThe regression equation is:\n")
    print(f"y_pred = {intercept} + ", end="")
    for feature, coeff in zip(features, coeffs):
        if feature != features[-1]:
            print(f"{coeff}({feature}) + ", end="")
        else:
            print(f"{coeff}({feature})") 

## Predict the suicide rate for (males, age 20, generation X) using the one-hot encoded variabels from the previous classification problem

In [4]:
# reformat feature name
suicide_data.rename(columns={'suicides/100k pop': 'suicides_100k_pop'}, inplace=True)

# create a new dataframe with the sex, age, and generation features
one_hot_df = suicide_data[['generation', 'age', 'sex', 'suicides_100k_pop']]

# one-hot encode the features
new_suicide_df = pd.get_dummies(one_hot_df, columns=['sex', 'age', 'generation'])

# convert all values to float
one_hot_suicide_df = new_suicide_df.astype('float64')

one_hot_suicide_df.head()

Unnamed: 0,suicides_100k_pop,sex_female,sex_male,age_15-24 years,age_25-34 years,age_35-54 years,age_5-14 years,age_55-74 years,age_75+ years,generation_Boomers,generation_G.I. Generation,generation_Generation X,generation_Generation Z,generation_Millenials,generation_Silent
0,6.71,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,5.19,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,4.83,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,4.59,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
4,3.28,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [5]:
# get labels and drop them from the data
one_hot_labels = np.array(new_suicide_df['suicides_100k_pop'])
one_hot_data = new_suicide_df.drop('suicides_100k_pop', axis = 1)
features = one_hot_data.columns
one_hot_data = np.array(one_hot_data)

In [6]:
# get training and test splits
x_train, x_test, y_train, y_test = train_test_split(one_hot_data, one_hot_labels, test_size=0.2, random_state=42)

In [7]:
# fit the model
lr_model = LinearRegression()
model_1 = lr_model.fit(x_train, y_train)
display_coeffs_and_equation(model_1)

The coefficients of the linear regression model are:

sex_female: -11043810134797.283
sex_male: -11043810134782.477
age_15-24 years: -669736127374.7412
age_25-34 years: -669736127371.7694
age_35-54 years: -669736127369.8015
age_5-14 years: -669736127382.7004
age_55-74 years: -669736127368.9316
age_75+ years: -669736127361.7905
generation_Boomers: -62855403172.59684
generation_G.I. Generation: -62855403169.61795
generation_Generation X: -62855403173.31905
generation_Generation Z: -62855403173.90937
generation_Millenials: -62855403174.31953
generation_Silent: -62855403172.70142

The regression equation is:

y_pred = 11776401665347.26 + -11043810134797.283(sex_female) + -11043810134782.477(sex_male) + -669736127374.7412(age_15-24 years) + -669736127371.7694(age_25-34 years) + -669736127369.8015(age_35-54 years) + -669736127382.7004(age_5-14 years) + -669736127368.9316(age_55-74 years) + -669736127361.7905(age_75+ years) + -62855403172.59684(generation_Boomers) + -62855403169.61795(generat

In [8]:
# create a new data point with an age of 20, gender male, and generation x
test_example = np.array([0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0]).reshape(1, -1)

# get the true labels from the test set for matching examples
matching_indices = np.where((x_test == test_example).all(axis=1))[0]
true_labels = y_test[matching_indices]

# make prediction
predicted_value = model_1.predict(test_example)

print(f"Predicted Value: {predicted_value[0]}")
print(f"MAE: {mean_absolute_error([predicted_value for i in range(len(true_labels))], true_labels)}")

Predicted Value: 16.72265625
MAE: 9.807116201456312


This model has 14 regression coefficients that map to each feature in the dataset. The coefficients in the above cell show that the weights are the same for each one-hot encoded version of the sex, age, and generation features.

## Use the natural numerical features to predict the suicide rate for (males, age 20, generation X)

In [9]:
numerical_df = suicide_data[['generation', 'age', 'sex', 'suicides_100k_pop']].copy()

# encode generation as an ordinal variable
generation_mapping = {0: 'G.I. Generation', 1: 'Silent', 2: 'Boomers', 3: 'Generation X', 4: 'Millenials', 5: 'Generation Z'}
numerical_df['generation'] = numerical_df['generation'].map({v: k for k, v in generation_mapping.items()})

# encode age as an ordinal value
age_mapping = {0: '5-14 years', 1: '15-24 years', 2: '25-34 years', 3: '35-54 years', 4: '55-74 years', 5: '75+ years'}
numerical_df['age'] = numerical_df['age'].map({v: k for k, v in age_mapping.items()})

# encode sex as a binary value
sex_mapping = {0: 'female', 1: 'male'}
numerical_df['sex'] = numerical_df['sex'].map({v: k for k, v in sex_mapping.items()})

# convert all values to float
numerical_df = numerical_df.astype('float64')

print("Generation mapping:\n")
print(generation_mapping)
print("\nAge mapping:\n")
print(age_mapping)
print("\nSex mapping:\n")
print(sex_mapping)

Generation mapping:

{0: 'G.I. Generation', 1: 'Silent', 2: 'Boomers', 3: 'Generation X', 4: 'Millenials', 5: 'Generation Z'}

Age mapping:

{0: '5-14 years', 1: '15-24 years', 2: '25-34 years', 3: '35-54 years', 4: '55-74 years', 5: '75+ years'}

Sex mapping:

{0: 'female', 1: 'male'}


In [10]:
numerical_df.head()

Unnamed: 0,generation,age,sex,suicides_100k_pop
0,3.0,1.0,1.0,6.71
1,1.0,3.0,1.0,5.19
2,3.0,1.0,0.0,4.83
3,0.0,5.0,1.0,4.59
4,2.0,2.0,1.0,3.28


In [11]:
# get labels and drop them from the data
numerical_labels = np.array(numerical_df['suicides_100k_pop'])
numerical_data = numerical_df.drop('suicides_100k_pop', axis = 1)
features = numerical_data.columns
numerical_data = np.array(numerical_data)

In [12]:
# split training and test data
x_train, x_test, y_train, y_test = train_test_split(numerical_data, numerical_labels, test_size=0.2, random_state=42)

In [13]:
# fit the model
lr_model = LinearRegression()
model_2 = lr_model.fit(x_train, y_train)
display_coeffs_and_equation(model_2)

The coefficients of the linear regression model are:

generation: -0.40278264899315597
age: 3.7527464716247825
sex: 14.82885578306375

The regression equation is:

y_pred = -3.0088482318725287 + -0.40278264899315597(generation) + 3.7527464716247825(age) + 14.82885578306375(sex)


In [14]:
# create a new data point with generation X, age 20, and male
test_example = np.array([3.0, 1.0, 1.0]).reshape(1, -1)

# get the true labels from the test set for matching examples
matching_indices = np.where((x_test == test_example).all(axis=1))[0]
true_labels = y_test[matching_indices]

# make the prediction
predicted_value = model_2.predict(test_example)

print(f"Predicted Value: {predicted_value[0]}")
print(f"MAE: {mean_absolute_error([predicted_value for i in range(len(true_labels))], true_labels)}")

Predicted Value: 14.364406075836536
MAE: 9.13720900575308


This model has 3 coefficients that map to each feature in the numerically encoded dataset. 

The model trained on the age, sex, and generation features with numerical (ordinal) encodings obtained slightly higher performance than the model trained on the one-hot encoded features. The one-hot encoded model had an MAE of 9.81, and the numerically encoded model had an MAE of 9.14 when predicting the suicide rate for males, age 20, from generation X. This performance increase was minor. However, the second model is simpler and more interpretable, as the regression equations demonstrate. Therefore, the second model is the preferred choice for this problem. The ordinal encodings worked well for these features because the generation and age variables had clear orderings. This type of ordinal encoding could reduce performance for nominal features with no specific ordering.

## Make a prediction for age 33, male and generation Alpha (i.e. the generation after generation Z)?

In [15]:
# create a new data point with a values generation Aplha (value 6 after generation Z), age of 33, and gender male
test_example = np.array([6.0, 2.0, 1.0]).reshape(1, -1)

predicted_value = model_2.predict(test_example)

print(f"Predicted Value: {predicted_value[0]}")

Predicted Value: 16.90880460048185


The model predicted that 33-year-old males in generation Alpha will have a higher suicide rate than 20-year-old males from generation X.

One advantage of using regression in terms of the independent variables is that the regression coefficients describe continuous, numerical weights assigned to each feature. The regression coefficients are interpreted as a feature's influence on the outcome when other features remain constant. Substituting these coefficients into the regression equation makes it easy to describe feature significance and model predictions to non-technical stakeholders. In contrast, interpretable classification techniques describe feature significance in terms of entropy and information gain. These metrics require a more complex explanation than the regression formula.

One advantage of using numerical features with regression instead of one-hot encoded variables is that the regression model can predict future, unseen values, as shown in the previous cell. The model trained with one-hot encoded features would have been unable to predict the suicide rate for a new generation because there was no column for this generation in the training data. Additionally, using numerically-encoded features rather than one-hot encoded features reduces the dimensionality of the dataset. A lower dimensional dataset makes the regression equation more interpretable, reduces overfitting, and improves computational efficiency. However, it is important to note that applying a numerical or ordinal encoding does not work for all features. This technique could also reduce performance by causing the model to make assumptions about a natural ordering in the variable that does not exist.

I would suggest a regression model for the problem of predicting suicide rates. The suicide rate is naturally a numeric feature. Binning the suicide rate variable for the classification approach is subjective, and high vs. low suicide rates could vary depending on the binning method used by the machine learning practitioner. Additionally, the threshold used to define a high suicide rate could change over time. The continuous output of the regression model does not make assumptions about a high vs. low suicide rate and retains the natural format of the dependent variable. As noted, another benefit of the regression approach is that it can predict the suicide rate for future generations not included in the training set. Therefore, the predictions made by the regression model provide a more valuable metric to describe suicide rates, and this model will generalize better over time.