In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df=pd.read_csv('/kaggle/input/suicide-rates-overview-1985-to-2021/master.csv')

We begin by doing some exploratory analysis.

In [None]:
df.info()
df.head()
df.describe()

We see there are some missing values. Also some columns need to be transformed to appropriate type (gdp_for_year to float for example). Population column is apparently the number of people of specific age group and not the country populaion. Country column is useless, we already have country and year column, so we can drop that one and we can also drop 'HDI for year' column since it has too many missing values.  


In [None]:
df.drop(['country-year', 'HDI for year'], axis = 1, inplace = True)

In [None]:
df.columns

Let's rename the columns and change gdp_year to float. 

In [None]:
df.columns = ['country', 'year', 'sex', 'age', 'suicides', 'population', 'suicides/100k', 'gdp_year', 'gdp_capita', 'generation']
df['gdp_year'] = df.gdp_year.str.replace(',','').astype('float')

Drop the rows where there are no suicides. 

In [None]:
df = df[df['suicides/100k'] != 0]

Let's check if there are any differences between ages and sex groups for number of suicides. First we reorder age categories so we'll have an easier time intepreting data visualization. 

In [None]:
df.age = df.age.astype('category').cat.set_categories(['5-14 years', '15-24 years', '25-34 years', '35-54 years', '55-74 years', '75+ years'], ordered = True)

In [None]:
sns.barplot(x = 'age', y = 'suicides/100k', hue = 'sex', data = df)

The difference is quite big. If you're a man you're more likely to commit suicide and the older you are, the higher likelihood of commiting suicide. 

Let's check for the differences in gdp_capita.

In [None]:
sns.scatterplot(x = 'gdp_capita', y ='suicides/100k' , data = df)

The trend seems to show that the higher gdp_capita, the less suicides the country has. Now let's take a look at individual countries. 

In [None]:

plt.figure(figsize = (10,20))
sns.barplot(x = 'suicides/100k', y = 'country', data = df)

Let's look at suicides over the years and if they affect gender differently. 

In [None]:
plt.figure(figsize = (10,5))

ax = sns.lineplot(x = 'year', y = 'suicides/100k', hue = 'sex',  data = df)



It appears that there have been fewer suicides over the last decade. The number of men commiting suicide is almost three times larger than the number of women commiting suicide. This could be correlated to the increasing GDP per capita. Let's take a look at this relationship. 

In [None]:

fig, axes = plt.subplots(1,2 ,figsize=(10, 5))

sns.lineplot(x = 'year', y = 'gdp_capita', data = df, ax = axes[0])
sns.lineplot(x = 'year', y = 'suicides/100k', data = df, ax = axes[1])

We see as GDP increases, the number of suicides drop. We even a see a little drop in gdp around year 2015 followed by an increase of suicides at the sime time. 

Now we'll have a look and try to build a machine learning model to see if we can predict the number of suicides/100k. I'll choose a random forest regressor model. Random Forest is an ensemble technique capable of performing both regression and classification tasks with the use of multiple decision trees and a technique called Bootstrap and Aggregation, commonly known as bagging. The basic idea behind this is to combine multiple decision trees in determining the final output rather than relying on individual decision trees. 

We will need to transform the data so it will be ready for analysis.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_selector as selector

In [None]:
df.age = df.age.astype('object')
X = df.drop('suicides/100k', axis = 1)
y = df['suicides/100k']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)

In [None]:
column_trans = ColumnTransformer(transformers=
        [('num', MinMaxScaler(), selector(dtype_exclude="object")),
        ('cat', OrdinalEncoder(), selector(dtype_include="object"))],
        remainder='drop')

We'll take a look to see which max_depth is the most appropriate for our regressor. 

In [None]:
results = {}

for i in range(1,20):
    
    clf = RandomForestRegressor(random_state=42, max_depth = i)

    pipeline = Pipeline([('prep',column_trans),
                         ('clf', clf)])
    pipeline.fit(X_train, y_train)
    score = pipeline.score(X_test, y_test)
    results[i] = score

In [None]:
results


In [None]:
plt.plot(*zip(*sorted(results.items())))

We see that anything over 10 is overfitting and our model doesn't improve that much. We reach a score of 0.974 accuracy. Now we can look at the most important features for the model. 

In [None]:
clf = RandomForestRegressor(random_state=42, max_depth = 10)
pipeline = Pipeline([('prep',column_trans),
                         ('clf', clf)])
pipeline.fit(X_train, y_train)

In [None]:
pipeline['clf'].feature_importances_

Let's visualize it. 

In [None]:
feature_list = []
targets = X.columns

#Print the name and gini importance of each feature

for feature in zip(targets, pipeline['clf'].feature_importances_):
    feature_list.append(feature)
 

df_imp = pd.DataFrame(feature_list, columns =['FEATURE', 'IMPORTANCE']).sort_values(by='IMPORTANCE', ascending=False)
df_imp['CUMSUM'] = df_imp['IMPORTANCE'].cumsum()

sns.barplot(x = 'IMPORTANCE', y = 'FEATURE', data = df_imp)

The graph shows most important factors when deciding the suicide rates. There are many reasons why it's this way. We see before that the year feature gives us an insight into how gdp affects suicides over the year. Countries are becoming more and more developed and people have more access to basic goods. There are many other factors deciding why gender plays such a big role. For example, we know that men are more likely to supress their feelings and are on average more violent, leading them to opt for more violent solutions to their problems. With age comes loneliness, the older you are and if you don't have a family, the higher chance that the people you love will be out of your life by then. 