# Suicide analysis
### Lukas Forst

In [12]:
import pandas as pd
import numpy as np
import plotly as plt
import seaborn as sns

Now let's look at the data.

In [3]:
data = pd.read_csv('data.csv')
data.head()

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_for_year ($),gdp_per_capita ($),generation
0,Albania,1987,male,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X
1,Albania,1987,male,35-54 years,16,308000,5.19,Albania1987,,2156624900,796,Silent
2,Albania,1987,female,15-24 years,14,289700,4.83,Albania1987,,2156624900,796,Generation X
3,Albania,1987,male,75+ years,1,21800,4.59,Albania1987,,2156624900,796,G.I. Generation
4,Albania,1987,male,25-34 years,9,274300,3.28,Albania1987,,2156624900,796,Boomers


Now, select random sample from the dataset.

In [4]:
data.sample(5)

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_for_year ($),gdp_per_capita ($),generation
20150,Qatar,2010,male,35-54 years,18,523563,3.44,Qatar2010,0.844,125122306346,74055,Generation X
11713,Iceland,1988,male,25-34 years,4,21500,18.6,Iceland1988,,6016168896,26249,Boomers
26795,United Kingdom,2011,female,25-34 years,134,4094631,3.27,United Kingdom2011,0.901,2619700404733,44491,Millenials
5043,Canada,1995,female,25-34 years,141,2431800,5.8,Canada1995,0.861,604031623433,21871,Generation X
451,Antigua and Barbuda,2002,male,25-34 years,0,6921,0.0,Antigua and Barbuda2002,,814615333,10499,Generation X


In [5]:
data.describe()

Unnamed: 0,year,suicides_no,population,suicides/100k pop,HDI for year,gdp_per_capita ($)
count,27820.0,27820.0,27820.0,27820.0,8364.0,27820.0
mean,2001.258375,242.574407,1844794.0,12.816097,0.776601,16866.464414
std,8.469055,902.047917,3911779.0,18.961511,0.093367,18887.576472
min,1985.0,0.0,278.0,0.0,0.483,251.0
25%,1995.0,3.0,97498.5,0.92,0.713,3447.0
50%,2002.0,25.0,430150.0,5.99,0.779,9372.0
75%,2008.0,131.0,1486143.0,16.62,0.855,24874.0
max,2016.0,22338.0,43805210.0,224.97,0.944,126352.0


Now as we saw some null values in the `data.head()`, we should probably check, how many null values we have.

In [6]:
rows, cols = data.shape
nulls_sum = data.isnull().sum()
nulls_sum.apply(lambda x: x / rows)

country               0.000000
year                  0.000000
sex                   0.000000
age                   0.000000
suicides_no           0.000000
population            0.000000
suicides/100k pop     0.000000
country-year          0.000000
HDI for year          0.699353
 gdp_for_year ($)     0.000000
gdp_per_capita ($)    0.000000
generation            0.000000
dtype: float64

Now we can see, that for the feature `HDI for year`, we are missing almost 70% of the data. Therefore, we should probably remove this metric.

In [7]:
data = data.drop('HDI for year', axis=1)

In [11]:
data.head()

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,gdp_for_year ($),gdp_per_capita ($),generation
0,Albania,1987,male,15-24 years,21,312900,6.71,Albania1987,2156624900,796,Generation X
1,Albania,1987,male,35-54 years,16,308000,5.19,Albania1987,2156624900,796,Silent
2,Albania,1987,female,15-24 years,14,289700,4.83,Albania1987,2156624900,796,Generation X
3,Albania,1987,male,75+ years,1,21800,4.59,Albania1987,2156624900,796,G.I. Generation
4,Albania,1987,male,25-34 years,9,274300,3.28,Albania1987,2156624900,796,Boomers


Let's visualise yearly values.

In [38]:
yearly =  data[['year', 'suicides_no', 'population']].groupby('year').agg({'population':'sum', 'suicides_no':'sum'})
yearly.describe()

Unnamed: 0,population,suicides_no
count,32.0,32.0
mean,1603817000.0,210888.125
std,393031100.0,55287.550729
min,132101900.0,15603.0
25%,1520310000.0,202235.0
50%,1740078000.0,233384.5
75%,1845573000.0,243501.25
max,1997297000.0,256119.0


It is interesting that the `min` and `max` value are so different - like a rank different. Let's investigate.

In [50]:
yearly[yearly['population'] == yearly['population'].min()]

Unnamed: 0_level_0,population,suicides_no
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2016,132101896,15603


It seems that the last year of the dataset is missing some data, so we should remove it in order to work only with complete data.

In [52]:
yearly = yearly.query('year != 2016')

In [63]:
import plotly.graph_objects as go
fig = go.Figure()

# add summed suicides
fig.add_trace(go.Scatter(x = yearly.index.values, y = yearly['suicides_no'], mode = 'lines', name = 'Suicides'))
fig.add_trace(go.Scatter(x = yearly.index.values, y = yearly['population'] / 10000, mode = 'lines', name = 'Population'))

fig.show()
