In this notebook, we will analyze the Suicide Rates dataset and draw conclusions related to global suicide trends from 1985 to 2016.

In [3]:
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import pandas as pd
import numpy as np

Let's take a look at our dataset.

# Data preprocessing

In [4]:
df = pd.read_csv('dataset.csv')
df.sample(7)

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_for_year ($),gdp_per_capita ($),generation
1099,Armenia,2001,male,25-34 years,2,284346,0.7,Armenia2001,,2118467913,587,Generation X
15149,Lithuania,2003,female,15-24 years,11,247710,4.44,Lithuania2003,,18802576988,5778,Millenials
6423,Croatia,1999,female,35-54 years,70,604300,11.58,Croatia1999,,23386945597,5459,Boomers
16318,Mauritius,1996,female,25-34 years,12,94900,12.64,Mauritius1996,,4421943910,4444,Generation X
10327,Greece,2010,female,75+ years,6,599568,1.0,Greece2010,0.867,299361576558,27886,Silent
14604,Kyrgyzstan,2000,female,75+ years,10,56276,17.77,Kyrgyzstan2000,0.593,1369693171,313,G.I. Generation
7135,Czech Republic,1993,female,75+ years,110,322400,34.12,Czech Republic1993,,40614350197,4186,G.I. Generation


It is evident that our data consists of twelve independent variables which are:

- country
- year
- sex
- age
- suicides_no: number of suicides
- population
- suicides/100k pop: number of suicides per 100k population
- country-year
- HDI for year: human development index (composite measure of societal development)
- gdp_for_year: annual country GDP in dollars
- gdp_per_capita: average GDP per person in dollars 
- generation: ['Generation X', 'Silent', 'Millenials', 'Boomers', 'G.I. Generation', 'Generation Z']

Let's adjust them to our analysis.

The 'country-year' column seems unnecessary since it duplicates information already present in the separate 'country' and 'year' columns.

In [5]:
column_name_mapping = {
    'suicides/100k pop': 'suicides_per_100k',
    'HDI for year': 'HDI_for_year',
    ' gdp_for_year ($) ': 'gdp_for_year',
    'gdp_per_capita ($)': 'gdp_per_capita'
}

df.rename(columns=column_name_mapping, inplace=True)
df.drop('country-year', axis=1, inplace=True)

Let's check the number of NaN values present in each column.

In [6]:
def check_missing_values(column):
    nan_percentage = df[column].isnull().sum() / df[column].size
    print(f'"{column}" column consists of {nan_percentage:.2%} missing values.')

for column in df.columns:
    check_missing_values(column)

"country" column consists of 0.00% missing values.
"year" column consists of 0.00% missing values.
"sex" column consists of 0.00% missing values.
"age" column consists of 0.00% missing values.
"suicides_no" column consists of 0.00% missing values.
"population" column consists of 0.00% missing values.
"suicides_per_100k" column consists of 0.00% missing values.
"HDI_for_year" column consists of 69.94% missing values.
"gdp_for_year" column consists of 0.00% missing values.
"gdp_per_capita" column consists of 0.00% missing values.
"generation" column consists of 0.00% missing values.


The 'HDI_for_year' column seems to have many NaN values. 

It's not a good idea to think about 'HDI_for_year' too much. Using the mean to fill in missing values could mess up the data. 
For example: really poor countries with very low HDI would be pushed up to the mean, creating inaccurate information.

Let's leave out this column for now and come back to it later.

In [7]:
df.drop('HDI_for_year', axis=1, inplace=True)
df.sample(7)

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides_per_100k,gdp_for_year,gdp_per_capita,generation
3913,Belize,1996,female,25-34 years,0,15534,0.0,641383800,3545,Generation X
8811,Finland,1993,male,75+ years,59,87700,67.27,89255751015,18826,G.I. Generation
5039,Canada,1995,male,15-24 years,507,2051200,24.72,604031623433,21871,Generation X
22533,Singapore,1996,female,75+ years,21,42600,49.3,96403758865,34148,G.I. Generation
4771,Bulgaria,2003,male,5-14 years,6,407671,1.47,20982685981,2800,Millenials
22597,Singapore,2001,male,15-24 years,21,216300,9.71,89286208629,28774,Millenials
5253,Canada,2013,male,75+ years,231,991048,23.31,1842628005830,55310,Silent


We're all set for analysis with our data.