## Introduction.

"In much of the world, suicide is stigmatized and condemned for religious or cultural reasons. In some countries, suicidal behaviour is a criminal offence punishable by law.
Suicide is therefore often a secretive act surrounded by taboo, and may be unrecognized, misclassified or deliberately hidden in official records of death."
— World Health Organization (2002)

Hi and welcome. 
This project's goal is to provide data visualization about suicide rates in 101 countries from 1985 to 2016.
The main goal is to offer an easy-to-read presentation of the data sets, highlighting different approches for further data analysis.
Any social, political and economical discussion based on this data is out of the scope of this notebook, leaving the reader free to extend those analysis.
Any suggestion to correct and improve the results showed is more than welcome.

Thanks a lot for your time.



Ruggero Piazza

Source: Kaggle, https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016

<a id='summary'></a>
## Summary:

1 - Data observation and cleaning.

2 - Data analysis:

[2.1 - Group By Year](#by_year)
   
- Total suicides per year.
   
- The 10 Years with the highest number of suicides.
                     
- Year with the highest/lowest amount of cases.
   
- Suicide by gender on highest/lowest year.
    
[2.2 - Group By Age](#by_age)
   
- Total suicides by age category.
   
- Total suicides divided by gender per each age category.

- The 5 countries with the highest number of suicides by age category.
    
[2.3 - Group By Gender](#by_gender)
   
- Total amount of suicides per gender.

- Ratio between male-female suicides per each country.
          
[2.4 - Group By Generation](#by_gen)
   
- Total amount of suicides per generation.
    
[2.5 - Group By Country](#by_country)
   
- Total amount of suicides per country.
   
- The 15 countries with the highest/lowest number of cases.
   
- Total suicides number per gender in the 15 countries with highest number of cases.

- The 5 countries with highest number along the yearly-curve of max growth in cases.

- Comparison between total number of suicides and population number, by year, on top 5 countries.

- The 10 countries with the highest suicides/100k population ratio.

3 - [Conclusion.](#conclusion)
    

## 1 - Data observation and cleaning.

In [None]:
import pandas as pd
from pandas import DataFrame
from IPython.display import HTML
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import warnings

In [None]:
HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
The raw code for this notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')

In [None]:
df = pd.read_csv("/kaggle/input/suicide-rates-overview-1985-to-2016/master.csv")

Quick look at the data frame to spot presence of NaN values.

In [None]:
df.head(10)

We can see the presence of NaN values in the "HDI for year" column. Using sample(), we check if the values are persistant.

In [None]:
df.sample(10)

Noted the presence of NaN values, we proceed in depth with the data's exploration.

In [None]:
df.info()

In [None]:
df.index # observing the total amount of rows

Every column present 27820 values, apart from the column "HDI for year" that presents less non-null values (8364). This confirm our previous observation. To visualize this observation we can use isnull().any()

In [None]:
df.isnull().any()

Knowing that:

total of the rows 27820

total of non-null values 8364

total of NaN values (27820 - 8364) = 19456

we can calculate the percentage of NaN values:
(19456/27820) X 100 = 69,93%

Proven that in the "HDI for year" column the 70% of the values are NaN, we can drop it from our analysis.

In [None]:
df.drop('HDI for year', axis=1, inplace=True) # dropping the column

Now we check if there is any missing year in the range 1985-2016:

In [None]:
arr_year = df['year'].unique()
arr_year.sort()
arr_year # checking any missing year in the range 1985-2016

Visualising the list of countries:

In [None]:
arr_countries = df['country'].unique()
arr_countries # visualising the list of countries

In [None]:
len(arr_countries) # counting the amount of countries

After an overview of the data frame, we can visualise how many observations we have by country. 

In [None]:
alpha = 0.7
plt.figure(figsize=(10, 25))
sns.countplot(y='country', data=df, alpha=alpha, color='blue')
plt.title('Observation\'s number by country')
plt.show() # visualisation of the observation's count by country

In [None]:
by_country = df.groupby('country')

After creating a dataframe with observation's number by country, we can highlight
the country with the highest/lowest number of observation:

In [None]:
# creating a Series with only the country's name and the number of observations
observation = pd.Series()
for country, country_df in by_country:
    observation[country] = len(country_df.loc[df['country']==country])
df_obs = observation.to_frame()
df_obs.rename(columns={0: 'Observations'}, inplace=True)
df_obs
# This approach is faster than create a dictionary and turn it into a dataframe.

Country with the lowest value of observation:

In [None]:
# country with the lowest value of observation

index_min = df_obs.idxmin() # index of minimum value
df_obs.loc[index_min, 'Observations']

In [None]:
df_obs.loc[df_obs['Observations']<=100]

Country with the highest value of observation:

In [None]:
# country with the higher value of observation

index_max = df_obs.idxmax()
df_obs.loc[index_max, 'Observations']

Countries with observation's number equel or greater than 350:

In [None]:
df_obs.loc[df_obs['Observations']>=350]

[back to summary](#summary)

# Data Analysis:

We start the analysis with a quick overview on the main DataFrame grouped by different categories to highlight possible approches.
<a id='by_year'></a>

## 2.1 - GROUP BY YEAR:

In [None]:
y = df.groupby('year') # grouping by year

Sample of the dataframe grouped by year followed by a graph:

In [None]:
by_year = pd.Series()
for year, year_df in y:
    by_year[str(year)] = year_df['suicides_no'].sum() # The object supports both integer and label-based indexing
by_year = by_year.to_frame()
by_year.rename(columns={0: 'Tot_Suicide'}, inplace=True)
by_year.sample(10)

In [None]:
# highlighting tot suicides by year
graph_by_year = by_year.plot(legend=False, grid=True) 
graph_by_year.set_xlabel('Year')
graph_by_year.set_ylabel('Tot Suicide')
plt.title('Tot suicides by year')

Creating a dataframe with the 10 largest values and relative graph: 

In [None]:
# creating a frame with the 10 largest values
largest = by_year.nlargest(10, 'Tot_Suicide')
largest.sort_index(inplace=True) # sorting
largest

In [None]:
# plotting the frame to highlight the trend 
graph_largest = largest.plot(legend=False, grid=True) 
graph_largest.set_xlabel('Year')
graph_largest.set_ylabel('Tot Suicide') 
plt.title('Tot suicides by top 10 years')

Year with the minimum amount of suicides:

In [None]:
# year with the minimum amount of suicides
year_min_index = by_year.idxmin()
by_year.loc[year_min_index, 'Tot_Suicide']

Year with the maximum amount of suicides:

In [None]:
# year with the maximum amount of suicides
year_max_index = by_year.idxmax()
by_year.loc[year_max_index, 'Tot_Suicide']

Grouping the main dataframe by year and sex:

In [None]:
# group by year and sex
gb_year_sex = df.groupby(['year', 'sex'])
df_year_sex = gb_year_sex[['suicides_no']].sum()
df_year_sex.head(10)

Plotting suicides by gender on highest and lowest year:

In [None]:
df_year_sex.loc[[1999, 2016]].plot(kind='bar')
plt.title('Suicides by gender on highest and lower year')

<a id='by_age'></a>

[back to summary](#summary)

## 2.2 - GROUP BY AGE:

Overview of the age's categories:

In [None]:
# overview of the age's categories
arr_age = df['age'].unique()
arr_age 

Grouping th main dataframe by age, calculation the total amount of suicides per age category:

In [None]:
a = df.groupby('age') # grouping by age

In [None]:
by_age = pd.Series()
for age, age_df in a:
    by_age[age] = age_df['suicides_no'].sum()
by_age.sort_values(ascending=False, inplace=True)
by_age = by_age.to_frame()
by_age.rename(columns={0: 'Tot_Suicides'}, inplace=True)
by_age

Plotting the result:

In [None]:
# highlighting the number of suicides per age
graph_by_age = by_age.plot(kind='bar')
graph_by_age.set_xlabel('Age')
plt.title('Tot suicides by age category')

Highlighting the correlation between total suicides and gender by age category:

In [None]:
# highlighting the correlation between tot suicides and gender by age category
gb_age_sex = df.groupby(['age', 'sex'])
gb_age_sex = gb_age_sex[['suicides_no']].sum()
gb_age_sex

In [None]:
gb_age_sex.plot(kind='barh', figsize=(10, 10))
plt.title('suicides number by gender per each age category')

After grouping by age and country and calculation the corresponding suicides number, we can focus on plotting per each age category, the top 5 countries:

In [None]:
gb_age_country = df.groupby(['age', 'country'])
gb_age_country = gb_age_country[['suicides_no']].sum()
gb_age_country.sample(10)

In [None]:
# plotting per each age category the top 5 countries
ages = df['age'].unique()
def plotting_data_frame(data_f, iteration):
    for item in iteration:
        new = data_f.loc[item]
        largest = new.nlargest(5, 'suicides_no')
        largest.plot(kind='barh')
        plt.title(f'{item}')
        plt.show()


plotting_data_frame(gb_age_country, ages)

<a id='by_gender'></a>

[back to summary](#summary)

## 2.3 - GROUP BY GENDER:

Grouping by gender with corresponding suicides number:

In [None]:
sx = df.groupby('sex') # grouping by gender

In [None]:
# tot amount of suicides by gender
by_sex = pd.Series()
for sex, sex_df in sx:
    by_sex[sex] = sex_df['suicides_no'].sum()
by_sex = by_sex.to_frame()
by_sex.rename(columns={0: 'Tot_Suicides'}, inplace=True)
by_sex

In [None]:
by_sex.plot(kind='pie', subplots=True, legend=False, figsize=(5, 5))
plt.title('Tot suicides by gender')

Now we can group by country and sex to highlight the male/female suicides ratio:

In [None]:
# grouping by country and sex
gb_country_gender = df.groupby(['country', 'sex'])
gb_country_gender = gb_country_gender[['suicides_no']].sum()

In [None]:
# pivoting the data frame to have male and female as columns
new_df = gb_country_gender.pivot_table(values='suicides_no', index=['country'], columns=['sex'])
new_df

In [None]:
# creating a new column named 'ratio' with the male/female suicides ratio
new_df['ratio'] = new_df['male'] / new_df['female']
new_df

Check any presence of 'nan' or 'inf' value in the 'ratio' column:

In [None]:
# checking if there is any 'nan' or 'inf' values:
new_df['ratio'].values

Replacing 'inf' values with 'nan' and then dropping any null value from the column:

In [None]:
# replacing inf values with nan values
new_df['ratio'].replace(np.inf, np.nan, inplace=True)
# dropping nan values
new_df['ratio'].dropna(inplace=True)
new_df['ratio'].values

In [None]:
new_df.reset_index(inplace=True)

We are able now to highlight countries with a specific ratio. 

In this example we can see the countries with the ratio higher than 5

In [None]:
condition = new_df['ratio'] > 5
new_df[condition]

<a id='by_gen'></a>

[back to summary](#summary)

## 2.4 - GROUP BY GENERATION:

Grouping by generation and plotting:

In [None]:
sg = df.groupby('generation')

In [None]:
by_gen = pd.Series()
for gen, gen_df in sg:
    by_gen[gen] = gen_df['suicides_no'].sum()
by_gen = by_gen[['G.I. Generation', 'Silent', 'Boomers', 'Generation X', 'Millenials', 'Generation Z']]
# generations are now ordered chronologically
by_gen = by_gen.to_frame()
by_gen.rename(columns={0: 'Tot_Suicides'}, inplace=True)
by_gen

In [None]:
graph_by_gen = by_gen.plot(kind='barh', legend=False)
graph_by_gen.set_ylabel('Generation')
graph_by_gen.set_xlabel('Tot Suicides')
plt.title('Tot suicides by generation')

<a id='by_country'></a>

[back to summary](#summary)

# 2.5 - GROUP BY COUNTRY:

First grouping by country to highlight the top 15 countries with highest/lowest values:

In [None]:
gb_country = df.groupby('country') # grouping the dataframe by country

In [None]:
by_country = pd.Series()
for country, country_df in gb_country:   
    by_country[country] = country_df['suicides_no'].sum()
by_country = by_country.to_frame()
by_country.rename(columns={0: 'Tot_Suicide'}, inplace=True)

# visualising the top 15 countries by total number of suicides
by_country_largest = by_country.nlargest(15, 'Tot_Suicide') 
by_country_largest.plot(kind='barh', figsize=(10, 8))
plt.title('Top 15 countries per suicide no')

In [None]:
# visualising the 15 countries with the lowest tot number of suicides
by_country_smallest = by_country.nsmallest(15, 'Tot_Suicide')
by_country_smallest.plot(kind='barh', figsize=(10, 8))
plt.title(' top 15 countries per lowest suicide no')

Using again the grouping by 'country' and 'sex' in the Grouping By Gender section, we can visualise the total of suicides per gender on the 15 countries with the highest number of suicides:

In [None]:
# re-calling the dataframe previously made grouping by 'country' and 'sex' in the By Gender section.
gb_country_gender.head(10)

In [None]:
# visualising the tot of suicides per gender on the 15 countries with the highest number of suicides
index_largest = by_country_largest.index
df_country_gender = gb_country_gender.loc[index_largest, 'suicides_no'].to_frame()
df_country_gender.plot(kind='barh', figsize=(10, 10))
plt.title('Tot suicides by gender on top 15 countries')

Looking at the previous analysis done on the years, we can select those years where the curve raised to reach the peak with the highest amount of cases (1990 to 1999). We will highlight for each of those years the countries with the highest number of cases.

In [None]:
gb_year_country = df.groupby(['year', 'country']) # grouping by year/country
gb_year_country = gb_year_country[['suicides_no']].sum()
gb_1990_1999 = gb_year_country.loc[1990:1999]
gb_1990_1999

In [None]:
years = [x for x in range(1990, 2000)]
plotting_data_frame(gb_1990_1999, years) # using function plotting_data_frame()

Russian Federation and United States seems to be the two countries with the highest number of cases. We can expect two very large countries to have higher total number of cases compares to smaller countries. According to this observation, we should consider the total amount of population as a parameter to keep in count. We proceed then organising the data includind the total number of population.

We proceed on grouping by 'country' and 'year', calculation the toal amount af suicides and population per each year.

In [None]:
# grouping by country/year
gb_country_year_population = df.groupby(['country', 'year']) 
# getting the tot amount of population per year
gb_country_year_population = gb_country_year_population[['suicides_no','population']].sum() 
gb_country_year_population

Now we split the dataframe in 5 dataframes: one for each of the top 5 countries for number of suicides.

In [None]:
# getting the top 5 countries indexes:
top_5 = by_country.nlargest(5, 'Tot_Suicide')
top_5_indexes = top_5.index

I'll visualize the data from the 5 dataframes to show the number of suicide with the population's growth. Seeing how the population changed along the year is a good way to spot events in the country's history that effected the population number. Those events might effect the number of suicides too.

In [None]:
warnings.filterwarnings('ignore') # to ignore warning related to older pandas version
def plotting(data_f, iteration):
    for item in iteration:
        data_f.loc[[item]].unstack(level=0).plot(subplots=True, figsize=(8, 8))

plotting(gb_country_year_population, top_5_indexes)

We proceed to calculate the ratio between suicide number and population number. Calculation are made per 100 thousand people.

First, we select from the main dataframe the columns needed for our analysis and we set country as the index:

In [None]:
df_trimmed = df[['country', 'suicides_no', 'population']]
df_trimmed = df_trimmed.set_index('country')
df_trimmed

In [None]:
df_trim_gb = df_trimmed.groupby('country')

In [None]:
df_russia = df_trim_gb.get_group('Russian Federation')
df_usa = df_trim_gb.get_group('United States')
df_japan = df_trim_gb.get_group('Japan')
df_france = df_trim_gb.get_group('France')
df_ukr = df_trim_gb.get_group('Ukraine')
df_germany = df_trim_gb.get_group('Germany')
df_korea = df_trim_gb.get_group('Republic of Korea')
df_brazil = df_trim_gb.get_group('Brazil')
df_pol = df_trim_gb.get_group('Poland')
df_uk = df_trim_gb.get_group('United Kingdom')
top10_list = [df_russia, df_usa, df_japan, df_france, df_ukr, df_germany, df_korea, df_brazil, df_pol, df_uk]


We proceed with calculationg the suicides/100k population ratio and plotting the result on the 10 countries with the highest value:

In [None]:
dicts_list = []
def making_dict(df):
    new_df = {'Country': df.index[0],
              'Mean population': df['population'].mean(),
              'Tot suicides': df['suicides_no'].sum()}
    dicts_list.append(new_df)

for df_countries in top10_list:
    making_dict(df_countries)

df_pop = pd.DataFrame(dicts_list).set_index('Country')
# calculating suicides/100k population
df_pop['suicides/100k pop'] = df_pop['Mean population'] / df_pop['Tot suicides']
df_pop[['suicides/100k pop']].plot(kind='barh', legend=True, figsize=(10, 8))

Visualising the ratio values in a dataframe:

In [None]:
df_pop[['suicides/100k pop']].sort_values(by='suicides/100k pop', ascending=False)

[back to summary](#summary)

<a id='conclusion'></a>

## Conclusion.


- Grouping by year:

According to the data set, the year with the highest amount of cases was 1999, the lowest was 2016.
From 1990 to 1999 we can observe a fairly steep growth in cases, followed by 
a stable number around the year 2000.
The number of cases dropped slightly around 2001 to return stable until 2004, 
where we can observe another drop in cases.
Another growth, albeit less steep than before, is observable until 2010 where the cases 
seem to slowly drop.
The ratio between male and female suicide numbers remain pretty constant throughout
the years, around the order of 3 times more for the male category.

- Grouping by age:

The age category with the highest number of recorded cases is the 35-54 years,
followed by 55-74, 25-34, 15-24, over 75 and 5-14. The gap between the numbers seems
to get smaller with the bigger gap between the first two age categories.
The gap between gender remains the same here too, with still a ratio of
3 to 1 with a bigger gap in the top list category (35-54) where the gap is nearly 4 to 1.
Here a summary of which countries are in the top 5 list per age category:

35-54 

1. Russian Federation
2. United States
3. Japan
4. Ukraine
5. France

55-74

1. Japan
2. Russian Federation
3. United States
4. Ukraine
5. Germany

25-34

1. Russian Federation
2. United States
3. Japan
4. Brazil
5. Ukraine

15-24

1. Russian Federation
2. United States
3. Japan
4. Brazil
5. Mexico

75+

1. Japan
2. United States
3. Russian Federation
4. France
5. Germany

5-14

1. United States
2. Russian Federation
3. Mexico
4. Brazil
5. Japan

We observe how Russian Federation and United States are present in every single
category in one of the top three spot, swapping position only with Japan, another country to be
present in every single category at the top 3 apart from the 5-14 category where is
in the lowest position. 
Mexico and Brazil, appearing respectively twice and three times seems to cover the younger
spectrum of the age category (15-24 / 5-14 for Mexico, 5-14 / 15-24 / 25-34 for Brazil) 
with France and Germany, appearing both twice, covering the older spectrum. 
Ukraine is the other European country appearing three times, covering 25-34 / 35-54 / 55-74 years of age.

- Grouping by gender:

As mentioned before, the gender gap seems to be constant around 3 times more 
male cases than female.

- Grouping by generation:

The generation more affected by high suicide rate seems to be the Boomers, followed by
the Silent generation, the Generation X, Millenials and G.I. Generation

- Grouping by country:

After covering the 15 countries with the highest and lowest number of cases,
we observe how Russian Federation and United States cover the top 2 spots with 
a quite big gap between them and the third spot (Japan).
The gender analysis seems to show the same result, with pretty much the same ratio.
Considering the years from 1990 to 1999, where we observe a steady growth in cases, I wanted to highlight the top 5 countries per suicides number. 
We observe how Russian Federation is always at the first spot, followed by USA and Japan in almost every year.
Only the year 1998 and 1999 show Japan taking over USA. Ukraine and Germany complete the list appearing every year with very close numbers. We see France at the bottom of the list only in the year 1990 and 1991.
Such result about Russian Federation and United States is predictable because of the size of the two nations. 
Before excluding this factor from our next analysis, I think was useful to highlight the relationship between population's growth and suicide number in the top 5 countries.
After calculating the suicide number per 100k population ratio, we observe a completely different scenario, with Brazil and the United Kingdom at the first two position albeit with a huge gap.


[back to summary](#summary)