## Overview of the material

***
- [Data Exploration and Augmentation of Missing Values](#augmentation)
- [Data Overview: Basic Bubble Plots](#bubbles)
- [Republic of Korea: A more detailed look](#korea)

In [1]:
# height and width of plots (feel free to adjust, if plots are too small/big on your screen)
HEIGHT = 600
WIDTH = 800

# import libraries we need
library(tidyverse)
library(ggplot2)
library(plotly)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.2.1 ──
[32m✔[39m [34mggplot2[39m 3.1.0.[31m9000[39m     [32m✔[39m [34mpurrr  [39m 0.3.1     
[32m✔[39m [34mtibble [39m 2.0.1          [32m✔[39m [34mdplyr  [39m 0.8.0.[31m1[39m   
[32m✔[39m [34mtidyr  [39m 0.8.3          [32m✔[39m [34mstringr[39m 1.4.0     
[32m✔[39m [34mreadr  [39m 1.3.1          [32m✔[39m [34mforcats[39m 0.4.0     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

Attaching package: ‘plotly’

The following object is masked from ‘package:ggplot2’:

    last_plot

The following object is masked from ‘package:stats’:

    filter

The following object is masked from ‘package:graphics’:

    layout



Data Exploration and Augmentation of Missing Values
===========
<a id=augmentation></a>
As promised, we will first have a look at the dataset and impute missing values. I will try to include a quick look into the dataframe after every important step, so nobody loses track of what we're doing and we can skim through this part without worrying too much about the code.

In [3]:
suicides = read.csv('../input/master.csv')
head(suicides, 13)

country,year,sex,age,suicides_no,population,suicides.100k.pop,country.year,HDI.for.year,gdp_for_year....,gdp_per_capita....,generation
Albania,1987,male,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X
Albania,1987,male,35-54 years,16,308000,5.19,Albania1987,,2156624900,796,Silent
Albania,1987,female,15-24 years,14,289700,4.83,Albania1987,,2156624900,796,Generation X
Albania,1987,male,75+ years,1,21800,4.59,Albania1987,,2156624900,796,G.I. Generation
Albania,1987,male,25-34 years,9,274300,3.28,Albania1987,,2156624900,796,Boomers
Albania,1987,female,75+ years,1,35600,2.81,Albania1987,,2156624900,796,G.I. Generation
Albania,1987,female,35-54 years,6,278800,2.15,Albania1987,,2156624900,796,Silent
Albania,1987,female,25-34 years,4,257200,1.56,Albania1987,,2156624900,796,Boomers
Albania,1987,male,55-74 years,1,137500,0.73,Albania1987,,2156624900,796,G.I. Generation
Albania,1987,female,5-14 years,0,311000,0.0,Albania1987,,2156624900,796,Generation X


### Check for NA-values.

In [4]:
suicides %>%
  select_if(function(x) any(is.na(x))) %>% 
  summarise_each(funs(sum(is.na(.))))

“funs() is soft deprecated as of dplyr 0.8.0
please use list() instead

# Before:
funs(name = f(.)

# After: 
list(name = ~f(.))

HDI.for.year
19456


No nan-values in all columns, apart from HDI.for.year. We're not going to use this column, as there are too many values to replace/estimate. 
As you can see above, recordings for **Albania begin only in 1987**. Let's check, for how many more levels of country, year, sex and age group we don't have values.

In [5]:
years = unique(suicides$year)
countries = unique(suicides$country)
sex = unique(suicides$sex)
ages = unique(suicides$age)

all_combinations = expand.grid(years, countries, sex, ages)
data.table::setnames(all_combinations, 
                     old = c('Var1', 'Var2', 'Var3', 'Var4'), 
                     new = c('year','country','sex','age'))

suicides_with_all_levels = left_join(all_combinations, suicides, 
                                     by = c('year','country','sex','age')) %>%
                             arrange(country, year)
head(suicides_with_all_levels, 5)

year,country,sex,age,suicides_no,population,suicides.100k.pop,country.year,HDI.for.year,gdp_for_year....,gdp_per_capita....,generation
1985,Albania,male,15-24 years,,,,,,,,
1985,Albania,female,15-24 years,,,,,,,,
1985,Albania,male,35-54 years,,,,,,,,
1985,Albania,female,35-54 years,,,,,,,,
1985,Albania,male,75+ years,,,,,,,,


**As you can see**, the above table now contains entries for all combinations of year, country, sex and age. 
In particular, we now have levels for Albania in 1985 and 1986, where the remaining columns are filled with NA values. 
Let's count again the number of NA-values in the columns of our new dataframe!

In [6]:
suicides_with_all_levels %>%
  select_if(function(x) any(is.na(x))) %>% 
  summarise_each(funs(sum(is.na(.))))

suicides_no,population,suicides.100k.pop,country.year,HDI.for.year,gdp_for_year....,gdp_per_capita....,generation
10964,10964,10964,10964,30420,10964,10964,10964


Seems like we're missing data for around 11000 levels. In order to have a smooth behaviour of the bubble-plots later, we need to estimate values for these missing values. We will add an indicator column to remember, which columns we estimated.

As column gdp_for_year... is just a number that can easily be derived from gdp_per_capita...., we will drop it. Also, I will not be using the HDI.for.year and generation columns, so I will drop these as well.

In [7]:
suicides_with_all_levels['groups'] = sprintf("%s.%s.%s", 
                                        suicides_with_all_levels$country,
                                        suicides_with_all_levels$sex,
                                        suicides_with_all_levels$age)

suicides_with_all_levels['estimated'] = is.na(suicides_with_all_levels$population)

grouped = suicides_with_all_levels %>%
             select('year', 'country', 'sex', 'age', 'suicides_no', 
                    'population', 'suicides.100k.pop','gdp_per_capita....', 'groups', 'estimated') %>%
             arrange(country, year) %>%
             group_by(groups)

suicides1 = grouped %>%
             fill('population', 'suicides.100k.pop', 'suicides_no', 'gdp_per_capita....',
                  .direction = 'up') %>%
             fill('population', 'suicides.100k.pop', 'suicides_no', 'gdp_per_capita....',
                  .direction = 'down') %>%
             ungroup() %>%
             filter(year < 2000)

suicides2 = grouped %>%
             fill('population', 'suicides.100k.pop', 'suicides_no', 'gdp_per_capita....',
                  .direction = 'down') %>%
             fill('population', 'suicides.100k.pop', 'suicides_no', 'gdp_per_capita....',
                  .direction = 'up') %>%
             ungroup() %>%
             filter(year >= 2000)

suicides = rbind(suicides1, suicides2)

# there might still be countries, where our estimation failed
countries_with_na = unique(suicides[is.na(suicides$population),]$country)
suicides = suicides %>% filter(!country %in% countries_with_na)

head(suicides, 10)

year,country,sex,age,suicides_no,population,suicides.100k.pop,gdp_per_capita....,groups,estimated
1985,Albania,female,15-24 years,14,289700,4.83,796,Albania.female.15-24 years,True
1986,Albania,female,15-24 years,14,289700,4.83,796,Albania.female.15-24 years,True
1987,Albania,female,15-24 years,14,289700,4.83,796,Albania.female.15-24 years,False
1988,Albania,female,15-24 years,8,295600,2.71,769,Albania.female.15-24 years,False
1989,Albania,female,15-24 years,5,299900,1.67,833,Albania.female.15-24 years,False
1990,Albania,female,15-24 years,7,292400,2.39,251,Albania.female.15-24 years,True
1991,Albania,female,15-24 years,7,292400,2.39,251,Albania.female.15-24 years,True
1992,Albania,female,15-24 years,7,292400,2.39,251,Albania.female.15-24 years,False
1993,Albania,female,15-24 years,10,285300,3.51,437,Albania.female.15-24 years,False
1994,Albania,female,15-24 years,6,282600,2.12,697,Albania.female.15-24 years,False


 # Basic Bubble Plots
 <a id='bubbles'></a>

After having filled our empty levels with more or less meaningful data, we can now move on to the interesting part and craft some interactive bubble plots. For this, we will use ggplot declare the actual graphs and feed them to plotly, a very powerful R-library for creating all kinds of interactive plots.

Let's first have a look at the overall trends within different countries over the years.

In [8]:
suicides['year_and_country'] = sprintf("%s.%s", suicides$country, suicides$year)

suicides_by_country = suicides %>%
                        group_by(year_and_country) %>%
                        summarise(
                            year = head(year, 1),
                            country = head(country, 1),
                            estimated = head(estimated, 1),
                            population = sum(population),
                            suicides_no = sum(suicides_no),
                            suicides_per_100k = sum(population * suicides.100k.pop) / (as.numeric(sum(population)) * n()),
                            gdp_per_capita = mean(gdp_per_capita....))

head(suicides_by_country)

year_and_country,year,country,estimated,population,suicides_no,suicides_per_100k,gdp_per_capita
Albania.1985,1985,Albania,True,2709600,73,2.654167,796
Albania.1986,1986,Albania,True,2709600,73,2.654167,796
Albania.1987,1987,Albania,False,2709600,73,2.654167,796
Albania.1988,1988,Albania,False,2764300,63,2.705,769
Albania.1989,1989,Albania,False,2803100,68,2.783333,833
Albania.1990,1990,Albania,True,2822500,47,1.5,251


In [15]:
plt = suicides_by_country %>%
        ggplot(aes(x = gdp_per_capita, 
                   y = suicides_per_100k,
                   size = population,
                   frame = year,
                   group = country)) +
        geom_point(aes(alpha = 1 - estimated*0.5), colour='magenta') +
        scale_x_log10() + 
        scale_alpha_continuous(range = c(0.4, 0.8)) + 
        labs(title= "Suicide rates grouped by country from 1985 to 2016") +
        #geom_hline(aes(yintercept = mean(suicides_per_100k)), color="gray") + # <- If You want to add a Mean Values
        theme_minimal()

pltly = ggplotly(plt, tooltip = c('country', 'population', 'suicides_per_100k', 'gdp_per_capita'))

# this is just a workaround to make the kaggle kernel show the plot correctly
htmlwidgets::saveWidget(pltly, "plot1.html")
IRdisplay::display_html(sprintf('<iframe src="plot1.html" height=%d width=%d></iframe>', HEIGHT, as.integer(WIDTH/1.2)))

**English Interpretation:**
As you can see, we have plotted the logarithmic GDP relative to population against the suicides per 100000 people. The size of the bubbles indicates the total population in that country and lower transparency, that we estimated that value. Note, how all the estimated bubbles stay at the same position. This makes sense, if you recall, that we estimated the values by filling in it's previous year value or next year value. 

Now, what can we actually find out from this? If you press 'play', the application starts the animation shifting through the years. You can also jump to years of your choice by moving the toggle on the timeline.
Note, how most of the bubbles tend to move towards the right throughout the years. This makes sense, as for most countries in the world the GDPs per capita have risen throughout the last decades. Interestingly, we see that trend broken in 2009, where all the GDPs seem to significantly lower. This is of course caused by the financial crisis in 2009, that most of us well remember.
Looking at the number of suicides per 100000 people we can also identify a trend towards lower numbers for countries that experience a rightward movement towards a higher GDP per capita.

**Bahasa Interpretation**:
Seperti yang bisa dilihat dari grafik diatas , lot PDB logaritmik relatif terhadap populasi terhadap bunuh diri per 100.000 orang. Ukuran gelembung menunjukkan total populasi di negara itu dan transparansi yang lebih rendah, yang kami perkirakan nilainya.

Remember, that the dataset provided us with age groups? We can use them, to plot the above **developments by age and country**.

In [17]:
suicides['year_age_country'] = sprintf('%s.%s.%s', suicides$age, suicides$year, suicides$country)


suicides_by_age = suicides %>%
                    group_by(year_age_country) %>%
                    summarise(
                            year = head(year, 1),
                            age = head(age, 1),
                            country = head(country, 1),
                            estimated = head(estimated, 1),
                            population = sum(population),
                            suicides_no = sum(suicides_no),
                            suicides_per_100k = sum(population * suicides.100k.pop) / (as.numeric(sum(population)) * n()),
                            gdp_per_capita = mean(gdp_per_capita....))

head(suicides_by_age, 14)

year_age_country,year,age,country,estimated,population,suicides_no,suicides_per_100k,gdp_per_capita
15-24 years.1985.Albania,1985,15-24 years,Albania,True,602600,35,5.77,796
15-24 years.1985.Antigua and Barbuda,1985,15-24 years,Antigua and Barbuda,False,15376,0,0.0,3850
15-24 years.1985.Argentina,1985,15-24 years,Argentina,False,4769400,225,4.695,3264
15-24 years.1985.Armenia,1985,15-24 years,Armenia,True,531000,13,2.415,756
15-24 years.1985.Aruba,1985,15-24 years,Aruba,True,10365,0,0.0,17949
15-24 years.1985.Australia,1985,15-24 years,Australia,False,2659800,384,14.26,12374
15-24 years.1985.Austria,1985,15-24 years,Austria,False,1287320,257,19.775,9759
15-24 years.1985.Azerbaijan,1985,15-24 years,Azerbaijan,True,1339500,20,1.475,1439
15-24 years.1985.Bahamas,1985,15-24 years,Bahamas,False,47200,0,0.0,11393
15-24 years.1985.Bahrain,1985,15-24 years,Bahrain,False,87500,0,0.0,9980


In [18]:
plt = suicides_by_age %>%
        ggplot(aes(x = gdp_per_capita, 
                   y = suicides_per_100k,
                   colour = age,
                   group = country,
                   frame = year,
                   size = population)) +
        geom_point(aes(alpha = 1 - estimated*0.5)) +
        scale_alpha_continuous(range = c(0.4, 0.8)) +
        scale_x_log10() + 
        labs(title= "Suicide rates grouped by age and country from 1985 to 2016") +
        theme_minimal() %>%
        highlight("plotly_selected")

pltly = ggplotly(plt)

# this is just a workaround to make the kaggle kernel show the plot correctly
htmlwidgets::saveWidget(pltly, "plot2.html")
IRdisplay::display_html(sprintf('<iframe src="plot2.html" height=%d width=%d></iframe>', HEIGHT, WIDTH))

With the age groups taken into account, we have now four times the bubbles if we compare it to the last plot. In order to keep readability at a good level, we added colors for age groups. Apart from that, everything is like in the last plot. If you want more information about one bubble, you can hover over it (you might need to make use of the zoom function or filtering to hit it though). 

Throughout the years, we can identify the age group of 5 to 14 with the lowest suicide rate, while 75+ has the highest rate. We can use the Box Select tool, to keep focus on interesting bubbles. For example, we can choose the 'Republic Of Korea'/'75+' bubble (the highest purple bubble in the last 2016-frame). This group seems to be the highest risk group for suicides beginning from around 2003. 

Now, I am not an expert in South Korean developments in society around that period, so lets ask [wikipedia](https://en.wikipedia.org/wiki/Suicide_in_South_Korea) for information. Seems like, we identified a well known, societal problem:

> One factor of suicide among elderly South Koreans is due to the amount of widespread poverty among senior citizens in South Korea, with nearly half of the country's elderly population living below the poverty line. Combined with a poorly-funded social safety net for the elderly, this can result in them killing themselves not to be a financial burden on their families, since the old social structure where children looked after their parents has largely disappeared in the 21st century. As a result, people living in rural areas tend to have higher suicide rates. This is due to extremely high rates of elderly discrimination, especially when applying for jobs, with 85.7% of those in their 50s experiencing discrimination. Age discrimination also directly correlates to suicide, on top of influencing poverty rates.

Of course, theres more to discover in this plot and I would be excited to see your findings in the comments! 


Republic of Korea: A More Detailed Look Into Recent Developments
=======
<a id='korea'></a>
For now, we will stick to Korea as an example, of what other animated plots we can use plotly for. As we have just found out, there seems to be problem with suicides especially among the older generation. We can visualize the trends in all age groups over the recent years using the plot_ly function:

In [25]:
# got some inspiration from https://plot.ly/r/cumulative-animations/ for this
accumulate_by = function(dat, var) {
  var <- lazyeval::f_eval(var, dat)
  lvls <- plotly:::getLevels(var)
  dats <- lapply(seq_along(lvls), function(x) {
    cbind(dat[var %in% lvls[seq(1, x)], ], frame = lvls[[x]])
  })
  dplyr::bind_rows(dats)
}

d <- suicides_by_age %>%
  filter(country == 'Republic of Korea') %>%
  accumulate_by(~year)

pltly = d %>%
         plot_ly(
             x = ~year, 
             y = ~suicides_per_100k,
             split = ~age,
             frame = ~frame,  #we created this column in accumulate_by
             type = 'scatter',
             mode = 'lines',
             color = ~age
         ) %>%
         layout(
             title = "Suicide Rates By Age Groups In Korea, 1985 - 2016"
         ) %>%
         animation_opts(
             frame = 100, 
             transition = 0, 
             redraw = FALSE
         ) %>%
         animation_slider(
             hide = T
         ) %>%
         animation_button(
             x = 1, xanchor = "right", y = 0, yanchor = "bottom"
         )

# this is just a workaround to make the kaggle kernel show the plot correctly
htmlwidgets::saveWidget(pltly, "plot3.html")
IRdisplay::display_html(sprintf('<iframe src="plot3.html" height=%d width=%d></iframe>', HEIGHT, WIDTH))

“number of items to replace is not a multiple of replacement length”

Now, as you can see, this plot is not as interactive as the ones before having only a play button. However, I find it an interesting tool to give a better feeling for the plotted trends. The rates for all age groups seem to be more or less constant until the early 90s. Throughout that period the group of 75+ has the highest suicide rate at around 20 suicides per 100000 people. From around 1993 onwards, we can observe a slight increase in suicide rates for all age groups above 25 years. For ages 55-74 and especially 75+ this trend experiences a significant speed up from 2000 to around 2004, where both groups almost double their suicide rates from 28.035 to around 53.83 and 56.88 to 125.405 suicides per 100000 people respectively. After what we saw in the previous plot, we expected this feature, however it is worrying to see, that even the generation from 55 to 74 seems to be victim to it.
On the other hand we can also deduce, that while some of the younge age groups also have experienced significantly increased suicide rates after 1993, the trends are not as drastic. 

We can compare the above plot with the average global trends:

In [26]:
suicides_by_age['year_age'] = sprintf('%s.%s', suicides_by_age$year, suicides_by_age$age)

d2 = suicides_by_age %>%
  filter(country != 'Republic of Korea') %>%
  group_by(year_age) %>%
  summarise(
      year = head(year, 1),
      age = head(age, 1),
      population = sum(population),
      suicides_no = sum(suicides_no),
      suicides_per_100k = sum(population * suicides_per_100k) / (as.numeric(sum(population)) * n())) %>%
  accumulate_by(~year)

d['label'] = sprintf('%s %s', d$country, d$age)
d2['label'] = sprintf('World Average %s', d2$age)
d_and_d2 = rbind(select(d, 'label', 'year', 'suicides_per_100k', 'age', 'frame'), 
                 select(d2, 'label', 'year', 'suicides_per_100k', 'age', 'frame'))

In [27]:
pltly = plot_ly() %>%
         add_trace(
             x = ~year, 
             y = ~suicides_per_100k,
             split = ~label,
             frame = ~frame,  #we created this column in accumulate_by
             type = 'scatter',
             mode = 'lines',
             data = d,
             color= ~age,
             opacity = 1.0) %>%
         add_trace( 
             x = ~year, 
             y = ~suicides_per_100k,
             split = ~label,
             frame = ~frame,  #we created this column in accumulate_by
             type = 'scatter',
             mode = 'lines',
             data = d2,
             opacity = 0.4,
             color= ~age) %>%
         layout(
             title = "Suicide Rates By Age Groups In Korea, 1985 - 2016"
         ) %>%
         animation_opts(
             frame = 100, 
             transition = 0, 
             redraw = FALSE
         ) %>%
         animation_slider(
             hide = T
         ) %>%
         animation_button(
             x = 1, xanchor = "right", y = 0, yanchor = "bottom"
         )

# this is just a workaround to make the kaggle kernel show the plot correctly
htmlwidgets::saveWidget(pltly, "plot4.html")
IRdisplay::display_html(sprintf('<iframe src="plot4.html" height=%d width=%d></iframe>', HEIGHT, WIDTH))

As you can see, we don't see the growing suicide rates represented in the global average trends. Worldwide, suicide rates seem to be declining continuously. When we go back and have a look at the bubble plots for the world wide trends, we see this observation backed as well. Most of the bubbles seem to have a trend towards the bottom right (increasing wealth and lower suicide rates).

This kernel is a work in progress, so feel free to leave feedback and ideas for future updates.