# <font color = Red>Global Suicide Rate Analysis</font>

<img src = "files/suicide+mgn1.jpg">

## Project Team:
1. Siddharth Suresh
2. Ying Xiong
3. Jiaxing Qiu
4. Luyuanyuan Yan

<img src = "files/UVA.png">
Each of us are a part of the Data Science Institute at the University of Virginia

## Data Reading and cleaning/pre-processing
The dataset was sourced from Kaggle (https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016)

We start off by importing all the libraries needed and reading in the original dataset.

In [None]:
# import the library needed
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.offline as py
import requests 
from bs4 import BeautifulSoup as soup
sns.set(font_scale=1.2)

# Reading in the original data set and analyzing the features
df1 = pd.read_csv('master.csv') # Suicide rates data set
df1.head()

After reading the dataset, a good starting point would be to check the number of rows & columns.

In [None]:
df1.shape

We see that the original dataset has 12 columns and 27,820 rows. Next, we check the NaN values in each column of the dataset

In [None]:
df1.isna().sum()

We notice that HDI values for **19,456** out of **27,820** (~70%) rows are NaN. However, we do believe that HDI of a country does influence the suicide rates, so we decided to supplement the original dataset with HDI data from an external source (http://hdr.undp.org/en/data)

Going through the other columns, it is safe to assume that certain columns won't help in analyzing the suicide rates, such as:
1. <font color = Red>generation</font> (we believe that trends in suicides based on age groups will have a better interpretation than the generation) 
2. <font color = Red>country-year</font> (it's more like a tag for this dataset, but since we would like to use the groupby function to look at individual factors affecting suicides, this would become irrelevant)
3. <font color = Red>HDI for year</font> (since 70% of the rows are NaN)

We load the HDI dataset into a new dataframe and look at a snapshot of the same.

In [None]:
df2 = pd.read_excel('HDI data(1990-2017).xlsx') # HDI data set
df2.head()

We check the number of columns, rows and NaN values for this dataset as well

In [None]:
df2.shape

The HDI dataset has 189 rows (each representing a unique country) and 30 columns (out of which 28 represent a year, the other two being the 'Country' and 'HDI Rank (2017)')

In [None]:
df2.isna().sum()

It was observed that there are plenty NaN values for the years 1990 to 1999. The analysis with respect to HDI values of these countries will therefore not be present in this project.  

We shall be combining these two dataframes so as to map the HDI values to the original dataset later. First, we need to make sure that the country names are consistent across the two dataframes. This requires further cleaning up of the dataframes.

In [None]:
df1 = df1.rename(columns = {' gdp_for_year ($) ': 'total_gdp'}) # renaming columns for easier reference

# renaming the country names of the dataset based on the country names of the GeoJSON file (downloaded for advanced visualizations) and maintaining uniformity
df1 = df1.replace(['Bahamas', 'Republic of Korea', 'Russian Federation',
       'Saint Vincent and Grenadines', 'United States', 'Cabo Verde', 'Macau', 'Serbia'], ['The Bahamas', 'South Korea', 'Russia',
       'Saint Vincent and the Grenadines', 'United States of America', 'Cape Verde', 'Macao S.A.R', 'Republic of Serbia'])
df2 = df2.replace(['Bahamas', 'Brunei Darussalam', 'Cabo Verde', 'Congo',
       'Congo (Democratic Republic of the)', "Côte d'Ivoire", 'Eswatini',
       'Guinea-Bissau', 'Hong Kong, China (SAR)',
       'Iran (Islamic Republic of)', 'Republic of Korea',
       "Lao People's Democratic Republic", 'Micronesia',
       'Palestine, State of', 'Russian Federation',
       'Saint Vincent and Grenadines', 'Serbia', 'Syrian Arab Republic',
       'Tanzania (United Republic of)',
       'The former Yugoslav Republic of Macedonia', 'Timor-Leste',
       'United States', 'Venezuela (Bolivarian Republic of)', 'Viet Nam'], ['The Bahamas', 'Brunei', 'Cape Verde', 'Republic of Congo',
       'Democratic Republic of the Congo', "Ivory Coast", 'Swaziland',
       'Guinea Bissau', 'Hong Kong S.A.R.',
       'Iran', 'South Korea',
       "Laos", 'Federated States of Micronesia',
       'Palestine', 'Russia',
       'Saint Vincent and the Grenadines', 'Republic of Serbia', 'Syria',
       'United Republic of Tanzania',
       'Macedonia', 'East Timor',
       'United States of America', 'Venezuela', 'Vietnam'])

Once both dataframes are in line with each other and consistent in terms of naming conventions, we can go ahead and start the process of stitching both datasets together.

However, another important step here is to transform the HDI dataset from a 'wide' format to a 'long' format. This will help in merging the two datasets based on the 'Country' & 'Year'.

So, it becomes important to transform the different columns in the HDI dataset (representing different 'Years') into one single column that has each year against each of the unique countries in the dataset. 

In [None]:
df2 = df2.iloc[:,1:] # Dropping the HDI rank column from the data set
df2.head()

In [None]:
df2 = pd.melt(df2, id_vars = ['Country'], var_name = 'Years', value_name = 'HDI values') 
# transforming the data set into 3 columns
df2.head()

In [None]:
df2 = df2.rename(columns = {'Country':'country', 'Years':'year', 'HDI values':'HDI'})
# column uniformity before merging HDI data into the original data set
df2['year'] = df2['year'].astype(int)
# changing the year column from str to int

After the cleaning up and pre-processing, both these dataframes were merged

In [None]:
df3 = pd.merge(df1, df2, on = ['country', 'year'], how = 'left')
# merging the two datasets based on 'country' and 'year' on the original dataframe
df3.tail()

In [None]:
df3['total_gdp'] = df3['total_gdp'].astype(int) # changing the total_gdp column to 'int'

Next we prepared a dataframe that will exclusively be used for the interactive visualization. In order to do that, we dropped all the irrelevant columns for the scope of this project, as enlisted above. Then, other columns except 'country', 'year', 'suicides_no' and 'population' were removed and the 'suicides/100k pop' and 'gdp_per_capita' was calculated based on this data.

In [None]:
# Dropping all columns except 'country', 'year', 'suicides_no' and 'population'
df_map = df3.drop(['age', 'sex', 'suicides/100k pop', 'HDI', 'generation', 'HDI for year', 'gdp_per_capita ($)', 'country-year'], axis = 1)

# using groupby for the number of suicides and population for a given country in a particular year
df_map = df_map.groupby(['country', 'year', 'total_gdp']).sum().reset_index()

# computing the suicide rate and gdp per capita for a given country in a particular year
df_map['suicides_rate'] = ((df_map.suicides_no / df_map.population)*(10**5)).round(4)
df_map['gdp_per_capita'] = (df_map.total_gdp/df_map.population).round(2)

# sorting based on year
df_map.sort_values(by = 'year', axis=0, inplace = True)

# Saving the dataframe into a separate csv file
df_map.to_csv("suicides_rate_map.csv")

df_map.head() # snapshot of the dataframe

## Suicides Rate Overall Trend

### Overall Trend Plot

In [None]:
#create a mapping for the different age groups into numerical numbers
age_mapping = {'5-14 years': 1, '15-24 years':2, '25-34 years':3, '35-54 years':4, '55-74 years':5, '75+ years':6}

df4 = df3
#rename the column names
df4 = df4.rename(columns={'suicides/100k pop': 'suicide_per_100k'})
#df4 = df3.rename(columns={'gdp_for_year ($)': 'gdp_for_year'})
df4 = df4.rename(columns={'gdp_per_capita ($)': 'gdp_per_capita'})
df4['age_group'] = df3.age.map(age_mapping)
df4

In [None]:
# plot the overall suicide ratio over all years
#group by countries to get ready to calculate suicide rate
suicideYear=df4.groupby('year').sum()[['suicides_no','population']].reset_index() 
#recalculate the ratio after grouping by year
suicideYear['suicides_per_100k'] = suicideYear.suicides_no/suicideYear.population*100000 


# Plotting the overall trend for suicide ratio
plt.figure(figsize=(13,6))
sns.lineplot('year', 'suicides_per_100k', data=suicideYear, marker = 'o')
plt.title("Worldwide Suicide Ratio Over Years")
plt.xlabel("Year")
plt.ylabel("Suicides Per 100k Population")
#add a average line of suicide rate into the plot
plt.hlines(suicideYear.suicides_per_100k.mean(), suicideYear.year.min(), suicideYear.year.max() ,colors = 'red', linestyles = 'dashed')

In [None]:
# plot the overall suicide number over all years

suicideYear=df4.groupby('year').sum()[['suicides_no','population']].reset_index() 

# Plotting the overall trend for suicide number
plt.figure(figsize=(13,6))
sns.lineplot('year', 'suicides_no', data=suicideYear, marker = 'o')
plt.title("Total Suicide Number Over Years")
plt.xlabel("Year")
plt.ylabel("Suicides Number")
#plt.hlines(suicideYear.suicides_per_100k.mean(), suicideYear.year.min(), suicideYear.year.max() ,colors = 'red', linestyles = 'dashed')

### Investigate the big gap from 1988 to 1989

In [None]:
#Build a data frame of suicide rate for each country each year
#calculte suicide rate for each country each year
topcty = df4.groupby(['country','year'])[['suicides_no', 'population']].sum().reset_index() 
topcty['suicides_per_100k']= topcty.suicides_no/topcty.population*100000 
topcty.head()

In [None]:
# Top countries contributing to big gap from 1988 to 1989
#create a dataframe delta89 recording suicide rate for each country each year

tmp = topcty[['country', 'year', 'suicides_per_100k']].copy() # a copy of topcty
tmp = tmp.pivot(index = 'country', columns = 'year', values =  'suicides_per_100k') #reform the dataframe for easy access
gap = tmp[1989] - tmp[1988] #calculate the suicide rate difference of 1988 and 1989 for each country
gap.sort_values(ascending=False).head() #find the countries with biggest jump

#Guyana and Malta are the two countries with highest gap from 1988 to 1989

In [None]:
# Plot overall suicide trend for countries have larger jump in 1988 & 1989
fig = plt.figure(figsize=(15,6))
ax1 = fig.add_subplot(1,2,1)
ax1.set_title('Overall trend for Guyana')
ax2 = fig.add_subplot(1,2,2)
ax2.set_title('Overall trend for Malta')
# Filter Guyana's suicide rate and plot it versus year
topcty[topcty['country']=='Guyana']
sns.lineplot('year', 'suicides_per_100k', data=topcty[topcty['country']=='Guyana'], marker = 'o',ax=ax1)
topcty[topcty['country']=='Malta']
sns.lineplot('year', 'suicides_per_100k', data=topcty[topcty['country']=='Malta'], marker = 'o', ax=ax2)


### Data missing - Reason causing the big gap in 1989
#### Check how many countries being recorded each year

In [None]:
# Taking total number of countries each year into account 
countryByYear = topcty.groupby('year')['country'].nunique() #count how many countries being recorded each year in this data set

fig = plt.figure(figsize=(15,6)) #create a figure, set figure size
ax1 = fig.add_subplot(1, 1, 1) #create a axis
countryByYear.plot.bar(ax=ax1) #Bar plot for number of countries per year
ax1.set_title('Total Number of Countries Each Year')
ax1.set_xlabel('Year')
ax1.set_ylabel('Number of Countries')


### Top 10 countries with highest suicide rates

In [None]:
# check the top 10 countries with highest suicide rates

topcountries = df4.groupby('country')[['suicides_no', 'population']].sum().reset_index()
topcountries['suicides_per_100k']= topcountries.suicides_no/topcountries.population*100000
top10=topcountries.sort_values(by= 'suicides_per_100k',ascending = False)[:10]
top10

In [None]:
plt.figure(figsize=(15,8))
sns.barplot(x=top10.country,y=top10.suicides_per_100k)
plt.title("Top 10 Countries With Highest Suicide Ratios")
plt.xlabel("Countries")
plt.ylabel("Suicides Per 100k Population")

### Trend of suicide rate for top 5 countries

In [None]:
topcountry = topcty[topcty['country'].isin(top10.country.tolist()[0:5])] #Filter the data of the top 5 countries

In [None]:
plt.figure(figsize=(18,8))
sns.lineplot(x='year',y='suicides_per_100k',hue=topcountry.country,data=topcountry)
plt.title("Suicides Ratio For Top 5 Suicide Countries")
plt.ylabel("Suicides Per 100k Population")
plt.xticks(rotation=45)
sns.set(font_scale=1)

#From the plot, Russia contributes most for the big gap from 1990 to 1995

### Russia contributes most for the big gap from 1990 to 1995

In [None]:
# Check overall suicide rate without Russia
withoutR = topcty[topcty['country']!='Russia'] #exclude Russia in the data set
withoutR = withoutR.groupby('year').sum()[['suicides_no','population']].reset_index() #recalulate suicide rates
withoutR['suicides_per_100k'] = withoutR.suicides_no/withoutR.population*100000
suicideYear['suicides_per_100k'] = suicideYear.suicides_no/suicideYear.population*100000

fig = plt.figure(figsize=(15,6))
ax1 = fig.add_subplot(1, 1, 1)
# plots for all countries
sns.lineplot('year', 'suicides_per_100k', data=suicideYear, ax=ax1, label="All Countries", marker='o')
# plot for all except Russia
sns.lineplot('year', 'suicides_per_100k', data=withoutR, ax=ax1, label="Without Russia", marker='o')

# adding title and legend etc.
plt.title('Overall Trend of Suicide Ratio - With/Without Russia',fontdict={'weight':'normal','size': 16})
plt.xlabel("Year")
plt.ylabel("Suicides Per 100k Population")
ax1.legend(loc='best', shadow=True)

### Investigate suicide rates for different age groups

In [None]:
#recalculate the overall suicide rate for different age groups
suicideByAge = df4.groupby(['year', 'age_group'])[['suicides_no', 'population']].sum().reset_index()
suicideByAge['suicides_per_100k']= suicideByAge.suicides_no/suicideByAge.population*100000
suicideByAge.groupby(['age_group']).sum()

In [None]:
plt.figure(figsize=(13,6))#Suicide ratio based on age groups

#create labels for different age groups
en = {1:'5-14 years',
      2:'15-24 years',
      3:'25-34 years',
      4:'35-54 years',
      5:'55-74 years',
      6:'75+ years'}


plt.figure(figsize=(13,6))
color=["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#2ecc71","#34495e"]

#create bar plot for overall suicide rate for different age group
sns.barplot(x=suicideByAge.age_group.map(en.get), y=suicideByAge['suicides_per_100k'],palette=color)
plt.title("Suicide Ratio based in Age group")
plt.xlabel("Age Group")
plt.ylabel("Suicides Per 100k Population")

In [None]:
# analysis for age group: 75+

elderly = df4[df4.age_group == 6].groupby(['year', 'country']).sum()[['suicides_no', 'population']].reset_index()
elderly['suicides_per_100k'] = elderly.suicides_no/elderly.population*100000

fig = plt.figure(figsize=(15,6))
ax1 = fig.add_subplot(1, 1, 1)
# suicide rate for people in the age group 75+
sns.lineplot('year', 'suicides_per_100k', data=elderly, ax=ax1, marker='o')

# adding title and label etc.
ax1.set_title('Overall Trend of Suicide Ratio - People of age 75+')
ax1.set_xlabel("Year")
ax1.set_ylabel("Suicides Per 100k Population")

In [None]:
elderly = elderly.groupby(['country']).sum()[['suicides_no', 'population']].reset_index()
elderly['suicides_per_100k'] = elderly.suicides_no/elderly.population*100000
top5_elderly = elderly.sort_values(by='suicides_per_100k', ascending=False)[:5]
plt.figure(figsize=(12,6))
sns.barplot(x=top5_elderly.country,y=top5_elderly.suicides_per_100k)
plt.title("Top 5 Countries in terms of the suicide rates for 75+")
plt.xlabel("Countries")
plt.ylabel("Suicides Per 100k Population")

In [None]:
top5_elderly

In [None]:
#Total suicide number based on age groups
totalAge = df4.groupby(['age_group']).sum()['suicides_no'].reset_index()
plt.figure(figsize=(12,6))
color=["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#2ecc71","#34495e"]

sns.barplot(x=totalAge.age_group.sort_values().map(en.get),y=totalAge.suicides_no, palette=color)
plt.title("Total Suicide based in Age group")
plt.xlabel("Age Group")
plt.ylabel("Number of Suicide")

### Suicide rate based on different age groups

In [None]:
#Suicide Based on Year And Age
color=["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#2ecc71","#34495e"]
plt.figure(figsize=(15,8))
sns.swarmplot(x='year',y='suicides_per_100k',hue=suicideByAge.age_group.map(en.get),data=suicideByAge,palette=color)
plt.title("Suicides Ratio Based On The Age Group")
plt.xticks(rotation=45)
plt.ylabel("Suicides Per 100k Population")

### Suicide rate of different age group in different gender

In [None]:
#Compare the suicide ratio fin Male and Female based on different age groups 

suicideByAgeGender = df4.groupby(['age_group','sex'])[['suicides_no', 'population']].sum().reset_index()

suicideByAgeGender['suicides_per_100k']= suicideByAgeGender.suicides_no/suicideByAgeGender.population*100000
suicideM_age = suicideByAgeGender[suicideByAgeGender.sex == 'male'].suicides_per_100k
suicideF_age = suicideByAgeGender[suicideByAgeGender.sex == 'female'].suicides_per_100k
index = np.arange(len(suicideF_age))
width = 0.3

plt.figure(figsize=(13,6))
plt.bar(index, height=suicideM_age, width=width, color= "#3498db", label='Male')
plt.bar(index+width, height=suicideF_age, width=width, color="#9b59b6", label='Female') #　width=index+width
plt.xlabel('Age Groups')
plt.ylabel('Suicide Ratio')
plt.title('Suicide Ratio in Male and Female')
plt.xticks(index + width / 2, ('5-14 years', '15-24 years', '25-34 years', '35-54 years', '55-74 years', '75+ years'))
plt.legend(loc='best')
plt.show()


### Relationship between suicide ratio, HDI for different countries

In [None]:
suicideDrop =df4.drop(columns=['HDI for year']).dropna(axis=0,how='any') #Drop previous HDI column because of too many NAs
#create a dataframe with HDI 
suicideHdi = suicideDrop.groupby(['year','country'])[['suicides_no', 'population']].sum().reset_index()
suicideHdi['suicides_per_100k']= suicideHdi.suicides_no/suicideHdi.population*100000 # calculte suicide rate for every country every year
hdi = suicideDrop.groupby(['year','country'])[['HDI']].mean().reset_index() # create HDI column for each country each year
suicideHdi['HDI']=hdi.HDI # add HDI column to the dataframe
suicideHdi.head()

In [None]:
# Check HDI versus suicide rate for the top 10 suicide countries
suicideHdiForTop=suicideHdi[suicideHdi['country'].isin(top10.country)]
fig = plt.figure(figsize=(16,11))
for i in range(len(top10.country)-4):
    ax1=fig.add_subplot(2,3,i+1)
    df1=suicideHdiForTop[suicideHdiForTop['country']==top10.country.tolist()[i]]
    sns.scatterplot(x=df1.HDI.round(2) ,y=df1.suicides_per_100k, ax=ax1)
    ax1.set_title(top10.country.iloc[i])
    ax1.set_xlabel('HDI')
    ax1.set_ylabel('Suicide per 100k')
    

In [None]:
bottom10 = topcountries.sort_values(by= 'suicides_per_100k')[:10]
plt.figure(figsize=(15,6))
sns.barplot(x=bottom10.country, y=bottom10.suicides_per_100k)
plt.title("Bottom 10 Countries With Lowest Suicide Ratios")
plt.xlabel("Countries")
plt.ylabel("Suicides Per 100k Population")

# Gender Suicide Ratio Analysis

In [None]:
df_gender = df4[['country','year','sex','age','suicides_no','population','suicide_per_100k', 'HDI']]
df_gender.insert(3,'age_group',0)
df_gender.loc[df_gender['age']=='5-14 years','age'] = '05-14 years'
df_gender.loc[df_gender['age']=='5-14 years','age_group'] = 1
df_gender.loc[df_gender['age']=='15-24 years','age_group'] = 2
df_gender.loc[df_gender['age']=='25-34 years','age_group'] = 3
df_gender.loc[df_gender['age']=='35-54 years','age_group'] = 4
df_gender.loc[df_gender['age']=='55-74 years','age_group'] = 5
df_gender.loc[df_gender['age']=='75+ years','age_group'] = 6

In [None]:
plt.rc('figure', figsize = (10,5))

### worldwide suicide ratio over 30 years per gender

In [None]:
temp = df_gender.groupby('sex')[['suicides_no','population']].sum().reset_index()
temp = temp.assign(suicide_ratio=temp['suicides_no']/(temp['population']/100000))
ratio_all = pd.pivot_table(temp,values = 'suicide_ratio', columns = ['sex'])
print(ratio_all)
ratio_all.plot.barh(title = 'overall suicide ratio for two genders')

### suicide ratio vs. year per gender

In [None]:
tmp = df_gender.groupby(['year','sex'])[['suicides_no','population']].sum()
tmp = tmp.assign(su_ratio_year = tmp['suicides_no']/(tmp['population']/100000))
ratio_year = tmp['su_ratio_year'].unstack()
ratio_year.plot(title = 'suicide ratio trend with year for two genders').set_ylabel('suicide_ratio')

### suicide ratio vs. age per gender

In [None]:
tmp = df_gender.groupby(['age','sex'])[['suicides_no','population']].sum()
tmp = tmp.assign(su_ratio_age = tmp['suicides_no']/(tmp['population']/100000))
ratio_age = tmp['su_ratio_age'].unstack()
ratio_age.plot.bar(title = 'suicide ratio trend with age for two genders').set_ylabel('suicide_ratio')

In [None]:
tmp = df_gender.groupby(['country','sex'])[['suicides_no','population']].sum().assign(ratio = tmp['suicides_no']/(tmp['population']/100000))
# tmp has multiIndex('country','sex'), we cannot select values by a single index in multiIndex, reset multiindex into column
tmp_reindex = tmp.reset_index()
female_top5 = tmp_reindex[tmp_reindex['sex']=='female'].sort_values(by = ['ratio'], ascending = False)[['country','ratio']].head(5)
female_top5.plot.bar(x= 'country', y = 'ratio', color = 'Orange')

### top 5 countries with highest suicide ratio per gender

In [None]:
tmp = df_gender.groupby(['country','sex'])[['suicides_no','population']].sum().assign(ratio = tmp['suicides_no']/(tmp['population']/100000))
# tmp has multiIndex('country','sex'), we cannot select values by a single index in multiIndex, reset multiindex into column
tmp_reindex = tmp.reset_index()
# tmp is untouched
top5 = tmp_reindex.sort_values(by = ['ratio'], ascending = False).groupby('sex')[['country','sex','ratio']].head(5)
# select the top 5 countries by ratio in two sex groups (in one dataframe)
top5_sex = top5.pivot(index='country', columns='sex', values='ratio')
top5_sex.plot.barh(title = 'Top 5 countries with highest suicide ratio per sex').set_xlabel('overall suicide ratio')

# Data Analysis - Relationship between GDP and Suicide Rate

In [None]:
sui_country = df4[['country', 'year', 'population','suicides_no','total_gdp']]
sui_country = sui_country.groupby([sui_country.country,sui_country.year],as_index = False).sum()
sui_country['new_suicide_rate'] = round(sui_country.suicides_no/sui_country.population * 100000,2)
#calculate suicide_rate by ourselves 

In [None]:
sui_country.head()

In [None]:
new_sr_ctr = sui_country.drop(columns = {'year','population','suicides_no','total_gdp'}).groupby(sui_country.country).mean().sort_values( 'new_suicide_rate',ascending =False).reset_index()

In [None]:
new_sr_ctr.head()

In [None]:
#suicide rate per country plot
plt.figure(figsize=(10,30))
plt.title('Suicide Rates - Country', fontsize=14)
plt.axvline(x=new_sr_ctr['new_suicide_rate'].mean(),color='gray',ls='--')
sns.barplot(data=new_sr_ctr, y='country',x='new_suicide_rate')

In [None]:
gdp_country = df4[['country', 'year', 'population','total_gdp']]
a = gdp_country.groupby(['country', 'year', 'total_gdp'], as_index=False).sum()
a['gdp_per'] = a['total_gdp']/a['population'] # calculate gdp_per country regardless of year 
a = a.drop(columns = ['total_gdp','population','year'])
a['suicide_rate'] = sui_country['new_suicide_rate']
a = a.groupby(a.country).mean().sort_values('gdp_per',ascending = False)

In [None]:
fig, ax1 = plt.subplots(1,1,figsize=(14,4), dpi=80, sharey=False)

sns.lineplot(data=a, y='suicide_rate', x='gdp_per', ax=ax1, label='country')
ax1.grid(False)
ax1.legend(bbox_to_anchor=(1, 0.12))

In [None]:
#overall trend of suicide_rate with gdp
fig, ax1 = plt.subplots(1,1,figsize=(10,8), dpi=80, sharey=False)
sns.regplot(data = a, x='gdp_per',
              y='suicide_rate', ax=ax1, color='C3')

## Data Analysis - Relationship between HDI and Suicide Rate

In [None]:
HDI_country = df3.drop(['age', 'sex', 'suicides/100k pop', 'generation', 'HDI for year', 'gdp_per_capita ($)', 'country-year'], axis = 1)
HDI_country = HDI_country.groupby(['country', 'year', 'HDI', 'total_gdp']).sum().reset_index()
HDI_country['suicide_rate'] = ((HDI_country.suicides_no / HDI_country.population)*(10**5)).round(3)
HDI_country['gdp_per_capita'] = (HDI_country.total_gdp/HDI_country.population).round(2)
HDI_country = HDI_country[(HDI_country['country'] == 'Russia')|(HDI_country['country'] == 'Lithuania')|(HDI_country['country'] == 'Sri Lanka')|(HDI_country['country'] == 'Latvia')|(HDI_country['country'] == 'Hungary')]

In [None]:
HDI_df = df3.drop(['age', 'sex', 'suicides/100k pop', 'generation', 'HDI for year', 'gdp_per_capita ($)', 'country-year'], axis = 1)
HDI_df = HDI_df.groupby(['country', 'year', 'HDI', 'total_gdp']).sum().reset_index()
HDI_df = HDI_df.groupby(['year']).mean().reset_index()
HDI_df['suicides_rate'] = ((HDI_df.suicides_no / HDI_df.population)*(10**5)).round(4)
HDI_df['gdp_per_capita'] = (HDI_df.total_gdp/HDI_df.population).round(2)
sns.regplot(x = 'HDI', y = 'suicides_rate', data = HDI_df, fit_reg = True, color = 'r')

In [None]:
sns.lineplot(x = 'year', y = 'HDI', data = HDI_country, hue = 'country')

In [None]:
sns.lineplot(x = 'year', y = 'suicide_rate', data = HDI_country, hue = 'country')

## Web scraper to import developing and developed country information 

In [None]:
#web scraper to import developing and developed country information 
url = "http://worldpopulationreview.com/countries/developed-countries/"

# Use requests to load the url
page = requests.get(url)

# Create a BeautifulSoup object
soup = soup(page.content, 'html.parser')

# pull all the texts of the 'table table-striped' class from the page
table = soup.find(class_ = 'table table-striped')

# pull text from all instances of <td> tag within the 'table table-striped' class
table2 = table.find_all('td')

countryname = []
index = 5
while index<len(table2):
    countryname.append(table2[index].getText())
    index = index+4
print(countryname)

In [None]:
d1 = df4[['country', 'year', 'population','suicides_no','total_gdp']]
d1= d1.groupby([d1.country,d1.year],as_index = False).sum()
d1['new_suicide_rate'] = round(d1.suicides_no/d1.population * 100000,2)

In [None]:
d2 = df4[['country', 'year', 'population','total_gdp']]
d2 = d2.groupby([d2.country,d2.year, d2.total_gdp],as_index = False).sum()
d2['gdp_per'] = round(d2['total_gdp']/d2['population'])
d2['new_suicide_rate'] = d1.new_suicide_rate
d2 = d2.drop(columns = {'population','total_gdp'})

In [None]:
inde = d2.country.isin(countryname)
ind_inverse = [not i for i in inde]
d2_develop = d2[inde]
d2_developin = d2[ind_inverse]

In [None]:
#overall trend of suicide rate versus gdp 
fig, ax1 = plt.subplots(1,1,figsize=(10,8), dpi=80, sharey=False)
sns.regplot(data = d2, x='gdp_per',
              y='new_suicide_rate', ax=ax1, color='C1')

In [None]:
#developed coutry's suicide rate vs. gdp per capita 
d2_develop = d2_develop.groupby('year').mean()[['new_suicide_rate','gdp_per']]
d2_develop.head(10)

In [None]:
#developing coutry's suicide rate vs. gdp per capita 
d2_developin = d2_developin.groupby('year').mean()[['new_suicide_rate','gdp_per']]
d2_developin.head(10)

In [None]:
#developing country plot 
fig, ax1 = plt.subplots(1,1,figsize=(10,8), dpi=80, sharey=False)
sns.regplot(data = d2_developin, x='gdp_per',
              y='new_suicide_rate', ax=ax1, color='C1')

In [None]:
#developed country plot 
fig, ax1 = plt.subplots(1,1,figsize=(10,8), dpi=80, sharey=False)
sns.regplot(data = d2_develop, x='gdp_per',
              y='new_suicide_rate', ax=ax1, color='C8')

In [None]:
# correlation heat map 
fig, ax = plt.subplots(figsize=(10, 8))
hm = sns.heatmap(df4[['year','suicides_no','population','suicide_per_100k','total_gdp','gdp_per_capita','HDI']].corr(), annot = True, ax = ax, cmap = "PiYG",fmt = '.2f',
                 linewidths = .05)
fig.subplots_adjust(top = 0.92)
fig.suptitle('Attributes Correlation Heatmap', fontsize = 20)
plt.show()

# Interactive visualization of the suicides rate

Use the following map below to view the suicides per 100k population in countries across the world from 1985 to 2016. You can use the following features in the map:
1. Zoom in/out
2. Hover function using mouse pointer
3. Play/Stop function for the year slider
4. Manual selection through year slider

In [None]:
fig = px.choropleth(df_map, locations="country", locationmode="country names",
                    color="suicides_rate",
                    hover_name="country",
                    animation_frame="year",
                    animation_group="year",
                    labels={'suicides_rate':'Suicides per 100k population', 'year':'Year', 'country':"Country", 'gdp_per_capita':'GDP per capita ($)'},
                    color_continuous_scale=px.colors.carto.Sunset, #sequential.RdPu, carto.Sunset
                    title="Yearly global trend in suicides per 100k")
fig.show()
py.offline.plot(fig, filename= 'suicidesrate_interactive.html')