<a href="https://colab.research.google.com/github/KwekuYamoah/Suicide-and-GDP-Case-Study/blob/main/Suicide_and_GDP_Case_Study.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Suicide and GDP Case Study

Kweku Andoh Yamoah

[MIT License]((https://en.wikipedia.org/wiki/MIT_License)

## Introduction
This is the first in a series of notebooks that make up a case study in exploratory data analysis. In this notebook, we

1.   Read data Kaggle. [Suicide Rates Overview 1985 to 2016](https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016). The dataset is described as follows:
"*This compiled dataset pulled from four other datasets linked by time and place, and was built to find signals correlated to increased suicide rates among different cohorts globally, across the socio-economic spectrum.*"
2.   Clean the data, particulary dealing with special codes that indicate missing data.
1.   Validate the data by comparing the values in the dataset with values documented in the codebook specified on Kaggle.
2.   Use describe to compute summary statistics and Pmf or Cdf to plot didtribution
1.   Generate "resampled" datasets to ensure efficient randomisation in the dataset
2.   Store the resampled data in a binary format (HDF5) that makes it easier to work with in the notebooks that follow this one.







## The Data

### General Informaation

*   **Original format**: csv
*   **Dataset shape**: 27820 x 12 (rows x columns)
*   19456 missing values for HDI

### Features in the dataset
#### <u>Categorical features</u>
**Country**: A total of 101 countries are included in this dataset. Most countries in Asia are not included and considering we have 195 countries in the world today, this is a quite biased data for worldwide analysis.<br>
**Year**: The dataset goes from 1985 to 2016<br>
**Sex**: Male/female differentiation<br>
**Age**: Age is divided in five age intervals.<br>
**Generation**: There are six generations included in this dataset. See 3.6 for details.
<br>
<blockquote>
    <p><font color="darkblue">This data's level of detail is defined by the combination of <b>Country+Year+Sex+Age</b>, which is a subsample of the population (e.g. Brazillian males of age between 15 and 25 in 1996). <br>For each of those we have corresponding numerical features.</font></p>
</blockquote>

#### <u>Numerical Features</u>
**Population size**: Number of people contained in each subsample
<br>**Number of Suicides**: Number of suicides in each subsample
<br>**Suicides per 100k people**: Number of suicides divided by the population size and multiplied by 100.000. This scales the number for better interpretation and allows you to make comparison between different subsamples.
<br>**GDP for year**: *Gross Domestic Product*, a measure of the market value for a country-year combination.
<br>**GDP per capita**: Obtained by dividing the GDP by the total population of the country for that year.
<br>**HDI for year** : *Human Development Index*, an index that measures life expectancy, income and education.
<br>

### Setup

If you are running this notebook in Colab, the following cell downloads the `empiricaldist` library.

If you are running in another environment, you will need to install it yourself.

In [None]:
# If we're running in Colab, set up the environment

import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    !pip install empiricaldist

Collecting empiricaldist
  Downloading https://files.pythonhosted.org/packages/c4/88/b6c44c1a5078224850473a8a6b82614d79147232113dd35e29de34b9ac8a/empiricaldist-0.3.9.tar.gz
Building wheels for collected packages: empiricaldist
  Building wheel for empiricaldist (setup.py) ... [?25l[?25hdone
  Created wheel for empiricaldist: filename=empiricaldist-0.3.9-cp36-none-any.whl size=10157 sha256=da4038ee6e26f8198fb4242caffadde59cffcf6327deef175513b271f7c1de42
  Stored in directory: /root/.cache/pip/wheels/bf/70/8c/55788f5a5806e6da295e5da80d2c0ef286d9a8260a1e3142e1
Successfully built empiricaldist
Installing collected packages: empiricaldist
Successfully installed empiricaldist-0.3.9


The following cell loads the packages we need.  If everything works, there should be no error messages. Fingers crossed :)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')
from empiricaldist import Pmf

The following cell defines a function I use to decorate the axes in plots.

In [None]:
def decorate(**options):
    """Decorate the current axes.
    Call decorate with keyword arguments like
    decorate(title='Title',
             xlabel='x',
             ylabel='y')
    The keyword arguments can be any of the axis properties
    https://matplotlib.org/api/axes_api.html
    """
    plt.gca().set(**options)
    plt.tight_layout()

## Reading the Data
The data we'll use is from Kaggle. I'll keep the original dataset as it is and create a new one to process the data throughout this analysis.

In [None]:
 #Load the data file

import os

if not os.path.exists('master.csv'):
    !wget https://raw.githubusercontent.com/KwekuYamoah/Suicide-and-GDP-Case-Study/main/master.csv


Now we can now read the file using Panadas. Pandas will read our information and store the results in a dataframe. We will view a few elements in the dataset to get a sense of the data

In [None]:
original_dataset = pd.read_csv("master.csv")
original_dataset.head()

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_for_year ($),gdp_per_capita ($),generation
0,Albania,1987,male,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X
1,Albania,1987,male,35-54 years,16,308000,5.19,Albania1987,,2156624900,796,Silent
2,Albania,1987,female,15-24 years,14,289700,4.83,Albania1987,,2156624900,796,Generation X
3,Albania,1987,male,75+ years,1,21800,4.59,Albania1987,,2156624900,796,G.I. Generation
4,Albania,1987,male,25-34 years,9,274300,3.28,Albania1987,,2156624900,796,Boomers


Our data cam in a pretty clear format. Missing values are already recoded with NaN. The next step is to create a copy of the dataset and use it throughout our analysis. We want to maintain the structure of the original dataset. Also, we will rename a few columns to make interpretation and coding easier. Finally we drop some columns which we won't use in our analysis as well.

In [None]:
#Creating a dataset copy
df = original_dataset.copy()

#Renaming a few columns to make interpretation and coding easier
df.rename(columns = {
    'suicides_no':'total_suicides',
    'suicides/100k pop':'suicides_per_100k',
    ' gdp_for_year ($) ':'gdp_for_year',
    'gdp_per_capita ($)':'gdp_per_capita',
    'HDI for year': 'HDI_for_year'
}, inplace=True)

#Dropping country-year
df = df.drop('country-year', axis=1)

#Values for gdp per year are strings. Fixing to a float
df.gdp_for_year = df.gdp_for_year.apply(lambda x: float(''.join(x.split(','))))

Now let's see how our data set is looking

In [None]:
df.head()

Unnamed: 0,country,year,sex,age,total_suicides,population,suicides_per_100k,HDI_for_year,gdp_for_year,gdp_per_capita,generation
0,Albania,1987,male,15-24 years,21,312900,6.71,,2156625000.0,796,Generation X
1,Albania,1987,male,35-54 years,16,308000,5.19,,2156625000.0,796,Silent
2,Albania,1987,female,15-24 years,14,289700,4.83,,2156625000.0,796,Generation X
3,Albania,1987,male,75+ years,1,21800,4.59,,2156625000.0,796,G.I. Generation
4,Albania,1987,male,25-34 years,9,274300,3.28,,2156625000.0,796,Boomers
