> **Tip**: Welcome to the Investigate a Dataset project! You will find tips in quoted sections like this to help organize your approach to your investigation. Before submitting your project, it will be a good idea to go back through your report and remove these sections to make the presentation of your work as tidy as possible. First things first, you might want to double-click this Markdown cell and change the title so that it reflects your dataset and investigation.

# Investigating How Suicide Rates Vary With Socioeconomic Factors

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

In this project, I will be analysing how suicide rates vary based on a nation's level of corruption, democracy and freedom of expression.

### Datasets and Indicators

I compiled my datasets using [Gapminder Tools](https://www.gapminder.org/data/), which contains data, broken down by country, on a wide range of indicators.
As the indicators I have chosen are not quantative, I have opted to use the following indices/scores in order to quantify the data:

- **Corruption Perception Index (CPI)** - This index, calculated by [Transparency International](https://www.transparency.org/research/cpi), is a measure of the level of corruption in a country. It is based on a scale of 0 to 100, with zero indicating a "Highly Corrupt" nation, and 100 indication a nation is "Very Clean".
- **Democracy Index (EIU)** - From the [Economist Inteligence Nuit](http://gapm.io/ddemocrix_eiu), this is a summary measure to express the quality of a country's democratic nature, calculated using 60 indicators. Graded from 0 to 100, with 0 indicating a very low level of democracy, and 100 indicating a very high democratic nature.
- **Freedom of Expression Index (IDEA)** - Available [here](http://gapm.io/ddemocrix_idea), this aggregates a set of indicators measuring media censorship and freedom of discussion and expression. Measued on a scale of 0 to 100, with 0 suggesting no freedom of expression at all, and 100 suggesting full access to freedom of expression.

All of these datasets include historical data, however I am not interested in trends in any of these indicators so can discard all but the most recent year (that all indicators have data for).

### Questions

I shall be analysing the distribution of countries' suicide rates, and the relationship between suicide rates and the above indicators.
My questions include:

- How are suicide rates distributed, and how do they range between different countries?
- Is there a correlation between the level of corruption of a country and the suicide rate?
- Is there a correlation between the level of democracy of a country and the suicide rate?
- Is there a correlation between the level of freedom of expression of a country and the suicide rate?

In [1]:
from functools import reduce
from IPython.display import display
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

<a id='wrangling'></a>
## Data Wrangling

> **Tip**: In this section of the report, you will load in the data, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you document your steps carefully and justify your cleaning decisions.

### General Properties

In [2]:
# Load Data
suicide_df    = pd.read_csv('suicide.csv')
corruption_df = pd.read_csv('corruption.csv')
democracy_df  = pd.read_csv('democracy.csv')
freedom_df    = pd.read_csv('freedom.csv')

#### Suicide Data

In [3]:
display(suicide_df.describe())
display(suicide_df.head())

Unnamed: 0,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
count,10.0,17.0,19.0,19.0,20.0,29.0,29.0,30.0,32.0,32.0,...,51.0,52.0,50.0,49.0,50.0,50.0,49.0,49.0,39.0,14.0
mean,9.999,10.506471,11.075263,11.221053,11.6025,11.112172,11.366138,10.7955,10.479872,10.5575,...,9.731773,9.602844,10.03814,9.598796,9.09774,9.44442,9.826318,9.056457,9.535072,10.057143
std,5.792266,6.022625,5.859827,6.17709,6.381608,6.803677,6.792478,6.846742,6.93409,6.574295,...,5.782587,6.189495,6.362952,6.025019,6.086949,5.440548,5.677636,5.482087,5.0689,5.153779
min,2.58,2.6,2.14,2.08,2.02,0.233,0.268,0.154,0.0859,0.2,...,0.0454,0.0669,0.159,0.15,0.115,0.104,0.0736,0.0684,0.0578,1.85
25%,8.1375,6.3,6.415,6.82,6.69,6.61,6.66,6.3775,5.7025,6.655,...,5.54,5.46,5.7425,5.41,5.0475,5.5925,6.13,5.39,6.175,7.5725
50%,8.735,9.07,9.88,9.03,10.1,9.11,9.94,9.41,9.36,9.23,...,9.73,9.27,9.325,9.62,8.815,9.205,9.7,8.79,9.04,9.875
75%,10.6775,13.5,14.1,14.8,14.125,15.5,15.1,13.7,14.35,14.075,...,13.15,12.7,13.275,13.1,12.075,12.325,12.6,12.3,12.3,12.375
max,24.4,22.4,22.0,23.7,26.4,28.1,27.2,26.6,27.6,25.0,...,27.4,30.1,30.4,27.6,28.0,25.7,30.9,26.1,25.2,22.9


Unnamed: 0,country,1950,1951,1952,1953,1954,1955,1956,1957,1958,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
0,Albania,,,,,,,,,,...,4.06,5.34,,3.08,,,,,,
1,Antigua and Barbuda,,,,,,,,,,...,,,,,,,,,,
2,Argentina,,,,,,,,,,...,,,,,,,,,,
3,Armenia,,,,,,,,,,...,,1.74,,,,2.39,1.92,1.54,2.02,1.85
4,Australia,9.11,9.41,10.5,10.8,10.7,10.3,10.8,12.2,12.5,...,9.71,9.96,9.62,9.82,9.73,10.3,10.2,11.2,11.6,


It appears as though 2014 was the last year with a significant amount of data (2015 and 2016 have 39 and 14 entries respectively, fewer than previous years), so I will take 2014 as the most recent year of data that I can analyse.

#### Corruption Data

In [4]:
display(corruption_df['2014'].describe())
display(corruption_df.head())

count    171.000000
mean      42.929825
std       19.811042
min        8.000000
25%       28.500000
50%       38.000000
75%       55.000000
max       92.000000
Name: 2014, dtype: float64

Unnamed: 0,country,2012,2013,2014,2015,2016,2017
0,Afghanistan,8.0,8.0,12.0,11.0,15.0,15
1,Albania,33.0,31.0,33.0,36.0,39.0,38
2,Algeria,34.0,36.0,36.0,36.0,34.0,33
3,Angola,22.0,23.0,19.0,15.0,18.0,19
4,Argentina,35.0,34.0,34.0,32.0,36.0,39


As we can see, there are 171 unique countries with data existing in the 2014 column.

#### Democracy Data

In [5]:
display(democracy_df['2014'].describe())
display(democracy_df.head())

count    164.000000
mean      55.279878
std       21.892154
min       10.800000
25%       35.275000
50%       57.750000
75%       73.925000
max       99.300000
Name: 2014, dtype: float64

Unnamed: 0,country,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018
0,Afghanistan,30.6,30.4,30.2,27.5,24.8,24.8,24.8,24.8,27.7,27.7,25.5,25.5,29.7
1,Albania,59.1,59.1,59.1,58.9,58.6,58.1,56.7,56.7,56.7,59.1,59.1,59.8,59.8
2,Algeria,31.7,32.5,33.2,33.8,34.4,34.4,38.3,38.3,38.3,39.5,35.6,35.6,35.0
3,Angola,24.1,28.8,33.5,33.4,33.2,33.2,33.5,33.5,33.5,33.5,34.0,36.2,36.2
4,Argentina,66.3,66.3,66.3,67.3,68.4,68.4,68.4,68.4,68.4,70.2,69.6,69.6,70.2


Here we can see that the democracy data does indeed have data for 2014, however only has 164 unique countries present that year.

#### Freedom of Expression Data

In [6]:
display(freedom_df['2014'].describe())
display(freedom_df.head())

count    155.000000
mean      61.916129
std       20.752079
min        2.000000
25%       47.500000
50%       65.000000
75%       78.000000
max       95.000000
Name: 2014, dtype: float64

Unnamed: 0,country,1975,1976,1977,1978,1979,1980,1981,1982,1983,...,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018
0,Afghanistan,35.0,35.0,35.0,23.0,20.0,20.0,22.0,22.0,22.0,...,53.0,52.0,52,51,52,52,52,50,51,55
1,Albania,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,...,71.0,71.0,71,71,64,65,65,65,69,62
2,Algeria,34.0,34.0,34.0,34.0,34.0,36.0,36.0,36.0,36.0,...,58.0,57.0,57,52,57,56,56,53,55,56
3,Angola,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,20.0,...,46.0,46.0,46,46,47,47,47,47,51,53
4,Argentina,52.0,24.0,14.0,14.0,14.0,14.0,14.0,17.0,33.0,...,78.0,77.0,77,77,78,78,76,83,82,82


Finally, this shows us that the freedom data only includes 155 countries for the year of 2014.

> **Tip**: You should _not_ perform too many operations in each cell. Create cells freely to explore your data. One option that you can take with this project is to do a lot of explorations in an initial notebook. These don't have to be organized, but make sure you use enough comments to understand the purpose of each code cell. Then, after you're done with your analysis, create a duplicate notebook where you will trim the excess and organize your steps so that you have a flowing, cohesive report.

> **Tip**: Make sure that you keep your reader informed on the steps that you are taking in your investigation. Follow every code cell, or every set of related code cells, with a markdown cell to describe to the reader what was found in the preceding cell(s). Try to make it so that the reader can then understand what they will be seeing in the following cell(s).

### Data Cleaning (Replace this with more specific notes!)

As I decided previously, I will only be working with data from 2014.
All four datasets have data for 2014, however the suicide rates only have data saved for 49 countries, significantly fewer countries than the indicators.

Using the mean to fill in any of these blank entries is not suitable, especially due to the large proportion of missing data to saved data (171 countries for corruption data, much larger than 49), so I will instead discard these rows.

In order to clean the data, I need to:
- Discard the unused years in all datasets, keeping only the 2014 columns
- Discard all rows in all datasets with missing values
- Rename the 2014 columns to include the name of the data they reference (in order to distinguish once combined)
- Combine all data into one dataset using an inner merge on the `country` column

#### Discard Unused Columns

In [7]:
# Columns to keep
columns = ['country', '2014']

suicide_df = suicide_df.filter(columns)
corruption_df = corruption_df.filter(columns)
democracy_df = democracy_df.filter(columns)
freedom_df = freedom_df.filter(columns)

#### Discard Rows with Missing Values

In [8]:
# Drop all rows with missing values
suicide_df.dropna(inplace=True)
corruption_df.dropna(inplace=True)
democracy_df.dropna(inplace=True)
freedom_df.dropna(inplace=True)

#### Rename Columns

In [9]:
suicide_df.rename(columns={'2014': 'suicide_rate'}, inplace=True)
corruption_df.rename(columns={'2014': 'corruption'}, inplace=True)
democracy_df.rename(columns={'2014': 'democracy'}, inplace=True)
freedom_df.rename(columns={'2014': 'freedom'}, inplace=True)

#### Combine Datasets

In [10]:
# Inner merge two dataframes on the "country" column
def merge(left, right):
    return pd.merge(left, right, on='country', how='inner')

dataframes = [suicide_df, corruption_df, democracy_df, freedom_df]
combined = reduce(merge, dataframes)

combined.head()

Unnamed: 0,country,suicide_rate,corruption,democracy,freedom
0,Armenia,1.54,37.0,41.3,60
1,Australia,11.2,80.0,90.1,86
2,Austria,11.2,72.0,85.4,86
3,Bahrain,0.541,49.0,28.7,23
4,Belgium,13.5,76.0,79.3,89


In [11]:
combined['country'].nunique()

45

The data is now free of all missing values, and combined into one dataframe, with separate columns for the suicide rate, level of corruption, democracy and freedom. There are 45 countries that had data saved in all four datasets, and hence those are what I am left with.

Ideally there would be more data, however I am limited by the low number of countries suicide rates were recorded for.

This data is now fully cleaned and ready for analysis.

<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

### Research Question 1 (Replace this header name!)

In [12]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.


### Research Question 2  (Replace this header name!)

In [13]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work, you should save a copy of the report in HTML or PDF form via the **File** > **Download as** submenu. Before exporting your report, check over it to make sure that the flow of the report is complete. You should probably remove all of the "Tip" quotes like this one so that the presentation is as tidy as possible. Congratulations!