# Step 1: The Hypotheses

## Hypothesis 1: GDP Per Capita vs Suicide

There's this endless debate between people who are advocates of the "money brings happiness" philosophy and their opponents, the "money's just a number" proponents. Let's put these two statements to the test. If it is true that money brings happiness, then as GDP per capita goes up, the number of suicides will go down. We can randomly sample the population and see what the trend is in our sample.

In order to do this, we need to divide our population into two sectors: the high GDP per capita sector and the low GDP per capita sector. From these two we can proceed to randomly sample to get a so-called "treatment" group and a "control" group. The "treatment" group will be those who are in the "high GDP per capita" sector and the "control" group will be those in the "low GDP per capita" sector. 

The above is my reasoning thus far. So, let me lay out my alternate and null hypotheses as follows:

Alternate Hypothesis: Citizens from higher gdp per capita group are more prone to suicide than those in the lower gdp per capita group. $\hat{p}_h$ - $\hat{p}_l$ > 0  (exciting, new claim!)

Null Hypothesis: The suicide rates between the two groups is the same. $\hat{p}_h$ - $\hat{p}_l$ = 0 (no difference bro!)

## Hypothesis 2: Sex vs Suicide 

I am curious to know - are men or women more likely to commit suicide? It would be interesting indeed if we found that say men committed suicide on average more than women. If so, I wonder what the cause of that would be?

I will construct the following hypotheses to test this:

Alternate Hypothesis: The suicide rates between men and women are different (whoa! exciting!).

Null Hypothesis: The suicide rates between men and women is the same (no difference bro!).

## Hypothesis 3: Generation vs Suicide 

Did past generations have more suicides compared to current generations? I've been hearing that suicide rates have been going up. If this is true then the past generations would have lower suicide rates than current generations. So, based on this premise, I'm going to construct my hypotheses like this:

Alternate Hypothesis: The suicide rates among current generations (Gen X, Millenials, Gen Z) is more than suicide rates in previous generations (Silent, G.I., Boomers) (whoa! Cool!).

Null Hypothesis: The suicide rates among current generations is the same as suicide rates in previous generations (no difference bro!).

## Hypothesis 4: Age vs Suicide 

Does age affect suicide? Do younger people commit suicide more than older folks? Or is it the other way around? 

The data is already divided by age group, so we can put forth the below hypotheses:

Alternate Hypothesis: The suicide rates among the various age groups are different (new! exciting!).

Null Hypothesis: The suicide rates among the various age groups is the same (no difference bro!).

## Hypothesis 5: Time vs Suicide 

As we go through time, is suicide increasing? Or is it decreasing? I've heard that suicide rates have been increasing. Let's put this to the test.

Alternate Hypothesis: The suicide rates in past years (1985 - 2000) is greater than the suicide rates in recent years (2001 - 2016)

Null Hypothesis: The suicide rates in past years (1985 - 2000) is the same as the suicide rates in recent years (2001 - 2016)

# Step 2: Retrieving the Data 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [9]:
#Read from the master.csv file and put into a dataframe
suicides = pd.read_csv('master.csv',sep=',',header=0,names=['country','yr','sex','age','suicides_no',
                                                            'population','suicides_100k','country_yr',
                                                            'HDI_yr','gdp_yr','gdp_pc','gen'])

In [23]:
#Check the first 5 entries in dataframe
suicides.head(25)

Unnamed: 0,country,yr,sex,age,suicides_no,population,suicides_100k,country_yr,HDI_yr,gdp_yr,gdp_pc,gen
0,Albania,1987,male,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X
1,Albania,1987,male,35-54 years,16,308000,5.19,Albania1987,,2156624900,796,Silent
2,Albania,1987,female,15-24 years,14,289700,4.83,Albania1987,,2156624900,796,Generation X
3,Albania,1987,male,75+ years,1,21800,4.59,Albania1987,,2156624900,796,G.I. Generation
4,Albania,1987,male,25-34 years,9,274300,3.28,Albania1987,,2156624900,796,Boomers
5,Albania,1987,female,75+ years,1,35600,2.81,Albania1987,,2156624900,796,G.I. Generation
6,Albania,1987,female,35-54 years,6,278800,2.15,Albania1987,,2156624900,796,Silent
7,Albania,1987,female,25-34 years,4,257200,1.56,Albania1987,,2156624900,796,Boomers
8,Albania,1987,male,55-74 years,1,137500,0.73,Albania1987,,2156624900,796,G.I. Generation
9,Albania,1987,female,5-14 years,0,311000,0.0,Albania1987,,2156624900,796,Generation X


In [30]:
suicides.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27820 entries, 0 to 27819
Data columns (total 13 columns):
country          27820 non-null object
yr               27820 non-null int64
sex              27820 non-null object
age              27820 non-null object
suicides_no      27820 non-null int64
population       27820 non-null int64
suicides_100k    27820 non-null float64
country_yr       27820 non-null object
HDI_yr           8364 non-null float64
gdp_yr           27820 non-null object
gdp_pc           27820 non-null int64
gen              27820 non-null object
gdp_pc_hilo      27820 non-null int64
dtypes: float64(2), int64(5), object(6)
memory usage: 2.8+ MB


In [32]:
suicides.shape

(27820, 13)

# Step 3: Testing the Hypotheses 

## Hypothesis 1: GDP Per Capita versus Suicide 

### Step 1a: Determining the High GDP (treatment) and Low GDP (control) groups

Method 1: Take the max GDP per capita, take the min GDP per capita and compute (max - min) / 2. This will give us a "naive" boundary between the two groups for now. 

In [11]:
suicides.gdp_pc

0         796
1         796
2         796
3         796
4         796
5         796
6         796
7         796
8         796
9         796
10        796
11        796
12        769
13        769
14        769
15        769
16        769
17        769
18        769
19        769
20        769
21        769
22        769
23        769
24        833
25        833
26        833
27        833
28        833
29        833
         ... 
27790    1964
27791    1964
27792    1964
27793    1964
27794    1964
27795    1964
27796    2150
27797    2150
27798    2150
27799    2150
27800    2150
27801    2150
27802    2150
27803    2150
27804    2150
27805    2150
27806    2150
27807    2150
27808    2309
27809    2309
27810    2309
27811    2309
27812    2309
27813    2309
27814    2309
27815    2309
27816    2309
27817    2309
27818    2309
27819    2309
Name: gdp_pc, Length: 27820, dtype: int64

In [24]:
#Let's do a groupby on gdp per capita and see how many total suicides there are
gdp_pc_grouped = suicides.groupby('gdp_pc')['suicides_no'].sum()

In [25]:
gdp_pc_grouped

gdp_pc
251         47
291        556
313        510
345        560
357        106
359        567
385         83
387        112
398        509
424        479
425        605
426        130
428       1576
431       5668
435        496
437        654
441       1416
454       5345
458         87
459        480
462         49
465         77
476         69
484        438
508         47
513       1914
514        105
515         67
516       1251
528        563
          ... 
84442      355
85394      573
85397       49
86068     1073
87951     1038
87961       33
89634     1071
90490       50
90797      485
90809     1032
91587       55
93053     1034
93066       37
93270       43
93638      548
95351       65
103431     505
103443     548
107430     598
107456      64
108408     515
109483      48
109804     554
111328      54
112581      78
113120      51
120423      40
121315      43
122729      55
126352      67
Name: suicides_no, Length: 2233, dtype: int64

In [12]:
#Let's define a naive boundary to decide whether we are in a "high gdp" or "low gdp"
dboundary = (suicides.gdp_pc.max() - suicides.gdp_pc.min()) / 2

In [13]:
dboundary

63050.5

In [26]:
#We engineer a new binary column called "gdp_pc_hilo" which shows 1 = high gdp, 0 = low gdp
suicides['gdp_pc_hilo'] = suicides['gdp_pc'].apply(lambda x : 0 if x < dboundary else 1)

In [28]:
suicides[suicides['gdp_pc_hilo'] == 1]

Unnamed: 0,country,yr,sex,age,suicides_no,population,suicides_100k,country_yr,HDI_yr,gdp_yr,gdp_pc,gen,gdp_pc_hilo
1726,Australia,2011,male,75+ years,146,588053,24.83,Australia2011,0.930,1394280784778,66770,Silent,1
1727,Australia,2011,male,35-54 years,704,3072726,22.91,Australia2011,0.930,1394280784778,66770,Generation X,1
1728,Australia,2011,male,25-34 years,347,1610295,21.55,Australia2011,0.930,1394280784778,66770,Millenials,1
1729,Australia,2011,male,55-74 years,366,2104816,17.39,Australia2011,0.930,1394280784778,66770,Boomers,1
1730,Australia,2011,male,15-24 years,242,1570069,15.41,Australia2011,0.930,1394280784778,66770,Millenials,1
1731,Australia,2011,female,35-54 years,230,3124328,7.36,Australia2011,0.930,1394280784778,66770,Generation X,1
1732,Australia,2011,female,15-24 years,93,1495053,6.22,Australia2011,0.930,1394280784778,66770,Millenials,1
1733,Australia,2011,female,55-74 years,118,2139108,5.52,Australia2011,0.930,1394280784778,66770,Boomers,1
1734,Australia,2011,female,75+ years,45,817927,5.50,Australia2011,0.930,1394280784778,66770,Silent,1
1735,Australia,2011,female,25-34 years,85,1584036,5.37,Australia2011,0.930,1394280784778,66770,Millenials,1


In [29]:
suicides[suicides['gdp_pc_hilo'] == 0]

Unnamed: 0,country,yr,sex,age,suicides_no,population,suicides_100k,country_yr,HDI_yr,gdp_yr,gdp_pc,gen,gdp_pc_hilo
0,Albania,1987,male,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X,0
1,Albania,1987,male,35-54 years,16,308000,5.19,Albania1987,,2156624900,796,Silent,0
2,Albania,1987,female,15-24 years,14,289700,4.83,Albania1987,,2156624900,796,Generation X,0
3,Albania,1987,male,75+ years,1,21800,4.59,Albania1987,,2156624900,796,G.I. Generation,0
4,Albania,1987,male,25-34 years,9,274300,3.28,Albania1987,,2156624900,796,Boomers,0
5,Albania,1987,female,75+ years,1,35600,2.81,Albania1987,,2156624900,796,G.I. Generation,0
6,Albania,1987,female,35-54 years,6,278800,2.15,Albania1987,,2156624900,796,Silent,0
7,Albania,1987,female,25-34 years,4,257200,1.56,Albania1987,,2156624900,796,Boomers,0
8,Albania,1987,male,55-74 years,1,137500,0.73,Albania1987,,2156624900,796,G.I. Generation,0
9,Albania,1987,female,5-14 years,0,311000,0.00,Albania1987,,2156624900,796,Generation X,0


Before we move along, we need to pay attention to two main statistical criteria. We would like to use the normal model (bell curve) to estimate the population proportions in the high and low groups. 

1. Independence Criterion - We need our sample size to be lower than 10% of the total population to ensure that the observations are sufficiently independent. On top of this, we also need our sample size to be big enough. If the population is not heavily skewed, then we can get away with a sample size of 30 or more in each group. However, if we see the suicide_no variable, it is heavily skewed. In this case, we need to fulfill a sample size of greater than 100.

2. Success-Failure Criterion- We need to ensure that the total number of successes and total number of failures in each group (sample size * proportion) is greater than 10. 

Once the above criteria are satisfied, we can use the normal model to estimate our population proportions.

In [None]:
#First, let's start with N = 200 random samples from both groups. This ensures we are far away from the 
#100 minimum and so we can ensure independence to some extent.
sample_size = 200
population = suicides.shape[0]
