### Table of contents

* [Objectives](#Objectives)
* [EDA + Data Cleaning](#EDA)
* [Modeling](#modeling)
* [Summary](#Summary)
* [Resources](#Resources)

In [1]:
# libraries 
import pandas as pd 
import numpy as np
import pandas_profiling

import seaborn as sns # for statistical graphs
from plotly import plotly # for interactive plots 
import cufflinks # for connecting pandas with plotly to create plots easily 
cufflinks.go_offline()


<a id='Objectives'></a>

## Objectives:




> **To test a null hypothesis about the share of the number of people who die in the world for each Hijri (Islamic calendar) month.**

---

#### The main question: 
* The Muslim community think people tend to die more (often) in the month of Shaban (8). is this true?

> **The null Hypothesis:** People tend to die more in the month of Shaban (8th month).

> **The alternative Hypothesis:** There is no difference between the rate of dying for all months.


<a id='EDA'></a>

## Data Exploration and Data Cleaning


* The data were collected from: 
> https://services.amana-md.gov.sa/eservicesite/Inq/DeathInquiry.aspx

* Since we are only interested in answering the Deaths month count hypothosis question, we have collected only the dates of burial.

* In the Islamic world, burial is mostly happens the next day when the person is dead, so we will assume generally that this is dying date.




In [2]:
# reading the data 
df = pd.read_csv('data/data.csv',names=['date_of_burial'])

In [3]:
# last 5 records 
df.tail()

Unnamed: 0,date_of_burial
113938,1439/12/30
113939,1439/12/30
113940,1439/12/30
113941,1439/12/30
113942,1439/12/30


In [4]:
# sumber of samples
len(df)

113943

In [5]:
# missing values 
df.isnull().sum() 
# Ans: no missing 

date_of_burial    0
dtype: int64

**Correcting the Date**

In [6]:


# wrangling the data 
# function to extract time variables
get_year = lambda hijri_date: hijri_date.split('/')[0]
get_month = lambda hijri_date: hijri_date.split('/')[1]
get_day = lambda hijri_date: hijri_date.split('/')[2]

# applying those functions
df['year'] = df.date_of_burial.apply(get_year)
df['month'] = df.date_of_burial.apply(get_month)
df['day'] = df.date_of_burial.apply(get_day)

# check
df.sample(3)

Unnamed: 0,date_of_burial,year,month,day
100171,1437/12/27,1437,12,27
19664,1419/07/07,1419,7,7
17347,1418/11/03,1418,11,3


> The reason why we are not using pandas **to_date** function is because it will not deal with these dates as hijri dates which has different structure from the Gregorian date.

> I decided to deal with them as a normal strings.

**Different data types problems check**

In [7]:
# Do we have dates in each row or maybe some other variable (non-numeric variable)? 
df.year.str.isnumeric().mean() 

# this means that all years are in a good format 

1.0

**Dealing with missing values (may happen when extracting the data)**

In [8]:
# Do we make some duplicates when we collected the data? 
# Note: because the website is giving us the data in order, we can use pandas diff (for the year feature) function to answer this
# example: year (next row) -  year == negative value --> this means that we have duplicated the extraction
df.year.astype('int').diff().nsmallest()


1    0.0
3    0.0
4    0.0
5    0.0
6    0.0
Name: year, dtype: float64

> No Negative values --> No duplicates when extracting the data.

In [9]:
# fast exploration of month counts
df[['year','month','day']].astype('int').describe()

Unnamed: 0,year,month,day
count,113943.0,113943.0,113943.0
mean,1427.945034,6.754439,15.366683
std,7.994828,3.515362,8.584238
min,1364.0,1.0,1.0
25%,1422.0,4.0,8.0
50%,1429.0,7.0,15.0
75%,1435.0,10.0,23.0
max,1439.0,17.0,30.0


> We can see that we have a month of a value 17.

In [10]:
# Extracting the row with the month 17 
df[df.month=='17']

Unnamed: 0,date_of_burial,year,month,day
35721,1423/17/17,1423,17,17
35722,1423/17/17,1423,17,17
35723,1423/17/17,1423,17,17
35724,1423/17/17,1423,17,17


> Four records with uncorrect month values.
>> Can we correct that by looking around these values? 

In [11]:
df.iloc[35719:35727,:]

Unnamed: 0,date_of_burial,year,month,day
35719,1423/12/30,1423,12,30
35720,1423/12/30,1423,12,30
35721,1423/17/17,1423,17,17
35722,1423/17/17,1423,17,17
35723,1423/17/17,1423,17,17
35724,1423/17/17,1423,17,17
35725,1424/01/01,1424,1,1
35726,1424/01/01,1424,1,1


> Because we have collected the data in order, we can assume that these 17 are actually 12.
> also by correcting the day 17 to be 30

In [12]:
# correcting the month & day 17 
indexes = df[df.month=='17'].index
df.iloc[indexes,:] = ['1423/12/30','1423','12','30']


In [13]:
# check again 
df.iloc[indexes,:]

Unnamed: 0,date_of_burial,year,month,day
35721,1423/12/30,1423,12,30
35722,1423/12/30,1423,12,30
35723,1423/12/30,1423,12,30
35724,1423/12/30,1423,12,30


In [14]:
# How is the amount of data changing over time?
df.year.value_counts().sort_index().iplot(kind='line',fill=True,title='Counts of data over time')

> **Assumption:** The data started to be seriously collected starting from the year of **1412**, to remove outliers, we can start testing the hypothesis from the year **1415**.

In [15]:
# removing the outliers 
df_cleaned =  df[df.year.astype('int') >= 1415].copy()

In [16]:
# check
df_cleaned.head()

Unnamed: 0,date_of_burial,year,month,day
5378,1415/01/01,1415,1,1
5379,1415/01/01,1415,1,1
5380,1415/01/01,1415,1,1
5381,1415/01/01,1415,1,1
5382,1415/01/01,1415,1,1


In [17]:
# number of samples we have in this dataset? how many years? 
df_cleaned.year.nunique()

# Ans: 25 years of data 

25

In [18]:
# wrangling the data 
# we need the month count share (percentage) for each year (preparing our samples)
df_main = (df_cleaned.month.groupby(df_cleaned.year).value_counts('%')*100).unstack()
df_main.index.rename('year',inplace=True)
# check
df_main.head()

month,01,02,03,04,05,06,07,08,09,10,11,12
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1415,8.019967,8.1198,7.054908,7.321131,6.788686,7.420965,8.086522,8.319468,9.084859,8.552413,8.918469,12.312812
1416,8.538486,7.852914,8.039888,6.918043,6.762231,6.762231,7.946401,8.413836,9.442194,8.320349,9.566843,11.436585
1417,8.120722,6.440572,6.565028,6.160548,6.036092,7.405103,7.311761,8.494088,10.765401,9.645302,10.889857,12.165526
1418,9.387262,6.610323,6.88198,6.88198,6.338666,6.308482,6.731059,7.727136,9.689104,9.417446,11.137941,12.888621
1419,8.952096,7.754491,6.766467,7.39521,6.946108,6.556886,7.005988,7.39521,9.760479,9.97006,9.191617,12.305389


* Now we have **25 samples, 12 groups** of data.
* This is the final dataset that we are going to do our analysis (**statistical test**) on it.

In [19]:
import plotly.graph_objs as go    



# Taking the percentage of month share for each year separately
# bars, to compare months percentage share & how it is changing over years 
df_main.iplot(kind='bar',barmode='stack',title='Months percentage share over years')


# box, to compare the overall months percentage share & to prevent outliers effect when comparing
layout = go.Layout(yaxis=dict(range=[0,50]),title='Months percentage share overall')
df_main.iplot(kind='box',layout=layout)



* We can see an increase for the months (9,12).
> **hypothesis**: due to Al-hajj & Ramadan seasons. 

<a id='modeling'></a>

## Modeling 

### Two samples t-test 

I have decided to use the two samples t-test, simply because it will give us a direct answer to our question.
> **Two sampled T-test :** The Independent Samples t Test or 2-sample t-test compares the means of two independent groups in order to determine whether there is statistical evidence that the associated population means are significantly different.

> **The null hypothesis for the test:** Mean(Shaban deaths share) > Mean(other months deaths share)

> **The alternative hypothesis for the test:** Mean(Shaban deaths share) <= Mean(other months deaths share)

In [20]:
# preparing the two groups for the test
group1 = df_main['08']# Shaban group
group1.head()

year
1415    8.319468
1416    8.413836
1417    8.494088
1418    7.727136
1419    7.395210
Name: 08, dtype: float64

In [21]:
# taking the mean for all months each year
group2 = df_main.drop(columns=['08']).mean(axis=1) # other months group
group2.head()

year
1415    8.334594
1416    8.326015
1417    8.318719
1418    8.388442
1419    8.418617
dtype: float64

**Normality test before**

* Group1:

In [22]:
from scipy.stats import shapiro
# normality test
stat, p = shapiro(group1)
print('Statistics=%.3f, p=%.3f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
    print('Sample looks Gaussian (fail to reject H0)')
else:
    print('Sample does not look Gaussian (reject H0)')

Statistics=0.968, p=0.604
Sample looks Gaussian (fail to reject H0)


* Group2:

In [23]:

# normality test
stat, p = shapiro(group2)
print('Statistics=%.3f, p=%.3f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
    print('Sample looks Gaussian (fail to reject H0)')
else:
    print('Sample does not look Gaussian (reject H0)')

Statistics=0.968, p=0.604
Sample looks Gaussian (fail to reject H0)


**T-test:**
> https://www.statsmodels.org/dev/generated/statsmodels.stats.weightstats.ttest_ind.html

In [24]:
from statsmodels.stats.weightstats import ttest_ind

# null hypothesis (H0) = mean(group1) > mean(group2)
tset, pval, _ = ttest_ind(x1=group1,# Shaban group
                          x2=group2,# other months group
                          alternative='smaller',# the alternative mean is smaller then the null mean 
                          )

print('p-values = ',pval)

if pval < 0.05:    # alpha value is 0.05 or 5%
    print("\n we are rejecting null hypothesis")
else:
    print("we are accepting null hypothesis")


p-values =  2.2742567544460084e-07

 we are rejecting null hypothesis


<a id='Summary'></a>
## Summary

> **To answer the main question, We can conclude with confidence that the number of deaths that are happening in the month of Shaban is not more significant from other months deaths rate. In fact, it might be even smaller compared to some specific months (like 12), I believe this is happening because of Al-Hajj season where a lot of Muslims are coming from around the world to Saudi Arabia, which indeed affect our results and the study.**  


<a id='Resources'></a>
## Resources

* https://services.amana-md.gov.sa/eservicesite/Inq/DeathInquiry.aspx

In [25]:
!jupyter nbconvert --to html The_main_notebook.ipynb

[NbConvertApp] Converting notebook The_main_notebook.ipynb to html
[NbConvertApp] Writing 348405 bytes to The_main_notebook.html
