![license_header_logo](https://user-images.githubusercontent.com/59526258/124226124-27125b80-db3b-11eb-8ba1-488d88018ebb.png)
> **Copyright (c) 2020-2021 CertifAI Sdn. Bhd.**<br>
 <br>
This program is part of OSRFramework. You can redistribute it and/or modify
<br>it under the terms of the GNU Affero General Public License as published by
<br>the Free Software Foundation, either version 3 of the License, or
<br>(at your option) any later version.
<br>
<br>This program is distributed in the hope that it will be useful,
<br>but WITHOUT ANY WARRANTY; without even the implied warranty of
<br>MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
<br>GNU Affero General Public License for more details.
<br>
<br>You should have received a copy of the GNU Affero General Public License
<br>along with this program.  If not, see <http://www.gnu.org/licenses/>.
<br>

## Introduction 

In this task we will be looking at diffrent type of confident interval and see how to do confident interval using python 

## What will we accomplish?

By the end of the notebook we will be able to :
1. Know diffrent type of confident interval
2. How and when to use the different confident interval
3. Use the formulas of the confident interval

## Notebook Content
* [Part 1 :Propotion Population Confident Interval](#propotion)
    * [Confidence Interval for female population proportion that has heart disease.](#femalepropotion)
    * [Confidence Interval for difference in population propotion .](#diffpropotion)
* [Part 2 :Population Mean Confidence Interval](#meaninterval)
    * [Confidence of the female cholesterol level mean.](#femalemeaninterval)
    * [Confidence Interval for difference in population mean.](#diffmeaninterval)

We will be using mainly pandas and numpy

In [25]:
import pandas as pd
import numpy as np

Dataset can be downloaded at https://www.kaggle.com/johnsmith88/heart-disease-dataset

In [47]:
df = pd.read_csv('../data/heart.csv')
df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1020,59,1,1,140,221,0,1,164,1,0.0,2,0,2,1
1021,60,1,0,125,258,0,0,141,1,2.8,1,1,3,0
1022,47,1,0,110,275,0,0,118,1,1.0,1,1,2,0
1023,50,0,0,110,254,0,0,159,0,0.0,2,0,2,1


<a name="propotion"><h3><b>Propotion Population Confident Interval 

We create the population propotion data frame which is person with the heart disease and differentiated by gender

In [27]:
df['gender'] = df.sex.replace({1: "Male", 0: "Female"})

Drop all not available data

In [28]:
data = df[["target", "gender"]].dropna()

In [29]:
pd.crosstab(data.target, data.gender)

gender,Female,Male
target,Unnamed: 1_level_1,Unnamed: 2_level_1
0,86,413
1,226,300


<a name="femalepropotion"><h4><b>Confidence Interval for female population proportion that has heart disease.

Get the female number and its propotion that have heart disease to get its confidence interval

In [30]:
f=86+226
f_propotion = 226/(f)

Get the standard error of the female propotion that have heart disease ( see the Confidence interval for proportions
formula )

In [31]:
se_female = np.sqrt(f_propotion * (1 - f_propotion) / f)
se_female

0.02529714756803247

Get the 95% confidence level ( the z_score is 1.96 which is available in the table) and use the formula get the interval

In [32]:
z_score = 1.96

lower_interval_FP = f_propotion - z_score* se_female
upper_interval_FP = f_propotion + z_score* se_female

(lower_interval_FP, upper_interval_FP)

(0.6747765651256307, 0.773941383592318)

For the female having heart disease: the lower and upper bounds of the 95% confidence interval are 0.675 and 0.774.

There is multiple method on getting the 95% confidence score and using the stasmodels is on of them

In [33]:
import statsmodels.api as sm
sm.stats.proportion_confint(f * f_propotion, f)

(0.6747774762140357, 0.773940472503913)

<a name="diffpropotion"><h4><b>Confidence Interval for difference in population propotion 

In [34]:
m = 399+413 
m_propotion = 399/(m) 

Get the standard error of the male propotion that have heart disease ( see the Confidence interval for proportions
formula )

In [35]:
se_male = np.sqrt(m_propotion * (1 - m_propotion) / m)
se_male

0.01754395197423383

Get the diffrence in standard error between the female and male  

In [36]:
se_diff = np.sqrt(se_female**2 + se_male**2)

Get the proption difference

In [37]:
d_propotion = f_propotion - m_propotion 

The upper and lower interval of the diffrence

In [38]:
lower_interval_DP = d_propotion - z_score* se_diff
upper_interval_DP = d_propotion + z_score* se_diff

(lower_interval_DP, upper_interval_DP)

(0.17264043686346828, 0.29331889116482524)

For the gender difference having heart disease: the lower and upper bounds of the 95% confidence interval are 0.173 and 0.293.


<a name="meaninterval"><h3><b>Population Mean Confidence Interval</h3></a>

<a name="femalemeaninterval"><h4>Confidence of the female cholesterol level mean

Here we get the mean ,size and standard devation of level of cholestrol between gender

In [39]:
df.groupby("gender").agg({"chol": [np.mean, np.std, np.size]})

Unnamed: 0_level_0,chol,chol,chol
Unnamed: 0_level_1,mean,std,size
gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Female,261.455128,64.466781,312
Male,239.237027,43.155535,713


Get the needed data for the mean confidence interval ( see the formula of the Confidence interval for the mean of normally-distributed data)

In [40]:
z = 1.96
mean_f = 261.455128  
sd_f = 64.466781     
n_f = 312
se_f = sd_f /np.sqrt(n_f)

Get the intervals for 95% confidence level

In [41]:
lower_interval_FM = mean_f - z* se_f  
upper_interval_FM = mean_f + z* se_f  
(lower_interval_FM, upper_interval_FM)

(254.30169095203016, 268.6085650479699)

For the female cholesterol level: the lower and upper bounds of the 95% confidence interval are 254.302 and 268.609.

<a name="diffmeaninterval"><h4><b>Confidence Interval for difference in population mean 

Get the needed data from above for male population

In [42]:
mean_m = 239.237027  
sd_m = 43.155535     
n_m = 713
se_m = sd_m /np.sqrt(n_m)

Find the standard error using the unpooled approach ( we use the unpooled because the variance between male and women not the same )( see the standard deviation to estimate the variance) 

In [43]:
mean_d = mean_f - mean_m

se_d = (np.sqrt((n_f-1)*se_f**2 + (n_m-1)*se_m**2)/(n_f+n_m-2))*(np.sqrt(1/n_f + 1/n_m))

Calculate the interval difference 

In [44]:
lower_interval_DM = mean_d - 1.96*se_d  
upper_interval_DM = mean_d + 1.96*se_d  
(lower_interval_DM, upper_interval_DM)

(22.20802509671433, 22.22817690328565)