![license_header_logo](https://user-images.githubusercontent.com/59526258/124226124-27125b80-db3b-11eb-8ba1-488d88018ebb.png)
> **Copyright (c) 2020-2021 CertifAI Sdn. Bhd.**<br>
 <br>
This program is part of OSRFramework. You can redistribute it and/or modify
<br>it under the terms of the GNU Affero General Public License as published by
<br>the Free Software Foundation, either version 3 of the License, or
<br>(at your option) any later version.
<br>
<br>This program is distributed in the hope that it will be useful,
<br>but WITHOUT ANY WARRANTY; without even the implied warranty of
<br>MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
<br>GNU Affero General Public License for more details.
<br>
<br>You should have received a copy of the GNU Affero General Public License
<br>along with this program.  If not, see <http://www.gnu.org/licenses/>.
<br>

## Introduction 

In this task we will be looking at diffrent type of confident interval and see how to do confident interval using python 

## What will we accomplish?

By the end of the notebook we will be able to :
1. Know diffrent type of confident interval
2. How and when to use the different confident interval
3. Use the formulas of the confident interval

## Notebook Content
* [Part 1 :Propotion Population Confident Interval](#propotion)
    * [Confidence Interval for female population proportion that has heart disease.](#femalepropotion)
    * [Confidence Interval for difference in population propotion .](#diffpropotion)
* [Part 2 :Population Mean Confidence Interval](#meaninterval)
    * [Confidence of the female cholesterol level mean.](#femalemeaninterval)
    * [Confidence Interval for difference in population mean.](#diffmeaninterval)

We will be using mainly pandas and numpy

In [None]:
import pandas as pd
import numpy as np

Dataset can be downloaded at https://www.kaggle.com/johnsmith88/heart-disease-dataset

In [None]:
df = pd.read_csv('..//Heart.csv')
df

<a name="propotion"><h3><b>Propotion Population Confident Interval 

We create the population propotion data frame which is person with the heart disease and differentiated by gender

In [None]:
df['gender'] = df.sex.replace({1: "Male", 0: "Female"})

Drop all not available data

In [None]:
data = df[["target", "gender"]].dropna()

In [None]:
pd.crosstab(data.target, data.gender)

<a name="femalepropotion"><h4><b>Confidence Interval for female population proportion that has heart disease.

Get the female number and its propotion that have heart disease to get its confidence interval

In [None]:
f=86+226
f_propotion = 226/(f)

Get the standard error of the female propotion that have heart disease ( see the Confidence interval for proportions
formula )

In [None]:
se_female = np.sqrt(f_propotion * (1 - f_propotion) / f)
se_female

Get the 95% confidence level ( the z_score is 1.96 which is available in the table) and use the formula get the interval

In [None]:
z_score = 1.96

lower_interval_FP = f_propotion - z_score* se_female
upper_interval_FP = f_propotion + z_score* se_female

(lower_interval_FP, upper_interval_FP)

For the female having heart disease: the lower and upper bounds of the 95% confidence interval are 0.675 and 0.774.

There is multiple method on getting the 95% confidence score and using the stasmodels is on of them

In [None]:
import statsmodels.api as sm
sm.stats.proportion_confint(f * f_propotion, f)

<a name="diffpropotion"><h4><b>Confidence Interval for difference in population propotion 

In [None]:
m = 399+413 
m_propotion = 399/(m) 

Get the standard error of the male propotion that have heart disease ( see the Confidence interval for proportions
formula )

In [None]:
se_male = np.sqrt(m_propotion * (1 - m_propotion) / m)
se_male

Get the diffrence in standard error between the female and male  

In [None]:
se_diff = np.sqrt(se_female**2 + se_male**2)

Get the proption difference

In [None]:
d_propotion = f_propotion - m_propotion 

The upper and lower interval of the diffrence

In [None]:
lower_interval_DP = d_propotion - z_score* se_diff
upper_interval_DP = d_propotion + z_score* se_diff

(lower_interval_DP, upper_interval_DP)

For the gender difference having heart disease: the lower and upper bounds of the 95% confidence interval are 0.173 and 0.293.


<a name="meaninterval"><h3><b>Population Mean Confidence Interval</h3></a>

<a name="femalemeaninterval"><h4>Confidence of the female cholesterol level mean

Here we get the mean ,size and standard devation of level of cholestrol between gender

In [None]:
df.groupby("gender").agg({"chol": [np.mean, np.std, np.size]})

Get the needed data for the mean confidence interval ( see the formula of the Confidence interval for the mean of normally-distributed data)

In [None]:
z = 1.96
mean_f = 261.455128  
sd_f = 64.466781     
n_f = 312
se_f = sd_f /np.sqrt(n_f)

Get the intervals for 95% confidence level

In [None]:
lower_interval_FM = mean_f - z* se_f  
upper_interval_FM = mean_f + z* se_f  
(lower_interval_FM, upper_interval_FM)

For the female cholesterol level: the lower and upper bounds of the 95% confidence interval are 254.302 and 268.609.

<a name="diffmeaninterval"><h4><b>Confidence Interval for difference in population mean 

Get the needed data from above for male population

In [None]:
mean_m = 239.237027  
sd_m = 43.155535     
n_m = 713
se_m = sd_m /np.sqrt(n_m)

Find the standard error using the unpooled approach ( we use the unpooled because the variance between male and women not the same )( see the standard deviation to estimate the variance) 

In [None]:
mean_d = mean_f - mean_m

se_d = (np.sqrt((n_f-1)*se_f**2 + (n_m-1)*se_m**2)/(n_f+n_m-2))*(np.sqrt(1/n_f + 1/n_m))

Calculate the interval difference 

In [None]:
lower_interval_DM = mean_d - 1.96*se_d  
upper_interval_DM = mean_d + 1.96*se_d  
(lower_interval_DM, upper_interval_DM)