## Finding Confidence Interval for 2 Dependent Populations

In this example, we'll find the confidence interval for the two dependent samples.
The goal of this to understand before-after or cause-effect relations of the 2 samples.
For example: 
    Effect of a pill that supposed to cahnge Iron level in the blood,
    How succesfull is the diet etc.
    
For finding confidence intervals when population variance is unknown and when we have small amount of samples (less than 30), t-statistic is being used as reliability factor (RF)
Without much detail, you can use http://www.ttable.org/ for finding t value

The formula for the confidence interval is;
[SampleMean - RF * StandardError , SampleMean + RF * StandardError]

We assume that population is normally distributed

In [1]:
#inserting libraries
import pandas as pd #pandas is great when dealing datasets
import math #for mathematical functions

In [2]:
#creating our dataset
#before and after body weights of 10 people for a new diet
d = {'before':[103.68, 110.68, 119.05, 101.75, 91.69, 112.03, 88.84, 105.18, 110.37, 120.99],
       'after': [92.87, 101.58, 105.66, 96.18, 86.97, 105.90, 80.56, 97.00, 99.27, 107.44]}
df = pd.DataFrame(data=d)
df

Unnamed: 0,before,after
0,103.68,92.87
1,110.68,101.58
2,119.05,105.66
3,101.75,96.18
4,91.69,86.97
5,112.03,105.9
6,88.84,80.56
7,105.18,97.0
8,110.37,99.27
9,120.99,107.44


In [3]:
#create a difference column
df['difference'] = df['after'] - df['before']
df

Unnamed: 0,before,after,difference
0,103.68,92.87,-10.81
1,110.68,101.58,-9.1
2,119.05,105.66,-13.39
3,101.75,96.18,-5.57
4,91.69,86.97,-4.72
5,112.03,105.9,-6.13
6,88.84,80.56,-8.28
7,105.18,97.0,-8.18
8,110.37,99.27,-11.1
9,120.99,107.44,-13.55


In [4]:
#we need sample mean, we need mean of the differences
sample_mean = df['difference'].mean()
sample_mean

-9.083000000000002

### How to find t-stat
First decide your confidence level. Let's say we look for 95% confidence level.
Confidence level = 1 - α
so, for 95% interval, our alpha is 5% whic is equal to 0.05
We are looking for an two tailed t-value (short explanation, if you check an (hypothesis = some value) it is two tailed, if you chech hypothesis greater (>) or less (<) than some value, it is one tailed. since we look for a mean it is two tailed)

Now look for this value in t table for 95% confidence (at the bottom) and look for n-1 samples
it is 2.262

In [5]:
t_95 = 2.262

### Standard Error
Standard error = standar deviation of difference divided by square root of population count.

In [6]:
sample_count = len(df)
sample_count

10

In [7]:
#Standar Deviation (std) of the sample
std = df['difference'].std()
std

3.111141448264788

In [8]:
#Standar error
standard_error = std/math.sqrt(sample_count)
standard_error

0.9838293099471636

In [9]:
#lets define the interval which has our mean with 95% possibilitiy
interval_95 =((sample_mean - t_95 * standard_error) ,(sample_mean + t_95 * standard_error))
interval_95

(-11.308421899100486, -6.857578100899518)

The result is our population mean is between -11.31 and -6.86 kilograms
This interval is called as confidence interval and our mean is 95% is in between. So we can say our diet is working

We can and should define a function for future use of these calculations.

In [10]:
def conf_interval(count, t, std, mean):
    '''
    count = sample dataset count
    t = t value from the table
    std = standard deviatoin of sample
    mean = mean of sample dataset
    '''
    return ((mean - t * (std/math.sqrt(count))),(mean + t * (std/math.sqrt(count))))