# Confidence Interval

This notebook will cover the concept of confidence interval and how to find the confidence interval in various scenarios. It will also brush the z and t tables as they are important for confidence interval estimation. Confidence intervals as the name suggests are intervals that statisticians use to estimate a value given a level of probabilty (usually 95% or 99%).

- Univariate 
    - Known variance
    - Unknown variance
- Bivariate 
    - Dependent, 
    - Independent
        - known variance
        - Unknown variance - assumed to be equal
        - Unknown variance - assumed to be unequal
        
The formula for confidence interval is:
**[Point Estimate $\pm $ Reliability Factor * Standard Error]**

The formula for each of the components of the above formula is dependent on the kind of data that would be analysed. 

In [1]:
import pandas as pd
import numpy as np
import scipy 
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

## 1 Univariate Data 
### a. Known Variance

For **Univariate data** where **Variance is known**, the **Z score** is used in calculating the reliability factor. The components of the above formula are:
$$\\ \mu \pm Z_{\alpha / 2} * \frac{\sigma}{\sqrt{n}}  \\ $$

**Example**: 
So you want to advise a friend on getting laptops in your city and you took price samples of fifty laptops with the intention of telling your friend how much he needs to budget for it. So you want to be 95% sure of the average cost of laptops. 


*Given that level of confidence is 95%, thus, $\alpha $ would be 5%; this would translate to 0.95 and 0.05 respectively. To obtain the z score of the value, you subtract $\alpha $ from 1.* 

In [2]:
np.random.seed(2)
laptops = pd.DataFrame(np.random.randint(5000, 100000, 50), columns=['Laptop Price'])
laptops.head()

Unnamed: 0,Laptop Price
0,94256
1,77173
2,49566
3,36019
4,89434


In [3]:
mean = laptops.mean()[0] # Mean price
s_error = laptops.std()[0] / np.sqrt(len(laptops)) # Standard Deviation of the prices
r_fact = scipy.stats.norm.ppf(1- (0.05 / 2)) # Reliability factor

print('Thus we are 95% sure that the average price of laptops will be between {:.0f} and {:.0f}'.format(mean - (r_fact * s_error), 
                                                                                          mean + (r_fact * s_error)))

Thus we are 95% sure that the average price of laptops will be between 46123 and 60642


### b. Unknown Variance

For **Univariate data** where **Variance is unknown**, the **T score** is used in calculating the reliability factor. The components of the above formula are:
$$\\ \mu \pm t_{n - 1, \alpha / 2} * \frac{SD}{\sqrt{n}}  \\ $$

**Example**: 
So you want to advise a friend on getting phones in your city and you took price samples of fourty phones with the intention of telling your friend how much he needs to budget for it. So you want to be 95% sure of the average cost of phones. 


*Given that level of confidence is 95%, thus, $\alpha $ would be 5%; this would translate to 0.95 and 0.05 respectively. To obtain the t score of the value, you subtract $\alpha $ from 1.* 

In [4]:
np.random.seed(23)
phone = pd.DataFrame(np.random.randint(6000, 100000, 20), columns=['Phone Price'])
phone.head()

Unnamed: 0,Phone Price
0,15256
1,98105
2,82726
3,15704
4,77711


In [5]:
ph_mean = phone.mean()[0] # Mean price of phone
std_error = phone.std()[0] / np.sqrt(len(phone)) # Standard Error
r_factor = scipy.stats.t.ppf(1-0.025, df=(len(phone) - 1)) # t statistics 


print('Thus we are 95% sure that the average price of phones will be between {:.0f} and {:.0f}'.format(ph_mean - r_factor * std_error, 
                                                                                           ph_mean + r_factor * std_error))

Thus we are 95% sure that the average price of phones will be between 32734 and 59784


## 2. Bivariate Population
### a. Dependent Population
Looking at a dual population where the **population is dependent** and **variance is unknown**, the confidence interval for the difference between the two means is thus:

$$\\ \mu \pm t_{n - 1, \alpha / 2} * \frac{S_{d}}{\sqrt{n}}  \\ $$

**Example**: The scores of 40 students were taken before and after a performance enhancing test is administered. So we want to determine if the performance enhancing test made any difference in the scores.

In [6]:
np.random.seed(9)
scores = pd.DataFrame({'Before': np.random.randint(1, 101, 20), 'After': np.random.randint(1, 101, 20)})

scores.head(10)

Unnamed: 0,Before,After
0,93,63
1,55,13
2,57,19
3,23,87
4,66,57
5,23,2
6,53,57
7,60,100
8,41,12
9,92,38


In [7]:
scores['Difference'] = scores.After - scores.Before

In [8]:
sc_mean = scores['Difference'].mean() # mean scores
r_factor = scipy.stats.t.ppf(1-0.025, df=(len(scores) - 1)) # t statistics 
std_err = scores['Difference'].std() / np.sqrt(len(scores['Difference'])) # standard error

print('This can be interpreted that it is 95% certain that the mean of the scores after the test was administered ranged from {:.0f} points lower than the mean scores before the test was administered to {:.0f} points higher than the mean scores of the before scores'.format(sc_mean - (r_factor * std_err), sc_mean + (r_factor * std_err)))

print('Thus we can conclude that the performance enhancing test is not effective')

This can be interpreted that it is 95% certain that the mean of the scores after the test was administered ranged from -34 points lower than the mean scores before the test was administered to 2 points higher than the mean scores of the before scores
Thus we can conclude that the performance enhancing test is not effective


### b. Independent Population 
#### i. Variance Known
Looking at a dual population where the **population is independent** and **variance is known**, to calculate the variance of both the population will be: $$ \\ \frac{\sigma_{a}^{2}}{n_{a}} + \frac{\sigma_{b}^{2}}{n_{b}} \\ $$
The confidence interval for the difference between the two means is thus:
 
 $$\\ (\bar{x} - \bar{y}) \pm Z_{\alpha / 2} * {\sqrt{\frac{\sigma_{x}^{2}}{n_{x}} + \frac{\sigma_{y}^{2}}{n_{y}}}}  \\ $$

**Example**: The income of two different professions (delivery men and janitors) were obtained. So we want to check if there is a difference between the income of both professions. 100 and 80 samples were obtained resectively.

In [9]:
delivery_men, janitors = np.random.randint(15000, 800000, 100), np.random.randint(15000, 800000, 80)

var_diff = np.sqrt((delivery_men.std() ** 2 / len(delivery_men))+ (janitors.std() ** 2/ len(janitors)))
mean_diff = delivery_men.mean() - janitors.mean()
crit_val = scipy.stats.norm.ppf(1- (0.05 / 2)) * 20

'We are 95% confident that the salary of delivery men differ from that of janitors by {} to {}'.format(
round(mean_diff - (crit_val * var_diff)), round(mean_diff + (crit_val * var_diff)))

'We are 95% confident that the salary of delivery men differ from that of janitors by -1260433.0 to 1378461.0'

### b. Independent Population 
#### ii. Variance Unknown but Assumed to be Equal
Looking at a dual population where the **population is independent** and **variance is known**: 
$$ \\ pooled variance = S_{p}^{2} = \frac{(n_{x} - 1)S_{x}^{2} + (n_{y} - 1)S_{y}^{2}}{n_{x} + n_{y} - 2}  \\ $$
The confidence interval for the difference between the two means is thus:
 
 $$\\ (\bar{x} - \bar{y}) \pm t_{n_{x} + n_{y} - 2},{\alpha / 2} * {\sqrt{\frac{S_{p}^{2}}{n_{x}} + \frac{S_{p}^{2}}{n_{y}}}}  \\ $$


**Example**: The sample of the cost of Lag_barbecue and benin_barbecue are given below. We want to check the difference between both means. 30 and 40 samples were obtained resectively.

In [90]:
Lag_barb, benin_barb = np.random.randint(1000, 5000, 30), np.random.randint(1000, 5000, 40)

In [91]:
pd.DataFrame(zip(Lag_barb, benin_barb), columns=['Lag_barbecue_Lagos', 'Lag_barbecue_Benin']).head()

Unnamed: 0,Lag_barbecue_Lagos,Lag_barbecue_Benin
0,1465,4067
1,4539,3290
2,4357,2316
3,4461,2355
4,1424,1387


In [92]:
# critical factor
crit_fact = scipy.stats.t.ppf(1-0.025, df=(len(Lag_barb) + len(benin_barb) - 2)) 

# calculating the pooled variance
pooled_var = ((len(Lag_barb) - 1) * Lag_barb.std() ** 2) + ((len(benin_barb) -1) * benin_barb.std() ** 2) / (len(Lag_barb) + len(benin_barb) - 2)

# calculating the standard error
std_error = np.sqrt(pooled_var / len(Lag_barb) + pooled_var / len(benin_barb))

# mean difference
mean_diff = Lag_barb.mean() - benin_barb.mean()

# confidence interval
round(mean_diff + crit_fact * std_error), round(mean_diff - crit_fact * std_error)

print("""There is a 95% certainty that the price of Lag_barbecue would either be {} lower or {}
higher than the price of benin_barbecure""".format(round(((mean_diff - crit_fact * std_error)*-1)), 
                                            round(mean_diff + crit_fact * std_error)))

There is a 95% certainty that the price of Lag_barbecue would either be 3157.0 lower or 3199.0
higher than the price of benin_barbecure


### b. Independent Population 
#### iii. Variance Unknown and Not assumed to be Equal
Looking at a dual population where the **population is independent** and **variance is unknown and assumed to not be equal**: 
The confidence interval for the difference between the two means is thus:
 
 $$\\ (\bar{x} - \bar{y}) \pm t_{v},{\alpha / 2} * {\sqrt{\frac{S_{p}^{2}}{n_{x}} + \frac{S_{p}^{2}}{n_{y}}}}  \\ $$


$$ \\ t_{v} = \frac{({\frac{S_{x}^{2}}{n_{x}} + \frac{S_{y}^{2}}{n_{y}}})^{2}}{Denominator}  \\ $$

$$\\ denominator = {(\frac{S_{x}^{2}}{n_{x}})^{2} /{n_{x}} - 1  + (\frac{S_{y}^{2}}{n_{y}})^{2}}/{n_{y}} - 1  \\ $$

**Example**: The sample of the cost of bible and dictionary are given below. We want to check the difference between both means. 30 and 30 samples were obtained resectively.

In [68]:
bible, dictionary = np.random.randint(3000, 15001, 30), np.random.randint(3000, 15001, 30)

In [78]:
# calculating the degrees of freedom
def degree(sample_1, sample_2):
    sample_1_error = (sample_1.std() ** 2) / len(sample_1)
    sample_2_error = (sample_2.std() ** 2) / len(sample_2)

    numerator = (sample_1_error + sample_2_error) ** 2
    denom = (sample_1_error ** 2) / (len(sample_1) - 1) + (sample_2_error ** 2) / (len(sample_2) - 1)
    df = numerator / denom
    return df

# Calculating the error
def error(sample_1, sample_2):
    sample_1_error = (sample_1.std() ** 2) / len(sample_1)
    sample_2_error = (sample_2.std() ** 2) / len(sample_2)
    
    combined = sample_1_error + sample_2_error
    return np.sqrt(combined)

# Calculating the mean difference
def mean_diff(sample_1, sample_2):
    return sample_1.mean() - sample_2.mean()

In [95]:
# upper limit
upper = mean_diff(bible, dictionary) + scipy.stats.t.ppf(1-0.025, df=degree(bible, dictionary)) * error(bible, dictionary)

# lower limit
lower = mean_diff(bible, dictionary) - scipy.stats.t.ppf(1-0.025, df=degree(bible, dictionary)) * error(bible, dictionary)

print("""There is a 95% certainty that the price of bible would either be {} lower or {}
higher than the price of dictionary""".format(round(((lower)*-1)), 
                                            round(upper)))

There is a 95% certainty that the price of bible would either be 1551.0 lower or 1941.0
higher than the price of dictionary


In [None]:
class ConfidenceInterval():
    import scipy
    def __init__(self, var = True, ind = True, dual = True, diff = True):
        self.var = var
        self.ind = ind
        self.dual = dual
        
#     crit_val = scipy.stats.t.ppf(1-0.025, df=(len(scores) - 1))

"""The critical value used would be heavily dependent on the the sample size, if sample is less than 50, use 
t score however, if greater than 50, use z score"""