## Finding Confidence Interval for 2 Independent Samples, Population Variances are Unknown but Assumed to be Equal

In this example, we'll find the confidence interval for the two independent samples.
We don't know the population variances of these samples but we assume that they are equal.

For example we want to estimate the price difference of apples in NY and LA. We don't know their variances however we think taht they should be the same.
   
For finding confidence intervals when population variance is unknown and when we have small amount of samples (generally less than 30), t-statistic is being used as reliability factor (RF)
Without much detail, you can use http://www.ttable.org/ for finding t value

Finally, we assume that populations are normally distributed

In [1]:
#inserting libraries
import pandas as pd #pandas is great when dealing datasets
import math #for mathematical functions
import numpy as np

In [2]:
#creating our dataset
#apple prices from different locations in both cities
d = {'apples_NY':[3.80, 3.76, 3.87, 3.99, 4.02, 4.25, 4.13, 3.98, 3.99, 3.62],
        'apples_LA':[3.02, 3.22, 3.24, 3.02, 3.06, 3.15, 3.81, 3.44, np.NaN, np.NaN]}
df = pd.DataFrame(data=d)
df

Unnamed: 0,apples_NY,apples_LA
0,3.8,3.02
1,3.76,3.22
2,3.87,3.24
3,3.99,3.02
4,4.02,3.06
5,4.25,3.15
6,4.13,3.81
7,3.98,3.44
8,3.99,
9,3.62,


In order to calculate confidence interval for this situation, we need Pooled Variance. Since we don't know the individual population variances, we use pooled variance values for calculating standard error.
The formula for pooled variance is:
pooled_var = ((nx-1) * varX + (ny-1) * varY) / (nx + ny -2)

nx = lenght of first variable (in our case apples_NY), 
ny = length of second variable (in our case apples_LA), 
varX and varY are the variances of variables.

In [3]:
#let's calculate variances
varNY = df['apples_NY'].var()
varLA = df['apples_LA'].var()

print('varaince for NY apples:', varNY)
print('variance for LA apples:', varLA)

varaince for NY apples: 0.03383222222222223
variance for LA apples: 0.07177142857142858


In [4]:
#let's calculate counts
nx = df['apples_NY'].count()
ny = df['apples_LA'].count()

print('Count for NY apples:', nx)
print('Count for LA apples:', ny)

Count for NY apples: 10
Count for LA apples: 8


In [5]:
#calculate pooled variance
pooledVar = (((nx-1)*varNY)+((ny-1)*varLA)) / (nx+ny-2)
pooledVar

0.05043062500000001

### How to find t-stat
First decide your confidence level. Let's say we look for 95% confidence level.
Confidence level = 1 - α
so, for 95% interval, our alpha is 5% whic is equal to 0.05
We are looking for an two tailed t-value (short explanation, if you check an (hypothesis = some value) it is two tailed, if you chech hypothesis greater (>) or less (<) than some value, it is one tailed. since we look for a mean it is two tailed)

Now look for this value in t table for 95% confidence (at the bottom) and look for n-2 samples (we have 2 sets)
it is 2.120

In [6]:
t_95 = 2.120

In [7]:
#we need means of the samples for calculating confidence interval
meanNY = df['apples_NY'].mean()
meanLA = df['apples_LA'].mean()

print('Mean for NY apples:', meanNY)
print('Mean for LA apples:', meanLA)

Mean for NY apples: 3.941
Mean for LA apples: 3.245


In [8]:
#let's calculate standard error
standard_error = math.sqrt((pooledVar)/nx + (pooledVar)/ny)
standard_error

0.10652178474377906

In [9]:
#lets define the interval which has our mean with 95% possibilitiy
interval_95 =(((meanNY-meanLA) - t_95 * standard_error) ,((meanNY-meanLA) + t_95 * standard_error))
interval_95

(0.4701738163431881, 0.9218261836568113)

The result is, with 95% confidence, difference between NY and LA apples is between 0.47 and 0.92