<h2 style="color: blanchedalmond;">CONFIDENCE INTERVALS</h2>

Confidence intervals are a statistical tool used to estimate the range within which a population parameter is likely to fall, based on sample data. They provide a measure of uncertainty around a sample statistic, such as a mean or proportion. For example, a 95% confidence interval for a mean suggests that if the same sampling process were repeated many times, approximately 95% of those intervals would contain the true population mean. The interval is constructed using the sample data and a chosen level of confidence (e.g., 95% or 99%), and it typically consists of an upper and lower bound around the sample estimate. This helps to understand the precision of the estimate and the reliability of the inference drawn from the sample.


<h1 style="color: aqua;" >point estimate</h1>

A point estimate is a single value derived from sample data used to estimate an unknown population parameter, such as a mean or proportion, representing the best guess based on the sample. 

In [24]:
#importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random
import math
import scipy.stats as stats

<h1 style="color: aquamarine;">average age of each voter</h1>

In [3]:
random.seed(10)
pop_ages1=np.random.poisson(lam=35,size=150000)
pop_ages2=np.random.poisson(lam=10,size=100000)
pop_ages=np.concatenate((pop_ages1,pop_ages2))
pop_ages.mean()


np.float64(25.009468)

In [4]:
random.seed(6)
sample_ages=np.random.choice(a=pop_ages,size=500)
sample_ages.mean()
pop_ages.mean()-sample_ages.mean()

np.float64(-0.014532000000002654)

<h1 style="color: aquamarine;">The race of each voter</h1>

In [5]:
random.seed(10)
pop_race=(['whites']*100000)+(['blacks']*50000)+(['hispanic']*50000)+(['asians']*25000)+(['others']*25000)
demo_sample=random.sample(pop_race,1000)
for race in set(demo_sample):
    print(race  +  "proportional estimate:")
    print(demo_sample.count(race)/1000)

whitesproportional estimate:
0.379
hispanicproportional estimate:
0.192
blacksproportional estimate:
0.231
othersproportional estimate:
0.099
asiansproportional estimate:
0.099


<h1 style="color: blue;">z critical value</h1>

The z critical value is a point on the standard normal distribution that corresponds to a specified level of significance in hypothesis testing or a confidence level in confidence intervals. It indicates the number of standard deviations a data point is from the mean.

Here are some common z critical values:

For a 90% confidence level: z = ±1.645 

For a 95% confidence level: z = ±1.96

For a 99% confidence level: z = ±2.576

The 
𝑧
critical
z 
critical
​
  value, or critical value, represents a point on the standard normal distribution (which has a mean of 0 and a standard deviation of 1) beyond which a certain proportion of the distribution lies. In hypothesis testing or confidence interval calculations, it's used to determine the boundary for rejecting the null hypothesis or for defining the range in which a parameter estimate is expected to fall.

<h1 style="color: aqua;" >how to calculate the margin of error</h1>

In [7]:
sample_size=1000
sample=np.random.choice(a=pop_ages, size=sample_size)
sample_mean=sample.mean()
z_critical= stats.norm.ppf(q=0.975)
print(z_critical)
pop_stdev=pop_ages.std()
margin_of_error=z_critical*(pop_stdev/math.sqrt(sample_size))
confidence_interval=(sample_mean-margin_of_error,sample_mean+margin_of_error)
print(confidence_interval)


1.959963984540054
(np.float64(24.313827553695806), np.float64(25.954172446304195))


<h1 style="color: aquamarine;"> t-distribution and t -critical value</h1>

The t-distribution is a probability distribution that arises when estimating the mean of a normally distributed population when the sample size is small and the population standard deviation is unknown. It is symmetric and bell-shaped, similar to the standard normal distribution, but has heavier tails, which allows it to account for the increased variability that occurs with smaller sample sizes. The t-critical value is a point on the t-distribution that corresponds to a specific significance level or confidence level, and it is used in hypothesis testing and constructing confidence intervals. As the sample size increases, the t-distribution approaches the standard normal distribution, and the t-critical values converge to z critical values.

In [13]:
np.random.seed(10)
sample_size=25
sample=np.random.choice(a=pop_ages, size=sample_size)
sample_mean=sample.mean()
t_critical=stats.t.ppf(q=0.975,df=24)
print('t_critical:')
print(t_critical)
sample_stdev=sample.std(ddof=1)
sigma=sample_stdev/math.sqrt(sample_size)
margin_of_error=t_critical*(sigma)
confidence_interval=(sample_mean-margin_of_error,sample_mean+margin_of_error)
print('confidence_interval:')
print(confidence_interval)

t_critical:
2.0638985616280205
confidence_interval:
(np.float64(22.0096661765773), np.float64(33.190333823422705))


<h1 style="color: aquamarine;"> direct confidence_interval</h1>

In [23]:
confidence_interval=stats.t.interval(df=24,loc=sample_mean,scale=sigma,confidence=0.95)
print('confidence_interval:')
print(confidence_interval)

confidence_interval:
(np.float64(22.0096661765773), np.float64(33.190333823422705))
