**Let us import the required libraries.**

In [1]:
import scipy.stats as stats
import numpy as np
import pandas as pd
import random

<a id="est"></a>
#  Parameter Estimation

The value associated with the characteristic of the population is known as a `parameter` and the characteristic of the sample is described by a `statistic`.
Usually, the population parameters are not known in most of the real-life problems. Thus, we consider a subset of the population (sample) to estimate the population parameter using a sample statistic.

`Point estimation` and `Interval estimation` are two of the methods to estimate the population parameter.

<a id="pt"></a>
## Point Estimation

This method considers a single value (sample statistic) as estimate of the population parameter. 

Let $X_{1}, X_{2}, X_{3},..., X_{n}$ be the random sample drawn from a population with mean $\mu$ and standard deviation $\sigma$. <br>
The point estimation method estimates the population mean, $\mu = \overline{X}$, where $\overline{X}$ is the sample mean and population standard deviation, $\sigma = s$, where $s$ is the standard deviation of the sample (`Standard Error`).

### Example:

#### 1. Consider the data of grade points for 35 students in a data science course. Select grades of 20 students randomly from the data and find the point estimate for the population mean.

     Grades: [59.1, 65.0, 75.8, 79.2, 95.0, 99.8, 89.1, 65.2, 41.9, 55.2, 94.8, 84.1, 83.2, 74.0, 75.5, 76.2, 79.1, 80.1, 
              92.1, 74.2, 59.2, 64.0, 75, 78.2, 95.6, 97.8, 89.5, 64.2, 41.8, 57.2, 85, 91.4, 81.8, 74.6, 90]

In [2]:
Grades=[59.1, 65.0, 75.8, 79.2, 95.0, 99.8, 89.1, 65.2, 41.9, 55.2, 94.8, 84.1, 83.2, 74.0, 75.5, 76.2, 79.1, 80.1, 
          92.1, 74.2, 59.2, 64.0, 75, 78.2, 95.6, 97.8, 89.5, 64.2, 41.8, 57.2, 85, 91.4, 81.8, 74.6, 90]

In [3]:
# Point Estimation
sample_mean=np.mean(Grades)
pop_mean=sample_mean
print('Point estimate of population mean:',pop_mean)

Point estimate of population mean: 76.68285714285715


In [4]:
print('Point estimate of population SD:',np.std(Grades))

Point estimate of population SD: 14.795336827417437


#### 2. A financial firm has created 50 portfolios. From them, a sample of 13 portfolios was selected, out of which 8 were found to be underperforming. Can you estimate the number of underperforming portfolios in the population?

In [5]:
prob_sample=8/13
prob_population=50*prob_sample
prob_population

30.76923076923077

<a id="err"></a>
###  Sampling Error

Sampling error is considered as the absolute difference between the sample statistic used to estimate the parameter and the corresponding population parameter. Since the entire population is not considered as the sample, the values of mean, median, quantiles, and so on calculated on sample differ from the actual population values. 

One can reduce the sampling error either by increasing the sample size or determining the optimal sample size using various methods.

#### 1. Consider the data for the number of ice-creams sold per day. An ice-cream vendor collected this data for 90 days and then a sample is drawn (without replacement) containing ice-creams sold for 25 days. 

data = [21, 93, 62, 76, 73, 20, 56, 95, 41, 36, 38, 13, 80, 88, 34, 18, 40, 11, 
        25, 29, 61, 23, 82, 10, 92, 69, 60, 87, 14, 91, 94, 49, 57, 83, 96, 55, 
        79, 52, 59, 39, 58, 17, 19, 98, 15, 54, 48, 46, 72, 45, 65, 28, 37, 30, 
        68, 75, 16, 33, 31, 99, 22, 51, 27, 67, 85, 47, 44, 77, 64, 97, 84, 42, 
        90, 70, 74, 89, 32, 26, 24, 12, 81, 53, 50, 35, 71, 63, 43, 86, 78, 66]
        
sample = [10, 22, 47, 66, 11, 57, 77, 98, 31, 63, 74, 84, 50, 96, 88, 92, 70, 54, 65, 44, 16, 72, 20, 90, 43]

Comupte the sampling error for mean.

In [6]:
data = [21, 93, 62, 76, 73, 20, 56, 95, 41, 36, 38, 13, 80, 88, 34, 18, 40, 
        11, 25, 29, 61, 23, 82, 10, 92, 69, 60, 87, 14, 91, 94, 49, 57, 83, 96, 55, 79, 
        52, 59, 39, 58, 17, 19, 98, 15, 54, 48, 46, 72, 45, 65, 28, 37, 30, 68, 75, 16, 
        33, 31, 99, 22, 51, 27, 67, 85, 47, 44, 77, 64, 97, 84, 42, 90, 70, 74, 89, 32, 26, 
        24, 12, 81, 53, 50, 35, 71, 63, 43, 86, 78, 66]

sample = [10, 22, 47, 66, 11, 57, 77, 98, 31, 63, 74, 84, 50, 96, 88, 92, 70, 54, 65, 44, 16, 72, 20, 90, 43]

In [7]:
print('Sampling Error for mean',np.mean(data)-np.mean(sample))

Sampling Error is -3.1000000000000014


<a id="int"></a>
## Interval Estimation for Mean

This method considers the range of values in which the population parameter is likely to lie. The confidence interval is an interval that describes the range of values in which the parameter lies with a specific probability. It is given by the formula,<br> <p style='text-indent:20em'> `conf_interval = sample statistic ± margin of error`</p>

The uncertainty of an estimate is described by the `confidence level` which is used to calculate the margin of error. 

<a id="large"></a>
### 2.2.1 Large Sample Size

Consider a population with mean $\mu$ and standard deviation $\sigma$. Let us take a sample of `n` observations from the population such that, $n \geq 30$. The central limit theorem states that the sampling distribution of mean follows a normal distribution with mean $\mu$ and standard deviation $\frac{\sigma}{\sqrt(n)}$.

The confidence interval for the population mean with $100(1-\alpha)$% confidence level is given as: $\overline{X} \pm Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}$

Where, <br>
$\overline{X}$: Sample mean<br>
$\alpha$: Level of significance<br>
$\sigma$: Population standard deviation<br>
$n$: Sample size

The quantity $\frac{\sigma}{\sqrt{n}}$ is the standard error of the mean. And the margin of error is given by $Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}$.

If we know the expected margin of error (ME), then we can calculate the required sample size (n) using the formula: $n = (Z_{\frac{\alpha}{2}})^{2}\frac{\sigma^{2}}{ME^{2}}$.
 
The above equation is valid for any population provided the sample size is sufficiently large (usually $n \geq 30$). Relace $\sigma$ by the standard deviation of the sample ($s$) if the population standard deviation is not known.

The value of $Z_{\frac{\alpha}{2}}$ for different $\alpha$ values can be obtained using the `stats.norm.isf()` from the scipy library. 

To calculate the confidence interval with 95% confidence, use the Z-value corresponding to `alpha = 0.05`. 

In [9]:
alpha=[0.1,0.05,0.02,0.01]
for a in alpha:
    z_alpha_by_2=stats.norm.interval(1-a)
    z_alpha_by_2=np.round(z_alpha_by_2,2)
    print(f'At {a} alpha. Confidence level {(1-a)*100} interval is{z_alpha_by_2}')

At 0.1 alpha. Confidence level 90.0 interval is[-1.64  1.64]
At 0.05 alpha. Confidence level 95.0 interval is[-1.96  1.96]
At 0.02 alpha. Confidence level 98.0 interval is[-2.33  2.33]
At 0.01 alpha. Confidence level 99.0 interval is[-2.58  2.58]


In [7]:
# What is z alpha by 2 at 93% confindece level
cl=0.93
alpha=1-cl
stats.norm.ppf(alpha/2)

-1.8119106729525982


#### 1. A random sample of weight (in kg.) for 35 diabetic patients is drawn from the population with a standard deviation of 8 kg. Find the 90% confidence interval for the population mean.

    Weight: [59.1, 65.0, 75.8, 79.2, 95.0, 99.8, 89.1, 65.3, 41.9, 55.2, 94.8, 84.1, 83.2, 74.0, 75.5, 76.2, 79.1, 80.1, 
             92.1, 74.2, 59.2, 64.0, 75, 78.2, 95.6, 97.8, 89.5, 64.2, 41.8, 57.2, 85, 91.4, 81.8, 74.6, 90]

In [10]:
Weight=[59.1, 65.0, 75.8, 79.2, 95.0, 99.8, 89.1, 65.3, 41.9, 55.2, 94.8, 84.1, 83.2, 74.0, 75.5, 76.2, 79.1, 80.1, 
         92.1, 74.2, 59.2, 64.0, 75, 78.2, 95.6, 97.8, 89.5, 64.2, 41.8, 57.2, 85, 91.4, 81.8, 74.6, 90]

In [11]:
sigma=8
cl=0.9
n=35
alpha=1-cl
# sample statistic
x_bar=np.mean(Weight)

# moe
z_alpha_by2=np.abs(stats.norm.isf(alpha/2))
se=sigma/(np.sqrt(n))
moe=z_alpha_by2*se

# Confidence interval
print(f'{x_bar-moe} to {x_bar+moe} with 90% confindence)')

74.46146621975642 to 78.90996235167215 with 90% confindence)


In [12]:
cl=0.9
x_bar=np.mean(Weight) 
sigma=8
n=35

stats.norm.interval(confidence=cl,loc=x_bar,scale=sigma/np.sqrt(n))

(74.46146621975642, 78.90996235167215)

#### 2. There are 150 apples on a tree. You randomly choose 40 apples and found that the average weight of apples is 182 grams with  pop standard deviation of 30 grams. Find the 95% confidence interval for the population mean.

In [20]:
sigma=30
cl=0.95
n=40
alpha=1-cl
# sample statistic
x_bar=182
# moe
z_alpha_by2=np.abs(stats.norm.ppf(alpha/2))
se=sigma/(np.sqrt(n))
moe=z_alpha_by2*se

# Confidence interval
print(f'{x_bar-moe} to {x_bar+moe} with 95% confindence)')

172.70307451543158 to 191.29692548456842 with 95% confindence)


In [14]:
cl=0.95
x_bar=182
sigma=30
n=40

stats.norm.interval(confidence=cl,loc=x_bar,scale=sigma/np.sqrt(n))

(172.70307451543158, 191.29692548456842)

#### 3. A movie production house needs to estimate the average monthly wage of the technical crew members. The previous data shows that the standard deviation of the wages is 190 dollars. The production team thinks that the estimation of the average wage should not exceed 54 dollars. The team has decided to take a small subset of wages for the estimation. Find a suitable number of wages to be considered to get the estimate with 90% confidence.

In [15]:
sigma=190
moe=54
cl=0.9
alpha=1-cl
z_alpha_by2=np.abs(stats.norm.ppf(alpha/2))
# moe = z*(sigma/n**2)
# n=z**2 * (sigma**2)/moe**2

n=(z_alpha_by2 **2)*((sigma**2)/moe**2)
print('Sample size should be',n)


Sample size should be 33.49455373554338


In [25]:
x_bar=500
n=600
sigma=40
cl=0.99
stats.norm.interval(cl,loc=x_bar,scale=sigma/(np.sqrt(n)))

(495.7936883611978, 504.2063116388022)

In [16]:
sigma=40
cl=0.99
n=600
alpha=1-cl
# sample statistic
x_bar=500
# moe
z_alpha_by2=np.abs(stats.norm.ppf(alpha/2))
se=sigma/(np.sqrt(n))
moe=z_alpha_by2*se

# Confidence interval
print(f'{x_bar-moe} to {x_bar+moe} with 95% confindence)')

495.7936883611978 to 504.2063116388022 with 95% confindence)


#### 4. 100 bags of coal were tested and had an average of 35% of ash with a standard deviation of 15%. Calculate the margin of error for a 95% confidence level.

In [17]:
cl=0.95
x_bar=35
n=100
df=n-1
s=15
se=s/(np.sqrt(n))
stats.t.interval(confidence=cl,loc=x_bar,scale=se,df=df)

(32.02367457273698, 37.97632542726302)

#### 5. From a sample of 250 observations, it is found that the average income of a 27 year old Londoner is £45,000 with a sample standard deviation of £4000. Obtain the 95% confidence interval to estimate the average income.

In [2]:
s=4000
cl=0.95
n=250
dof=n-1
alpha=1-cl
x_bar=45000
se=s/(np.sqrt(n))
t_alphaby2 = np.abs(stats.t.ppf(alpha/2,df=dof))
moe=t_alphaby2*se
print('Margin of Error',moe)
print('Confidence Interval',x_bar-moe,x_bar+moe)

Margin of Error 498.2577949931726
Confidence Interval 44501.74220500683 45498.25779499317


In [19]:
stats.t.interval(confidence=cl,loc=x_bar,scale=se,df=dof)

(44501.74220500683, 45498.25779499317)

<a id="small"></a>
### Small Sample Size

Let us take a sample of `n` observations from the population such that, $n < 30$. Here the standard deviation of the population is unknown. The confidence interval for the population mean with $100(1-\alpha)$% confidence level is given as: $\overline{X} \pm t_{\frac{\alpha}{2}, n-1}\frac{s}{\sqrt{n}}$

Where, <br>
$\overline{X}$: Sample mean<br>
$\alpha$: Level of significance<br>
$s$: Sample standard deviation<br>
$n-1$: degrees of freedom

The ratio $\frac{s}{\sqrt{n}}$ is the estimate of the standard error of the mean. And $t_{\frac{\alpha}{2}, n-1}\frac{s}{\sqrt{n}}$ is the margin of error for the estimate.

The value of $t_{\frac{\alpha}{2}, n-1}$ for different $\alpha$ values can be obtained using the `stats.t.isf()` from the scipy library.  

In [20]:
# Lets calculate the t_alpha/2 for diferent CL
# Lets assume n=15
n=15
df=n-1
cl=[0.99,0.98,0.95,0.90]
for i in cl:
    inter=stats.t.interval(confidence=i,df=df)
    print(f't interval at alpha/2 and df {df} with cl {i*100} = {np.round(inter,2)} ')

t interval at alpha/2 and df 14 with cl 99.0 = [-2.98  2.98] 
t interval at alpha/2 and df 14 with cl 98.0 = [-2.62  2.62] 
t interval at alpha/2 and df 14 with cl 95.0 = [-2.14  2.14] 
t interval at alpha/2 and df 14 with cl 90.0 = [-1.76  1.76] 


In [21]:
# T_alpha/2 at cl =90 and n=10
cl=0.9
alpha=1-cl
n=10
df=n-1
np.abs(stats.t.ppf(alpha/2,df=df))

1.833112932653634

#### 1. There are 150 apples on a tree. You randomly choose 17 apples and found that the average weight of apples is 78 grams with a standard deviation of 23 grams. Find the 90% confidence interval for the population mean.

In [22]:
n=17
dof=16
cl=0.9
alpha=1-cl

x_bar=78
s=23
se=s/(np.sqrt(n))

stats.t.interval(confidence=cl,df=dof,loc=x_bar,scale=se)

(68.26090326067306, 87.73909673932694)

<a id="prop"></a>
## Interval Estimation for Proportion

Consider a population in which each observation is either a success or a failure. The population proportion is denoted by `P` which the ratio of the number of successes to the size of the population.

The confidence interval for the population proportion with $100(1-\alpha)$% confidence level is given as: $p \pm Z_{\frac{\alpha}{2}}\sqrt{\frac{p(1 - p)}{n}}$

Where, <br>
$p$: Sample proportion<br>
$\alpha$: Level of significance<br>
$n$: Sample size

The quantity $Z_{\frac{\alpha}{2}}\sqrt{\frac{p(1 - p)}{n}}$ is the margin of error.

#### 1. A financial firm has created 50 portfolios. From them, a sample of 13 portfolios was selected, out of which 8 were found to be underperforming. Construct a 99% confidence interval to estimate the population proportion.

In [8]:
import statsmodels.stats.proportion as proportion

In [27]:
p=8/13
n=13
se=np.sqrt(p*(1-p)/n)
cl=0.99
alpha=1-cl
z_alpha_by2=np.abs(stats.norm.ppf(alpha/2))
moe=z_alpha_by2*se
print('Population Proportion Interval',p-moe,p+moe)

Population Proportion Interval 0.26782280814713805 0.9629464226220927


In [28]:
proportion.proportion_confint(count=8,nobs=13,alpha=0.01)

(0.26782280814713794, 0.962946422622093)