# Assignment 9: Point Estimate and Interval Estimate (Confidence Interval)
### A random survey of enrollment at 35 community colleges across the United States yielded the following figures:
### 6,414; 1,550; 2,109; 9,350; 21,828; 4,300; 5,944; 5,722; 2,825; 2,044;
### 5,481; 5,200; 5,853; 2,750; 10,012; 6,357; 27,000; 9,414; 7,681; 3,200;
### 17,500; 9,200; 7,380; 18,314; 6,557; 13,713; 17,768; 7,493; 2,771; 2,861;
### 1,263; 7,285; 28,165; 5,080; 11,622
### Perform point estimate and interval estimate with 95% confidence level using t-distribution.
### Since we don't know the population variance, we will use the t-distribiution instead of the normal distribution.


In [None]:
# Import Python packages
import math
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams["figure.figsize"] = (12,8)  # global setting [12,8] does not work!

## Step 0 - Data Preprocessing
### Process the raw data to make a list of integers. In order to calculate descriptive statistics, Python needs to work with a list of numbers.
### Note: Don't manually make the list by hand-typing the numbers. Write code to automate the data preparation.

In [None]:
# make each line of numbers a string object and then concatenate them together 
# The end result is one single string containing 35 numbers separated by ";"

data_1 = "6,414; 1,550; 2,109; 9,350; 21,828; 4,300; 5,944; 5,722; 2,825; 2,044;" 
data_2 = "5,481; 5,200; 5,853; 2,750; 10,012; 6,357; 27,000; 9,414; 7,681; 3,200; "
data_3 = "17,500; 9,200; 7,380; 18,314; 6,557; 13,713; 17,768; 7,493; 2,771; 2,861; "
data_4 = "1,263; 7,285; 28,165; 5,080; 11,622"
data = data_1 + data_2 + data_3 + data_4
data

'6,414; 1,550; 2,109; 9,350; 21,828; 4,300; 5,944; 5,722; 2,825; 2,044;5,481; 5,200; 5,853; 2,750; 10,012; 6,357; 27,000; 9,414; 7,681; 3,200; 17,500; 9,200; 7,380; 18,314; 6,557; 13,713; 17,768; 7,493; 2,771; 2,861; 1,263; 7,285; 28,165; 5,080; 11,622'

### Convert the single string into a list of strings using split() function
### Make sure to specify a delimiter or separator

In [None]:
data = data.split(";")
print(data)

['6,414', ' 1,550', ' 2,109', ' 9,350', ' 21,828', ' 4,300', ' 5,944', ' 5,722', ' 2,825', ' 2,044', '5,481', ' 5,200', ' 5,853', ' 2,750', ' 10,012', ' 6,357', ' 27,000', ' 9,414', ' 7,681', ' 3,200', ' 17,500', ' 9,200', ' 7,380', ' 18,314', ' 6,557', ' 13,713', ' 17,768', ' 7,493', ' 2,771', ' 2,861', ' 1,263', ' 7,285', ' 28,165', ' 5,080', ' 11,622']


### Create a list of integers from the list of strings using List Comprehension or a for loop. Make sure to remove the "," first and then convert the strings to integers.

In [None]:
data = [x.strip(' ') for x in data]
data = [x.replace(",","") for x in data]
print(data)

['6414', '1550', '2109', '9350', '21828', '4300', '5944', '5722', '2825', '2044', '5481', '5200', '5853', '2750', '10012', '6357', '27000', '9414', '7681', '3200', '17500', '9200', '7380', '18314', '6557', '13713', '17768', '7493', '2771', '2861', '1263', '7285', '28165', '5080', '11622']


In [None]:
#now we can convert the list of strings directly into integers
#using list comprehension:
data = list(map(int, data))
print(data)

[6414, 1550, 2109, 9350, 21828, 4300, 5944, 5722, 2825, 2044, 5481, 5200, 5853, 2750, 10012, 6357, 27000, 9414, 7681, 3200, 17500, 9200, 7380, 18314, 6557, 13713, 17768, 7493, 2771, 2861, 1263, 7285, 28165, 5080, 11622]


## Step 1 - Calculate and Display the Sample Size and Sample Mean

In [None]:
#Calculate and display the sample size
sample_size = len(data) #length of list, which is 35
print("Sample size = " + str(sample_size)) 
#print("Sample size = " + str(len(data))) ---alternatively, this one line of code can do the same thing

#Calculate and display the sample mean
sample_mean = np.mean(data, dtype=np.float64)
sample_mean = np.round(sample_mean, 0) #round up to the full integer
print("Sample mean is " + str(sample_mean))

Sample size = 35
Sample mean is 8629.0


### The point estimate of the mean enrollment of US community colleges is 8629.


## Step 2 - Calculate and Display the Sample Standard Deviation & Sample Standard Error
### Sample Standard Deviation $S=\sqrt{\dfrac{1}{n-1}\sum\limits_{i=1}^n (X_i-\bar{X})^2}$
### Sample Standard Error = $\dfrac{S}{\sqrt{n}}$
### Note: The default Delta Degree of Freedom (DDOF) for Numpy's std function is 0 which is applicable to populate data. For sample data, we need to specify ddof=1.
### For the enrollment data, we round up the statistics to be the full integers (no decimal points).

In [None]:
#Calculate and display the sample standard deviation using Numpy's std function
std_dev = np.std(data, dtype=np.float64, ddof=1) #specify ddof = 1
std_dev = np.round(std_dev, 0) #round up to the full integer
print("Sample Standard Deviation = " + str(std_dev))

# Calculate and display the sample standard error
std_error = stats.sem(data, axis=None, ddof=1) #using SciPy stat package, this gets the standard error 
std_error = np.round(std_error, 0) 
print("Sample Standard Error is " + str(std_error))

Sample Standard Deviation = 6944.0
Sample Standard Error is 1174.0


## Step 3 - Calculate t Critical Value using t-Distribution
### $\alpha$ = 1 - Confidence Level = 1 - 95% = 0.05
### $\dfrac{\alpha}{2}$ = 0.025
### n (sample size) = 35
### df (degree of freedom) = n - 1 = 35 - 1 = 34
### We will use Python scipy.stats t-distribution's PPF (Percentage Point Function) to calculate t critical value $t_{0.025,34}$.

In [None]:
# Calculate and display the t critical value using scipy.stats.t package ppf function
t_stat = np.round(stats.t.ppf(1-0.025, 35), 2) #first using the ppf function, then rounding it to 2 decimals
print("t critical value =  " + str(t_stat))

t critical value =  2.03


## Step 4 - Calculate the Margin of Error
### Margin of Error = t-Statistics * Sample Standard Error = $t_{\alpha/2,n-1}\left(\dfrac{s}{\sqrt{n}}\right)$

In [None]:
# Calculate and display the margin of error
margin_error = np.round(t_stat * std_error, 0)
print("Margin of Error = " + str(margin_error))

Margin of Error = 2383.0


## Step 5 - Calculate Lower and Upper Limit of the Confidence Interval
### Lower Limit = Sample Mean - Margin of Error
### Upper Limit = Sample Mean + Margin of Error

In [None]:
# Calculate and display the lower limit
l_limit = sample_mean - margin_error
print("Lower limit = ", round(l_limit, 0))
# Calculate and display the upper limit
u_limit = sample_mean + margin_error
print("Upper limit = ", round(u_limit, 0))

Lower limit =  6246.0
Upper limit =  11012.0


## Step 6 - Now We have the 95% Confidence Interval
### Confidence Interval ($\sigma$ unknown) = $\bar{x}\space\pm\space t_{\alpha/2}\left(\dfrac{S}{\sqrt{n}}\right)$ = Sample_Mean $\pm$ Margin of Error

In [None]:
# Print the 95% confidence interveral in the form (lower limit, upper limit)
print("The 95% Confidence Interval = (", round(l_limit, 0), ",", round(u_limit, 0), ")")

The 95% Confidence Interval = ( 6246.0 , 11012.0 )


### The End

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=e724fedd-4d55-4022-982f-31bf2f312f26' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>