The Central Limit Theorem(CLT) states that for any data, provided a high number of samples have been taken. The following properties hold:

        Sampling Distribution Mean(μₓ¯) = Population Mean(μ)
        Sampling distribution’s standard deviation (Standard error) = σ/√n ≈S/√n
        For n > 30, the sampling distribution becomes a normal distribution.
        Let’s verify the properties of CLT in Python through Jupyter Notebook.

For the following Python code, we’ll use the datasets of Population and Random Values, which we can find here.

In [None]:
# Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

np.random.seed(42)

In [None]:
# Population
#reading the dataset
df = pd.read_excel('Population.xls')

#Printing the dataset
df

In [None]:
# Extracting Weight Column from the dataset
df = df[['Weight']]
df

# Plotting the distribution graph using Seaborn Library
sns.distplot(df.Weight)
plt.show()

# the chart is close to normal distribution

In [None]:
#Mran of the Weight Column
df.Weight.mean()

# Standard Deviation for the Weight Column
df.Weight.std()

Mean = 220.67326732673268
Std. Dev. = 26.643110470317723

These values are the exact Mean and Standard Deviation values of the Weight Column.

Now, let’s start sampling the data.

First, we’ll take a sample size of 30 members from the data. The reason for that is, after repeated sampling of observations, we need to find if the sampling distribution follows Normal Distribution or not.

In [None]:
df.Weight.std()/np.sqrt(samp_size)


We get the value of above code = 4.86

The value is close to the sample_means.std().

So, from the above code, we can infer that:

    Sampling distribution’s mean (μₓ¯) = Population mean (μ)

    Sampling distribution’s standard deviation (standard error) = σ/√n

Till now, we have seen the original data of the “Weight” column is in the form of normal distribution. Let’s see whether the sample distribution will be of Normal Distribution form even if the original data is not in the Normal Distribution form.

In [None]:
# Reading the data into Jupyter Notebook
df1 = pd.read_csv("Random Values.csv")

# Printing the data
df1

# Plotting the Values column Matplolib
df1.Values.plot.hist(bins=40)
plt.show()

 the Values column does not resemble the Normal Distribution graph. It looks somewhat like an exponential distribution.

Let’s pick samples from this distribution, calculate their means, and plot the sampling distribution.

In [None]:
# Taking a random sample of size 50 from the Values column
samp_size = 50

# Calculating the mean of every sample and converting it into Series Object
sample_means = [df1.VAL.sample(samp_size).mean() for i in range(1000)]
sample_means = pd.Series(sample_means)

# Plotting the all the samples in a distribution plot using Seaborn
sns.distplot(sample_means)
plt.show()

The Distribution of the sample_means we obtainerom the Values Column, which is far from Normal Distribution, is still very much a Normal Distribution.d f

Let’s compare the sample_means Mean value to its parent Mean value.

sample_means.mean()
# The Output will be
130.39213999999996
df1.Value.mean()
#The Output is
130.4181654676259
As we can see, the sample_means mean value and original dataset’s mean value are both similar.

Similarly, the standard deviation of sample mean is sample_means.std() =13.263962580003142

That value should be quite close to df1.Value.std()/np.sqrt(samp_size) =14.060457446377631

In [None]:
#Let’s compare the Distribution graphs of each Dataset with it’s corresponding sampling distribution.


As we can see, irrespective of the original dataset’s distribution, the sampling distribution resembles the Normal Distribution Curve.

There’s only one thing to consider now, i.e., Sample Size. We’ll observe that, as the sample size increases, the sampling distribution will approximate a normal distribution even more closely.


In [None]:
#Effect of Sample Size on the Sampling Distribution
# Different Sizes of Samples
sample_sizes = [3, 10, 30, 50, 100, 200]

# Plotting the distribution graphs for the above created Samples
plt.figure(figsize=[10,7])
for ind, samp_size in enumerate(sample_sizes):
    sample_means = [df1.VAL.sample(samp_size).mean() for i in range(500)]
    plt.subplot(2,3,ind+1)
    sns.distplot(sample_means, bins=25)
    plt.title("Sample size: "+str(samp_size))
plt.show()

As we can observe, the distribution graph for Sample Size 3 & 10 does not resemble Normal Distribution. Still, from the Sample Size 30 as the Sample Size increases, the Sample Distribution resembles Normal Distribution.

As a rule of thumb, we can say that a sample size of 30 or above is ideal for concluding that the sampling distribution is nearly normal, and further inferences can be drawn from it.

    Through this Python Code, we can conclude that CLT’s following three properties hold.

    Sampling Distribution Mean(μₓ¯) = Population Mean(μ)

    Sampling distribution’s standard deviation (Standard error) = σ/√n

    For n > 30, the sampling distribution becomes a normal distribution.

    Estimating Mean Using CLT

    The mean commute time of 30000 employees (μ)= 36.6 (sample mean) + some margin of error. We can find this margin of error using the CLT (central limit theorem). Now that we know what the CLT is let’s see how we can find the error margin.

Let’s say we have the mean commute time of 100 employees is X¯=36.6 min, and the Standard Deviation of the sample is S=10 min. Using CLT, we can infer that,

    Sampling Distribution Mean(μₓ¯) = Population Mean(μ)

    Sampling Distributions’ Standard Deviation = σ/√n ≈S/√n = 10/√100 = 1

    Since Sampling Distribution is a Normal Distribution

    P(μ-2 < 36.6 < μ+2) = 95.4%, we get this value by 1–2–3 Rule of Normal Distribution Curve.

    P(μ-2 < 36.6 < μ+2) = P(36.6–2< μ < 36.6+2) = 95.4%

Now, we can say that there is a 95.4% probability that the Population Mean(μ) lies between (36.6–2, 36.6+2). In other words, we are 95.4% confident that the error in estimating the mean ≤ 2.

Hence the probability associated with the claim is called confidence level (Here it is 95.4%).
The maximum error made in the sample mean is called the margin of error (Here it is 2min).
The final interval of value is called confidence interval {Here it is: (34.6, 38.6)}

We can generalize this concept in the following manner.

Let’s say that we have a sample with sample size n, mean X¯, and standard deviation S. Now, the y% confidence interval (i.e., the confidence interval corresponding to a y% confidence level) for μ would be given by the range:

Confidence interval = (X — (Z* S/√n), X + (Z* S/√n))

where Z* is the Z-score associated with a y% confidence level.