# Python Automation Illinois State University 
### Scott Schmidt
### Central Limit Theorem and Normal Distribution

Background:
Statistic properties (e.g., mean, standard deviation) of a population is important for data analysis. In reality, we usually cannot collect the complete dataset of a whole population. Thus, we often take random samples of the population and estimate these statistic parameters based on the samples.
Given a sample of n observations from a population, we can calculate estimates of the population mean, standard deviation, and various other population characteristics (parameters). Prior to obtaining the complete dataset, there is uncertainty as to which of all possible samples will occur. Because of this, estimates such as mean and deviation will vary from one sample to another. The behavior of such estimates in repeated sampling is described by what are called sampling distributions. Any particular sampling distribution will give an indication of how close the estimate is likely to be to the value of the parameter being estimated.

The Central Limit Theorem states that if you have a population with mean μ and standard deviation σ and take sufficiently large (n≥30) random samples from the population with replacement, then the distribution of the sample means will be approximately normally distributed.
Sampling can be done with or without replacement. If with replacement, the same data point could appear more than once in a sample.
Please note: in this lab, you should use for loop to iterate all data and perform required calculations. You are not allowed to directly use built-in functions max(); min(); sum(); and mean() from statistics module. You can use len() if needed.

Dataset: we assume the following 100 weights to be a complete population, and thus can accurately calculate some important parameters of this dataset: [112.99,136.49,153.03,142.34,144.3,123.3,141.49,136.46,112.37,120.67,127.45,114.14,125.61,122.46,116.09,140,129.5,142.97,137.9,124.04,141.28,143.54,97.9,129.5,141.85,129.72,142.42,131.55,108.33,113.89,103.3,120.75,125.79,136.22,140.1,128.75,141.8,121.23,131.35,106.71,124.36,124.86,139.67,137.37,106.45,128.76,145.68,116.82,143.62,134.93,147.02,126.33,125.48,115.71,123.49,147.89,155.9,128.07,119.37,133.81,128.73,137.55,129.76,128.82,135.32,109.61,142.47,132.75,103.53,124.73,129.31,134.02,140.4,102.84,128.52,120.3,138.6,132.96,115.62,122.52,134.63,121.9,155.38,128.94,129.1,139.47,140.89,131.59,121.12,131.51,136.55,141.49,140.61,112.14,133.46,131.8,120.03,123.1,128.14,115.48]

This dataset will be given in a cvs file format. You can open it in Notepad as a text file, and then copy&paste to your Python source code.
Data source: 
http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_Dinov_020108_HeightsWeights


--------------------------------------

# 1. Write a program called p1-1.py to (10 points): 
1. Create a Python List to store the population;
2. calculate the maximum, minimum, mean; and 
3. print the result (screenshots). 

(Output example: The mean of this population is 129.21)

In [3]:
import pandas as pd

print("Starting program p1-1.py ")

# file=pd.read_csv(r'C:\Users\sschm\Desktop\IT170-PA1-data')

numList=[112.99,136.49,153.03,142.34,144.3,123.3,141.49,136.46,112.37,120.67,127.45,114.14,125.61,
             122.46,116.09,140,129.5,142.97,137.9,124.04,141.28,143.54,97.9,129.5,141.85,129.72,142.42,
             131.55,108.33,113.89,103.3,120.75,125.79,136.22,140.1,128.75,141.8,121.23,131.35,106.71,
             124.36,124.86,139.67,137.37,106.45,128.76,145.68,116.82,143.62,134.93,147.02,126.33,125.48,
             115.71,123.49,147.89,155.9,128.07,119.37,133.81,128.73,137.55,129.76,128.82,135.32,109.61,142.47,
             132.75,103.53,124.73,129.31,134.02,140.4,102.84,128.52, 120.3,138.6,132.96,115.62,122.52,134.63,121.9,
             155.38,128.94,129.1,139.47,140.89,131.59,121.12,131.51,136.55,141.49,140.61,112.14,133.46,131.8,120.03,123.1,128.14,115.48]

def getMean():
    total=0
    for n in numList:
        total=total+n
    return round(total/len(numList),2)
    
def getMin():
    low=999_999
    for n in numList:
        if n<low:
            low=n
    return low

def getMax():
    high=-9999
    for n in numList:
        if n>high:
            high=n
    return high

print("The mean of this population is", getMean())
print("The low of this population is", getMin())
print("The high of this population is", getMax())
print("Program p1-1.py Finished.")

## 2. Write a program called p1-2.py to take random samples from the dataset above.	(10 points)

* The samples sizes are 5, 10, 30, respectively.
* We will do sampling with replacement (the same data point could appear more than once in a sample). 
* Print each sample. 

(Output example: The sample with size 5 is: [140.1, 123.49, 128.14, 125.79, 142.47])

In [4]:
from random import choices
print("Starting program p1-2.py ")

sizes=[5,10,30] #sample sizes
results={} #store results in a dictionary
for size in sizes:
    data=choices(numList, k=size)
    results[size]=data
    print("The sample with size ", size, " is: ", data)
print("Done!")
#Resource: https://stackoverflow.com/questions/43281886/get-a-random-sample-with-replacement

# 3. Program called p1-3.py 

### A. The Central Limit Theorem works in this population. (1)	(10 points) 
* Please do sampling with replacement 100 times with the same sampling size 30 from the population. 
* For each sample, calculate its mean, and then convert the mean result to int type and save to a list called sampleMeanIntList.
Please print this list.

(Output example: The mean values of 100 samplings: [129, 127, 129, 132, 129, 131, 130, 130, 129, 125, 128, 128, 131, 129, 131, 129, 130, 127, 131, 127, 128, 134, 129, 128, 125, 133, 127, 123, 127, 126, 125, 128, 129, 132, 131, 127, 134, 130, 128, 129, 129, 131, 129, 131, 124, 125, 129, 125, 127, 127, 127, 126, 132, 128, 128, 131, 129, 131, 131, 125, 127, 126, 133, 127, 126, 127, 126, 125, 128, 129, 127, 128, 126, 126, 128, 129, 133, 130, 128, 130, 127, 129, 126, 131, 128, 130, 130, 130, 129, 133, 130, 127, 129, 128, 131, 126, 130, 130, 129, 128] )

In [5]:
from random import choices
import time
print("Starting program p1-3.py ")

numList=[112.99,136.49,153.03,142.34,144.3,123.3,141.49,136.46,112.37,120.67,127.45,114.14,125.61,
             122.46,116.09,140,129.5,142.97,137.9,124.04,141.28,143.54,97.9,129.5,141.85,129.72,142.42,
             131.55,108.33,113.89,103.3,120.75,125.79,136.22,140.1,128.75,141.8,121.23,131.35,106.71,
             124.36,124.86,139.67,137.37,106.45,128.76,145.68,116.82,143.62,134.93,147.02,126.33,125.48,
             115.71,123.49,147.89,155.9,128.07,119.37,133.81,128.73,137.55,129.76,128.82,135.32,109.61,142.47,
             132.75,103.53,124.73,129.31,134.02,140.4,102.84,128.52, 120.3,138.6,132.96,115.62,122.52,134.63,121.9,
             155.38,128.94,129.1,139.47,140.89,131.59,121.12,131.51,136.55,141.49,140.61,112.14,133.46,131.8,120.03,123.1,128.14,115.48]

start=time.time()
sampleMeanIntList=[]
size=30

while len(sampleMeanIntList) < 101:
    numbers=choices(numList, k=size)
    total=0
    for num in numbers:
        total=total+num
    mean=int(total/size) #k is len of sample size
    sampleMeanIntList.append(mean)

print("The values of the 100 samplngs: ", sampleMeanIntList)
print("Program Finished in: ", round(time.time()-start, 2), " seconds")

### Part B. Normal Distribution: (5 points) 

Use the following code:

#Counter(sampleMeanIntList)

#(Output example:
Counter({123: 1,
         124: 1,
         125: 7,
         126: 9,
         127: 15,
         128: 15,
         129: 19,
         130: 12,
         131: 12,
         132: 3,
         133: 4,
         134: 2})
)

In [6]:
from collections import Counter
import matplotlib.pyplot as plt

counts=Counter(sampleMeanIntList) #count how many times each int appears in the list
counts=dict(counts) #turns count into a dictionary

myList = counts.items() 
myList = sorted(myList) #sort list
x, y = zip(*myList) 

plt.plot(x, y)
plt.show() 

Yes. There is a strong normal distribution in the above graph.