# Running simualtions with Python

When we run simulations of events, business decisions etc, we will always run a follow-up analysis of stats:

## Warmup Exercise

- set the seed to be 765 with numpy
- create a 5*3 numpy array from random numbers between 0 and 1.
Hint: `np.random.rand()`
- multiple the array (elementwise) by 4
- make the array 1 dimensional. Hint: `flatten()` or `reshape()`
- what is the max value?
- identify the index for that largest value? Hint: `argmax()`




In [5]:
import random 
import numpy as np

# Basic Stats in Python

![](https://www.statology.org/wp-content/uploads/2018/10/normal_dist.png)

In [4]:
# Very useful package for many math/science/engineering tasks
# import scipy to start
import scipy

In [3]:
# if you are coding locally and don't have scipy yet:
! pip install scipy

Collecting scipy
  Downloading scipy-1.7.0-cp39-cp39-macosx_10_9_x86_64.whl (32.1 MB)
[K     |████████████████████████████████| 32.1 MB 7.2 MB/s 
Installing collected packages: scipy
Successfully installed scipy-1.7.0


In [None]:
# the module we will use today:
from scipy import stats

#https://docs.scipy.org/doc/scipy/reference/stats.html

## One Sample T-test



In [None]:
# create a 1d array from a normal dist 0/1
# size = 15
x = np.random.normal(size=15)

In [None]:
#np.random.normal()

In [None]:
x[:5]

In [None]:
x.mean()

In [None]:
# ttest from scipy
stats.ttest_1samp(x, 0)


In [None]:
# try this again - larger sample
# size = 100
x = np.random.normal(size=1000)

In [None]:
stats.ttest_1samp(x, 0)

> Makes sense right, as sample size increases from the distribution, pvalue gets larger.  Much less unlikely to be from a different dist

In [None]:
# another but we shift the data to 2, std=1, size=50
z = np.random.normal(loc=2, scale=1, size=50)
z.mean()

In [None]:
# save out the result to a variable
# result
result = stats.ttest_1samp(z, 0)

In [None]:
result

In [None]:
# type
type(result)

In [None]:
# parse to list
list(result)

## Quick Exercise:

In [None]:
# create an array with mean 85 and standard deviation of 3
# test against a population mean of 91
# draw 50 samples

In [None]:
#grades = np.random.normal(loc=?, scale=?, size=?)

## Two Sample t-test

In [None]:
# lets create two random normal, 100/15, 115/15, size=100
x = np.random.normal(100, 15, 100)
y = np.random.normal(115, 15, 100)

In [None]:
x.mean()

In [None]:
y.mean()

In [None]:
stats.ttest_ind(x, y)

## Chi-square

In [None]:
## test for independence
# 4 sets of rolls of dice, summarized

a1 = [6, 4, 5, 10]
a2 = [8, 5, 3, 3]
a3 = [5, 4, 8, 4]
a4 = [4, 11, 7, 13]
a5 = [5, 8, 7, 6]
a6 = [7, 3, 5, 9]
dice = np.array([a1, a2, a3, a4, a5, a6])


In [None]:
dice

In [None]:
dice.sum(axis=1)

In [None]:
dice.sum(axis=0)

In [None]:
stats.chi2_contingency(dice)

In [None]:
# another way to unpack the results
stat, p, dof, exp = stats.chi2_contingency(dice)
p

## Quick Exercise:

The operations manager of a company that manufactures tires wants to determine whether there are any differences in the quality of work among the three daily shifts. She randomly selects 496 tires and carefully inspects them. Each tire is either classified as perfect, satisfactory, or defective, and the shift that produced it is also recorded. The two categorical variables of interest are the shift and condition of the tire produced. The data can be summarized by the accompanying two-way table. Does the data provide sufficient evidence at the 5% significance level to infer that there are differences in quality among the three shifts?



|      shift           | Perfect | Satisfactory | Defective |
|-----------------|---------|--------------|-----------|
| Morning Shift   | 106     | 124          | 1         |
| Afternoon Shift | 67      | 85           | 1         |
| Night Shift     | 37      | 72           | 3         |

Source: https://online.stat.psu.edu/stat500/lesson/8/8.1