# Running simualtions with Python

When we run simulations of events, business decisions etc, we will always run a follow-up analysis of stats:

## Warmup Exercise

- set the seed to be 765 with numpy
- create a 5*3 numpy array from random numbers between 0 and 1.
Hint: `np.random.rand()`
- multiple the array (elementwise) by 4
- make the array 1 dimensional. Hint: `flatten()` or `reshape()`
- what is the max value?
- identify the index for that largest value? Hint: `argmax()`




In [2]:
import random as random 
import numpy as np

In [3]:
new_x
new_x.reshape(15,)

NameError: name 'new_x' is not defined

In [None]:
np.random.seed(765)
x = np.random.rand(5,3)
new_x = x * 4
flat_x = new_x.flatten()
print(flat_x.max())
print(flat_x.argmax())


3.774252844888631
3


# Basic Stats in Python

![](https://www.statology.org/wp-content/uploads/2018/10/normal_dist.png)

In [None]:
# Very useful package for many math/science/engineering tasks
# import scipy to start
import scipy

In [None]:
# if you are coding locally and don't have scipy yet:
! pip install scipy

Collecting scipy
  Downloading scipy-1.7.0-cp39-cp39-macosx_10_9_x86_64.whl (32.1 MB)
[K     |████████████████████████████████| 32.1 MB 7.2 MB/s 
Installing collected packages: scipy
Successfully installed scipy-1.7.0


In [None]:
# the module we will use today:
from scipy import stats

#https://docs.scipy.org/doc/scipy/reference/stats.html

## One Sample T-test



In [5]:
# create a 1d array from a normal dist 0/1
# size = 15
x = np.random.normal(size=15)
x

array([-0.99585943, -1.53814265, -0.58757093,  0.25025828,  0.74253265,
       -0.609139  ,  0.83583234, -0.40800033, -0.95577823,  0.23934566,
        1.21813532, -0.52902373, -0.89769086, -1.64135507, -0.86241618])

In [None]:
#np.random.normal() - center around 0 , Stdevaiton 1 

In [None]:
x[:5]

array([ 0.70269132,  0.53887887,  1.03456033,  0.18994755, -2.26443528])

In [None]:
x.mean()

0.13773217973698818

In [None]:
# ttest from scipy
stats.ttest_1samp(x, 0)


Ttest_1sampResult(statistic=-0.3273365660072085, pvalue=0.7434819174007327)

In [None]:
# try this again - larger sample
# size = 100
x = np.random.normal(size=1000)

In [None]:
stats.ttest_1samp(x, 0)

Ttest_1sampResult(statistic=1.6603135453792537, pvalue=0.09716520651869501)

> Makes sense right, as sample size increases from the distribution, pvalue gets larger.  Much less unlikely to be from a different dist

In [None]:
# another but we shift the data to 2, std=1, size=50
z = np.random.normal(loc=2, scale=1, size=50)
z.mean()

2.371883949764046

In [None]:
# save out the result to a variable
# result
result = stats.ttest_1samp(z, 0)

In [None]:
result

Ttest_1sampResult(statistic=12.810114630246511, pvalue=2.940448100802464e-17)

In [None]:
# type
type(result)

scipy.stats.stats.Ttest_1sampResult

In [None]:
# parse to list
list(result)

[12.810114630246511, 2.940448100802464e-17]

## Quick Exercise:

In [None]:
# create an array with mean 85 and standard deviation of 3
# test against a population mean of 91
# draw 50 samples
y = np.random.normal(loc=85, scale=3, size=50)
result = stats.ttest_1samp(y, 91) 
# values are what your sample is , what you are comparing to 
result

Ttest_1sampResult(statistic=-12.770672325406384, pvalue=3.3052553589513205e-17)

In [None]:
#grades = np.random.normal(loc=?, scale=?, size=?)

## Two Sample t-test

In [None]:
# lets create two random normal, 100/15, 115/15, size=100
x = np.random.normal(100, 15, 100)
y = np.random.normal(115, 15, 100)

In [None]:
x.mean()

99.47328889257446

In [None]:
y.mean()

118.45599562518208

In [None]:
stats.ttest_ind(x, y)

Ttest_indResult(statistic=-9.415081238418447, pvalue=1.247759799182783e-17)

## Chi-square

In [None]:
## test for independence
# 4 sets of rolls of dice, summarized

a1 = [6, 4, 5, 10]
a2 = [8, 5, 3, 3]
a3 = [5, 4, 8, 4]
a4 = [4, 11, 7, 13]
a5 = [5, 8, 7, 6]
a6 = [7, 3, 5, 9]
dice = np.array([a1, a2, a3, a4, a5, a6])


In [None]:
dice

array([[ 6,  4,  5, 10],
       [ 8,  5,  3,  3],
       [ 5,  4,  8,  4],
       [ 4, 11,  7, 13],
       [ 5,  8,  7,  6],
       [ 7,  3,  5,  9]])

In [None]:
dice.sum(axis=1)

array([25, 19, 21, 35, 26, 24])

In [None]:
dice.sum(axis=0)

array([35, 35, 35, 45])

In [None]:
stats.chi2_contingency(dice)

(16.490612061288754,
 0.35021521809742745,
 15,
 array([[ 5.83333333,  5.83333333,  5.83333333,  7.5       ],
        [ 4.43333333,  4.43333333,  4.43333333,  5.7       ],
        [ 4.9       ,  4.9       ,  4.9       ,  6.3       ],
        [ 8.16666667,  8.16666667,  8.16666667, 10.5       ],
        [ 6.06666667,  6.06666667,  6.06666667,  7.8       ],
        [ 5.6       ,  5.6       ,  5.6       ,  7.2       ]]))

In [None]:
# another way to unpack the results
stat, p, dof, exp = stats.chi2_contingency(dice)
p

0.35021521809742745

## Quick Exercise:

The operations manager of a company that manufactures tires wants to determine whether there are any differences in the quality of work among the three daily shifts. She randomly selects 496 tires and carefully inspects them. Each tire is either classified as perfect, satisfactory, or defective, and the shift that produced it is also recorded. The two categorical variables of interest are the shift and condition of the tire produced. The data can be summarized by the accompanying two-way table. Does the data provide sufficient evidence at the 5% significance level to infer that there are differences in quality among the three shifts?



|      shift           | Perfect | Satisfactory | Defective |
|-----------------|---------|--------------|-----------|
| Morning Shift   | 106     | 124          | 1         |
| Afternoon Shift | 67      | 85           | 1         |
| Night Shift     | 37      | 72           | 3         |

Source: https://online.stat.psu.edu/stat500/lesson/8/8.1