## Python walkthrough of in-class data analysis

## Prerequistes 
1. Install Python
2. Install Numpy, pandas, openpyxl

If you are unsure how to do these things you can download python here(https://www.python.org/downloads/) and I would recommend you use Conda as a package manager to help with 2.[https://docs.conda.io/en/latest/miniconda.html] 

If you are using conda you should be able to do something like `
conda install openpyxl
`
To install packages. 



In [1]:
import numpy as np 
import pandas as pd

## Opening the document

In [2]:
data = pd.read_excel("Week 2 - SPC - Excel - InClass.xlsx", header=0)
data[:10]

Unnamed: 0,Sample,Failure Time
0,1,127
1,2,125
2,3,131
3,4,124
4,5,129
5,6,121
6,7,142
7,8,151
8,9,160
9,10,125


```data[:10]``` gets the first 10 rows. Note that in python counting starts at 0 note 1 

```header=0``` says that the text in the 0th row(first row in english) are what we should use as the name of each column. 

If there are multiple sheets you can select the sheet you want with the sheet name option either by selecting what sheet we want by name or number in order of appearance.  

In [3]:
## Sheet name by number, get the first sheet
data = pd.read_excel("Week 2 - SPC - Excel - InClass.xlsx", sheet_name=0, header=0)
data[:10]

Unnamed: 0,Sample,Failure Time
0,1,127
1,2,125
2,3,131
3,4,124
4,5,129
5,6,121
6,7,142
7,8,151
8,9,160
9,10,125


In [4]:
## Sheet name by number, get the first sheet
data = pd.read_excel("Week 2 - SPC - Excel - InClass.xlsx", sheet_name="InputData", header=0)
data[:10]

Unnamed: 0,Sample,Failure Time
0,1,127
1,2,125
2,3,131
3,4,124
4,5,129
5,6,121
6,7,142
7,8,151
8,9,160
9,10,125


The above three for this example are all the same. 

## Manipulating data 

There are a lot of ways manipulate data in python, you might have to google them as they arise. Here a few

1. Access a column
To access a column you can use syntax like ```data["Column Name"]```. We can access values they want by index. So to get the first 10 values of the failure time we do

In [6]:
data["Failure Time"][0:20:2]

0     127
2     131
4     129
6     142
8     160
10    124
12    120
14    128
16    137
18    142
Name: Failure Time, dtype: int64

2. Sorting 
We can sort values increasing or decreasing order. For example, 

In [7]:
data["Failure Time"].sort_values()

30    118
13    119
12    120
5     121
20    121
29    122
19    123
11    123
10    124
25    124
3     124
17    124
32    125
35    125
1     125
24    125
9     125
39    126
0     127
26    128
14    128
38    129
4     129
27    129
28    130
31    131
2     131
37    131
15    133
33    133
21    136
16    137
23    137
22    140
36    140
34    141
6     142
18    142
7     151
8     160
Name: Failure Time, dtype: int64

The LHS tells us the initial position and the right tells us the value 


3. Selecting data
We can select data in arbitary ways based on the data. To do use we use syntax like ```np.where``` for example if we wanted the people whose failure time was above 130 we can do

In [8]:
long_fail_times = [np.where(data["Failure Time"] > 130)][0][0]
long_fail_times

array([ 2,  6,  7,  8, 15, 16, 18, 21, 22, 23, 31, 33, 34, 36, 37])

We can then use this variable to get at the rows of the table that match these indices

In [9]:
data["Failure Time"][long_fail_times]

2     131
6     142
7     151
8     160
15    133
16    137
18    142
21    136
22    140
23    137
31    131
33    133
34    141
36    140
37    131
Name: Failure Time, dtype: int64

## Getting statistics about the data
Most of the common operations you'll need are in the numpy package. 

In [10]:
max_value = np.max(data["Failure Time"])
min_value = np.min(data["Failure Time"])
mean = np.mean(data["Failure Time"])
std = np.std(data["Failure Time"])

print("Mean: ", mean)
print("Standard dev: ", std)
print("Max Value: ", max_value)
print("Min Value", min_value)
print("Range: ", max_value - min_value)

Mean:  129.975
Standard dev:  8.80195290830393
Max Value:  160
Min Value 118
Range:  42
