# Probability and Statistics for Engineering and the Sciences

I've taken courses in business statistics (bleh), probability (much better!), and stochastic processes (fascinating and difficult and fascinating), now I'm reading this book to to start off with stapling all of my knowledge together. Jupyter notebooks are ideal because they're laptop-on-my-lap-on-the-train compatible.

* A **trimmed mean** removes some lowest and highest percentile of the data (outlier correction). The median can be thought of as a competely trimmed mean.
* Mean includes full weight of outliers, median takes zero stock. Sometimes you want a measure somewhere in between, which is when a trimmed mean is useful (more analytical than simply "discard these five values as being too outlier-y).
* **Box plot** bits:
 * **First IQR** and **third IQR** are lower-half and upper-half medians, respectively, these are the bottom and top of a box plot respectively. The distance between the two is **fourth spread**, $f_4$, and this is a statistic that is resistant to outliers.
 * Line down the middle is the median (not mode!).
 * Lines extend on either side to the smallest value still within $1.5 \times f_4$  of the median.
 * **Outliers** are in the $\pm [1.5 f_4, 3 f_4 ]$ range, **extreme outliers** are in the $\geq 3 f_4$ range.

**Box plot in Plotly in Jupyter**

In [3]:
import json
import plotly.tools as tls

# Set my plotly credentials.
data = json.load(open('plotly_credentials.json'))['credentials']
tls.set_credentials_file(username=data['username'], api_key=data['key'])

In [4]:
import string
import pandas as pd
import numpy as np
import plotly.plotly as py
import plotly.graph_objs as go

# Populate a pandas DataFrame with randomized letter values.
# It'd be more sensible to get letter frequency -grams but /\o/\ this is an example direct from the source.
N = 100
y_vals = {}
for letter in list(string.ascii_uppercase):
    # np.random.randn() returns N random standard normal samples. Defaults to five.
    # Note, numpy doesn't provide sigma and mu parameters, you have to do the normal transform yourself. Cute!
    y_vals[letter] = np.random.randn(N)+(3*np.random.randn())

df = pd.DataFrame(y_vals)
# df.head returns the first five rows of the DataFrame.
df.head()

data = []

# I prefer to use iteritems, but to each his own.
for col in df.columns:
    data.append(  go.Box( y=df[col], name=col, showlegend=False ) )

data.append( go.Scatter( x = df.columns, y = df.mean(), mode='lines', name='mean' ) )

# This both creates and displays the graph and sends it to and saves it on their server (no non-public data!).
py.iplot(data, filename='pandas-box-plot')

# cf. https://plot.ly/python/histograms-and-box-plots-tutorial/

High five! You successfuly sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~ResidentMario/0 or inside your plot.ly account where it is named 'pandas-box-plot'


* **Pairwise independence** is not full independence if you can find counterexamples in larger sets.
* A **Bernoulli random variable** (or Bernoulli trial) takes on a value of 0 or 1, e.g. success or failure.
* A **family of probability distributions** are organized around a **parameter**.
* **Discrete CDF**: $F(X) = P(X \leq x) = \sum_{y: y \leq x} p(y)$.
* Since CDFs are right-continuous, technically $P(X \in [a, b]) = F(b) - F(a-)$, where $a-$ is the limit inferior of $a$.
* **Discrete expectation**: $E(X) = \mu_x = \sum_{x \in \Omega} x \cdot p(x)$.
* **Drunk statistician's rule**: $E[h(x)] = \sum_{x \in \Omega} h(x) \cdot p(x)$.
* $Var(X) = \sum_{x \in \Omega}(x-\mu)^2 \cdot p(x) = E[(X-\mu)^2] = E(X^2) - [E(X)]^2$
* $s$ and $s^2$ correspond with $\sigma$ and $\sigma^2$
* 