# Modern Methods of Data Analysis

## Introduction
The goal of Data analysis is always to extract Information from Data. Some pieces of Information will stand out immediately
(for example if a Value appears very often in the data set that will be noticable without the aid of staistical tools)
but most of the information has to be extracted via analytical procedures. In this context the information that can be
extracted from the data will be a model of the process that generated the data.

This Model will have a specific form (an exponential decay for example), that most often will have to be infered from the experience of the
person analyzing the Data.

To be more precise, a model is a function of some set of input parameters $\vec{\theta}$ and of a set
of observables $\vec{x}$ that produce the "true" value(s) $\vec{y} = f(\vec{x}, \vec{\theta})$ that when combined with uncertainties
produce the observed data $\vec{x}$.
These fluctuations arise either from (ramdom) fluctuations (that sometimes by their very nature cannot be suppressed) of the
quantities of interest ($\vec{x}$), or of noise and other sideeffects that are part of the measuring process.

Every Measurement will have some amount of uncertainty associated with it (remember that we are actually all just big interference phenomenons
when viewed from a quantum mechanical perspective) as we are allways measuring processes that are by there very nature statistical.
Furthermore the Instruments themselves are quantummechanical objects that are subject to the same random behaviour that causes fluctuations.

There are models that by their very nature can approximate any underlying function (for example a fourier series), these methods are however
mostly avoided, because the calculations involved are very time consuming and don't present the result in a way that is understandable for
humans (Do you know the coeficients of the fourier series for an exponential decay? Well I definitely don't) even if the result can be made
arbitrarily accurate and the situation where "the model is wrong" can always be avoided.

## The role of Statistics in Data Analysis

As mentioned above, most often we are dealing with systems that are either to complex to be accurately modeled in detail (the stock market would come to mind)
or that by their very nature are statistical (i.e involve some sort of randomness).

-----
#### Assumptions for complex systems
The reason that we can not only model truly random processes (like quantum mechanics) but also highly complex models is the assumption, that in highly complex
models small fluctuations only cause small effects, which seems to hold true for the most part (and for further information please consult your local chaos theorist).
So if the small fluctuations are assumed to be random, the resulting behaviour is still mostly dependent on the main "forces" acting on the system because small random fluctuations are still small and will only have a small effect. These small effects then add up to the uncertainty of the prediction based on the more coarse model.
-----

As randomness is involved in the processes the mathematical methods of the field of statistics (the study of random quantities) can be applied to gain information from Data.

Statistics offers a wide range of procedures and definitions that can be used to reproducably and repeatedly gain information from data.
In this specific case we would like a model ($\vec{y} = f(\vec{x},\vec{\theta})$) that describes the data. The model therefor represents *information*.

As we are dealing with statistical processes we can not make certain claims, but must use probabilities in stead.
Another problem that can easily be overlooked when only taking a superficial look at statistical analysis, is that most statements are not
clearly Yes or No but are far more often something inbetween. This thing inbetween is called the Probability.
The concept of probability allows for a far more nuanced view of the data. It allows to weigh different options against each other using the probability as
a measure of which statement is *less wrong*.

-----
#### Side note on Data vs Information
As you may have noticed I am making the distiction between Data and information. I am doing so to avoid the seemingly common confusion of both. For this set of documents data allways means the raw measurements. There is a LOT of data out there. The problem with data is obvious when considering the following example: 

I give you 400,000 pages of rows and rows of numbers that describe the process that you are to become an expert on. Even though you have "all the data" you don't have an *understanding* (i.e Informationn), you can't *predict* anything based on the Data alone . You job is to *make sense* of all these numbers, to find the *patterns* in
the Data, thus deriving information from all of these rows and columns.
As you may see now, information is closely related to understanding and therefor to something we humans can comprehend and act upon or think about. Data on the other hand is only a vast quantity of numbers.
-----

## Interpretations of Probability
Before I get to the more formal math of probability theory, I would like to present two common interpretations and an example to ease in to this (sometimes quite confusing) studt of Probability. Both interpretations of Probability will be used depending on the circumstances, so they don't exclude but complement themselves.

### The bayesian interpretation
In the bayesian interpretation of Probability views statements as either true or false and interprets probability as a degree of belief in that statement.
I think it is important to note that the type of statements that are normally used in combination with the bayesian interpretation are relatively basic.
This is a consequence of the quickly declining probability of multiple events that are interdependent and have to occurr in a precice order, to make a complex
statement true. The Probability that statement $E$ is correct is written as $P(E)$.

If $P(E)$ is **true** we define $P(E) = 1$ and if it is **false** we define $P(E) = 0$

If you are unsure that a statement $F$ is true, lets say you give it a 50% chance of being true, you would subsequently write $P(F) = 0.5$.

The beauty of this way of formulating probability is that logical operations can be mapped to mathematical operations of the real numbers according to the following transformation

| Logical operation  | mathematical operation|
|---------|----------------------|
|**AND**  | $P(E_1) \cdot P(E_2)$|
|**OR**   | $P(E_1) + P(E_2)$    |
|**NOT**  |$1 - P(E)$            |

Which this transformation the logical operations can be expanded to include not only the values of **true** and **false** but also all probabilities in between.
So we now have a rule for calculating the probability (or truth content) of composit statements like **A AND B** or any other combinations that can be built with
bolean logic.

As we are able to calculate the possibility of a statement being **true** we can also calulate the probability of a statement being **false**.

(*FYI*: the computer you are reading this on is running entirely on boolean logic, so you can formulate quite a lot with it)

----
#### Engineering is careful optimisation of probability
Lets consider a process normally occurring billions of time every day. When we get into vehicles we expect the engine to turn on and run until we switch it off.
A gasoline engine at idle speed revs at about 500rpm. A four cylinder four stroke engine will fire one cylinder once every two revolutions making two firings per revolution for our four cylinder engine. The speed of the engine is geared down (or up, depending on gear) to drive the wheels which ultimately drive the vehicle.

The following data is taken from https://en.wikipedia.org/wiki/Gear_train and is an example for a 2005 Corvette C5 Z06.
Lets calculate the amount of cylinder firings per second.

In [9]:
# ---- play around with the numbers here
current_gear = 4
current_speed = 50 # in km/h
cylinder_count = 4
# ----

# we are assuming that we want to travel forward, which is why the reverse is being ignored
# engine revolutions per wheel revolution 
gear_ratio = [2.97/1, 2.07/1, 1.43/1, 1.00/1, 0.84/1, 0.56/1]
wheel_circumfrence = 2.09 # in meters

# convert speed to m/s
current_speed = current_speed/3.6

# calculate revolutions per second (wheels) given the current speed
wheel_rps = current_speed / wheel_circumfrence
# do the gear reduction 
engine_rps = wheel_rps * gear_ratio[current_gear]
# determin the average number of firings per revolurion based on cylinder count
fpr = cylinder_count / 2 # we need two revolutions for a cylinder to start the cycle over
# determin firings per second
fps = fpr * engine_rps
print("Firings per second at a speed of {} km/h in gear {} = {:.3f} ".format(current_speed*3.6, current_gear, fps))

Firings per second at a speed of 50.0 km/h in gear 4 = 11.164 


So for your (imaginary?) 20 min commute to work (at an average speed of `current_speed`) the amount of firings can be calculated.

In [8]:
firings_per_commute = fps * 20 * 60
print("Cylinder firings per Commute = {:.3f}".format(firings_per_commute))

Cylinder firings per Commute = 13397.129


Nowadays misfires of cylinders are a rare occasion, I'd bet you have never had one outside of racing or pushing your car a bit too hard.

For the sake of being able to formulate consice mathematical statements, let's give these two events a name we can reference in such statements. The event of a cylinder firing we shall call $F$ and the event of a cylinder misfiring we call $M$.

So assuming you want to get through your commute without a single misfire we can calculate the probability that a cylinder fire has to be successful
for you to have even a chance to get through an **average** commute without a misfire (be aware of the average in the last sentence, it will get important later)

For the next calculation to be any good we need to know how often you want to tolerate a misfire on a commute. So lets say we can tolerate a misfire in 10% of our commutes (the real value is probably as low as 0.01% maybe? this is just a guess). That means that on average, we have one misfire every week assuming a 5 day working week and one commute there and one back.

So we have $P(M_{commute}) = 0.1$, where $M_{commute}$ is the chance that we get a misfire during our commute. From this we can calculate the probability of a cylinder firing without problem.

A probabiliry of 10% means that in 1 of 10 commutes we can have a misfire. As we allready know a commute has `firing_per_commute` chances for a cylinder to either fire or misfire. There is only allowed to be a single misfire in $\frac{1}{P(M_{commute})}$ commutes. As we know how many times a cylinder fires per commute we can calculate  the total count of cylinder firings that can on average contain a single misfire.

In [16]:
# lets say we want to get through 10% of our commutes without experiencing a misfire
tolerable_ratio_of_commutes_with_misfire = 0.1

# so we can calculate the amount of cylinder firings that ocurr where one of them is allowed to misfire
firings_with_a_maximum_of_one_misfire = firings_per_commute/tolerable_ratio_of_commutes_with_misfire
print("{:.1f}".format(firings_with_a_maximum_of_one_misfire))

133971.3


As there are `firings_with_a_maximum_of_one_misfire` total possibilities for a cylinder to misfire and we want it to only misfire once the probability of misfire is $\frac{1}{N}$ where N is the total number of times the cylinder fires (not neccesarily succesfully).

As this is the probability of a cylinder misfiring and we want the probability of a cylinder firing successfully we and these events are mutualy exclusive (that part is quite important but you will see that later on) the probability of firing successfully is **NOT** $P(M) = 1 - P(M)$ 

In [13]:
probability_of_cylinder_firing = 1-1/firings_with_a_maximum_of_one_misfire
print(probability_of_cylinder_firing)

0.9999925357142857


We can check the result with the following calculation.
Every time a cylinder fires, it can either fire correctly or misfire (or at least let us assume those are the two only possibilities),
so having a cylinder fire twice in a row (like tossing a coin tow times in a row and getting the same result twice) requires us to multiply the probabilities
together.

For a total of $n$ consecutive successful firings the probability is given as $P(F)^n$ so we expect that the probability that `firings_per_commute` amount of consecutive successful cylinder firings would occurr in 90% of cases.

In [15]:
probability_of_cylinder_firing**int(firings_per_commute)

0.9048379528610883

which as you can see is the case.

----

### The frequentist interpretat

<img src="graphics/Probabilities.svg" />