<div style="text-align: right">INFO 6105 Data Sci Engineering Methods and Tools</div>
<div style="text-align: right">Dino Konstantopoulos, 24 February 2020</div>

We correct your data science homework involving Formula 1 driving. 

<br />
<center>
    <img src="images/f1-python-20.png" width=400 />
</center>

Let's use the following probability functions and classes.

In [1]:
def p(event, space): 
    """The probability of an event, given a sample space of outcomes. 
    event: a collection of outcomes, or a predicate that is true of outcomes in the event. 
    space: a set of outcomes or a probability distribution of {outcome: frequency} pairs."""
    if is_predicate(event):
        event = such_that(event, space)
    if isinstance(space, ProbDist):
        return sum(space[o] for o in space if o in event)
    else:
        return Fraction(len(event & space), len(space))

is_predicate = callable

def such_that(predicate, space): 
    """The outcomes in the sample pace for which the predicate is true.
    If space is a set, return a subset {outcome,...} with outcomes where predicate(element) is true;
    if space is a ProbDist, return a ProbDist {outcome: frequency,...} with outcomes where predicate(element) is true."""
    if isinstance(space, ProbDist):
        return ProbDist({o:space[o] for o in space if predicate(o)})
    else:
        return {o for o in space if predicate(o)}

In [2]:
class ProbDist(dict):
    """A Probability Distribution; an {outcome: probability} mapping."""
    def __init__(self, mapping=(), **kwargs):
        self.update(mapping, **kwargs)
        # Make probabilities sum to 1.0; assert no negative probabilities
        total = sum(self.values())
        for outcome in self:
            self[outcome] = self[outcome] / total
            assert self[outcome] >= 0

# F1 Sport Analytics

<br />
<center>
    <img src="ipynb.images/mercedes-ferrari.jpg" width=800 />
</center>

Question 1.1 (20 points) There are two F1 races coming up: The Singapore Grand Prix this weekend and the Russian Grand Prix the weekend after. The 2019 driver standings are given [here](https://www.formula1.com/en/results.html/2019/drivers.html). Given these standings (please do not use team standings given on the same Web site, use driver standings), what is the Probability Distribution for each F1 driver to win the Singapore Grand Prix? What is the Probability Distribution for each F1 driver to win *both* the Singapore and Russian Grand Prix? What is the probability for Mercedes to win both races? What is the probability for Mercedes to win at least one race? Note that Mercedes, and each other racing team, has two drivers per race.

Question 1.2 (30 points) If Mercedes wins the first race, what is the probability that Mercedes wins the next one? If Mercedes wins at least one of these two races, what is the probability Mercedes wins both races? How about Ferrari, Alfa Romeo, and McLaren?

Question 1.3 (50 points) Mercedes wins **at least one** of these two races on a rainy day. What is the probability Mercedes wins both races, assuming races can be held on either rainy, sunny, cloudy, snowy or foggy days? Assume that rain, sun, clouds, snow, and fog are the only possible weather conditions on race tracks.

You need to provide *proof* for your answers. `I think it's one in a million because Mercedes sucks and I love Ferrari` is not a good answer. Leverage the counting framework in this workbook.

### Note

We are going to approach the problem a bit differently:

When we are asked to consider the case of winning the first race, instead of winning the first race, we are going to model the probability of winning the *next* race because it's easier (for me) to reason this way. But whether we win the first race or the second race does not make a difference in ***frequentist statistics*** because *we stay with the same probability distribution for wins*.

Hint: We use `SGP` to denote the Probability Distribution given by F1 driver wins. Write driver initials as keys and driver wins as values in a dictionary that you pass to our function `ProbDist`..


### Question 1

In [3]:
SGP = ProbDist(
    LH = 413,
    VB = 326,
    MV = 278,
    CL = 264,
    SV = 240,
    CS = 96,
    PG = 96,
    AA = 92,
    DR = 54,
    SP = 52,
    LN = 49,
    KR = 43,
    DK = 37,
    NH = 37,
    LS = 21,
    KM = 20,
    AG = 14,
    RG = 8,
    RK = 1,
    GR = 0)
SGP

{'LH': 0.19290051377860812,
 'VB': 0.15226529659037832,
 'MV': 0.12984586641756188,
 'CL': 0.12330686595049042,
 'SV': 0.1120971508640822,
 'CS': 0.044838860345632885,
 'PG': 0.044838860345632885,
 'AA': 0.04297057449789818,
 'DR': 0.025221858944418495,
 'SP': 0.024287716020551145,
 'LN': 0.022886501634750117,
 'KR': 0.02008407286314806,
 'DK': 0.017281644091546006,
 'NH': 0.017281644091546006,
 'LS': 0.009808500700607193,
 'KM': 0.009341429238673517,
 'AG': 0.006539000467071462,
 'RG': 0.0037365716954694066,
 'RK': 0.00046707146193367583,
 'GR': 0.0}

The probability of two ***successive*** wins is just the square of each single win probability (***intersection*** of two events):

In [4]:
def f(x): return x ** 2.0
SRGP = {k: f(v) for k, v in SGP.items()}
SRGP

{'LH': 0.03721060821605098,
 'VB': 0.023184720545755877,
 'MV': 0.016859949025727322,
 'CL': 0.015204583190532214,
 'SV': 0.012565771231844805,
 'CS': 0.002010523397095169,
 'PG': 0.002010523397095169,
 'AA': 0.0018464702726794173,
 'DR': 0.0006361421686121433,
 'SP': 0.0005898931494949367,
 'LN': 0.0005237919570774198,
 'KR': 0.00040336998277224037,
 'DK': 0.000298655222506867,
 'NH': 0.000298655222506867,
 'LS': 9.62066859938118e-05,
 'KM': 8.726230022114447e-05,
 'AG': 4.27585271083608e-05,
 'RG': 1.3961968035383116e-05,
 'RK': 2.181557505528612e-07,
 'GR': 0.0}

## Debugging

Probabilities (and statistics as a consequence) is the science of **counting**. Just build your universe of all possible outcomes, and ***count***!

Let's do some **debugging**. Let's display results as a 2D grid of outcomes. A cell will be colored **white** if Mercedes does not win two races, **yellow** if Mercedes wins two races but not with a least one win on a rainy day, and **green** if Mercedes wins two races with at least one win on a rainy day. 

Let's reduce the amount of data for our debugging (you always do this in data science: Always test with a smaller amount of rows!), to just Mercedes, Ferrari, and Renault. ***You always debug with less data***! And I picked these teams because they cover the range of wins: a lot, medium, very few. I'll also reduce weather to (r)ain, and (s)un.

In [5]:
def Uniform(outcomes): return ProbDist({event: 1 for event in outcomes})

def joint(A, B, sep=' '):
    """The joint distribution of two independent probability distributions. 
    Result is all entries of the form {a+sep+b: P(a)*P(b)}"""
    return ProbDist({a + sep + b: A[a] * B[b]
                    for a in A
                    for b in B})

def next_mercedes_win_p(outcome): return outcome.count(' LH') + outcome.count(' VB') == 1
def one_mercedes_win_p(outcome): return outcome.count('LH') + outcome.count('VB') >= 1
def two_mercedes_wins_p(outcome): return outcome.count('LH') + outcome.count('VB') == 2

def at_least_one_Mercedes_win_on_a_rainy_day(outcome): return 'LHr' in outcome or 'VBr' in outcome

In [6]:
# probability of winning one race. Mercedes is LH and VB
SGPr = ProbDist(
    LH = 413,
    VB = 326,
    CL = 264,
    SV = 240,
    DR = 54,
    NH = 37)
SGPr

{'LH': 0.3095952023988006,
 'VB': 0.24437781109445278,
 'CL': 0.19790104947526238,
 'SV': 0.17991004497751126,
 'DR': 0.04047976011994003,
 'NH': 0.027736131934032984}

In [7]:
# probability of winning a race on a specific weather condition
SGPrw  = joint(SGPr, Uniform('rs'), '') #care about the weather
SGPrw

{'LHr': 0.15479760119940028,
 'LHs': 0.15479760119940028,
 'VBr': 0.12218890554722636,
 'VBs': 0.12218890554722636,
 'CLr': 0.09895052473763116,
 'CLs': 0.09895052473763116,
 'SVr': 0.08995502248875561,
 'SVs': 0.08995502248875561,
 'DRr': 0.02023988005997001,
 'DRs': 0.02023988005997001,
 'NHr': 0.013868065967016488,
 'NHs': 0.013868065967016488}

In [8]:
# probability of winning two races on specific weather conditions
SRGPrw  = joint(SGPrw, SGPrw)
len(SRGPrw)

144

Ok, I can work with 12 x 12 data points, they won't fry my kernel. What do they look like?

In [9]:
import random
random.sample(list(SRGPrw), 10)

['CLs LHr',
 'LHs VBs',
 'VBr LHr',
 'VBs LHs',
 'VBr NHs',
 'NHr VBs',
 'SVs DRs',
 'NHs VBr',
 'NHr LHr',
 'DRs SVr']

Let's do some plotting. Machine learning is all about *geometry* (specifically, building outcome manifolds in state space that represent the surface joining all possible outcomes). That is why we debug everything with pictures.

Let's plot all possible outcomes of our discrete probability distribution on a grid, and color cells in green and yellow depending on two respective predicates. If one is true, color the cell any color (yellow or green), if the other is true *as well*, color the cell green.

In [10]:
from IPython.display import HTML

def Pgrid(event, condition, dist):
    def first_half(s): return s[:len(s)//2]
    firsts = sorted(set(map(first_half, dist)))
    return HTML('<table>' +
                cat(row(first, event, dist, condition) for first in firsts) +
                '</table>')

def row(first, event, dist, condition):
    "Display a row where the first race result is paired with each of the possible second race results."
    thisrow = sorted(outcome for outcome in dist if outcome.startswith(first))
    return '<tr>' + cat(cell(outcome, event, condition) for outcome in thisrow) + '</tr>'

def cell(outcome, event, condition): 
    "Display outcome in appropriate color."
    color = ('lightgreen' if event(outcome) and condition(outcome) else
             'yellow' if condition(outcome) else
             'white')
    return '<td style="background-color: {}">{}</td>'.format(color, outcome)    

cat = ''.join

In [11]:
# Let's plot the all possible outcomes
# white cells: no two mercedes wins
# colored cells: at least one mercedes win on a rainy day
# green cells: two mercedes wins with at least one of them on a rainy day
Pgrid(two_mercedes_wins_p, at_least_one_Mercedes_win_on_a_rainy_day, SRGPrw)

0,1,2,3,4,5,6,7,8,9,10,11
CLr CLr,CLr CLs,CLr DRr,CLr DRs,CLr LHr,CLr LHs,CLr NHr,CLr NHs,CLr SVr,CLr SVs,CLr VBr,CLr VBs
CLs CLr,CLs CLs,CLs DRr,CLs DRs,CLs LHr,CLs LHs,CLs NHr,CLs NHs,CLs SVr,CLs SVs,CLs VBr,CLs VBs
DRr CLr,DRr CLs,DRr DRr,DRr DRs,DRr LHr,DRr LHs,DRr NHr,DRr NHs,DRr SVr,DRr SVs,DRr VBr,DRr VBs
DRs CLr,DRs CLs,DRs DRr,DRs DRs,DRs LHr,DRs LHs,DRs NHr,DRs NHs,DRs SVr,DRs SVs,DRs VBr,DRs VBs
LHr CLr,LHr CLs,LHr DRr,LHr DRs,LHr LHr,LHr LHs,LHr NHr,LHr NHs,LHr SVr,LHr SVs,LHr VBr,LHr VBs
LHs CLr,LHs CLs,LHs DRr,LHs DRs,LHs LHr,LHs LHs,LHs NHr,LHs NHs,LHs SVr,LHs SVs,LHs VBr,LHs VBs
NHr CLr,NHr CLs,NHr DRr,NHr DRs,NHr LHr,NHr LHs,NHr NHr,NHr NHs,NHr SVr,NHr SVs,NHr VBr,NHr VBs
NHs CLr,NHs CLs,NHs DRr,NHs DRs,NHs LHr,NHs LHs,NHs NHr,NHs NHs,NHs SVr,NHs SVs,NHs VBr,NHs VBs
SVr CLr,SVr CLs,SVr DRr,SVr DRs,SVr LHr,SVr LHs,SVr NHr,SVr NHs,SVr SVr,SVr SVs,SVr VBr,SVr VBs
SVs CLr,SVs CLs,SVs DRr,SVs DRs,SVs LHr,SVs LHs,SVs NHr,SVs NHs,SVs SVr,SVs SVs,SVs VBr,SVs VBs


Let's ***count***!

Number of cells where Mercedes wins at least once on a rainy day = 12 + 12 + (12 - 2) + (12 - 2) = 44
Number of cells where Mercedes wins both races = 3 + 3 + 3 + 3 = 12
And so probability of two Mercedes wins given that Mercedes won at least once race on a cloudy day is 12 / 44 = 27%.

对

And now let's color a slightly bigger table. Let's increase the amount of data to Mercedes, Ferrari, Renault, ***and Red Bull***. Also, let's add (c)loudy day.

In [12]:
# probability of winning one race. Mercedes is LH and VB
SGPr = ProbDist(
    LH = 413,
    VB = 326,
    CL = 264,
    SV = 240,
    DR = 54,
    NH = 37,
    MV = 278,
    AA = 92)
SGPr

{'LH': 0.24237089201877934,
 'VB': 0.19131455399061034,
 'CL': 0.15492957746478872,
 'SV': 0.14084507042253522,
 'DR': 0.03169014084507042,
 'NH': 0.02171361502347418,
 'MV': 0.16314553990610328,
 'AA': 0.0539906103286385}

In [13]:
# probability of winning a race on a specific weather condition
SGPrw  = joint(SGPr, Uniform('rsc'), '') #care about the weather
SGPrw

{'LHr': 0.08079029733959313,
 'LHs': 0.08079029733959313,
 'LHc': 0.08079029733959313,
 'VBr': 0.06377151799687013,
 'VBs': 0.06377151799687013,
 'VBc': 0.06377151799687013,
 'CLr': 0.051643192488262914,
 'CLs': 0.051643192488262914,
 'CLc': 0.051643192488262914,
 'SVr': 0.04694835680751174,
 'SVs': 0.04694835680751174,
 'SVc': 0.04694835680751174,
 'DRr': 0.010563380281690142,
 'DRs': 0.010563380281690142,
 'DRc': 0.010563380281690142,
 'NHr': 0.007237871674491394,
 'NHs': 0.007237871674491394,
 'NHc': 0.007237871674491394,
 'MVr': 0.05438184663536776,
 'MVs': 0.05438184663536776,
 'MVc': 0.05438184663536776,
 'AAr': 0.01799687010954617,
 'AAs': 0.01799687010954617,
 'AAc': 0.01799687010954617}

In [14]:
# probability of winning two races on specific weather conditions
SRGPrw  = joint(SGPrw, SGPrw)
len(SRGPrw)

576

Yikes, that's a 24 x 24 table, the square of our previous table!

In [15]:
Pgrid(two_mercedes_wins_p, at_least_one_Mercedes_win_on_a_rainy_day, SRGPrw)

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
AAc AAc,AAc AAr,AAc AAs,AAc CLc,AAc CLr,AAc CLs,AAc DRc,AAc DRr,AAc DRs,AAc LHc,AAc LHr,AAc LHs,AAc MVc,AAc MVr,AAc MVs,AAc NHc,AAc NHr,AAc NHs,AAc SVc,AAc SVr,AAc SVs,AAc VBc,AAc VBr,AAc VBs
AAr AAc,AAr AAr,AAr AAs,AAr CLc,AAr CLr,AAr CLs,AAr DRc,AAr DRr,AAr DRs,AAr LHc,AAr LHr,AAr LHs,AAr MVc,AAr MVr,AAr MVs,AAr NHc,AAr NHr,AAr NHs,AAr SVc,AAr SVr,AAr SVs,AAr VBc,AAr VBr,AAr VBs
AAs AAc,AAs AAr,AAs AAs,AAs CLc,AAs CLr,AAs CLs,AAs DRc,AAs DRr,AAs DRs,AAs LHc,AAs LHr,AAs LHs,AAs MVc,AAs MVr,AAs MVs,AAs NHc,AAs NHr,AAs NHs,AAs SVc,AAs SVr,AAs SVs,AAs VBc,AAs VBr,AAs VBs
CLc AAc,CLc AAr,CLc AAs,CLc CLc,CLc CLr,CLc CLs,CLc DRc,CLc DRr,CLc DRs,CLc LHc,CLc LHr,CLc LHs,CLc MVc,CLc MVr,CLc MVs,CLc NHc,CLc NHr,CLc NHs,CLc SVc,CLc SVr,CLc SVs,CLc VBc,CLc VBr,CLc VBs
CLr AAc,CLr AAr,CLr AAs,CLr CLc,CLr CLr,CLr CLs,CLr DRc,CLr DRr,CLr DRs,CLr LHc,CLr LHr,CLr LHs,CLr MVc,CLr MVr,CLr MVs,CLr NHc,CLr NHr,CLr NHs,CLr SVc,CLr SVr,CLr SVs,CLr VBc,CLr VBr,CLr VBs
CLs AAc,CLs AAr,CLs AAs,CLs CLc,CLs CLr,CLs CLs,CLs DRc,CLs DRr,CLs DRs,CLs LHc,CLs LHr,CLs LHs,CLs MVc,CLs MVr,CLs MVs,CLs NHc,CLs NHr,CLs NHs,CLs SVc,CLs SVr,CLs SVs,CLs VBc,CLs VBr,CLs VBs
DRc AAc,DRc AAr,DRc AAs,DRc CLc,DRc CLr,DRc CLs,DRc DRc,DRc DRr,DRc DRs,DRc LHc,DRc LHr,DRc LHs,DRc MVc,DRc MVr,DRc MVs,DRc NHc,DRc NHr,DRc NHs,DRc SVc,DRc SVr,DRc SVs,DRc VBc,DRc VBr,DRc VBs
DRr AAc,DRr AAr,DRr AAs,DRr CLc,DRr CLr,DRr CLs,DRr DRc,DRr DRr,DRr DRs,DRr LHc,DRr LHr,DRr LHs,DRr MVc,DRr MVr,DRr MVs,DRr NHc,DRr NHr,DRr NHs,DRr SVc,DRr SVr,DRr SVs,DRr VBc,DRr VBr,DRr VBs
DRs AAc,DRs AAr,DRs AAs,DRs CLc,DRs CLr,DRs CLs,DRs DRc,DRs DRr,DRs DRs,DRs LHc,DRs LHr,DRs LHs,DRs MVc,DRs MVr,DRs MVs,DRs NHc,DRs NHr,DRs NHs,DRs SVc,DRs SVr,DRs SVs,DRs VBc,DRs VBr,DRs VBs
LHc AAc,LHc AAr,LHc AAs,LHc CLc,LHc CLr,LHc CLs,LHc DRc,LHc DRr,LHc DRs,LHc LHc,LHc LHr,LHc LHs,LHc MVc,LHc MVr,LHc MVs,LHc NHc,LHc NHr,LHc NHs,LHc SVc,LHc SVr,LHc SVs,LHc VBc,LHc VBr,LHc VBs


But the counting is similar: 

24 + 24 + (24 - 2) + (24 - 2) = 92 yellow cells
5 + 5 + 5 + 5 = 20 green cells

So probability of two Mercedes wins given at least one Mercedes win on a rainy day = 20/92 = 22%.

And so it makes sense that for all of Formula 1 drivers and all 5 weather conditions, that the probability hovers around 25%. Do you want to run this? Sorry, I am not going to risk frying my kernel, but you can :-)


Some of you may have gotten confused when I added evidence to a probability distribution, and you wondered how **the heck** are weather conditions related to F1 rankings? But you were ***thinking too hard***!

Probability theory is ***just counting***: Given all possible outcomes, 1) **figure out the joint sample space**, and 2) **just count favorable outcomes over all possible outcomes**. Isn't *counting* simpler than math formulae?

<center>
    <img src="ipynb.images/elementary.png" width=300 />
</center>

But you know what's the strangest thing? A Mercedes win on a rainy day is understandable, because Mercedes has great Pirelli tires, and once I figure that out, I'm ready to bet on Mercedes on rainy days. But if i told you that Mercedes wins in Russia on the day it rains... in Australia (not on the race track), and that there are 5 possible weather conditions in Australia, *the same listing of cells above holds*! Same sample space, same favorable outcomes and unfavorable ones!

<center>
    <img src="ipynb.images/cmon-iceage.png" width=200 />
</center>

You may say... "Wait a minute, if this theory that you call probability theory gives me illogical answers, how can I trust it"? 

Probability theory is **mathematically sound**, but it all comes down to ***the model you apply it to***. 

If your sample space is the joint distribution of F1 drivers, and.. weather conditions **in Australia**, then you've probably built the ***wrong statistical model***, and there ain't a single Machine Learning algorithm that will help you here! *Junk in, junk out*. But weather on race tracks and F1 racing, that does make sense, right? Skills of drivers, performance of tires..

That is why the **model** is so important in Data Science, why we're spending so much time talking about **models**. 

Before you start applying statistics to data, or feeding data into a Machine, you need to ***work on your model***. In the case of discrete random variables, your pdf is a **dictionary**, much like our Danish or Formula 1 examples: Even though you have *some* data based on *some* experiment, you still need to build a model with joint distributions to complete your sample space for the problem at hand. In the case of continuous random variables your state space is most often infinite (countable or uncountable) and you need to build a much denser model. In fact, a high dimensional manifold in state space.

**Statistics** is the field of mathematics which deals with the understanding and interpretation of **data**. Specifically, you want to find the underlying mechanism that yields the data you're analyzing. You took the first step by learning probability theory. Now we'll begin to catalog all possible probability distributions that lead to seemingly random data (Poisson, gaussian, exponential, etc), and shape models with **Bayesian inference**: 

- You match the histogram of your data to a pdf on the catalogue, you find its parameters using either point estimates (classical theory) or probabilistic programs like variational inference or Monte Carlo methods for pdf-based estimates. Once you have your analytic model, you may ***extract all kinds of interesting statistics from it*** (instead of from the data).

That's what meteorologists do!

That is what machines do! They use a deep neural structure to build a curve, which they can then use for prediction. Machine Learning experts have stronger stats foundation than CS undergraduates in a deep learning class. Information theory, in general, requires a *strong* understanding of data and probability (and linear algebra), and anyone interested in becoming a Data Scientist or Machine Learning Engineer needs to develop a deep intuition of statistical (and linear algebra) concepts. That is your journey in this class.

In many cases, predictive Machine Learning algorithms are useless in helping with the understanding of data becayse they do not yield their data model. ***Bayesian*** ML changes all that.