# KEN1435 - Principles of Data Science | Lab 4: What are the Chances? and The Law of Averages

First we load the necessary python packages. Note that we include `import os` above the other packages with a blank line between them. This is because the common agreement is to load pure-python packages first, before you load contributed packages. As `os` is part of python itself, it should be loaded first, before contributed packages such as `matplotlib`, `numpy`, `pandas`, and `seaborn` are loaded. Finally, a third section of packages that you load are the packages that you construct within your own project.

In [1]:
import os

import matplotlib.pyplot as plt
import matplotlib.patches as mpatches # this is used in the solution for the legend in exercise 15
import matplotlib.ticker as mticker
import numpy as np
import pandas as pd
from scipy.special import comb
import seaborn as sns

tab10 = plt.get_cmap("tab10").colors

%matplotlib inline

## What are the Chances?

In the lecture we explored several examples of how we can use simulation to determine the probability that certain events occur. In this lab, we will look at a different example: a Lottery.

Suppose we have lottery in which balls are drawn from a lottery machine. This machine contains numbered balls, that range from `1` to `99`. To determine the winning combination, seven balls are drawn from the machine **without replacement**.

1. Build an array `balls` that contains all numbered balls in the machine

In [2]:
balls = np.arange(1, 99, 1)
balls

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
       52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68,
       69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85,
       86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98])

2. Use functions from `np.random` to make a realization of seven drawn balls. Save the seven drawn numbers in the variable `draw`.

In [3]:
draw = np.random.randint(1, 99, size = 7)
draw

array([94, 19, 93, 38,  9, 78, 87])

Participation in the lottery is done by handing in a ticket on which you can place a prediction which balls will be drawn from the machine.

Suppose that for a particular week, you participate with the following numbers: `10`, `19`, `20`, `21`, `40`, `50`, and `52`. 

3. How many numbers would you have hade correct if you used this prediction on the outcome in the variable `draw`?

In [10]:
def rollingNumbers (num_draws, start, end):
    count = 0
    numbers = np.array([10,19,20,21,40,50,52])
    for i in range(num_draws):
        drawing = np.random.randint(start, end, size = len(numbers))
        common_elements = np.intersect1d(drawing, numbers)
        count += len(common_elements)
    return count

As the chances of winning are very small, we will focus on a several probabilities; the chance of predicting a specific number out of seven drawn balls correctly.

4. Simulate hundred thousand draws from the machine and determine how many balls have been correctly predicted by your guess. Store the number of correct guesses in the data frame called `outcomes`.

In [12]:
rollingNumbers(100000, 1, 99)

48732

5. Suppose that we use these outcomes to estimate the probabilities of each outcome occuring, what would the estimates be?

In [19]:
# Lets just divide the occurences by the number of numbers
a = rollingNumbers(100000, 1, 99) / 100000
solution = a / 7 #length of the numbers array
solution

0.06935428571428572

6. Calculate the actual probability of getting `k` numbers correctly for all possible values of `k`.

In [3]:
numbers = np.array([10, 19, 20, 21, 40, 50, 52])
counts_df = pd.DataFrame(columns=numbers)
counts_df.loc[0] = [0 for i in numbers]
num_draws = 100000
for i in range(num_draws):
    drawing = np.random.randint(1, 99, size=len(numbers))
    for number in drawing:
        if number in numbers:
            counts_df[number] += 1
print(counts_df)


## The Law of Averages

Let's start by loading all experimental data. Download all the files from Canvas and save them in a subfolder called `data`. We start by getting a list of all the filenames in that folder using `listdir`. Next, we loop over the filenames to load the data file into a dictionary called `data`. Given the names of the files, we will use the firstname of the person performing the flips as the key for in the dictionary.

In [20]:
fns = os.listdir("data")
data = {}
for fn in fns:
    key = fn.split("_")[0]
    data[key] = pd.read_csv("data/{}".format(fn))

7. List all the keys in the `data` dictionary.

In [21]:
data[key]

Unnamed: 0,date,person,coin,start,sequence
0,2023-01-14 10:04:35,Sjoerd Terpstra,0.50RON,t,htxtthhtthththhxhtxthtxhhxxtttthxhhhttxxthxthh...
1,2023-01-14 10:17:55,Sjoerd Terpstra,0.50RON,t,txtttthhhhthtxthxtttuxttthtxhhtttthhxthhhthhxt...
2,2023-01-14 10:25:04,Sjoerd Terpstra,0.50RON,t,tthththththththhxhhttttthtthhhhhhtththhtththhh...
3,2023-01-14 10:31:56,Sjoerd Terpstra,0.50RON,t,txhhxthhtttxxthhthtxthtthhtthtuhxhhhhttxhxttth...
4,2023-01-14 10:38:23,Sjoerd Terpstra,0.50RON,t,thththththxthtthtththhhxxhtthtththhtthtthththh...
5,2023-01-14 10:47:56,Sjoerd Terpstra,0.50RON,h,ththtxhxhhhhhhthtxttxhxhthhhtttttththththhhtth...
6,2023-01-14 10:54:59,Sjoerd Terpstra,0.50RON,t,hththhhtxtthhhxththhhtttttxxttththtxtthhththtt...
7,2023-01-14 11:04:53,Sjoerd Terpstra,0.50RON,t,txhxxhhhhttxhttththtttththttthhuxxtthtththtxth...
8,2023-01-14 11:10:57,Sjoerd Terpstra,0.50RON,t,txthhtthhthtxthttxtxthhthtutxthttxtthtthhtxhth...
9,2023-01-14 11:27:22,Sjoerd Terpstra,0.50EUR,t,thththttxhhxththxhhhhthhtttthtttuhhhxhhhhhttth...


8. Manually inspect all data files that are contained in the `data` dictionary. Which files of them are loaded incorrectly or have more  than four columns?

***Answer:*** *Your answer goes here*

9. Fix the incorrectly formattted data frames by loading them in correctly and replacing the incorrectly formatted data frames.

To simplify accessing the data further down the line, it is good practice to unify the naming of the columns in the data frames. 

10. Rename the columns in all the data frames in `data` to the follwoing names: `["date", "coin_used", "start", "sequence"]`.

11. Construct a dictionary that you can use for the value mapping of the sequence values for all data frames

12. Calculate the fraction of tosses that land on heads for each sequence and store it in a new column named `frac_heads`

13. Denote whether the start of the sequence was a heads by `1` or a tails by `0` in a new column named `start_heads`.

14. Combine all data frames in a single dataframe.

15. Plot the distribution of the observed fraction of heads stratified by the starting side of the coin and indicate the average fraction of heads within these observations with a vertical line.