# Machine Problem: Conditional Probability

This [Python](https://www.python.org) challenge uses [numpy](https://numpy.org/) and its `random` module to generate pseudo random numbers.
Remember to import modules whenever needed.

In [None]:
import numpy as np
from numpy.random import randint
from numpy.random import binomial
import pandas as pd
import matplotlib.pyplot as plt

Consider two jars containing balls marked zeros and ones.
Jar 0 has 192 zeros and 192 ones.
On the other hand, Jar 1 has 128 zeros and 256 ones.
We begin by writing a method to generate a sequence of drawings with replacement from Jar 1 and Jar 2.
Each drawing should have the appropriate probability, and the total number of samples in the sequence should be `seq_length`.

In [None]:
def jar_seq(jar_number, seq_length=1):
    # Return random drawings from specified jar
    #
    if jar_number == 0:
        prob_one = 192 / 384
    elif jar_number == 1:
        prob_one = 256 / 384
    else:
        prob_one = 0
    outcomes = binomial(1, prob_one, seq_length)
    return outcomes

def jar_rnd_seq(seq_length=1):
    # Return random drawings from Jar 0 or Jar 1
    #
    jar_rnd_labels = binomial(1, 0.5, seq_length)
    jar0_out = jar_seq(0, seq_length)
    jar1_out = jar_seq(1, seq_length)
    outcomes = np.multiply(jar_rnd_labels,jar0_out) + np.multiply(np.ones(jar_rnd_labels.shape) - jar_rnd_labels,jar1_out)
    return outcomes.astype(int), jar_rnd_labels

Using a large number of samples and empirical averaging, estimate the probability of getting a one when drawing from Jar 0, from Jar 1, and from a random selection thereof.
Make sure that these estimates match your expectations.

In [None]:
seq_length = 10000

jar0_outcomes = jar_seq(0, seq_length)
epmf0 = np.bincount(jar0_outcomes) / seq_length

jar1_outcomes = jar_seq(1, seq_length)
epmf1 = np.bincount(jar1_outcomes) / seq_length

jar_outcomes, jar_labels = jar_rnd_seq(seq_length)
epmf = np.bincount(jar_outcomes) / seq_length

plt.bar([0, 1], epmf)
plt.show

Write code that compute the empirical average of the observed sequence `jar_outcomes`, conditioned on the event that the samples are either coming from Jar 0, or for Jar 1.
In other words, sift through the sequence and only retain the values of `jar_outcomes` corresponding to Jar 0, then compute the empirical average for these values.
Likewise, sift through the sequence and only retain the values of `jar_outcomes` corresponding to Jar 1, then compute the empirical average for these values.

In [None]:
seq_length = 10000
jar_outcomes, jar_labels = jar_rnd_seq(seq_length)

jar_outcomes_cond0 = []
jar_outcomes_cond1 = []
for idx in range(len(jar_labels)):
    if jar_labels[idx] == 0:
        jar_outcomes_cond0.append(jar_outcomes[idx])
    # EDIT
    #

epmf_cond0 = np.bincount(jar_outcomes_cond0) / len(jar_outcomes_cond0)
plt.bar([0, 1], epmf_cond0)
plt.show

Try to infer the relation between the empirical average obtained using the conditioning on Jar 0 and the empirical average obtained when drawing from Jar 0 only.
You are now asked to guess the value of the jar (0 or 1) based on the observations you see.
In doing so, you wish to minimize the probability of being wrong.

  1. If you are given only one observation, what should be your decision rule to select Jar 0 or Jar 1?
  2. If you are given 16 observations all of them coming from the same (unknown jar), what should be your decision rule to decide whether all these observations came from Jar 0 or Jar 1?

Implement your decision rules below in Python.

In [None]:
# EDIT
#


Apply your decision rule to the two data sets contained in this repository.
The first data set `input1.csv` contains one observation per line, and you should decide whether each line is coming from Jar 0 or Jar 1.
The second data set `input16.csv` contains 16 observations per line (drawn with replacement), all of which from the same jar; you should deice whether each line is coming from Jar 0 or Jar 1.
Write you decisions to CSV files in the prescribed format.
Add and commit these files to your GitHub repository.
Use the prescribed format for your output files.
In particular, they should have the right names and structure.

In [None]:
df_input1 = pd.read_csv('input1.csv', index_col=0)
df_input16 = pd.read_csv('input16.csv', index_col=0)

Observation1 = df_input1['O1'].to_numpy()
Observation16 = df_input16.to_numpy()
print(Observation1.shape)
print(Observation16.shape)

# EDIT
#

decision1 = np.zeros(len(Observation1)).astype(int)
data_decision1 = {'decision1': decision1}
df_decision1 = pd.DataFrame(data=data_decision1)
df_decision1.to_csv("output1.csv")

# EDIT
#

decision16 = np.zeros(len(Observation16)).astype(int)
data_decision16 = {'decision16': decision16}
df_decision16 = pd.DataFrame(data=data_decision16)
df_decision16.to_csv("output16.csv")

print(decision1.shape)
print(decision16.shape)

Would you have different decision rules if the Jar 0 and Jar 1 were not equally likely?
If yes, how so?