# This is a practice notebook for probability and bayes'

In [9]:
# Standard Imports
import pandas as pd
import numpy as np
from scipy import stats
from matplotlib import pyplot as plt
import seaborn as sns

%matplotlib inline

# Calculating Probabilities Questions

### 🧠 Knowledge Check

### 1) AND Question:

What is the probability of rolling a 5 on a fair die _and_ getting a tails on a fair coin toss?

**Your answer here**:

- 



<details>
    <summary>Answer</summary>

We're checking for the intersection of these sets. Of the six possible outcomes on a die roll only one (the 5) will do. So the chance of getting a 5 on a die is 1/6. Of the two possible outcomes on a coin toss again only one (tails) will do. So the chance of getting tail on a coin toss is 1/2.

So the calculation is: $$\large P(5 \cap tails) = \left(\frac{1}{6}\right)*\left(\frac{1}{2}\right) = \frac{1}{12}$$
</details>

### 2) OR Question: 

What is the probability of rolling a 5 on a die _or_ getting a tails on a coin toss?

**Your answer here**:

- 



<details>
    <summary>Answer</summary>
    
   We're now checking for the union of these sets. Here we want to count all the die-coin combinations where we have a 5 on the die AND all the die-coin combinations where we have a tails on the coin. 

$$\large P(5 \cup tails) $$

**BUT:**

If the die is 5, that includes two possibilities: 5-heads and **5-tails**.

Our coin is tails, that includes six possibilities: 1-tails, 2-tails, 3-tails, 4-tails, **5-tails**, and 6-tails.

But then we've counted the combination where **both** the 5 and the tails occur **twice**.

So the correct calculation is the sum of the individual probabilities **less the probability of their intersection**:

$$\large P(5 \cup tails) = \left(\frac{1}{6}\right) + \left(\frac{1}{2}\right) - \left(\left(\frac{1}{6}\right)*\left(\frac{1}{2}\right)\right) = \frac{7}{12} $$
    </details>

# These are practice problems for using probability in python

## Enough Talk - Let's Explore in Python!

### Mushroom dataset

Let's look at a modified version of the Mushroom dataset from UCI [here](https://archive.ics.uci.edu/ml/datasets/Mushroom). Each row in this dataset corresponds to one observation (one mushroom). 

In [8]:
df = pd.read_csv('data/Mushrooms_cleaned.csv')

# EDA Goes here

'''
EDA Methods:

Dataframes (2+d)

.info()
    Information about the columns, including nulls
.describe()
    statistics on numerical data in the columns
.head()
    first 5 rows
.tail()
    last 5 rows
.dtypes
    data types of each column
.shape
    (rows, columns)

Series (1d data frame)

value_counts()
    returns the index of each possible value and the number of times it shows in the column

sort_values()
    sorts the values

'''

#### 1) If you picked a row from this dataset at random, what is the probability it corresponds to a bruised mushroom? 

In other words, find $P(bruised)$

In [None]:
df['edible-poisonous'].value_counts(normalize=True)

edible       0.517971
poisonous    0.482029
Name: edible-poisonous, dtype: float64

In [None]:
df['bruised'].value_counts(normalize=True)

False    0.584441
True     0.415559
Name: bruised, dtype: float64

In [None]:
print(len(df.index))
print(len(df.loc[df['bruised'] == True].index))

len(df.loc[df['bruised'] == True]) / len(df)

8124
3376


0.4155588380108321

In [None]:
# Another way
p_bruised = df[df['bruised'] == True].shape[0]/df.shape[0]
p_bruised

0.4155588380108321

In [None]:
# Let's see...
df.sample(1)

Unnamed: 0,edible-poisonous,gill-spacing,stalk-shape,stalk-color-above-ring,stalk-color-below-ring,gill-color,bruised
2698,edible,close,tapering,white,white,white,True


#### 2) What is the probability you pick a row corresponding to a mushroom that is bruised _AND_ edible?

$P(edible \cap bruised)$

BUT! Are they independent events ...?

In [None]:
p_bruised_and_edible = df[(df['bruised'] == True) & (df['edible-poisonous'] == 'edible')].shape[0]/df.shape[0]

p_bruised_and_edible

0.33874938453963566

Are being bruised and being edible independent of each other?

> Formally, $A$ and $B$ are *independent* if and only if the probability that *both* $A$ *and* $B$ happen is:
> 
> $$\large P(A \cap B) = P(A) * P(B)$$

In [None]:
p_bruised

0.4155588380108321

In [None]:
p_edible = len(df[df['edible-poisonous'] == 'edible'])/len(df)
p_edible

0.517971442639094

In [None]:
p_bruised * p_edible

0.21524761082589627

In [None]:
p_bruised_and_edible == p_bruised * p_edible

False