# Intro to Probability

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('Datasets/assessment_scores.csv')

In [None]:
df

In [None]:
df.shape

## Accessing Certain Data Values

Recall that the probability of an event *A*, in sample space *S*, is defined as

$$P(A) = \frac{\text{number of ways $A$ can occur}}{\text{number of ways $S$ can occur}}$$

Most of the problems involving probability and data frames can be reduced to 2 subproblems:

* How do you find the value of the numerator?
* How do you find the value of the denominator?

We need to know **how** to find the counts for each part of the fraction.

Much of which can be done by using the `len()` function.

### Getting Exact Single Values

Suppose we want to find the probability that a randomly selected student is from Ohio.

We will need the following:

* Numerator &rarr; Number of instances of OH in the "state" column.
* Denominator &rarr; Total number of students in the data set

For the denominator, we simply need to know the number of rows of data. The quickest way to know the number of items in this list is to use the `len` function. If we are going to use this denominator for many problems, it would be to our advantage to also store the result as a variable.

In [None]:
denom = len(df)

In [None]:
denom

Now that we've taken care of the denominator, we need to figure out how many of the students are from Ohio. If we scroll down the list and keep a runny tally of how many states are "OH," this would take far too much time. There *has* to be an easier way...

In Pandas, there are actually a few ways we can find how many of the students in the data set are from Ohio. 

However, regardless of which method we use, in Python when we are testing for equality, we use `==` and not `=`.

Also, since the state abbreviation "OH" is a **string**, we need to make sure to put the entire date inside a pair of single (or double) quotation marks.

As usual, feel free to use whichever method works best for you.

#### Method 1: Using the df[df['column name'] == ] Approach

In this method, you type the name of the column (*don't forget the quotation marks*) inside the square brackets of `df[]`.

On the right side, type the value that you are looking to find; in this case, the rows in which the state is "OH".

You then wrap this statement, `df['state'] == "OH"` inside of another `df[]`.

This will create a conditional statement in which, for each row, the state either is "OH" or it is not "OH". In essence, we are creating a smaller version of our data frame where we only see the rows in which the state value is "OH".

In [None]:
df[
    df['state'] == "OH"
]

If we want to know **how many** states are listed as "OH", we just need to find the length of our new, smaller data frame. This is where the `len` function comes in handy.

First, it might make it easier to store our previous result as a variable. This way we don't have to type that `df[ yada yada ]` stuff again.

In [None]:
stateOH = df[
    df['state'] == "OH"
]

Now, let's find the length (*i.e.* the number of rows) of our new data frame. This will be our numerator.

In [None]:
numer = len(stateOH)

In [None]:
numer

So now we have our numerator (30019) and earlier we found our denominator (1500000), so now we are ready to find the probability that a randomly selected student is from Ohio.

In [None]:
numer/denom

#### Method 2: Using the Dot Operator Approach

To get the numerator, instead of typing `df["state"]`, we could type `df.state`. However, we still need to put `df.state == "OH"` inside the square brackets of `df[]`.

In [None]:
len(
    df[df.state == "OH"]
)

#### Method 3: Using df.query Approach

In this method, we use `df.query` to run a query on the `state` column. 

A **query** is a process of looking up values based on some condition(s).

When using `df.query()`, you will have to put your condition inside quotation marks.

The column name does not have to be in quotation marks but the state name does. *Note*: I have added an unncessary space between the double and single quotation marks at the end.

In [None]:
df.query('state == "OH" ')

And, just like before, to find the numerator, use the `len` function.

In [None]:
len(
    df.query('state == "OH" ')
)

### Boolean Operators: ==, !=, <, >, <=, >=

#### Exercise 1:

Find the probability that a randomly selected student's `ela09` score is greater than 600.

#### Exercise 2:

Find the probability that a randomly selected student's `math10` score is less than or equal to 350.

#### Exercise 3:

Find the probability that a randomly selected student is **not** from Pennsylvania.