# Experiment 1

## Problem Statement

Implement Probability Concepts on a Dataset.


## Dataset Link

https://www.kaggle.com/datasets/uciml/student-alcohol-consumption

## Code


In [67]:
# import libraries
import pandas as pd
import scipy.stats as stats
from scipy.stats import uniform
from scipy.stats import poisson

In [68]:
# reading the dataset
df = pd.read_csv('student-por.csv')

print(df.head().to_markdown())

|    | school   | sex   |   age | address   | famsize   | Pstatus   |   Medu |   Fedu | Mjob    | Fjob     | reason   | guardian   |   traveltime |   studytime |   failures | schoolsup   | famsup   | paid   | activities   | nursery   | higher   | internet   | romantic   |   famrel |   freetime |   goout |   Dalc |   Walc |   health |   absences |   G1 |   G2 |   G3 |
|---:|:---------|:------|------:|:----------|:----------|:----------|-------:|-------:|:--------|:---------|:---------|:-----------|-------------:|------------:|-----------:|:------------|:---------|:-------|:-------------|:----------|:---------|:-----------|:-----------|---------:|-----------:|--------:|-------:|-------:|---------:|-----------:|-----:|-----:|-----:|
|  0 | GP       | F     |    18 | U         | GT3       | A         |      4 |      4 | at_home | teacher  | course   | mother     |            2 |           2 |          0 | yes         | no       | no     | no           | yes       | yes      | no         | 

In [69]:
# getting length of the dataset
print(f"Length of the dataset: {len(df)}")

# getting the number of columns
print(f"Number of columns: {len(df.columns)}")

# getting the number of rows
print(f"Number of rows: {len(df.index)}")

Length of the dataset: 649
Number of columns: 33
Number of rows: 649


In [70]:
# Add a boolean column called grade_A noting if a student achieved 80% or higher as a final score. Original values are on a 0–20 scale so we multiply by 5.
df['grade_A'] = df['G3'] * 5 >= 80

In [71]:
# create a pivot table to see the relationship between absences and grade_A
pivot = df.iloc[0:395][['absences', 'G3']]
print(pivot.head().to_markdown())

|    |   absences |   G3 |
|---:|-----------:|-----:|
|  0 |          4 |   11 |
|  1 |          2 |   11 |
|  2 |          6 |   12 |
|  3 |          0 |   14 |
|  4 |          0 |   13 |


In [72]:
# calculate count of students with absences > 10 and grade_A > 16
count = len(pivot[(pivot['absences'] > 10) & (pivot['G3'] > 16)])
print(f"Count of students with absences > 10 and grade_A > 16: {count}")

# calculate count of students with absences > 10
count1 = len(pivot[pivot['absences'] > 10])
print(f"Count of students with absences > 10: {count1}")

# calculate count of students with grade_A > 16
count2 = len(pivot[(pivot['G3'] > 16)])
print(f"Count of students with grade_A > 16: {count2}")

Count of students with absences > 10 and grade_A > 16: 1
Count of students with absences > 10: 37
Count of students with grade_A > 16: 26


In [73]:
# calculate P(A ∩ B)
P_a_intersection_b = count/395
print(f"P(A ∩ B): {P_a_intersection_b}")

# calculate P(A)
P_a = count2/395
print(f"P(A): {P_a}")

# calculate P(B)
P_b = count1/395
print(f"P(B): {P_b}")

P(A ∩ B): 0.002531645569620253
P(A): 0.06582278481012659
P(B): 0.09367088607594937


In [74]:
# calculate P(A|B)
P = P_a_intersection_b/P_b
print(f"P(A|B): {P}")

P(A|B): 0.02702702702702703


#### Question 1

**THE BIRTHDAY PROBLEM**

Compute the probability of getting a minimum of one overlapping birthday in a random group of 23 peoples.
Because ontaining random sample again and again is tedious task, we can do simulations on a computer with assumptions.
The birthdays are independent of each other. Each possible birthday has the same probability. There are only 365 possible birthdays (not 366, as ignoring the leap year.
Hint : in other words, we're modelling the process as drawing 23 independent samples from a discrete uniform distribution with parameter n = 365.


In [75]:
from random import randint


NUM_PEOPLE = 23
NUM_POSSIBLE_BIRTHDAYS = 365
NUM_TRIALS = 10000


def generate_random_birthday():
    birthday = randint(1, NUM_POSSIBLE_BIRTHDAYS)
    return birthday


def generate_k_birthdays(k):
    birthdays = [generate_random_birthday() for _ in range(k)]
    return birthdays


def aloc(birthdays):
    unique_birthdays = set(birthdays)

    num_birthdays = len(birthdays)
    num_unique_birthdays = len(unique_birthdays)
    has_coincidence = (num_birthdays != num_unique_birthdays)

    return has_coincidence


def estimate_p_aloc():
    num_aloc = 0
    for _ in range(NUM_TRIALS):
        birthdays = generate_k_birthdays(NUM_PEOPLE)
        has_coincidence = aloc(birthdays)
        if has_coincidence:
            num_aloc += 1

    p_aloc = num_aloc / NUM_TRIALS
    return p_aloc


p_aloc = estimate_p_aloc()
print(f"Estimated P(ALOC) after {NUM_TRIALS} trials: {p_aloc}")

Estimated P(ALOC) after 10000 trials: 0.5102


#### Question 2

The weight of certain species of frog is uniformly distributed from 15 and 25 grams. if you randomly select a frog, what is the probability that the frog weights between 17 and 19 grams.


In [76]:
# uniform distribution

p_uniform = uniform.cdf(x=19, loc=15, scale=10) - \
    uniform.cdf(x=17, loc=15, scale=10)

print(f"Probability of a uniform distribution: {p_uniform}")

Probability of a uniform distribution: 0.2


#### Question 3

**Telecommunication Industry**

According to the Telecommunication Industry the average monthly cell phone bill is Rs. 1000 with a standard deviation of Rs. 200.

What is the probability that a randomly selected cell phone bill is more than Rs 1200? What is the probability that a randomly selected cell phone bill is between Rs 750 and Rs 1200? What is the probability that a randomly selected cell phone bill is no more than Rs 650? What is the amount above which lies top 15% of cell phone bills? What is the amount below which lies bottom 25% of cell phone bills?

Note: This is a problem of normal probability distribution. Though the distribution is not mentioned, in absence of any other information we assume normality in the population.


In [77]:
# normal distribution

a = stats.norm.cdf(1200, 1000, 200)
g = 1-a
print(g)

b = stats.norm.cdf(750, 1000, 200)
print(a-g)
stats.norm.ppf(.15, 1000, 200)

stats.norm.ppf(.25, 1000, 200)

0.15865525393145707
0.6826894921370859


865.1020499607837

#### Question 4

**Fruit problem**

Suppose we own a fruit shop and on an average 3 customers arrive in the shop every 10 minutes. The mean rate here is 3 or λ = 3. Poisson probability distributions can help us answer questions like what is the probability that 5 customers will arrive in the next 10 mins?


In [81]:
# poisson distribution

# Probability of 5 or less customers
p = poisson.cdf(5, 3)
print("Probability of 5 or less customers: ", p)

Probability of 5 or less customers:  0.9160820579686966
