**Question 1: Define the z-statistic and explain its relationship to the standard normal distribution. How is the z-statistic used in hypothesis testing?**

The z-statistic is a measure of how many standard deviations a data point is from the mean of a distribution. It's calculated as:

z = (x - μ) / σ

Where:

x is the data point

μ is the population mean

σ is the population standard deviation

The relationship to the standard normal distribution is crucial: When a data point comes from a normally distributed population, its corresponding z-score follows a standard normal distribution (mean of 0 and standard deviation of 1). This allows us to use z-tables or statistical software to find probabilities associated with specific z-scores.

In hypothesis testing, the z-statistic helps us determine how likely it is to observe our sample data (or more extreme data) if the null hypothesis were true. We calculate a z-score based on the sample data and compare it to a critical z-value determined by our chosen significance level (alpha). If the calculated z-score falls within the critical region (beyond the critical z-value), we reject the null hypothesis.

**Question 2: What is a p-value, and how is it used in hypothesis testing? What does it mean if the p-value is very small (e.g., 0.01)?**

The p-value is the probability of observing our sample data (or more extreme data) if the null hypothesis is true. It's a measure of the evidence against the null hypothesis.

In hypothesis testing, we compare the p-value to our significance level (alpha). If the p-value is less than or equal to alpha, we reject the null hypothesis.

A very small p-value (e.g., 0.01) suggests strong evidence against the null hypothesis. It means there's only a 1% chance (or less) of observing our sample data if the null hypothesis were actually true. This leads us to conclude that the null hypothesis is likely false.

**Question 3: Compare and contrast the binomial and Bernoulli distributions.**

Both the binomial and Bernoulli distributions deal with discrete, binary outcomes (success/failure).

Bernoulli: Represents a single trial with two possible outcomes. It has one parameter, p, representing the probability of success.

Binomial: Represents the number of successes in a fixed number (n) of independent Bernoulli trials. It has two parameters: n (number of trials) and p (probability of success on each trial).

Essentially, a binomial distribution is the sum of n independent Bernoulli trials.

**Question 4: Under what conditions is the binomial distribution used, and how does it relate to the Bernoulli distribution?**

The binomial distribution is used when the following conditions are met:

Fixed number of trials (n).

Each trial is independent.

Each trial has only two possible outcomes (success/failure).

The probability of success (p) is the same for each trial.

As mentioned above, the binomial distribution arises from the sum of n independent Bernoulli random variables.

**Question 5: What are the key properties of the Poisson distribution, and when is it appropriate to use this distribution?**

The Poisson distribution models the probability of a given number of events occurring in a fixed interval of time or space. Key properties:

Events occur independently of each other.

The average rate (λ) of events is constant over the interval.

It's appropriate to use the Poisson distribution when:

You're counting the number of events.

Events are rare compared to the overall possibilities (small probability of occurrence in a small interval).

Events occur independently.

The average rate of events is constant.

Examples: number of customers arriving at a store in an hour, number of typos on a page.

**Question 6: Define the terms "probability distribution" and "probability density function" (PDF). How does a PDF differ from a probability mass function (PMF)?**

Probability Distribution: A function that describes the likelihood of all possible outcomes of a random variable.

Probability Density Function (PDF): Used for continuous random variables. It describes the relative likelihood of a random variable taking on a given value. The area under the PDF curve over a given interval represents the probability that the variable falls within that interval.

Probability Mass Function (PMF): Used for discrete random variables. It gives the probability that a random variable takes on a specific value.

**Question 7: Explain the Central Limit Theorem (CLT) with example.**

The Central Limit Theorem states that the distribution of the sample means of a population will approximate a normal distribution as the sample size gets larger, regardless of the shape of the original population distribution.

Example: Imagine rolling a single six-sided die many times. The distribution of individual rolls is uniform (equal probability for each outcome). However, if you repeatedly take samples of, say, 30 dice rolls and calculate the average of each sample, the distribution of those sample averages will be approximately normal, centered around the mean of 3.5.

**Question 8: Compare z-scores and t-scores. When should you use a z-score, and when should a t-score be applied instead?**

Both z-scores and t-scores are used in hypothesis testing to determine how far a sample statistic is from a hypothesized population parameter, measured in standard error units.

z-score: Used when the population standard deviation (σ) is known.

t-score: Used when the population standard deviation (σ) is unknown and is estimated using the sample standard deviation (s). The t-distribution is similar to the standard normal distribution but has heavier tails, especially with smaller sample sizes. As the sample size increases, the t-distribution approaches the standard normal distribution.

In [1]:
#Question 9: Given a sample mean of 105, a population mean of 100, a standard deviation of 15, and a sample size of 25, calculate the z-score and p-value. Based on a significance level of 0.05, do you reject or fail to reject the null hypothesis?

#Answer 9:

import scipy.stats as st

sample_mean = 105
population_mean = 100
population_stddev = 15
sample_size = 25
alpha = 0.05

# Calculate the z-score
z = (sample_mean - population_mean) / (population_stddev / (sample_size**0.5))

# Calculate the p-value (two-tailed test)
p_value = 2 * (1 - st.norm.cdf(abs(z))) # Two-tailed test

print(f"Z-score: {z}")
print(f"P-value: {p_value}")

if p_value < alpha:
    print("Reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")

Z-score: 1.6666666666666667
P-value: 0.09558070454562939
Fail to reject the null hypothesis
