#Statistics Advanced - 1


1. What is a random variable in probability theory?

- A random variable in probability theory is a variable whose possible values are numerical outcomes of a random experiment or event.

- There are two types:

* Discrete random variable - takes a finite or countable set of values, such as the number of heads in three coin tosses
* Continuous random variable - can take any value within a range, such as the height of students in a class.
2. What are the types of random variables?
- There are mainly two types of random variables in probability theory:
- Discrete random variable - Takes a finite or countable number of possible values. Examples: number of goals in a football match, number of students present in class.
- Continuous random variable - Takes an infinite number of possible values within a given range or interval. Examples: temperature in a city, height of people, time taken to finish a race.
3. Explain the difference between discrete and continuous distributions.
- A discrete distribution describes the probabilities of outcomes of a discrete random variable. The values are separate and countable. Example: probability distribution of rolling a die (values 1 to 6). The probabilities are listed for each distinct outcome, and their sum equals 1.
- A continuous distribution describes the probabilities of outcomes of a continuous random variable. The values can take any point in a range, so probabilities are represented using a curve (probability density function) instead of a simple list. Example: distribution of heights in a population. The probability of getting an exact value is zero; instead, we talk about the probability within an interval.
4. What is a binomial distribution, and how is it used in probability?
- A binomial distribution is a type of probability distribution that summarizes the likelihood of getting a fixed number of successes in a fixed number of independent trials, where each trial has only two possible outcomes — success or failure — and the probability of success remains constant.
- It is used in situations such as:

-  Predicting the number of heads when flipping a coin multiple times
-  Estimating the number of defective items in a batch
-  Calculating the probability of a certain number of people agreeing in a survey

- The formula for the probability of getting exactly k successes in n trials is:

- **P(X = k) = C(n, k) × p^k × (1 − p)^(n − k)**
- where:
- n = number of trials
- k = number of successes
- p = probability of success in one trial
- C(n, k) = combination of n items taken k at a time
5. What is the standard normal distribution, and why is it important?
- The standard normal distribution is a special type of normal distribution that has a mean of 0 and a standard deviation of 1. Its shape is a symmetric bell curve centered at zero.
- It is important because:
- It allows us to easily compare different normal distributions by converting values into z-scores (standardized values).
-  Tables and software for the standard normal make probability calculations easier.
- Many real-world phenomena follow or approximate a normal distribution, so standardizing helps in statistical inference, hypothesis testing, and confidence intervals.
6. What is the Central Limit Theorem (CLT), and why is it critical in statistics?
- The Central Limit Theorem (CLT) states that when independent random samples are taken from any population with a finite mean and variance, the sampling distribution of the sample mean will approach a normal distribution as the sample size becomes large, regardless of the original population’s distribution.
- It is critical because:
* It justifies using normal probability methods (like z-tests and t-tests) even when the original data is not normally distributed
* It enables statistical inference for population parameters using sample statistics
* It provides the theoretical foundation for many statistical procedures in research and data analysis
- A common rule of thumb is that a sample size of 30 or more is usually enough for the CLT to give a good normal approximation.
7. What is the significance of confidence intervals in statistical analysis?
- A confidence interval is a range of values, calculated from sample data, that is likely to contain the true value of a population parameter with a certain level of confidence (such as 95% or 99%).
- Its significance in statistical analysis is:
* It provides an estimate of the parameter along with the margin of error, instead of just a single point estimate
* It reflects the uncertainty of the estimate, helping in better decision-making
* A wider interval suggests more uncertainty, while a narrower interval suggests more precision
* It is widely used in research, surveys, and quality control to express how reliable the results are
- **Example:** A 95% confidence interval for a population mean of (48, 52) means we are 95% confident the true mean lies between 48 and 52.
8. What is the concept of expected value in a probability distribution?
- The expected value in a probability distribution is the long-run average or mean value you would expect to get if you repeated an experiment many times under the same conditions.
- For a discrete random variable, it is calculated as:
- **E(X) = Σ \[x × P(x)]**
- where:
- x = each possible value of the random variable
- P(x) = probability of that value occurring
- For a continuous random variable, it is found using an integral instead of a sum.
- It is important because it gives a single number summarizing the “center” of the distribution and is used in decision-making, risk assessment, and statistical analysis.
















9. Write a Python program to generate 1000 random numbers from a normal distribution with mean = 50 and standard deviation = 5. Compute its mean and standard deviation using NumPy, and draw a histogram to visualize the distribution.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Generate 1000 random numbers from a normal distribution
mean = 50
std_dev = 5
size = 1000

data = np.random.normal(mean, std_dev, size)

# Compute mean and standard deviation
calculated_mean = np.mean(data)
calculated_std_dev = np.std(data)

print("Calculated Mean:", calculated_mean)
print("Calculated Standard Deviation:", calculated_std_dev)

# Draw histogram
plt.hist(data, bins=30, color='skyblue', edgecolor='black')
plt.title('Histogram of Normally Distributed Data')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()


10. You are working as a data analyst for a retail company. The company has collected daily sales data for 2 years and wants you to identify the overall sales trend. daily_sales = [220, 245, 210, 265, 230, 250, 260, 275, 240, 255, 235, 260, 245, 250, 225, 270, 265, 255, 250, 260]
- Explain how you would apply the Central Limit Theorem to estimate the average sales with a 95% confidence interval.
- Write the Python code to compute the mean sales and its confidence interval.

In [None]:
import numpy as np
import scipy.stats as stats

# Daily sales data
daily_sales = [220, 245, 210, 265, 230, 250, 260, 275, 240, 255,
               235, 260, 245, 250, 225, 270, 265, 255, 250, 260]

# Calculate mean and standard deviation
mean_sales = np.mean(daily_sales)
std_sales = np.std(daily_sales, ddof=1)  # sample std deviation
n = len(daily_sales)

# Z-score for 95% confidence
z_score = 1.96

# Margin of error
margin_error = z_score * (std_sales / np.sqrt(n))

# Confidence interval
lower_bound = mean_sales - margin_error
upper_bound = mean_sales + margin_error

print(f"Mean Sales: {mean_sales:.2f}")
print(f"95% Confidence Interval: ({lower_bound:.2f}, {upper_bound:.2f})")
