# Statistics Basics - Assignment Questions

## 1. Explain the different types of data (qualitative and quantitative) and provide examples of each. Discuss nominal, ordinal, interval, and ratio scales.

**Qualitative Data:** Descriptive data that categorizes attributes. Examples: Colors, gender.

**Quantitative Data:** Numeric data that can be measured. Examples: Age, height.

- **Nominal Scale:** Categories without a meaningful order (e.g., blood type, car brands).
- **Ordinal Scale:** Ordered categories but without fixed differences (e.g., satisfaction level).
- **Interval Scale:** Ordered data with meaningful differences, but no true zero (e.g., temperature in Celsius).
- **Ratio Scale:** Has a true zero, allows for meaningful ratios (e.g., height, weight).

## 2. What are the measures of central tendency, and when should you use each?

**Mean:** The average of a dataset, used for normally distributed data.

**Median:** The middle value, used when data is skewed or contains outliers.

**Mode:** The most frequent value, useful for categorical data.

## 3. Explain the concept of dispersion. How do variance and standard deviation measure the spread of data?

Dispersion refers to how spread out the data points are.

**Variance:** Measures the average squared deviation from the mean.

**Standard Deviation:** The square root of variance, measuring data spread in original units.

In [None]:
import numpy as np
data = [10, 12, 23, 23, 16, 23, 21, 16]
variance = np.var(data)
std_dev = np.std(data)
variance, std_dev

## 4. What is a box plot, and what can it tell you about the distribution of data?

A box plot (box-and-whisker plot) displays the distribution of a dataset based on five summary statistics: minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum. It helps in detecting outliers.

In [None]:
import matplotlib.pyplot as plt
plt.boxplot(data)
plt.title('Box Plot')
plt.show()

## 5. Discuss the role of random sampling in making inferences about populations.

Random sampling ensures that every individual in the population has an equal chance of being selected, reducing bias and increasing the reliability of inferences.

## 6. Explain the concept of skewness and its types. How does skewness affect the interpretation of data?

**Skewness** measures asymmetry in a distribution:

- **Positive Skew (Right-Skewed):** Tail extends to the right; mean > median.
- **Negative Skew (Left-Skewed):** Tail extends to the left; mean < median.
- **Zero Skewness:** Symmetric distribution.

In [None]:
from scipy.stats import skew
skewness = skew(data)
skewness

## 7. What is the interquartile range (IQR), and how is it used to detect outliers?

**Interquartile Range (IQR) = Q3 - Q1**

Outliers are detected using:

Lower Bound = Q1 - 1.5 * IQR

Upper Bound = Q3 + 1.5 * IQR

In [None]:
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1
Lower_Bound = Q1 - 1.5 * IQR
Upper_Bound = Q3 + 1.5 * IQR
IQR, Lower_Bound, Upper_Bound

## 8. Discuss the conditions under which the binomial distribution is used.

The binomial distribution is used when:

- There are **n** independent trials.
- Each trial has two outcomes (success/failure).
- Probability of success remains constant.

In [None]:
from scipy.stats import binom
n, p = 10, 0.5
binom_dist = binom.pmf(5, n, p)
binom_dist

## 9. Explain the properties of the normal distribution and the empirical rule (68-95-99.7 rule).

A normal distribution is symmetrical and bell-shaped. The empirical rule states:

- **68%** of data falls within 1 SD of the mean.
- **95%** within 2 SDs.
- **99.7%** within 3 SDs.

In [None]:
import scipy.stats as stats
mean, std_dev = 50, 10
empirical_rule = stats.norm(mean, std_dev).cdf([mean-std_dev, mean+std_dev])
empirical_rule

## 10. Provide a real-life example of a Poisson process and calculate the probability for a specific event.

Poisson distribution models rare events, like customer arrivals at a shop per hour.

In [None]:
from scipy.stats import poisson
lambda_ = 4  # Average arrivals per hour
probability = poisson.pmf(2, lambda_)
probability

## 11. Explain what a random variable is and differentiate between discrete and continuous random variables.

A **random variable** assigns a numerical value to each outcome.

- **Discrete:** Takes countable values (e.g., number of heads in coin flips).
- **Continuous:** Takes an infinite range of values (e.g., height of students).

## 12. Provide an example dataset, calculate both covariance and correlation, and interpret the results.

In [None]:
import pandas as pd
df = pd.DataFrame({'X': [1, 2, 3, 4, 5], 'Y': [2, 3, 5, 7, 11]})
covariance = df['X'].cov(df['Y'])
correlation = df['X'].corr(df['Y'])
covariance, correlation