<a href="https://colab.research.google.com/github/Spatro123/My-Assignments-/blob/main/Statistics_Basics_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### 1. Explain the different types of data (qualitative and quantitative) and provide examples of each. Discuss nominal, ordinal, interval, and ratio scales.

**Qualitative (Categorical) data:** Describes categories or labels; not numeric in nature.
- *Examples:* Colors of cars (Red, Blue), types of fruit (Apple, Banana), blood groups (A, B, AB, O).

**Quantitative (Numerical) data:** Numeric measurements that represent counts or measurements.
- *Examples:* Age, height, exam scores, salary.

**Measurement scales:**
- **Nominal:** Categories with no inherent order (e.g., eye color).
- **Ordinal:** Categories with an order but unknown/unequal intervals (e.g., ratings: Poor, Fair, Good, Excellent).
- **Interval:** Numeric scale with equal intervals but no true zero (e.g., temperature in °C). Differences are meaningful but ratios are not.
- **Ratio:** Numeric scale with equal intervals and a true zero point (e.g., weight, height, income). Ratios are meaningful.

—


### 2. What are the measures of central tendency, and when should you use each? Discuss the mean, median, and mode with examples and situations where each is appropriate.

- **Mean (Arithmetic mean):** Sum of values divided by count. Best for symmetric distributions without outliers. *Example:* Average test score when no extreme values.
- **Median:** Middle value after sorting. Robust to outliers and skewness. *Example:* Median household income when incomes are skewed.
- **Mode:** Most frequent value. Useful for categorical data or multimodal distributions. *Example:* Most common shoe size in a sample.

—


### 3. Explain the concept of dispersion. How do variance and standard deviation measure the spread of data?

**Dispersion** describes how spread out the data values are around the central value.
- **Variance (σ² or s²):** Average of squared deviations from the mean. Gives weight to larger deviations due to squaring.
- **Standard deviation (σ or s):** Square root of variance. Expressed in the same units as the data and easier to interpret.

If variance/SD is small, data points cluster near the mean; if large, data are widely spread.

—


### 4. What is a box plot, and what can it tell you about the distribution of data?

A **box plot** (or whisker plot) summarizes data using five-number summary: minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum. The box spans Q1 to Q3; the line inside is the median. Whiskers typically extend to the last data point within 1.5×IQR from the quartiles; points beyond are considered outliers.

A box plot reveals central tendency, spread, skewness, and outliers, and is useful for comparing distributions between groups.

—


### 5. Discuss the role of random sampling in making inferences about populations.

**Random sampling** ensures each member of a population has an equal chance to be selected. It reduces selection bias and allows us to use probability theory to make valid inferences about the population (e.g., confidence intervals, hypothesis tests). Good random sampling increases representativeness so sample statistics approximate population parameters.

—


### 6. Explain the concept of skewness and its types. How does skewness affect the interpretation of data?

**Skewness** measures the asymmetry of a distribution around its mean.
- **Positive (right) skew:** Tail extends to the right. Mean > median. Example: income distribution.
- **Negative (left) skew:** Tail extends to the left. Mean < median. Example: age at retirement if most retire late but a few retire very early.

Skewness affects interpretation: in skewed data, mean is pulled toward the tail and may not represent the 'typical' value; median is preferred as a measure of central tendency.

—


### 7. What is the interquartile range (IQR), and how is it used to detect outliers?

**IQR = Q3 − Q1** (distance between 75th and 25th percentiles). It measures the spread of the middle 50% of data.
A common rule to detect outliers: values < Q1 − 1.5×IQR or > Q3 + 1.5×IQR are considered outliers. This method is robust and less sensitive to extreme values than methods based on mean and SD.

—


### 8. Discuss the conditions under which the binomial distribution is used.

The **binomial distribution** models the number of successes in _n_ independent trials when each trial has exactly two outcomes (success/failure) and the probability of success _p_ is constant across trials.
Conditions (Bernoulli trials):
1. Fixed number of trials, _n_.
2. Each trial is independent.
3. Only two outcomes per trial (success or failure).
4. Probability of success _p_ remains the same each trial.

The probability of exactly _k_ successes is: $P(X=k)=\binom{n}{k} p^{k} (1-p)^{n-k}$.

—


### 9. Explain the properties of the normal distribution and the empirical rule (68–95–99.7 rule).

**Normal distribution** is symmetric, bell-shaped, and fully described by mean μ and standard deviation σ. Key properties:
- Symmetric around mean μ.
- Mean = median = mode.
- Defined by parameters μ and σ; probability density function is the familiar Gaussian curve.

**Empirical rule:** For a normal distribution:
- ~68% of values lie within μ ± 1σ
- ~95% within μ ± 2σ
- ~99.7% within μ ± 3σ

This helps estimate probabilities and detect unusual observations.

—


### 10. Provide a real-life example of a Poisson process and calculate the probability for a specific event.

A **Poisson process** models the count of events occurring randomly over a fixed interval when events happen independently and the average rate (λ) is constant.
Real-life example: Number of incoming calls at a call center per hour.
If average calls per hour λ = 6, the probability of observing exactly k calls in an hour is:
$$P(X=k)=\frac{e^{-\lambda} \lambda^{k}}{k!}.$$
For example, probability of exactly 4 calls when λ=6: $P(X=4)=e^{-6} 6^{4}/4!$ (we compute this in the practical section).

—


### 11. Explain what a random variable is and differentiate between discrete and continuous random variables.

A **random variable** is a variable that takes on numerical values determined by the outcomes of a random phenomenon.
- **Discrete random variable:** Takes countable values (e.g., number of heads in 10 coin tosses).
- **Continuous random variable:** Takes values from a continuum (e.g., height, weight). Probabilities are given over intervals and use probability density functions.

—


### 12. Provide an example dataset, calculate both covariance and correlation, and interpret the results.

See the practical section below for a worked example with Python code that creates a dataset, computes covariance and Pearson correlation, and interprets them.

---


## Practical — Code Examples
Run the following cells to see computations and visualizations.


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from math import comb, factorial, exp

# Set seaborn style
sns.set()


In [None]:
# Example dataset
data = pd.DataFrame({
    'Age': [23, 25, 31, 22, 45, 36, 52, 27, 30, 29],
    'Income': [25000, 27000, 50000, 23000, 120000, 60000, 150000, 30000, 45000, 40000]
})
data.head()

### Measures of central tendency (mean, median, mode) and dispersion (variance, standard deviation)


In [None]:
mean_age = data['Age'].mean()
median_age = data['Age'].median()
mode_age = data['Age'].mode().tolist()
var_age = data['Age'].var(ddof=0)  # population variance
std_age = data['Age'].std(ddof=0)  # population std

mean_income = data['Income'].mean()
median_income = data['Income'].median()
mode_income = data['Income'].mode().tolist()
var_income = data['Income'].var(ddof=0)
std_income = data['Income'].std(ddof=0)

results = {
    'mean_age': mean_age,
    'median_age': median_age,
    'mode_age': mode_age,
    'var_age': var_age,
    'std_age': std_age,
    'mean_income': mean_income,
    'median_income': median_income,
    'mode_income': mode_income,
    'var_income': var_income,
    'std_income': std_income
}
results

### Box plot and skewness


In [None]:
# Boxplot for Income
plt.figure(figsize=(6,4))
sns.boxplot(x=data['Income'])
plt.title('Box plot of Income')
plt.show()

# Skewness
skew_age = data['Age'].skew()
skew_income = data['Income'].skew()
skew_age, skew_income

### IQR and outlier detection


In [None]:
Q1 = data['Income'].quantile(0.25)
Q3 = data['Income'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = data[(data['Income'] < lower_bound) | (data['Income'] > upper_bound)]
{'Q1':Q1, 'Q3':Q3, 'IQR':IQR, 'lower_bound':lower_bound, 'upper_bound':upper_bound, 'outliers': outliers.to_dict(orient='records')}

### Random sampling demonstration


In [None]:
population = np.arange(1, 501)  # population of 500 items
sample = np.random.choice(population, size=50, replace=False)
sample[:10]  # show first 10 sample values

### Binomial distribution example (probability calculation)

Probability of exactly k successes in n independent trials with success probability p:
$$P(X=k)=\binom{n}{k} p^{k} (1-p)^{n-k}$$


In [None]:
def binomial_pmf(n, k, p):
    return comb(n, k) * (p**k) * ((1-p)**(n-k))

# Example: probability of getting exactly 3 heads in 5 fair coin tosses (p=0.5)
binomial_pmf(5, 3, 0.5)

### Poisson distribution example
Probability mass function for Poisson with mean λ:
$$P(X=k)=e^{-\lambda} \frac{\lambda^{k}}{k!}$$


In [None]:
def poisson_pmf(lam, k):
    return exp(-lam) * (lam**k) / factorial(k)

# Example: average 6 calls per hour, probability of exactly 4 calls
poisson_pmf(6, 4)

### Random variables: discrete vs continuous (small demonstration)


In [None]:
# Discrete random variable example: number of heads in 3 coin tosses (values 0,1,2,3)
discrete_values = [binomial_pmf(3, k, 0.5) for k in range(4)]
discrete_values

### Covariance and Correlation example


In [None]:
cov_matrix = data[['Age','Income']].cov()  # sample covariance (ddof=1)
pearson_corr = data[['Age','Income']].corr(method='pearson')
cov_matrix, pearson_corr

Interpretation:
- **Covariance:** Shows direction of linear relationship. Positive covariance indicates variables move together; negative indicates they move in opposite directions. Magnitude depends on units, so it's not standardized.
- **Correlation (Pearson):** Standardized measure ranging from -1 to 1. Values near ±1 indicate strong linear relationship; near 0 indicate weak or no linear relationship.
