<a href="https://colab.research.google.com/github/Chaakash16/Python-Basics/blob/main/Statistics_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Statistics Basics**

Q1. Explain the different types of data (qualitative and quantitative) and provide examples of each. Discuss
nominal, ordinal, interval, and ratio scales.

Ans. Data is classified into **qualitative (categorical)** and **quantitative (numerical)** types.  

- **Qualitative Data**: Descriptive, non-numeric data.  
  - **Nominal Scale**: Categories without order (e.g., colors, gender).  
  - **Ordinal Scale**: Categories with order but no fixed difference (e.g., rankings, satisfaction levels).  

- **Quantitative Data**: Numeric data that can be measured.  
  - **Interval Scale**: Ordered with equal differences but no true zero (e.g., temperature in Celsius).  
  - **Ratio Scale**: Ordered with equal differences and a true zero (e.g., height, weight, income).

Q2. What are the measures of central tendency, and when should you use each? Discuss the mean, median,
and mode with examples and situations where each is appropriate.

Ans. Measures of central tendency summarize data using a single value:  

- **Mean** (average): Used for symmetric data without outliers (e.g., average salary).  
- **Median** (middle value): Best for skewed data or when outliers exist (e.g., house prices).  
- **Mode** (most frequent value): Ideal for categorical data or distributions with repeating values (e.g., favorite movie genre).

Q3. Explain the concept of dispersion. How do variance and standard deviation measure the spread of data?

Ans. **Dispersion** measures how spread out data points are in a dataset.  

- **Variance**: The average of squared differences from the mean, showing overall data spread.  
- **Standard Deviation**: The square root of variance, indicating how much data deviates from the mean in original units.  

Both help understand data consistency—low values mean tightly clustered data, while high values indicate wide spread.

Q4. What is a box plot, and what can it tell you about the distribution of data?

Ans. A **box plot** (or **box-and-whisker plot**) visually represents data distribution using five key statistics: **minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum**.  

It helps identify **skewness, spread, central tendency, and outliers**—a longer whisker indicates skewness, while outliers appear as individual points beyond the whiskers.

Q5. Discuss the role of random sampling in making inferences about populations.

Ans. **Random sampling** ensures that every individual in a population has an equal chance of selection, making samples **unbiased and representative**.  

It helps in making **accurate inferences** about populations, reducing bias, and allowing for **generalization** in statistical analysis, such as surveys, experiments, and predictive modeling.

Q6. Explain the concept of skewness and its types. How does skewness affect the interpretation of data?




Ans. **Skewness** measures the asymmetry of a dataset's distribution.  

- **Positive Skew (Right-Skewed)**: Tail on the right, mean > median (e.g., income distribution).  
- **Negative Skew (Left-Skewed)**: Tail on the left, mean < median (e.g., exam scores).  
- **Zero Skew (Symmetric)**: Mean = Median = Mode (e.g., normal distribution).  

Skewness affects interpretation by influencing measures of central tendency and indicating potential data transformation needs.

Q7. What is the interquartile range (IQR), and how is it used to detect outliers?

Ans. The **Interquartile Range (IQR)** measures data spread by calculating the difference between the **third quartile (Q3) and first quartile (Q1)**:  

\[
IQR = Q3 - Q1
\]  

To detect **outliers**, values beyond **1.5 × IQR** below Q1 or above Q3 are considered extreme. This helps identify unusually high or low data points.

Q8. Discuss the conditions under which the binomial distribution is used.

Ans. The **binomial distribution** is used when:  

1. **Fixed number of trials (n)** – A set number of experiments or observations.  
2. **Two possible outcomes** – Each trial results in **success** or **failure**.  
3. **Constant probability (p)** – The probability of success remains the same in all trials.  
4. **Independent trials** – The outcome of one trial does not affect another.  

Example: Tossing a coin **10 times** (success = heads, failure = tails).

Q9. Explain the properties of the normal distribution and the empirical rule (68-95-99.7 rule).

Ans. The **normal distribution** is a symmetric, bell-shaped curve characterized by:  

- **Mean = Median = Mode**  
- **Symmetry around the mean**  
- **Asymptotic tails** (never touch the x-axis)  
- **Defined by mean (μ) and standard deviation (σ)**  

### **Empirical Rule (68-95-99.7 Rule):**  
For a normal distribution:  
- **68%** of data falls within **1σ** of the mean.  
- **95%** falls within **2σ**.  
- **99.7%** falls within **3σ**.  

This helps in predicting data spread and identifying outliers.

Q10. Provide a real-life example of a Poisson process and calculate the probability for a specific event.




Ans. Real-Life Example of a Poisson Process:
A Poisson process models events occurring randomly over time or space at a constant rate.

Example: A call center receives an average of 5 calls per hour. What is the probability of receiving exactly 3 calls in an hour?

Poisson Probability Formula:
𝑃
(
𝑋
=
𝑘
)
=
𝑒
−
𝜆
𝜆
𝑘
𝑘
!
P(X=k)=
k!
e
−λ
 λ
k

​

where:

𝜆
=
5
λ=5 (average calls per hour)
𝑘
=
3
k=3 (desired number of calls)
𝑒
≈
2.718
e≈2.718
Calculation:
𝑃
(
𝑋
=
3
)
=
𝑒
−
5
⋅
5
3
3
!
=
2.718
−
5
⋅
125
6
P(X=3)=
3!
e
−5
 ⋅5
3

​
 =
6
2.718
−5
 ⋅125
​

Let me calculate the exact value.

The probability of receiving exactly 3 calls in an hour is approximately 0.1404 (14.04%).

Q11. Explain what a random variable is and differentiate between discrete and continuous random variables.

Ans. A **random variable** is a numerical outcome of a random experiment.  

### **Types of Random Variables:**  
1. **Discrete Random Variable**:  
   - Takes **countable** values (e.g., 0, 1, 2, …).  
   - Example: Number of cars in a parking lot.  

2. **Continuous Random Variable**:  
   - Takes **uncountable** values within a range.  
   - Example: Height of students in a class (can be 5.6 ft, 5.65 ft, etc.).

Q12. Provide an example dataset, calculate both covariance and correlation, and interpret the results.

Ans. ### **Example Dataset:**  
We have two variables: **Study Hours (X)** and **Exam Scores (Y)**  

| Student | Study Hours (X) | Exam Score (Y) |
|---------|---------------|-------------|
| 1       | 2             | 50          |
| 2       | 4             | 60          |
| 3       | 6             | 70          |
| 4       | 8             | 80          |
| 5       | 10            | 90          |

### **Covariance Formula:**  
\[
Cov(X, Y) = \frac{\sum (X_i - \bar{X}) (Y_i - \bar{Y})}{n}
\]

### **Correlation Formula (Pearson’s r):**  
\[
r = \frac{Cov(X, Y)}{\sigma_X \sigma_Y}
\]

Let's calculate both covariance and correlation.

### **Results & Interpretation:**  
- **Covariance = 40.0** → Positive value indicates that as study hours increase, exam scores also increase.  
- **Correlation (r) = 1.0** → Perfect positive correlation, meaning study hours and exam scores have a **direct linear relationship** (more study hours lead to higher scores).