# üìä Statistics ‚Äì Interview Questions (MCQ + Theory + Coding) PDF Notes

---

## 1Ô∏è‚É£ Multiple Choice Questions (MCQ)

**Q1:** Which measure of central tendency is least affected by outliers?

* A) Mean
* B) Median ‚úÖ
* C) Mode
* D) Standard Deviation

**Q2:** If a dataset is normally distributed, what percentage of data lies within 1 standard deviation from the mean?

* A) 50%
* B) 68% ‚úÖ
* C) 95%
* D) 99%

**Q3:** What type of variable is 'Blood Group'?

* A) Ordinal
* B) Nominal ‚úÖ
* C) Discrete
* D) Continuous

**Q4:** What is the correct formula for sample variance?

* A) Œ£(x ‚àí Œº)¬≤ / N
* B) Œ£(x ‚àí xÃÑ)¬≤ / (n ‚àí 1) ‚úÖ
* C) Œ£(x ‚àí xÃÑ)¬≤ / n
* D) Œ£(x ‚àí Œº)¬≤ / (N ‚àí 1)

**Q5:** Which sampling technique divides the population into strata and samples from each group?

* A) Simple Random Sampling
* B) Systematic Sampling
* C) Stratified Sampling ‚úÖ
* D) Cluster Sampling

---

## 2Ô∏è‚É£ Theory Questions

**Q1:** Explain the difference between Descriptive and Inferential Statistics.

* Descriptive: Summarizes and describes data.
* Inferential: Draws conclusions or predictions about a population from a sample.

**Q2:** Define population and sample with examples.

* Population: All students in a university.
* Sample: 200 students selected from the university.

**Q3:** What is variance and why is it important?

* Variance measures data spread from the mean.
* Important to detect variability, assess risk, and understand data distribution.

**Q4:** Describe types of variables with examples.

* Qualitative: Nominal (Gender), Ordinal (Ratings)
* Quantitative: Discrete (Number of students), Continuous (Height)

**Q5:** Explain the difference between Z-test and T-test.

* Z-test: Large sample, population SD known
* T-test: Small sample, population SD unknown

---

## 3Ô∏è‚É£ Coding Questions (Python)

**Q1:** Compute Mean, Median, Mode, Variance, and Standard Deviation.

```python
import numpy as np
from statistics import mode

data = [10, 20, 20, 30, 40]
print("Mean:", np.mean(data))
print("Median:", np.median(data))
print("Mode:", mode(data))
print("Variance:", np.var(data, ddof=1))  # Sample variance
print("Std Dev:", np.std(data, ddof=1))
```

**Q2:** Create a stratified sample from a dataset.

```python
import pandas as pd
from sklearn.model_selection import train_test_split

data = pd.DataFrame({
    'Department': ['CS','CS','EE','EE','ME','ME'],
    'Score': [90,85,88,92,75,80]
})

# Stratified sampling
train, test = train_test_split(data, test_size=0.5, stratify=data['Department'], random_state=42)
print(test)
```

**Q3:** Calculate Z-score of a value.

```python
value = 70
mean = np.mean(data['Score'])
std = np.std(data['Score'], ddof=1)
z_score = (value - mean) / std
print(z_score)
```

**Q4:** Simulate a normal distribution and plot histogram.

```python
import matplotlib.pyplot as plt

data = np.random.normal(50, 10, 1000)
plt.hist(data, bins=30, color='skyblue')
plt.title('Normal Distribution')
plt.show()
```

**Q5:** Compute correlation between two variables.

```python
x = [1,2,3,4,5]
y = [2,4,6,8,10]
correlation = np.corrcoef(x, y)[0,1]
print("Correlation:", correlation)
```

---

‚úÖ **End of Statistics Interview Questions (MCQ + Theory + Coding)**


# ü§ñ ML & Data Science Interview Questions

## 1. Linear Regression

**Question:** Explain Linear Regression and how it works.

**Answer:**  
Linear Regression predicts a continuous output \(y\) from input \(X\) using a linear relationship:

$$
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n + \epsilon
$$

- \(\beta_i\) = coefficients (weights)  
- \(\epsilon\) = error term  

**Example:**

Predict house price based on `size`:

Data:  
| Size (sq.ft) | Price ($k) |
|--------------|------------|
| 1000         | 200        |
| 1500         | 250        |

Fit linear model ‚Üí predict new house prices.

---

## 2. Logistic Regression

**Question:** Difference between Linear and Logistic Regression.

**Answer:**  
- Logistic regression predicts **probabilities** (0-1) for classification.  
- Uses **sigmoid function**:

$$
p = \frac{1}{1 + e^{-z}}, \quad z = \beta_0 + \beta_1 x_1 + ... + \beta_n x_n
$$

**Example:** Predict if a student passes (1) or fails (0) based on hours studied.

---

## 3. Decision Tree

**Question:** How does a Decision Tree decide splits?

**Answer:**  
- Uses **metrics like Gini Index, Entropy** to choose the best split.  

**Entropy Formula:**

$$
Entropy(S) = - \sum_{i=1}^{c} p_i \log_2(p_i)
$$

- \(p_i\) = proportion of class \(i\) in the set  
- \(c\) = number of classes  

**Example:** Classify whether a fruit is apple/orange based on color and size.

---

## 4. SVM (Support Vector Machine)

**Question:** How does SVM work?

**Answer:**  
- SVM finds a **hyperplane** that separates classes with maximum margin.  

**Equation of Hyperplane:**

$$
w \cdot x + b = 0
$$

- \(w\) = weight vector  
- \(b\) = bias  

**Kernel Trick:** Allows non-linear separation using transformations like RBF, polynomial.

---

## 5. K-Nearest Neighbors (KNN)

**Question:** Explain KNN and distance metrics.

**Answer:**  
- KNN classifies based on **majority vote of k nearest neighbors**.  
- Common distance metrics:

$$
\text{Euclidean: } d = \sqrt{\sum (x_i - y_i)^2}
$$

$$
\text{Manhattan: } d = \sum |x_i - y_i|
$$

---

## 6. Bias vs Variance

**Question:** What is Bias-Variance tradeoff?

**Answer:**  
- **Bias:** Error due to oversimplified model (underfitting)  
- **Variance:** Error due to over-complex model (overfitting)  

**Total Error:**

$$
\text{Total Error} = Bias^2 + Variance + \text{Irreducible Error}
$$

**Example:** Linear model (high bias) vs deep tree (high variance).

---

## 7. Cross Validation

**Question:** How to validate ML models?

**Answer:**  
- Split data into **k folds**. Train on k-1 folds, test on 1 fold. Repeat k times.  
- Reduces overfitting and improves generalization.

**Example:** 5-Fold CV: data split into 5 parts ‚Üí train on 4, test on 1, rotate.

---

## 8. PCA (Principal Component Analysis)

**Question:** What is PCA and why use it?

**Answer:**  
- PCA reduces **dimensionality** by creating new uncorrelated features (principal components).  
- Maximize **variance** along each component.  

**Math:**

$$
Z = X W
$$

- \(X\) = original data matrix  
- \(W\) = eigenvectors of covariance matrix  

**Example:** Reduce 50 features to 10 principal components for faster ML training.

---

## 9. Confusion Matrix & Metrics

**Question:** Explain Confusion Matrix and evaluation metrics.

**Answer:**  

| Predicted | Actual |        |
|-----------|--------|--------|
| TP        | FP     |        |
| FN        | TN     |        |

- **Accuracy:** \( \frac{TP+TN}{TP+FP+FN+TN} \)  
- **Precision:** \( \frac{TP}{TP+FP} \)  
- **Recall:** \( \frac{TP}{TP+FN} \)  
- **F1-Score:** \( 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall} \)

---

## 10. Regularization

**Question:** Difference between L1 and L2 regularization.

**Answer:**  
- **L1 (Lasso):** Adds absolute weights penalty ‚Üí sparsity, feature selection

$$
Loss = MSE + \lambda \sum |w_i|
$$

- **L2 (Ridge):** Adds squared weights penalty ‚Üí prevents large weights

$$
Loss = MSE + \lambda \sum w_i^2
$$

---

## ‚úÖ Tips for Interviews

- Always explain **intuition + math + example**  
- Draw diagrams when possible (e.g., Decision Tree, SVM)  
- Use **real dataset examples** like Iris, Titanic  
- Mention **overfitting & regularization** for complex models  
