

### **1. What is Simple Linear Regression?**

* Simple Linear Regression is a method used to examine the linear relationship between two variables: one independent (X) and one dependent (Y).
* The goal is to predict the value of Y based on X using the equation Y = mX + c.
* It assumes a constant rate of change and a straight-line relationship.
* It's often used in forecasting and trend analysis.
* The model minimizes errors using the least squares method.

---

### **2. What are the key assumptions of Simple Linear Regression?**

* **Linearity**: The relationship between X and Y must be linear.
* **Independence**: Observations should be independent of each other.
* **Homoscedasticity**: Constant variance of residuals across values of X.
* **Normality**: Residuals should be normally distributed.
* **No outliers**: Extreme values can distort results.
* Violation of these assumptions can lead to biased or misleading results.

---

### **3. What does the coefficient m represent in the equation Y = mX + c?**

* The slope **m** represents the change in Y for every one-unit change in X.
* If m is positive, Y increases with X; if negative, Y decreases with X.
* It determines the steepness and direction of the regression line.
* It quantifies the strength and direction of the relationship.
* A larger magnitude indicates a stronger influence of X on Y.

---

### **4. What does the intercept c represent in the equation Y = mX + c?**

* The intercept **c** is the predicted value of Y when X is zero.
* It marks the point where the regression line crosses the Y-axis.
* Provides a baseline estimate of Y in the absence of X.
* It helps understand the model's output in theoretical scenarios.
* While not always practically interpretable, it is essential for the equation’s accuracy.

---

### **5. How do we calculate the slope m in Simple Linear Regression?**

* Formula:

  $$
  m = \frac{n\sum XY - (\sum X)(\sum Y)}{n\sum X^2 - (\sum X)^2}
  $$
* It minimizes the sum of squared errors between actual and predicted Y.
* Based on covariance of X and Y divided by variance of X.
* It ensures the best-fitting straight line.
* Essential in building an accurate predictive model.

---

### **6. What is the purpose of the least squares method in Simple Linear Regression?**

* Minimizes the sum of squared differences between observed and predicted values.
* Ensures the best-fit line through the data points.
* Reduces error in prediction.
* Provides consistent, unbiased estimates of model parameters.
* Fundamental to regression analysis and model accuracy.

---

### **7. How is the coefficient of determination (R²) interpreted in Simple Linear Regression?**

* Indicates the proportion of variance in Y explained by X.
* R² ranges from 0 to 1:

  * **0**: No variance explained
  * **1**: All variance explained
* A higher R² means better model fit.
* R² = 0.75 means 75% of the variation in Y is due to X.
* Doesn’t imply causation, only correlation.

---

### **8. What is Multiple Linear Regression?**

* An extension of simple regression using two or more independent variables.
* General form:

  $$
  Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_nX_n + \epsilon
  $$
* Helps analyze how multiple factors affect the outcome.
* Useful in real-world scenarios where many variables influence the result.
* More flexible but requires careful assumptions handling.

---

### **9. What is the main difference between Simple and Multiple Linear Regression?**

* Simple: One independent variable.
* Multiple: Two or more independent variables.
* Multiple regression can model more complex relationships.
* Simple regression is easier to visualize and interpret.
* Multiple regression requires more data and computation.

---

### **10. What are the key assumptions of Multiple Linear Regression?**

* **Linearity** between predictors and the outcome.
* **Independence** of errors (no autocorrelation).
* **Homoscedasticity** of residuals.
* **Normal distribution** of residuals.
* **No multicollinearity** among predictors.
* Violations affect model reliability and interpretation.

---

### **11. What is heteroscedasticity, and how does it affect the results of a Multiple Linear Regression model?**

* Occurs when residual variance changes with levels of an independent variable.
* Leads to inefficient and biased estimates.
* Affects confidence intervals and hypothesis testing.
* Detected using residual plots or statistical tests (e.g., Breusch–Pagan).
* Addressed using transformations, robust standard errors, or different models.

---

### **12. How can you improve a Multiple Linear Regression model with high multicollinearity?**

* Remove or combine highly correlated variables.
* Use dimensionality reduction (like PCA).
* Apply Ridge or Lasso regression (regularization).
* Evaluate Variance Inflation Factor (VIF) to identify collinearity.
* Standardize variables to reduce correlations.

---

### **13. What are some common techniques for transforming categorical variables for use in regression models?**

* **One-hot encoding**: Converts categories into binary columns.\n- **Label encoding**: Assigns integer values to categories.\n- **Binary encoding**: Represents categories in binary code.\n- **Ordinal encoding**: Maintains order in ranked categories.\n- Categorical transformations are essential because regression requires numerical inputs.

---

### **14. What is the role of interaction terms in Multiple Linear Regression?**

* Show how two or more variables jointly affect the outcome.\n- Example: Effect of X₁ depends on level of X₂.\n- Captures combined influences that individual terms miss.\n- Important in modeling real-world complexities.\n- Must be interpreted carefully to avoid confusion.

---

### **15. How can the interpretation of intercept differ between Simple and Multiple Linear Regression?**

* **Simple regression**: Intercept is Y when X = 0.\n- **Multiple regression**: Intercept is Y when all X₁, X₂...Xₙ = 0.\n- May not have practical meaning if zero is unrealistic.\n- More theoretical in MLR but essential for model calculation.

---

### **16. What is the significance of the slope in regression analysis, and how does it affect predictions?**

* The slope measures the rate of change in Y with respect to X.\n- Indicates the direction (positive or negative) and strength of the relationship.\n- Steeper slopes mean larger changes in Y.\n- Crucial for understanding predictor importance.\n- Helps in forecasting and decision-making.

---

**17. How does the intercept in a regression model provide context for the relationship between variables?**

* The intercept indicates the expected value of the dependent variable when all independent variables are zero.
* It serves as a baseline or starting point in the regression equation.
* In some cases, it may have no practical meaning (e.g., if zero is not a possible value for predictors), but it's essential for accurately modeling the relationship.
* The context of the intercept helps assess how much of the response is explained without any influence from the predictors.

---

**18. What are the limitations of using R² as a sole measure of model performance?**

* R² does not indicate whether the regression model is appropriate or whether the predictions are unbiased.
* It can be artificially high for overfitted models or models with many variables.
* It does not indicate causality, only the strength of association.
* A high R² does not guarantee good predictive performance on unseen data.
* It does not account for the number of predictors, which is why adjusted R² is often preferred.

---

**19. How would you interpret a large standard error for a regression coefficient?**

* A large standard error indicates high variability in the estimate of the coefficient.
* It suggests that the corresponding predictor might not be a significant contributor to the model.
* Large standard errors lead to wider confidence intervals, making statistical conclusions less reliable.
* It may signal multicollinearity or poor data quality.

---

**20. How can heteroscedasticity be identified in residual plots, and why is it important to address it?**

* In a residual plot, heteroscedasticity appears as a funnel shape—residuals spread wider or narrower as the fitted values increase.
* It violates one of the key assumptions of linear regression.
* It can lead to inefficient estimates, biased standard errors, and unreliable hypothesis tests.
* Techniques to address it include data transformation, weighted least squares, or using robust standard errors.

**21. What does it mean if a Multiple Linear Regression model has a high R² but low adjusted R²?**

* It suggests that adding more predictors has increased the R² but not improved model quality.
* Adjusted R² accounts for the number of predictors and only increases if the new variables improve the model significantly.
* A high R² but low adjusted R² may indicate overfitting.
* It calls for reevaluation of predictor variables and model simplification.

**22. Why is it important to scale variables in Multiple Linear Regression?**

* Scaling ensures that all variables contribute equally to the model.
* It improves numerical stability and convergence speed, especially in models using regularization (e.g., Ridge, Lasso).
* It makes coefficients more interpretable.
* Prevents predictors with large scales from dominating those with smaller scales.

**23. What is polynomial regression?**

* Polynomial regression is a type of regression analysis where the relationship between the independent and dependent variable is modeled as an nth-degree polynomial.
* It allows fitting curves to the data rather than straight lines.
* For example, Y = β₀ + β₁X + β₂X² + ... + βₙXⁿ + ε
* It is used when data shows a curvilinear trend.

**24. How does polynomial regression differ from linear regression?**

* Linear regression models linear relationships, while polynomial regression can model nonlinear patterns.
* Polynomial regression includes higher-order powers of the independent variable.
* Linear regression fits a straight line, polynomial fits curves.
* Both are linear in coefficients but differ in the functional form.

**25. When is polynomial regression used?**

* When the relationship between the dependent and independent variables is nonlinear.
* When residual plots indicate patterns that a linear model cannot capture.
* For modeling growth rates, wave patterns, and U-shaped relationships.
* To improve prediction accuracy when linear models underperform.

**26. What is the general equation for polynomial regression?**

* Y = β₀ + β₁X + β₂X² + β₃X³ + ... + βₙXⁿ + ε
* Where β are coefficients, X is the independent variable, and n is the degree of the polynomial.
* It captures more complex relationships as the degree increases.

**27. Can polynomial regression be applied to multiple variables?**

* Yes, it can be extended to multiple variables (multivariate polynomial regression).
* Includes interaction terms and polynomial terms for each predictor.
* For example: Y = β₀ + β₁X₁ + β₂X₁² + β₃X₂ + β₄X₂² + β₅X₁X₂ + ε
* Complexity increases significantly with more variables.

**28. What are the limitations of polynomial regression?**

* Prone to overfitting, especially with high-degree polynomials.
* Poor extrapolation beyond the range of the data.
* Interpretation becomes difficult as the degree increases.
* Higher degrees can lead to large variance and numerical instability.

**29. What methods can be used to evaluate model fit when selecting the degree of a polynomial?**

* Use cross-validation techniques (like k-fold CV).
* Compare RMSE or MAE for different degrees.
* Analyze adjusted R² and AIC/BIC values.
* Use validation curves to visualize performance against complexity.

**30. Why is visualization important in polynomial regression?**

* Helps assess how well the model captures the data's shape.
* Makes it easier to detect overfitting or underfitting.
* Visual residual plots can indicate if the model is appropriate.
* Useful for communicating findings and interpreting relationships.

**31. How is polynomial regression implemented in Python?**

* Use `PolynomialFeatures` from `sklearn.preprocessing` to generate polynomial terms.
* Combine with `LinearRegression` from `sklearn.linear_model`:

```python
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

model = make_pipeline(PolynomialFeatures(degree=2), LinearRegression())
model.fit(X_train, y_train)
```

* `make_pipeline` ensures clean and sequential preprocessing.
* Evaluate using R² score, residuals, and visualization.

**End of Answer Sheet**
