<a href="https://colab.research.google.com/github/Ayesha765/pwAssign/blob/main/EvaluationMetricsandRegressionImplementation_Theoretical.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---

# THEORETICAL QUESTION

---

**1. What does R-squared represent in a regression model?**

   - **R-squared**, or the coefficient of determination, measures how well the independent variables explain the variability of the dependent variable.

   - A value of **1** means the model explains 100% of the variability in the data, while a value of 0 means it explains none.

   - For example, if \( R Square = 0.75 \), it means 75% of the variation in the target variable is explained by the predictors, while the remaining 25% is due to unknown or unmodeled factors.

**2. What are the assumptions of linear regression?**

   - Linear regression relies on the following assumptions for accurate results:

   1. **Linearity**: The relationship between predictors and the target is linear.

   2. **Independence**: Observations are independent (e.g., no correlation between consecutive errors in time series).

   3. **Homoscedasticity**: The residuals (errors) have constant variance across all levels of the independent variables.

   4. **Normality of residuals**: Residuals should be normally distributed to ensure valid hypothesis testing.
   
   5. **No multicollinearity**: Predictors should not be highly correlated with each other, as it makes coefficients unreliable.

**3. What is the difference between R-squared and Adjusted R-squared?**

   - **R-squared** increases whenever a new predictor is added to the model, even if the predictor doesn't improve the model.

   - **Adjusted R-squared** accounts for the number of predictors in the model. It increases only if the added predictors improve the model performance.
   
   - For example, if you add a predictor that doesn’t contribute much, Adjusted \( R^2 \) may decrease, while \( R^2 \) will still increase.

**4. Why do we use Mean Squared Error (MSE)?**

   - **MSE** measures the average squared differences between the predicted and actual values.

   - Squaring the errors penalizes larger errors more than smaller ones, making MSE sensitive to significant deviations.
   
   - MSE is also easier to optimize in mathematical models (like gradient descent) because it has a smooth derivative.

**5. What does an Adjusted R-squared value of 0.85 indicate?**
   - An Adjusted \( R Square \) value of **0.85** means that 85% of the variance in the target variable is explained by the predictors, adjusted for the number of predictors.

   - It shows that the model fits well but has considered the complexity (number of predictors) to avoid overfitting.

---

**6. How do we check for normality of residuals in linear regression?**

   To ensure residuals follow a normal distribution, we can:

   1. Plot a **histogram** of residuals to visually check for normality.

   2. Use a **Q-Q plot** (quantile-quantile plot), where points should follow a straight line if residuals are normal.

   3. Perform statistical tests like the **Shapiro-Wilk test** or **Jarque-Bera test**.
   
   - Normal residuals are crucial for valid confidence intervals and p-values.

 **7. What is multicollinearity, and how does it impact regression?**
   - **Multicollinearity** occurs when two or more predictors are highly correlated, making it hard to determine their individual effects on the target variable.
   - Impact:
     - Coefficient estimates become unstable.
     - Predictions might remain accurate, but the interpretation of individual predictors becomes unreliable.
   - For example, if you use both "age" and "years of experience" in a model, they may show multicollinearity as they are often related.

**8. What is Mean Absolute Error (MAE)?**

   - **MAE** is the average of the absolute differences between predicted and actual values.

   - It’s simpler than MSE and is less sensitive to outliers because it doesn’t square errors.
   
   - For example, if the predictions are off by 5, 10, and 15 units, the MAE will be (5 + 10 + 15)\3 = 10 .

**9. What are the benefits of using an ML pipeline?**

   - **Consistency**: Ensures the same preprocessing steps are applied during training and prediction.

   - **Automation**: Reduces manual effort by automating tasks like scaling, encoding, and feature selection.

   - **Efficiency**: Streamlines workflows, saving time and effort.

   - **Reproducibility**: Makes it easy to reproduce results, especially with large datasets.
   
   - Example: A pipeline might standardize data, handle missing values, and train a regression model in one sequence.

**10. Why is RMSE considered more interpretable than MSE?**

   - **MSE** uses squared units of the target variable, making it hard to interpret.
   
   - **RMSE** takes the square root of MSE, bringing the error back to the same unit as the target variable, which is more intuitive for decision-making.

---

**11. What is pickling in Python, and how is it useful in ML?**

   - **Pickling** converts a Python object (like a trained model) into a binary format for saving or transferring.

   - In ML, pickling helps:

     - Save trained models to avoid retraining.
     
     - Share models with others or deploy them for predictions.

**12. What does a high R-squared value mean?**

   - A high Rsquare value (close to 1) means the model explains a large portion of the variability in the dependent variable.
   
   - However, it doesn’t guarantee a good model; overfitting can inflate Rsquare. Always check residuals and other metrics like Adjusted Rsquare and RMSE.

**13. What happens if linear regression assumptions are violated?**

   - If assumptions are violated:

     - Predictions might still work, but the statistical inferences (e.g., p-values, confidence intervals) will be invalid.

     - Violations like heteroscedasticity or non-normality can lead to biased or inefficient coefficient estimates.
     
   - Example: If residuals are not normally distributed, the p-values for coefficients might be misleading.


**14. How can we address multicollinearity in regression?**

   - **Remove correlated predictors**: Drop one of the highly correlated variables.

   - Use **Regularization** techniques like Lasso or Ridge regression to handle multicollinearity.
   
   - Apply **PCA (Principal Component Analysis)** to reduce dimensions while preserving most of the variance.

**15. How can feature selection improve model performance in regression analysis?**

   - **Feature selection** identifies and retains only the most relevant predictors, improving:

     - Model accuracy by reducing noise.

     - Interpretability by focusing on key drivers of the target variable.
     
     - Efficiency by lowering computational costs and training time.

---

**16. How is Adjusted R-squared calculated?**
   - Adjusted R-squared is calculated using the following formula:

   - Adjusted R-squared = 1 - [(1 - R-squared) * (n - 1) / (n - k - 1)]
     Where:
     -  R-sq: Regular R-squared value.
     -  n : Number of observations.
     -  k : Number of predictors.

   - It adjusts \( R-sq \) for the number of predictors, penalizing the addition of irrelevant features. Unlike \( R-sq \), it decreases if the added predictors don’t improve the model.

**17. Why is MSE sensitive to outliers?**

   - **MSE (Mean Squared Error)** squares the residuals (errors), which amplifies the effect of larger errors.

   - Mean Squared Error (MSE) is sensitive to outliers because it squares the errors, meaning large deviations from the predicted values (like those caused by outliers) are significantly amplified, disproportionately impacting the overall MSE value compared to smaller errors; essentially, a single outlier can significantly inflate the MSE due to its squared contribution.
     



**18. What is the role of homoscedasticity in linear regression?**

   - Homoscedasticity means that the residuals (errors) have constant variance across all levels of the predictors.

   - Importance:
     - Ensures valid statistical tests (e.g., t-tests, F-tests).
     - Prevents over- or under-estimation of coefficients.
   - Violations (heteroscedasticity) can lead to biased standard errors, affecting p-values and confidence intervals.

 **19. What is Root Mean Squared Error (RMSE)?**

   - RMSE is the square root of MSE:

   - Root Mean Squared Error (RMSE) is a metric used to measure the average difference between predicted and actual values in a regression model. It is calculated by taking the square root of the average of squared errors.

   -  RMSE gives an idea of how well the model's predictions match the actual values, with lower values indicating better performance. It is sensitive to larger errors due to squaring the differences.
   
   - It represents the standard deviation of prediction errors, providing an interpretable measure of model accuracy in the same units as the target variable.


**20. Why is pickling considered risky?**

   - Pickling can execute arbitrary code during deserialization, making it a security risk if files are tampered with.

   - Risks include:

     - **Code injection**: Malicious code can be embedded in a pickled file.

     - **Compatibility issues**: Pickled files might not work across different Python versions.
     
   - Always verify the source of pickled files before loading them.

---

**21. What alternatives exist to pickling for saving ML models?**

Alternatives to pickling for saving ML models include:

1. **Joblib**: Efficient for large models with numpy arrays.

2. **ONNX**: Cross-platform format for different frameworks.

3. **TensorFlow SavedModel**: For TensorFlow models.

4. **HDF5**: Used with Keras for saving model architecture and weights.

5. **MLflow**: Framework for managing and saving models with versioning.

These provide better performance and compatibility based on the use case.

**22. What is heteroscedasticity, and why is it a problem?**

   - **Heteroscedasticity** occurs when the residuals don’t have constant variance across predictor values (e.g., errors increase with larger values of predictors).

   - Problems caused:

     - Standard errors become biased, leading to invalid hypothesis testing.

     - Predictions may still be accurate, but inferences (e.g., p-values) are unreliable.
     
   - Example: In a housing price model, errors may increase for higher-priced homes.


**23. How can interaction terms enhance a regression model's predictive power?**

   - **Interaction terms** account for the combined effect of two or more variables on the target.

   - They allow the model to capture relationships that a simple additive model might miss.
   - Example:
   
     - Suppose we have variables **"hours studied"** and **"sleep hours"**. The effect of studying on grades might depend on the amount of sleep.

     - Adding an interaction term like **(hours studied × sleep hours)** helps capture this dependency.

---
---