#Theoretical Questions

1. What is Simple Linear Regression?

- Simple Linear Regression is a statistical method used to model the linear relationship between a single independent variable (predictor) and a single dependent variable (response). The goal is to find the best-fitting straight line that describes how the dependent variable changes as the independent variable changes.

2. What are the key assumptions of Simple Linear Regression?

- The key assumptions are :
* **Linearity :** The relationship between the independent and dependent variables is linear.
* **Independence of Errors :** The residuals (the differences between the observed and predicted values) are independent of each other.
* **Homoscedasticity :** The variance of the residuals is constant across all levels of the independent variable.
* **Normality of Errors :** The residuals are normally distributed.

3. What does the coefficient m represent in the equation Y = mX + c?

- In the equation Y = mX + c, where Y is the dependent variable and X is the independent variable, m (slope) represents the change in the dependent variable (Y) for a one-unit change in the independent variable (X). It indicates the direction and magnitude of the linear relationship.

4. What does the intercept c represent in the equation Y = mX + c?

- In the equation Y = mX + c, c (intercept) represents the value of the dependent variable (Y) when the independent variable (X) is zero. It's the point where the regression line crosses the Y-axis.

5. How do we calculate the slope m in Simple Linear Regression?

- The slope (m) is typically calculated using the formula:
m = Σ[(Xi - X̄)(Yi - Ȳ)] / Σ[(Xi - X̄)²]

- where :
* Xi and Yi are the individual data points.
* X̄ is the mean of the independent variable.
* Ȳ is the mean of the dependent variable.
* Σ denotes the summation.

- Alternatively, using the correlation coefficient (r) and standard deviations (Sx and Sy), m = r * (Sy / Sx)

6. What is the purpose of the least squares method in Simple Linear Regression?

- The purpose of the least squares method is to find the best-fitting regression line by minimizing the sum of the squared differences between the observed values of the dependent variable and the values predicted by the regression line. This ensures that the overall prediction error is minimized.

7. How is the coefficient of determination (R²) interpreted in Simple Linear Regression?

- The coefficient of determination (R²) represents the proportion of the variance in the dependent variable that is predictable from the independent variable. It ranges from 0 to 1.
* An R² of 1 indicates that the regression model perfectly explains all the variance in the dependent variable.
* An R² of 0 indicates that the model explains none of the variance.
* For example, an R² of 0.80 means that 80% of the variation in the dependent variable can be explained by the independent variable.

8. What is Multiple Linear Regression?

- Multiple Linear Regression is a statistical method used to model the linear relationship between a single dependent variable and two or more independent variables. The goal is to find the best-fitting hyperplane that describes how the dependent variable changes as the independent variables change. The equation is generally represented as : Y = β₀ + β₁X₁ + β₂X₂ + ... + ε

9. What is the main difference between Simple and Multiple Linear Regression?

- The main difference is the number of independent variables used to predict the dependent variable. Simple Linear Regression uses one independent variable, while Multiple Linear Regression uses two or more independent variables.

10. What are the key assumptions of Multiple Linear Regression?

- Similar to Simple Linear Regression, the key assumptions are :
* **Linearity :** The relationship between the dependent variable and each independent variable is linear.
* **Independence of Errors :** The residuals are independent of each other.
* **Homoscedasticity :** The variance of the residuals is constant across all combinations of the independent variables.
* **Normality of Errors :** The residuals are normally distributed.
* **No Perfect Multicollinearity :** The independent variables are not perfectly linearly correlated with each other.

11. What is heteroscedasticity, and how does it affect the results of a Multiple Linear Regression model?

- Heteroscedasticity occurs when the variance of the residuals is not constant across all levels of the independent variables. This violates the assumption of homoscedasticity. It can lead to :
* **Inefficient coefficient estimates :** The standard errors of the coefficients may be unreliable.
* **Biased standard errors :** This can lead to incorrect conclusions about the statistical significance of the coefficients (Type I or Type II errors).

12. How can you improve a Multiple Linear Regression model with high multicollinearity?

- High multicollinearity (high correlation between independent variables) can be improved by :
* Removing one or more of the highly correlated variables.
* Combining the correlated variables into a single variable (if conceptually meaningful).
* Using dimensionality reduction techniques like Principal Component Analysis (PCA).
* Using regularization techniques like Ridge or Lasso regression.

13. What are some common techniques for transforming categorical variables for use in regression models?

- Common techniques include :
* **One-Hot Encoding :** Creating binary (0 or 1) dummy variables for each category of the categorical variable.
* **Label Encoding :** Assigning a unique numerical label to each category. This should be used with caution for nominal variables as it can imply an ordinal relationship.
* **Ordinal Encoding :** Assigning numerical labels based on the order of the categories (used for ordinal variables).

14. What is the role of interaction terms in Multiple Linear Regression?

- Interaction terms in Multiple Linear Regression allow the effect of one independent variable on the dependent variable to depend on the level of another independent variable. They capture synergistic or antagonistic effects between predictors that a simple additive model would miss. For example, an interaction term between advertising spend and season might show that the impact of advertising is different in summer versus winter.

15. How can the interpretation of intercept differ between Simple and Multiple Linear Regression?

- In **Simple Linear Regression** (Y = mX + c), the intercept 'c' is the value of Y when X is zero.

- In **Multiple Linear Regression** (Y = β₀ + β₁X₁ + β₂X₂ + ...), the intercept β₀ is the value of Y when all independent variables (X₁, X₂, ...) are zero. The interpretation depends on whether it's practically meaningful for all independent variables to be zero simultaneously. If it's not, the intercept might not have a direct real-world interpretation but is still necessary for the model.

16. What is the significance of the slope in regression analysis, and how does it affect predictions?

- The slope (or coefficients in multiple regression) indicates the rate of change in the dependent variable for a one-unit change in the corresponding independent variable, holding other variables constant (in multiple regression). It determines the direction (positive or negative) and magnitude of the relationship. A larger slope means a greater change in the dependent variable for a unit change in the predictor, thus significantly influencing the model's predictions.

17. How does the intercept in a regression model provide context for the relationship between variables?

- The intercept provides the baseline value of the dependent variable when all independent variables are at their reference point (often zero). It sets the starting point for the linear relationship and helps to anchor the model's predictions. While sometimes not directly interpretable in a practical sense, it's a crucial parameter in defining the regression line or hyperplane.

18. What are the limitations of using R² as a sole measure of model performance?

- Limitations of R² include :
* It doesn't indicate if the model is correctly specified. A high R² doesn't mean the chosen model is the most appropriate.
* It can be artificially inflated by adding more variables to the model, even if those variables are not truly related to the dependent variable. Adjusted R² attempts to address this.
* It doesn't tell you about the direction or nature of the relationships between variables.
* It doesn't assess the precision of the coefficient estimates.
* It is sensitive to outliers.
* It doesn't imply causation.

19. How would you interpret a large standard error for a regression coefficient?

- A large standard error for a regression coefficient indicates that there is a high degree of uncertainty in the estimated value of that coefficient. It suggests that the true effect of the independent variable on the dependent variable might be quite different from the estimated coefficient. This could be due to factors like small sample size, high variability in the data, or multicollinearity.

20. How can heteroscedasticity be identified in residual plots, and why is it important to address it?

- **Heteroscedasticity** can often be identified in residual plots (residuals plotted against predicted values or one of the independent variables) by observing a non-constant variance of the residuals. Common patterns include a funnel shape (where the spread of residuals increases or decreases as the predicted values increase).

- It's important to address **heteroscedasticity** because it violates a key assumption of linear regression, leading to unreliable standard errors of the coefficients and potentially incorrect inferences about the significance of the predictors.

21. What does it mean if a Multiple Linear Regression model has a high R² but low adjusted R²?

- A high R² with a low adjusted R² suggests that while the model explains a large proportion of the variance in the dependent variable, a significant portion of this explanation might be due to the inclusion of one or more independent variables that do not truly contribute to the model. The adjusted R² penalizes the addition of unnecessary variables, so a large discrepancy between R² and adjusted R² indicates that the model might be overfitting the data.

22. Why is it important to scale variables in Multiple Linear Regression?

- Scaling variables (e.g., using standardization or normalization) can be important in Multiple Linear Regression for several reasons :
* To address issues with variables having vastly different scales : This can affect the performance of some optimization algorithms used to fit the model.
* To make the coefficients more directly comparable: When predictors are on different scales, their coefficients are also on different scales, making it hard to directly compare their impact. Scaling can put them on a more level playing field.
* To potentially mitigate multicollinearity issues.
* It's often required by certain regularization techniques (like L1 and L2 regularization).

23. What is polynomial regression?

- Polynomial regression is a form of regression analysis in which the relationship between the independent variable(s) and the dependent variable is modeled as an nth-degree polynomial. It's used when the relationship between the variables is non-linear.

24. How does polynomial regression differ from linear regression?

- Linear regression models a linear relationship between variables, while polynomial regression models a non-linear relationship by including polynomial terms (e.g., squared, cubed) of the independent variable(s) in the regression equation.

25. When is polynomial regression used?

- Polynomial regression is used when the scatter plot of the data shows a curved relationship between the independent and dependent variables, indicating that a straight line would not be a good fit.

26. What is the general equation for polynomial regression?

- For a single independent variable X, the general equation for an nth-degree polynomial regression is:
Y = β₀ + β₁X + β₂X² + β₃X³ + ...  + ε

27. Can polynomial regression be applied to multiple variables?

- Yes, polynomial regression can be applied to multiple independent variables. The equation would include polynomial terms and interaction terms involving multiple variables. For example, with two independent variables (X₁ and X₂), a second-degree polynomial regression might look like:
Y = β₀ + β₁X₁ + β₂X₂ + β₃X₁² + β₄X₂² + β₅X₁X₂ + ε

28. What are the limitations of polynomial regression?

- Limitations include :
* **Overfitting :** Higher-degree polynomials can easily overfit the data, capturing noise rather than the underlying relationship, leading to poor generalization to new data.
* **Extrapolation :** Polynomial models can produce very unreliable predictions outside the range of the observed data.
* **Interpretation :** As the degree of the polynomial increases, the interpretation of the coefficients becomes more complex.
* **Data Requirements :** Fitting high-degree polynomials effectively often requires a large amount of data.

29. What methods can be used to evaluate model fit when selecting the degree of a polynomial?

- Methods include :
* **Visual inspection of the fitted curve :** Plotting the regression curve against the data points to see how well it fits.
* **Adjusted R² :** Choosing the degree that maximizes the adjusted R², which penalizes the addition of unnecessary terms.
* **AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) :** These information criteria help balance model fit and complexity.
* **Cross-validation :** Evaluating the model's performance on unseen data for different polynomial degrees.
* **Hypothesis testing on the coefficients :** Assessing the significance of the higher-order polynomial terms.

30. Why is visualization important in polynomial regression?

- Visualization is crucial in polynomial regression for :
* Identifying non-linear relationships in the data initially.
* Assessing the fit of the polynomial curve to the data points.
* Detecting overfitting by observing if the curve is excessively wiggly or follows noise.
* Understanding the shape of the relationship being modeled.

31. How is polynomial regression implemented in Python?

- Polynomial regression can be implemented in Python using libraries like :
* **NumPy and scikit-learn :** You can use PolynomialFeatures from sklearn.preprocessing to generate polynomial features from the original independent variable(s), and then use a linear regression model (like LinearRegression from sklearn.linear_model) to fit the model to these transformed features.
* **Statsmodels :** This library also provides functionality for fitting polynomial regression models.