1.What is Simple Linear Regression?

Simple Linear Regression is a statistical method to model the relationship between two variables by fitting a straight line. It predicts the value of a dependent variable (target) based on one independent variable (feature).

The equation is:  
y = b0 + b1 * x  
where,  
- y = predicted value  
- b0 = intercept  
- b1 = slope (coefficient)  
- x = independent variable


2.What are the key assumptions of Simple Linear Regression?

1. Linearity:The relationship between the independent and dependent variable is linear.
2. Independence: Observations are independent of each other.
3. Homoscedasticity:Constant variance of errors (residuals) across all levels of the independent variable.
4. Normality:The residuals (errors) are normally distributed.
5. No multicollinearity: (In simple linear regression, only one predictor, so this mainly applies to multiple regression.)


3.What does the coefficient m represent in the equation Y = mX + c?

The coefficient m represents the slope of the line. It indicates the change in the dependent variable (Y) for a one-unit change in the independent variable (X).


4.What does the intercept c represent in the equation Y = mX + c?

The intercept c represents the value of Y when X is zero. It is the point where the regression line crosses the Y-axis.


5.How do we calculate the slope m in Simple Linear Regression?

Formula for slope (m):
m = covariance(X, Y) / variance(X)

import numpy as np

Example data
X = np.array([1, 2, 3, 4, 5])
Y = np.array([2, 4, 5, 4, 5])

Calculate means
mean_x = np.mean(X)
mean_y = np.mean(Y)

 Calculate numerator and denominator for slope
numerator = np.sum((X - mean_x) * (Y - mean_y))
denominator = np.sum((X - mean_x) ** 2)

Calculate slope
m = numerator / denominator

print("Slope (m):", m)


6. What is the purpose of the least squares method in Simple Linear Regression?

The least squares method is used to find the best-fitting line by minimizing the sum of the squared differences (errors) between the observed values and the predicted values. It helps to find the slope (m) and intercept (c) that minimize these errors.


7. How is the coefficient of determination (R²) interpreted in Simple Linear Regression?

R² measures how well the regression line fits the data. It represents the proportion of the variance in the dependent variable explained by the independent variable.

- R² = 1 means perfect fit.
- R² = 0 means the model explains none of the variance.
- Higher R² indicates better model performance.


8. What is Multiple Linear Regression?

Multiple Linear Regression is a statistical method used to model the relationship between one dependent variable and two or more independent variables by fitting a linear equation.

The equation is:  
y = b0 + b1*x1 + b2*x2 + ... + bn*xn  

where,  
- y = predicted value  
- b0 = intercept  
- b1, b2, ..., bn = coefficients for each independent variable x1, x2, ..., xn  
- n = number of independent variables


9. What is the main difference between Simple and Multiple Linear Regression?

- Simple Linear Regression uses one independent variable to predict the dependent variable.
- Multiple Linear Regression uses two or more independent variables to predict the dependent variable.


10.What are the key assumptions of Multiple Linear Regression?

Linearity: The relationship between dependent and independent variables is linear.
Independence: Observations are independent.
Homoscedasticity:Constant variance of residuals across all levels of independent variables.
Normality: Residuals are normally distributed.
No multicollinearity: Independent variables are not highly correlated with each other.


11. What is heteroscedasticity, and how does it affect the results of a Multiple Linear Regression model?

Heteroscedasticity occurs when the variance of the residuals (errors) is not constant across all levels of the independent variables.

Effects:
- It violates the assumption of homoscedasticity.
- Causes inefficient estimates and unreliable hypothesis tests.
- Leads to biased standard errors, affecting confidence intervals and p-values.
- Makes model predictions less reliable.


12.How can you improve a Multiple Linear Regression model with high multicollinearity?

- Remove or combine highly correlated features.
- Use dimensionality reduction techniques like Principal Component Analysis (PCA).
- Apply regularization methods such as Ridge or Lasso regression.
- Collect more data if possible.
- Check and carefully select independent variables based on domain knowledge.


13.What are some common techniques for transforming categorical variables for use in regression models?

- One-Hot Encoding: Converts each category into binary columns (0 or 1).
- Label Encoding: Assigns each category a unique integer (useful for ordinal categories).
- Ordinal Encoding: Assigns ordered integers to categories with a meaningful order.
- Binary Encoding:Converts categories into binary digits to reduce dimensionality.
- Target Encoding: Replaces categories with the mean of the target variable for that category.


14.What is the role of interaction terms in Multiple Linear Regression?

Interaction terms capture the combined effect of two or more independent variables on the dependent variable, showing how the effect of one variable changes depending on the level of another.

They help model complex relationships that are not purely additive.


15.How can the interpretation of intercept differ between Simple and Multiple Linear Regression?

- In Simple Linear Regression, the intercept represents the predicted value of the dependent variable when the single independent variable is zero.
- In Multiple Linear Regression, the intercept represents the predicted value when all independent variables are zero simultaneously.


16.What is the significance of the slope in regression analysis, and how does it affect predictions?

The slope indicates the amount of change in the dependent variable for a one-unit increase in the independent variable. It shows the direction and strength of the relationship.

A positive slope means the dependent variable increases as the independent variable increases, while a negative slope means it decreases.

The slope directly affects predictions by determining how much the output changes with input changes.


17. How does the intercept in a regression model provide context for the relationship between variables?

The intercept represents the expected value of the dependent variable when all independent variables are zero. It provides a baseline or starting point for the model's predictions, helping to understand where the regression line or plane crosses the dependent variable axis.


18.What are the limitations of using R² as a sole measure of model performance?

- R² does not indicate if the model is appropriate or unbiased.
- It can be artificially high with many predictors (overfitting).
- Does not measure predictive accuracy on new data.
- Cannot detect non-linear relationships.
- Does not show if variables are significant.



19. How would you interpret a large standard error for a regression coefficient?

A large standard error means the estimate of the coefficient is less precise and has more variability. It indicates uncertainty in the coefficient’s value, suggesting the predictor may not be reliably associated with the dependent variable.


20. How can heteroscedasticity be identified in residual plots, and why is it important to address it?

- Identification:
  In residual plots, heteroscedasticity appears as a pattern where the spread (variance) of residuals changes across levels of predicted values (e.g., funnel shape or increasing/decreasing spread).

- Importance: 
  Addressing heteroscedasticity is important because it violates regression assumptions, leading to inefficient estimates, biased standard errors, and unreliable hypothesis tests.


21. What does it mean if a Multiple Linear Regression model has a high R² but low adjusted R²?

It indicates that while the model explains a lot of variance (high R²), it may include unnecessary or irrelevant predictors. The adjusted R² penalizes adding useless variables, so a low adjusted R² suggests overfitting or that some predictors don't improve the model significantly.


22. Why is it important to scale variables in Multiple Linear Regression?

Scaling variables ensures all features contribute equally by putting them on the same scale. It improves model training stability and convergence, especially when using regularization techniques like Ridge or Lasso regression.


23. What is polynomial regression?

Polynomial regression is an extension of linear regression where the relationship between the independent variable and the dependent variable is modeled as an nth-degree polynomial.

The equation:  
y = b0 + b1*x + b2*x² + ... + bn*xⁿ  

It allows modeling of non-linear relationships by fitting a curve instead of a straight line.


24. How does polynomial regression differ from linear regression?

- **Linear regression** models a straight-line relationship between independent and dependent variables.
- **Polynomial regression** models a curved relationship by including higher-degree terms (e.g., x², x³), capturing non-linear patterns.


25. When is polynomial regression used?

Polynomial regression is used when the relationship between the independent and dependent variables is **non-linear** but can be approximated by a polynomial curve. It helps model curved trends that linear regression cannot capture.


26. What is the general equation for polynomial regression?

y = b0 + b1*x + b2*x² + b3*x³ + ... + bn*xⁿ

where,  
- y = predicted value  
- b0 = intercept  
- b1, b2, ..., bn = coefficients  
- x, x², ..., xⁿ = independent variable raised to powers from 1 to n  
- n = degree of the polynomial


27. Can polynomial regression be applied to multiple variables?

Yes. Polynomial regression can be extended to multiple variables by including polynomial terms and interaction terms of the independent variables, allowing modeling of non-linear relationships in multiple dimensions.


28. What are the limitations of polynomial regression?

- Prone to **overfitting**, especially with high-degree polynomials.
- Can produce **oscillations** and unstable predictions outside the training data range.
- **Difficult to interpret** coefficients as the degree increases.
- Sensitive to **outliers**.
- Requires careful selection of polynomial degree.


29. What methods can be used to evaluate model fit when selecting the degree of a polynomial?

- **Cross-validation:** Assess model performance on unseen data.
- **Adjusted R²:** Penalizes adding unnecessary polynomial terms.
- **Mean Squared Error (MSE):** Lower values indicate better fit.
- **Root Mean Squared Error (RMSE):** Provides error in original units.
- **Visual inspection:** Plotting fitted curve against data.


30. Why is visualization important in polynomial regression?

Visualization helps to:
- Understand the shape of the relationship between variables.
- Detect underfitting or overfitting by comparing the curve to data points.
- Communicate model behavior clearly.
- Identify outliers and patterns that affect the model.


31.How is polynomial regression implemented in Python?

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
import numpy as np

Example data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 6, 14, 28, 45])

Transform features to polynomial features (degree 2)
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

 Fit Linear Regression on transformed features
model = LinearRegression()
model.fit(X_poly, y)

Predict
y_pred = model.predict(X_poly)
print("Predicted values:", y_pred)
