1.Foundations & Theory

What is linear regression, and what are its assumptions?

Explain the difference between simple and multiple linear regression.

What is the equation of a simple linear regression model? Define each term.

Why do we use the least squares method to fit a regression line?

What do the coefficients in linear regression represent?

State the Gauss–Markov assumptions. What is the BLUE property?

What happens if the linearity assumption is violated?

How do you detect and handle multicollinearity?

What is heteroscedasticity, and how can it be identified (e.g., Breusch–Pagan test, White test)?

What are the consequences of autocorrelation in time-series regression?

How do you check for normality of residuals (QQ plot, Shapiro–Wilk test)?

What is the effect of outliers and influential points (Cook’s distance, leverage)?

Define and interpret R² and adjusted R².

When is R² misleading?

Explain MSE, RMSE, and MAE.

What is the F-statistic in regression output, and what does it test?

Compare AIC and BIC for model selection.

How do you test the significance of a single regression coefficient (t-test)?

How do you test the significance of multiple coefficients (F-test)?

Construct a confidence interval for a regression coefficient.

What is the relationship between the t-test for a coefficient and the F-test for the model?

How do you handle nonlinear relationships while still using linear regression (e.g., polynomial regression, log transformation)?

What is log-linear, linear-log, and log-log regression? Interpret their coefficients.

What is ridge regression (L2 regularization)? How does it differ from OLS?

Explain lasso regression (L1 regularization) and its advantage for feature selection.

What is elastic net regression?

Weighted least squares (WLS) – when and why is it used?

Generalized least squares (GLS) – for autocorrelation or heteroscedasticity.

Robust regression (using Huber loss) – when is it needed?

Quantile regression – how does it differ from OLS?

Regression with dummy variables for categorical predictors.

Interaction terms – how do you include and interpret them?

Explain stepwise regression (forward, backward, bidirectional).

What is the bias–variance tradeoff in model complexity?

How does cross-validation help in regression model selection?

Compare subset selection vs. shrinkage methods (ridge, lasso).

If two predictors are highly correlated, how does it affect coefficient estimates?

How would you test if a regression model is better than just using the mean of Y?

Prove the Gauss–Markov theorem.

Show that OLS minimizes the sum of squared residuals.

Derive the distribution of the OLS estimator under normality of errors.

Explain the maximum likelihood estimation approach for linear regression.

What is Bayesian linear regression, and how does it differ from classical OLS?

How does polynomial regression extend the linear regression model?

What are generalized linear models (GLMs), and how do they relate to linear regression?

Can you discuss the use of spline functions in regression?

What are mixed models, and where might you use them?

Explain the purpose of residual plots and how to interpret them.

What are leverage points and how do they affect a regression model?

Describe how you would detect and address outliers in your regression analysis.

Explain the concept of Cook's distance.

How is influence measured in the context of linear regression?

Describe the variance inflation factor (VIF) and its significance.

What is the role of the intercept term in a linear regression model?

What are the common metrics to evaluate a linear regression model's performance?

Explain the concept of homoscedasticity. Why is it important?

How is hypothesis testing used in the context of linear regression?

Can you explain the concept of gradient descent and its importance in linear regression?

How do you use regularization to improve linear regression models?

How can you optimize the hyperparameters of a regularized linear regression model?

2. Implementation & Practical Application
How would you implement linear regression from scratch in Python (using NumPy)?

In scikit-learn, how do you fit a linear regression and interpret the coefficients?

How do you check regression assumptions using Python/R plots and tests?

Given a real dataset, walk through the steps of building, diagnosing, and selecting a linear regression model.

Implement a multiple linear regression model using NumPy or similar libraries.

Write a Python function that performs the gradient descent algorithm for linear regression.

Create a Python script to calculate the VIF for each predictor in a dataset.

Code a Python function to implement ridge regression using scikit-learn.

Use pandas to load a dataset and prepare it for linear regression, handling any missing values.

Plot residual diagrams and analyze the model fit using Matplotlib or Seaborn.

Write a Python function to compute and print out model evaluation metrics (RMSE, MAE, R-squared).

Perform a polynomial regression on a sample dataset and plot the results.

Use scikit-learn to perform cross-validation on a linear regression model and extract the test scores.

Describe the steps involved in preprocessing data for linear regression analysis.

How do you deal with missing values when preparing data for linear regression?

What feature selection methods can be used prior to building a regression model?

How is feature scaling relevant to linear regression?

Explain the concept of data splitting into training and test sets.

How do you address overfitting in linear regression?

How to handle categorical variables in linear regression?

Implement simple linear regression from scratch in Python.

Implement a linear regression model to predict customer lifetime value using scikit-learn.

Develop a regularized regression model to analyze and predict healthcare costs.

Perform a time-series linear regression analysis on stock market data.

Create a Python script that tunes the hyperparameters of an elastic net regression model using grid search.

Write a Python function that incorporates polynomial features into a regression model for better fit and analyzes the trade-off with model complexity.

3. Application Scenarios & Case Studies
Discuss how linear regression can be used for sales forecasting.

Describe a situation where linear regression could be applied in the finance sector.

How can linear regression be used for price optimization in retail?

Explain how you might use regression analysis to assess the effect of marketing campaigns.

Describe how linear regression models could be used in predicting real estate prices.

How would you approach building a linear regression model to predict customer churn?

Illustrate the process you would follow to model the relationship between advertising spend and revenue generation.

Walk me through a time you diagnosed a poorly performing regression model and how you improved it.

Describe how you might use linear regression to optimize inventory levels in a supply chain context.

Propose a framework for using regression analysis to evaluate the impact of promotional activities on sales volume.

4. Advanced Concepts & Modern Trends
Discuss recent advances in optimization algorithms for linear regression.

How has the field of linear regression modeling evolved with the advent of big data?

What are the latest research trends in regularized regression techniques?

How can linear regression models be made more robust to non-standard data types?

Discuss the potential role of linear regression in the development of AI for personalized medicine.

Describe a situation where logistic regression might be preferred over linear regression.

How would you explain the importance of linear regression to a non-technical stakeholder?

What steps would you take if your linear regression model shows significant bias after deployment?

How would you use A/B testing to validate the outcomes of a linear regression model in a live environment?

Describe a scenario where you'd have to transition from a simple to a multiple linear regression model, and the considerations you'd have to make.

5. Critical Thinking & Edge Cases
When is linear regression unsuitable, even if the relationship seems linear?

Why can’t we use R² to compare models with different dependent variables?

What happens if you regress Y on X and then X on Y? Are the slopes reciprocals?

If you add more variables, can R² decrease?

Can the OLS estimator be unbiased but inconsistent? Explain.

## What is Linear Regression and its Assumptions?

Linear Regression as the simplest way to find a trend in data. Mathematically, we’re just fitting a line through data points so that the distance between the points and the line is as small as possible. We use it when we want to see how much a 'target' (like sales) changes when we change a 'feature' (like ad spend).


Assumptions:

Linearity:

We assume the relationship is a straight line. If the data actually follows a curve (like an 'S' or a 'U'), a linear model will miss the pattern and give us poor predictions.

Independence:

We assume each data point is its own separate event. For example, if I'm measuring house prices, one sale shouldn't influence the next. If they do (like in stock market data), our errors get 'linked,' and the model becomes unreliable.

Homoscedasticity (Constant Variance):

This is a fancy word for 'consistent noise.' We want the model's errors to be the same size across the whole dataset. If the model is very accurate for small values but gets wildly inaccurate for large values, it creates a 'fan shape' in our residuals, which we call heteroscedasticity.

No Multicollinearity:

Our predictors shouldn't be 'twins.' If I use both 'Square Footage' and 'Number of Rooms' to predict price, they might be so correlated that the model can’t tell which one is actually driving the price. This makes our coefficients jump around and become unstable.

Normality of Errors:

We want our mistakes to follow a bell curve. This isn't strictly necessary to get a good line, but it's vital if we want to trust our p-values and say, 'Yes, this variable is statistically significant.

## what is regression ?

Regression is a statistical method to understand the relationship between variables, predicting a dependent variable (outcome) based on changes in one or more independent variables (predictors) by finding a best-fit line or curve through data points, used widely in forecasting, risk analysis, and modeling

## How to use Predictive  modeling ?

Linear Regression is used in predictive modeling to predict a future or unknown numerical value based on past data.

Step-by-step use in predictive modeling

Collect data
Gather past data where you already know the result.
Example:

Hours studied → Exam marks

House size → House price

Choose variables

Input (X): What you use to predict (hours studied, size, age, etc.)

Output (Y): What you want to predict (marks, price, salary)

Train the model
Linear Regression finds the best straight line that fits the data:

Y = mX + c


This line shows how X affects Y.

Make prediction
When a new value of X is given, the model calculates Y using the line.
Example:

If a student studies 6 hours → model predicts 75 marks

Evaluate accuracy
Compare predicted values with real values to check how good the model is.

## Simple vs. Multiple Linear Regression

The difference is really just about how many 'ingredients' you’re using to make your prediction.

Simple regression is like looking at a single cause and effect—for example, does more 'study time' lead to higher 'test scores'? It’s a 2D relationship: one X and one Y.  mainly one independent variable  and one  dependent  variable  


Multiple regression is more like the real world. It recognizes that many things affect an outcome at once. To predict 'test scores,' you’d look at study time, hours of sleep, and previous grades all together.  more than one independent variable



## Equation of a simple linear regression model

``` y = β₀ + β₁x + ε ```

y is our target - what we're trying to predict.

β₀ is our starting point or baseline - what y would be if x were zero.

β₁ is the rate of change - it tells us how much y increases for each one-unit increase in x.

x is our input feature - the data we're using to make predictions.

ε is crucial - it represents all the unmeasured factors and random noise. In interviews, I always emphasize that ε acknowledges no model captures everything perfectly. We assume these errors average to zero and follow a normal distribution.


## Why the "Least Squares" Method?
We use it because it’s the most efficient way to find the 'best' fit. Imagine you draw a line through a bunch of dots. Some dots are above the line, some are below.

We measure those distances (residuals).

We square them so that the negative distances don't cancel out the positive ones, and to 'punish' the line more for big misses.

The 'Least Squares' method is just a mathematical way to find the one specific line where that total 'pile' of squared errors is as small as it can possibly be."

## What do the Coefficients Represent?
The coefficients are basically the 'sensitivity' of your model.In a simple model, the coefficient tells you: 'If I increase X by one unit, how much does y move?'In a multiple regression model, it’s a bit more nuanced. It tells you the impact of one specific variable while assuming every other variable is staying exactly the same. This is key because it lets us isolate the effect of one factor in a complex system.

## Can a coefficient be negative?
YES. A negative coefficient just means an inverse relationship. For example, as the 'price' of a product goes up, the 'number of sales' usually goes down. That would show up as a negative β1

## Gauss–Markov & the BLUE Property
The Gauss-Markov theorem is basically the set of rules that, if followed, guarantees our OLS model is the best possible tool for the job.

BLUE stands for ``` Best Linear Unbiased Estimator```.

Best means it has the lowest variance (it’s the most precise).

Linear means the estimator is a linear function of the data.

Unbiased means that on average, it hits the true population value. If the Gauss-Markov assumptions hold (Linearity, Independence, Homoscedasticity, and No Multicollinearity), OLS is the king of estimators.

## What happens if the linearity assumption is violated?
If the true relationship is a curve and you force a straight line through it, you're going to have bias. Your model will systematically over-predict in some areas and under predict in others. You’ll see this immediately in a residual plot instead of a random cloud of points, the residuals will show a clear pattern, like a 'U' shape.  simple word   there is no linear relationship between the independent variables and the dependent variables.

## How do you detect and handle multicollinearity?

This happens when two or more of your predictors are highly correlated (like 'years of education' and 'age').

Detection: I look at the VIF (Variance Inflation Factor). A VIF over 5 or 10 is a red flag.

Handling: You can either drop one of the redundant variables, combine them into a single index, or use Ridge/Lasso regression, which are designed to handle this 'shakiness' in the coefficients.

## What is heteroscedasticity, and how can it be identified   ? Heteroscedasticity: The "Uneven Noise"
Heteroscedasticity is when the 'noise' (variance of errors) isn't constant. For example, rich people's spending habits vary much more than poor people's.

Identification: Look for a 'fan' or 'cone' shape in your residual plot.

Tests: I’d use the Breusch-Pagan or White test. If the p-value is low, we have heteroscedasticity.

Fix: Use a log-transform on the target variable or switch to Robust Standard Errors so our p-values stay trustworthy

Ii is the variance of errors (residuals) in a regression model isn't constant across all levels of the predictor variables, creating unequal scatter, often seen as a fan or cone shape in residual plots, and can be identified visually or with tests like the Breusch-Pagan test, 

## What are the consequences of autocorrelation in time-series regression?
This usually happens in time-series data where today's error is related to yesterday's error.

Consequences: It doesn't bias your coefficients, but it shrinks your standard errors. This makes the model look much more 'confident' than it actually is, leading to 'significant' p-values that are actually just fake noise.

``` biased standard errors and inefficient coefficient estimates.  ```

##  How do you check for normality of residuals
We want our errors to follow a bell curve.
QQ Plot: This is a visual check. If the points fall on a straight diagonal line, we're good.

Shapiro-Wilk Test: This is the formal math test. If the p-value is significant (< 0.05), it 
means our residuals aren't normal.

Why it matters: We need this for our p-values and Confidence Intervals to be accurate in small datasets

## Outliers vs. Influential Points
Not every 'weird' point is a problem.

Outlier: A point with a huge residual (it's far from the line).

Leverage: A point with an extreme X value.

Influential Point: This is the dangerous one. It’s a point that, if removed, would significantly change the slope of the line.
Cook’s Distance: This is the metric I use to find them. If a point has a Cook's distance greater than 1 (or 4/n), it’s likely pulling the whole model toward itself like a magnet.

## Which assumption is the most important?
The Zero Conditional Mean (E[  epsilon|X  ] = 0) is arguably the most critical. If your error term is correlated with your predictors (Endogeneity), your coefficients will be biased, and no amount of 'big data' will fix a fundamentally biased estimate.

The most important assumption for linear regression is linearity

## R² and Adjusted R²
R² is the 'Percentage of Explained Variance.' If  R² is 0.80, it means your model captures 80% of the variation in the data, and the other 20% is just random noise or factors you didn't measure.
Adjusted R²   is the honest version of R²  . Standard R²   has a flaw: it goes up every time you add a new variable, even if that variable is total garbage. Adjusted R²   'penalizes' you for adding variables that don't actually improve the model. If you add a useless variable, your Adjusted R²   will actually go down."

## When is R² Misleading?  

R² isn't everything. It can be misleading in a few ways:

Overfitting: You can get a near-perfect R² by adding 100 variables to a dataset of 100 points, but the model will fail on new data.

Non-linearity: You can have a very low R² even if there is a strong relationship (like a curve) that the linear model just can't see.

Spurious Correlation: In time-series, two things that both increase over time (like 'Global Warming' and 'Number of iPhones') will show a high R² even though they have nothing to do with each other."


## MSE vs. RMSE vs. MAE


MAE	Mean Absolute Error	"The average miss." If MAE is 5, your prediction is off by 5 units on average. It's very easy to explain to business stakeholders.

MSE	Mean Squared Error	"The mathematical favorite." It squares the errors, so a miss of 10 is 100 times worse than a miss of 1. It’s useful for math, but hard to interpret because the units are squared.

RMSE	Root Mean Squared Error	"The industry standard." It’s the square root of MSE. It brings the units back to the original scale but still penalizes large outliers heavily.


## What is the F-statistic in regression output, and what does it test?
While a t-test checks if a single variable is useful, the F-statistic checks if the entire model is useful. It tests a Null Hypothesis (H_0) that all your coefficients are zero.If the F-test p-value is low, it means your model is statistically better than just guessing the average (mean) of Y every time."

``` overall significance of the model,  ```

## AIC vs. BIC: The "Tie-Breakers"
These are used when you have two or three different models and you need to pick the best one. They both reward a good fit and penalize complexity (too many variables).

AIC (Akaike): Generally better if your goal is prediction. It's a bit more 'generous' with adding variables.

BIC (Bayesian): Much stricter. It has a higher penalty for extra variables. Use BIC if you want a simple, explainable model and want to avoid overfitting at all costs.

Rule of thumb: For both, the lower the score, the better the model.

## Why use RMSE instead of MAE?"
 RMSE is more useful when large errors are particularly expensive or dangerous. For example, in predicting medicine dosages, a big miss is much worse than two small misses. RMSE reflects that risk by squaring the errors.

## How do you test the significance of a single regression coefficient (t-test)?

The t-test asks: 'Is this specific variable actually doing anything?' We start with a Null Hypothesis ($H_0$) that the coefficient is zero (it has no effect). We calculate the t-statistic by taking our estimate and dividing it by its 'standard error' (the uncertainty).

The Logic: If the t-statistic is large (usually greater than 2) and the p-value is small (usually < 0.05), we reject the Null. We conclude that the variable is 'statistically significant,' meaning it’s very unlikely we’d see this relationship by pure chance.





## How do you test the significance of multiple coefficients (F-test)?

The F-test is the 'Team Test.' While the t-test looks at one variable at a time, the F-test looks at a group of variables simultaneously.

Why use it? Sometimes, three variables are individually 'weak' (high p-values) because they are correlated, but together they are very powerful.

The Logic: We compare a 'Big Model' (with the variables) to a 'Small Model' (without them). If the Big Model reduces the error significantly more than we’d expect by chance, the F-test comes back significant.

## Constructing a Confidence Interval

A coefficient (like β1 = 5.2) is just a single 'best guess.' A Confidence Interval (CI) is a range that says, 'We aren't 100% sure it’s 5.2, but we are 95% sure it’s somewhere between 4.8 and 5.6.'Construction: You take your estimate and add/subtract a 'margin of error' (which is just the standard error multiplied by a critical value from the t-distribution).Pro-tip: If the 95% Confidence Interval includes zero, it’s the same as saying the variable is not statistically significant at the 5% level.



## What is the relationship between the t-test for a coefficient and the F-test for the model?

There are two main connections:The 'Simple' Case: In a simple linear regression (only one X), the F-test and the t-test are actually telling you the exact same thing. In fact, mathematically, F = t^2.The 'Model' Case: The F-statistic you see at the bottom of a regression summary is testing the entire model against a model with no predictors at all. If your individual t-tests are all failing, but your F-test is significant, it’s a huge red flag that you have Multicollinearity—your variables are helpful, but they are stepping on each other's toes.

```The t-test is used to compare the means of two groups and determine if they are significantly different, while the F-test is used to compare variances of two or more groups and assess if they are significantly differen```

## What does a p-value of 0.03 actually mean?
It’s the probability of observing our data (or something more extreme) under the Null Hypothesis. It's not the probability that the theory is true, but rather a measure of how 'surprising' our data is if there were no real effect

## What Does a P-Value Indicate
low p value  significant  relationship
high p value  does not significantly contribute


## How do you handle nonlinear relationships while still using linear regression ( polynomial regression, log transformation)?

The most important thing to remember is that Linear Regression must be linear in its coefficients β, but it doesn't have to be linear in its features (X). This means we can transform our data into curves, and the model will still treat it as a linear math problem.

Polynomial Regression (Adding "Curves")

If the relationship looks like a U-shape or an S-curve, I would use Polynomial Regression.

How I'd do it: I’d create new features by squaring or cubing the original variables (adding an x^2 or x^3 term).

The Result: This allows the 'straight line' of the model to bend and follow the data points more closely.

Example: In real estate, the value of a house might increase slowly at first with square footage, but then jump significantly once you reach a 'luxury' size. A squared term captures that acceleration.

Log Transformations (Handling Exponential Growth)

If the data is skewed or the relationship is based on percentages rather than fixed amounts, I would use a Log Transformation.

How I'd do it: I take the natural log of the Dependent Variable (ln(Y)), the Independent Variable (ln(X)), or both.

The Result: It compresses large values and spreads out small ones. It’s perfect for 'multiplicative' relationships like how a 5% raise is different for someone making 40k vs. 400k.Main : If I log both X$and Y$ the coefficient represents elasticity (how a 1% change in X impacts the % change in Y).


Polynomials = For physical curves (U-shapes).

Log Transforms = For percentage-based growth or skewed dat





## What is log-linear, linear-log, and log-log regression? Interpret their coefficients.

### Log-Linear Model (Exponential Growth)

In this model, the dependent variable (Y) is logged, but the predictor (X) stays the same.
USE : When Y grows exponentially (e.g., a bank account balance over time)
Coefficient Interpretation: A 1-unit increase in X is associated with a ( β1* 100)

If β1 = 0.05, then adding 1 year of experience (X) increases salary (Y) by 5%.

### Linear-Log Model (Diminishing Returns)
Here, the independent variable (X) is logged, but Y remains in its original unit
USE : When you have "diminishing returns"—where X has a huge impact at first, but then levels off (e.g., the benefit of adding more fertilizer to a plant).

Coefficient Interpretation: A 1% increase in X is associated with a (β1/ 100)unit change in Y.Example: If \β1 = 20, then increasing advertising spend (X) by 1% results in 0.20 more units sold (Y).

###  Log-Log Model (Elasticity)

Both X and Y are logged. This is very popular in economics.

When to use it: When the relationship is multiplicative. It helps you find the "Elasticity."Coefficient Interpretation: A 1% increase in X is associated with a β1% change in Y.Example: If β1= -1.5, then a 1% increase in the price of a product (X) leads to a 1.5% decrease in demand (Y).


why we use ln instead of log10

We use the natural log because for small values of β, the change in ln(Y) is a very close approximation of the percentage change in Y. It makes the math of 'growth rates' much cleaner.


## What is ridge regression (L2 regularization)? How does it differ from OLS?

Ridge Regression, or L2 Regularization, is a technique used to prevent a model from overfitting. In standard linear regression, the model tries to fit the training data as perfectly as possible, which can lead to 'exploded' coefficients if the data is noisy or has highly correlated features. Ridge adds a penalty to the loss function based on the size of the coefficients. This forces the model to keep the weights small, making it more stable and better at predicting new, unseen data.

To understand the difference, look at the Loss Functions. Ordinary Least Squares (OLS) only cares about accuracy, while Ridge cares about accuracy + simplicity


### When should you choose Ridge over OLS?

Ridge in these three scenarios:

Multicollinearity: When your independent variables are highly correlated (e.g., "Years of Experience" and "Age"). OLS would give them wild, unreliable weights; Ridge handles them gracefully.

High-Dimensional Data: When you have many features (p) but not many observations (n). Standard OLS often fails or overfits here.

Noisy Data: When you suspect that some of the patterns in your training set are just random noise that shouldn't be learned.

## Explain lasso regression (L1 regularization) and its advantage for feature selection.

Lasso stands for Least Absolute Shrinkage and Selection Operator. Like Ridge, it adds a penalty to the model to prevent overfitting. However, instead of penalizing the square of the coefficients, it penalizes the absolute value. The 'magic' of Lasso is that it doesn't just make coefficients small—it can force them to become exactly zero. This effectively deletes unimportant features from the model, leaving you with a simpler, more interpretable result.

### When to use Lasso

High-Dimensional Data: If you have 1,000 features but suspect only 10 really matter.

Model Interpretability: If you need to explain to stakeholders exactly which 3-4 factors are driving the business results.

Automated Variable Selection: When you want the model to decide which data is "noise" and which is "signal" without manually dropping columns.

### What is the main drawback of Lasso?  

If I have several highly correlated variables, Lasso will randomly pick one and zero out the others. This can be a problem if I need to see the impact of all those variables. In that case, I might use Elastic Net, which combines both L1 and L2 penalties."


## What is elastic net regression?

Elastic Net is a regularized regression method that uses a linear combination of L1 and L2 penalties. While Lasso is great for feature selection and Ridge is great for handling multicollinearity, they both have weaknesses. Lasso might randomly pick one variable from a group of correlated ones and ignore the others, while Ridge keeps all of them but can't simplify the model. Elastic Net fixes this by balancing both penalties, allowing it to group correlated variables together while still being able to perform feature selection.

### Which one should I use by default
In practice, Elastic Net is often preferred over Lasso because it is more robust. However, it does require tuning an extra hyperparameter (the L1 Ratio). If I have the computational budget for cross-validation, I'll usually start with Elastic Net to see if a hybrid approach performs better than pure Lasso or Ridge.


## Weighted least squares (WLS) – when and why is it used?
Standard OLS treats every data point as equally important. WLS says, 'Wait, some of these data points are more reliable than others.'

When to use it: When you have Heteroscedasticity and you actually know why the variance is changing.

How it works: You give a higher 'weight' to points with low variance and a lower 'weight' to points with high variance. It’s like listening more to a witness who has 20/20 vision than one who forgot their glasses.

## Generalized Least Squares (GLS)
GLS is the 'Big Brother' of OLS. While OLS assumes all errors are independent and have the same variance, GLS can handle data where errors are correlated (like in Time Series) or have different variances.

The Logic: It mathematically transforms the data to 'clean out' the correlation or uneven variance before running the regression. If you have a complex error structure, GLS is your go-to.

## Robust Regression (Huber Loss)
OLS is a bit of a 'drama queen'—one single outlier can pull the entire line toward itself because it squares the errors. Robust Regression uses something like Huber Loss.

How it works: It acts like OLS for small errors (squaring them), but for huge errors (outliers), it acts like MAE (just taking the absolute value). This prevents outliers from having an unfair amount of influence on your model.

## Quantile Regression
OLS always tries to predict the Mean (the average). But sometimes the average doesn't tell the whole story.

The Difference: Quantile regression allows you to predict the Median, the 10th percentile, or the 90th percentile.

Example: If you’re studying healthcare costs, you might not care about the 'average' patient; you might want to predict the costs for the 'top 5%' of most expensive patients. Quantile regression lets you do that.

## Regression with dummy variables for categorical predictors.
Computers only understand numbers, so we have to turn 'Categories' into 0s and 1s.

The Rule: If you have 3 categories (e.g., Red, Blue, Green), you only need 2 dummy variables.

Why? Because if it’s not Red and not Blue, it must be Green. This prevents the 'Dummy Variable Trap' (perfect multicollinearity), where the model gets confused because the variables add up to a constant.

## Stepwise Regression: The "Search" Strategies
Stepwise regression is a way to automate the process of picking which variables belong in your model when you have too many to choose from.

Forward Selection: You start with an empty model and add the variables one by one. Each step, you pick the variable that gives the biggest boost (usually based on the lowest p-value or AIC). You stop when no more variables add significant value.

Backward Elimination: You start with everything in the model. Then, you kick out the least helpful variable (the one with the highest p-value) one at a time. You stop when everyone left in the room is statistically significant.

Bidirectional (Stepwise): This is a mix. You add variables like Forward selection, but after each addition, you check if any of the older variables have become redundant and should be removed. It’s like a 'constant re-evaluation' of the team."

## What is the bias–variance tradeoff in model complexity?
This is the fundamental balancing act in machine learning.

Bias (Underfitting): This happens when your model is too simple. For example, if you try to fit a straight line to a curve, you have 'High Bias.' The model is too rigid to see the real pattern.

Variance (Overfitting): This happens when your model is too complex. It starts memorizing the noise and 'wiggles' in your specific dataset. It looks great on your training data but fails miserably on new data because it's too 'sensitive.'

The Tradeoff: As you add more variables, Bias goes down, but Variance goes up. The goal is to find the 'sweet spot' where the Total Error is at its lowest.

## How does cross-validation help in regression model selection?
Cross-validation is how we prove our model isn't just 'memorizing' the training data.How it works (K-Fold): Instead of just splitting the data once, we split it into, say, 5 groups (folds). We train the model on 4 groups and test it on the 5th. We repeat this 5 times, so every data point gets a chance to be the 'test set.

Why use it: It gives us a much more honest estimate of how the model will perform in the real world. If a model has a great R^2 on the training data but a terrible score during cross-validation, we know we've overfit and need to simplify the model.

## Compare subset selection vs. shrinkage methods (ridge, lasso).

The fundamental difference is that Subset Selection is "all-or-nothing" (discrete), while Shrinkage is "shades of grey" (continuous).

Subset Selection (The "Discrete" Approach)
In subset selection (like Best Subset, Forward, or Backward selection), you treat each feature as a light switch: it’s either ON or OFF.

How it works: You pick a subset of k predictors and run standard Ordinary Least Squares (OLS) on just those variables. The other p-k variables are discarded.

The Problem (Stability): It is very unstable. Because it's a discrete process, adding just one new data point might cause the model to suddenly "switch off" one variable and "switch on" a completely different one. This leads to high variance.

The Problem (Computation): "Best Subset" is impossible for large datasets. With 40 features, there are 2^40 (over 1 trillion) possible combinations to check.

Shrinkage Methods (The "Continuous" Approach)

Shrinkage methods (Ridge and Lasso) don't throw variables away immediately. Instead, they put a "leash" on the coefficients, pulling them toward zero.

How it works: They keep all (or most) variables but reduce their magnitude. Instead of a light switch, it’s like a dimmer switch.

Stability: Because the coefficients change smoothly as you adjust the penalty 	λ, these models are much more stable and have lower variance than subset selection.

Bias-Variance Tradeoff: Shrinkage purposefully introduces a little bias (by not letting the coefficients reach their "perfect" OLS values) to get a massive reduction in variance. This usually results in better predictions on new data.


### Why not just use Best Subset if I have the computing power?

Even with infinite computing power, Shrinkage is often better. Because Subset Selection uses OLS on the chosen variables, it has no way to reduce variance. Shrinkage methods provide a 'smoothing' effect. In most real-world data where features are correlated or noisy, a model that shrinks coefficients (Ridge/Lasso) will almost always generalize better to new data than a model that just picks a subset and runs OLS



## Why do many people dislike Stepwise Regression?

Because it's a 'greedy' algorithm. It only looks at the best variable in the moment and might miss the best combination of variables overall. Also, it tends to make p-values look better than they actually are (data dredging). This is why many practitioners prefer Lasso regression or Cross-Validation for variable selection instead.

## When is Linear Regression Unsuitable?
Even if the data looks like a perfect straight line, linear regression might be a bad choice if:

Extrapolation is required: If you need to predict far outside the range of your data, a linear trend might not last ( a person's height vs. age—it's linear for a while, then it stops).


Endogeneity/Causality: If you want to prove X causes Y, but there's a hidden 'omitted variable' causing both, your coefficients will be biased and misleading.


Time Series with Trends: Two unrelated things that both go up over time (like 'Number of organic grocery stores' and 'SpaceX launches') will show a high linear correlation, but the model is logically meaningless (Spurious Correlation).


## Why can’t we use R^2 to compare different Y variables?
R^2 is a ratio: it’s the variance explained divided by the total variance of the dependent variable.If you change Y (for example, comparing a model that predicts Price vs. a model that predicts log(Price)), the total variance (SST) changes completely.It’s like comparing a student’s score on a Math test to their score on a History test. Even if they got '80%' on both, the difficulty and the 'spread' of the class scores are different. You can't say one model is 'better' just because the R^2 is higher.

## How to Handle Categorical Variables in Linear Regression?
To handle categorical variables in linear regression, we first need to convert them into numerical formats since regression models cannot process non-numeric data. The two main approaches are one-hot encoding and label encoding:

One-hot encoding: creates new binary columns for each category of a categorical variable.
For example, if you have a variable called "Color" with three possible values: Red, Blue, and Green, one-hot encoding will create three new columns: "Color_Red", "Color_Blue", and "Color_Green". Each row will have a value of 1 in the column corresponding to its color and 0 in the others.

Can you tell me one drawback of one-hot encoding? It Increases dimensionality when there are many categories (curse of dimensionality).

Label Encoding: It assigns a unique integer to each category of a variable. Using the same "Color" example, you could label Red as 0, Blue as 1, and Green as 2.

##  How to Handle Missing Data in Linear Regression?

Imputation: It Replace missing values with the mean, median, or mode of that feature. For example, fill in missing weights with the average weight.

Dropping Rows/Columns: Remove rows or columns with missing data if they are minimal, but do this cautiously to avoid losing important information.

Prediction Models: Use other variables to predict and fill in missing values, ensuring a more complete dataset for analysis.

## If two predictors are highly correlated, how does it affect coefficient estimates?

we should remove some of the control variables that are correlated with each other. You may find then that a variable that was insignificant will become (more) significant because it no longer has collinearity with some other variable that was showing up as significant.

## How would you test if a regression model is better than just using the mean of Y?

The residuals must follow a normal distribution.
The residuals are homogeneous, there's homoscedasticity.
There's no outliers in the errors.
There's no autocorrelation in the errors.
There's no multicolinearity between the independent variables.

compare their performance using metrics like R-squared, MSE, or MAE, where lower error (MSE/MAE) or higher R-squared indicates a better model


## Explain the maximum likelihood estimation approach for linear regression ?

Maximum Likelihood Estimation (MLE) for linear regression finds model parameters (coefficients and error variance) that maximize the probability (likelihood) of observing the given data, assuming errors follow a normal distribution

## Can you discuss the use of spline functions in regression ?
Spline functions in regression model non-linear relationships by fitting low-degree polynomials to different segments of data, connected smoothly at points called "knots," offering more flexibility than single high-degree polynomials while avoiding their oscillations

## What are mixed models, and where might you use them?
Mixed models are statistical models that incorporate both "fixed effects" and "random effects" to account for variation within and between groups of data .

## Explain the purpose of residual plots and how to interpret them.

A residual plot is a graphical method to check how well a model's predictions match actual data. For example, if you're predicting sales based on price, the residuals are the differences between the predicted and actual sales

## What are leverage points and how do they affect a regression model?

Leverage points are data points with unusual or extreme independent (X) variable values, meaning they are far from the average X, giving them the potential to heavily influence a regression line, even if their Y-value fits the trend.

## Explain the concept of homoscedasticity. Why is it important?
it means the prediction errors are evenly spread out and do not follow any pattern, regardless of the magnitude of the predictor variables. 

## When is linear regression unsuitable, even if the relationship seems linear?

If your target depends on curves, thresholds, or complex shapes