**Theoretical**


1 What does R-squared represent in a regression model?
-R-squared is a statistical measure that represents the proportion of the variance in the dependent variable that is explained by the independent variable(s) in a regression model. It is also known as the coefficient of determination.


2  What are the assumptions of linear regression?
-Linear regression relies on several key assumptions to ensure the validity and reliability of its estimates. These assumptions are:

### **1. Linearity**  
- The relationship between the independent variables (predictors) and the dependent variable (outcome) must be **linear**.  
- If the relationship is non-linear, linear regression may not be appropriate or could require transformation (e.g., logarithmic transformation).

### **2. Independence (No Autocorrelation)**  
- The residuals (errors) should be **independent** of each other.  
- In time series data, this means no serial correlation (errors at one time point should not be correlated with errors at another time point).  
- **Violation test**: Use the **Durbin-Watson test** to check for autocorrelation.

### **3. Homoscedasticity (Constant Variance of Errors)**  
- The variance of residuals should remain **constant** across all levels of the independent variable(s).  
- If variance increases or decreases (heteroscedasticity), standard errors may be biased, leading to unreliable statistical tests.  
- **Violation test**: Use a **scatter plot of residuals vs. predicted values** or **Breusch-Pagan test**.

### **4. Normality of Residuals**  
- The residuals (errors) should be **normally distributed**, especially for small sample sizes.  
- This is important for hypothesis testing (e.g., t-tests, confidence intervals).  
- **Violation test**: Use a **Q-Q plot**, **histogram of residuals**, or **Shapiro-Wilk test**.

### **5. No Multicollinearity**  
- Independent variables should not be highly correlated with each other.  
- High multicollinearity makes it difficult to determine the effect of each predictor on the dependent variable.  
- **Violation test**: Check **Variance Inflation Factor (VIF)**—a VIF > 5 or 10 suggests problematic multicollinearity.

### **6. No Omitted Variable Bias**  
- The model should include all relevant predictors; omitting important variables can lead to biased estimates.  
- There should be no correlation between the independent variables and omitted factors.




3 What is the difference between R-squared and Adjusted R-squared?
-### **Difference Between \(R^2\) and Adjusted \(R^2\)**  

| **Metric**          | **Definition** | **Formula** | **Key Difference** |
|---------------------|---------------|------------|-------------------|
| **\(R^2\) (R-Squared)** | Measures the proportion of variance in the dependent variable explained by the independent variables. | \(\ R^2 = 1 - \frac{SS_{res}}{SS_{tot}} \) | Increases when more predictors are added, even if they don’t improve the model. |
| **Adjusted \(R^2\)** | Adjusts \(R^2\) by accounting for the number of predictors, preventing overfitting. | \(\ Adjusted\ R^2 = 1 - \left( \frac{(1 - R^2)(n - 1)}{n - p - 1} \right) \) | Increases only if a new predictor improves the model; decreases if it does not. |

### **Key Differences**  
1. **Adjusted \(R^2\) Penalizes Extra Variables**  
   - \(R^2\) always increases when new independent variables are added, even if they don’t contribute to explaining variance.  
   - Adjusted \(R^2\) **increases only when the added predictor actually improves the model**.  

2. **Use Case**  
   - **\(R^2\)** is useful for basic understanding of model fit.  
   - **Adjusted \(R^2\)** is better for comparing models with different numbers of predictors.  

3. **Formula Adjustment**  
   - Adjusted \(R^2\) introduces a penalty based on sample size (\(n\)) and the number of predictors (\(p\)).  

### **When to Use Which?**  
- If you just want to see how well the model explains variance, use **\(R^2\)**.  
- If you are comparing multiple models or avoiding overfitting, use **Adjusted \(R^2\)**.  






4 Why do we use Mean Squared Error (MSE)?
-### **Why Do We Use Mean Squared Error (MSE)?**  

Mean Squared Error (MSE) is a widely used metric for evaluating the performance of regression models. It measures the **average squared difference** between actual and predicted values.

### **1. Definition & Formula**  
\[
MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
\]
Where:  
- \( y_i \) = Actual value  
- \( \hat{y}_i \) = Predicted value  
- \( n \) = Number of observations  

### **2. Reasons for Using MSE**  
#### **a) Penalizes Larger Errors More**  
- Squaring the errors **amplifies larger deviations**, making the model more sensitive to outliers.
- This ensures that models minimize large prediction errors, leading to better performance.

#### **b) Differentiable for Optimization**  
- The squared function is continuous and differentiable, making it easy to compute gradients for optimization (e.g., in gradient descent for machine learning models).

#### **c) Easy to Interpret & Compare**  
- MSE provides a simple way to measure how well a model fits the data.
- It is widely used in model selection and hyperparameter tuning.

### **3. Limitations of MSE**  
#### **a) Not in the Same Units as the Target Variable**  
- Since MSE squares the errors, its unit is **different** from the original variable.  
- Solution: Use **Root Mean Squared Error (RMSE)** to convert it back to the original unit.

#### **b) Sensitive to Outliers**  
- Large errors get squared, making MSE **highly sensitive to outliers**.
- Solution: Use **Mean Absolute Error (MAE)** if robustness to outliers is needed.

### **4. Alternatives to MSE**  
- **Mean Absolute Error (MAE):** Uses absolute values instead of squares, making it less sensitive to outliers.  
- **Root Mean Squared Error (RMSE):** The square root of MSE, making the error units the same as the target variable.  
- **Mean Absolute Percentage Error (MAPE):** Expresses error as a percentage of actual values.  



5 What does an Adjusted R-squared value of 0.85 indicate?
-An **Adjusted R-squared** value of **0.85** indicates that **85% of the variance in the dependent variable is explained by the independent variables** in the regression model, **adjusted for the number of predictors**.  

### **Key Interpretations:**
1. **Strong Model Fit**  
   - Since **Adjusted \(R^2\)** accounts for the number of predictors, an **0.85 value suggests a strong fit**, meaning the independent variables explain most of the variability in the target variable.  

2. **Better than Simple \(R^2\)**  
   - Unlike **\(R^2\), which always increases with more variables**, **Adjusted \(R^2\) only increases if the new variables contribute meaningful explanatory power**.  
   - If the Adjusted \(R^2\) remains high, it suggests that most of the predictors in the model are relevant.

3. **Still Some Unexplained Variance**  
   - Since Adjusted \(R^2\) is **not 1.0**, about **15% of the variance is still unexplained**, meaning there could be other factors affecting the dependent variable that are not included in the model.

### **What Should You Do Next?**
- **Check for overfitting**: Even though 0.85 is high, ensure the model is generalizable by testing on unseen data.  
- **Evaluate residuals**: Ensure assumptions like normality and homoscedasticity are met.  
- **Consider adding/exploring more variables**: If the domain knowledge suggests other relevant predictors, test their impact.  




6 How do we check for normality of residuals in linear regression?
-### **How to Check for Normality of Residuals in Linear Regression**  
The assumption of **normality of residuals** means that the residuals (errors) should follow a normal distribution. This is especially important for valid hypothesis testing and confidence intervals in regression analysis.  

### **Methods to Check Normality**  

#### **1. Visual Inspection**  
📌 *Best for an initial quick check.*  

✅ **Histogram of Residuals**  
- Plot a histogram of the residuals.  
- If the residuals are normally distributed, the histogram should resemble a **bell curve**.  

✅ **Q-Q Plot (Quantile-Quantile Plot)**  
- Compares the residual quantiles to a normal distribution.  
- If the residuals are normal, points should fall along the **45-degree reference line**.  

✅ **Box Plot**  
- Identifies potential **outliers** that may affect normality.  
- Extreme outliers can distort the assumption of normality.  

---

#### **2. Statistical Tests**  
📌 *Useful when you need a formal test rather than a visual check.*  

✅ **Shapiro-Wilk Test**  
- Null hypothesis (\(H_0\)): Residuals follow a normal distribution.  
- If \(p < 0.05\), reject \(H_0\) (residuals are **not** normal).  

✅ **Kolmogorov-Smirnov (K-S) Test**  
- Tests whether residuals deviate significantly from a normal distribution.  
- Works better for **larger datasets**.  

✅ **Anderson-Darling Test**  
- Similar to Shapiro-Wilk but places more emphasis on the tails of the distribution.  

✅ **Jarque-Bera Test**  
- Checks for skewness and kurtosis in residuals.  
- If the test statistic is **significant**, residuals deviate from normality.  

---

#### **3. Skewness and Kurtosis**  
📌 *Numerical indicators of normality.*  
- **Skewness**: Measures symmetry (should be close to 0 for normality).  
- **Kurtosis**: Measures the "tailedness" of the distribution (should be around 3).  

---

### **What to Do If Residuals Are Not Normal?**  
🔹 **Apply Transformations** (e.g., log, square root, Box-Cox transformation).  
🔹 **Check for Outliers** and remove influential points if necessary.  
🔹 **Use Robust Regression** if normality is severely violated.  
🔹 **Increase Sample Size**, as normality issues may disappear with larger data.  




7 What is multicollinearity, and how does it impact regression?
-### **What is Multicollinearity?**  
**Multicollinearity** occurs when two or more independent variables in a regression model are highly correlated, meaning they provide redundant information. This makes it difficult to determine the individual effect of each predictor on the dependent variable.

### **Types of Multicollinearity**
1. **Perfect Multicollinearity**  
   - When one predictor is a perfect linear combination of another (e.g., \( X_2 = 2X_1 \)).  
   - The model cannot be estimated in this case.  
   
2. **High (But Not Perfect) Multicollinearity**  
   - When independent variables are strongly correlated but not exact duplicates.  
   - This can cause instability in coefficient estimates.

---

### **How Does Multicollinearity Impact Regression?**
1. **Unstable Coefficients**  
   - High correlation between predictors makes it difficult for the model to assign the correct contribution to each variable.  
   - Small changes in data can lead to large fluctuations in coefficients.

2. **Inflated Standard Errors**  
   - Standard errors of regression coefficients increase, making them **statistically insignificant** even if they are actually important.

3. **Misleading p-values**  
   - Due to large standard errors, **p-values may be high**, leading to incorrect conclusions about variable significance.

4. **Difficulty in Interpretation**  
   - When predictors are highly correlated, it becomes hard to determine their independent effect on the outcome.

5. **Overfitting Issues**  
   - A model with highly correlated predictors may perform well on training data but poorly on new data.

---

### **How to Detect Multicollinearity?**
✅ **Variance Inflation Factor (VIF)**  
   - Measures how much variance of a coefficient is inflated due to correlation with other predictors.  
   - **Rule of thumb:**  
     - **VIF > 5 or 10** indicates high multicollinearity.  
     - **VIF ≈ 1** means no multicollinearity.  

✅ **Correlation Matrix**  
   - A heatmap or correlation table can reveal high correlations (above **0.7 or 0.8** is concerning).  

✅ **Condition Number**  
   - A high condition number (\( > 30 \)) suggests multicollinearity.

---

### **How to Handle Multicollinearity?**
🔹 **Remove One of the Correlated Variables**  
   - If two variables convey similar information, keep the more interpretable one.  

🔹 **Use Principal Component Analysis (PCA)**  
   - Reduces correlated variables into uncorrelated principal components.  

🔹 **Combine Variables**  
   - Create a new feature that represents both correlated variables (e.g., average or ratio).  

🔹 **Use Ridge Regression (L2 Regularization)**  
   - Penalizes large coefficients, reducing multicollinearity impact.  



8 What is Mean Absolute Error (MAE)?
-ChatGPT said:
What is Mean Absolute Error (MAE)?
Mean Absolute Error (MAE) is a regression metric that measures the average absolute difference between actual and predicted values. It tells us how far predictions are, on average, from the actual values, making it a useful measure of model accuracy.




9 What are the benefits of using an ML pipeline?
-### **Benefits of Using an ML Pipeline**  

A **Machine Learning (ML) pipeline** is a structured way to automate and streamline the ML workflow, from data preprocessing to model deployment. Using an ML pipeline offers several key advantages:

---

### **1. Automation & Efficiency 🚀**  
- Automates repetitive tasks (e.g., data cleaning, feature engineering, model training).  
- Reduces manual intervention, saving time and effort.  

✅ Example: Instead of manually preprocessing data each time, a pipeline ensures it happens automatically before model training.  

---

### **2. Reproducibility 🔄**  
- Ensures consistent results when re-running experiments.  
- Eliminates randomness in data transformations and model training.  

✅ Example: Using **Scikit-learn Pipelines**, you can ensure the same preprocessing steps are applied every time.  

---

### **3. Scalability 📈**  
- Easily handles large datasets and complex workflows.  
- Can be deployed to production systems seamlessly.  

✅ Example: A pipeline can be used for batch processing or real-time inference without changing the core logic.  

---

### **4. Modular & Maintainable Code 🏗️**  
- Breaks down the ML process into **separate, manageable steps** (e.g., data preprocessing, feature selection, model training).  
- Makes debugging and updating models easier.  

✅ Example: If a feature engineering step needs updating, you can modify just that step without affecting the entire workflow.  

---

### **5. Hyperparameter Tuning & Experimentation 🎯**  
- Pipelines can integrate **automated hyperparameter tuning** (e.g., GridSearchCV, RandomizedSearchCV).  
- Ensures that model selection and optimization are done systematically.  

✅ Example: Instead of manually testing different parameters, a pipeline can automate hyperparameter tuning.  

---

### **6. Avoids Data Leakage 🚫**  
- Ensures that data transformation steps (like scaling or feature selection) are **only learned from training data** and **not influenced by test data**.  
- Prevents overfitting and ensures fair evaluation.  

✅ Example: If you apply scaling to the entire dataset before splitting, you risk **data leakage**—a pipeline prevents this by applying transformations only to the training set.  

---

### **7. Easy Deployment & Integration 🏭**  
- Pipelines can be deployed in **production environments** (e.g., cloud, APIs, edge devices).  
- Can be integrated with tools like **MLOps**, **CI/CD**, and **monitoring systems**.  

✅ Example: A pipeline built with **TensorFlow or Scikit-learn** can be deployed in **AWS, Azure, or Google Cloud** for real-time predictions.  

---

### **8. Supports Parallel Processing ⚡**  
- Many pipeline frameworks (e.g., Apache Airflow, Kubeflow) support parallel execution, reducing computation time.  

✅ Example: Data preprocessing, feature selection, and model training can run in **parallel**, speeding up ML workflows.  

---

### **Popular ML Pipeline Tools & Frameworks**  
✅ **Scikit-learn Pipelines** (for structured ML workflows)  
✅ **TensorFlow Extended (TFX)** (for deep learning pipelines)  
✅ **Apache Airflow** (for scheduling and orchestrating ML tasks)  
✅ **Kubeflow** (for scalable ML pipelines in Kubernetes)  



10 Why is RMSE considered more interpretable than MSE?
-### **Why is RMSE More Interpretable Than MSE?**  

**Root Mean Squared Error (RMSE)** is often preferred over **Mean Squared Error (MSE)** for interpretation because it is in the **same unit** as the target variable, making it easier to understand in practical terms.

---

### **Key Differences Between RMSE and MSE**  

| Metric | Formula | Interpretation | Units |
|--------|---------|----------------|-------|
| **MSE** | \( MSE = \frac{1}{n} \sum (y_i - \hat{y}_i)^2 \) | Measures average squared error but is not in the original scale of the target variable | Squared units (e.g., if target is in dollars, MSE is in dollars²) |
| **RMSE** | \( RMSE = \sqrt{MSE} \) | Measures the average error in the same unit as the target variable | Same as the target variable (e.g., dollars) |

---

### **Why is RMSE More Interpretable?**  
1. **Same Unit as Target Variable**  
   - RMSE takes the square root of MSE, bringing the error measurement **back to the original scale of the dependent variable**.  
   - Example: If predicting house prices in **dollars**, RMSE will be in **dollars**, while MSE will be in **dollars²**, which is harder to interpret.  

2. **Easier to Relate to Actual Errors**  
   - Since RMSE represents the **average deviation from actual values**, it directly tells us how far off our predictions are, on average.  
   - Example: If RMSE = **5,000 dollars**, it means the model’s predictions are off by about **$5,000 on average**.  

3. **More Comparable Across Models**  
   - Since MSE is squared, its magnitude can vary widely, making it harder to compare across models.  
   - RMSE provides a more intuitive comparison of model performance.  

---

### **When to Use RMSE vs. MSE?**  
- **Use RMSE** when interpretability is important (e.g., communicating results to stakeholders).  
- **Use MSE** when emphasizing large errors (because it squares errors, making large deviations more significant).  



11 What is pickling in Python, and how is it useful in ML?
-### **What is Pickling in Python?**  
**Pickling** is the process of **serializing** (converting) a Python object into a binary format so it can be saved to a file and later **deserialized** (loaded back into memory). This is done using Python’s built-in `pickle` module.  

### **How to Pickle an Object?**
```python
import pickle

# Example object (dictionary)
data = {"name": "Alice", "age": 25, "score": 90}

# Save to file
with open("data.pkl", "wb") as file:
    pickle.dump(data, file)

# Load from file
with open("data.pkl", "rb") as file:
    loaded_data = pickle.load(file)

print(loaded_data)  # Output: {'name': 'Alice', 'age': 25, 'score': 90}
```

---

### **Why is Pickling Useful in Machine Learning?**  

1. **Saving and Loading Trained Models**  
   - After training a machine learning model, you can pickle it and reuse it later without retraining.  
   - **Example: Save a trained model**  
     ```python
     import pickle
     from sklearn.linear_model import LinearRegression

     model = LinearRegression()
     model.fit(X_train, y_train)

     with open("model.pkl", "wb") as file:
         pickle.dump(model, file)

     # Load the model later
     with open("model.pkl", "rb") as file:
         loaded_model = pickle.load(file)
     ```

2. **Speeding Up Workflows**  
   - Instead of recalculating expensive feature transformations, pickle and reload them.  

3. **Sharing and Deployment**  
   - You can share trained models with other developers without sharing the raw training data.  
   - Useful for **deploying ML models** in applications and APIs.  

4. **Storing Preprocessed Data and Features**  
   - Avoid reprocessing large datasets by pickling precomputed features.  

---

### **Limitations of Pickling**  
❌ **Not Cross-Language Compatible** – Pickled objects can only be loaded in Python.  
❌ **Security Risk** – Loading an untrusted pickle file can execute malicious code.  
❌ **Version Issues** – Pickled objects may not be compatible across different Python versions.  

✅ **Alternatives:** Use `joblib` for larger ML models or `JSON` for simple data serialization.  




12 What does a high R-squared value mean?
-### **What Does a High R-Squared Value Mean?**  

A **high \( R^2 \) (R-squared) value** in a regression model indicates that **a large proportion of the variance in the dependent variable is explained by the independent variables**.  

For example:  
- \( R^2 = 0.85 \) means **85% of the variance** in the target variable is explained by the predictors.  
- \( R^2 = 0.95 \) means the model explains **95% of the variance**, leaving only 5% unexplained.  

---

### **Interpretation of a High \( R^2 \) Value**  

✅ **Good Fit (Generally)**  
- If \( R^2 \) is high, the independent variables **effectively predict the dependent variable**.  
- It suggests that the model captures the patterns in the data well.  

❗ **BUT High \( R^2 \) Does NOT Always Mean a Good Model!**  
- A high \( R^2 \) does not confirm that the model is **correct** or **useful**.  
- There could still be **overfitting, multicollinearity, or missing variables**.  

---

### **Things to Watch Out For:**  

1. **Overfitting 🚨**  
   - If \( R^2 \) is **too high (near 1.0)**, the model might be memorizing the training data rather than generalizing well.  
   - Solution: Check **Adjusted \( R^2 \)**, Cross-validation, or Regularization.  

2. **Multicollinearity 🌀**  
   - High \( R^2 \) with **high VIF (Variance Inflation Factor)** may indicate that predictors are highly correlated, reducing the reliability of coefficients.  

3. **Non-Linearity 🔄**  
   - A high \( R^2 \) doesn’t confirm that a linear model is the best choice—there could be **non-linear relationships** that the model fails to capture.  

4. **Missing Important Variables 🔍**  
   - A model can have high \( R^2 \) but still **miss key predictors** if the right variables aren't included.  

---

### **When is a High \( R^2 \) Truly Good?**  
✅ **For Predictive Accuracy:** If the model performs well on both training and test data.  
✅ **For Business & Decision Making:** If the model provides useful insights and makes logical sense.  
✅ **When Adjusted \( R^2 \) is Also High:** This means the predictors are genuinely contributing to the model.  



13 What happens if linear regression assumptions are violated?
-### **What Happens If Linear Regression Assumptions Are Violated?**  

Linear regression relies on several assumptions for accurate and reliable results. If these assumptions are violated, it can lead to **biased estimates, incorrect inferences, and poor model performance**.  

---

### **1. Linearity Violation 📈**  
**Assumption:** The relationship between the independent and dependent variables should be **linear**.  

❌ **Consequence:**  
- The model may underfit the data, leading to poor predictions.  
- Residuals (errors) will show a pattern, indicating a non-linear relationship.  

✅ **Solution:**  
- Use polynomial regression or a non-linear model like decision trees or neural networks.  
- Apply transformations (e.g., log, square root) to make relationships more linear.  

---

### **2. Independence Violation (Autocorrelation) 🔄**  
**Assumption:** Errors (residuals) should be **independent** of each other.  

❌ **Consequence:**  
- If residuals are correlated (common in time-series data), the model may **underestimate standard errors**, making p-values unreliable.  
- Predictions may not generalize well.  

✅ **Solution:**  
- Check for autocorrelation using the **Durbin-Watson test**.  
- If autocorrelation is present, use **time series models** like ARIMA instead of linear regression.  

---

### **3. Normality Violation (Non-Normal Residuals) 📊**  
**Assumption:** Residuals should be **normally distributed**.  

❌ **Consequence:**  
- Confidence intervals and hypothesis tests (e.g., p-values) may be inaccurate.  
- The model may struggle with outliers or skewed distributions.  

✅ **Solution:**  
- Check normality using a **QQ plot** or **Shapiro-Wilk test**.  
- Apply transformations (e.g., log or Box-Cox).  
- Use **robust regression** if extreme outliers are an issue.  

---

### **4. Homoscedasticity Violation (Heteroscedasticity) 🌊**  
**Assumption:** The variance of residuals should be **constant** across all levels of independent variables.  

❌ **Consequence:**  
- Unequal variance (heteroscedasticity) leads to **biased standard errors**, making p-values and confidence intervals unreliable.  
- This can result in **incorrect significance tests**.  

✅ **Solution:**  
- Check for heteroscedasticity using **residual plots** or the **Breusch-Pagan test**.  
- Apply transformations (e.g., log transformation).  
- Use **weighted least squares regression** instead of ordinary least squares (OLS).  

---

### **5. Multicollinearity Violation 🔄**  
**Assumption:** Independent variables should not be highly correlated with each other.  

❌ **Consequence:**  
- **Unstable coefficients**—small changes in data lead to large fluctuations in estimated coefficients.  
- **Incorrect variable importance**—the model struggles to distinguish the effect of each predictor.  

✅ **Solution:**  
- Detect multicollinearity using the **Variance Inflation Factor (VIF)** (**VIF > 5** is a concern).  
- Remove or combine correlated variables.  
- Use **regularization techniques** like Ridge Regression (L2 regularization).  

---

### **Summary Table: Consequences & Solutions**  

| **Violation**        | **Consequence** | **Solution** |
|----------------------|----------------|--------------|
| **Non-linearity** | Poor model fit, biased predictions | Use polynomial regression, transformations, or non-linear models |
| **Autocorrelation** | Underestimated errors, unreliable predictions | Use time-series models (e.g., ARIMA), check Durbin-Watson test |
| **Non-normal residuals** | Invalid hypothesis tests, poor interpretability | Apply transformations (log, Box-Cox), use robust regression |
| **Heteroscedasticity** | Biased standard errors, misleading significance tests | Use weighted least squares, check Breusch-Pagan test |
| **Multicollinearity** | Unstable coefficients, incorrect variable importance | Check VIF, remove correlated variables, use Ridge Regression |



14 How can we address multicollinearity in regression?
-### **How to Address Multicollinearity in Regression?**  

Multicollinearity occurs when **independent variables are highly correlated**, making it difficult to determine the individual effect of each predictor. It can lead to **unstable coefficients** and **incorrect interpretations**.  

---

### **How to Detect Multicollinearity?**  
1. **Variance Inflation Factor (VIF)** – A VIF **greater than 5 or 10** indicates multicollinearity.  
   ```python
   from statsmodels.stats.outliers_influence import variance_inflation_factor
   import pandas as pd

   # Example DataFrame with independent variables
   X = df[['feature1', 'feature2', 'feature3']]
   
   # Calculate VIF for each feature
   vif = pd.DataFrame()
   vif["Feature"] = X.columns
   vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
   print(vif)
   ```
2. **Correlation Matrix** – A high correlation (e.g., > 0.8) between two predictors suggests multicollinearity.
   ```python
   print(df.corr())
   ```
3. **Eigenvalues of the Design Matrix** – If some eigenvalues are very small, it indicates near collinearity.

---

### **How to Fix Multicollinearity?**  

#### **1. Remove Highly Correlated Predictors 🔍**  
- Drop one of the correlated variables.  
- Choose the one that is less important based on business knowledge or feature importance.  

#### **2. Combine Correlated Features (Feature Engineering) 🏗️**  
- Create a new feature by averaging or applying PCA (Principal Component Analysis).  
- Example: If "Height" and "Weight" are highly correlated, create **BMI = Weight / Height²**.

#### **3. Use Regularization (Ridge or Lasso Regression) 🏆**  
- **Ridge Regression (L2 regularization)** reduces multicollinearity by shrinking coefficients.  
- **Lasso Regression (L1 regularization)** can **eliminate** less important predictors, reducing redundancy.  
   ```python
   from sklearn.linear_model import Ridge, Lasso

   ridge = Ridge(alpha=1.0)
   ridge.fit(X_train, y_train)

   lasso = Lasso(alpha=0.1)
   lasso.fit(X_train, y_train)
   ```

#### **4. Use Principal Component Analysis (PCA) 🔄**  
- PCA reduces correlated features into **uncorrelated principal components**, which can be used in regression.  
   ```python
   from sklearn.decomposition import PCA

   pca = PCA(n_components=2)
   X_pca = pca.fit_transform(X)
   ```

#### **5. Increase Sample Size (If Possible) 📊**  
- Sometimes, adding more data can help stabilize coefficient estimates.

---

### **Best Approach?**  
- **If two variables are highly correlated**, remove one.  
- **If multicollinearity is widespread**, use **Ridge Regression or PCA**.  
- **If interpretability is key**, prefer **Lasso Regression** to remove redundant features.  




15 How can feature selection improve model performance in regression analysis?
-Feature selection is the process of choosing the most relevant and important features for a regression model while removing redundant or irrelevant ones. It improves accuracy, interpretability, and efficiency.




16  How is Adjusted R-squared calculated?
-### **How is Adjusted R-Squared Calculated?**  

**Adjusted R-squared (\( R^2_{adj} \))** is a modified version of **R-squared (\( R^2 \))** that accounts for the number of predictors in the model. Unlike regular \( R^2 \), which **always increases** when more variables are added (even if they are irrelevant), **Adjusted \( R^2 \) penalizes the inclusion of unnecessary predictors**.

---

### **Formula for Adjusted R-Squared**  

\[
R^2_{adj} = 1 - \left( \frac{(1 - R^2) \times (n - 1)}{n - k - 1} \right)
\]

Where:  
- \( R^2 \) = Regular R-squared  
- \( n \) = Number of observations (sample size)  
- \( k \) = Number of independent variables (predictors)  

---

### **How Does It Work?**  
✅ **If a new predictor improves the model significantly**, Adjusted \( R^2 \) **increases**.  
❌ **If a new predictor adds little to no value**, Adjusted \( R^2 \) **decreases**.  

This makes **Adjusted \( R^2 \)** a **better metric than regular \( R^2 \)** for comparing models with different numbers of predictors.  

---

### **Example Calculation**  
Suppose:  
- \( R^2 = 0.85 \)  
- \( n = 100 \) (observations)  
- \( k = 5 \) (predictors)  

\[
R^2_{adj} = 1 - \left( \frac{(1 - 0.85) \times (100 - 1)}{100 - 5 - 1} \right)
\]

\[
R^2_{adj} = 1 - \left( \frac{(0.15 \times 99)}{94} \right)
\]

\[
R^2_{adj} = 1 - \left( \frac{14.85}{94} \right)
\]

\[
R^2_{adj} = 1 - 0.158  
\]

\[
R^2_{adj} = 0.842
\]

So, the Adjusted \( R^2 \) value is **0.842**, which is slightly lower than the original \( R^2 \) of 0.85 due to the penalty for additional predictors.

---

### **Key Takeaways**  
- **Adjusted \( R^2 \) is always ≤ Regular \( R^2 \)**.  
- **If Adjusted \( R^2 \) increases**, adding the predictor **improves the model**.  
- **If Adjusted \( R^2 \) decreases**, the new predictor **does not contribute much** and should be reconsidered.  





17 Why is MSE sensitive to outliers?
-Mean Squared Error (MSE) is sensitive to outliers because it squares the error terms, giving more weight to larger errors.




18 What is the role of homoscedasticity in linear regression?
-Homoscedasticity refers to the assumption that the variance of residuals (errors) remains constant across all levels of the independent variable(s). It is a key assumption in ordinary least squares (OLS) regression.





19 What is Root Mean Squared Error (RMSE)?
-Root Mean Squared Error (RMSE) is a popular metric for evaluating the performance of a regression model. It measures the average magnitude of the prediction error, giving more weight to large errors due to squaring.





20 Why is pickling considered risky?
-Pickling in Python is a process used to serialize and deserialize objects, commonly used in machine learning (ML) models to save and load trained models efficiently. However, pickling is risky due to security vulnerabilities and compatibility issues.




21 What alternatives exist to pickling for saving ML models?
-There are several alternatives to **pickling** for saving machine learning models, depending on your needs for **portability, security, efficiency, and compatibility**. Here are some of the most common alternatives:  

### **1. Joblib**  
- **Library**: `joblib`  
- **Why Use It?**  
  - Faster than `pickle` for large NumPy arrays (common in ML models)  
  - More efficient **compression** (can use `zlib`, `gzip`, etc.)  
- **Usage**:  
  ```python
  from joblib import dump, load
  dump(model, 'model.joblib')
  model = load('model.joblib')
  ```  

### **2. ONNX (Open Neural Network Exchange)**  
- **Library**: `onnx`  
- **Why Use It?**  
  - **Interoperability** across frameworks (TensorFlow, PyTorch, etc.)  
  - Supports hardware acceleration (TensorRT, DirectML)  
- **Usage**:  
  ```python
  import torch  
  import onnx  
  torch.onnx.export(model, dummy_input, "model.onnx")  
  ```  

### **3. TensorFlow SavedModel (for TensorFlow/Keras models)**  
- **Library**: `tensorflow`  
- **Why Use It?**  
  - Optimized for **TensorFlow Serving** and **deployment**  
  - Saves entire computation graph (better than HDF5 for TF models)  
- **Usage**:  
  ```python
  model.save("model_directory")  # Save
  model = tf.keras.models.load_model("model_directory")  # Load
  ```  

### **4. HDF5 (Hierarchical Data Format)**
- **Library**: `h5py`, `keras`  
- **Why Use It?**  
  - Structured storage for large datasets  
  - Commonly used for Keras models (`.h5` format)  
- **Usage**:  
  ```python
  model.save("model.h5")  
  from tensorflow.keras.models import load_model  
  model = load_model("model.h5")  
  ```  

### **5. PMML (Predictive Model Markup Language)**  
- **Library**: `sklearn2pmml`  
- **Why Use It?**  
  - XML-based, used for **model deployment in enterprise systems**  
  - Works with **Java-based scoring engines**  
- **Usage**:  
  ```python
  from sklearn2pmml import sklearn2pmml
  sklearn2pmml(pipeline, "model.pmml")
  ```  

### **6. JSON (for lightweight models)**  
- **Library**: `json`  
- **Why Use It?**  
  - Human-readable  
  - Useful for **saving model parameters** but **not weights**  
- **Usage**:  
  ```python
  import json  
  with open("model.json", "w") as f:
      json.dump(model.get_params(), f)
  ```  

### **7. TorchScript (for PyTorch models)**  
- **Library**: `torch`  
- **Why Use It?**  
  - Converts PyTorch models into an **optimized, deployable format**  
- **Usage**:  
  ```python
  scripted_model = torch.jit.script(model)
  scripted_model.save("model.pt")
  ```  

### **8. MLflow (for Experiment Tracking and Model Storage)**  
- **Library**: `mlflow`  
- **Why Use It?**  
  - Tracks **versions** of models  
  - Easily deployable with **MLflow Serving**  
- **Usage**:  
  ```python
  import mlflow  
  mlflow.sklearn.save_model(model, "model_path")
  model = mlflow.sklearn.load_model("model_path")
  ```  

### **Which One to Choose?**  
| **Use Case** | **Best Alternative** |  
|-------------|----------------|  
| Large NumPy models | Joblib |  
| Cross-framework (TF, PyTorch) | ONNX |  
| TensorFlow/Keras models | SavedModel / HDF5 |  
| Enterprise deployment | PMML |  
| PyTorch models | TorchScript |  
| Lightweight model metadata | JSON |  
| Experiment tracking | MLflow |  




22 What is heteroscedasticity, and why is it a problem?
-Heteroscedasticity refers to a situation in which the variance of the errors (residuals) in a regression model is not constant across all levels of an independent variable. In other words, some observations have much larger or smaller error terms than others, leading to an unequal spread of residuals


Why is Heteroscedasticity a Problem?
Violates OLS Assumption

Ordinary Least Squares (OLS) regression assumes homoscedasticity (constant variance).
When heteroscedasticity is present, OLS estimates remain unbiased, but they are no longer efficient (i.e., they don’t have the smallest variance).
Affects Standard Errors

Since variance is not constant, the standard errors of regression coefficients are misestimated.
This can lead to incorrect p-values and misleading hypothesis tests (e.g., a coefficient might appear significant when it's not).
Reduces Model Reliability

Predictions from a model with heteroscedasticity may be unreliable.
It indicates that the model is not capturing some important relationships in the data.
Violates Assumption for Confidence Intervals

Confidence intervals around estimates become incorrect, affecting decision-making.



23  How can interaction terms enhance a regression model's predictive power?
-### **How Interaction Terms Enhance a Regression Model's Predictive Power**  

#### **What Are Interaction Terms?**  
Interaction terms in regression **capture the combined effect of two or more independent variables on the dependent variable**. They allow the relationship between an independent variable and the dependent variable to change depending on the value of another independent variable.

**Formula for a Basic Interaction Term in Multiple Linear Regression:**  
\[
Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3(X_1 \times X_2) + \epsilon
\]
Where:
- \( X_1 \) and \( X_2 \) are independent variables  
- \( X_1 \times X_2 \) is the interaction term  
- \( \beta_3 \) measures how the effect of \( X_1 \) on \( Y \) changes based on \( X_2 \)  

---

### **How Do Interaction Terms Improve Predictive Power?**  

1. **Capture Non-Additive Relationships**  
   - Standard regression assumes **additive effects**, meaning that the effect of one variable on the outcome is the same regardless of the other variables.  
   - Interaction terms allow for **synergistic or antagonistic effects** that would be missed in a simple linear model.  

   **Example:**  
   - Suppose you're studying **salary (\$Y\$)** based on **education (\$X_1\$)** and **experience (\$X_2\$)**.  
   - The effect of education might be **larger for experienced workers** than for fresh graduates.  
   - Adding an interaction term **(education × experience)** captures this relationship.  

---

2. **Improve Model Fit and Accuracy**  
   - Including interactions **reduces bias** by better modeling real-world complexities.  
   - Results in **lower residual variance** and better **R² (goodness of fit)**.  

   **Example:**  
   - If your residual plot shows systematic patterns (not just random scatter), an interaction term might help explain missing relationships.  

---

3. **Enable More Precise Interpretations**  
   - Interaction terms **clarify conditional relationships**, making results more actionable.  
   - Helps understand **when** and **under what conditions** an independent variable has a stronger/weaker effect.  

   **Example:**  
   - If you study **advertising (\$X_1\$)** and **brand loyalty (\$X_2\$)** on sales (\$Y\$), an interaction can reveal:  
     - Advertising might work **better for new customers** but **worse for loyal customers**.  

---

### **How to Create Interaction Terms in Python?**  
You can create interaction terms manually or using `PolynomialFeatures` from `sklearn`.  

#### **Method 1: Manually Adding Interaction Terms**  
```python
import pandas as pd
import statsmodels.api as sm

# Sample dataset
df = pd.DataFrame({
    'X1': [1, 2, 3, 4, 5],  # Education (years)
    'X2': [10, 12, 15, 18, 20],  # Experience (years)
    'Y': [50, 60, 70, 85, 100]  # Salary ($K)
})

# Add interaction term
df['X1_X2'] = df['X1'] * df['X2']

# Fit regression model
X = sm.add_constant(df[['X1', 'X2', 'X1_X2']])  # Adding constant for intercept
y = df['Y']
model = sm.OLS(y, X).fit()

print(model.summary())
```

#### **Method 2: Using `PolynomialFeatures` from `sklearn`**
```python
from sklearn.preprocessing import PolynomialFeatures
import numpy as np

X = np.array([[1, 10], [2, 12], [3, 15], [4, 18], [5, 20]])  # Education & Experience
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_interaction = poly.fit_transform(X)

print(X_interaction)  # Columns: X1, X2, X1*X2
```

---

### **When to Use Interaction Terms?**  
✅ When there’s a **logical reason** that one variable modifies another’s effect.  
✅ When residual plots show **non-random patterns** suggesting missing relationships.  
✅ When **domain knowledge** suggests a **multiplicative** or **conditional** effect.  

⚠ **Avoid Overfitting!**  
- Too many interaction terms can make the model **complex and hard to interpret**.  
- Use **feature selection techniques** (e.g., stepwise regression, Lasso) to choose relevant interactions.  

---

### **Key Takeaways**  
✔ Interaction terms **improve predictive accuracy** by capturing **real-world dependencies**.  
✔ They allow for **non-additive effects**, making models more **flexible and interpretable**.  
✔ Use interaction terms **judiciously** to avoid overfitting.  












In [None]:
**Practical:**



1 Write a Python script to visualize the distribution of errors (residuals) for a multiple linear regression model
using Seaborn's "diamonds" dataset
-import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
import scipy.stats as stats

# Load the diamonds dataset
df = sns.load_dataset("diamonds")

# Select features and target variable
X = df[['carat', 'depth', 'table', 'x', 'y', 'z']]  # Predictor variables
y = df['price']  # Target variable

# Handle any potential infinite values or missing data
X = X.replace([np.inf, -np.inf], np.nan).dropna()
y = y[X.index]  # Ensure y matches the cleaned X indices

# Split data into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features (important for linear regression)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Fit the Multiple Linear Regression model
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# Predict on the test set
y_pred = model.predict(X_test_scaled)

# Compute residuals
residuals = y_test - y_pred

# --- PLOT 1: Histogram of Residuals ---
plt.figure(figsize=(12, 5))
sns.histplot(residuals, bins=50, kde=True, color="royalblue")
plt.axvline(x=0, color='red', linestyle='dashed', linewidth=2)  # Reference line at zero
plt.title("Distribution of Residuals")
plt.xlabel("Residuals")
plt.ylabel("Frequency")
plt.grid()

# --- PLOT 2: Q-Q Plot (Normality Check) ---
plt.figure(figsize=(6, 6))
sm.qqplot(residuals, line="s", fit=True)
plt.title("Q-Q Plot of Residuals")
plt.grid()

# Show the plots
plt.show()





2  Write a Python script to calculate and print Mean Squared Error (MSE), Mean Absolute Error (MAE), and Root
Mean Squared Error (RMSE) for a linear regression model.
-import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Load dataset
df = sns.load_dataset("diamonds")

# Select features and target variable
X = df[['carat', 'depth', 'table', 'x', 'y', 'z']]  # Predictor variables
y = df['price']  # Target variable

# Handle missing or infinite values (if any)
X = X.replace([np.inf, -np.inf], np.nan).dropna()
y = y[X.index]  # Ensure y matches the cleaned X indices

# Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features (important for regression)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Linear Regression model
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = model.predict(X_test_scaled)

# Calculate error metrics
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mse)

# Print the results
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")



3  Write a Python script to check if the assumptions of linear regression are met. Use a scatter plot to check
linearity, residuals plot for homoscedasticity, and correlation matrix for multicollinearity
-import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Load dataset
df = sns.load_dataset("diamonds")

# Select relevant features and target variable
X = df[['carat', 'depth', 'table', 'x', 'y', 'z']]  # Predictor variables
y = df['price']  # Target variable

# Handle missing/infinite values
X = X.replace([np.inf, -np.inf], np.nan).dropna()
y = y[X.index]  # Ensure y matches cleaned X indices

# Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Linear Regression model
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = model.predict(X_test_scaled)
residuals = y_test - y_pred

# --- 1. LINEARITY CHECK: Scatter plot of actual vs. predicted values ---
plt.figure(figsize=(6, 6))
sns.scatterplot(x=y_test, y=y_pred, alpha=0.5, color="blue")
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color="red", linestyle="dashed")  # 45-degree line
plt.xlabel("Actual Prices")
plt.ylabel("Predicted Prices")
plt.title("Linearity Check: Actual vs. Predicted Prices")
plt.grid()
plt.show()

# --- 2. HOMOSCEDASTICITY CHECK: Residuals plot ---
plt.figure(figsize=(6, 6))
sns.scatterplot(x=y_pred, y=residuals, alpha=0.5, color="purple")
plt.axhline(y=0, color="red", linestyle="dashed")  # Reference line at zero
plt.xlabel("Predicted Prices")
plt.ylabel("Residuals (Errors)")
plt.title("Homoscedasticity Check: Residuals vs. Predicted Values")
plt.grid()
plt.show()

# --- 3. MULTICOLLINEARITY CHECK: Correlation Matrix ---
corr_matrix = pd.DataFrame(X_train_scaled, columns=X.columns).corr()
plt.figure(figsize=(6, 5))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Multicollinearity Check: Correlation Matrix")
plt.show()

# --- 4. MULTICOLLINEARITY CHECK: Variance Inflation Factor (VIF) ---
vif_data = pd.DataFrame()
vif_data["Feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X_train_scaled, i) for i in range(X_train_scaled.shape[1])]
print("\nVariance Inflation Factor (VIF) for Multicollinearity Check:\n", vif_data)



4 Write a Python script that creates a machine learning pipeline with feature scaling and evaluates the
performance of different regression models
-import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Load dataset
df = sns.load_dataset("diamonds")

# Select features and target variable
X = df[['carat', 'depth', 'table', 'x', 'y', 'z']]  # Predictor variables
y = df['price']  # Target variable

# Handle missing/infinite values
X = X.replace([np.inf, -np.inf], np.nan).dropna()
y = y[X.index]  # Ensure y matches cleaned X indices

# Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define different regression models
models = {
    "Linear Regression": LinearRegression(),
    "Ridge Regression": Ridge(alpha=1.0),
    "Lasso Regression": Lasso(alpha=0.1),
    "Decision Tree": DecisionTreeRegressor(max_depth=10),
    "Random Forest": RandomForestRegressor(n_estimators=100, random_state=42),
    "Gradient Boosting": GradientBoostingRegressor(n_estimators=100, random_state=42)
}

# Evaluate each model
results = []

for name, model in models.items():
    # Create a pipeline with feature scaling and model training
    pipeline = Pipeline([
        ("scaler", StandardScaler()),
        ("regressor", model)
    ])

    # Train model
    pipeline.fit(X_train, y_train)

    # Predictions
    y_pred = pipeline.predict(X_test)

    # Calculate performance metrics
    r2 = r2_score(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_test, y_pred)

    # Store results
    results.append([name, r2, mse, rmse, mae])

# Convert results to DataFrame
results_df = pd.DataFrame(results, columns=["Model", "R² Score", "MSE", "RMSE", "MAE"])

# Sort by R² Score
results_df = results_df.sort_values(by="R² Score", ascending=False)

# Print results
print("\nPerformance of Regression Models:\n")
print(results_df)

# Plot model comparison
plt.figure(figsize=(10, 5))
sns.barplot(data=results_df, x="Model", y="R² Score", palette="coolwarm")
plt.title("Model Performance Comparison (Higher R² is Better)")
plt.xticks(rotation=45)
plt.show()




5  Implement a simple linear regression model on a dataset and print the model's coefficients, intercept, and
R-squared score
-import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Load dataset
df = sns.load_dataset("diamonds")

# Select one feature (carat) and target variable (price)
X = df[['carat']]  # Predictor (independent variable)
y = df['price']    # Target (dependent variable)

# Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Simple Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Get model parameters
slope = model.coef_[0]   # Coefficient (slope)
intercept = model.intercept_  # Intercept
r2 = model.score(X_test, y_test)  # R-squared score

# Print results
print(f"Coefficient (Slope): {slope:.2f}")
print(f"Intercept: {intercept:.2f}")
print(f"R-squared Score: {r2:.4f}")

# Make predictions
y_pred = model.predict(X_test)

# --- Visualization: Regression Line ---
plt.figure(figsize=(8, 5))
sns.scatterplot(x=X_test['carat'], y=y_test, color="blue", alpha=0.5, label="Actual")
sns.lineplot(x=X_test['carat'], y=y_pred, color="red", label="Regression Line")
plt.xlabel("Carat")
plt.ylabel("Price")
plt.title("Simple Linear Regression: Carat vs. Price")
plt.legend()
plt.grid()
plt.show()




6 Write a Python script that analyzes the relationship between total bill and tip in the 'tips' dataset using
simple linear regression and visualizes the results.
-import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Load dataset
df = sns.load_dataset("tips")

# Select feature (total_bill) and target variable (tip)
X = df[['total_bill']]  # Independent variable
y = df['tip']  # Dependent variable

# Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Simple Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Get model parameters
slope = model.coef_[0]  # Coefficient (slope)
intercept = model.intercept_  # Intercept
r2 = model.score(X_test, y_test)  # R-squared score

# Print results
print(f"Coefficient (Slope): {slope:.2f}")
print(f"Intercept: {intercept:.2f}")
print(f"R-squared Score: {r2:.4f}")

# Make predictions
y_pred = model.predict(X_test)

# --- Visualization: Regression Line ---
plt.figure(figsize=(8, 5))
sns.scatterplot(x=X_test['total_bill'], y=y_test, color="blue", alpha=0.5, label="Actual Tips")
sns.lineplot(x=X_test['total_bill'], y=y_pred, color="red", label="Regression Line")
plt.xlabel("Total Bill ($)")
plt.ylabel("Tip ($)")
plt.title("Simple Linear Regression: Total Bill vs. Tip")
plt.legend()
plt.grid()
plt.show()




7 Write a Python script that fits a linear regression model to a synthetic dataset with one feature. Use the
model to predict new values and plot the data points along with the regression line
-import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# --- 1. Generate Synthetic Dataset ---
np.random.seed(42)  # For reproducibility
X = 2 * np.random.rand(100, 1)  # 100 random values between 0 and 2
y = 4 + 3 * X + np.random.randn(100, 1)  # Linear relation (y = 4 + 3X) with noise

# Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- 2. Train Simple Linear Regression Model ---
model = LinearRegression()
model.fit(X_train, y_train)

# Get model parameters
slope = model.coef_[0][0]  # Coefficient (slope)
intercept = model.intercept_[0]  # Intercept
r2 = model.score(X_test, y_test)  # R² Score

# Print results
print(f"Coefficient (Slope): {slope:.2f}")
print(f"Intercept: {intercept:.2f}")
print(f"R-squared Score: {r2:.4f}")

# --- 3. Make Predictions ---
X_new = np.array([[0], [2]])  # Predict for X = 0 and X = 2
y_pred = model.predict(X_new)

# --- 4. Visualization ---
plt.figure(figsize=(8, 5))
plt.scatter(X, y, color="blue", alpha=0.5, label="Data Points")  # Scatter plot of actual data
plt.plot(X_new, y_pred, color="red", linewidth=2, label="Regression Line")  # Regression line
plt.xlabel("Feature (X)")
plt.ylabel("Target (y)")
plt.title("Simple Linear Regression on Synthetic Data")
plt.legend()
plt.grid()
plt.show()




8 Write a Python script that pickles a trained linear regression model and saves it to a file.
-import numpy as np
import pickle
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# --- 1. Generate Synthetic Data ---
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- 2. Train a Linear Regression Model ---
model = LinearRegression()
model.fit(X_train, y_train)

# --- 3. Pickle the Model ---
filename = "model.pkl"
with open(filename, "wb") as file:
    pickle.dump(model, file)

print(f"Model saved to {filename}")

# --- 4. Load the Pickled Model ---
with open(filename, "rb") as file:
    loaded_model = pickle.load(file)

# --- 5. Make Predictions Using the Loaded Model ---
X_new = np.array([[0], [2]])  # Predict for X = 0 and X = 2
y_pred = loaded_model.predict(X_new)

# Print predictions
print(f"Predictions: {y_pred.flatten()}")




9  Write a Python script that fits a polynomial regression model (degree 2) to a dataset and plots the
regression curve
-import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

# --- 1. Generate Synthetic Data ---
np.random.seed(42)
X = 6 * np.random.rand(100, 1) - 3  # Random values between -3 and 3
y = 2 + X - 0.5 * X**2 + np.random.randn(100, 1)  # Quadratic relationship with noise

# Sort X for better plotting
X_sorted = np.sort(X, axis=0)
y_sorted = y[np.argsort(X, axis=0).flatten()]

# --- 2. Train a Polynomial Regression Model (Degree 2) ---
degree = 2
model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
model.fit(X, y)

# --- 3. Make Predictions ---
y_pred = model.predict(X_sorted)

# --- 4. Visualization ---
plt.figure(figsize=(8, 5))
plt.scatter(X, y, color="blue", alpha=0.5, label="Data Points")  # Original data
plt.plot(X_sorted, y_pred, color="red", linewidth=2, label="Polynomial Regression Curve")  # Regression curve
plt.xlabel("Feature (X)")
plt.ylabel("Target (y)")
plt.title(f"Polynomial Regression (Degree {degree})")
plt.legend()
plt.grid()
plt.show()





10 Generate synthetic data for simple linear regression (use random values for X and y) and fit a linear
regression model to the data. Print the model's coefficient and intercep
-import numpy as np
from sklearn.linear_model import LinearRegression

# --- 1. Generate Synthetic Data ---
np.random.seed(42)  # For reproducibility
X = 2 * np.random.rand(100, 1)  # 100 random values between 0 and 2
y = 4 + 3 * X + np.random.randn(100, 1)  # Linear relation (y = 4 + 3X) with some noise

# --- 2. Train a Simple Linear Regression Model ---
model = LinearRegression()
model.fit(X, y)

# --- 3. Print Model Parameters ---
print(f"Coefficient (Slope): {model.coef_[0][0]:.2f}")
print(f"Intercept: {model.intercept_[0]:.2f}")




11  Write a Python script that fits polynomial regression models of different degrees to a synthetic dataset and
compares their performance
-import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

# --- 1. Generate Synthetic Data ---
np.random.seed(42)
X = 6 * np.random.rand(100, 1) - 3  # Random values between -3 and 3
y = 2 + X - 0.5 * X**2 + np.random.randn(100, 1)  # Quadratic relationship with noise

# Sort X for better plotting
X_sorted = np.sort(X, axis=0)
y_sorted = y[np.argsort(X, axis=0).flatten()]

# --- 2. Fit Polynomial Regression Models and Compare ---
degrees = [1, 2, 3, 4, 5]  # Different polynomial degrees
plt.figure(figsize=(10, 6))

for degree in degrees:
    # Create a pipeline with PolynomialFeatures and LinearRegression
    model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    model.fit(X, y)

    # Predict values
    y_pred = model.predict(X_sorted)

    # Compute performance metrics
    r2 = r2_score(y_sorted, y_pred)
    mse = mean_squared_error(y_sorted, y_pred)
    mae = mean_absolute_error(y_sorted, y_pred)

    # Print performance metrics
    print(f"Degree {degree}: R²={r2:.3f}, MSE={mse:.3f}, MAE={mae:.3f}")

    # Plot regression curve
    plt.plot(X_sorted, y_pred, label=f"Degree {degree} (R²={r2:.2f})")

# --- 3. Visualization ---
plt.scatter(X, y, color="blue", alpha=0.5, label="Data Points")  # Original data
plt.xlabel("Feature (X)")
plt.ylabel("Target (y)")
plt.title("Polynomial Regression Comparison")
plt.legend()
plt.grid()
plt.show()




12 Write a Python script that fits a simple linear regression model with two features and prints the model's
coefficients, intercept, and R-squared score
-import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# --- 1. Generate Synthetic Data ---
np.random.seed(42)
X1 = 2 * np.random.rand(100, 1)  # Feature 1
X2 = 3 * np.random.rand(100, 1)  # Feature 2
y = 5 + 2 * X1 + 3 * X2 + np.random.randn(100, 1)  # Linear relation with noise

# Combine features into a single matrix
X = np.hstack((X1, X2))

# Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- 2. Train a Linear Regression Model ---
model = LinearRegression()
model.fit(X_train, y_train)

# --- 3. Print Model Parameters ---
print(f"Coefficients (Slopes): {model.coef_.flatten()}")
print(f"Intercept: {model.intercept_[0]:.2f}")
print(f"R-squared Score: {model.score(X_test, y_test):.4f}")



13  Write a Python script that generates synthetic data, fits a linear regression model, and visualizes the
regression line along with the data points
-import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# --- 1. Generate Synthetic Data ---
np.random.seed(42)
X = 2 * np.random.rand(100, 1)  # Feature values (100 random numbers between 0 and 2)
y = 4 + 3 * X + np.random.randn(100, 1)  # Linear relationship with noise

# --- 2. Train a Linear Regression Model ---
model = LinearRegression()
model.fit(X, y)

# --- 3. Make Predictions ---
X_new = np.array([[0], [2]])  # Predict for X=0 and X=2
y_pred = model.predict(X_new)

# --- 4. Visualization ---
plt.figure(figsize=(8, 5))
plt.scatter(X, y, color="blue", alpha=0.5, label="Data Points")  # Scatter plot of data
plt.plot(X_new, y_pred, color="red", linewidth=2, label="Regression Line")  # Regression line
plt.xlabel("Feature (X)")
plt.ylabel("Target (y)")
plt.title("Simple Linear Regression")
plt.legend()
plt.grid()
plt.show()

# --- 5. Print Model Parameters ---
print(f"Coefficient (Slope): {model.coef_[0][0]:.2f}")
print(f"Intercept: {model.intercept_[0]:.2f}")



14 Write a Python script that uses the Variance Inflation Factor (VIF) to check for multicollinearity in a dataset
with multiple features
-import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from statsmodels.stats.outliers_influence import variance_inflation_factor

# --- 1. Generate Synthetic Data with Multiple Features ---
np.random.seed(42)
X1 = 2 * np.random.rand(100, 1)  # Feature 1
X2 = 3 * X1 + np.random.randn(100, 1) * 0.1  # Feature 2 (highly correlated with X1)
X3 = 5 * np.random.rand(100, 1)  # Feature 3 (less correlated)
y = 4 + 2 * X1 + 3 * X2 + 1.5 * X3 + np.random.randn(100, 1)  # Target variable

# Create a DataFrame
df = pd.DataFrame(np.hstack([X1, X2, X3]), columns=["X1", "X2", "X3"])

# --- 2. Compute VIF for Each Feature ---
vif_data = pd.DataFrame()
vif_data["Feature"] = df.columns
vif_data["VIF"] = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]

# --- 3. Print VIF Results ---
print("Variance Inflation Factor (VIF) for Each Feature:")
print(vif_data)



15  Write a Python script that generates synthetic data for a polynomial relationship (degree 4), fits a
polynomial regression model, and plots the regression curve
-import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

# --- 1. Generate Synthetic Data ---
np.random.seed(42)
X = 6 * np.random.rand(100, 1) - 3  # Random values between -3 and 3
y = 2 + 1.5 * X - 0.8 * X**2 + 0.5 * X**3 - 0.2 * X**4 + np.random.randn(100, 1) * 3  # Degree 4 polynomial with noise

# Sort X for better plotting
X_sorted = np.sort(X, axis=0)
y_sorted = y[np.argsort(X, axis=0).flatten()]

# --- 2. Fit a Polynomial Regression Model (Degree 4) ---
degree = 4
model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
model.fit(X, y)

# Predict values
y_pred = model.predict(X_sorted)

# --- 3. Visualization ---
plt.figure(figsize=(8, 5))
plt.scatter(X, y, color="blue", alpha=0.5, label="Data Points")  # Scatter plot of data
plt.plot(X_sorted, y_pred, color="red", linewidth=2, label=f"Polynomial Regression (Degree {degree})")  # Regression curve
plt.xlabel("Feature (X)")
plt.ylabel("Target (y)")
plt.title("Polynomial Regression (Degree 4)")
plt.legend()
plt.grid()
plt.show()




16  Write a Python script that creates a machine learning pipeline with data standardization and a multiple
linear regression model, and prints the R-squared score
-import numpy as np
import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

# --- 1. Generate Synthetic Data ---
np.random.seed(42)
X1 = 2 * np.random.rand(100, 1)  # Feature 1
X2 = 3 * np.random.rand(100, 1)  # Feature 2
X3 = 4 * np.random.rand(100, 1)  # Feature 3
y = 5 + 2 * X1 + 3 * X2 + 1.5 * X3 + np.random.randn(100, 1)  # Linear relationship with noise

# Combine features into a single matrix
X = np.hstack([X1, X2, X3])

# Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- 2. Create a Machine Learning Pipeline ---
pipeline = make_pipeline(StandardScaler(), LinearRegression())

# Train the model
pipeline.fit(X_train, y_train)

# Predict on test data
y_pred = pipeline.predict(X_test)

# --- 3. Print Model Performance ---
r2 = r2_score(y_test, y_pred)
print(f"R-squared Score: {r2:.4f}")




17 Write a Python script that performs polynomial regression (degree 3) on a synthetic dataset and plots the
regression curve
-import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

# --- 1. Generate Synthetic Data ---
np.random.seed(42)
X = 6 * np.random.rand(100, 1) - 3  # Random values between -3 and 3
y = 2 + 1.5 * X - 0.8 * X**2 + 0.5 * X**3 + np.random.randn(100, 1) * 3  # Cubic relationship with noise

# Sort X for better plotting
X_sorted = np.sort(X, axis=0)
y_sorted = y[np.argsort(X, axis=0).flatten()]

# --- 2. Fit a Polynomial Regression Model (Degree 3) ---
degree = 3
model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
model.fit(X, y)

# Predict values
y_pred = model.predict(X_sorted)

# --- 3. Visualization ---
plt.figure(figsize=(8, 5))
plt.scatter(X, y, color="blue", alpha=0.5, label="Data Points")  # Scatter plot of data
plt.plot(X_sorted, y_pred, color="red", linewidth=2, label=f"Polynomial Regression (Degree {degree})")  # Regression curve
plt.xlabel("Feature (X)")
plt.ylabel("Target (y)")
plt.title("Polynomial Regression (Degree 3)")
plt.legend()
plt.grid()
plt.show()




18 Write a Python script that performs multiple linear regression on a synthetic dataset with 5 features. Print
the R-squared score and model coefficients
-import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

# --- 1. Generate Synthetic Data ---
np.random.seed(42)
n_samples = 100  # Number of samples
n_features = 5   # Number of features

X = np.random.rand(n_samples, n_features) * 10  # Features scaled between 0 and 10
true_coeffs = np.array([2.5, -1.2, 3.8, 0.5, -2.0])  # True coefficients
y = 5 + X.dot(true_coeffs) + np.random.randn(n_samples) * 2  # Linear relationship with noise

# Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- 2. Train a Multiple Linear Regression Model ---
model = LinearRegression()
model.fit(X_train, y_train)

# Predict on test data
y_pred = model.predict(X_test)

# --- 3. Print Model Performance ---
r2 = r2_score(y_test, y_pred)
print(f"R-squared Score: {r2:.4f}")
print(f"Intercept: {model.intercept_:.2f}")
print(f"Coefficients: {model.coef_}")



19 Write a Python script that generates synthetic data for linear regression, fits a model, and visualizes the
data points along with the regression line.
-import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# --- 1. Generate Synthetic Data ---
np.random.seed(42)
X = 2 * np.random.rand(100, 1)  # Random feature values between 0 and 2
y = 4 + 3 * X + np.random.randn(100, 1)  # Linear relationship with noise

# --- 2. Train a Linear Regression Model ---
model = LinearRegression()
model.fit(X, y)

# --- 3. Make Predictions ---
X_new = np.array([[0], [2]])  # Predict for X=0 and X=2
y_pred = model.predict(X_new)

# --- 4. Visualization ---
plt.figure(figsize=(8, 5))
plt.scatter(X, y, color="blue", alpha=0.5, label="Data Points")  # Scatter plot of data
plt.plot(X_new, y_pred, color="red", linewidth=2, label="Regression Line")  # Regression line
plt.xlabel("Feature (X)")
plt.ylabel("Target (y)")
plt.title("Simple Linear Regression")
plt.legend()
plt.grid()
plt.show()

# --- 5. Print Model Parameters ---
print(f"Coefficient (Slope): {model.coef_[0][0]:.2f}")
print(f"Intercept: {model.intercept_[0]:.2f}")




20 Create a synthetic dataset with 3 features and perform multiple linear regression. Print the model's Rsquared score and coefficients
-import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

# --- 1. Generate Synthetic Data ---
np.random.seed(42)
n_samples = 100  # Number of samples

X1 = 2 * np.random.rand(n_samples, 1)  # Feature 1
X2 = 3 * np.random.rand(n_samples, 1)  # Feature 2
X3 = 4 * np.random.rand(n_samples, 1)  # Feature 3

# Create the target variable (linear combination of features + noise)
true_coeffs = np.array([2.5, -1.2, 3.8])  # True coefficients
y = 5 + (X1 * true_coeffs[0] + X2 * true_coeffs[1] + X3 * true_coeffs[2]).flatten() + np.random.randn(n_samples) * 2

# Combine features into a single matrix
X = np.hstack([X1, X2, X3])

# Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- 2. Train a Multiple Linear Regression Model ---
model = LinearRegression()
model.fit(X_train, y_train)

# Predict on test data
y_pred = model.predict(X_test)

# --- 3. Print Model Performance ---
r2 = r2_score(y_test, y_pred)
print(f"R-squared Score: {r2:.4f}")
print(f"Intercept: {model.intercept_:.2f}")
print(f"Coefficients: {model.coef_}")





21 Write a Python script that demonstrates how to serialize and deserialize machine learning models using
joblib instead of pickling
-import numpy as np
import joblib
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# --- 1. Generate Synthetic Data ---
np.random.seed(42)
X = 2 * np.random.rand(100, 1)  # Random feature values
y = 4 + 3 * X + np.random.randn(100, 1)  # Linear relationship with noise

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- 2. Train a Linear Regression Model ---
model = LinearRegression()
model.fit(X_train, y_train)

# --- 3. Serialize (Save) the Model Using joblib ---
joblib.dump(model, "linear_regression_model.pkl")
print("Model saved successfully!")

# --- 4. Deserialize (Load) the Model ---
loaded_model = joblib.load("linear_regression_model.pkl")
print("Model loaded successfully!")

# --- 5. Make Predictions Using the Loaded Model ---
sample_input = np.array([[1.5]])  # Example input
predicted_value = loaded_model.predict(sample_input)
print(f"Prediction for input {sample_input.flatten()[0]}: {predicted_value.flatten()[0]:.2f}")




22 Write a Python script to perform linear regression with categorical features using one-hot encoding. Use
the Seaborn 'tips' dataset.
-import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

# --- 1. Load the 'tips' Dataset ---
tips = sns.load_dataset("tips")

# --- 2. Select Features and Target ---
categorical_features = ["sex", "smoker", "day", "time"]  # Categorical columns
numerical_features = ["total_bill", "size"]  # Numerical columns
target = "tip"

# Separate features and target variable
X = tips[numerical_features + categorical_features]
y = tips[target]

# --- 3. Perform One-Hot Encoding for Categorical Features ---
X = pd.get_dummies(X, columns=categorical_features, drop_first=True)  # Avoid dummy variable trap

# --- 4. Split Data into Training and Testing Sets ---
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- 5. Train a Multiple Linear Regression Model ---
model = LinearRegression()
model.fit(X_train, y_train)

# --- 6. Make Predictions ---
y_pred = model.predict(X_test)

# --- 7. Print Model Performance ---
r2 = model.score(X_test, y_test)  # R-squared score
print(f"R-squared Score: {r2:.4f}")
print(f"Intercept: {model.intercept_:.2f}")
print(f"Coefficients: {dict(zip(X.columns, model.coef_))}")





23 Compare Ridge Regression with Linear Regression on a synthetic dataset and print the coefficients and Rsquared score
-import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

# --- 1. Generate Synthetic Data ---
np.random.seed(42)
n_samples = 100  # Number of samples
n_features = 5   # Number of features

X = np.random.rand(n_samples, n_features) * 10  # Features scaled between 0 and 10
true_coeffs = np.array([3, -2, 1.5, 0, -1])  # True coefficients
y = 5 + X.dot(true_coeffs) + np.random.randn(n_samples) * 5  # Linear relationship with noise

# Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- 2. Train Linear Regression ---
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
y_pred_lin = lin_reg.predict(X_test)
r2_lin = r2_score(y_test, y_pred_lin)

# --- 3. Train Ridge Regression (L2 Regularization) ---
ridge_reg = Ridge(alpha=1.0)  # Regularization strength
ridge_reg.fit(X_train, y_train)
y_pred_ridge = ridge_reg.predict(X_test)
r2_ridge = r2_score(y_test, y_pred_ridge)

# --- 4. Print Model Performance ---
print("Linear Regression Results:")
print(f"R-squared Score: {r2_lin:.4f}")
print(f"Coefficients: {lin_reg.coef_}\n")

print("Ridge Regression Results:")
print(f"R-squared Score: {r2_ridge:.4f}")
print(f"Coefficients: {ridge_reg.coef_}")




24 Write a Python script that uses cross-validation to evaluate a Linear Regression model on a synthetic
dataset.
-import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_regression

# --- 1. Generate Synthetic Data ---
np.random.seed(42)
X, y = make_regression(n_samples=200, n_features=5, noise=10)  # 200 samples, 5 features, noise added

# --- 2. Initialize Linear Regression Model ---
model = LinearRegression()

# --- 3. Perform Cross-Validation (Using 5 Folds) ---
cv_scores = cross_val_score(model, X, y, cv=5, scoring='r2')  # R-squared as evaluation metric

# --- 4. Print Model Performance ---
print(f"Cross-Validation R-squared Scores: {cv_scores}")
print(f"Mean R-squared Score: {cv_scores.mean():.4f}")
print(f"Standard Deviation: {cv_scores.std():.4f}")




25 Write a Python script that compares polynomial regression models of different degrees and prints the Rsquared score for each.
-import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.metrics import r2_score

# --- 1. Generate Synthetic Data (Polynomial Relationship) ---
np.random.seed(42)
X = 6 * np.random.rand(100, 1) - 3  # Random values between -3 and 3
y = 2 + 1.5 * X - 0.8 * X**2 + 0.3 * X**3 + np.random.randn(100, 1) * 2  # Cubic relationship with noise

# --- 2. Compare Polynomial Regression Models ---
degrees = [1, 2, 3, 4, 5]  # Degrees of polynomial to test
r2_scores = []

plt.figure(figsize=(10, 5))
plt.scatter(X, y, color="gray", alpha=0.5, label="Data Points")  # Plot raw data

X_test = np.linspace(-3, 3, 100).reshape(-1, 1)  # For plotting smooth curves

for degree in degrees:
    model = make_pipeline(PolynomialFeatures(degree), LinearRegression())  # Pipeline with PolynomialFeatures
    model.fit(X, y)

    y_pred = model.predict(X)
    r2 = r2_score(y, y_pred)  # Compute R-squared score
    r2_scores.append((degree, r2))

    # Plot regression curve
    plt.plot(X_test, model.predict(X_test), label=f"Degree {degree}")

# --- 3. Print R-squared Scores ---
print("Polynomial Regression Performance:")
for degree, score in r2_scores:
    print(f"Degree {degree}: R-squared Score = {score:.4f}")

# --- 4. Final Plot Formatting ---
plt.xlabel("Feature (X)")
plt.ylabel("Target (y)")
plt.title("Polynomial Regression Comparison")
plt.legend()
plt.grid()
plt.show()


