# Thoery Qs.


### 1. What does R-squared represent in a regression model?
R-squared represents the proportion of variance in the dependent variable that is predictable from the independent variables.

---

### 2. What are the assumptions of linear regression?
- Linearity
- Independence
- Homoscedasticity (constant variance of errors)
- Normality of residuals
- No multicollinearity

---

### 3. What is the difference between R-squared and Adjusted R-squared?
- **R-squared** increases with more predictors.
- **Adjusted R-squared** adjusts for the number of predictors, penalizing unnecessary ones.

---

### 4. Why do we use Mean Squared Error (MSE)?
MSE measures the average squared difference between actual and predicted values, helping evaluate model accuracy.

---

### 5. What does an Adjusted R-squared value of 0.85 indicate?
It means that 85% of the variance in the dependent variable is explained by the model, adjusting for the number of predictors.

---

### 6. How do we check for normality of residuals in linear regression?
- Histogram or Q-Q plot of residuals
- Shapiro-Wilk test
- Kolmogorov-Smirnov test

---

### 7. What is multicollinearity, and how does it impact regression?
Multicollinearity occurs when independent variables are highly correlated, making coefficient estimates unstable and less interpretable.

---

### 8. What is Mean Absolute Error (MAE)?
MAE is the average absolute difference between actual and predicted values. It is less sensitive to outliers than MSE.

---

### 9. What are the benefits of using an ML pipeline?
- Streamlined preprocessing and modeling
- Reduces code duplication
- Ensures reproducibility
- Facilitates parameter tuning and deployment

---

### 10. Why is RMSE considered more interpretable than MSE?
RMSE is in the same unit as the target variable, unlike MSE which is in squared units.

---

### 11. What is pickling in Python, and how is it useful in ML?
Pickling serializes Python objects to a byte stream, useful for saving trained models and reloading them without retraining.

---

### 12. What does a high R-squared value mean?
It indicates a strong fit between the model and the data, i.e., a high percentage of variance is explained by the model.

---

### 13. What happens if linear regression assumptions are violated?
It can lead to biased or inefficient estimates, unreliable hypothesis tests, and poor predictive performance.

---

### 14. How can we address multicollinearity in regression?
- Remove highly correlated predictors
- Use dimensionality reduction (e.g., PCA)
- Apply regularization methods like Ridge or Lasso

---

### 15. How can feature selection improve model performance in regression analysis?
Feature selection removes irrelevant or redundant features, reducing overfitting and improving model interpretability and performance.

---

### 16. How is Adjusted R-squared calculated?
Adjusted R² = 1 - [(1 - R²) * (n - 1)/(n - k - 1)], where `n` is number of observations and `k` is number of predictors.

---

### 17. Why is MSE sensitive to outliers?
Because it squares the error term, outliers have a disproportionately large impact on MSE.

---

### 18. What is the role of homoscedasticity in linear regression?
Homoscedasticity (equal error variance) ensures reliable coefficient estimates and valid hypothesis testing.

---

### 19. What is Root Mean Squared Error (RMSE)?
RMSE is the square root of MSE. It measures the average magnitude of prediction errors.

---

### 20. Why is pickling considered risky?
Pickling can execute arbitrary code during loading, leading to security vulnerabilities if loading untrusted data.

---

### 21. What alternatives exist to pickling for saving ML models?
- Joblib (optimized for NumPy arrays)
- ONNX (Open Neural Network Exchange)
- JSON (for lightweight, secure storage)

---

### 22. What is heteroscedasticity, and why is it a problem?
Heteroscedasticity is non-constant variance of residuals. It violates regression assumptions and leads to inefficient estimates.

---

### 23. How can interaction terms enhance a regression model's predictive power?
Interaction terms capture combined effects of features, improving model flexibility and accuracy.



# Practical Tasks


## Task 1: Visualize Residuals for Multiple Linear Regression using Seaborn's 'diamonds' Dataset

In [None]:

import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
import numpy as np

diamonds = sns.load_dataset('diamonds')
diamonds = diamonds.select_dtypes(include=[np.number]).dropna()

X = diamonds.drop(columns=['price'])
y = diamonds['price']

X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
predictions = model.predict(X)
residuals = y - predictions

plt.figure(figsize=(10, 6))
sns.histplot(residuals, kde=True)
plt.title('Distribution of Residuals')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.show()


## Task 2: Calculate MSE, MAE, and RMSE

In [None]:

from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

mse = mean_squared_error(y, predictions)
mae = mean_absolute_error(y, predictions)
rmse = np.sqrt(mse)

print(f"MSE: {mse}")
print(f"MAE: {mae}")
print(f"RMSE: {rmse}")


## Task 3: Check Assumptions of Linear Regression

In [None]:

import pandas as pd

# Linearity
plt.figure(figsize=(6, 4))
plt.scatter(predictions, y)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Linearity Check')
plt.show()

# Homoscedasticity
plt.figure(figsize=(6, 4))
plt.scatter(predictions, residuals)
plt.axhline(0, color='red', linestyle='--')
plt.title('Residuals Plot')
plt.xlabel('Predicted')
plt.ylabel('Residuals')
plt.show()

# Multicollinearity
corr_matrix = X.drop(columns='const').corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()


## Task 4: . Write a Python script that creates a machine learning pipeline with feature scaling and evaluates the performance of different regression models

In [None]:
# Code for Task 4 goes here
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, LinearRegression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LinearRegression())
])
pipeline.fit(X_train, y_train)
print("Score:", pipeline.score(X_test, y_test))


## Task 5 :  Implement a simple linear regression model on a dataset and print the model's coefficients, intercept, and R-squared score.

In [None]:
# Code for Task 5 goes here
model = LinearRegression()
model.fit(X[['carat']], y)
print("Coefficient:", model.coef_)
print("Intercept:", model.intercept_)
print("R^2 Score:", model.score(X[['carat']], y))


## Task 6 : Write a Python script that analyzes the relationship between total bill and tip in the 'tips' dataset using simple linear regression and visualizes the results.

In [None]:
# Code for Task 6 goes here
df = sns.load_dataset("tips")
X = df[['total_bill']]
y = df['tip']
model = LinearRegression().fit(X, y)
sns.regplot(x='total_bill', y='tip', data=df)
plt.title("Tip vs Total Bill")
plt.show()


## Task 7 :  Write a Python script that fits a linear regression model to a synthetic dataset with one feature. Use the model to predict new values and plot the data points along with the regression line.

In [None]:
# Code for Task 7 goes here
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=100, n_features=1, noise=10)
model = LinearRegression().fit(X, y)
y_pred = model.predict(X)

plt.scatter(X, y)
plt.plot(X, y_pred, color='red')
plt.show()


## Task 8 : Write a Python script that pickles a trained linear regression model and saves it to a file.

In [None]:
# Code for Task 8 goes here
import pickle

with open("model.pkl", "wb") as f:
    pickle.dump(model, f)


## Task 9 : Write a Python script that fits a polynomial regression model (degree 2) to a dataset and plots the regression curve.

In [None]:
# Code for Task 9 goes here
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X[['carat']])
model = LinearRegression().fit(X_poly, y)
y_pred = model.predict(X_poly)

plt.scatter(X['carat'], y)
plt.plot(X['carat'], y_pred, color='red')
plt.show()


## Task 10 : Generate synthetic data for simple linear regression (use random values for X and y) and fit a linear regression model to the data. Print the model's coefficient and intercept.

In [None]:
# Code for Task 10 goes here
X, y = make_regression(n_samples=100, n_features=1, noise=15)
model = LinearRegression().fit(X, y)
print("Coefficient:", model.coef_, "Intercept:", model.intercept_)


## Task 11 :  Write a Python script that fits polynomial regression models of different degrees to a synthetic dataset and compares their performance.

In [None]:
# Code for Task 11 goes here
for d in [1, 2, 3, 4]:
    poly = PolynomialFeatures(degree=d)
    X_poly = poly.fit_transform(X)
    model = LinearRegression().fit(X_poly, y)
    print(f"Degree {d}, R^2 Score: {model.score(X_poly, y)}")


## Task 12 : Write a Python script that fits a simple linear regression model with two features and prints the model's coefficients, intercept, and R-squared score.

In [None]:
# Code for Task 12 goes here
model = LinearRegression().fit(X[['carat', 'depth']], y)
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
print("R^2:", model.score(X[['carat', 'depth']], y))


## Task 13 : Write a Python script that generates synthetic data, fits a linear regression model, and visualizes the regression line along with the data points.


In [None]:
# Code for Task 13 goes here
X, y = make_regression(n_samples=100, n_features=1, noise=20)
model = LinearRegression().fit(X, y)
plt.scatter(X, y)
plt.plot(X, model.predict(X), color='green')
plt.show()


## Task 14 : 14. Write a Python script that uses the Variance Inflation Factor (VIF) to check for multicollinearity in a dataset with multiple features.

In [None]:
# Code for Task 14 goes here
from statsmodels.stats.outliers_influence import variance_inflation_factor

X = df[["carat", "depth", "table"]]
X = sm.add_constant(X)
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data)


## Task 15 : 15. Write a Python script that generates synthetic data for a polynomial relationship (degree 4), fits a polynomial regression model, and plots the regression curve.



In [None]:
# Code for Task 15 goes here
poly = PolynomialFeatures(degree=4)
X_poly = poly.fit_transform(X[['carat']])
model = LinearRegression().fit(X_poly, y)
plt.scatter(X['carat'], y)
plt.plot(X['carat'], model.predict(X_poly), color='orange')
plt.show()


## Task 16 : Write a Python script that creates a machine learning pipeline with data standardization and a multiple linear regression model, and prints the R-squared score.



In [None]:
# Code for Task 16 goes here
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('reg', LinearRegression())
])
pipeline.fit(X, y)
print("R^2 Score:", pipeline.score(X, y))


## Task 17 : Write a Python script that performs polynomial regression (degree 3) on a synthetic dataset and plots the regression curve.

In [None]:
# Code for Task 17 goes here
poly = PolynomialFeatures(degree=3)
X_poly = poly.fit_transform(X[['carat']])
model = LinearRegression().fit(X_poly, y)
plt.scatter(X['carat'], y)
plt.plot(X['carat'], model.predict(X_poly), color='purple')
plt.show()


## Task 18 :  Write a Python script that performs multiple linear regression on a synthetic dataset with 5 features. Print the R-squared score and model coefficients.

In [None]:
# Code for Task 18 goes here
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=100, n_features=5, noise=10)
model = LinearRegression().fit(X, y)
print("R^2:", model.score(X, y))
print("Coefficients:", model.coef_)


## Task 19 : Write a Python script that generates synthetic data for linear regression, fits a model, and visualizes the data points along with the regression line.

In [None]:
# Code for Task 19 goes here
X, y = make_regression(n_samples=100, n_features=1, noise=10)
model = LinearRegression().fit(X, y)
plt.scatter(X, y)
plt.plot(X, model.predict(X), color='black')
plt.show()


## Task 20 : Create a synthetic dataset with 3 features and perform multiple linear regression. Print the model's Rsquared score and coefficients.

In [None]:
# Code for Task 20 goes here
X, y = make_regression(n_samples=100, n_features=3, noise=5)
model = LinearRegression().fit(X, y)
print("R^2:", model.score(X, y))
print("Coefficients:", model.coef_)


## Task 21 : Write a Python script that demonstrates how to serialize and deserialize machine learning models using joblib instead of pickling.

In [None]:
# Code for Task 21 goes here
import joblib

joblib.dump(model, 'model.joblib')
loaded_model = joblib.load('model.joblib')


## Task 22 : Write a Python script to perform linear regression with categorical features using one-hot encoding. Use the Seaborn 'tips' dataset.

In [None]:
# Code for Task 22 goes here
df = sns.load_dataset("tips")
df = pd.get_dummies(df, columns=["sex", "smoker", "day", "time"], drop_first=True)
X = df.drop("tip", axis=1)
y = df["tip"]
model = LinearRegression().fit(X, y)
print("R^2:", model.score(X, y))


## Task 23 : Compare Ridge Regression with Linear Regression on a synthetic dataset and print the coefficients and Rsquared score.

In [None]:
# Code for Task 23 goes here
from sklearn.linear_model import Ridge

X, y = make_regression(n_samples=100, n_features=2, noise=15)
lr = LinearRegression().fit(X, y)
ridge = Ridge(alpha=1).fit(X, y)

print("Linear R^2:", lr.score(X, y))
print("Ridge R^2:", ridge.score(X, y))


## Task 24 : Write a Python script that uses cross-validation to evaluate a Linear Regression model on a synthetic dataset.

In [None]:
# Code for Task 24 goes here
from sklearn.model_selection import cross_val_score

X, y = make_regression(n_samples=100, n_features=2, noise=10)
model = LinearRegression()
scores = cross_val_score(model, X, y, cv=5)
print("Cross-validated scores:", scores)
print("Mean score:", scores.mean())


## Task 25 : Write a Python script that compares polynomial regression models of different degrees and prints the Rsquared score for each.

In [None]:
# Code for Task 25 goes here
for degree in range(1, 5):
    poly = PolynomialFeatures(degree)
    X_poly = poly.fit_transform(X)
    model = LinearRegression().fit(X_poly, y)
    print(f"Degree {degree} R^2: {model.score(X_poly, y)}")
