In [None]:
#### 1. What does R-squared represent in a regression model?
R-squared represents the proportion of the variance in the dependent variable that is predictable from the independent variables.

#### 2. What are the assumptions of linear regression?
- Linearity
- Independence
- Homoscedasticity
- Normality of residuals
- No multicollinearity

#### 3. What is the difference between R-squared and Adjusted R-squared?
R-squared increases with additional features. Adjusted R-squared adjusts for the number of predictors, penalizing unnecessary variables.

#### 4. Why do we use Mean Squared Error (MSE)?
MSE gives a measure of prediction accuracy by penalizing larger errors more heavily due to squaring.

#### 5. What does an Adjusted R-squared value of 0.85 indicate?
It indicates that 85% of the variance is explained by the model after adjusting for the number of predictors.

#### 6. How do we check for normality of residuals in linear regression?
Using a Q-Q plot or histogram of residuals.

#### 7. What is multicollinearity, and how does it impact regression?
It refers to high correlation between independent variables. It can make coefficient estimates unstable.

#### 8. What is Mean Absolute Error (MAE)?
MAE is the average of the absolute errors between predicted and actual values.

#### 9. What are the benefits of using an ML pipeline?
- Code modularity
- Easy experimentation
- Preprocessing integration
- Cross-validation compatibility

#### 10. Why is RMSE considered more interpretable than MSE?
Because RMSE is in the same unit as the target variable, while MSE is squared.

#### 11. What is pickling in Python, and how is it useful in ML?
Pickling is saving a Python object into a byte stream. Useful for saving ML models for reuse.

#### 12. What does a high R-squared value mean?
It means that a large proportion of variance in the target variable is explained by the model.

#### 13. What happens if linear regression assumptions are violated?
Results may be biased, inconsistent, or inefficient.

#### 14. How can we address multicollinearity in regression?
- Remove correlated features
- Use regularization (e.g., Ridge)
- Apply PCA

#### 15. How can feature selection improve model performance in regression analysis?
By removing irrelevant or redundant variables, improving model interpretability and accuracy.

#### 16. How is Adjusted R-squared calculated?
Adjusted R² = 1 - [(1 - R²)(n - 1)/(n - k - 1)], where n = observations, k = predictors.

#### 17. Why is MSE sensitive to outliers?
Because it squares the errors, giving more weight to larger errors.

#### 18. What is the role of homoscedasticity in linear regression?
Ensures consistent variance of errors across all levels of the independent variables.

#### 19. What is Root Mean Squared Error (RMSE)?
Square root of the average of squared errors between predicted and actual values.

#### 20. Why is pickling considered risky?
Unpickling untrusted data can lead to code execution vulnerabilities.

#### 21. What alternatives exist to pickling for saving ML models?
- Joblib
- ONNX
- PMML
- Model export APIs (e.g., TensorFlow SavedModel)

#### 22. What is heteroscedasticity, and why is it a problem?
It refers to non-constant variance of errors. Violates regression assumptions and affects inference.

#### 23. How can interaction terms enhance a regression model's predictive power?
By capturing combined effects of variables that wouldn’t be detected individually.


**Practical**

In [None]:
# Task 1
df = sns.load_dataset('diamonds')
X = df[['carat', 'depth', 'table']].dropna()
y = df['price'].loc[X.index]
model = LinearRegression().fit(X, y)
y_pred = model.predict(X)
residuals = y - y_pred
sns.histplot(residuals, kde=True)
plt.title("Distribution of Residuals")
plt.show()

# Task 2
mse = mean_squared_error(y, y_pred)
mae = mean_absolute_error(y, y_pred)
rmse = np.sqrt(mse)
print(f"MSE: {mse}, MAE: {mae}, RMSE: {rmse}")

# Task 3
plt.scatter(y_pred, residuals)
plt.title("Residuals vs Fitted")
plt.xlabel("Fitted values")
plt.ylabel("Residuals")
plt.axhline(0, color='r', linestyle='--')
plt.show()

sns.heatmap(X.corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()

# Task 4
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LinearRegression())
])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
pipeline.fit(X_train, y_train)
print("Pipeline R² Score:", pipeline.score(X_test, y_test))

# Task 5
X = df[['carat']].dropna()
y = df['price'].loc[X.index]
model = LinearRegression().fit(X, y)
print("Coefficient:", model.coef_)
print("Intercept:", model.intercept_)
print("R² Score:", model.score(X, y))


In [None]:
# Task 6
tips = sns.load_dataset('tips')
X = tips[['total_bill']]
y = tips['tip']
model = LinearRegression().fit(X, y)
sns.regplot(x='total_bill', y='tip', data=tips)
plt.title("Tip vs Total Bill")
plt.show()

# Task 7
X = np.random.rand(100, 1) * 10
y = 3 * X.flatten() + np.random.randn(100) * 5
model = LinearRegression().fit(X, y)
plt.scatter(X, y)
plt.plot(X, model.predict(X), color='red')
plt.title("Synthetic Linear Regression")
plt.show()

# Task 8
with open("linear_model.pkl", "wb") as f:
    pickle.dump(model, f)

# Task 9
X = np.linspace(0, 10, 100)
y = 2 * X**2 + 3 * X + 5 + np.random.randn(100) * 10
X_poly = PolynomialFeatures(degree=2).fit_transform(X.reshape(-1, 1))
model = LinearRegression().fit(X_poly, y)
plt.scatter(X, y)
plt.plot(X, model.predict(X_poly), color='red')
plt.title("Polynomial Regression Curve (Degree 2)")
plt.show()

# Task 10
X, y = make_regression(n_samples=100, n_features=1, noise=15)
model = LinearRegression().fit(X, y)
print("Coefficient:", model.coef_)
print("Intercept:", model.intercept_)


In [None]:
# Task 11
for d in range(1, 5):
    poly = PolynomialFeatures(degree=d)
    X_poly = poly.fit_transform(X)
    model = LinearRegression().fit(X_poly, y)
    print(f"Degree {d} R² Score:", model.score(X_poly, y))

# Task 12
X, y = make_regression(n_samples=100, n_features=2, noise=10)
model = LinearRegression().fit(X, y)
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
print("R² Score:", model.score(X, y))

# Task 13
plt.scatter(X[:, 0], y)
plt.plot(X[:, 0], model.predict(X), color='red')
plt.title("Regression Line with Data")
plt.show()

# Task 14
from statsmodels.stats.outliers_influence import variance_inflation_factor
df = pd.DataFrame(X, columns=['x1', 'x2'])
vif = pd.DataFrame()
vif["VIF"] = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]
vif["Feature"] = df.columns
print(vif)

# Task 15
X = np.linspace(-3, 3, 100).reshape(-1, 1)
y = 1 + 2*X + 3*X**2 + 4*X**3 + 5*X**4 + np.random.randn(100, 1).flatten()
poly = PolynomialFeatures(degree=4)
X_poly = poly.fit_transform(X)
model = LinearRegression().fit(X_poly, y)
plt.scatter(X, y)
plt.plot(X, model.predict(X_poly), color='red')
plt.title("Polynomial Regression (Degree 4)")
plt.show()


In [None]:
# Task 16
pipe = Pipeline([
    ("scale", StandardScaler()),
    ("lr", LinearRegression())
])
pipe.fit(X, y)
print("R² Score:", pipe.score(X, y))

# Task 17
poly = PolynomialFeatures(degree=3)
X_poly = poly.fit_transform(X)
model = LinearRegression().fit(X_poly, y)
plt.scatter(X, y)
plt.plot(X, model.predict(X_poly), color='red')
plt.title("Polynomial Regression (Degree 3)")
plt.show()

# Task 18
X, y = make_regression(n_samples=100, n_features=5, noise=10)
model = LinearRegression().fit(X, y)
print("R² Score:", model.score(X, y))
print("Coefficients:", model.coef_)

# Task 19
X, y = make_regression(n_samples=100, n_features=1, noise=10)
model = LinearRegression().fit(X, y)
plt.scatter(X, y)
plt.plot(X, model.predict(X), color='red')
plt.title("Regression Line")
plt.show()

# Task 20
X, y = make_regression(n_samples=100, n_features=3, noise=10)
model = LinearRegression().fit(X, y)
print("R² Score:", model.score(X, y))
print("Coefficients:", model.coef_)


In [None]:
# Task 21
joblib.dump(model, "model_joblib.pkl")
model_loaded = joblib.load("model_joblib.pkl")
print("Deserialized Model R² Score:", model_loaded.score(X, y))

# Task 22
tips = sns.load_dataset('tips')
X = tips[['total_bill', 'sex', 'smoker']]
y = tips['tip']
X_encoded = pd.get_dummies(X, drop_first=True)
model = LinearRegression().fit(X_encoded, y)
print("R² Score:", model.score(X_encoded, y))

# Task 23
X, y = make_regression(n_samples=100, n_features=1, noise=20)
lr = LinearRegression().fit(X, y)
rr = Ridge(alpha=1.0).fit(X, y)
print("Linear:", lr.coef_, lr.score(X, y))
print("Ridge:", rr.coef_, rr.score(X, y))

# Task 24
X, y = make_regression(n_samples=100, n_features=1, noise=10)
model = LinearRegression()
scores = cross_val_score(model, X, y, cv=5, scoring='r2')
print("Cross-Validation R² Scores:", scores)

# Task 25
for d in range(1, 6):
    poly = PolynomialFeatures(degree=d)
    X_poly = poly.fit_transform(X)
    model = LinearRegression().fit(X_poly, y)
    print(f"Degree {d} R² Score:", model.score(X_poly, y))
