<a href="https://colab.research.google.com/github/Chaakash16/Python-Basics/blob/main/Evaluation_Metrics_and_Regression_Implementation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Theoretical**
1. What does R-squared represent in a regression model?
R-squared represents the proportion of variance in the dependent variable that is explained by the independent variables.

2. What are the assumptions of linear regression?

*   Linearity
*   Independence of errors
*   Homoscedasticity (constant error variance)
*   Normal distribution of residuals
*   No multicollinearity

3. What is the difference between R-squared and Adjusted R-squared?
Adjusted R-squared penalizes the addition of unnecessary predictors, making it more reliable when comparing models with different numbers of features.

4. Why do we use Mean Squared Error (MSE)?
MSE is used because it penalizes larger errors more than smaller ones, giving a clear metric for optimization in regression models.

5. What does an Adjusted R-squared value of 0.85 indicate?
It means 85% of the variance in the dependent variable is explained by the model, accounting for the number of predictors.

6. How do we check for normality of residuals in linear regression?

*   Use a Q-Q plot or histogram of residuals
*   Apply statistical tests like Shapiro-Wilk

7. What is multicollinearity, and how does it impact regression?
Multicollinearity refers to high correlation between independent variables. It causes unstable estimates and inflates standard errors.

8. What is Mean Absolute Error (MAE)?
MAE is the average of the absolute differences between predicted and actual values. It is simple and interpretable.

9. What are the benefits of using an ML pipeline?

*   Automates preprocessing and modeling steps
*   Ensures reproducibility
*   Reduces chances of data leakage

10. Why is RMSE considered more interpretable than MSE?
RMSE is in the same units as the target variable, making it easier to understand in practical terms.

11. What is pickling in Python, and how is it useful in ML?
Pickling is a method to serialize Python objects. It's useful for saving trained models and reusing them later.

12. What does a high R-squared value mean?
It means the model explains a large portion of the variance in the dependent variable.

13. What happens if linear regression assumptions are violated?
Model performance and validity of inferences may degrade. For example, coefficients may be biased or standard errors misleading.

14. How can we address multicollinearity in regression?

*   Remove or combine correlated features
*   Use PCA
*   Apply regularization (Ridge, Lasso)

15. How can feature selection improve model performance in regression analysis?
It reduces overfitting, improves model interpretability, and increases computational efficiency.

16. How is Adjusted R-squared calculated?
Adjusted R² = 1 - [(1 - R²) * (n - 1) / (n - p - 1)]
Where n = number of samples, p = number of predictors

17. Why is MSE sensitive to outliers?
Because it squares the errors, making large errors have an outsized effect.

18. What is the role of homoscedasticity in linear regression?
It ensures that residuals have equal variance across levels of the independent variables, which is important for valid inference.

19. What is Root Mean Squared Error (RMSE)?
RMSE is the square root of MSE and provides error in the same units as the target variable.

20. Why is pickling considered risky?
Pickled files can execute arbitrary code, making them insecure when loading from untrusted sources.

21. What alternatives exist to pickling for saving ML models?

*   joblib (optimized for large numpy arrays)
*   ONNX
*   PMML
*   JSON (for model parameters)

22. What is heteroscedasticity, and why is it a problem?
Heteroscedasticity occurs when residuals have non-constant variance, violating regression assumptions and affecting model reliability.

23. How can interaction terms enhance a regression model's predictive power?
They capture relationships between features that affect the outcome together in a non-additive way.

1. Visualize the distribution of residuals for multiple linear regression (Seaborn "diamonds" dataset)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Load and clean data
df = sns.load_dataset('diamonds').dropna()
X = df[['carat', 'depth', 'table']]
y = df['price']

# Split and train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression().fit(X_train, y_train)
y_pred = model.predict(X_test)
residuals = y_test - y_pred

# Plot residuals
sns.histplot(residuals, kde=True)
plt.title("Distribution of Residuals")
plt.xlabel("Residuals")
plt.ylabel("Frequency")
plt.show()


2. Calculate and print MSE, MAE, and RMSE for a linear regression model

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mse)

print(f"Mean Squared Error (MSE): {mse}")
print(f"Mean Absolute Error (MAE): {mae}")
print(f"Root Mean Squared Error (RMSE): {rmse}")


3. Check if the assumptions of linear regression are met

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# 1. Linearity (pairplot)
sns.pairplot(df[['carat', 'depth', 'table', 'price']])
plt.suptitle("Linearity Check", y=1.02)
plt.show()

# 2. Homoscedasticity (residuals plot)
plt.scatter(y_pred, residuals)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel("Predicted values")
plt.ylabel("Residuals")
plt.title("Homoscedasticity Check")
plt.show()

# 3. Multicollinearity (correlation matrix)
corr_matrix = df[['carat', 'depth', 'table']].corr()
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm")
plt.title("Correlation Matrix")
plt.show()


4. Create an ML pipeline with feature scaling and evaluate regression models

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LinearRegression())
])

X = df[['carat', 'depth', 'table']]
y = df['price']

scores = cross_val_score(pipeline, X, y, cv=5, scoring='r2')
print("Cross-validated R² scores:", scores)
print("Average R² score:", scores.mean())


5. Simple linear regression model: print coefficients, intercept, R²

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

X = df[['carat']]
y = df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression().fit(X_train, y_train)

print("Coefficient:", model.coef_)
print("Intercept:", model.intercept_)
print("R-squared Score:", model.score(X_test, y_test))


6. Analyze relationship between total bill and tip (Seaborn 'tips' dataset)

In [None]:
df_tips = sns.load_dataset('tips')
X = df_tips[['total_bill']]
y = df_tips['tip']

model = LinearRegression().fit(X, y)
y_pred = model.predict(X)

plt.scatter(X, y, label='Actual')
plt.plot(X, y_pred, color='red', label='Regression Line')
plt.xlabel("Total Bill")
plt.ylabel("Tip")
plt.title("Linear Regression: Total Bill vs Tip")
plt.legend()
plt.show()


7. Fit linear regression on synthetic data and plot regression line

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Generate synthetic data
X = np.random.rand(100, 1) * 10
y = 2.5 * X.squeeze() + np.random.randn(100) * 2

# Fit model
model = LinearRegression().fit(X, y)
y_pred = model.predict(X)

# Plot
plt.scatter(X, y, label='Data')
plt.plot(X, y_pred, color='red', label='Regression Line')
plt.xlabel("X")
plt.ylabel("y")
plt.title("Synthetic Data with Regression Line")
plt.legend()
plt.show()


8. Pickle a trained linear regression model and save to file

In [None]:
import pickle

# Train on synthetic data from Q7
with open("linear_model.pkl", "wb") as f:
    pickle.dump(model, f)

print("Model saved to 'linear_model.pkl'")


9. Fit a polynomial regression model (degree 2) and plot curve

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

# Use synthetic data from Q7
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

# Fit model
model_poly = LinearRegression().fit(X_poly, y)
y_poly_pred = model_poly.predict(X_poly)

# Plot
plt.scatter(X, y, label="Data")
plt.plot(X, y_poly_pred, color='green', label="Polynomial Regression (deg 2)")
plt.xlabel("X")
plt.ylabel("y")
plt.title("Polynomial Regression Curve")
plt.legend()
plt.show()


10. Generate synthetic data for simple linear regression and print model

In [None]:
X = np.random.rand(100, 1) * 5
y = 4 * X.squeeze() + np.random.randn(100)

model = LinearRegression().fit(X, y)

print("Coefficient:", model.coef_)
print("Intercept:", model.intercept_)


11. Fit polynomial regression models of different degrees and compare

In [None]:
degrees = [1, 2, 3, 4]
from sklearn.metrics import r2_score

for d in degrees:
    poly = PolynomialFeatures(degree=d)
    X_poly = poly.fit_transform(X)
    model = LinearRegression().fit(X_poly, y)
    y_pred = model.predict(X_poly)
    score = r2_score(y, y_pred)
    print(f"Degree {d} R-squared: {score:.4f}")


12. Simple linear regression with two features: print coeffs, intercept, R²

In [None]:
X = df[['carat', 'table']]  # Use diamonds dataset
y = df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression().fit(X_train, y_train)

print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
print("R-squared Score:", model.score(X_test, y_test))


13. Generate synthetic data, fit linear regression, visualize with regression line

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Generate synthetic data
X = np.linspace(0, 10, 100).reshape(-1, 1)
y = 3 * X.squeeze() + np.random.randn(100) * 2

# Fit model
model = LinearRegression().fit(X, y)
y_pred = model.predict(X)

# Plot
plt.scatter(X, y, label='Data')
plt.plot(X, y_pred, color='red', label='Regression Line')
plt.xlabel("X")
plt.ylabel("y")
plt.title("Linear Regression on Synthetic Data")
plt.legend()
plt.show()


14. Use Variance Inflation Factor (VIF) to check for multicollinearity

In [None]:
import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

# Use diamonds dataset
X = df[['carat', 'depth', 'table']]
X_const = add_constant(X)

vif_data = pd.DataFrame()
vif_data['Feature'] = X_const.columns
vif_data['VIF'] = [variance_inflation_factor(X_const.values, i)
                   for i in range(X_const.shape[1])]

print(vif_data)


15. Generate polynomial data (degree 4), fit model, plot curve

In [None]:
from sklearn.preprocessing import PolynomialFeatures

# Generate synthetic data
X = np.linspace(-3, 3, 100).reshape(-1, 1)
y = 1*X.squeeze()**4 - 3*X.squeeze()**3 + 2*X.squeeze()**2 + np.random.randn(100) * 5

# Polynomial regression
poly = PolynomialFeatures(degree=4)
X_poly = poly.fit_transform(X)

model = LinearRegression().fit(X_poly, y)
y_pred = model.predict(X_poly)

# Plot
plt.scatter(X, y, label='Data')
plt.plot(X, y_pred, color='purple', label='Degree 4 Regression')
plt.title("Polynomial Regression (Degree 4)")
plt.legend()
plt.show()


16. ML pipeline with standardization and multiple linear regression

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', LinearRegression())
])

X = df[['carat', 'depth', 'table']]
y = df['price']

pipeline.fit(X, y)
r2 = pipeline.score(X, y)
print("R-squared Score:", r2)


17. Polynomial regression (degree 3) on synthetic data

In [None]:
X = np.linspace(-2, 2, 100).reshape(-1, 1)
y = X.squeeze()**3 - 2 * X.squeeze()**2 + 1 + np.random.randn(100)

poly = PolynomialFeatures(degree=3)
X_poly = poly.fit_transform(X)

model = LinearRegression().fit(X_poly, y)
y_pred = model.predict(X_poly)

plt.scatter(X, y, label="Data")
plt.plot(X, y_pred, color="orange", label="Degree 3 Regression")
plt.title("Polynomial Regression (Degree 3)")
plt.legend()
plt.show()


18. Multiple linear regression on synthetic dataset with 5 features

In [None]:
from sklearn.datasets import make_regression

# Generate synthetic data
X, y = make_regression(n_samples=100, n_features=5, noise=10, random_state=42)
model = LinearRegression().fit(X, y)

print("R-squared Score:", model.score(X, y))
print("Coefficients:", model.coef_)


19. Generate synthetic data for linear regression and visualize regression line

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Synthetic data
X = np.random.rand(100, 1) * 10
y = 5 * X.squeeze() + np.random.randn(100) * 3

# Train model
model = LinearRegression().fit(X, y)
y_pred = model.predict(X)

# Plot
plt.scatter(X, y, label="Data")
plt.plot(X, y_pred, color="red", label="Regression Line")
plt.xlabel("X")
plt.ylabel("y")
plt.title("Linear Regression on Synthetic Data")
plt.legend()
plt.show()


20. Create a synthetic dataset with 3 features and perform multiple linear regression

In [None]:
from sklearn.datasets import make_regression

# Generate data
X, y = make_regression(n_samples=100, n_features=3, noise=5, random_state=42)

model = LinearRegression().fit(X, y)

print("R-squared Score:", model.score(X, y))
print("Coefficients:", model.coef_)


21. Serialize and deserialize ML models using joblib instead of pickle

In [None]:
import joblib

# Save the model
joblib.dump(model, "linear_model.joblib")

# Load the model
loaded_model = joblib.load("linear_model.joblib")

print("Model loaded successfully using joblib.")


22. Perform linear regression with categorical features using one-hot encoding ('tips' dataset)

In [None]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Load data
df = sns.load_dataset('tips')
X = df[['total_bill', 'sex', 'smoker', 'day', 'time']]
y = df['tip']

# Preprocessing pipeline with one-hot encoding
categorical = ['sex', 'smoker', 'day', 'time']
preprocessor = ColumnTransformer([
    ('cat', OneHotEncoder(drop='first'), categorical)
], remainder='passthrough')

pipeline = Pipeline([
    ('preprocess', preprocessor),
    ('model', LinearRegression())
])

pipeline.fit(X, y)
print("R-squared Score:", pipeline.score(X, y))


23. Compare Ridge Regression with Linear Regression on a synthetic dataset

In [None]:
from sklearn.linear_model import Ridge

# Use synthetic data from earlier
X, y = make_regression(n_samples=100, n_features=5, noise=10, random_state=42)

# Linear Regression
lin_model = LinearRegression().fit(X, y)
print("Linear Regression Coefficients:", lin_model.coef_)
print("Linear Regression R²:", lin_model.score(X, y))

# Ridge Regression
ridge_model = Ridge(alpha=1.0).fit(X, y)
print("Ridge Regression Coefficients:", ridge_model.coef_)
print("Ridge Regression R²:", ridge_model.score(X, y))


24. Use cross-validation to evaluate a Linear Regression model on synthetic data

In [None]:
from sklearn.model_selection import cross_val_score

# Synthetic data
X, y = make_regression(n_samples=100, n_features=3, noise=8, random_state=42)

model = LinearRegression()
scores = cross_val_score(model, X, y, cv=5, scoring='r2')

print("Cross-validated R² scores:", scores)
print("Average R² score:", scores.mean())


25. Compare polynomial regression models of different degrees and print R-squared scores

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score

# Generate synthetic data
np.random.seed(42)
X = np.linspace(-3, 3, 100).reshape(-1, 1)
y = 0.5 * X.squeeze()**3 - X.squeeze()**2 + 2 * X.squeeze() + np.random.randn(100) * 2

# Compare different polynomial degrees
degrees = [1, 2, 3, 4, 5]
for degree in degrees:
    poly = PolynomialFeatures(degree)
    X_poly = poly.fit_transform(X)

    model = LinearRegression().fit(X_poly, y)
    y_pred = model.predict(X_poly)

    r2 = r2_score(y, y_pred)
    print(f"Degree {degree} - R-squared: {r2:.4f}")
