In [50]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Load dataset
df = pd.read_csv("ToyotaCorolla - MLR.csv")

# Display first few rows
print(df.head())

   Price  Age_08_04     KM Fuel_Type  HP  Automatic    cc  Doors  Cylinders  \
0  13500         23  46986    Diesel  90          0  2000      3          4   
1  13750         23  72937    Diesel  90          0  2000      3          4   
2  13950         24  41711    Diesel  90          0  2000      3          4   
3  14950         26  48000    Diesel  90          0  2000      3          4   
4  13750         30  38500    Diesel  90          0  2000      3          4   

   Gears  Weight  
0      5    1165  
1      5    1165  
2      5    1165  
3      5    1165  
4      5    1170  


In [51]:
df.dtypes

Price         int64
Age_08_04     int64
KM            int64
Fuel_Type    object
HP            int64
Automatic     int64
cc            int64
Doors         int64
Cylinders     int64
Gears         int64
Weight        int64
dtype: object

In [52]:
# Preprocess the data
# Convert 'Fuel_Type' to numerical using one-hot encoding
df = pd.get_dummies(df, columns=['Fuel_Type'], drop_first=True)


In [53]:
# Define features (X) and target variable (y)
X = df.drop('Price', axis=1)
y = df['Price']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [54]:
# Multiple Linear Regression Models
# Model 1
model1 = LinearRegression()
model1.fit(X_train, y_train)
# Model 2
model2 = LinearRegression()
model2.fit(X_train[['Age_08_04', 'KM','HP','Automatic','Weight']], y_train) # Feature selection
# Model 3
model3 = LinearRegression()
model3.fit(X_train[['Age_08_04', 'KM']], y_train)


In [55]:
# Evaluate the models
def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    print(f"Mean Squared Error: {mse}")
    print(f"R-squared: {r2}")
    return mse, r2

print("Model 1 Evaluation:")
mse1, r2_1 = evaluate_model(model1, X_test, y_test)

print("Model 2 Evaluation:")
mse2, r2_2 = evaluate_model(model2, X_test[['Age_08_04', 'KM','HP','Automatic','Weight']], y_test)

print("Model 3 Evaluation:")
mse3, r2_3 = evaluate_model(model3, X_test[['Age_08_04', 'KM']], y_test)

Model 1 Evaluation:
Mean Squared Error: 2203043.8231437034
R-squared: 0.8348888040611082
Model 2 Evaluation:
Mean Squared Error: 1985549.574429618
R-squared: 0.8511893129923238
Model 3 Evaluation:
Mean Squared Error: 2925606.602783666
R-squared: 0.7807349994776657


In [56]:
# Lasso Regression
lasso_model = Lasso(alpha=0.1)  # Adjust alpha as needed
lasso_model.fit(X_train, y_train)

print("\nLasso Regression Evaluation:")
evaluate_model(lasso_model, X_test, y_test)

# Ridge Regression
ridge_model = Ridge(alpha=0.1) # Adjust alpha as needed
ridge_model.fit(X_train, y_train)
print("\nRidge Regression Evaluation:")
evaluate_model(ridge_model, X_test, y_test)


Lasso Regression Evaluation:
Mean Squared Error: 2202270.260024681
R-squared: 0.8349467801805

Ridge Regression Evaluation:
Mean Squared Error: 2202732.2441679
R-squared: 0.8349121559240098


(2202732.2441679, 0.8349121559240098)

Interview Questions:
1.What is Normalization & Standardization and how is it helpful?
2.What techniques can be used to address multicollinearity in multiple linear regression?

1. What is Normalization & Standardization and how is it helpful?

Both normalization and standardization are feature scaling techniques used to preprocess data for machine learning models, especially when dealing with numerical features with different scales.

    Normalization (Min-Max Scaling):
        Scales features to a fixed range, typically [0,1] or [-1,1].
        Formula:
        Xnorm=X−XminXmax−Xmin
        Xnorm​=Xmax​−Xmin​X−Xmin​​
        Useful when the data does not follow a normal distribution and needs to be bounded.

    Standardization (Z-score Scaling):
        Centers the data around zero with a standard deviation of one.
        Formula:
        Xstd=X−μσ
        Xstd​=σX−μ​ where μμ is the mean and σσ is the standard deviation.
        Useful for algorithms like regression, PCA, and gradient-based models that assume normal distribution.

Why is it helpful?

    Improves model convergence (especially for gradient-based models).
    Prevents features with larger scales from dominating those with smaller scales.
    Makes training more stable and improves accuracy.

2. What techniques can be used to address multicollinearity in multiple linear regression?

Multicollinearity occurs when independent variables are highly correlated, leading to unstable coefficient estimates and inflated standard errors.

Techniques to address multicollinearity:

    Variance Inflation Factor (VIF)
        Calculate VIF for each independent variable. A VIF > 10 indicates high multicollinearity.
        Drop or combine variables with high VIF values.

    Principal Component Analysis (PCA)
        Reduces dimensionality by transforming correlated variables into a set of uncorrelated principal components.

    Lasso Regression (L1 Regularization)
        Adds penalty to large coefficients, effectively shrinking some coefficients to zero, reducing multicollinearity.

    Ridge Regression (L2 Regularization)
        Penalizes large coefficients, reducing their impact and helping with collinearity.

    Feature Selection
        Use domain knowledge or automated feature selection techniques like Recursive Feature Elimination (RFE) to remove redundant variables.

    Combine Correlated Features
        If two variables are highly correlated, combine them into a new feature (e.g., taking the average or difference).