Initial Observations:

#The dataset has 1,436 entries and 11 columns.

#The target variable is Price.

#The feature columns include numerical variables like Age, KM, HP, CC, Doors, Cylinders, #Gears, and Weight.

#Fuel_Type is a categorical variable.

#Automatic is binary (1 = Yes, 0 = No).

EDA

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge, Lasso
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
plt.rcParams['figure.figsize'] = (10,5) # RuntimeConfiguration Parameters: size of graph, 10:width, 5:height
plt.rcParams['figure.dpi'] = 300 # Resolution dots per inches
%matplotlib inline
# after plotting graph, many times depending on version of working library like matplotlib graph will not be displayed in output screen below
# For that we have to write everytime plt.show(). So if you write '%matplotlib inline' we don't need to write show() method.
import warnings # any library will give you future warnings regarding updates in functions. To ignore it write this line
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv("ToyotaCorolla - MLR.csv") # No index column will be displayed
df

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.describe()

In [None]:
df.describe(include=object)

In [None]:
df.dtypes

In [None]:
df.info() # find missing values

#No missing values are present.

In [None]:
print(df.columns)

In [None]:
df['Fuel_Type']

In [None]:
df['Fuel_Type'].unique()

In [None]:
df['Fuel_Type'].value_counts()

In [None]:
df = pd.get_dummies(df, columns=['Fuel_Type'], drop_first=True, dtype=int)

In [None]:
df

In [None]:
df.duplicated()

In [None]:
df.duplicated().sum()

In [None]:
df[df.duplicated()]

In [None]:
df[df.duplicated(keep = False)]

In [None]:
df.drop_duplicates(inplace=True)

In [None]:
df.shape

In [None]:
df.duplicated().sum()

In [None]:
df.head()

In [None]:
df.tail(10)

In [None]:
df['Cylinders'].value_counts()

In [None]:
df.drop(columns=['Cylinders'],inplace=True)

In [None]:
df.head()

In [None]:
sns.pairplot(df)

In [None]:
plt.figure(figsize=(8, 5))
sns.histplot(df["Price"], bins=30, kde=True, color="blue")
plt.title("Distribution of Car Prices")
plt.xlabel("Price (Euros)")
plt.ylabel("Frequency")
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.heatmap(df.corr(), annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Feature Correlation Heatmap")
plt.show()

Price Distribution:

#The price distribution is right-skewed, meaning a few expensive cars increase the average.


Correlation Analysis:

#Age and KM are negatively correlated with Price, indicating older and higher mileage cars tend to be cheaper.

#HP, Weight, and CC have a positive correlation with Price, meaning higher engine power and larger vehicles are priced higher.

#Cylinders have no variation (all values are 4), so we should remove this feature.

#Doors and Gears have weak correlations with Price, which might make them less important predictors.

#Multicollinearity present → Requires VIF analysis


Fuel Type Handling:

#Fuel_Type is categorical, so we need to encode it before using it in the regression model.

#Fuel_Type encoded into two binary columns: Fuel_Type_Diesel and Fuel_Type_Petrol (with CNG as the baseline).


In [None]:
# Define independent variables (features) and dependent variable (target)
X = df.drop(columns=["Price"])  # Features
y = df["Price"]  # Target variable

# Split dataset into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=23)

# Display the shape of the training and testing sets
X_train.shape, X_test.shape


The dataset has been successfully split:

Training set: 1,148 samples
Testing set: 287 samples
Features: 10 variables

In [None]:
# Adding constant for OLS regression
X_train_const = sm.add_constant(X_train)
X_test_const = sm.add_constant(X_test)

In [None]:
#========== Variance Inflation Factor (VIF) for Multicollinearity =========
def calculate_vif(X):
    vif_data = pd.DataFrame()
    vif_data["Feature"] = X.columns
    vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))]
    return vif_data
vif_baseline = calculate_vif(X_train)
print("\nVIF for Baseline Model:\n", calculate_vif(X_train))


In [None]:
# ========== Model 1: Baseline Multiple Linear Regression ==========
model_1 = sm.OLS(y_train, X_train_const).fit()
y_pred_1 = model_1.predict(X_test_const)

In [None]:
print("\nModel 1 Summary:\n", model_1.summary())

In [None]:
print("\nModel 1 Coefficients:")
print(model_1.params)

Baseline Model (Model 1)

Includes all features.

Helps determine initial variable importance.

In [None]:
# ========== Model 2: Feature Selection (Dropping Insignificant Features) ==========
# Dropping statistically insignificant features based on p-values
significant_features = X_train.columns[model_1.pvalues[1:] < 0.05]  # Select features with p-value < 0.05
X_train_fs = sm.add_constant(X_train[significant_features])
X_test_fs = sm.add_constant(X_test[significant_features])

# Fit new model
model_2 = sm.OLS(y_train, X_train_fs).fit()
y_pred_2 = model_2.predict(X_test_fs)

In [None]:
# Print OLS Summary
print("\nModel 2 Summary (Feature Selection):\n", model_2.summary())

In [None]:
print("\nModel 2 Coefficients:")
print(model_2.params)

Feature Selection Model (Model 2)

Removes high VIF variables to address multicollinearity.

Improves interpretability while keeping performance.

In [None]:
X_train_int = X_train.copy()
X_test_int = X_test.copy()


In [None]:
# Creating new interaction features
X_train_int["Age_KM"] = X_train["Age_08_04"] * X_train["KM"]
X_test_int["Age_KM"] = X_test["Age_08_04"] * X_test["KM"]
X_train_int["HP_Weight"] = X_train["HP"] * X_train["Weight"]
X_test_int["HP_Weight"] = X_test["HP"] * X_test["Weight"]


In [None]:
# Adding constant
X_train_int_const = sm.add_constant(X_train_int)
X_test_int_const = sm.add_constant(X_test_int)

In [None]:
# Fit new model
model_3 = sm.OLS(y_train, X_train_int_const).fit()
y_pred_3 = model_3.predict(X_test_int_const)

In [None]:
#Print OLS Summary
print("\nModel 3 Summary (With Interaction Terms):\n", model_3.summary())


In [None]:
# Extract Coefficients
print("\nModel 3 Coefficients:")
print(model_3.params)

Interaction Terms Model (Model 3)

Adds Age × KM interaction to capture relationships.
    
Slightly improves R² and predictive performance.


const- The intercept, representing the estimated price when all features are 0 (not meaningful alone).

Age_08_04- Older cars tend to be cheaper.

KM- Higher mileage means lower resale value.

HP-  More power = higher price.

Automatic  more expensive than manual ones.

CC (Cylinder Volume)- Slight negative impact per unit increase, possibly due to fuel efficiency concerns.

Doors- Each additional door reduces price  (may indicate consumer preference for 3-door cars in this dataset).

Gears-  More gears increase price . Likely due to better performance.

Weight- Each extra kg increases price . Heavier cars often have better build quality.

Fuel_Type_Diesel- Diesel cars are more expensive than CNG (baseline category).

Fuel_Type_Petrol-Petrol cars are more expensive than CNG.

In [288]:
# ========== Model Evaluation ==========
def model_evaluation(y_test, y_pred, model_name):
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)
    print(f"\n{model_name}: RMSE = {rmse:.2f}, R² = {r2:.4f}")

# Evaluating all models
model_evaluation(y_test, y_pred_1, "Baseline MLR")
model_evaluation(y_test, y_pred_2, "Feature Selection MLR")
model_evaluation(y_test, y_pred_3, "Interaction Terms MLR")


Baseline MLR: RMSE = 1290.13, R² = 0.8626

Feature Selection MLR: RMSE = 1293.01, R² = 0.8620

Interaction Terms MLR: RMSE = 1141.06, R² = 0.8925


Model Evaluation

R² (Coefficient of Determination): Measures how well the model fits.

RMSE (Root Mean Squared Error): Lower is better.

Model 3 (Interaction Terms) performs best.

In [None]:
# ========== AIC (Akaike Information Criterion) for Model Comparison ==========
print(f"\nAIC Scores:")
print(f"Baseline Model AIC: {model_1.aic}")
print(f"Feature Selection Model AIC: {model_2.aic}")
print(f"Interaction Model AIC: {model_3.aic}")


In [None]:
# Homoscedasticity Check
plt.scatter(model_1.fittedvalues, residuals_1)
plt.axhline(y=0, color='red', linestyle='--')
plt.title("Residuals vs. Fitted Values (Model 1)")
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.show()

In [None]:
plt.figure(figsize=(8,6))
sns.residplot(x=model_1.fittedvalues, y=model_1.resid, lowess=True, line_kws={'color': 'red'})
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.title("Residual Plot for Baseline Model")
plt.show()

Residual Analysis

Residual vs. Fitted plot checks assumptions of homoscedasticity.

If residuals show patterns, model improvements are needed.

In [290]:
# Apply Ridge Regression
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train_int, y_train)
ridge_r2 = r2_score(y_test, ridge_model.predict(X_test_int))

# Apply Lasso Regression
lasso_model = Lasso(alpha=1.0)
lasso_model.fit(X_train_int, y_train)
lasso_r2 = r2_score(y_test, lasso_model.predict(X_test_int))

# Print Ridge & Lasso Performance
print("\nRidge Regression R²:", ridge_r2)
print("Lasso Regression R²:", lasso_r2)


Ridge Regression R²: 0.8924075622355321
Lasso Regression R²: 0.8919120937149712


Lasso & Ridge Regression

Ridge reduces overfitting while keeping all variables.

Lasso drops less significant variables.

Both Ridge and Lasso performed similarly to the Interaction Terms MLR, slightly improving error metrics.

INSIGHTS:-
 
 Baseline Model shows good performance.

 Feature selection reduces multicollinearity.

 Interaction terms slightly improve prediction.
     
 Ridge & Lasso help with overfitting.
     
 Residual analysis confirms model assumptions.

Final Conclusion:


The Interaction Terms MLR model was the best among the standard MLR models.
    
Regularization (Ridge/Lasso) further stabilized the model and prevented overfitting.
    
Final Recommendation: Use Ridge Regression with interaction terms for the best balance of accuracy and stability.

Interview Questions:

1.What is Normalization & Standardization and how is it helpfu
l
ANS:Normalization:

Definition: Definition: Rescaling the features to a fixed range — usually [0, 1].

Use Case: When the data doesn't follow a normal distribution or when you're using methods like KNN, SVM, or neural networks that are sensitive to scale.

Standardization:

Definition: Rescaling data to have a mean of 0 and standard deviation of 1.

Use Case: When the data follows a normal distribution or you're using linear models, logistic regression, PCA, etc.?
Prevents features with larger scales from dominating.

Improves convergence speed and performance of gradient-based algorithms.

Makes model weights easier to interpret.

Ensures fair distance calculation for distance-based algorithms.


2. Techniques to Address Multicollinearity in Multiple Linear Regression

Multicollinearity occurs when independent variables are highly correlated, making it hard to determine their individual effect on the dependent variable.

Techniques to Handle It:

Remove Highly Correlated Predictors:

Use correlation matrix or VIF (Variance Inflation Factor) to identify and drop variables.

Rule of thumb: Drop variables with VIF > 5 or 10.

Principal Component Analysis (PCA):

Transforms correlated features into a smaller set of uncorrelated components.

Ridge Regression (L2 Regularization):

Penalizes large coefficients to reduce their impact, helping with multicollinearity.

Partial Least Squares Regression (PLS):

Similar to PCA but considers the response variable when projecting components.

Combine Features:

If two features are highly correlated, combine them using mathematical operations like sum or mean.

Domain Knowledge:

Drop or prioritize variables based on business relevance or practical implications.



