## ***Importing Libraries***

**First things  first, we import all the libraries we’ll need for this project. These tools will help us do a bit of everything: working with data, running statistical tests, creating visualizations, preparing features, and building prediction models. Basically, this setup gives us everything we need to go from raw data to useful insights.**

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.figure_factory import create_distplot as ff
import plotly.subplots as sp
import numpy as np
from scipy.stats import kruskal
from scipy.stats import boxcox
from scipy.stats import boxcox_normmax
from sklearn.preprocessing import OneHotEncoder
from scipy.stats import anderson
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LassoCV
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from statsmodels.stats.stattools import durbin_watson
from statsmodels.stats.diagnostic import het_breuschpagan

In [2]:
cars_df = pd.read_csv('car data.csv')

***What's in the Dataset?***

| Variable         | Description                                                                 |
|------------------|-----------------------------------------------------------------------------|
| Car_Name         | Name of the car.                                                            |
| Year             | The year the car was purchased.                                             |
| Selling_Price    | Price the car is being sold for (in lakhs).                                |
| Present_Price    | Current ex-showroom price of the car.                                       |
| Kms_Driven       | Number of kilometers the car has been driven.                              |
| Fuel_Type        | Type of fuel used by the car (e.g., Petrol, Diesel).                       |
| Seller_Type      | Indicates whether the seller is a dealer or an individual.                 |
| Transmission     | Type of transmission (Manual or Automatic).                                |
| Owner            | Number of previous owners the car has had.                                 |


## ***Data Snapshot***

In [3]:
display(cars_df.head())

features = cars_df.shape[1]
observations = cars_df.shape[0]

print('\n\033[1mInference:\033[0m The dtaset has {features} features and {observations} observations\n'.
      format(features=features, observations=observations)) 

Unnamed: 0,Car_Name,Year,Selling_Price,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
0,ritz,2014,3.35,5.59,27000,Petrol,Dealer,Manual,0
1,sx4,2013,4.75,9.54,43000,Diesel,Dealer,Manual,0
2,ciaz,2017,7.25,9.85,6900,Petrol,Dealer,Manual,0
3,wagon r,2011,2.85,4.15,5200,Petrol,Dealer,Manual,0
4,swift,2014,4.6,6.87,42450,Diesel,Dealer,Manual,0



[1mInference:[0m The dtaset has 9 features and 301 observations



In [4]:
cars_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 301 entries, 0 to 300
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Car_Name       301 non-null    object 
 1   Year           301 non-null    int64  
 2   Selling_Price  301 non-null    float64
 3   Present_Price  301 non-null    float64
 4   Kms_Driven     301 non-null    int64  
 5   Fuel_Type      301 non-null    object 
 6   Seller_Type    301 non-null    object 
 7   Transmission   301 non-null    object 
 8   Owner          301 non-null    int64  
dtypes: float64(2), int64(3), object(4)
memory usage: 21.3+ KB


### ***Quick Data Check:***
**The data looks pretty solid, no missing values and everything’s in the right format. But since we have only a few features, there’s a chance of overfitting when we build the price prediction model. Given that car prices and speeds vary, we might run into some outliers. As for the `Car_Name` column, I’m a bit unsure about its usefulness, so we might remove it, but we’ll double check that later.**


In [5]:
cars_df['Age'] = cars_df['Year'].apply(lambda x : cars_df['Year'].max() +1 - x)
cars_df = cars_df.drop('Year', axis=1)
cars_df = cars_df[['Age'] + [col for col in cars_df.columns if col != 'Age']]
cars_df.head()

Unnamed: 0,Age,Car_Name,Selling_Price,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
0,5,ritz,3.35,5.59,27000,Petrol,Dealer,Manual,0
1,6,sx4,4.75,9.54,43000,Diesel,Dealer,Manual,0
2,2,ciaz,7.25,9.85,6900,Petrol,Dealer,Manual,0
3,8,wagon r,2.85,4.15,5200,Petrol,Dealer,Manual,0
4,5,swift,4.6,6.87,42450,Diesel,Dealer,Manual,0


**When talking about cars, it's easier to use their age (how many years since they were made) instead of the year they were made. It helps the model work better because it deals with numbers showing how much time has passed. Using age also saves us from updating the model every year, which we’d have to do if we used the manufacture year. This way, predictions like price forecasts stay more stable and accurate.**


## ***Data Stats***


In [6]:
cars_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,301.0,5.372093,2.891554,1.0,3.0,5.0,7.0,16.0
Selling_Price,301.0,4.661296,5.082812,0.1,0.9,3.6,6.0,35.0
Present_Price,301.0,7.628472,8.644115,0.32,1.2,6.4,9.9,92.6
Kms_Driven,301.0,36947.20598,38886.883882,500.0,15000.0,32000.0,48767.0,500000.0
Owner,301.0,0.043189,0.247915,0.0,0.0,0.0,0.0,3.0


- **Age**: Looks good. The distribution and average seem fine.
- **Selling_Price**: There's a lot of variation (5.08) and some outliers.
- **Present_Price**: The value 92.6 seems high, might be a mistake or a rare luxury car.
- **Kms_Driven**: Some outliers here too, but cutting off cars with less than 200,000 km doesn’t really make sense.
- **Owner**: 96% of the cars have no previous owner.

In [7]:
cars_df.describe(include='object').T

Unnamed: 0,count,unique,top,freq
Car_Name,301,98,city,26
Fuel_Type,301,3,Petrol,239
Seller_Type,301,2,Dealer,195
Transmission,301,2,Manual,261


- **Car_Name**: There are 98 unique values out of 301, so I’m thinking of removing it. I'll check the stats to see if it affects the `Selling_Price` first.
- **Fuel_Type, Seller_Type, Transmission**: No problems here. Since these are categories, we’ll just convert them to numbers using One-Hot Encoding when we build the model.

In [8]:
selling_price_groups = [cars_df["Selling_Price"].values for _, car_data in cars_df.groupby("Car_Name")]  


stats,p_val = kruskal(*selling_price_groups)

print(f"Kruskal test statistic: {stats:.2f}")
print(f"Kruskal test p-value: {p_val:.2f}")

if p_val < 0.05:
    print("Car names have a significant effect on selling price")
else:
    print("Car names have no significant effect on selling price")


Kruskal test statistic: 0.00
Kruskal test p-value: 1.00
Car names have no significant effect on selling price


**Once we’ve checked and seen that the column doesn’t impact the car price, we’ll just go ahead and drop it.**


In [9]:
cars_df = cars_df.drop(['Car_Name'], axis=1)

## ***Getting the Data in Shape***

In [10]:
for col in cars_df.columns:
    pct_missing = cars_df[col].isnull().mean() * 100
    print(f'{col} - {pct_missing:.2f}%')

Age - 0.00%
Selling_Price - 0.00%
Present_Price - 0.00%
Kms_Driven - 0.00%
Fuel_Type - 0.00%
Seller_Type - 0.00%
Transmission - 0.00%
Owner - 0.00%


**Like I said earlier, there are no missing values in the data.**

In [11]:
duplicate_rows = cars_df[cars_df.duplicated(keep=False)]

duplicate_rows_sorted = duplicate_rows.sort_values(by=list(cars_df.columns))

duplicate_rows_sorted.head()

Unnamed: 0,Age,Selling_Price,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
15,3,7.75,10.79,43000,Diesel,Dealer,Manual,0
17,3,7.75,10.79,43000,Diesel,Dealer,Manual,0
51,4,23.0,30.61,40000,Diesel,Dealer,Automatic,0
93,4,23.0,30.61,40000,Diesel,Dealer,Automatic,0


**There are some duplicate values, probably from manual entry mistakes or from merging data pulled automatically using APIs or scraping.**

In [12]:
print(f'The dataset contains {duplicate_rows.shape[0]} duplicate rows that need to be removed.') 

cars_df.drop_duplicates(inplace=True)

The dataset contains 4 duplicate rows that need to be removed.


**Gotta drop the duplicate values, because they can mess with the accuracy, throw off the averages, and might even lead us to make bad calls.**

In [13]:
numric_col = cars_df.select_dtypes(include=['number']).columns
categorical_col = cars_df.select_dtypes(include=['object']).columns

target_col = 'Selling_Price'
outliers_indexes = []

def detect_outliers_iqr(data,columns):
    outlier_indices = []
    for col in columns:
        q1 = data[col].quantile(0.25)
        q3 = data[col].quantile(0.75)
        IQR = q3 - q1
        lower_bound = q1 - 1.5 * IQR
        upper_bound = q3 + 1.5 * IQR
        
        outliers_samples = data[(data[col] < lower_bound) | (data[col] > upper_bound)]
        outlier_indices.extend(outliers_samples.index.tolist())
    return outlier_indices


def detect_outliers_categorical(data,columns,target):
    outlier_indices = []
    for col in columns:
        for cat in data[col].unique():
            data_cat = data[data[col] == cat]
            q1 = data_cat[target].quantile(0.25)
            q3 = data_cat[target].quantile(0.75)
            IQR = q3 - q1
            lower_bound = q1 - 1.5 * IQR
            upper_bound = q3 + 1.5 * IQR
            
            outliers_samples = data_cat[(data_cat[target] < lower_bound) | (data_cat[target] > upper_bound)]
            outlier_indices = [].extend(outliers_samples.index.tolist())
    return outlier_indices

numeric_outlier_indices = detect_outliers_iqr(cars_df, numric_col)
categorical_outlier_indices = detect_outliers_categorical(cars_df, categorical_col, target_col)

outliers_indexes = list(set((numeric_outlier_indices or []) + (categorical_outlier_indices or [])))



print(f"{len(outliers_indexes)} outliers were identified, whose indices are:\n")
for i, index in enumerate(outliers_indexes, start=1):
    print(f"{index:>6}", end="  ") 
    if i % 10 == 0:
        print()


37 outliers were identified, whose indices are:

    37      39      50      51      52      53      54     179     184      58  
    59     189      62      63      64     191      66     192     196      69  
   193     198     201      77      78      79      80     205      82      84  
    85      86      92      96      97     106     241  

**There are 37 outliers, but let’s not rush to remove them. Not everything unusual needs to be deleted. The outliers can actually represent rare, but real cases. They're natural and can even reveal patterns we might have missed. Of course, they could also be data entry mistakes, so we’ll need to check them carefully before deleting.**

In [14]:
removing_indexes = []
removing_indexes.extend(cars_df[cars_df[target_col]>33].index)
removing_indexes.extend(cars_df[cars_df['Kms_Driven'] > 400000].index)
removing_indexes = list(set(removing_indexes))
cars_df.drop(removing_indexes, inplace=True)
cars_df.reset_index(drop=True, inplace=True)
removing_indexes

[196, 86]

**So, I’ll go ahead and remove values that don’t fit the general pattern. Since we’ll be building a linear regression model later, it’s important to handle outliers carefully, as linear regression is sensitive to them.**

In [15]:
cars_df.sample(5, random_state=1)

Unnamed: 0,Age,Selling_Price,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
138,3,0.6,0.8,20000,Petrol,Individual,Manual,0
233,4,11.25,13.6,68000,Diesel,Dealer,Manual,0
51,2,18.0,19.77,15000,Diesel,Dealer,Automatic,0
290,5,3.75,6.8,33019,Petrol,Dealer,Manual,0
242,7,3.75,6.79,35000,Petrol,Dealer,Manual,0


## ***Data Dive***

In [16]:
def numeric_col_eda(df):
    numeric_col = df.select_dtypes(include=['int64', 'float64']).columns
    num_cols = len(numeric_col)
    
    rows = (num_cols // 2) + (num_cols % 2)
    fig =  sp.make_subplots(rows=rows, cols=2, subplot_titles=numeric_col)
    
    for i,col in enumerate(numeric_col):
        hist_col = [df[col].tolist()]
        group_col = [col]
        
        min_val, max_val = np.floor(df[col].min()), np.ceil(df[col].max())
        values, bin_edges = np.histogram(df[col], bins=10, range=(min_val, max_val)) 
        
        bin_size = np.diff(bin_edges).mean()
        color = '#172f5a'
        
        displot = ff(hist_col, group_col, bin_size=bin_size, show_hist=True, show_rug=False, colors=[color])
        
        for trace in displot['data']:
            fig.add_trace(trace, row=(i // 2) + 1, col=(i % 2) + 1)
            
    fig.update_layout(
        width= 800,
        height= 600,
        title_text = 'Distribution of Numeric Features',
        plot_bgcolor = 'rgba(0,0,0,0)',
        showlegend = False,
        font = dict(family='Arial',size=12,color='black')
        )
    
    fig.show()
numeric_col_eda(cars_df) 

<p align="left">
  <img src="My VS\plot1.png">
  <br>
</p>  

**All columns are Right-Skewed, meaning most values are smaller or medium, with a few large ones pulling the distribution to the right. This makes the mean larger than the median, so the values aren’t evenly spread around the mean.**


In [17]:
def compare_target_with_numerics(df, target_col):
    numeric_col = df.select_dtypes(include=['int64', 'float64']).columns
    num_cols = len(numeric_col)
    
    numeric_col = [col for col in numeric_col if col != target_col]
    
    rows = (num_cols // 2) + (num_cols % 2)
    subplot_titles = [f"{target_col} vs {col}" for col in numeric_col]
    fig = sp.make_subplots(rows=rows, cols=2, subplot_titles=subplot_titles)
    
    color = '#172f5a'
    
    for i, col in enumerate(numeric_col):
        row_pos = (i // 2) + 1
        col_pos = (i % 2) + 1
        scatter_fig = px.scatter(df, x=col, y=target_col, color_discrete_sequence=[color])
        trace = scatter_fig.data[0]
        fig.add_trace(trace, row=row_pos, col=col_pos)
        
    fig.update_layout(
        width=800,
        height=600,
        title_text=f'{target_col} vs Numeric Features',
        plot_bgcolor='rgba(0,0,0,0)',
        showlegend=False,
        font=dict(family='Arial', size=12, color='black')
    )
    
    fig.show()
    
compare_target_with_numerics(cars_df, target_col)

<p align="left">
  <img src="My VS\plot2.png">
  <br>
</p>  

**Present_Price and Selling_Price are positively correlated. This means that as one increases, so does the other. For the rest, the relationship is negative, meaning as the values go up, the selling price goes down.**


In [18]:
def categorical_cols_eda(df):
    categorical_cols = df.select_dtypes(include=['object']).columns
    num_cols = len(categorical_cols)
    rows = (num_cols // 2) + (num_cols % 2)
    fig = sp.make_subplots(rows=rows, cols=2, subplot_titles=categorical_cols)

    color = '#172f5a'
    
    for i, col in enumerate(categorical_cols):
        row_pos = (i // 2) + 1
        col_pos = (i % 2) + 1
        counts = df[col].value_counts()
        trace = go.Bar(x=counts.index, y=counts.values, marker_color=color, name=col)
        fig.add_trace(trace, row=row_pos, col=col_pos)
        
    
        
    fig.update_layout(
        width=800,
        height=600,
        title_text='Categorical Features',
        plot_bgcolor = 'rgba(0,0,0,0)',
        showlegend=False,
        font=dict(family='Arial', size=12, color='black')
    )
    
    fig.show()

categorical_cols_eda(cars_df)

<p align="left">
  <img src="My VS\plot3.png">
  <br>
</p>  

**In the Fuel_Type column, Petrol is the most common, while CNG is the least. For Seller_Type, Dealer appears more than Individual. In the Transmission column, Manual is more frequent than Automatic.**

In [19]:
def compare_target_with_categoricals(df, target_col):
    categorical_cols= df.select_dtypes(include=['object']).columns
    num_cols = len(categorical_cols)
    
    rows = (num_cols // 2) + (num_cols % 2)
    subplot_titles = [f"{target_col} vs {col}" for col in categorical_cols]
    fig = sp.make_subplots(rows=rows, cols=2, subplot_titles=subplot_titles)
    
    color = '#172f5a'
    
    for i,col in enumerate(categorical_cols):
        row_pos = (i // 2) + 1
        col_pos = (i % 2) + 1
        box_fig = px.box(df,x=col,y=target_col,color_discrete_sequence=[color])
        trace = box_fig.data[0]
        fig.add_trace(trace,row=row_pos,col=col_pos)
        
    fig.update_layout(
        width=800,
        height=600,
        title_text=f'{target_col} vs Categorical Variables',
        plot_bgcolor='rgba(0,0,0,0)',
        showlegend=False,
        font=dict(family='Arial', size=12, color='black')
    )
    
    fig.show()
    
compare_target_with_categoricals(cars_df, target_col)

<p align="left">
  <img src="My VS\plot4.png">
  <br>
</p>  

**Diesel cars are priced higher than Petrol or CNG cars, with CNG coming next. Cars sold by Dealers are more expensive than those sold by Individuals. Automatic cars have a higher price than Manual ones.**

## ***Model Setup***

- Encoding the Categorical Stuff

In [20]:
categorical_features = ['Fuel_Type', 'Seller_Type', 'Transmission']

encoder = OneHotEncoder(drop='first', sparse_output=False)
encoded_columns = encoder.fit_transform(cars_df[categorical_features])

encoded_column_names = encoder.get_feature_names_out(categorical_features)
encoded_df = pd.DataFrame(encoded_columns, columns=encoded_column_names)

cars_df_encoded = pd.concat([cars_df.drop(columns=categorical_features), encoded_df], axis=1)

cars_df_encoded.head()

Unnamed: 0,Age,Selling_Price,Present_Price,Kms_Driven,Owner,Fuel_Type_Diesel,Fuel_Type_Petrol,Seller_Type_Individual,Transmission_Manual
0,5,3.35,5.59,27000,0,0.0,1.0,0.0,1.0
1,6,4.75,9.54,43000,0,1.0,0.0,0.0,1.0
2,2,7.25,9.85,6900,0,0.0,1.0,0.0,1.0
3,8,2.85,4.15,5200,0,0.0,1.0,0.0,1.0
4,5,4.6,6.87,42450,0,1.0,0.0,0.0,1.0


- Train-Test Split Time

In [21]:
x = cars_df_encoded.drop('Selling_Price', axis=1)
y = cars_df_encoded['Selling_Price']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)

- Setting Up the Pipeline

In [22]:
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('linear_reg', LinearRegression())    
])
pipe.fit(x_train, y_train)

**We used a pipeline to make sure the scaler only fits on the training data, not the test data. This helps avoid data leakage, when test data sneaks into training by mistake. That kind of leak can make the model look way more accurate than it really is.**


 - Checking How the Model Did on Training Data

In [23]:
train_score = pipe.score(x_train, y_train)
y_train_pred = pipe.predict(x_train)

print(f'MAE: {metrics.mean_absolute_error(y_train, y_train_pred):.2f}')
print(f'MSE: {metrics.mean_squared_error(y_train, y_train_pred):.2f}')
print(f'RMSE: {np.sqrt(metrics.mean_squared_error(y_train, y_train_pred)):.2f}')
print(f"R²: {train_score:.2f}")

MAE: 1.07
MSE: 2.44
RMSE: 1.56
R²: 0.88


- Testing the Model Performance

In [24]:
y_pred = pipe.predict(x_test)

print(f"MAE: {metrics.mean_absolute_error(y_test, y_pred):.2f}")
print(f"MSE: {metrics.mean_squared_error(y_test, y_pred):.2f}")
print(f"RMSE: {np.sqrt(metrics.mean_squared_error(y_test, y_pred)):.2f}")
print(f"R^2: {pipe.score(x_test, y_test):.2f}")

MAE: 1.01
MSE: 2.83
RMSE: 1.68
R^2: 0.88


**There are important rules that the model is built on. If the data doesn’t meet these conditions, the results might be misleading and shouldn’t be relied on.**

- Linearity

In [25]:
residuals = y_test - y_pred

fig_actual = px.scatter(x=y_test, y=y_pred, trendline='lowess', color_discrete_sequence=['#172f5a'])
fig_resid = px.scatter(x=y_pred, y=residuals, trendline='lowess', color_discrete_sequence=['#172f5a'])

fig = sp.make_subplots(rows=1, cols=2, subplot_titles=('Actual vs Predicted', 'Residuals vs Predicted'))

for trace in fig_actual.data:
    fig.add_trace(trace, row=1, col=1)
for trace in fig_resid.data:
    fig.add_trace(trace, row=1, col=2)

fig.update_layout(title_text='Comparison of Actual vs Predicted and Residuals vs Predicted',
                  width=800, height=500,
                  plot_bgcolor='rgba(0,0,0,0)',
                  showlegend=False,
                  font=dict(family='Arial', size=12, color='black'))

fig.update_xaxes(title_text='Predicted', row=1, col=1)
fig.update_xaxes(title_text='Predicted', row=1, col=2)
fig.update_yaxes(title_text='Actual', row=1, col=1)
fig.update_yaxes(title_text='Residuals', row=1, col=2) 

<p align="left">
  <img src="My VS\plot5.png">
  <br>
</p>  

**For linear regression, the relationship between the predictors and the target should be linear, meaning the graph should show a straight line or something close to it. Since it's not linear here, linear regression might not be the best choice.**

**One way to fix this is by using Polynomial Terms. It basically helps the model fit curves instead of forcing a straight line, which makes it more flexible with non linear patterns.**

- Normality of Residuals

**There are a few ways to check if residuals follow a normal distribution, but here I'll use a stats based method, the Anderson-Darling test.**


In [26]:
result = anderson(residuals)

print("Anderson-Darling Test Statistic:", result.statistic)
print("Critical Values:", result.critical_values)
print("Significance Levels:", result.significance_level)

if result.statistic < result.critical_values[2]: 
    print("Residuals appear to be normally distributed.")
else:
    print("Residuals do not appear to be normally distributed.")

Anderson-Darling Test Statistic: 3.993515427841146
Critical Values: [0.553 0.63  0.756 0.882 1.049]
Significance Levels: [15.  10.   5.   2.5  1. ]
Residuals do not appear to be normally distributed.


**The second assumption isn’t met either, so this confirms that linear regression isn’t the best choice. We can try using Box Cox to make the distribution closer to normal and reduce skewness.**

- Residuals Shouldn't Be Related

In [27]:
dw_stat = durbin_watson(residuals)
print(f'Durbin-Watson statistic: {dw_stat:.3f}')
if dw_stat < 1.5:
    print('Significant positive autocorrelation')
elif dw_stat > 2.5:
    print('Significant negative autocorrelation')
else:
    print('Little to no autocorrelation')

Durbin-Watson statistic: 1.802
Little to no autocorrelation


**We used this test to check if the residuals are linked. If they were, it would mess with the model. But luckily, they're not connected, so all good.**


- Homoscedasticity

In [28]:
fig = px.scatter(cars_df,x=y_test,y= residuals,trendline='lowess',color_discrete_sequence=['#172f5a'])
fig.update_layout(title_text='Residuals vs Predicted Values',
                  width=800, height=500,
                  plot_bgcolor='rgba(0,0,0,0)',
                  showlegend=False,
                  font=dict(family='Arial', size=12, color='black'))
fig.add_shape(
    type='line',
    x0=y_test.min(),  
    x1=y_test.max(), 
    y0=0,
    y1=0,
    line=dict(color='black', dash='dash')
)
fig.update_xaxes(title_text='Predicted Values')
fig.update_yaxes(title_text='Residuals')

<p align="left">
  <img src="My VS\plot6.png">
  <br>
</p>  

**Here, the residuals should stay consistent, and the variance should be constant regardless of the predicted values. But in our case, that's not happening. The line (not the dotted one) should be straight, but it's not.**

- Making the Data and Model Better

In [29]:
boxcox_transformed_data = cars_df_encoded.copy()

for col in numric_col:
    if (boxcox_transformed_data[col] > 0).all():  
        boxcox_transformed_data[col], _ = boxcox(boxcox_transformed_data[col])
    else:
        print(f"Skipping Box-Cox transformation for column '{col}' as it contains non-positive values.")


boxcox_transformed_data.head()

Skipping Box-Cox transformation for column 'Owner' as it contains non-positive values.


Unnamed: 0,Age,Selling_Price,Present_Price,Kms_Driven,Owner,Fuel_Type_Diesel,Fuel_Type_Petrol,Seller_Type_Individual,Transmission_Manual
0,1.569863,1.373237,2.078641,109.074611,0,0.0,1.0,0.0,1.0
1,1.742802,1.837986,2.89534,129.65272,0,1.0,0.0,0.0,1.0
2,0.685737,2.447546,2.947209,65.399718,0,0.0,1.0,0.0,1.0
3,2.013696,1.169186,1.66227,58.749579,0,0.0,1.0,0.0,1.0
4,1.569863,1.793866,2.382714,129.035489,0,1.0,0.0,0.0,1.0


In [30]:
x = boxcox_transformed_data.drop('Selling_Price', axis=1)
y = boxcox_transformed_data['Selling_Price']

In [31]:
poly_pipe = Pipeline([
    ('poly_features', PolynomialFeatures(degree=2, include_bias=False)),
    ('scaler', StandardScaler()),
    ('linear_reg', LinearRegression())
])

poly_pipe.fit(x_train, y_train)

**We used PolynomialFeatures to help the model learn from the non linear relationship in our data. Since the relationship wasn’t linear, the model didn’t perform that well at first. But now, with polynomial terms added, the model can better fit the curves and capture more complex patterns.**


In [32]:
train_score_poly = poly_pipe.score(x_train, y_train)
y_train_pred_poly = poly_pipe.predict(x_train)

In [33]:
print(f"Training MAE: {metrics.mean_absolute_error(y_train, y_train_pred_poly):.2f}")
print(f"Training MSE: {metrics.mean_squared_error(y_train, y_train_pred_poly):.2f}")
print(f"Training RMSE: {np.sqrt(metrics.mean_squared_error(y_train, y_train_pred_poly)):.2f}")
print(f"Training R²: {train_score_poly:.2f}")

Training MAE: 0.43
Training MSE: 0.36
Training RMSE: 0.60
Training R²: 0.98


In [34]:
test_score_poly = poly_pipe.score(x_test, y_test)
y_test_pred_poly = poly_pipe.predict(x_test)

In [35]:
print(f"Test MAE: {metrics.mean_absolute_error(y_test, y_test_pred_poly):.2f}")
print(f"Test MSE:  {metrics.mean_squared_error(y_test, y_test_pred_poly):.2f}")
print(f"Test RMSE: {np.sqrt(metrics.mean_squared_error(y_test, y_test_pred_poly)):.2f}")
print(f"Test R²: {test_score_poly:.2f}")

Test MAE: 0.57
Test MSE:  0.91
Test RMSE: 0.95
Test R²: 0.96


**Now the model leveled up from 88% to 96%. Looks like it finally woke up and started doing its job right.**


- Let's Give Lasso a Shot 

In [36]:
lasso_pipe = Pipeline([
    ('poly_features', PolynomialFeatures(degree=2, include_bias=False)),
    ('scaler', StandardScaler()),
    ('lasso_CV', LassoCV(eps=0.01, n_alphas=100, max_iter=10000, cv=3))
])

lasso_pipe.fit(x_train, y_train) 


In [37]:
lasso_pipe.named_steps['lasso_CV'].alpha_

np.float64(0.04002906642831475)

In [38]:
train_score_lasso = lasso_pipe.score(x_train, y_train) 
y_train_pred_lasso = lasso_pipe.predict(x_train) 

In [39]:
print(f"Training MAE: {metrics.mean_absolute_error(y_train, y_train_pred_lasso):.2f}")
print(f"Training MSE: {metrics.mean_squared_error(y_train, y_train_pred_lasso):.2f}")
print(f"Training RMSE: {np.sqrt(metrics.mean_squared_error(y_train, y_train_pred_lasso)):.2f}")
print(f"Training R²: {train_score_lasso:.2f}")

Training MAE: 0.52
Training MSE: 0.62
Training RMSE: 0.79
Training R²: 0.97


In [40]:
test_score_lasso = lasso_pipe.score(x_test, y_test)
y_test_pred_lasso = lasso_pipe.predict(x_test)

In [41]:
print(f"Test MAE: {metrics.mean_absolute_error(y_test, y_test_pred_lasso):.2f}")
print(f"Test MSE:  {metrics.mean_squared_error(y_test, y_test_pred_lasso):.2f}")
print(f"Test RMSE: {np.sqrt(metrics.mean_squared_error(y_test, y_test_pred_lasso)):.2f}")
print(f"Test R²: {test_score_lasso:.2f}")

Test MAE: 0.57
Test MSE:  0.81
Test RMSE: 0.90
Test R²: 0.97


**Lasso pulled off a smooth move with a solid 97% score. Not bad for a model that likes to keep things simple by shrinking the less important stuff.**

 


---

***Looks like we’ve come a long way, from a basic linear model to something smarter and sharper. There's still room to explore and tune more, but for now, we’ve got ourselves a model that actually gets the job done.***