#**Predict Price for Used Cars**

In [None]:
pip install scipy

In [None]:
pip install seaborn

In [None]:
pip install --upgrade pip

In [None]:
pip install pandas

In [None]:
pip install -U scikit-learn

In [None]:
pip install statsmodels

In [None]:
### IMPORT: ------------------------------------
import scipy.stats as stats 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings('ignore') # To supress warnings
 # set the background for the graphs
from scipy.stats import skew
plt.style.use('ggplot')
import missingno as msno # to get visualization on missing values
from sklearn.model_selection import train_test_split # Sklearn package's randomized data splitting function
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
import math
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
pd.set_option('display.float_format', lambda x: '%.3f' % x)
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_colwidth',400)
pd.set_option('display.float_format', lambda x: '%.5f' % x) # To supress numerical display in scientific notations
import statsmodels.api as sm
print("Load Libraries- Done")

#Read and Understand data

In [None]:
#Reading the csv file  used car data.csv 
df=pd.read_csv("C:\\Users\\bodda\\OneDrive\\APDS\\APDS Assignments\\APDS Group Project\\used_car_data.csv")
cars=df.copy()
print(f'There are {cars.shape[0]} rows and {cars.shape[1]} columns') # fstring 

In [None]:
# inspect data, print top 5 
cars.head(5)

In [None]:
# bottom 5 rows:
cars.tail(5)

In [None]:
#get the size of dataframe
print ("Rows     : " , cars.shape[0])  #get number of rows/observations
print ("Columns  : " , cars.shape[1]) #get number of columns
print ("#"*40,"\n","Features : \n\n", cars.columns.tolist()) #get name of columns/features
print ("#"*40,"\nMissing values :\n\n", cars.isnull().sum().sort_values(ascending=False))
print( "#"*40,"\nPercent of missing :\n\n", round(cars.isna().sum() / cars.isna().count() * 100, 2)) # looking at columns with most Missing Values
print ("#"*40,"\nUnique values :  \n\n", cars.nunique())  #  count of unique values

In [None]:
cars.info()

In [None]:
#Visualize missing values
msno.bar(cars)

**Observations**

This preview shows that some columns potentially have a lot of missingness so we'll want to make sure to look into that later.
Market Category has 3742 missing values. 31 % values are missing

Number of Doors has only 6 values missing. can be one of key factor in deciding price.

Engine Cylinders and Engine HP has 30 and 31 missing values respectively.

Mileage is divided into types city mpg and Highway MPG

Mileage,Power,Engine,MSRP we know are quantitative variables but are of object dtype here and needs to to converted to numeric.

In [None]:
# Making a list of all categorical variables
cat_col = [ 'EngineFuelType', 'MarketCategory', 'VehicleStyle', 'VehicleSize','EngineHP','MSRP']

# Printing number of count of each unique value in each column
for column in cat_col:
    print(cars[column].value_counts())
    print("#" * 40)

Observations

1,The most common engine fuel type is regular unleaded, with 7172 occurrences, followed by premium unleaded (required) with 2009 occurrences, and premium unleaded (recommended) with 1523 occurrences.

2,The most common market category is crossover with 1110 occurrences, followed by flex fuel with 872 occurrences, and luxury with 855 occurrences.

3,The most common vehicle style is sedan with 3048 occurrences, followed by 4dr SUV with 2488 occurrences, and coupe with 1211 occurrences.

4,The most common vehicle size is compact with 4764 occurrences, followed by midsize with 4373 occurrences, and large with 2777 occurrences.

5,The engine horsepower ranges from 55 to 1001, with the most common horsepower being 200 (456 occurrences), followed by 170 (351 occurrences) and 210 (320 occurrences).

#**Data Preprocessing**

Processing highway MPG, city mpg, MSRP columns

Datatype for highway MPG, city mpg, MSRP are object because of unit assigned ,so striping units.

In [None]:
#np.random.seed(9)
cars[['highwayMPG','citympg','MSRP']]

In [None]:
typeoffuel=['flex-fuel (unleaded/E85)',
'diesel',
'electric',
'flex-fuel (premium unleaded required/E85)','flex-fuel (premium unleaded recommended/E85)',
'flex-fuel (unleaded/natural gas)','natural gas']
cars.loc[cars['EngineFuelType'].isin(typeoffuel)]

In [None]:
cars.info()

In [None]:
# Define the columns to round
cols_to_round = ['EngineHP', 'EngineCylinders', 'highwayMPG', 'citympg', 'MSRP']

# Round the columns to one decimal place
df[cols_to_round] = df[cols_to_round].round(1)

In [None]:
cars.head()

In [None]:
# Define the columns to round
cols_to_round = ['EngineHP', 'EngineCylinders', 'highwayMPG', 'citympg', 'MSRP']

# Round the columns to one decimal place
cars[cols_to_round] = cars[cols_to_round].round(1)

In [None]:
cars.head()

In [None]:
cars.head()

#**Feature Enginering** 




converting datatype

In [None]:
# Convert object data types to category data types
cars["EngineFuelType"] = pd.Categorical(cars["EngineFuelType"])
cars["MarketCategory"] = pd.Categorical(cars["MarketCategory"])
cars["VehicleStyle"] = pd.Categorical(cars["VehicleStyle"])

# Convert data types for "Popularity", "Number of Doors", and "Vehicle Size"
cars["Popularity"] = cars["Popularity"].astype(float)
cars["NumberofDoors"] = cars["NumberofDoors"].astype(float)
cars["VehicleSize"] = cars["VehicleSize"].map({'Compact': 1, 'Midsize': 2, 'Large': 3}).astype(float)

In [None]:
cars.info()

In [None]:
cars.describe().T

**Processing Years to Derive Age of car**

Since year has 2014, 1996 etc. But this will not help to understand how old cars is and its effect on price. so creating two new columns current year and Age . Current year would be 2023 and Age column would be Ageofcar= currentyear-year. And then drop currentyear columns

In [None]:
cars['Current_year']=2023
cars['Ageofcar']=cars['Current_year']-cars['Year']
cars.drop('Current_year',axis=1,inplace=True)
cars.head()

**Processing Name column**

Brands do play an important role in Car selection and Prices. So extracting brand names from the Name.

In [None]:
# Dropping rows with null values in the "Model" column
cars = cars.dropna(subset=['Model'])

In [None]:
# Define the categorical columns
cat_cols = ["Make"]

# Loop through each categorical column
for column in cat_cols:
    # Count the values in the column and sort them in ascending order
    value_counts = cars[column].value_counts().sort_index()
    # Print the value counts
    print(value_counts)
    print("#" * 40)

In [None]:
cars.info()

In [None]:
cars.Make.nunique()

In [None]:
cars.groupby(cars.Make).size().sort_values(ascending =False)

There are 48 unique Brands in the dataset.Chevrolet brand is most available for purchase/Sold followed by Ford.

In [None]:
cars.Model.isnull().sum()

In [None]:
cars.Model.nunique()

In [None]:
cars.groupby('Model')['Model'].size().nlargest(30)

There are 915 unique models and Silverado 1500 is most popular Model.



In [None]:
# Get the size of the dataframe
rows = cars.shape[0]
columns = cars.shape[1]
features = cars.columns.tolist()

# Print the number of rows, columns, and features
print("Rows     : ", rows)
print("Columns  : ", columns)
print("#" * 40, "\n", "Features : \n\n", features)

# Print the missing values and percent of missing values for each column
missing_values = cars.isnull().sum().sort_values(ascending=False)
percent_missing = round(cars.isna().sum() / cars.isna().count() * 100, 2)
print("#" * 40, "\nMissing values :\n\n", missing_values)
print("#" * 40, "\nPercent of missing :\n\n", percent_missing)

# Print the count of unique values for each column
unique_values = cars.nunique()
print("#" * 40, "\nUnique values :\n\n", unique_values)

#**EDA**

In [None]:
cars.info()

In [None]:
cars.describe()

**Observations**

1,Year ranges from 1990- 2017 . Age of cars 6 year old to 33 years old

2,In city mpg and Highway MPG the maximum MPG is very high, and seems to be outlier. Need to analyze further.


3,MSRP 2065902 is too much for a used car. Seems to be an outlier.

In [None]:
plt.style.use('ggplot')
#select all quantitative columns for checking the spread
numeric_columns = cars.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(20,25))

for i, variable in enumerate(numeric_columns):
                     plt.subplot(10,3,i+1)
                       
                     sns.distplot(cars[variable],kde=False,color='blue')
                     plt.tight_layout()
                     plt.title(variable)

**Observations**


1,Year is left skewed and has outilers on lower side.This column can be dropped 

2,Engine HP, Engine Cylinders and Vehicle Size is right skewed. 

3,highway MPG and	city mpg is almost Normally distrubuted. Has few outliers on upper and lowerside. need to check further.

4,Age of car is right skewed.

In [None]:
cat_columns=['Make','Model','EngineFuelType','TransmissionType',	'Driven_Wheels','MarketCategory','VehicleSize',	'VehicleStyle'] #cars.select_dtypes(exclude=np.number).columns.tolist()

plt.figure(figsize=(15,21))

for i, variable in enumerate(cat_columns):
                     plt.subplot(4,2,i+1)
                     order = cars[variable].value_counts(ascending=False).index    
                     ax=sns.countplot(x=cars[variable], data=cars , order=order ,palette='viridis')
                     for p in ax.patches:
                           percentage = '{:.1f}%'.format(100 * p.get_height()/len(cars[variable]))
                           x = p.get_x() + p.get_width() / 2 - 0.05
                           y = p.get_y() + p.get_height()
                           plt.annotate(percentage, (x, y),ha='center')
                     plt.xticks(rotation=90)
                     plt.tight_layout()
                     plt.title(variable)

In [None]:
# Define the categorical columns
cat_columns = ['Make', 'Model', 'Engine Fuel Type', 'Transmission Type', 'Driven_Wheels', 'Market Category','Vehicle Size','Vehicle Style']

# Create a subplot figure with custom size
fig, axes = plt.subplots(nrows=4, ncols=2, figsize=(18, 30))

# Loop through each categorical variable and create a countplot
for i, variable in enumerate(cat_columns):
    # Calculate the order of the categories based on their counts and select top 10
    order = cars[variable].value_counts(ascending=False)[:10].index
    # Create a countplot
    ax = sns.countplot(x=variable, data=cars, order=order, palette='viridis', ax=axes[i//2,i%2])
    # Add percentage annotations to the bars
    for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height()/len(cars[variable]))
        x = p.get_x() + p.get_width() / 2 - 0.05
        y = p.get_y() + p.get_height()
        ax.annotate(percentage, (x, y), ha='center')
    # Set the x-axis label rotation and title
    ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
    ax.set_xlabel('')
    ax.set_title(variable)


**Observations**



**Car Profile**


~69 % cars available for sell have Automatic Transmission.

~60 % cars use regular unleaded Engine Fuel Type.

~16% of car available for sale are from Chevrolet & Ford brands.

~40% of car being sold/avialable for purchase have front wheel drive as Driven wheels.

Of all the types of cars available for purchase, sedans have the highest number of models. Passenger minivans have the fewest.


Car being sold/available for purchase are in 6 - 33 years old

In [None]:
numeric_columns = cars.select_dtypes(include=np.number).columns.tolist()

plt.figure(figsize=(13,17))

for i, variable in enumerate(numeric_columns):
    plt.subplot(5,2,i+1)
    sns.scatterplot(x=cars[variable], y=cars['MSRP'])
    plt.title('MSRP vs '+ variable)
    plt.tight_layout()

#**Handling missing values**

In [None]:
cars.isnull().sum()

**Calculating missing values in each row**

In [None]:
# counting the number of missing values per row
num_missing = cars.isnull().sum(axis=1)
num_missing.value_counts()

In [None]:
#Investigating how many missing values per row are there for each variable
for n in num_missing.value_counts().sort_index().index:
    if n > 0:
        print("*" *30,f'\nFor the rows with exactly {n} missing values, NAs are found in:')
        n_miss_per_col = cars[num_missing == n].isnull().sum()
        print(n_miss_per_col[n_miss_per_col > 0])
        print('\n\n')

**This confirms that certain columns tend to be missing together or all nonmissing together. So will try to fill the missing values , as much as possible.**

In [None]:
cars[num_missing == 2]

In [None]:
col = ['EngineFuelType', 'EngineHP', 'EngineCylinders']
cars[col].isnull().sum()

**We can start filling missing values by grouping name and year and fill in missing values. with median.**

In [None]:
cars.groupby(['Make', 'Model', 'Year'])['EngineHP'].median().head(30)

In [None]:
cars['EngineFuelType'] = cars.groupby(['Make', 'Model', 'Year'])['EngineFuelType'].apply(lambda x: x.fillna(x.mode().iloc[0]) if len(x.mode()) > 0 else x)
cars['EngineHP'] = cars.groupby(['Make', 'Model', 'Year'])['EngineHP'].apply(lambda x: x.fillna(x.median()) if len(x) > 0 else x)
cars['EngineCylinders'] = cars.groupby(['Make', 'Model', 'Year'])['EngineCylinders'].apply(lambda x: x.fillna(x.median()) if len(x) > 0 else x)

In [None]:
col = ['EngineFuelType', 'EngineHP', 'EngineCylinders']
cars[col].isnull().sum()

In [None]:
cars.groupby(['Make', 'Model', 'Year'])['EngineCylinders'].median().head(10)

**As we can see most of the model have same engine Cylinders and instead of just applying median , grouping with model and year that should give me more granularity, and near to accurate Engine Cylinder values.**

In [None]:
# Fill missing values in Engine Fuel Type, Engine HP, and Engine Cylinders columns
# Use median to fill missing values since there are many outliers
cars['Engine Fuel Type'] = cars['Engine Fuel Type'].fillna(cars['Engine Fuel Type'].mode()[0])
cars['Engine HP'] = cars['Engine HP'].fillna(cars['Engine HP'].median())
cars['Engine Cylinders'] = cars['Engine Cylinders'].fillna(cars['Engine Cylinders'].median())

In [None]:
col = ['Engine Fuel Type', 'Engine HP', 'Engine Cylinders']
cars[col].isnull().sum()

There are no missing values

In [None]:
#cars.groupby(['Model','Year'])['Engine'].agg({'median','mean','max'}).sort_values(by='Model',ascending='True').head(10)

In [None]:
from sklearn.preprocessing import LabelEncoder

# create a label encoder object
le = LabelEncoder()

# encode the categorical variable 'Engine Fuel Type'
cars['Engine Fuel Type Encoded'] = le.fit_transform(cars['Engine Fuel Type'])

# group by Make, Model, and Year and calculate median, mean, and max of the encoded variable
cars.groupby(['Make', 'Model', 'Year'])['Engine Fuel Type Encoded'].agg(['median', 'mean', 'max']).sort_values(by='Model', ascending=True).head(10)

# drop the encoded variable if you don't need it anymore
cars.drop('Engine Fuel Type Encoded', axis=1, inplace=True)


In [None]:
cars['Number of Doors'].isnull().sum()

In [None]:
cars["Driven_Wheels"] = cars["Driven_Wheels"].astype("category")
#cars['Model'] =cars['Model'].astype("category")

In [None]:
cars.info()

In [None]:
#For better granualarity grouping has there would be same car model present so filling with a median value brings it more near to real value
#cars['MSRP']=cars.groupby(['Model', 'Year'])['MSRP'].apply(lambda x:x.fillna(x.median()))

In [None]:
cars.MSRP.isnull().sum()

In [None]:
cars['MSRP']=cars.groupby(['Make'])['MSRP'].apply(lambda x:x.fillna(x.median()))

In [None]:
cars.MSRP.isnull().sum()

In [None]:
cars.groupby(['Make'])['MSRP'].median().sort_values(ascending=False)

In [None]:
cars.isnull().sum()

In [None]:
cols1 = ["Engine HP","Engine Cylinders"]

for ii in cols1:
    cars[ii] = cars[ii].fillna(cars[ii].median())

In [None]:
#dropping remaining rows
#cannot further fill this rows so dropping them

cars.dropna(inplace=True,axis=0)

In [None]:
cars.isnull().sum()

In [None]:
cars.head()

In [None]:
cars.isnull().sum()

In [None]:
df.shape 

Finally done with all missing values handling

In [None]:
cars.groupby(['Make'])['MSRP'].agg({'median','mean','max'})

In [None]:
#using business knowledge to create class 
Low=['Buick'
'Chevrolet'
'Chrysler'
'Dodge'
'FIAT'
'Ford'
'GMC'
'Honda'
'Hyundai'
'Kia'
'Mazda'
'Mitsubishi'
'Nissan'
'Oldsmobile'
'Plymouth'
'Pontiac'
'Saab'
'Scion'
'Subaru'
'Suzuki'
'Toyota'
'Volkswagen'
'Volvo']

High=['Audi',
      'Mini Cooper',
      'Bentley',
      'Mercedes-Benz',
      'Lamborghini',
      'Volkswagen',
      'Porsche',
      'Land Rover',
      'Nissan',
      'Volvo',
      'Jeep',
      'Jaguar',
      'BMW']# more than 30lakh

In [None]:
def classrange(x):
    if x in Low:
        return "Low"
    elif x in High:
        return "High"
    else: 
        return x

In [None]:
cars['Make_Class'] = cars['Make'].apply(lambda x: classrange(x))

In [None]:
cars['Make_Class'].unique()

In [None]:
cars['Engine HP']=cars['Engine HP'].astype(int)
cars['Make_Class']=cars["Make_Class"].astype('category')

In [None]:
plt.figure(figsize=(10,8))
sns.heatmap(cars.corr(),annot=True ,cmap="YlGnBu" )
plt.show()

**Observations**

Engine has strong positive correlation to Power [0.86].

MSRP has positive correlation to Engine HP[0.66] as well Engine Cylinders [0.58].

Age of car is negative correlated to MSRP, Popularity, Vehicle Size, Number of doors, city & Highway MPG.

Price has negative correlation to age of car.

city mpg doesnt impact MSRP.

In [None]:
sns.pairplot(data=cars , corner=True)
plt.show()

**Observations**

Same observation about correlation as seen in heatmap.

Kilometer driven doesnot have impact on Price .

As power increase mileage decrease.
Car with recent make sell at higher prices.
Engine and Power increase , price of the car seems to increase.

#**Variables that are correlated with Price variable**



In [None]:
# understand relation ship of MSPR vs Popularity and Transmission Type
plt.figure(figsize=(10,7))

plt.title("MSRP VS Popularity based on Transmission Type")
sns.scatterplot(y='Popularity', x='MSRP', hue='Transmission Type', data=cars)

In [None]:
 #understand relationship betweem MSRP and Transmission Type
plt.figure(figsize=(10,7))
plt.title("MSRP vs Year based on Transmission Type")
sns.scatterplot(y='Year', x='MSRP', hue='Transmission Type', data=cars)

In [None]:
# Understand the relationships  between city mpg and MSRP
sns.scatterplot(y='city mpg', x='MSRP', hue='Transmission Type', data=cars)

In [None]:
# Understand the relationships  between highwat mpg and MSRP
sns.scatterplot(y='highway MPG', x='MSRP', hue='Transmission Type', data=cars)

In [None]:
# Impact of years on MSRP 
plt.figure(figsize=(10,7))
plt.title("MSRP based on manufacturing Year of Make")
sns.lineplot(x='Year', y='MSRP',hue='Transmission Type',
             data=cars)

In [None]:
# Impact of years on MSRP 
plt.figure(figsize=(10,7))
plt.title("MSRP Vs Year VS Popularity")
sns.lineplot(x='Year', y='MSRP',hue='Popularity',
             data=cars)

In [None]:
cars[(cars["Vehicle Size"]=='1') & (cars["Year"].isin([2010]))].sort_values(by='MSRP',ascending =False)

In [None]:
plt.figure(figsize=(10,7))
plt.title("MSRP Vs Year VS Driven_Wheels")
sns.lineplot(x='Year', y='MSRP',hue='Driven_Wheels',
             data=cars)

Need to check the reason for spike in MSRP for all wheel drive for in 2008.

In [None]:
cars[(cars["Driven_Wheels"]=='all wheel drive') & (cars["Year"].isin([2008]))].sort_values(by='MSRP',ascending =False)

The observation is for The Porsche Panamera is expensive and luxury car so the data is valid.

In [None]:
cars.describe()

MSRP Vs Year

In [None]:
#MSRP and Year 
plt.figure(figsize=(20,10))
sns.set(font_scale=1)
sns.barplot(x='Year', y='MSRP', data=cars)
plt.grid()

MSRP Vs Engine Cylinders

In [None]:
#MSRP and Engine Cylinders 
plt.figure(figsize=(15,10))
sns.set(font_scale=2)
sns.barplot(x='Engine Cylinders', y='MSRP', data=cars)
plt.grid()

MSRP Vs Make

In [None]:
#MSRP and make 
plt.figure(figsize=(20,15))
sns.set(font_scale=1)
sns.boxplot(x='MSRP', y='Make', data=cars)
plt.grid()

In [None]:
sns.relplot(data=cars, y='MSRP',x='city mpg',hue='Transmission Type',aspect=1,height=5)

In [None]:
cars.head()

In [None]:
sns.relplot(data=cars, y='MSRP',x='Year',col='Number of Doors',hue='Transmission Type',aspect=1,height=5)


In [None]:
sns.relplot(data=cars, y='MSRP',x='Year',col='Engine Fuel Type',hue='Transmission Type',aspect=0.5,height=10)

#**Insights based on EDA**

Expensive cars are in Coimbatore and Banglore.
2 Seater cars are more expensive.
Diesel Fuel type car are more expensive compared to other fuel type.
As expected, Older model are sold cheaper compared to latest model
Automatic transmission vehicle have a higher price than manual transmission vehicles.
Vehicles with more engine capacity have higher prices.
Price decreases as number of owner increases.
Automatic transmission require high engine and power.
Prices for Cars with fuel type as Deisel has increased with recent models
Engine,Power, how old the car his, Mileage,Fuel type,location,Transmission effect the price.

In [None]:
# check distrubution if skewed. If distrubution is skewed , it is advice to use log transform
cols_to_log = cars.select_dtypes(include=np.number).columns.tolist()
for colname in cols_to_log:
    sns.distplot(cars[colname], kde=True)
    plt.show()

Distrubtions are right skewed , using Log transform can help in normalization

In [None]:
#Distrubtions are right skewed , using Log transform can help in normalization
def Perform_log_transform(df,col_log):
    """#Perform Log Transformation of dataframe , and list of columns """
    for colname in col_log:
        df[colname + '_log'] = np.log(df[colname])
    #df.drop(col_log, axis=1, inplace=True)
    df.info()

In [None]:
#This needs to be done before the data is split
Perform_log_transform(cars,['Year','MSRP'])

In [None]:
cars.drop(['Model','Year','Make',],axis=1,inplace=True)

In [None]:
cars.info()

#**Model Building**

In [None]:
X = cars.drop(["MSRP", "MSRP_log"], axis=1)
y = cars[["MSRP_log", "MSRP"]]

**Creating dummy variables**

In [None]:
def encode_cat_vars(x):
    x = pd.get_dummies(
        x,
        columns=x.select_dtypes(include=["object", "category"]).columns.tolist(),
        drop_first=True,
    )
    return x

In [None]:
#Dummy variable creation is done before spliting the data , so all the different categories are covered
#create dummy variable
X = encode_cat_vars(X)
X.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_train.reset_index()
print("X_train:",X_train.shape)
print("X_test:",X_test.shape)
print("y_train:",y_train.shape)
print("y_test:",y_test.shape)

In [None]:
# Statsmodel api does not add a constant by default. We need to add it explicitly.
X_train = sm.add_constant(X_train)
# Add constant to test data
X_test = sm.add_constant(X_test)


def build_ols_model(train):
    # Create the model
    olsmodel = sm.OLS(y_train["MSRP_log"], train)
    return olsmodel.fit()

In [None]:
# fit statmodel
olsmodel1 = build_ols_model(X_train)
print(olsmodel1.summary())

Both the R-squared and Adjusted R squared of our model are very high. This is a clear indication that we have been able to create a very good model that is able to explain variance in price of used cars for upto 92%

The model is not an underfitting or overfitting model.

To be able to make statistical inferences from our model, we will have to test that the linear regression assumptions are followed.

Before we move on to assumption testing, we'll do a quick performance check on the test data.

In [None]:
import math

# RMSE
def rmse(predictions, targets):
    return np.sqrt(((targets - predictions) ** 2).mean())


# MAPE
def mape(predictions, targets):
    return np.mean(np.abs((targets - predictions)) / targets) * 100


# MAE
def mae(predictions, targets):
    return np.mean(np.abs((targets - predictions)))


# Model Performance on test and train data
def model_pref(olsmodel, x_train, x_test):

    # Insample Prediction
    y_pred_train_pricelog = olsmodel.predict(x_train)
    y_pred_train_Price = y_pred_train_pricelog.apply(math.exp)
    y_train_Price = y_train["MSRP"]

    # Prediction on test data
    y_pred_test_pricelog = olsmodel.predict(x_test)
    y_pred_test_Price = y_pred_test_pricelog.apply(math.exp)
    y_test_Price = y_test["MSRP"]

    print(
        pd.DataFrame(
            {
                "Data": ["Train", "Test"],
                "RMSE": [
                    rmse(y_pred_train_Price, y_train_Price),
                    rmse(y_pred_test_Price, y_test_Price),
                ],
                "MAE": [
                    mae(y_pred_train_Price, y_train_Price),
                    mae(y_pred_test_Price, y_test_Price),
                ],
                "MAPE": [
                    mape(y_pred_train_Price, y_train_Price),
                    mape(y_pred_test_Price, y_test_Price),
                ],
            }
        )
    )


# Checking model performance
model_pref(olsmodel1, X_train, X_test)  # High Overfitting.

Root Mean Squared Error of train and test data is not different, indicating that our model is not overfitting the train data.

Mean Absolute Error indicates that our current model is able to predict used cars prices within mean error of 9505 on test data.

The units of both RMSE and MAE are same - Lakhs in this case. But RMSE is greater than MAE because it peanalises the outliers more.

Mean Absolute Percentage Error is ~19% on the test data.

#**Test Assumptions**

#Checking the Linear Regression Assumptions
#1,No Multicollinearity
#2,Mean of residuals should be 0
#3,No Heteroscedasticity
#4,Linearity of variables
#5,Normality of error terms

#Checking Assumption 1: No Multicollinearity
We will use VIF, to check if there is multicollinearity in the data.

Features having a VIF score >5 will be dropped/treated till all the features have a VIF score <5

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor


def checking_vif(train):
    vif = pd.DataFrame()
    vif["feature"] = train.columns

    # calculating VIF for each feature
    vif["VIF"] = [
        variance_inflation_factor(train.values, i) for i in range(len(train.columns))
    ]
    return vif

In [None]:
# Check VIF
print(checking_vif(X_train))

In [None]:

# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
X = cars.drop(['MSRP'], axis=1)
y = cars['MSRP']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:

# Preprocessing
X = df[[ 'Year', 'Make', 'Model','city mpg']]
X = pd.get_dummies(X, drop_first=True)
y = df['MSRP']

# Splitting the data into train and test sets
np.random.seed(0)
msk = np.random.rand(len(df)) < 0.8
X_train = X[msk]
y_train = y[msk]
X_test = X[~msk]
y_test = y[~msk]

# Function to build OLS model
def build_ols_model(X):
    X = sm.add_constant(X)
    model = sm.OLS(y_train, X)
    results = model.fit()
    return results

# Function to check model performance
def model_perf(model, X_train, X_test):
    y_train_pred = model.predict(sm.add_constant(X_train))
    train_rmse = np.sqrt(np.mean((y_train - y_train_pred)**2))
    print('Train RMSE:', train_rmse)

    y_test_pred = model.predict(sm.add_constant(X_test))
    test_rmse = np.sqrt(np.mean((y_test - y_test_pred)**2))
    print('Test RMSE:', test_rmse)

# Build OLS model
X_train1 = X_train.drop(['city mpg'], axis=1)
X_test1 = X_test.drop(['city mpg'], axis=1)
olsmodel2 = build_ols_model(X_train1)

# Print model summary
print(olsmodel2.summary())

# Check model performance
model_perf(olsmodel2, X_train1, X_test1)



In [None]:
def model_perf(model, X_train, X_test):
    y_train_pred = model.predict(sm.add_constant(X_train))
    train_rmse = np.sqrt(np.mean((y_train - y_train_pred)**2))

    y_test_pred = model.predict(sm.add_constant(X_test))
    test_rmse = np.sqrt(np.mean((y_test - y_test_pred)**2))

    return train_rmse, test_rmse

In [None]:
train_rmse, test_rmse = model_perf(olsmodel2, X_train1, X_test1)
print('Train RMSE:', train_rmse)
print('Test RMSE:', test_rmse)

In [None]:
"""
from sklearn.linear_model import Ridge

# Splitting the data into train and test sets
np.random.seed(0)
msk = np.random.rand(len(df)) < 0.8
X_train = X[msk]
y_train = y[msk]
X_test = X[~msk]
y_test = y[~msk]

# Fit Ridge regression model
ridge = Ridge(alpha=0.1) # increase alpha for stronger regularization
ridge.fit(X_train, y_train)

# Print model coefficients
print('Model Coefficients:', ridge.coef_)

# Check model performance
train_rmse = np.sqrt(np.mean((y_train - ridge.predict(X_train))**2))
print('Train RMSE:', train_rmse)

test_rmse = np.sqrt(np.mean((y_test - ridge.predict(X_test))**2))
print('Test RMSE:', test_rmse)
"""

Trail Code Start

#Checking Assumption 2: Mean of residuals should be 0

In [None]:
#Checking Assumption 2: Mean of residuals should be 0
residuals = olsmodel2.resid
np.mean(residuals)

#**Checking Assumption 3: No Heteroscedasticity Homoscedacity**

**Homoscedacity **- If the residuals are symmetrically distributed across the regression line , then the data is said to homoscedastic.

**Heteroscedasticity**- - If the residuals are not symmetrically distributed across the regression line, then the data is said to be heteroscedastic. In this case the residuals can form a funnel shape or any other non symmetrical shape.

We'll use Goldfeldquandt Test to test the following hypothesis

**Null hypothesis** : Residuals are homoscedastic Alternate hypothesis : Residuals have hetroscedasticity

alpha = 0.05

In [None]:
import statsmodels.stats.api as sms
from statsmodels.compat import lzip

name = ["F statistic", "p-value"]
test = sms.het_goldfeldquandt(residuals, X_train1)
lzip(name, test)

Since p-value > 0.05 we cannot reject the Null Hypothesis that the residuals are homoscedastic.

Assumptions 3 is also satisfied by our olsmodel2

#**Checking Assumption 4: Linearity of variables**

Predictor variables must have a linear relation with the dependent variable.

To test the assumption, we'll plot residuals and fitted values on a plot and ensure that residuals do not form a strong pattern. They should be randomly and uniformly scattered on the x axis.

In [None]:
# predicted values
fitted = olsmodel2.fittedvalues

# sns.set_style("whitegrid")
sns.residplot(x=fitted, y=residuals, color="purple", lowess=True)
plt.xlabel("Fitted Values")
plt.ylabel("Residual")
plt.title("Residual PLOT")
plt.show()

Assumptions 4 is satisfied by our olsmodel2. There is no pattern in the residual vs fitted values plot.

#**Checking Assumption 5: Normality of error terms**

The residuals should be normally distributed.

In [None]:
sns.distplot(residuals)

In [None]:
# Plot q-q plot of residuals
import pylab
import scipy.stats as stats

stats.probplot(residuals, dist="norm", plot=pylab)
plt.show()

The residuals have a close to normal distribution. Assumption 5 is also satisfied. We should further investigate these values in the tails where we have made huge residual errors.

Now that we have seen that olsmodel2 follows all the linear regression assumptions. Let us use this model to draw inferences.

In [None]:
print(olsmodel2.summary())

In [None]:
# Create a list of predictor variables and the outcome variable
predictors = ['Engine HP', 'Engine Cylinders', 'Number of Doors', 'highway MPG', 'city mpg', 'Ageofcar']
outcome = 'MSRP'

# Fit a linear regression model using OLS
model = sm.OLS(cars[outcome], sm.add_constant(cars[predictors])).fit()

# Extract key performance metrics
rsquared = model.rsquared
coefficients = model.params
pvalues = model.pvalues
std_errors = model.bse
fstat = model.fvalue
f_pvalue = model.f_pvalue

# Print the results
print("R-squared value: ", rsquared)
print("Coefficients: ", coefficients)
print("P-values: ", pvalues)
print("Standard errors of coefficients: ", std_errors)
print("F-statistic: ", fstat)
print("P-value for F-statistic: ", f_pvalue)


In [None]:
import statsmodels.api as sm

# Add constant to independent variables
X = sm.add_constant(X)

# Fit the linear regression model
model = sm.OLS(y, X).fit()

# Get the summary of the model
print(model.summary())

# Get R-squared value
print("R-squared value: ", model.rsquared)

# Get coefficients
print("Coefficients: ", model.params)

# Get p-values
print("P-values: ", model.pvalues)

# Get standard errors of coefficients
print("Standard errors of coefficients: ", model.bse)

# Get F-statistic and p-value for F-statistic
print("F-statistic: ", model.fvalue)
print("P-value for F-statistic: ", model.f_pvalue)

The given OLS regression model has a high R-squared value of 0.981, indicating that a large proportion of the variance in the dependent variable (MSRP) is explained by the independent variables. The adjusted R-squared value of 0.979 also indicates that the independent variables included in the model are a good fit for the dependent variable.

The F-statistic of 611.5 is significant with a p-value of 0.00, indicating that the model as a whole is significant.

There are a total of 926 degrees of freedom for the model, with 10987 degrees of freedom for the residuals.

The log-likelihood of the model is -1.2441e+05, indicating the goodness of fit of the model.

The AIC and BIC values of 2.507e+05 and 2.575e+05 respectively, are measures of the quality of the model's fit relative to the number of independent variables.

Since the covariance type is non-robust, the model may be sensitive to outliers in the data.

In [None]:
# Assuming X_train1 and X_test1 are already defined
print("Shape of X_train1:", X_train1.shape)
print("Shape of X_test1:", X_test1.shape)

# If using NumPy arrays, the output might look like this:
# Shape of X_train1: (5000, 10)
# Shape of X_test1: (1000, 10)

# If using Pandas DataFrames, the output might look like this:
# Shape of X_train1: (5000, 10)
# Shape of X_test1: (1000, 10)


#**Observations from the model**

1, It is important to note here that the predicted values are log(price) and therefore coefficients have to be converted accordingly to understand their influence in price.

2, With our linear regression model we have been able to capture ~98 variation in our data.

3, The model indicates that the most significant predictors of price of used cars are -

Age of the car

Number of doors in the car

Engine HP

city mpg

Highway MPG

Popularity

Engine Cylinders

Engine Fuel Type

Driven_wheels

Transmission Type

Market Category

Vehicle Size

Vehicle Style



4, Newer cars sell for higher prices. 1 unit increase in age of the car leads to [ exp(0.1123) = 1.12 Lakh ] decrease in the price of the vehicle, when everything else is constant.

As the number of seats increases, the price of the car increases - exp(0.05) = 1.05 Lakhs

Mileage is inversely correlated with Price. Generally, high mileage cars are the lower budget cars.

Kilometers Driven have a negative relationship with the price which is intuitive. A car that has been driven more will have more wear and tear and hence sell at a lower price, everything else being 0.

The categorical variables are a little hard to interpret. But it can be seen that all the car_category variables in the dataset have a negative relationship with the Price and the magnitude of this negative relationship decrease as the brand category moves to lower brands.