Here we have a dataset of car details on which we try to fit some regression models to predict the selling prices of the car.

# Importing libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns
%matplotlib inline

In [None]:
#Medical Cost Personal Datasets
data = pd.read_csv(r'../input/vehicle-dataset-from-cardekho/CAR DETAILS FROM CAR DEKHO.csv')
data

# DATA PREPROCESSING

In [None]:
data.head()

In [None]:
data.shape

In [None]:
#removing duplicate entries
data.drop_duplicates(keep='first',inplace=True)
data.shape

In [None]:
#check missing or null values
data.isnull().sum()

Hence there are no null values.

In [None]:
data.info()

In [None]:
data.describe(include="all")

Hence we have 5 categorical columns. If we do not drop "Car_Name" before converting the categorical data to indicators, the resultant data will give 98+ features making the data redundant. Hence:

In [None]:
del data["name"]

In [None]:
data.columns

In [None]:
data=data[['year', 'km_driven', 'fuel', 'seller_type',
       'transmission', 'owner', 'selling_price']]

In [None]:
#checking unique values for the categorical data
print(data["fuel"].unique())
print(data["seller_type"].unique())
print(data["transmission"].unique())
print(data["owner"].unique())


In [None]:
datac=data.copy(deep=True)

In [None]:
#Converting the categorical to indicator variables.
data=pd.get_dummies(data,drop_first =True)
data.head()

The column "Year" is meaningless unless it is in terms of the number of years after which the selling price is being estimated. Hence:

In [None]:
from datetime import date
year=date.today().year
year
data.year = year-data.year

In [None]:
data.columns

In [None]:
data.head()

# Vizualisation

We use seaborn to create a jointplot to compare various columns to vizualise the correlation between them [to vaguely estimate the strength of the correlation or presence of multicollinearity(Multicollinearity generally occurs when there are high correlations between two or more predictor variables.)].

In [None]:
sns.jointplot(x='year', y='selling_price',data = data)

In [None]:
sns.jointplot(x='year', y='selling_price',data = data, kind= 'hex')

Enhancing a scatterplot by including a linear regression model (and its uncertainty) using lmplot().

In [None]:
sns.set(color_codes=True)
sns.lmplot(x='km_driven', y='selling_price',data = data)


lmplot returns the FacetGrid object with the plot on it for further tweaking.FacetGrid class helps in visualizing distribution of one variable as well as the relationship between multiple variables separately within subsets of the dataset using multiple panels.

A FacetGrid can be drawn with up to three dimensions − row, col, and hue. The first two have obvious correspondence with the resulting array of axes; thinking of the hue variable as a third dimension along a depth axis, where different levels are plotted with different colors.

FacetGrid object takes a dataframe as input and the names of the variables that will form the row, column, or hue dimensions of the grid.

We now use pairplot to detect any multicollinearity between the predictors. While the column of charges in the plot will give us the dependence of the response variable on the predictor variable.

In [None]:
sns.pairplot(datac[['year', 'km_driven','owner', 'selling_price']],hue='owner', markers=["o", "s", "D","p","*"])  
#"Owner" can be replaced with 'Fuel_Type_Diesel','Fuel_Type_Petrol','Seller_Type_Individual','Transmission_Manual' for more plots.

In [None]:
sns.catplot(data=datac, kind="swarm", x="owner", y="selling_price", col="seller_type")

In [None]:
correlations=datac.corr()
correlations

In [None]:
correlations=data.corr()
correlations

# Training and Testing Data.
We now go ahead and split the data into training and testing sets. We define a variable X that will contain all the columns except the target column and store the target column, i.e, "Selling_Price" in another variable, say y.
 

In [None]:
data

In [None]:
y = data["selling_price"].values
X = data.drop(columns="selling_price").values

In [None]:
#Train-test split of the data
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.33,random_state=42)



# Feature Scaling

In [None]:
columns = data.columns.tolist()

In [None]:
columns.remove("selling_price")

In [None]:
columns

In [None]:
#columns to be standardized : km_driven,year
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

y_train = sc.fit_transform(y_train.reshape(-1, 1))
y_test = sc.fit_transform(y_test.reshape(-1, 1))

In [None]:
X_train

In [None]:
X_test

In [None]:
y_train

In [None]:
y_test

# Training the regression models.


In [None]:
#Importing libraries
from sklearn import linear_model
from sklearn.metrics import mean_squared_error

In [None]:
#linear regression
model=linear_model.LinearRegression()
model.fit(X_train,y_train)
y_pred=model.predict(X_test)
print('Coefficients: \n', model.coef_)

In [None]:
ridgeregr = linear_model.Ridge(alpha=30, normalize =True)

# Train the model using the training sets
ridgeregr.fit(X_train, y_train)

# Make predictions using the testing set
ridge_y_pred = ridgeregr.predict(X_test)
print('Coefficients: \n', ridgeregr.coef_)

In [None]:
lasso =linear_model.Lasso(alpha=50, normalize = False) 
lasso.fit(X_train,y_train)

# Make predictions using the testing set
lasso_y_pred = lasso.predict(X_test)
print('Coefficients: \n', lasso.coef_)

# Performance Evaluation. 

Before we evaluate our performance, I would like to mention the four principle assumptions which justify the use of linear regression models for purposes of inference or prediction, i.e,
"(i) linearity and additivity of the relationship between dependent and independent variables:

    (a) The expected value of dependent variable is a straight-line function of each independent variable, holding the others fixed.

    (b) The slope of that line does not depend on the values of the other variables.

    (c)  The effects of different independent variables on the expected value of the dependent variable are additive.

(ii) statistical independence of the errors (in particular, no correlation between consecutive errors in the case of time series data)

(iii) homoscedasticity (constant variance) of the errors

    (a) versus time (in the case of time series data)

    (b) versus the predictions

    (c) versus any independent variable

(iv) normality of the error distribution."

Let's emphasize the last assumption, the normal distribution of the error wich is nothing but the difference in values between the actual target value and the predicted target value. Let's create a visualization of this difference and verify our implementation of the regression model on our data.

In [None]:
plt.pyplot.scatter(y_test, y_pred)
plt.pyplot.ylabel('Predicted')
plt.pyplot.xlabel('Actual')
print('Mean squared error: %.4f' % mean_squared_error(y_test, y_pred))

In [None]:
sns.distplot(y_test-y_pred)

The distribution of error follows normal distribution indeed. Though for plain regression model, the distribution is slightly left(negatively) skewed,i.e, it has more negative values compared to the number of positive values.

In [None]:
plt.pyplot.scatter(y_test, ridge_y_pred)
plt.pyplot.ylabel('Predicted')
plt.pyplot.xlabel('Actual')
print('Mean squared error: %.4f' % mean_squared_error(y_test, ridge_y_pred))

In [None]:
sns.distplot(y_test-ridge_y_pred)

    The distribution for error in case of normalized Ridge regression model is right(positively) skewed for the data used here.

In [None]:
plt.pyplot.scatter(y_test, lasso_y_pred)
plt.pyplot.ylabel('Predicted')
plt.pyplot.xlabel('Actual')
print('Mean squared error: %.4f' % mean_squared_error(y_test, lasso_y_pred))

In [None]:
sns.distplot(y_test-lasso_y_pred)

  For Lasso, the coefficients can be evaluated to zero. Hence by the coefficients we can say that no feature has any significant impact on the target variable.
  This is the reason why Lasso regression is used for feature selection.

CONCLUSION:
In this case, it is clear from the figures that Lasso performs the worst.