# Car Price Prediction using Machine Learning with Python

<b>Problem Statement:</b><br>
We have to predict the price of used car based on several features like present price, distance driven, fuel type, transmission type etc. 
<br>
This model will help both the car buyers and sellers to get the used car market value. <br>
<br>
<b> Understanding the Data:</b><br>
The data that we are going to use in this machine learning project is about the used price. Specifically contain information datapoint like current price, distance driven, fuel type, transmission etc. We would use the data analysis techniques to understand the data better and gain new insights. 


#### Mounting Google drive to import the required data

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')
%cd gdrive/My Drive/Colab Notebooks/Predictive-Analytics

#### Importing the dependencies for this project

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn import metrics

#### Data pre-processing

In [None]:
# Reading the data from the csv file to pandas dataframe
cardata = pd.read_csv("UsedCarData.csv")

In [None]:
#Inspecting the imported data
cardata.head()

In [None]:
# Checking the shape/ number of data points
cardata.shape

In [None]:
# Getting the information about the dataset/ Different columns and their datatypes
cardata.info()

In [None]:
# Checking for missing/na values in the data set
cardata.isna().any()

In [None]:
cardata.isnull().any()

No values are missing from the data set. 

In [None]:
# To check if there are any outliers
# Here we conclude that we don't have any outliers.
cardata.describe(percentiles=[0.25,0.5,0.75,0.9,0.95,0.99])

In [None]:
# Checking the distribution of the categorical data
print(cardata.Fuel_Type.value_counts())
print("-------"*2)
print(cardata.Selling_type.value_counts())
print("-------"*2)
print(cardata.Transmission.value_counts())

In [None]:
sns.heatmap(cardata[["Year", "Selling_Price", "Present_Price","Driven_kms", "Owner"]].corr(), annot=True, fmt='.2f')

The machine learning algorithms can not properly understand the text data, so we convert the categorical data into numrical data through encoding

In [None]:
# Encoding the categocial data
cardata.replace({'Fuel_Type':{'Petrol': 0,'Diesel':1,'CNG':2}}, inplace = True)
cardata.replace({'Selling_type':{'Dealer': 0,'Individual':1}}, inplace = True)
cardata.replace({'Transmission':{'Manual': 0,'Automatic':1}}, inplace = True)

In [None]:
cardata.head()

In [None]:
sns.heatmap(cardata[["Year", "Selling_Price", "Present_Price","Driven_kms", "Owner","Fuel_Type", "Selling_type","Transmission" ]].corr(), annot=True, fmt='.2f')

In [None]:
sns.pairplot(cardata[["Year", "Selling_Price", "Present_Price","Driven_kms", "Owner","Fuel_Type", "Selling_type","Transmission" ]])
plt.show()

From the pair plot, there are three interesting relationship that cameup
1. Selling price verses Current price
2. Selling price veres Year
3. Selling price veres distance driven

Creating the Scatter plot for these

In [None]:
fig, ax = plt.subplots()

ax.scatter(
    cardata["Selling_Price"], 
    cardata["Present_Price"], 
    alpha=.3
);

ax.set_title("Selling Price V/s the Present Price ");
ax.set_ylabel("Present Price");
ax.set_xlabel("Selling Price");

#### Train & Test Data

We woudl define two variables X & Y where, X would be independent variables 
and Y would be the target variable (what are we trying to predict)

In [None]:
X = cardata.drop(['Car_Name', 'Selling_Price'], axis = 1)
Y = cardata["Selling_Price"]

In [None]:
X.head()

In [None]:
Y.head()

We would now split the data into the train and test data

Since the data us very limited. we would keep the size of the test small


In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)

#### Model Training

In [None]:
# Linear Regression Model
linear_regression_model = LinearRegression()
linear_regression_model.fit(X_train, Y_train)


#### Model Evaluation

In [None]:
Y_predict = linear_regression_model.predict(X_test)

In [None]:
plt.scatter(Y_test, Y_predict)
plt.xlabel("Actual Price")
plt.ylabel("Predicted Price")
plt.title("Actual Price verses Predicted Price")
plt.show()

In [None]:
# R-Squared Value
errorsquared = metrics.r2_score(Y_test, Y_predict)
print(f"R-Squared Value: {errorsquared}")