# Car Price Prediction (Model)
## CRISP_DM Methodological Steps:

# 1.Business Understanding

### Business Problem:
The project addresses the issue of accurately predicting car prices. Mispricing can lead to missed opportunities for buyers, sellers, and dealerships. By leveraging predictive analytics, we can better understand the factors influencing car prices, providing more accurate price estimates and helping stakeholders make informed decisions. This approach enhances the efficiency of the car market, benefiting all parties involved.

### Business Goal:
The primary goal of this project is to predict car prices accurately. By analyzing various features of the car, we aim to create a model that provides reliable price predictions. This will help buyers, sellers, and dealerships make better decisions, improve market efficiency, and streamline transactions.

### Introduction:
This section summarizes the car price dataset and its key features:

#### About Dataset:
- **Car_Name**: Name of the car
- **Year**: Manufacturing year of the car
- **Selling_Price**: Price at which the car is sold
- **Present_Price**: Current market price of the car
- **Kms_Driven**: Total kilometers driven by the car
- **Fuel_Type**: Type of fuel used (e.g., Petrol, Diesel)
- **Seller_Type**: Type of seller (e.g., Individual, Dealer)
- **Transmission**: Transmission type (e.g., Manual, Automatic)
- **Owner**: Number of previous owners

### Summary of Predictive Analytics for Car Price Prediction Project:
This project applies machine learning techniques to analyze car data and predict prices. By using algorithms like **Linear Regression and RandomForestRegressor**, the model learns from historical data and identifies key features that influence car prices. This early prediction allows buyers and sellers to make more informed decisions. The project is important for the car industry as it supports market transparency, optimizes pricing strategies, and helps stakeholders make better financial choices.

## Importing Necessary Libraries

In [2]:
import pandas as pd               # Data manipulation and analysis
import numpy as np                # Numerical operations
import matplotlib.pyplot as plt    # Visualization of data
import seaborn as sns             # Statistical data visualization
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, MinMaxScaler  # Data preprocessing
from sklearn.linear_model import LinearRegression  # Linear regression model for predictions
from sklearn.ensemble import RandomForestRegressor  # Random forest model for predictions
from sklearn.metrics import r2_score, confusion_matrix  # Model evaluation
from sklearn.model_selection import train_test_split  # Data splitting for training and testing
import joblib                     # Used to save the trained model

## 2. Data Understanding

### Loading the Dataset

In [5]:
df=pd.read_csv(r"C:\Users\original\Downloads\Car Price Data.csv")

### Displaying the First 5 Rows of the Dataset


In [7]:
df.head(5)

Unnamed: 0,Car_Name,Year,Selling_Price,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
0,ritz,2014,3.35,5.59,27000,Petrol,Dealer,Manual,0
1,sx4,2013,4.75,9.54,43000,Diesel,Dealer,Manual,0
2,ciaz,2017,7.25,9.85,6900,Petrol,Dealer,Manual,0
3,wagon r,2011,2.85,4.15,5200,Petrol,Dealer,Manual,0
4,swift,2014,4.6,6.87,42450,Diesel,Dealer,Manual,0


### Displaying the Last 4 Rows of the Dataset:


In [9]:
df.tail(4)

Unnamed: 0,Car_Name,Year,Selling_Price,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
297,brio,2015,4.0,5.9,60000,Petrol,Dealer,Manual,0
298,city,2009,3.35,11.0,87934,Petrol,Dealer,Manual,0
299,city,2017,11.5,12.5,9000,Diesel,Dealer,Manual,0
300,brio,2016,5.3,5.9,5464,Petrol,Dealer,Manual,0


### Get the shape of the dataset (number of rows and columns)

In [11]:
df.shape

(301, 9)

### Displaying Information About the Dataset

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 301 entries, 0 to 300
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Car_Name       301 non-null    object 
 1   Year           301 non-null    int64  
 2   Selling_Price  301 non-null    float64
 3   Present_Price  301 non-null    float64
 4   Kms_Driven     301 non-null    int64  
 5   Fuel_Type      301 non-null    object 
 6   Seller_Type    301 non-null    object 
 7   Transmission   301 non-null    object 
 8   Owner          301 non-null    int64  
dtypes: float64(2), int64(3), object(4)
memory usage: 21.3+ KB


### Checking for Missing Values


In [15]:
df.isnull().sum()

Car_Name         0
Year             0
Selling_Price    0
Present_Price    0
Kms_Driven       0
Fuel_Type        0
Seller_Type      0
Transmission     0
Owner            0
dtype: int64

### Summary Statistics of the Dataset


In [17]:
df.describe()

Unnamed: 0,Year,Selling_Price,Present_Price,Kms_Driven,Owner
count,301.0,301.0,301.0,301.0,301.0
mean,2013.627907,4.661296,7.628472,36947.20598,0.043189
std,2.891554,5.082812,8.644115,38886.883882,0.247915
min,2003.0,0.1,0.32,500.0,0.0
25%,2012.0,0.9,1.2,15000.0,0.0
50%,2014.0,3.6,6.4,32000.0,0.0
75%,2016.0,6.0,9.9,48767.0,0.0
max,2018.0,35.0,92.6,500000.0,3.0


### Checking for Duplicates

In [19]:
df.duplicated().sum()

2

In [20]:
df.drop_duplicates(inplace=True)

In [21]:
df.duplicated().sum()

0

### Checking the Number of Unique Values in Each Column:


In [23]:
df.nunique()

Car_Name          98
Year              16
Selling_Price    156
Present_Price    147
Kms_Driven       206
Fuel_Type          3
Seller_Type        2
Transmission       2
Owner              3
dtype: int64

## 3. Data Preparation


### Dropping Unnecessary Columns:


In [26]:
df.drop(['Car_Name'],axis=1,inplace=True)

### Encoding Categorical Columns


In [28]:
df.dtypes

Year               int64
Selling_Price    float64
Present_Price    float64
Kms_Driven         int64
Fuel_Type         object
Seller_Type       object
Transmission      object
Owner              int64
dtype: object

In [29]:
df.columns

Index(['Year', 'Selling_Price', 'Present_Price', 'Kms_Driven', 'Fuel_Type',
       'Seller_Type', 'Transmission', 'Owner'],
      dtype='object')

In [30]:
le=LabelEncoder()
df['Fuel_Type']=le.fit_transform(df['Fuel_Type'])
df['Seller_Type']=le.fit_transform(df['Seller_Type'])
df['Transmission']=le.fit_transform(df['Transmission'])

### Encoding Categorical Variables with OneHotEncoder


# also this way for OneHotEncoder you are free to choose any way,OK
df = pd.get_dummies(df, columns=['Fuel_Type', 'Seller_Type','Transmission'], drop_first=True)
df

### Splitting Features and Target Variable


In [34]:
x=df.drop(['Selling_Price'],axis=1)
y=df['Selling_Price']

### Splitting Data into Training and Test Sets

In [36]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2 ,random_state=42)

## 4. Modeling


### Training the Model


In [39]:
model=LinearRegression()
model.fit(x_train,y_train)

## 5. Evaluation


### Making Predictions


In [42]:
y_predict=model.predict(x_test)

### Evaluating the Model Performance


In [44]:
print(r2_score(y_test,y_predict))

0.7410829335730038


### Saving the Model


In [46]:
joblib.dump(model, r"C:\Users\original\Downloads\Car_Price_Model.pkl")

['C:\\Users\\original\\Downloads\\Car_Price_Model.pkl']

### Training the Model with RandomForestRegressor

In [48]:
model2=RandomForestRegressor()
model2.fit(x_train,y_train)

In [49]:
y_predict=model2.predict(x_test)

In [50]:
print(r2_score(y_test,y_predict))

0.49731206240825965


### Saving the Model


In [59]:
joblib.dump(model2, r"C:\Users\original\Downloads\Car Price Data.pkl")

['C:\\Users\\original\\Downloads\\Car Price Data.pkl']