# car price prediction

In [1]:
from IPython.display import Image
Image(url='https://miro.medium.com/v2/resize:fit:1200/0*Y7SWB-YvdAfsAUYZ.png')


# Table of Contents
- [1. Project Overview](#1-project-overview)
  - [1.1 Introduction](#11-introduction)
  - [1.2 Problem Statement](#12-problem-statement)
  - [1.3 Objectives](#13-objectives)
- [2. Importing Packages](#2-importing-packages)
- [3. Loading Data](#3-loading-data)
- [4. Data Cleaning](#4-data-cleaning)
- [5. Exploratory Data Analysis (EDA)](#5-exploratory-data-analysis-eda)
- [6. Regression](#5-regression)
- [7. Conclusion](#6-conclusion)

### 1. Project Overview

##### 1.1 Introduction

The automobile industry is one of the most significant contributors to the global economy. With the surge in demand for both new and used cars, understanding the factors that determine car prices is essential for consumers, manufacturers, and resellers. Predicting car prices accurately can help customers make informed purchase decisions, aid manufacturers in pricing strategies, and assist resellers in maximizing profits.

This project aims to use data science techniques to analyze car attributes and build a machine learning model capable of predicting car prices. The project will also uncover insights into how various features, such as mileage, age, and horsepower, influence car value.

##### 1.2 Problem Statement

The car market is highly dynamic, with prices influenced by numerous factors like make, model, age, mileage, and features. Traditional methods of determining car prices often rely on manual appraisals, which can be subjective and inconsistent.

This creates a need for a data-driven approach to predict car prices accurately. The challenge is to build a model that can:

- Handle the diversity of car features.
- Account for non-linear relationships between features and prices.
- Provide explainable predictions for better decision-making.

##### 1.3 Objectives

1. Develop a Machine Learning Model:

- Build and evaluate a regression model to predict car prices based on their features.
2. Feature Analysis:

- Identify and quantify the influence of various features (e.g., brand, age, mileage, and horsepower) on car prices.
3. Provide Insights:

- Offer actionable insights for buyers, sellers, and manufacturers based on the model's output and feature importance.


### 2. Importing Packages

To carry out data cleaning, manipulation, and visualization, we’ll use the following Python libraries:

- pandas: Provides data structures and functions needed to efficiently clean and manipulate the dataset.
- numpy: Adds support for numerical operations, including handling arrays and mathematical functions for outlier treatment.
- matplotlib and seaborn: Libraries for data visualization. matplotlib is a core plotting library, while seaborn builds on it to provide more aesthetic and statistical visualizations.

In [2]:
# Libraries for data loading, manipulation and analysis

import numpy as np
import pandas as pd
import csv
import seaborn as sns
import matplotlib.pyplot as plt

# Displays output inline
%matplotlib inline

# Libraries for Handing Errors
import warnings
warnings.filterwarnings('ignore')

### 3. Loading Data

The data used for this project was located in the car_prediction_data.csv file. To better manipulate and analyse the car_prediction_data.csv file, it was loaded into a Pandas Data Frame using the Pandas function, .read_csv().

In [3]:
# loading dataset
df = pd.read_csv("car_prediction_data.csv", index_col=False)

In [4]:
df.head()

Unnamed: 0,Car_Name,Year,Selling_Price,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
0,ritz,2014,3.35,5.59,27000,Petrol,Dealer,Manual,0
1,sx4,2013,4.75,9.54,43000,Diesel,Dealer,Manual,0
2,ciaz,2017,7.25,9.85,6900,Petrol,Dealer,Manual,0
3,wagon r,2011,2.85,4.15,5200,Petrol,Dealer,Manual,0
4,swift,2014,4.6,6.87,42450,Diesel,Dealer,Manual,0


Check the DataFrame to see if it loaded correctly.

In [5]:
# Displays the number of rows and columns
df.shape

(301, 9)

Results : The dataset consists of 301 rows (observations) and 9 columns (features).

In [6]:
## Display summary information about the DataFrame.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 301 entries, 0 to 300
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Car_Name       301 non-null    object 
 1   Year           301 non-null    int64  
 2   Selling_Price  301 non-null    float64
 3   Present_Price  301 non-null    float64
 4   Kms_Driven     301 non-null    int64  
 5   Fuel_Type      301 non-null    object 
 6   Seller_Type    301 non-null    object 
 7   Transmission   301 non-null    object 
 8   Owner          301 non-null    int64  
dtypes: float64(2), int64(3), object(4)
memory usage: 21.3+ KB


In [7]:
df.describe()

Unnamed: 0,Year,Selling_Price,Present_Price,Kms_Driven,Owner
count,301.0,301.0,301.0,301.0,301.0
mean,2013.627907,4.661296,7.628472,36947.20598,0.043189
std,2.891554,5.082812,8.644115,38886.883882,0.247915
min,2003.0,0.1,0.32,500.0,0.0
25%,2012.0,0.9,1.2,15000.0,0.0
50%,2014.0,3.6,6.4,32000.0,0.0
75%,2016.0,6.0,9.9,48767.0,0.0
max,2018.0,35.0,92.6,500000.0,3.0


### 4. Data Cleaning

4.1 Handle Missing Values

In [8]:
df.isnull().sum()

Car_Name         0
Year             0
Selling_Price    0
Present_Price    0
Kms_Driven       0
Fuel_Type        0
Seller_Type      0
Transmission     0
Owner            0
dtype: int64

Our dataset has no missing values

4.2 Remove Duplicates

In [9]:
df.duplicated()

0      False
1      False
2      False
3      False
4      False
       ...  
296    False
297    False
298    False
299    False
300    False
Length: 301, dtype: bool

Check for duplicates

In [10]:
df.drop_duplicates(inplace=True)

Remove duplicates

4.3 Encode Categorical Variables

In [11]:
# Automatically detect categorical columns
categorical_columns = df.select_dtypes(include=['object']).columns

# One-hot encode all categorical columns
df_encoded = pd.get_dummies(df, columns=categorical_columns, drop_first=True)

print("\nDataFrame after One-Hot Encoding All Categorical Columns:")
print(df_encoded)



DataFrame after One-Hot Encoding All Categorical Columns:
     Year  Selling_Price  Present_Price  Kms_Driven  Owner  \
0    2014           3.35           5.59       27000      0   
1    2013           4.75           9.54       43000      0   
2    2017           7.25           9.85        6900      0   
3    2011           2.85           4.15        5200      0   
4    2014           4.60           6.87       42450      0   
..    ...            ...            ...         ...    ...   
296  2016           9.50          11.60       33988      0   
297  2015           4.00           5.90       60000      0   
298  2009           3.35          11.00       87934      0   
299  2017          11.50          12.50        9000      0   
300  2016           5.30           5.90        5464      0   

     Car_Name_Activa 3g  Car_Name_Activa 4g  Car_Name_Bajaj  ct 100  \
0                 False               False                   False   
1                 False               False           

In [13]:
df_encoded.head()


Unnamed: 0,Year,Selling_Price,Present_Price,Kms_Driven,Owner,Car_Name_Activa 3g,Car_Name_Activa 4g,Car_Name_Bajaj ct 100,Car_Name_Bajaj Avenger 150,Car_Name_Bajaj Avenger 150 street,...,Car_Name_swift,Car_Name_sx4,Car_Name_verna,Car_Name_vitara brezza,Car_Name_wagon r,Car_Name_xcent,Fuel_Type_Diesel,Fuel_Type_Petrol,Seller_Type_Individual,Transmission_Manual
0,2014,3.35,5.59,27000,0,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,True
1,2013,4.75,9.54,43000,0,False,False,False,False,False,...,False,True,False,False,False,False,True,False,False,True
2,2017,7.25,9.85,6900,0,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,True
3,2011,2.85,4.15,5200,0,False,False,False,False,False,...,False,False,False,False,True,False,False,True,False,True
4,2014,4.6,6.87,42450,0,False,False,False,False,False,...,True,False,False,False,False,False,True,False,False,True


### 5. Exploratory Data Analysis (EDA)

### 6.Regression


### 7. Conclusion
