# PRICE PREDICTICTORS FOR USED CARS.

> __AUTHORS:__ BILL KISUYA, JOAN NJOROGE, BRENDA MUTAI, BRIAN NGENY, JEFF KIARIE & IVAN KIBET.

### Business understanding

#### Introduction.
This project focuses on determining the different factors that affect the selling price of used cars. A vehicle is almost always a depreciating asset i.e the value decreases as time goes by. While the motor vehicle industry is a multi-billion global industry that employs millions of people globally, its interdependence with technology is crucial in determining the different factors that affect the resale price of vehicles. In doing so, we will be able to provide useful insight to car dealerships and owners to help them make informed choices when it comes to purchasing and selling used cars.

#### Problem statement.
Since there are several different factors to be considered when one buys a vehicle, its is hard to determine the selling price of a used car. This project aims to eradicate that problem by combining several different factors to determine how each of them affects the final selling price of a used vehicle.

### Data understanding
The data provided for use in this project contains 301 rows and 9 columns. these are;
> __Car_Name__ - provides the name of each vehicle.<br>
> __Year__ - the year the vehicle was manufactured.<br>
> __Selling_Price__ - the final selling price of the vehicle.<br>
> __Present_Price__ - the initial asking price of the vehicle.<br>
> __Kms_Driven__ - the milage on the car.<br>
> __Fuel_Type__ - the type of fuel used by the car.<br>
> __Seller_Type__ - the current ownership of the car.<br>
> __Transmission__ - the type of gearbox transmission system of the car.<br>
> __Owner__

### Data preparation

In [17]:
#import modules, then load and preview the data 
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

df = pd.read_csv('data/car data.csv')
df.head()

Unnamed: 0,Car_Name,Year,Selling_Price,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
0,ritz,2014,3.35,5.59,27000,Petrol,Dealer,Manual,0
1,sx4,2013,4.75,9.54,43000,Diesel,Dealer,Manual,0
2,ciaz,2017,7.25,9.85,6900,Petrol,Dealer,Manual,0
3,wagon r,2011,2.85,4.15,5200,Petrol,Dealer,Manual,0
4,swift,2014,4.6,6.87,42450,Diesel,Dealer,Manual,0


In [18]:
# information about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 301 entries, 0 to 300
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Car_Name       301 non-null    object 
 1   Year           301 non-null    int64  
 2   Selling_Price  301 non-null    float64
 3   Present_Price  301 non-null    float64
 4   Kms_Driven     301 non-null    int64  
 5   Fuel_Type      301 non-null    object 
 6   Seller_Type    301 non-null    object 
 7   Transmission   301 non-null    object 
 8   Owner          301 non-null    int64  
dtypes: float64(2), int64(3), object(4)
memory usage: 21.3+ KB


#### Data cleaning.
The process of cleaning these data involves, checking for duplicated records and discarding any duplicates and then checking the value counts in categorical columns before creating dummy variables in preparation for modelling.

In [19]:
# drop duplicate values
df.drop_duplicates(keep='first')

Unnamed: 0,Car_Name,Year,Selling_Price,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
0,ritz,2014,3.35,5.59,27000,Petrol,Dealer,Manual,0
1,sx4,2013,4.75,9.54,43000,Diesel,Dealer,Manual,0
2,ciaz,2017,7.25,9.85,6900,Petrol,Dealer,Manual,0
3,wagon r,2011,2.85,4.15,5200,Petrol,Dealer,Manual,0
4,swift,2014,4.60,6.87,42450,Diesel,Dealer,Manual,0
...,...,...,...,...,...,...,...,...,...
296,city,2016,9.50,11.60,33988,Diesel,Dealer,Manual,0
297,brio,2015,4.00,5.90,60000,Petrol,Dealer,Manual,0
298,city,2009,3.35,11.00,87934,Petrol,Dealer,Manual,0
299,city,2017,11.50,12.50,9000,Diesel,Dealer,Manual,0


In [20]:
# objects in the dataset
objects_df = df.select_dtypes(include=object)
objects_df
# find unique varibles and null values in the categorical data
for column in objects_df.columns:
    print(f"Column name: '{column}'")
    print(f"No. of unique values: {len(objects_df[column].unique())}")
    print(f"No. of null values: {objects_df[column].isnull().sum()}")
    print(f"% of null values: {(objects_df[column].isnull().sum() / len(objects_df[column]) * 100) }")
    print(objects_df[column].value_counts())
    print()

Column name: 'Car_Name'
No. of unique values: 98
No. of null values: 0
% of null values: 0.0
Car_Name
city                        26
corolla altis               16
verna                       14
fortuner                    11
brio                        10
                            ..
Honda CB Trigger             1
Yamaha FZ S                  1
Bajaj Pulsar 135 LS          1
Activa 4g                    1
Bajaj Avenger Street 220     1
Name: count, Length: 98, dtype: int64

Column name: 'Fuel_Type'
No. of unique values: 3
No. of null values: 0
% of null values: 0.0
Fuel_Type
Petrol    239
Diesel     60
CNG         2
Name: count, dtype: int64

Column name: 'Seller_Type'
No. of unique values: 2
No. of null values: 0
% of null values: 0.0
Seller_Type
Dealer        195
Individual    106
Name: count, dtype: int64

Column name: 'Transmission'
No. of unique values: 2
No. of null values: 0
% of null values: 0.0
Transmission
Manual       261
Automatic     40
Name: count, dtype: int64



From the analysis above, we can go ahead and create dummy variables for the following columns; Fuel_Type, Seller_Type & Transmission. The Car_Name column has 98 unique values and thus will not be a good idea to create dummy variables for it.

In [21]:
# create a dataframe of dummy variables
df2 = pd.get_dummies(df[['Fuel_Type', 'Seller_Type', 'Transmission']],drop_first=True).astype(int)
df2

Unnamed: 0,Fuel_Type_Diesel,Fuel_Type_Petrol,Seller_Type_Individual,Transmission_Manual
0,0,1,0,1
1,1,0,0,1
2,0,1,0,1
3,0,1,0,1
4,1,0,0,1
...,...,...,...,...
296,1,0,0,1
297,0,1,0,1
298,0,1,0,1
299,1,0,0,1


In [22]:
# merge the dummy variables dataframe with the original dataframe
df_cleaned = pd.concat([df, df2], axis=1)
df_cleaned

Unnamed: 0,Car_Name,Year,Selling_Price,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner,Fuel_Type_Diesel,Fuel_Type_Petrol,Seller_Type_Individual,Transmission_Manual
0,ritz,2014,3.35,5.59,27000,Petrol,Dealer,Manual,0,0,1,0,1
1,sx4,2013,4.75,9.54,43000,Diesel,Dealer,Manual,0,1,0,0,1
2,ciaz,2017,7.25,9.85,6900,Petrol,Dealer,Manual,0,0,1,0,1
3,wagon r,2011,2.85,4.15,5200,Petrol,Dealer,Manual,0,0,1,0,1
4,swift,2014,4.60,6.87,42450,Diesel,Dealer,Manual,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
296,city,2016,9.50,11.60,33988,Diesel,Dealer,Manual,0,1,0,0,1
297,brio,2015,4.00,5.90,60000,Petrol,Dealer,Manual,0,0,1,0,1
298,city,2009,3.35,11.00,87934,Petrol,Dealer,Manual,0,0,1,0,1
299,city,2017,11.50,12.50,9000,Diesel,Dealer,Manual,0,1,0,0,1


In [23]:
# drop the columns we do not need
df_cleaned.drop(['Seller_Type', 'Fuel_Type', 'Transmission', 'Present_Price', 'Car_Name'], axis=1, inplace=True)
df_cleaned

Unnamed: 0,Year,Selling_Price,Kms_Driven,Owner,Fuel_Type_Diesel,Fuel_Type_Petrol,Seller_Type_Individual,Transmission_Manual
0,2014,3.35,27000,0,0,1,0,1
1,2013,4.75,43000,0,1,0,0,1
2,2017,7.25,6900,0,0,1,0,1
3,2011,2.85,5200,0,0,1,0,1
4,2014,4.60,42450,0,1,0,0,1
...,...,...,...,...,...,...,...,...
296,2016,9.50,33988,0,1,0,0,1
297,2015,4.00,60000,0,0,1,0,1
298,2009,3.35,87934,0,0,1,0,1
299,2017,11.50,9000,0,1,0,0,1


### Modeling
Now, we split the data into X - independet variables and y - dependent/target variable. We shall then perform a train_test_split before fitting the training data on to a linear regression model and then test the model using the test data.

In [24]:
# creating X and y variables
X = df_cleaned.drop(['Selling_Price'], axis=1)
y = df_cleaned['Selling_Price']

In [25]:
# performing a train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

In [26]:
# fitting the training data onto a linear regression model
model = LinearRegression()

model.fit(X_train, y_train)

# predicting y values using the model
y_train_pred = model.predict(X_train)

In [27]:
# # fitting the test data onto a linear regression model
model.fit(X_test, y_test)

# predicting y values using the model
y_test_pred = model.predict(X_test)

### Evaluation

We shall now evaluate the performance of the model by looking at the mean_squared_error and RMSE of the training data and the test data.

In [28]:
# mean_squared_error
from sklearn.metrics import mean_squared_error

# training data mse 
train_mse = mean_squared_error(y_train, y_train_pred)
print('Training data MSE:', train_mse)

print()

# test data mse 
test_mse = mean_squared_error(y_test, y_test_pred)
print('Test data MSE:', test_mse)

Training data MSE: 10.403536887517268

Test data MSE: 10.917846329814795


In [29]:
# root_mean_squared_error
from sklearn.metrics import mean_squared_error

# training data rmse 
train_rmse = mean_squared_error(y_train, y_train_pred, squared=False)
print('Training data RMSE:', train_rmse)

print()

# test data rmse 
test_rmse = mean_squared_error(y_test, y_test_pred, squared=False)
print('Test data RMSE:', test_rmse)

Training data RMSE: 3.225451423834696

Test data RMSE: 3.304216447179996


In [33]:
model.get_params

<bound method BaseEstimator.get_params of LinearRegression()>

In [32]:
model.coef_

array([ 4.36776435e-01, -3.29063043e-06, -9.41131444e-02,  2.59770204e+00,
       -2.59770204e+00, -3.22757888e+00, -6.41440579e+00])

In [34]:
pd.DataFrame(zip(X.columns, model.coef_))

Unnamed: 0,0,1
0,Year,0.436776
1,Kms_Driven,-3e-06
2,Owner,-0.094113
3,Fuel_Type_Diesel,2.597702
4,Fuel_Type_Petrol,-2.597702
5,Seller_Type_Individual,-3.227579
6,Transmission_Manual,-6.414406


## Summary
This project focuses on determining the factors that influence the selling price of used cars. A dataset containing information about used cars, such as car details, car features, and pricing details, was analyzed. The dataset was cleaned by removing duplicate records and converting categorical columns into dummy variables. A linear regression model was built to predict the selling price of used cars based on various features.

The key findings from the analysis are as follows:

* The dataset includes cars of various fuel types, seller types, and transmission types. Most cars use Petrol as fuel, are sold by dealers, and have manual transmission systems.

* The model identified the important features that influence the selling price of used cars. Notably, the year of manufacture, type of fuel (Diesel), and ownership type (Individual) have a positive impact on the selling price. However, features like manual transmission and higher mileage (Kms_Driven) have a negative impact on the selling price.

## Conclusion:
In conclusion, this project provides valuable insights into the factors affecting the selling price of used cars. By analyzing the dataset and building a linear regression model, we can predict the selling price of used cars based on their features. Car dealerships and owners can use this information to make informed decisions while purchasing or selling used cars.

The linear regression model performed reasonably well, with the mean squared error (MSE) and root mean squared error (RMSE) values indicating a relatively low prediction error. However, it is important to note that the model's performance may vary depending on the dataset's characteristics and the data used for training and testing.