### Exploring the Used Car Market in Saudi Arabia and Predicting Car Prices

#### Objectives :
- Perform Exploratory Data Analysis to uncover trends and patterns in the used car market in Saudi Arabia.
- Build a predictive model to estimate car prices based on key features like mileage, engine size, and year.

In [11]:
# Import important libraries 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [25]:
# reading our dataset 
df = pd.read_csv('UsedCarsSA_EN.csv')

## Data Preparation 1.A

In [27]:
# Check the condition of the dataset
print(df.shape)

(8035, 13)


In [33]:
df.info()
df.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8035 entries, 0 to 8034
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Make         8035 non-null   object 
 1   Type         8035 non-null   object 
 2   Year         8035 non-null   int64  
 3   Origin       8035 non-null   object 
 4   Color        8035 non-null   object 
 5   Options      8035 non-null   object 
 6   Engine_Size  8035 non-null   float64
 7   Fuel_Type    8035 non-null   object 
 8   Gear_Type    8035 non-null   object 
 9   Mileage      8035 non-null   int64  
 10  Region       8035 non-null   object 
 11  Price        8035 non-null   int64  
 12  Negotiable   8035 non-null   bool   
dtypes: bool(1), float64(1), int64(3), object(8)
memory usage: 761.3+ KB


Unnamed: 0,Make,Type,Year,Origin,Color,Options,Engine_Size,Fuel_Type,Gear_Type,Mileage,Region,Price,Negotiable
0,Chrysler,C300,2018,Saudi,Black,Full,5.7,Gas,Automatic,103000,Riyadh,114000,False
1,Nissan,Patrol,2016,Saudi,White,Full,4.8,Gas,Automatic,5448,Riyadh,0,True
2,Nissan,Sunny,2019,Saudi,Silver,Standard,1.5,Gas,Automatic,72418,Riyadh,27500,False
3,Hyundai,Elantra,2019,Saudi,Grey,Standard,1.6,Gas,Automatic,114154,Riyadh,43000,False
4,Hyundai,Elantra,2019,Saudi,Silver,Semi Full,2.0,Gas,Automatic,41912,Riyadh,59500,False
5,Honda,Accord,2018,Saudi,Navy,Full,1.5,Gas,Automatic,39000,Riyadh,72000,False
6,Toyota,Land Cruiser,2011,Saudi,White,Semi Full,4.5,Gas,Automatic,183000,Riyadh,92000,False
7,GMC,Yukon,2009,Saudi,Bronze,Full,5.7,Gas,Automatic,323000,Riyadh,0,True
8,Chevrolet,Impala,2019,Saudi,Black,Standard,3.6,Gas,Automatic,70000,Riyadh,80000,False
9,Toyota,Yaris,2018,Saudi,White,Standard,1.5,Gas,Automatic,131000,Jeddah,32000,False


### Identify the Variables of the dataset and trying to get the description of each field :
- Make: The car manufacturer.
- Type: Model or type of the car.
- Year: Manufacturing year.
- Origin: Country of origin.
- Color: Color of the car.
- Options: Feature level (e.g., Full, Standard, Semi Full).
- Engine_Size: Engine capacity in liters.
- Fuel_Type: Type of fuel (e.g., Gas, Diesel).
- Gear_Type: Transmission type (e.g., Automatic, Manual).
- Mileage: Distance the car has traveled (in km).
- Region: Region where the car is being sold.
- Price: Selling price.
- Negotiable: Whether the price is negotiable or not.

In [53]:
# We will now start cleaning the data, first checking if there is a duplicates.
sumDup=df.duplicated()
sum(sumDup)

3

In [55]:
# it appears that we have 3 duplicates

In [61]:
df_cars=df.drop_duplicates()
sumDup=df_cars.duplicated()
sum(sumDup)

0

In [83]:
# checking null value in the dataset 
df_null= df_cars.isnull().sum()
df_null

Make           0
Type           0
Year           0
Origin         0
Color          0
Options        0
Engine_Size    0
Fuel_Type      0
Gear_Type      0
Mileage        0
Region         0
Price          0
Negotiable     0
dtype: int64

In [89]:
# Now as we see before that the price have zero value in the head, we want to remove all zero values.
price_zero = df_cars[df_cars['Price'] == 0]
price_zero

Unnamed: 0,Make,Type,Year,Origin,Color,Options,Engine_Size,Fuel_Type,Gear_Type,Mileage,Region,Price,Negotiable
1,Nissan,Patrol,2016,Saudi,White,Full,4.8,Gas,Automatic,5448,Riyadh,0,True
7,GMC,Yukon,2009,Saudi,Bronze,Full,5.7,Gas,Automatic,323000,Riyadh,0,True
18,GMC,Yukon,2018,Saudi,White,Full,5.3,Gas,Automatic,37000,Riyadh,0,True
26,Toyota,Camry,2019,Saudi,Red,Full,2.5,Gas,Automatic,8000,Makkah,0,True
28,Toyota,Avalon,2008,Other,Red,Full,3.5,Gas,Automatic,169000,Riyadh,0,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8018,Mazda,CX9,2015,Saudi,Red,Standard,3.6,Gas,Automatic,195000,Al-Medina,0,True
8021,Ford,Explorer,2010,Other,Black,Semi Full,1.6,Gas,Automatic,3275230,Al-Baha,0,True
8022,Toyota,Furniture,2020,Saudi,White,Semi Full,2.7,Gas,Automatic,82000,Makkah,0,True
8024,Toyota,Furniture,2014,Saudi,White,Semi Full,4.0,Gas,Automatic,497480,Riyadh,0,True


In [95]:
# there is 2526 zero value in price, so we just drop it
df = df_cars[df_cars['Price'] != 0]

# Check the number of rows that cleaned
zero_removed = len(df) - len(df_cars)

print("Number of rows removed: ",zero_removed)

Number of rows removed:  -2526


In [99]:
df.head(10)

Unnamed: 0,Make,Type,Year,Origin,Color,Options,Engine_Size,Fuel_Type,Gear_Type,Mileage,Region,Price,Negotiable
0,Chrysler,C300,2018,Saudi,Black,Full,5.7,Gas,Automatic,103000,Riyadh,114000,False
2,Nissan,Sunny,2019,Saudi,Silver,Standard,1.5,Gas,Automatic,72418,Riyadh,27500,False
3,Hyundai,Elantra,2019,Saudi,Grey,Standard,1.6,Gas,Automatic,114154,Riyadh,43000,False
4,Hyundai,Elantra,2019,Saudi,Silver,Semi Full,2.0,Gas,Automatic,41912,Riyadh,59500,False
5,Honda,Accord,2018,Saudi,Navy,Full,1.5,Gas,Automatic,39000,Riyadh,72000,False
6,Toyota,Land Cruiser,2011,Saudi,White,Semi Full,4.5,Gas,Automatic,183000,Riyadh,92000,False
8,Chevrolet,Impala,2019,Saudi,Black,Standard,3.6,Gas,Automatic,70000,Riyadh,80000,False
9,Toyota,Yaris,2018,Saudi,White,Standard,1.5,Gas,Automatic,131000,Jeddah,32000,False
10,Toyota,Camry,2017,Gulf Arabic,White,Standard,2.5,Gas,Automatic,107000,Dammam,50000,False
11,Nissan,Patrol,2014,Saudi,White,Full,5.6,Gas,Automatic,106000,Dammam,135000,False


In [103]:
# further check from highest price to lowest
data_sorted = df.sort_values(by='Price', ascending=True)
data_sorted.head(25)

Unnamed: 0,Make,Type,Year,Origin,Color,Options,Engine_Size,Fuel_Type,Gear_Type,Mileage,Region,Price,Negotiable
8023,GMC,Yukon,2019,Saudi,Grey,Full,5.3,Gas,Automatic,50000,Jubail,1,False
6999,Genesis,G80,2018,Other,Grey,Semi Full,3.8,Gas,Automatic,170000,Riyadh,500,False
7625,Toyota,Yaris,2018,Saudi,White,Standard,1.5,Gas,Automatic,100000,Riyadh,850,False
2178,Mitsubishi,Attrage,2019,Saudi,Grey,Standard,1.2,Gas,Automatic,41000,Jeddah,877,False
3642,Kia,Rio,2019,Saudi,Bronze,Standard,1.4,Gas,Automatic,55500,Arar,884,False
7233,Toyota,Yaris,2019,Saudi,White,Standard,1.6,Gas,Automatic,85000,Najran,950,False
1661,MG,5,2020,Saudi,White,Standard,1.5,Gas,Automatic,41000,Al-Ahsa,988,False
7665,Hyundai,Elantra,2019,Saudi,Grey,Standard,2.0,Gas,Automatic,89000,Jeddah,993,False
3009,GMC,Yukon,2021,Saudi,Another Color,Standard,5.3,Gas,Automatic,4000,Jeddah,1000,False
4277,Toyota,Corolla,2020,Saudi,Silver,Standard,1.6,Gas,Automatic,48563,Hail,1002,False


In [121]:
# great,now we want to see the uesfulness of "Negotiable" column because before we cleaned the values, it gave us "True" only in zero values in price column
true_non_zero = df[(df['Negotiable'] == True) & (df['Price'] != 0)]
print(len(true_non_zero))

0


In [115]:
# so there is no use for the column as there is no value other than zero to gave us true, so we drop it
df_cars=df.drop(columns=['Negotiable'])

In [117]:
df_cars.head()

Unnamed: 0,Make,Type,Year,Origin,Color,Options,Engine_Size,Fuel_Type,Gear_Type,Mileage,Region,Price
0,Chrysler,C300,2018,Saudi,Black,Full,5.7,Gas,Automatic,103000,Riyadh,114000
2,Nissan,Sunny,2019,Saudi,Silver,Standard,1.5,Gas,Automatic,72418,Riyadh,27500
3,Hyundai,Elantra,2019,Saudi,Grey,Standard,1.6,Gas,Automatic,114154,Riyadh,43000
4,Hyundai,Elantra,2019,Saudi,Silver,Semi Full,2.0,Gas,Automatic,41912,Riyadh,59500
5,Honda,Accord,2018,Saudi,Navy,Full,1.5,Gas,Automatic,39000,Riyadh,72000


In [128]:
# Let us do some validation on the dataset and check 
df_cars.describe()

Unnamed: 0,Year,Engine_Size,Mileage,Price
count,5506.0,5506.0,5506.0,5506.0
mean,2014.839085,3.178169,139379.9,78334.94
std,5.142642,1.465976,334933.0,75041.51
min,1963.0,1.0,100.0,1.0
25%,2013.0,2.0,46000.0,35000.0
50%,2016.0,2.7,101000.0,58000.0
75%,2018.0,4.0,180000.0,95000.0
max,2021.0,9.0,20000000.0,1150000.0


In [134]:
# the format is not good with scientific notation lets change it 
pd.set_option('display.float_format', '{:,.1f}'.format)
df_cars.describe()

Unnamed: 0,Year,Engine_Size,Mileage,Price
count,5506.0,5506.0,5506.0,5506.0
mean,2014.8,3.2,139379.9,78334.9
std,5.1,1.5,334933.0,75041.5
min,1963.0,1.0,100.0,1.0
25%,2013.0,2.0,46000.0,35000.0
50%,2016.0,2.7,101000.0,58000.0
75%,2018.0,4.0,180000.0,95000.0
max,2021.0,9.0,20000000.0,1150000.0


### first look:
- The max value of Mileage is unrealistic as it is 20,000,000 and the std is very high, we will deal with it later, like limiting it to 100000km
- the price seem resonable but we have to look for a normal value like above 7000, there is no benefit to consider a one riyal car
- Year are in realistic range as it is from 1963 to 2021
- the engine size is relistic at first glance as well from 1L to 9L

In [147]:
# check the number of cars below 7000 riyal
low_price_cars = df_cars[df_cars['Price'] < 7000]
print(f"Number of cars priced below 7000 Riyal: {len(low_price_cars)}")

Number of cars priced below 7000 Riyal: 112


In [151]:
low_price_cars.describe()

Unnamed: 0,Year,Engine_Size,Mileage,Price
count,112.0,112.0,112.0,112.0
mean,2015.3,2.9,112187.7,2379.8
std,8.0,1.4,220693.8,1582.9
min,1986.0,1.0,100.0,1.0
25%,2016.0,2.0,21230.0,1307.5
50%,2019.0,2.5,49500.0,1711.5
75%,2020.0,3.6,111250.0,2730.0
max,2021.0,6.0,2000000.0,6500.0


In [157]:
# they tend to have the high mileage as the min is one riyal all these point aim that the info is unrealistic so droping it is better
df_cars_cleaned = df_cars[df_cars['Price'] >= 7000]

# Confirm the new dataset size
print(f"Number of cars remaining after filtering: {len(df_cars_cleaned)}")
df_cars_cleaned = df_cars[df_cars['Mileage'] <=80_000]
print(f"Number of cars remaining after filtering: {len(df_cars_cleaned)}")

Number of cars remaining after filtering: 5394
Number of cars remaining after filtering: 2155


In [161]:
df_cars=df_cars_cleaned

In [167]:
df_cars.head(12)

Unnamed: 0,Make,Type,Year,Origin,Color,Options,Engine_Size,Fuel_Type,Gear_Type,Mileage,Region,Price
2,Nissan,Sunny,2019,Saudi,Silver,Standard,1.5,Gas,Automatic,72418,Riyadh,27500
4,Hyundai,Elantra,2019,Saudi,Silver,Semi Full,2.0,Gas,Automatic,41912,Riyadh,59500
5,Honda,Accord,2018,Saudi,Navy,Full,1.5,Gas,Automatic,39000,Riyadh,72000
8,Chevrolet,Impala,2019,Saudi,Black,Standard,3.6,Gas,Automatic,70000,Riyadh,80000
13,Mercedes,CLA,2020,Other,White,Standard,2.0,Gas,Automatic,20000,Riyadh,235000
14,Mercedes,E,2017,Saudi,Grey,Full,2.0,Gas,Automatic,20600,Dammam,210000
15,Toyota,Corolla,2018,Saudi,White,Standard,1.6,Gas,Automatic,7702,Dammam,45000
17,Nissan,Sunny,2017,Saudi,White,Standard,1.5,Gas,Automatic,58000,Al-Medina,24500
20,Toyota,Prado,2021,Saudi,White,Semi Full,4.0,Gas,Automatic,3000,Dammam,174000
23,Toyota,Furniture,2021,Saudi,White,Semi Full,2.4,Diesel,Automatic,29000,Qassim,145000


#### let us clean the index to have a better view at the data

In [170]:
df_cars.reset_index(drop=True, inplace=True)
df_cars.head(12)

Unnamed: 0,Make,Type,Year,Origin,Color,Options,Engine_Size,Fuel_Type,Gear_Type,Mileage,Region,Price
0,Nissan,Sunny,2019,Saudi,Silver,Standard,1.5,Gas,Automatic,72418,Riyadh,27500
1,Hyundai,Elantra,2019,Saudi,Silver,Semi Full,2.0,Gas,Automatic,41912,Riyadh,59500
2,Honda,Accord,2018,Saudi,Navy,Full,1.5,Gas,Automatic,39000,Riyadh,72000
3,Chevrolet,Impala,2019,Saudi,Black,Standard,3.6,Gas,Automatic,70000,Riyadh,80000
4,Mercedes,CLA,2020,Other,White,Standard,2.0,Gas,Automatic,20000,Riyadh,235000
5,Mercedes,E,2017,Saudi,Grey,Full,2.0,Gas,Automatic,20600,Dammam,210000
6,Toyota,Corolla,2018,Saudi,White,Standard,1.6,Gas,Automatic,7702,Dammam,45000
7,Nissan,Sunny,2017,Saudi,White,Standard,1.5,Gas,Automatic,58000,Al-Medina,24500
8,Toyota,Prado,2021,Saudi,White,Semi Full,4.0,Gas,Automatic,3000,Dammam,174000
9,Toyota,Furniture,2021,Saudi,White,Semi Full,2.4,Diesel,Automatic,29000,Qassim,145000


In [172]:
df_cars.describe()

Unnamed: 0,Year,Engine_Size,Mileage,Price
count,2155.0,2155.0,2155.0,2155.0
mean,2016.3,3.0,32721.7,101211.9
std,5.4,1.4,27218.1,100419.5
min,1963.0,1.0,100.0,1.0
25%,2015.0,2.0,4000.0,40000.0
50%,2018.0,2.5,30000.0,70000.0
75%,2019.0,3.8,57000.0,128000.0
max,2021.0,9.0,80000.0,1150000.0


In [174]:
df_cars.to_csv('cleaned_used_cars.csv', index=False)