In [5]:
import pandas as pd
import numpy as np

Load the dataset and review the data to understand its structure and content

In [30]:
df = pd.read_csv("../data/raw/car_sales_data.csv")
df.head(10)

Unnamed: 0,Date,Salesperson,Customer Name,Car Make,Car Model,Car Year,Sale Price,Commission Rate,Commission Earned
0,2022-08-01,Monica Moore MD,Mary Butler,Nissan,Altima,2018,15983,0.070495,1126.73
1,2023-03-15,Roberto Rose,Richard Pierce,Nissan,F-150,2016,38474,0.134439,5172.4
2,2023-04-29,Ashley Ramos,Sandra Moore,Ford,Civic,2016,33340,0.114536,3818.63
3,2022-09-04,Patrick Harris,Johnny Scott,Ford,Altima,2013,41937,0.092191,3866.2
4,2022-06-16,Eric Lopez,Vanessa Jones,Honda,Silverado,2022,20256,0.11349,2298.85
5,2022-12-18,Terry Perkins MD,John Olsen,Ford,Altima,2015,14769,0.077247,1140.86
6,2022-06-12,Ashley Brown,Tyler Lawson,Honda,F-150,2013,41397,0.14278,5910.67
7,2022-06-20,Norma Watkins,Michael Bond,Ford,Altima,2015,46233,0.071624,3311.38
8,2022-09-02,Scott Parker,Stephanie Smith,Ford,Corolla,2021,27337,0.099504,2720.13
9,2023-04-06,Andrew Smith,Ashley Moreno DDS,Ford,Civic,2018,16309,0.149926,2445.14


In [31]:
print(f"This dataset contains {df.shape[0]} rows and {df.shape[1]} columns.\n")
df.info()

This dataset contains 2500000 rows and 9 columns.

<class 'pandas.DataFrame'>
RangeIndex: 2500000 entries, 0 to 2499999
Data columns (total 9 columns):
 #   Column             Dtype  
---  ------             -----  
 0   Date               str    
 1   Salesperson        str    
 2   Customer Name      str    
 3   Car Make           str    
 4   Car Model          str    
 5   Car Year           int64  
 6   Sale Price         int64  
 7   Commission Rate    float64
 8   Commission Earned  float64
dtypes: float64(2), int64(2), str(5)
memory usage: 171.7 MB


This dataset contains 2.5 million records with a mix of categorical and numerical features. We can see that the columns `Date`, `Salesperson`, `Customer Name`, `Car Make`, and `Car Model` are of type string. I'm using the describe() function to compute summary statistics for numerical columns.

In [32]:
df.describe()

Unnamed: 0,Car Year,Sale Price,Commission Rate,Commission Earned
count,2500000.0,2500000.0,2500000.0,2500000.0
mean,2015.996,30012.18,0.09998766,3001.005
std,3.739132,11545.14,0.02887202,1481.467
min,2010.0,10000.0,0.05000014,501.34
25%,2013.0,20019.0,0.0749645,1821.71
50%,2016.0,30006.0,0.1000058,2741.91
75%,2019.0,40022.0,0.1250065,3978.142
max,2022.0,50000.0,0.15,7494.53


`Car Year`
We can see that the car production years range from 2010 (min value) to 2022 (max value), while the average year is 2016 (mean value, which is also the median for this column - half of the cars were produced before 2016, and half of them were produced after). Standard deviation (std) measures how much the values in a dataset vary or spread out around the mean. Since the standard deviation for Car Year is around 4 years, we can conclude that most cars were produced between 2012 and 2020. The first quartile indicates that 25% of cars were produced by 2013, while the third quartile shows that 25% were produced after 2019.

`Sale Price`
The average sale price is around 30,000, with prices ranging from 10,000 to 50,000. The relatively high standard deviation (11,545) indicates noticeable variability in car prices. The first quartile indicates that 25% of sale prices are below 20,000, while the third quartile shows that 25% are above 40,000.

`Commission Rate`
The commission rate ranges from approximately 5% to 15%, with a mean close to 10%. The standard deviation is close to 3%, and the quantiles show that the 25% of commission rates are below 7% and 25% are above 12.5%.

`Commission Earned`
The average commission earned per sale is approximately 3,000, ranging from about 500 to 7,500. 25% earned commissions are below 1.800 and 25% are above 4,000.

We can see that the median values for all numerical columns are close to their respective means, indicating fairly symmetric distributions.

Next, we check for any duplicate rows in the dataset by using the df.duplicated().sum() function.

In [33]:
df.duplicated().sum()

np.int64(0)

There are no duplicate rows in this dataset.

isna().sum() counts the number of missing values in each column.

In [34]:
df.isna().sum()

Date                 0
Salesperson          0
Customer Name        0
Car Make             0
Car Model            0
Car Year             0
Sale Price           0
Commission Rate      0
Commission Earned    0
dtype: int64

There are no missing values in this dataset. But there might be some special characters that represent missing or invalid data (e.g. '-', '/', whitespace or an empty string).

In [35]:
special_characters=['/','-','', ' ', 'N/A']
print(df.isin(special_characters).sum())

Date                 0
Salesperson          0
Customer Name        0
Car Make             0
Car Model            0
Car Year             0
Sale Price           0
Commission Rate      0
Commission Earned    0
dtype: int64


In [36]:
pd.to_datetime(df['Date'], errors='coerce').isna().sum()

np.int64(0)

Now, I want to check for any data inconsistencies. Data inconsistencies occur when values are logically or format-wise conflicting, such as mismatched types, invalid dates, or contradictory information.

In [37]:
#Check how many earned commissions are not equal to the multiplication of the sale price and the commission rate

(df['Commission Earned'] != df['Sale Price'] * df['Commission Rate']).sum()

np.int64(2500000)

All 2,500,000 rows show a mismatch between Commission Earned and the product of Sale Price and Commission Rate. This could be due to rounding, pre-calculated values, or inconsistencies in the dataset.

In [38]:
#Calculate the difference between expected and true commission

diff = df['Commission Earned'] - (df['Sale Price'] * df['Commission Rate'])

#Check the number of rows with a very small difference 

tolerance = 0.01
close_matches = (diff.abs() < tolerance).sum()
print(f"Number of rows close to expected value: {close_matches}")
max_diff = diff.abs().max()
print(f"Maximum difference: {max_diff}")

Number of rows close to expected value: 2500000
Maximum difference: 0.0049999980242319


After accounting for rounding, all 2,500,000 rows are effectively consistent. The maximum difference is only 0.005, indicating minor rounding effects rather than true inconsistencies.

In [39]:
#Check if there are any numeric values below zero
numeric_columns=['Car Year', 'Sale Price', 'Commission Rate', 'Commission Earned']
for col in numeric_columns:
    print((df[col] < 0).sum())

0
0
0
0


=> There are no inconsistencies in this dataset

Next, I want to extract additional information from the Date column, such as the year, month, day of the week, and quarter.

I also want to create a new column for the car’s age, calculated as the difference between the sale year and the car’s production year.

In [40]:
#Convert the Date column into a datetime format so that date-related operations can be performed

df['Date'] = pd.to_datetime(df['Date'], errors='coerce')

In [41]:
df["Year"]=df["Date"].dt.year
df["Month"]=df["Date"].dt.month
df["Quarter"]=df["Date"].dt.quarter
df["Day Of Week"]=df["Date"].dt.day_of_week

In [42]:
#Check if the car production year is less than or equal to the year it was sold
(df["Car Year"] <= df["Year"]).sum()

np.int64(2500000)

In [43]:
df["Car Age"]=df["Year"]-df["Car Year"]
df.rename(columns={"Car Year":"Car Production Year"}, inplace=True)
df.head()

Unnamed: 0,Date,Salesperson,Customer Name,Car Make,Car Model,Car Production Year,Sale Price,Commission Rate,Commission Earned,Year,Month,Quarter,Day Of Week,Car Age
0,2022-08-01,Monica Moore MD,Mary Butler,Nissan,Altima,2018,15983,0.070495,1126.73,2022,8,3,0,4
1,2023-03-15,Roberto Rose,Richard Pierce,Nissan,F-150,2016,38474,0.134439,5172.4,2023,3,1,2,7
2,2023-04-29,Ashley Ramos,Sandra Moore,Ford,Civic,2016,33340,0.114536,3818.63,2023,4,2,5,7
3,2022-09-04,Patrick Harris,Johnny Scott,Ford,Altima,2013,41937,0.092191,3866.2,2022,9,3,6,9
4,2022-06-16,Eric Lopez,Vanessa Jones,Honda,Silverado,2022,20256,0.11349,2298.85,2022,6,2,3,0


I will create a new dataset excluding the `Date`, `Salesperson`, and `Customer Name` columns because these columns are not needed for the numerical analysis or modeling I plan to perform.
Date has already been transformed into separate features such as year, month, day of the week, and quarter, so the original column is redundant.
Salesperson and Customer Name are categorical identifiers that do not carry predictive value, and they also contain personal identifiers, so they are removed to protect privacy and simplify the dataset.

In [44]:
df_new = df.drop(columns=['Date', 'Salesperson', 'Customer Name'])

In [45]:
#Convert column names to lowercase and replace spaces with underscores

df_new.columns = df_new.columns.str.lower().str.replace(' ', '_')
df_new.info()

<class 'pandas.DataFrame'>
RangeIndex: 2500000 entries, 0 to 2499999
Data columns (total 11 columns):
 #   Column               Dtype  
---  ------               -----  
 0   car_make             str    
 1   car_model            str    
 2   car_production_year  int64  
 3   sale_price           int64  
 4   commission_rate      float64
 5   commission_earned    float64
 6   year                 int32  
 7   month                int32  
 8   quarter              int32  
 9   day_of_week          int32  
 10  car_age              int64  
dtypes: float64(2), int32(4), int64(3), str(2)
memory usage: 171.7 MB


In [46]:
df_new.head()

Unnamed: 0,car_make,car_model,car_production_year,sale_price,commission_rate,commission_earned,year,month,quarter,day_of_week,car_age
0,Nissan,Altima,2018,15983,0.070495,1126.73,2022,8,3,0,4
1,Nissan,F-150,2016,38474,0.134439,5172.4,2023,3,1,2,7
2,Ford,Civic,2016,33340,0.114536,3818.63,2023,4,2,5,7
3,Ford,Altima,2013,41937,0.092191,3866.2,2022,9,3,6,9
4,Honda,Silverado,2022,20256,0.11349,2298.85,2022,6,2,3,0


After performing all checks and data transformations, I will save the cleaned dataset to a new CSV file for further analysis.

In [47]:
df_new.to_csv("../data/processed/car_sales_cleaned.csv", index=False)