We'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. The dataset was originally scraped and uploaded to Kaggle (https://www.kaggle.com/orgesleka/used-cars-database/data). The aim of this project is to clean the data and analyze the included used car listings.

In [3]:
#Importing libraries
import pandas as pd
import numpy as np


In [None]:
#Parse the data
autos = pd.read_csv ('autos.csv', encoding = 'Latin-1')
autos.info()
autos.head(3)

In [None]:
#Check the data
autos.info()
autos.head(3)

As we can see raw data is not clear: some columns have different numbers of non-null objects (< 50 0000). Names of the columns are "CamelCase". Lets make it "snake_case"

In [None]:
#Change column names:
names = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer_km', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'picture', 'postal_code',
       'last_seen']
autos.set_axis(names, axis = 'columns', inplace = True)
autos.columns

Now let's do some basic data exploration to determine what other cleaning tasks need to be done. 

In [None]:
#Check the data
autos.describe(include = 'object')

Lets modifided text (price and odometer_km) to integer

In [None]:
#Convert tex to integer
autos['price'] = (autos['price'].str.replace('$','')
                                .str.replace(',','')
                                .astype(int)
                 )

autos['odometer_km'] = (autos['odometer_km'].str.replace(',','')
                                .str.replace('km','')
                                .astype(int)
                 )
autos['registration_year'] = autos['registration_year'].astype(int)
autos['registration_month'] = autos['registration_month'].astype(int)

autos.head(3)

We'll start by analyzing the odometer_km and price columns. 
Analyze the columns using minimum and maximum values and look for any values that look unrealistically high or low (outliers) that we want to remove.


In [None]:
#Check odometer and price
autos['odometer_km'].unique().shape
autos['odometer_km'].describe()
autos['odometer_km'].value_counts().sort_index(ascending = True)

autos['price'].unique().shape
autos['price'].describe()
autos['price'].value_counts().sort_index(ascending = False).head(100)

autos = autos[autos["price"].between(1,350000)]

autos['price'].value_counts().sort_index(ascending = True).head(100)

autos.describe()

Let's now move on to the date columns and understand the date range the data covers. There are 5 columns that should represent date values

In [7]:
autos.describe(include = 'object')
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False)
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False)
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False)

autos['registration_year'].describe()

count    48565.000000
mean      2004.755421
std         88.643887
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

Because a car can't be first registered after the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. Lets remove them.

In [None]:
autos = autos[autos["registration_year"].between(1920,2018)]
autos['registration_year'].value_counts().sort_index(ascending = True).head(100)

When working with data on cars, it's natural to explore variations across different car brands. We can use aggregation to understand the brand column.

In [None]:
autos['brand'].value_counts()

For the top 10 brands, let's use aggregation to understand the average mileage for those cars and if there's any visible link with mean price.

In [None]:
top10 = autos['brand'].value_counts().sort_values(ascending = False).head(10)
print(top10)
brands = autos['brand'].unique()
autos_mile ={}

for b in top10.index:
    rows = autos[autos['brand'] == b]
    mean_miles = rows['odometer_km'].mean()
    autos_mile[b] = mean_miles

print('.....')
print(autos_mile)