# Exploring Used Car Listings on eBay

This project explores used car listings from the [German eBay](https://www.ebay.de) website with a special focus on data cleaning. The original dataset, which is available on [Kaggle](https://www.kaggle.com/orgesleka/used-cars-database/data) is modified by [Dataquest](https://www.dataquest.io) reducing the entries from 370,000 to around 50,000.

Let's start by reading and examine the file content.

In [1]:
import numpy as np
import pandas as pd

autos = pd.read_csv('autos.csv', encoding='Latin-1')
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

In [2]:
autos.head()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


In [3]:
autos.tail()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
49995,2016-03-27 14:38:19,Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon,privat,Angebot,"$24,900",control,limousine,2011,automatik,239,q5,"100,000km",1,diesel,audi,nein,2016-03-27 00:00:00,0,82131,2016-04-01 13:47:40
49996,2016-03-28 10:50:25,Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...,privat,Angebot,"$1,980",control,cabrio,1996,manuell,75,astra,"150,000km",5,benzin,opel,nein,2016-03-28 00:00:00,0,44807,2016-04-02 14:18:02
49997,2016-04-02 14:44:48,Fiat_500_C_1.2_Dualogic_Lounge,privat,Angebot,"$13,200",test,cabrio,2014,automatik,69,500,"5,000km",11,benzin,fiat,nein,2016-04-02 00:00:00,0,73430,2016-04-04 11:47:27
49998,2016-03-08 19:25:42,Audi_A3_2.0_TDI_Sportback_Ambition,privat,Angebot,"$22,900",control,kombi,2013,manuell,150,a3,"40,000km",11,diesel,audi,nein,2016-03-08 00:00:00,0,35683,2016-04-05 16:45:07
49999,2016-03-14 00:42:12,Opel_Vectra_1.6_16V,privat,Angebot,"$1,250",control,limousine,1996,manuell,101,vectra,"150,000km",1,benzin,opel,nein,2016-03-13 00:00:00,0,45897,2016-04-06 21:18:48


We observe that the dataset comprises 50,000 rows and 20 columns. Five of the columns contain null values with two of them having around 10% of their values as null and one of them around 20%. We also see that 15 out of 20 columns stored as string, while the other 5 as integer.

The column names in the dataset are entered in camel case, which we will change to snake case following Python conventions. In addition, we will modify some of the column names to improve clarity. 

In [4]:
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [5]:
autos.rename({'dateCrawled': 'date_crawled', 'offerType': 'offer_type', 'abtest': 'ab_test', 'vehicleType': 'vehicle_type', 'yearOfRegistration': 'registration_year', 'powerPS': 'power_ps', 'monthOfRegistration': 'registration_month', 'fuelType': 'fuel_type', 'notRepairedDamage': 'unrepaired_damage', 'dateCreated': 'ad_created', 'nrOfPictures': 'num_photos', 'postalCode': 'postal_code', 'lastSeen': 'last_seen'}, axis=1, inplace=True)

autos.columns

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'num_photos', 'postal_code',
       'last_seen'],
      dtype='object')

## Preliminary Exploration and Data Cleaning

Let's start with descriptive statistics on the dataset.

In [6]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,ab_test,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,num_photos,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-11 22:38:16,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


At first, looking at frequency row we notice that there are some columns that almost entirely consist of a single value. Namely, the `seller` and `offer_type` columns have all but one identical values. Since these columns do not provide any meaninful information given they are the same for all entries, we can remove them from our dataset.

It also appears that the `num_photos` column comprises the value zero for all entries. Let's confirm this by counting all unique values in the column.

In [7]:
autos['num_photos'].value_counts()

0    50000
Name: num_photos, dtype: int64

It is clear that all rows for this column consists of zeros, so we can remove `num_photos` column as well as the `seller` and `offer_type` columns.

In [8]:
autos = autos.drop(['seller', 'offer_type', 'num_photos'], axis=1)

Now let's take a look at the measure of central tendencies for the columns. The minimum and maximum values for `registration_year` column are clearly incorrect with values of 1000 and 9999. We can investigate the each end of the spectrum for this column before deciding on how to deal with the issue.

In [9]:
autos['registration_year'].sort_values().head(20)

22316    1000
49283    1001
24511    1111
35238    1500
10556    1800
32585    1800
28693    1910
42181    1910
15898    1910
3679     1910
30781    1910
33295    1910
45157    1910
22659    1910
46213    1910
21416    1927
22101    1929
11246    1931
2573     1934
2221     1934
Name: registration_year, dtype: int64

In [10]:
autos['registration_year'].sort_values().tail(20)

5763     2019
49185    2019
27578    2800
4549     4100
453      4500
42079    4800
22799    5000
4164     5000
49153    5000
24519    5000
27618    5911
8360     6200
25003    8888
49910    9000
13559    9000
6308     9996
8012     9999
14341    9999
33950    9999
38076    9999
Name: registration_year, dtype: int64

We see that there are other values that are incorrect on both ends and need to be cleaned. One approach is to remove all of these rows. Given there are 50,000 data points removing a couple dozen entries would not likely skew the data, but perhaps we can consider replacing the average of all values with these outliers, which would enable us to keep the entries.

In order to achieve this goal, first we should determine the mean value excluding the outliers.

In [11]:
reg_year_list = []
for reg_year in autos['registration_year']:
    if reg_year > 1900 and reg_year < 2020:
        reg_year_list.append(reg_year)

reg_year_series = pd.Series(reg_year_list)
reg_year_avg = round(reg_year_series.mean())
reg_year_avg

2003

Now we can replace the outliers with the mean value.

In [15]:
for index, reg_year in autos['registration_year'].iteritems():
    if reg_year < 1900 or reg_year > 2019:
        autos['registration_year'][index] = reg_year_avg
        
autos['registration_year'].describe()

count    50000.00000
mean      2003.36666
std          7.69210
min       1910.00000
25%       1999.00000
50%       2003.00000
75%       2008.00000
max       2019.00000
Name: registration_year, dtype: float64