# Exploring eBay Car Sales Data

In this project we are going to explore [used cars data](https://data.world/data-society/used-cars-data). Dataquest has made a few modifications to make the data dirtier. The aim of this project is to clean the data and analyze the used car listings.

In [159]:
import pandas as pd
import numpy as np

autos = pd.read_csv("autos.csv", encoding = "Latin-1")

Let's explore the dataset.

In [160]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

In [161]:
autos.head(3)

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37


## Fixing the column names

With autos.info() we can see useful information like column names and types. We can see that the dataset contains 20 columns. Most of the columns are string, some are integer. There are some null values, especially on the notRepairedDamage column. The column names use camel case instead of Python's preferred snake case, so let's begin by changing the column names to snake case.

In [162]:
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [163]:
# Copy the above and change the values.

autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'kilometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen']

In [164]:
autos.head(3)

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,kilometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37


The columns are now in snakecase, which is Python's preferred naming convention. Now lets begin to explore our data

## Cleaning data

In [165]:
autos.describe(include="all")

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,kilometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-30 19:48:02,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


We can see that the seller and offer_type columns are mainly just one value, meaning that we can delete those columnn. We can also see that some numeric data needs to be cleaned, for example price and kilometer data contains some symbols instead of numbers. Number's of pictures seems to only contain 0's as values, so let's check that out to make sure.

In [166]:
autos["nr_of_pictures"].value_counts()

0    50000
Name: nr_of_pictures, dtype: int64

As we can see there are only 0's in nr_of_pictures. Let's continue with deleting columns seller, offer_type and nr_of_pictures.

In [167]:
autos = autos.drop(["seller", "offer_type", "nr_of_pictures"], axis = 1)

Now we need to clean the columns that contained symbols instead of numeric values.

In [168]:
autos["price"] = autos["price"].str.replace("$", "").str.replace(",", "").astype(int)
autos["kilometer"] = autos["kilometer"].str.replace(",", "").str.replace("km", "").astype(int)

Let's also change the kilometer column's name to odometer_km:

In [169]:
autos = autos.rename({"kilometer": "odometer_km"}, axis=1)

We will now dive deeper into price and odometer_km columns. We are looking for data that doesn't look right, for example unrealistic prices. We are going to look for unique values, min/max/median/mean values etc.

In [170]:
autos["odometer_km"].unique().shape

(13,)

From the output we can see that there are only 13 unique values for odometer_km.

In [171]:
autos["odometer_km"].describe()

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

From above we can see useful information about odometer_km column. There are 50,000 values, but we can also see that over 50% of the values have a value of 150,000.

In [172]:
autos["odometer_km"].value_counts().head(10).sort_index()

5000        967
40000       819
50000      1027
60000      1164
70000      1230
80000      1436
90000      1757
100000     2169
125000     5170
150000    32424
Name: odometer_km, dtype: int64

This list shows us amount of values for each odometer_km value, for example there are 32424 values for 150000. This tells us that the odometer_km values are rounded. Let's now take a look at prices using the same methods.

In [173]:
autos["price"].unique().shape

(2357,)

This time there are a lot more unique values, 2357.

In [174]:
autos["price"].describe()

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

There are 50,000 values as expected. The minimum price is 0, which might mean that there are some rows missing a value.

In [175]:
autos["price"].value_counts().sort_index()

0           1421
1            156
2              3
3              1
5              2
            ... 
10000000       1
11111111       2
12345678       3
27322222       1
99999999       1
Name: price, Length: 2357, dtype: int64

From above we can see that there are 1421 cars that were sold "for free". Prices 11111111, 12345678, 99999999 also seem unrealistic. There also seems to be a lot of cars sold for basically free, but the starting bid could be just 1, so we are going to keep those. Let's sort from the highest price to smallest.

In [176]:
autos["price"].value_counts().sort_index(ascending=False).head(20)

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
Name: price, dtype: int64

Looks like prices lower than 389000 seem realistic, so we are going to use prices that are lower than that. Let's take a look at price range of 500 and 389000.

In [177]:
print(autos["price"].between(1,389000).value_counts())

True     48565
False     1435
Name: price, dtype: int64


Above we can see that there are 48565 cars sold between the price range of 1 to 389000. There are 1435 outliers that will be deleted.

In [178]:
autos = autos.loc[(autos["price"] >= 1) & (autos["price"] <= 389000)]

## Dates

There are 5 different dates in our dataset: `date_crawled`, `last_seen`, `ad_created`, `registration_month` and `registration_year`. Let's print them to see more information.

In [184]:
autos[['date_crawled','last_seen','ad_created','registration_month','registration_year']][0:3]

Unnamed: 0,date_crawled,last_seen,ad_created,registration_month,registration_year
0,2016-03-26 17:47:46,2016-04-06 06:45:54,2016-03-26 00:00:00,3,2004
1,2016-04-04 13:38:56,2016-04-06 14:45:08,2016-04-04 00:00:00,6,1997
2,2016-03-26 18:57:24,2016-04-06 20:15:37,2016-03-26 00:00:00,7,2009


As we can see the format is year-month-day. The `registration_month` and `registration_year` only contains an integer value, while the other columns are strings. We are now going to calculate the distribution of values in the `date_crawled`, `ad_created` and `last_seen` columns as percentages.