This project uses a dataset of cars from a classified section of the German eBay website. I will try and find out information about this data. 

In [2]:
import pandas as pd
import numpy as np
autos = pd.read_csv("autos.csv",encoding="Latin-1")

In [3]:
autos.info()
autos.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


There is a mixture of integer and string data. The columns with integer types are yearOfRegistration, powerPS, monthOfRegistration, nrOfPictures and postalCode. All other columns have string types. There are some null values.

In [4]:
#Changing specific column names as specified using .rename()
new_names = {"yearOfRegistration":"registration_year",
            "monthOfRegistration":"registration_month",
            "notRepairedDamage":"unrepaired_damage",
            "dateCreated":"ad_created"}
autos.rename(new_names,axis=1,inplace=True)

#Changing the rest using a function to convert from camel to snake case
def camel_to_snake(s):
    return ''.join(['_'+c.lower() if c.isupper() else c for c in s]).lstrip('_')

autos.columns = [camel_to_snake(s) for s in autos.columns]

#Changing power_p_s to power_ps
autos.rename({"power_p_s":"power_ps"},axis=1,inplace=True)

#Display column names
autos.columns

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

In [5]:
autos.head()

#Investigating the seller columns
#autos["seller"].value_counts()
#autos.loc[autos["seller"]=='gewerblich']

#Investigating null values
#autos.loc[autos["model"].isnull()]

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


price and odometer are numbers stored as strings (price has a $ symbol, odometer has km at the end). I will clean these columns

nr_of_pictures has 0 for every value and should be removed

seller has 'privat' for every value except one. This value however seems to be legitimate otherwise.

gearbox,model,fuel type,unrepaired damage have null values. I used autos.loc[autos["model"].isnull()] and similar code to investigate these null values. I could not find a pattern from a brief overview.





In [71]:
#Now to clean the price and odometer columns.

autos["price"] = autos["price"].str.replace("$","").str.replace(",","").astype(int)
autos["odometer"] = autos["odometer"].str.replace("km","").str.replace(",","").astype(int)

autos.head()

autos.rename({"odometer":"odometer_km"},axis=1,inplace=True)

In [7]:
autos["odometer_km"].describe()
autos["price"].describe()

autos.sort_values("price")["price"]

KeyError: 'odometer_km'

The value for odometer_km maxes out at 150000.

There are suspicious prices. There are values 11111111 and 12345678 and 99999999 and 1234566 which seem suspicous and should be removed. It is feasible a car could sell for 10 million or 27 million so I will leave those, although they are very high.

In [None]:
bad_numbers = [11111111,12345678,99999999,1234566]

autos = autos.loc[autos["price"] != 1234566,:]
#I did this with each of the 'bad numbers'. Need to find out how to do it all at once.

autos.sort_values("price")["price"]

In [None]:
autos["odometer_km"].value_counts()
#All seems to be ok here, although the values have clearly been rounded.

I had to remove suspicious large values from price. odometer_km seems to be ok but the values have clearly been rounded to the nearest 5000 (if under 10,000), 10,000 (if under 100,000) or the nearest 25,000 (if above 100,000). This is bad data but I'm not sure what I can do about it.

In [11]:
autos["date_crawled"].str[:10].value_counts(normalize=True,
                                           dropna=False).sort_index()
'''
autos["ad_created"].str[:10].value_counts(normalize=True,
                                           dropna=False).sort_index()
autos["last_seen"].str[:10].value_counts(normalize=True,
                                           dropna=False).sort_index()
'''

2016-03-05    0.02538
2016-03-06    0.01394
2016-03-07    0.03596
2016-03-08    0.03330
2016-03-09    0.03322
2016-03-10    0.03212
2016-03-11    0.03248
2016-03-12    0.03678
2016-03-13    0.01556
2016-03-14    0.03662
2016-03-15    0.03398
2016-03-16    0.02950
2016-03-17    0.03152
2016-03-18    0.01306
2016-03-19    0.03490
2016-03-20    0.03782
2016-03-21    0.03752
2016-03-22    0.03294
2016-03-23    0.03238
2016-03-24    0.02910
2016-03-25    0.03174
2016-03-26    0.03248
2016-03-27    0.03104
2016-03-28    0.03484
2016-03-29    0.03418
2016-03-30    0.03362
2016-03-31    0.03192
2016-04-01    0.03380
2016-04-02    0.03540
2016-04-03    0.03868
2016-04-04    0.03652
2016-04-05    0.01310
2016-04-06    0.00318
2016-04-07    0.00142
Name: date_crawled, dtype: float64

In [14]:
autos["registration_year"].describe()

count    50000.000000
mean      2005.073280
std        105.712813
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

Outliers in the registration year (years of 1000 and 9999). The date posted seems to rise more in march.

In [16]:
autos["registration_year"].value_counts().sort_index()

1000       1
1001       1
1111       1
1500       1
1800       2
1910       9
1927       1
1929       1
1931       1
1934       2
1937       4
1938       1
1939       1
1941       2
1943       1
1948       1
1950       3
1951       2
1952       1
1953       1
1954       2
1955       2
1956       5
1957       2
1958       4
1959       7
1960      34
1961       6
1962       4
1963       9
        ... 
2001    2703
2002    2533
2003    2727
2004    2737
2005    3015
2006    2708
2007    2304
2008    2231
2009    2098
2010    1597
2011    1634
2012    1323
2013     806
2014     666
2015     399
2016    1316
2017    1453
2018     492
2019       3
2800       1
4100       1
4500       1
4800       1
5000       4
5911       1
6200       1
8888       1
9000       2
9996       1
9999       4
Name: registration_year, Length: 97, dtype: int64

It seems values between 1900 to 2016.

In [22]:
autos = autos.loc[autos["registration_year"]<2017,:]
autos = autos.loc[autos["registration_year"]>1900,:]
autos["registration_year"].describe()
autos.info()
autos.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48028 entries, 0 to 49999
Data columns (total 20 columns):
date_crawled          48028 non-null object
name                  48028 non-null object
seller                48028 non-null object
offer_type            48028 non-null object
price                 48028 non-null object
abtest                48028 non-null object
vehicle_type          44903 non-null object
registration_year     48028 non-null int64
gearbox               45604 non-null object
power_ps              48028 non-null int64
model                 45560 non-null object
odometer              48028 non-null object
registration_month    48028 non-null int64
fuel_type             44301 non-null object
brand                 48028 non-null object
unrepaired_damage     39040 non-null object
ad_created            48028 non-null object
nr_of_pictures        48028 non-null int64
postal_code           48028 non-null int64
last_seen             48028 non-null object
dtypes: int64(5)

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


The incorrect values have been removed.

In [72]:
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,...,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen,popularity
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,5000,control,bus,2004,manuell,158,...,150000,3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54,
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,8500,control,limousine,1997,automatik,286,...,150000,6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08,
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,8990,test,limousine,2009,manuell,102,...,70000,7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37,
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,4350,control,kleinwagen,2007,automatik,71,...,70000,6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28,
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,1350,test,kombi,2003,manuell,0,...,150000,7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50,


In [80]:
autos["brand"].value_counts()
top_20 = autos["brand"].value_counts().sort_values(ascending=False)[:19].index

b_dict = {}
for b in top_20:
    b_list = autos.loc[autos["brand"]==b,:]
    b_mean = b_list["price"].mean()
    b_dict[b] = b_mean
b_dict

{'audi': 9093.65003615329,
 'bmw': 8334.645155185466,
 'citroen': 44534.79671150971,
 'fiat': 2711.8011272141707,
 'ford': 7263.015811455847,
 'hyundai': 5308.53911205074,
 'mazda': 4010.7716643741405,
 'mercedes_benz': 30317.447816593885,
 'nissan': 4664.891034482758,
 'opel': 5252.61655437921,
 'peugeot': 3039.4682651622,
 'renault': 2395.4164467897976,
 'seat': 4296.492554410081,
 'skoda': 6334.91948051948,
 'smart': 3542.706586826347,
 'sonstige_autos': 39621.77946768061,
 'toyota': 5115.33388981636,
 'volkswagen': 6516.457597173145,
 'volvo': 4757.108108108108}

Citroen, mercedes_benz, sonstige_autos are having their mean shifted by extreme values.

In [89]:
autos["brand"].value_counts()
top_6 = autos["brand"].value_counts().sort_values(ascending=False)[:5].index

price_dict = {}
for b in top_6:
    b_list = autos.loc[autos["brand"]==b,:]
    b_mean = b_list["price"].mean()
    price_dict[b] = b_mean
    
mileage_dict = {}
for b in top_6:
    b_list = autos.loc[autos["brand"]==b,:]
    b_mean = b_list["odometer_km"].mean()
    mileage_dict[b] = b_mean
    
price_series = pd.Series(price_dict)
mileage_series = pd.Series(mileage_dict)

mileage_series

df = pd.DataFrame(price_series, columns=['mean_price'])
df["mean_mileage"] = mileage_series
df

Unnamed: 0,mean_price,mean_mileage
audi,9093.650036,129287.780188
bmw,8334.645155,132434.708554
mercedes_benz,30317.447817,130860.262009
opel,5252.616554,129227.141482
volkswagen,6516.457597,128730.369062


In [82]:
autos.columns

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer_km', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen', 'popularity'],
      dtype='object')