# Ebay car sales 

In this guided project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.

The dataset was originally scraped and uploaded to Kaggle. Dataquest made a few modifications from the original dataset that was uploaded to Kaggle:

- Dataquest sampled 50,000 data points from the full dataset, to ensure the code runs quickly in our hosted environment
- Dataquest dirtied the dataset a bit to more closely resemble what you would expect from a scraped dataset (the version uploaded to Kaggle was cleaned to be easier to work with)
<br> 

<b>
The aim of this project is to clean the data and analyze the included used car listings. we'll also become familiar with some of the unique benefits jupyter notebook provides for pandas.


In [1]:
import pandas as pd, numpy as np

In [2]:
# the data, encoded in latin-1
autos = pd.read_csv('autos.csv', encoding='Latin-1')

# Understand and clean the data
<br>
Let's use some of pandas and numpy's useful methods to get an indea of what each column in the dataframe means.

In [7]:
#quick look
autos.info()
autos.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
date_crawled          50000 non-null object
name                  50000 non-null object
seller                50000 non-null object
offer_type            50000 non-null object
price                 50000 non-null object
abtest                50000 non-null object
vehicle_type          44905 non-null object
registration_year     50000 non-null int64
gearbox               47320 non-null object
power_ps              50000 non-null int64
model                 47242 non-null object
odometer              50000 non-null object
registration_month    50000 non-null int64
fuel_type             45518 non-null object
brand                 50000 non-null object
unrepaired_damage     40171 non-null object
ad_created            50000 non-null object
nr_of_pictures        50000 non-null int64
postal_code           50000 non-null int64
last_seen             50000 non-null object
dtypes: int64(5)

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


Observations about the columns:<br>

- missing values in some  
- int64 or string types 
- photos appears to be 0 for all rows
- weird names


In [5]:
#the names of the columns are in CamelCase but it's prefered for them to be in Snake case
autos = autos.rename(columns={'yearOfRegistration': 'registration_year',
              'monthOfRegistration': 'registration_month',
              'notRepairedDamage' : 'unrepaired_damage', 
              'dateCreated':'ad_created',
              'dateCrawled':'date_crawled',
              'offerType':'offer_type', 
              'vehicleType':'vehicle_type', 
              'powerPS':'power_ps',
              'fuelType':'fuel_type',
              'nrOfPictures':'nr_of_pictures',
              'postalCode':'postal_code',
              'lastSeen':'last_seen'}
              )

autos.head(1)

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54


In [6]:
#Some important stats/info about each feature
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-23 18:39:34,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


Some more observations:
<br>
- seller/offer_type/abtest/gearbox/unrepaired_damage all only contain 2 unique values

- odometer/datecreated/last seen,price are numbers which have been saved as text

- missing data in Vtype, gearbox, model, fuel, notRepairedDamage

In [8]:
#how many ads have pictures?
autos["nr_of_pictures"].value_counts()

0    50000
Name: nr_of_pictures, dtype: int64

no car ad has a picture, so we can remove the picture column

<b> 
How many missing values in each column ?

In [11]:
autos["vehicle_type"].isnull().sum()

5095

In [12]:
autos["gearbox"].isnull().sum()

2680

In [13]:
autos["model"].isnull().sum()

2758

In [14]:
autos["fuel_type"].isnull().sum()

4482

In [15]:
autos["unrepaired_damage"].isnull().sum()

9829

## Cleaning/analyzing the numerical columns

In [17]:
#get rid of dollar sign
autos["price"] = autos["price"].str.replace("$", "").str.replace(",", "")

In [18]:
#convert to integer
autos["price"] = autos["price"].astype(int)

price is now an integer

In [19]:
#the same with odometer
autos["odometer"] = autos["odometer"].str.replace("km", "")
autos["odometer"] = autos["odometer"].str.replace(",", "").astype(int)

In [20]:
#make sure nothing was lost
autos["odometer"].value_counts().sum()

50000

In [21]:
# we can preserve the unit of measurement by adding it to the column name
autos = autos.rename(columns={"odometer":"odometer_km"})

In [22]:
#see how it looks now
autos.head(2)

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,5000,control,bus,2004,manuell,158,andere,150000,3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,8500,control,limousine,1997,automatik,286,7er,150000,6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08


In [23]:
#closer look at price now it is in numerical form
prices_size = autos["price"].unique().shape
autos["price"].describe()

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [28]:
#largest vals
prices = autos["price"].value_counts()
prices.sort_index(ascending=False).head(10)

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
Name: price, dtype: int64

In [38]:
#lowest vals
prices = autos["price"].value_counts()
prices.sort_index(ascending=True).head(10)

0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
Name: price, dtype: int64

There are a few ludicrous values of which must surely either be wrong OR very rare instances which do not tell us much about the data as a whole.
<br>
Given that eBay is an auction site, there could legitimately be items where the opening bid is \$1. We will keep the \$1 items, but remove anything above \$400,000, since it seems that prices increase steadily to that number and then jump up to less realistic numbers.

In [30]:
#restrict on price up to 400000
autos = autos[autos["price"].between(1,400000)]

In [31]:
autos['price'].value_counts().sort_index(ascending=False).head()

350000    1
345000    1
299000    1
295000    1
265000    1
Name: price, dtype: int64

over 400000 not considered

In [33]:
#do we need to do the same for odometer
autos["odometer_km"].describe()

count     49986.000000
mean     125736.506222
std       40038.133399
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

didnt find anything wrong with the odometer column

## Cleaning/analyzing the date columns

In [34]:
#some more text columns which need to be converted or simplified
autos[['date_crawled', 'ad_created', 'last_seen']][0:5]

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50


In [36]:
#cut off the time, only concerened with the date
autos['date_crawled'] = autos['date_crawled'].str[:10]
autos['ad_created'] = autos['ad_created'].str[:10]
autos['last_seen'] = autos['last_seen'].str[:10]

In [37]:
#any patterns crawled date(the date the ad was placed into the data)?
autos['date_crawled'].value_counts(normalize=True, dropna=False).sort_index()

2016-03-05    0.025387
2016-03-06    0.013944
2016-03-07    0.035970
2016-03-08    0.033269
2016-03-09    0.033209
2016-03-10    0.032129
2016-03-11    0.032489
2016-03-12    0.036770
2016-03-13    0.015564
2016-03-14    0.036630
2016-03-15    0.033990
2016-03-16    0.029508
2016-03-17    0.031509
2016-03-18    0.013064
2016-03-19    0.034910
2016-03-20    0.037831
2016-03-21    0.037490
2016-03-22    0.032909
2016-03-23    0.032389
2016-03-24    0.029108
2016-03-25    0.031749
2016-03-26    0.032489
2016-03-27    0.031049
2016-03-28    0.034850
2016-03-29    0.034150
2016-03-30    0.033629
2016-03-31    0.031909
2016-04-01    0.033809
2016-04-02    0.035410
2016-04-03    0.038691
2016-04-04    0.036490
2016-04-05    0.013104
2016-04-06    0.003181
2016-04-07    0.001420
Name: date_crawled, dtype: float64

<b>Looks like the site was crawled daily over roughly a one month period in March and April 2016. The distribution of listings crawled on each day is roughly uniform.

In [41]:
#any patterns last_seen date? (the date in which the seller was last seen)
autos['last_seen'].value_counts(normalize=True, dropna=False).sort_index()

2016-03-05    0.001080
2016-03-06    0.004421
2016-03-07    0.005362
2016-03-08    0.007582
2016-03-09    0.009843
2016-03-10    0.010763
2016-03-11    0.012524
2016-03-12    0.023807
2016-03-13    0.008983
2016-03-14    0.012804
2016-03-15    0.015884
2016-03-16    0.016445
2016-03-17    0.027928
2016-03-18    0.007422
2016-03-19    0.015744
2016-03-20    0.020706
2016-03-21    0.020726
2016-03-22    0.021586
2016-03-23    0.018585
2016-03-24    0.019565
2016-03-25    0.019205
2016-03-26    0.016965
2016-03-27    0.016024
2016-03-28    0.020846
2016-03-29    0.022326
2016-03-30    0.024847
2016-03-31    0.023827
2016-04-01    0.023106
2016-04-02    0.024887
2016-04-03    0.025367
2016-04-04    0.024627
2016-04-05    0.124275
2016-04-06    0.220982
2016-04-07    0.130957
Name: last_seen, dtype: float64

<b> The last three days contain a disproportionate amount of 'last seen' values. It's unlikely that there was a massive spike in sales, and more likely that these values are to do with the crawling period ending and don't indicate car sales.

In [43]:
#date advert was created
autos['ad_created'].value_counts(normalize=True, dropna=False).sort_index()

2015-06-11    0.000020
2015-08-10    0.000020
2015-09-09    0.000020
2015-11-10    0.000020
2015-12-05    0.000020
2015-12-30    0.000020
2016-01-03    0.000020
2016-01-07    0.000020
2016-01-10    0.000040
2016-01-13    0.000020
2016-01-14    0.000020
2016-01-16    0.000020
2016-01-22    0.000020
2016-01-27    0.000060
2016-01-29    0.000020
2016-02-01    0.000020
2016-02-02    0.000040
2016-02-05    0.000040
2016-02-07    0.000020
2016-02-08    0.000020
2016-02-09    0.000040
2016-02-11    0.000020
2016-02-12    0.000060
2016-02-14    0.000040
2016-02-16    0.000020
2016-02-17    0.000020
2016-02-18    0.000040
2016-02-19    0.000060
2016-02-20    0.000040
2016-02-21    0.000060
                ...   
2016-03-09    0.033229
2016-03-10    0.031869
2016-03-11    0.032789
2016-03-12    0.036610
2016-03-13    0.016925
2016-03-14    0.035230
2016-03-15    0.033749
2016-03-16    0.030008
2016-03-17    0.031189
2016-03-18    0.013724
2016-03-19    0.033849
2016-03-20    0.037871
2016-03-21 

<b>Not much to say here other than there are quite a large amount of different dates on which ads have been made

In [29]:
autos['registration_year'].describe()

count    49986.000000
mean      2005.075721
std        105.727161
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

In [45]:
#weird dates in car reg
lowest_year = autos.loc[autos['registration_year']==1000, :]
highest_year = autos.loc[autos['registration_year']==9999, :]

In [46]:
lowest_year

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
22316,2016-03-29,VW_Kaefer.__Zwei_zum_Preis_von_einem.,privat,Angebot,1500,control,,1000,manuell,0,kaefer,5000,0,benzin,volkswagen,,2016-03-29,0,48324,2016-03-31


In [47]:
highest_year

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
8012,2016-03-23,Opel_GT_Karosserie_mit_Brief!,privat,Angebot,700,test,,9999,,0,andere,10000,0,,opel,,2016-03-23,0,21769,2016-04-05
14341,2016-03-23,Hole_kostenlos_ab,privat,Angebot,0,test,,9999,,0,,10000,0,,bmw,,2016-03-23,0,32689,2016-03-23
33950,2016-03-23,58er_karmann_ghia_lowlight_Kaefer__zum_restaur...,privat,Angebot,7999,test,,9999,,0,kaefer,10000,0,,volkswagen,,2016-03-23,0,47638,2016-04-06
38076,2016-04-04,Mercedes_Benz_A180,privat,Angebot,18000,test,,9999,,0,a_klasse,10000,0,benzin,mercedes_benz,,2016-04-04,0,51379,2016-04-07


<b>1 car claims to be registered in the year 1000 way before cars where invented and 
4 cars claim to be registerd in the future


In [33]:
#the weird reg dates
autos_strange = autos.loc[~(autos['registration_year'].between(1900,2017)), :]

In [48]:
#keep only the reasonable cars with reasonable reg dates
autos = autos.loc[autos['registration_year'].between(1950, 2016), :]

In [49]:
# make sure we've updated autos
autos['registration_year'].describe()

count    47992.000000
mean      2002.844016
std          7.100927
min       1950.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       2016.000000
Name: registration_year, dtype: float64

In [50]:
#any patterns in reg year ?
autos['registration_year'].value_counts(normalize=True)

2000    0.069887
2005    0.062823
1999    0.062469
2004    0.057030
2003    0.056822
2006    0.056405
2001    0.056301
2002    0.052780
1998    0.051113
2007    0.048008
2008    0.046487
2009    0.043695
1997    0.042257
2011    0.034047
2010    0.033276
1996    0.030088
2012    0.027567
2016    0.027421
1995    0.027338
2013    0.016794
2014    0.013856
1994    0.013752
1993    0.009272
2015    0.008314
1990    0.008231
1992    0.008126
1991    0.007418
1989    0.003771
1988    0.002959
1985    0.002167
          ...   
1982    0.000896
1972    0.000729
1979    0.000729
1960    0.000688
1981    0.000625
1967    0.000563
1976    0.000563
1971    0.000563
1968    0.000542
1973    0.000521
1974    0.000500
1966    0.000458
1977    0.000458
1969    0.000396
1975    0.000396
1965    0.000354
1964    0.000250
1963    0.000188
1959    0.000146
1961    0.000125
1956    0.000104
1962    0.000083
1958    0.000083
1950    0.000063
1955    0.000042
1957    0.000042
1954    0.000042
1951    0.0000

<b> majority of cars were registered from mid 90's onwards

## What effect does Brand have on price?
<br> 
does the brand of the car effect its price?

In [51]:
# the brands
autos['brand'].value_counts().index[:20]

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault',
       'peugeot', 'fiat', 'seat', 'skoda', 'mazda', 'nissan', 'smart',
       'citroen', 'toyota', 'sonstige_autos', 'hyundai', 'volvo', 'mini'],
      dtype='object')

In [52]:
#the mean price of each brand
brands_mean = {}
for brand in autos['brand'].value_counts()[:20].index:
    mean_price = autos.loc[autos['brand']==brand, 'price'].mean()
    brands_mean[brand]= mean_price

    

In [53]:
#sort
mean_prices_sorted = sorted(brands_mean.items(), key=lambda x: x[1])

prices_sorted

In [54]:
mean_prices_sorted

[('renault', 2396.2067751869777),
 ('fiat', 2711.8011272141707),
 ('opel', 2877.6314775573105),
 ('peugeot', 3039.4682651622),
 ('smart', 3542.706586826347),
 ('ford', 3640.131162234837),
 ('citroen', 3699.935628742515),
 ('mazda', 4010.7716643741405),
 ('seat', 4296.492554410081),
 ('nissan', 4664.891034482758),
 ('volvo', 4757.108108108108),
 ('toyota', 5115.33388981636),
 ('volkswagen', 5227.481343283582),
 ('hyundai', 5308.53911205074),
 ('skoda', 6334.91948051948),
 ('bmw', 8101.893032942067),
 ('mercedes_benz', 8482.02141140485),
 ('audi', 9093.65003615329),
 ('mini', 10460.012048192772),
 ('sonstige_autos', 10863.923679060665)]

it appears that renaults have the lowest value and sonstige autos have the highest, mini is surpisingly expensive considering its a cheaper car than audi and mercedes 
this could be explained by there simply being more mercedes and audis so the division is smaller

In [55]:
#mileage averages
mileage_mean = {}
for brand in autos['brand'].value_counts()[:20].index:
    mean_mileage = autos.loc[autos['brand']==brand, 'odometer_km'].mean()
    mileage_mean[brand]= mean_mileage

In [56]:
#convert both dictionaries to series 
brands_mean_series = pd.Series(brands_mean)
mileage_mean_series = pd.Series(mileage_mean)

In [57]:
#merge into dataframe
mileage_and_price = pd.DataFrame(brands_mean_series,columns=['mean_price'])
mileage_and_price['mean_mileage']=mileage_mean_series

In [58]:
mileage_and_price.sort_values(by=['mean_price'], ascending=False)

Unnamed: 0,mean_price,mean_mileage
sonstige_autos,10863.923679,88688.845401
mini,10460.012048,88602.409639
audi,9093.650036,129287.780188
mercedes_benz,8482.021411,130899.06052
bmw,8101.893033,132455.509277
skoda,6334.919481,110954.545455
hyundai,5308.539112,106511.627907
volkswagen,5227.481343,128726.924588
toyota,5115.33389,115709.51586
volvo,4757.108108,138355.855856


the top end cars seems to still be expensive despite having high mileage, minis are only up there since they are relatively low mileage, sonstige autos could be the same or they could also be a more prestige car maker

##  Translating German to English 
<br>
many of the column names and values are in german, in order to understand the data better we should convert to our native language

In [59]:
#types of seller
autos['seller'].unique()


array(['privat', 'gewerblich'], dtype=object)

In [60]:
#convert offer type
autos['offer_type'] = autos['offer_type'].map({'Angebot':'offer', 'Gesuch':'request'})

In [61]:
#convert seller
autos['seller'] = autos['seller'].map({'privat':'private', 'gewerblich':'commercial'})

In [62]:
#check
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26,Peugeot_807_160_NAVTECH_ON_BOARD,private,offer,5000,control,bus,2004,manuell,158,andere,150000,3,lpg,peugeot,nein,2016-03-26,0,79588,2016-04-06
1,2016-04-04,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,private,offer,8500,control,limousine,1997,automatik,286,7er,150000,6,benzin,bmw,nein,2016-04-04,0,71034,2016-04-06
2,2016-03-26,Volkswagen_Golf_1.6_United,private,offer,8990,test,limousine,2009,manuell,102,golf,70000,7,benzin,volkswagen,nein,2016-03-26,0,35394,2016-04-06
3,2016-03-12,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,private,offer,4350,control,kleinwagen,2007,automatik,71,fortwo,70000,6,benzin,smart,nein,2016-03-12,0,33729,2016-03-15
4,2016-04-01,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,private,offer,1350,test,kombi,2003,manuell,0,focus,150000,7,benzin,ford,nein,2016-04-01,0,39218,2016-04-01


In [63]:
#the vehicle types
autos['vehicle_type'].unique()

array(['bus', 'limousine', 'kleinwagen', 'kombi', nan, 'coupe', 'suv',
       'cabrio', 'andere'], dtype=object)

In [64]:
#convert
autos['vehicle_type'] = autos['vehicle_type'].map({'bus':'bus', 'limousine':'luxury','kleinwagen':'small'
                          ,'kombi':'wagon','coupe':'coupe', 'suv':'suv','cabrio':'cabrio',
                           'andere' : 'other'})

In [66]:
#convert gearbox
autos['gearbox'] = autos['gearbox'].map({'automatik':'automatic', 'manuell':'manual'})

In [67]:
#check
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26,Peugeot_807_160_NAVTECH_ON_BOARD,private,offer,5000,control,bus,2004,,158,andere,150000,3,lpg,peugeot,nein,2016-03-26,0,79588,2016-04-06
1,2016-04-04,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,private,offer,8500,control,luxury,1997,,286,7er,150000,6,benzin,bmw,nein,2016-04-04,0,71034,2016-04-06
2,2016-03-26,Volkswagen_Golf_1.6_United,private,offer,8990,test,luxury,2009,,102,golf,70000,7,benzin,volkswagen,nein,2016-03-26,0,35394,2016-04-06
3,2016-03-12,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,private,offer,4350,control,small,2007,,71,fortwo,70000,6,benzin,smart,nein,2016-03-12,0,33729,2016-03-15
4,2016-04-01,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,private,offer,1350,test,wagon,2003,,0,focus,150000,7,benzin,ford,nein,2016-04-01,0,39218,2016-04-01


## Clean the dates further
<br>
Whilst we have got rid of the time element to the date columns, they are still in string format and therefore cannot be compared, lets turn them into integers

In [68]:
autos['date_crawled'] = autos['date_crawled'].str.replace('-', '').astype(int)

In [69]:
autos['ad_created'] = autos['ad_created'].str.replace('-', '').astype(int)

In [70]:
autos['last_seen'] = autos['last_seen'].str.replace('-', '').astype(int)

In [72]:
#new format with int for dates
autos.head(30)

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,20160326,Peugeot_807_160_NAVTECH_ON_BOARD,private,offer,5000,control,bus,2004,,158,andere,150000,3,lpg,peugeot,nein,20160326,0,79588,20160406
1,20160404,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,private,offer,8500,control,luxury,1997,,286,7er,150000,6,benzin,bmw,nein,20160404,0,71034,20160406
2,20160326,Volkswagen_Golf_1.6_United,private,offer,8990,test,luxury,2009,,102,golf,70000,7,benzin,volkswagen,nein,20160326,0,35394,20160406
3,20160312,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,private,offer,4350,control,small,2007,,71,fortwo,70000,6,benzin,smart,nein,20160312,0,33729,20160315
4,20160401,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,private,offer,1350,test,wagon,2003,,0,focus,150000,7,benzin,ford,nein,20160401,0,39218,20160401
5,20160321,Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto...,private,offer,7900,test,bus,2006,,150,voyager,150000,4,diesel,chrysler,,20160321,0,22962,20160406
6,20160320,VW_Golf_III_GT_Special_Electronic_Green_Metall...,private,offer,300,test,luxury,1995,,90,golf,150000,8,benzin,volkswagen,,20160320,0,31535,20160323
7,20160316,Golf_IV_1.9_TDI_90PS,private,offer,1990,control,luxury,1998,,90,golf,150000,12,diesel,volkswagen,nein,20160316,0,53474,20160407
8,20160322,Seat_Arosa,private,offer,250,test,,2000,,0,arosa,150000,10,,seat,nein,20160322,0,7426,20160326
9,20160316,Renault_Megane_Scenic_1.6e_RT_Klimaanlage,private,offer,590,control,bus,1997,,90,megane,150000,7,benzin,renault,nein,20160316,0,15749,20160406


## Which Brands are most popular?

In [77]:
#dictionary of brand and its most pooular car
most_pop_model_per_brand = {}
for brand in autos['brand'].value_counts().index:
    model_counts = autos.loc[autos['brand']==brand, 'model'].value_counts()
    if len(model_counts > 0):
        most_common_model = model_counts.index[0]
    else:
        most_common_model = ' '
    most_pop_model_per_brand[brand]=most_common_model
    

In [74]:
print(most_pop_model_per_brand)

{'daewoo': 'matiz', 'land_rover': 'freelander', 'dacia': 'sandero', 'sonstige_autos': ' ', 'ford': 'focus', 'toyota': 'yaris', 'daihatsu': 'cuore', 'honda': 'civic', 'skoda': 'octavia', 'opel': 'corsa', 'mini': 'cooper', 'lada': 'niva', 'trabant': '601', 'smart': 'fortwo', 'subaru': 'legacy', 'lancia': 'ypsilon', 'mazda': '3_reihe', 'jeep': 'grand', 'rover': 'andere', 'volkswagen': 'golf', 'alfa_romeo': '156', 'chevrolet': 'andere', 'bmw': '3er', 'peugeot': '2_reihe', 'suzuki': 'andere', 'citroen': 'andere', 'audi': 'a4', 'fiat': 'punto', 'porsche': '911', 'saab': 'andere', 'nissan': 'micra', 'volvo': 'v70', 'seat': 'ibiza', 'mercedes_benz': 'c_klasse', 'kia': 'andere', 'chrysler': 'andere', 'renault': 'twingo', 'jaguar': 'andere', 'hyundai': 'i_reihe', 'mitsubishi': 'colt'}


In [78]:
#convert to series
most_common_series = pd.Series(most_pop_model_per_brand)

In [80]:
# are all brands considered
most_common_series.shape

(40,)

In [87]:
#add most popular brand model to the dataframe 
autos['most_popular_model']=most_common_series

In [88]:
autos = autos.drop(columns=['most_popular_model'])

In [89]:
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,20160326,Peugeot_807_160_NAVTECH_ON_BOARD,private,offer,5000,control,bus,2004,,158,andere,150000,3,lpg,peugeot,nein,20160326,0,79588,20160406
1,20160404,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,private,offer,8500,control,luxury,1997,,286,7er,150000,6,benzin,bmw,nein,20160404,0,71034,20160406
2,20160326,Volkswagen_Golf_1.6_United,private,offer,8990,test,luxury,2009,,102,golf,70000,7,benzin,volkswagen,nein,20160326,0,35394,20160406
3,20160312,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,private,offer,4350,control,small,2007,,71,fortwo,70000,6,benzin,smart,nein,20160312,0,33729,20160315
4,20160401,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,private,offer,1350,test,wagon,2003,,0,focus,150000,7,benzin,ford,nein,20160401,0,39218,20160401


# Effect of mileage on average price

In [90]:

autos.groupby(by=['odometer_km']).mean()['price']

odometer_km
5000       7686.210938
10000     19859.915323
20000     17940.720839
30000     16414.455137
40000     15448.881101
50000     13621.878607
60000     12269.502203
70000     10817.819850
80000      9575.700573
90000      8353.384661
100000     7923.664452
125000     6086.207905
150000     3653.931111
Name: price, dtype: float64

<br> obviously prices decrease as mileage goes up 

# Conclusion

Without yet being armed with knowledge on visualisation, the purpose of this task was to introduce me to reading, cleaning and glancing over data to gain quick insights. I feel it served it's purpose well.