# Guided Project: Exploring eBay Car Sales Data

In this project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a [classifieds](https://en.wikipedia.org/wiki/Classified_advertising) section of the German eBay website. We can find the data set [here](https://data.world/data-society/used-cars-data).

The goal of this project is to clean the data and analyze the included used car listings. 


## Introduction

In [1]:
import pandas as pd
autos = pd.read_csv("autos.csv", encoding="Latin-1")

In [2]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

In [3]:
autos.head()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


Dataset Autos has 20 columns, 50000 entries. The data dictionary provided with data is as follows:
 - dateCrawled - When this ad was first crawled. All field-values are taken from this date.
- name - Name of the car.
- seller - Whether the seller is private or a dealer.
- offerType - The type of listing
- price - The price on the ad to sell the car.
- abtest - Whether the listing is included in an A/B test.
- vehicleType - The vehicle Type.
- yearOfRegistration - The year in which the car was first registered.
- gearbox - The transmission type.
- powerPS - The power of the car in PS.
- model - The car model name.
- kilometer - How many kilometers the car has driven.
- monthOfRegistration - The month in which the car was first registered.
- fuelType - What type of fuel the car uses.
- brand - The brand of the car.
- notRepairedDamage - If the car has a damage which is not yet repaired.
- dateCreated - The date on which the eBay listing was created.
- nrOfPictures - The number of pictures in the ad.
- postalCode - The postal code for the location of the vehicle.
- lastSeenOnline - When the crawler saw this ad last online.

From information about dataset can make the following observations:
1. The dataset contains 20 columns, most of which are strings.
2. Some columns have null values, but none have more than ~20% null values.
3. The column names use camelcase instead of Python's preferred snakecase, which means we can't just replace spaces with underscores.

## Cleaning Column Names

In [5]:
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

We will convert the column names from camelcase to snakecase and reword some of the column names based on the data dictionary to be more descriptive.

In [6]:
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'num_photos', 'postal_code',
       'last_seen']
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,ab_test,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,num_photos,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


## Initial Exploration and Cleaning

Now we will do some basic data exploration to determine what other cleaning tasks need to be done. Initially we will look for: 
- Text columns where all or almost all values are the same. These can often be dropped as they don't have useful information for analysis. 
- Examples of numeric data stored as text which can be cleaned and converted.

In [7]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,ab_test,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,num_photos,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-30 17:37:35,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


Our initial observations:
there are a number of text columns where all (or nearly all) of the values are the same:
- seller
- offer_type

The num_photos column looks odd, we'll need to investigate this further.

We found that the price and odometer columns are numeric values stored as text. So we need to:
- Remove any non-numeric characters
- Convert the column to a numeric dtype.

In [8]:
autos["price"] = autos["price"].str.replace('$','')
autos["price"] = autos["price"].str.replace(',','').astype(int)

autos["odometer"] = autos["odometer"].str.replace('km','')
autos["odometer"] = autos["odometer"].str.replace(',','').astype(int)
autos.rename({"odometer": "odometer_km"}, axis=1, inplace=True)

In [9]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
date_crawled          50000 non-null object
name                  50000 non-null object
seller                50000 non-null object
offer_type            50000 non-null object
price                 50000 non-null int64
ab_test               50000 non-null object
vehicle_type          44905 non-null object
registration_year     50000 non-null int64
gearbox               47320 non-null object
power_ps              50000 non-null int64
model                 47242 non-null object
odometer_km           50000 non-null int64
registration_month    50000 non-null int64
fuel_type             45518 non-null object
brand                 50000 non-null object
unrepaired_damage     40171 non-null object
ad_created            50000 non-null object
num_photos            50000 non-null int64
postal_code           50000 non-null int64
last_seen             50000 non-null object
dtypes: int64(7), 

## Exploring the Odometer and Price Columns

In [10]:
autos["odometer_km"].value_counts()

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
5000        967
40000       819
30000       789
20000       784
10000       264
Name: odometer_km, dtype: int64

We can see that the values in this field are rounded, which might indicate that sellers had to choose from pre-set options for this field. Additionally, there are more high mileage than low mileage vehicles.

In [11]:
print(autos["price"].unique().shape)
print(autos["price"].describe())

(2357,)
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64


Again, the prices in this column seem rounded, however given there are 2357 unique values in the column, that may just be people's tendency to round prices on the site.
There are 1,421 cars listed with $0 price - given that this is only 2% of the of the cars, we might consider removing these rows. The maximum price is one hundred million dollars, which seems a lot, let's look at the highest prices further.

In [12]:
autos["price"].value_counts().sort_index(ascending=True).head(20)

0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
13       2
14       1
15       2
17       3
18       1
20       4
25       5
29       1
30       7
35       1
Name: price, dtype: int64

There are a number of listings with prices below 30, including about 1500 at 0. There are also a small number of listings with very high values, including 14 at around or over $1 million.

Given that eBay is an auction site, there could legitimately be items where the opening bid is $1. We will keep the 1 items, but remove anything above 350,000, since it seems that prices increase steadily to that number and then jump up to less realistic numbers.

In [13]:
autos = autos[autos["price"].between(1,351000)]
autos["price"].describe()

count     48565.000000
mean       5888.935591
std        9059.854754
min           1.000000
25%        1200.000000
50%        3000.000000
75%        7490.000000
max      350000.000000
Name: price, dtype: float64

**Exploring the date columns**

There are a number of columns with date information:
- date_crawled
- registration_month
- registration_year
- ad_created
- last_seen

These are a combination of dates that were crawled, and dates with meta-information from the crawler. The non-registration dates are stored as strings.

We'll explore each of these columns to learn more about the listings.

In [14]:
autos[['date_crawled','ad_created','last_seen']][0:5]

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50


In [15]:
(autos["date_crawled"]
        .str[:10]
        .value_counts(normalize=True, dropna=False)
        .sort_index()
        )

2016-03-05    0.025327
2016-03-06    0.014043
2016-03-07    0.036014
2016-03-08    0.033296
2016-03-09    0.033090
2016-03-10    0.032184
2016-03-11    0.032575
2016-03-12    0.036920
2016-03-13    0.015670
2016-03-14    0.036549
2016-03-15    0.034284
2016-03-16    0.029610
2016-03-17    0.031628
2016-03-18    0.012911
2016-03-19    0.034778
2016-03-20    0.037887
2016-03-21    0.037373
2016-03-22    0.032987
2016-03-23    0.032225
2016-03-24    0.029342
2016-03-25    0.031607
2016-03-26    0.032204
2016-03-27    0.031092
2016-03-28    0.034860
2016-03-29    0.034099
2016-03-30    0.033687
2016-03-31    0.031834
2016-04-01    0.033687
2016-04-02    0.035478
2016-04-03    0.038608
2016-04-04    0.036487
2016-04-05    0.013096
2016-04-06    0.003171
2016-04-07    0.001400
Name: date_crawled, dtype: float64

In [16]:
(autos["date_crawled"]
        .str[:10]
        .value_counts(normalize=True, dropna=False)
        .sort_values()
        )

2016-04-07    0.001400
2016-04-06    0.003171
2016-03-18    0.012911
2016-04-05    0.013096
2016-03-06    0.014043
2016-03-13    0.015670
2016-03-05    0.025327
2016-03-24    0.029342
2016-03-16    0.029610
2016-03-27    0.031092
2016-03-25    0.031607
2016-03-17    0.031628
2016-03-31    0.031834
2016-03-10    0.032184
2016-03-26    0.032204
2016-03-23    0.032225
2016-03-11    0.032575
2016-03-22    0.032987
2016-03-09    0.033090
2016-03-08    0.033296
2016-03-30    0.033687
2016-04-01    0.033687
2016-03-29    0.034099
2016-03-15    0.034284
2016-03-19    0.034778
2016-03-28    0.034860
2016-04-02    0.035478
2016-03-07    0.036014
2016-04-04    0.036487
2016-03-14    0.036549
2016-03-12    0.036920
2016-03-21    0.037373
2016-03-20    0.037887
2016-04-03    0.038608
Name: date_crawled, dtype: float64

Looks like the site was crawled daily over roughly a one month period in March and April 2016. The distribution of listings crawled on each day is roughly uniform.

In [17]:
(autos["last_seen"]
        .str[:10]
        .value_counts(normalize=True, dropna=False)
        .sort_index()
        )

2016-03-05    0.001071
2016-03-06    0.004324
2016-03-07    0.005395
2016-03-08    0.007413
2016-03-09    0.009595
2016-03-10    0.010666
2016-03-11    0.012375
2016-03-12    0.023783
2016-03-13    0.008895
2016-03-14    0.012602
2016-03-15    0.015876
2016-03-16    0.016452
2016-03-17    0.028086
2016-03-18    0.007351
2016-03-19    0.015834
2016-03-20    0.020653
2016-03-21    0.020632
2016-03-22    0.021373
2016-03-23    0.018532
2016-03-24    0.019767
2016-03-25    0.019211
2016-03-26    0.016802
2016-03-27    0.015649
2016-03-28    0.020859
2016-03-29    0.022341
2016-03-30    0.024771
2016-03-31    0.023783
2016-04-01    0.022794
2016-04-02    0.024915
2016-04-03    0.025203
2016-04-04    0.024483
2016-04-05    0.124761
2016-04-06    0.221806
2016-04-07    0.131947
Name: last_seen, dtype: float64

The crawler recorded the date it last saw any listing, which allows us to determine on what day a listing was removed, presumably because the car was sold.

The last three days contain a disproportionate amount of 'last seen' values. Given that these are 6-10x the values from the previous days, it's unlikely that there was a massive spike in sales, and more likely that these values are to do with the crawling period ending and don't indicate car sales.

In [18]:
print(autos["ad_created"].str[:10].unique().shape)
(autos["ad_created"]
        .str[:10]
        .value_counts(normalize=True, dropna=False)
        .sort_index()
        )

(76,)


2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
2015-12-30    0.000021
2016-01-03    0.000021
2016-01-07    0.000021
2016-01-10    0.000041
2016-01-13    0.000021
2016-01-14    0.000021
2016-01-16    0.000021
2016-01-22    0.000021
2016-01-27    0.000062
2016-01-29    0.000021
2016-02-01    0.000021
2016-02-02    0.000041
2016-02-05    0.000041
2016-02-07    0.000021
2016-02-08    0.000021
2016-02-09    0.000021
2016-02-11    0.000021
2016-02-12    0.000041
2016-02-14    0.000041
2016-02-16    0.000021
2016-02-17    0.000021
2016-02-18    0.000041
2016-02-19    0.000062
2016-02-20    0.000041
2016-02-21    0.000062
                ...   
2016-03-09    0.033151
2016-03-10    0.031895
2016-03-11    0.032904
2016-03-12    0.036755
2016-03-13    0.017008
2016-03-14    0.035190
2016-03-15    0.034016
2016-03-16    0.030125
2016-03-17    0.031278
2016-03-18    0.013590
2016-03-19    0.033687
2016-03-20    0.037949
2016-03-21 

There is a large variety of ad created dates. Most fall within 1-2 months of the listing date, but a few are quite old, with the oldest at around 9 months.

In [19]:
autos["registration_year"].describe()

count    48565.000000
mean      2004.755421
std         88.643887
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

The year that the car was first registered will likely indicate the age of the car. Looking at this column, we note some odd values. The minimum value is 1000, long before cars were invented and the maximum is 9999, many years into the future.

## Dealing with Incorrect Registration Year Data

Because a car can't be first registered after the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1900s.

Let's count the number of listings with cars that fall outside the 1900 - 2016 interval and see if it's safe to remove those rows entirely, or if we need more custom logic.

One option is to remove the listings with these values. Let's determine what percentage of our data has invalid values in this column:

In [20]:
(~autos["registration_year"].between(1900,2016)).sum() / autos.shape[0]

0.038793369710697

Given that this is less than 4% of our data, we will remove these rows.

In [21]:
autos = autos[autos["registration_year"].between(1900,2016)]
autos["registration_year"].value_counts(normalize=True)

2000    0.067608
2005    0.062895
1999    0.062060
2004    0.057904
2003    0.057818
2006    0.057197
2001    0.056468
2002    0.053255
1998    0.050620
2007    0.048778
2008    0.047450
2009    0.044665
1997    0.041794
2011    0.034768
2010    0.034040
1996    0.029412
2012    0.028063
1995    0.026285
2016    0.026135
2013    0.017202
2014    0.014203
1994    0.013474
1993    0.009104
2015    0.008397
1992    0.007926
1990    0.007433
1991    0.007262
1989    0.003727
1988    0.002892
1985    0.002035
          ...   
1966    0.000471
1976    0.000450
1969    0.000407
1975    0.000386
1965    0.000364
1964    0.000257
1963    0.000171
1959    0.000129
1961    0.000129
1910    0.000107
1956    0.000086
1958    0.000086
1937    0.000086
1962    0.000086
1950    0.000064
1954    0.000043
1941    0.000043
1951    0.000043
1934    0.000043
1957    0.000043
1955    0.000043
1953    0.000021
1943    0.000021
1929    0.000021
1939    0.000021
1938    0.000021
1948    0.000021
1927    0.0000

It appears that most of the vehicles were first registered in the past 20 years.

## Exploring Price by Brand

In [22]:
autos["brand"].value_counts(normalize=True)

volkswagen        0.211264
bmw               0.110045
opel              0.107581
mercedes_benz     0.096463
audi              0.086566
ford              0.069900
renault           0.047150
peugeot           0.029841
fiat              0.025642
seat              0.018273
skoda             0.016409
nissan            0.015274
mazda             0.015188
smart             0.014160
citroen           0.014010
toyota            0.012703
hyundai           0.010025
sonstige_autos    0.009811
volvo             0.009147
mini              0.008762
mitsubishi        0.008226
honda             0.007840
kia               0.007069
alfa_romeo        0.006641
porsche           0.006127
suzuki            0.005934
chevrolet         0.005698
chrysler          0.003513
dacia             0.002635
daihatsu          0.002506
jeep              0.002271
subaru            0.002142
land_rover        0.002099
saab              0.001649
jaguar            0.001564
daewoo            0.001500
trabant           0.001392
r

German manufacturers represent four out of the top five brands, almost 50% of the overall listings. Volkswagen is by far the most popular brand, with approximately double the cars for sale of the next two brands combined.
There are lots of brands that don't have a significant percentage of listings, so we will limit our analysis to brands representing more than 5% of total listings.

In [24]:
brand_counts = autos["brand"].value_counts(normalize=True)
common_brands = brand_counts[brand_counts > .05].index
print(common_brands)

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')


In [25]:
brands_mean_prices = {}

for b in common_brands:
    selected_rows = autos[autos["brand"] == b]
    mean = selected_rows["price"].mean()
    brands_mean_prices[b] = int(mean)
    
print(brands_mean_prices)    

{'audi': 9336, 'volkswagen': 5402, 'mercedes_benz': 8628, 'bmw': 8332, 'ford': 3749, 'opel': 2975}


Of the top 5 brands, there is a distinct price gap:

- Audi, BMW and Mercedes Benz are more expensive
- Ford and Opel are less expensive
- Volkswagen is in between - this may explain its popularity, it may be a 'best of 'both worlds' option.

## Storing Aggregate Data in a DataFrame

For the top 6 brands, let's use aggregation to understand the average mileage for those cars and if there's any visible link with mean price. 



In [26]:
bmp_series = pd.Series(brands_mean_prices)
pd.DataFrame(bmp_series, columns=["mean_price"]) 

Unnamed: 0,mean_price
audi,9336
bmw,8332
ford,3749
mercedes_benz,8628
opel,2975
volkswagen,5402


In [27]:
brands_mean_mileage = {}

for b in common_brands:
    selected_rows = autos[autos["brand"] == b]
    mean_mileage = selected_rows["odometer_km"].mean()
    brands_mean_mileage[b] = int(mean_mileage)

mean_mileage = pd.Series(brands_mean_mileage).sort_values(ascending=False)
mean_prices = pd.Series(brands_mean_prices).sort_values(ascending=False)

In [28]:
brand_info = pd.DataFrame(mean_mileage, columns=['mean_mileage'])
brand_info

Unnamed: 0,mean_mileage
bmw,132572
mercedes_benz,130788
opel,129310
audi,129157
volkswagen,128707
ford,124266


In [29]:
brand_info["mean_price"] = mean_prices
brand_info

Unnamed: 0,mean_mileage,mean_price
bmw,132572,8332
mercedes_benz,130788,8628
opel,129310,2975
audi,129157,9336
volkswagen,128707,5402
ford,124266,3749


The range of car mileages does not vary as much as the prices do by brand, instead all falling within 10% for the top brands. There is a slight trend to the more expensive vehicles having higher mileage, with the less expensive vehicles having lower mileage.

**Next steps:**
- Converting vehicle_type type to english

In [31]:
autos["vehicle_type"].value_counts()

limousine     12598
kleinwagen    10585
kombi          8930
bus            4031
cabrio         3016
coupe          2462
suv            1965
andere          390
Name: vehicle_type, dtype: int64

In [32]:
correct_vehicle_type = {
                        "limousine":"limousine",
                        "kleinwagen":"small_car",
                        "kombi":"combi",
                        "bus":"bus",
                        "cabrio":"cabrio",
                        "suv":"suv",
                        "coupe":"coupe",
                        "andere":"other"
                      }

autos["vehicle_type"] = autos["vehicle_type"].map(correct_vehicle_type)

In [33]:
autos["fuel_type"].value_counts()

benzin     28540
diesel     14032
lpg          649
cng           71
hybrid        37
elektro       19
andere        15
Name: fuel_type, dtype: int64

In [34]:
correct_fuel_type = {
                        "benzin":"petrol",
                        "diesel":"diesel",
                        "lpg":"lpg",
                        "cng":"cng",
                        "hybrid":"hybrid",
                        "elektro":"electric",
                        "andere":"other"
                      }

autos["fuel_type"] = autos["fuel_type"].map(correct_fuel_type)

In [35]:
print(autos["fuel_type"].unique())

['lpg' 'petrol' 'diesel' nan 'cng' 'hybrid' 'electric' 'other']


In [36]:
print(autos["vehicle_type"].unique())

['bus' 'limousine' 'small_car' 'combi' nan 'coupe' 'suv' 'cabrio' 'other']


- Convert the dates to be uniform numeric data, so "2016-03-21" becomes the integer 20160321.



In [37]:
autos["date"] = autos["date_crawled"].str.split().str[0]
autos["date"] = autos["date"].str.replace('-','').astype(int)
print(autos["date"])

0        20160326
1        20160404
2        20160326
3        20160312
4        20160401
5        20160321
6        20160320
7        20160316
8        20160322
9        20160316
11       20160316
12       20160331
13       20160323
14       20160323
15       20160401
16       20160316
17       20160329
18       20160326
19       20160317
20       20160305
21       20160306
22       20160328
23       20160310
24       20160403
25       20160321
26       20160403
28       20160319
29       20160402
30       20160314
31       20160314
           ...   
49968    20160401
49969    20160317
49970    20160321
49971    20160329
49972    20160326
49973    20160327
49975    20160327
49976    20160319
49977    20160331
49978    20160404
49979    20160320
49980    20160312
49981    20160315
49982    20160329
49983    20160306
49985    20160402
49986    20160404
49987    20160322
49988    20160328
49989    20160311
49990    20160321
49991    20160306
49992    20160310
49993    20160315
49994    2

- See if there are particular keywords in the name column that you can extract as new columns

In [38]:
autos["name"].value_counts().sort_values(ascending=False)

BMW_316i                                                             75
Volkswagen_Golf_1.4                                                  75
Ford_Fiesta                                                          74
Volkswagen_Polo                                                      72
BMW_318i                                                             72
BMW_320i                                                             71
Opel_Corsa                                                           68
Renault_Twingo                                                       66
Volkswagen_Golf                                                      57
Opel_Corsa_1.2_16V                                                   56
BMW_116i                                                             53
Volkswagen_Polo_1.2                                                  48
Volkswagen_Golf_1.6                                                  47
Peugeot_206                                                     

We observe that if we split by _, first is brand name, second is car name or series, for third it varies like it can be capacity or series or special edition. As variation is more in third place we can not extact much and create a new specific column. Brand-model combination can be extacted using first and second split columns using delimiter as _

- **Find the most common brand/model combinations**

In [39]:
autos["name"].value_counts()

BMW_316i                                                         75
Volkswagen_Golf_1.4                                              75
Ford_Fiesta                                                      74
Volkswagen_Polo                                                  72
BMW_318i                                                         72
BMW_320i                                                         71
Opel_Corsa                                                       68
Renault_Twingo                                                   66
Volkswagen_Golf                                                  57
Opel_Corsa_1.2_16V                                               56
BMW_116i                                                         53
Volkswagen_Polo_1.2                                              48
Volkswagen_Golf_1.6                                              47
Opel_Corsa_B                                                     46
Peugeot_206                                     

So three the most common brand/model are Ford_Fiesta, BMW_318i and Opel_Corsa 

- **Split the odometer_km into groups, and use aggregation to see if average prices follows any patterns based on the mileage**

In [40]:
autos.loc[autos["odometer_km"].between(0,30000),"odometer_km_group"] = "A"
autos.loc[autos["odometer_km"].between(30001,60000),"odometer_km_group"] = "B"
autos.loc[autos["odometer_km"].between(60001,90000) ,"odometer_km_group"] = "C"
autos.loc[autos["odometer_km"].between(90001,120000),"odometer_km_group"] = "D"
autos.loc[autos["odometer_km"].between(120001,150000) ,"odometer_km_group"] = "E"
autos.loc[autos["odometer_km"] > 150000  ,"odometer_km_group"] = "F"

In [41]:
avg_price_by_mileage = {}

for m in autos["odometer_km_group"].unique():
    selected_rows = autos[autos["odometer_km_group"] == m]
    mean_price = int(selected_rows["price"].mean())
    avg_price_by_mileage[m] = mean_price
    
avg_price_by_mileage

{'A': 15122, 'B': 13721, 'C': 9563, 'D': 8132, 'E': 4107}

In [43]:
autos.loc[autos["odometer_km_group"] == 'A',["name","odometer_km","odometer_km_group","price"]].head()

Unnamed: 0,name,odometer_km,odometer_km_group,price
22,MINI_Cooper_S_Cabrio,10000,A,25450
24,BMW_535i_xDrive_Sport_Aut.,30000,A,48500
42,Vw_Polo_l.0_/60_PS_Blue_Motion_Technologie_Son...,20000,A,11900
52,Senator_A_3.0E_Karosserie_restauriert_m._viele...,5000,A,3500
53,Dacia_Sandero_Stepway_II_MPI_85__top_gepflegt,30000,A,9000


findings: 
- the highest prices have cars with minimal odometer reading
- the lowest prices have cars with maximum odometer reading
    