# Exploring Ebay Car Sales Data

The following dataset has been taken from Kaggle. The dataset was originally scraped and uploaded to [Kaggle](https://www.kaggle.com/orgesleka/used-cars-database/data). This is a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.

The data dictionary provided with data is as follows:

* __dateCrawled__ - When this ad was first crawled. All field-values are taken from this date.
* __name__ - Name of the car.
* __seller__ - Whether the seller is private or a dealer.
* __offerType__ - The type of listing
* __price__ - The price on the ad to sell the car.
* __abtest__ - Whether the listing is included in an A/B test.
* __vehicleType__ - The vehicle Type.
* __yearOfRegistration__ - The year in which the car was first registered.
* __gearbox__ - The transmission type.
* __powerPS__ - The power of the car in PS.
* __model__ - The car model name.
* __kilometer__ - How many kilometers the car has driven.
* __monthOfRegistration__ - The month in which the car was first registered.
* __fuelType__ - What type of fuel the car uses.
* __brand__ - The brand of the car.
* __notRepairedDamage__ - If the car has a damage which is not yet repaired.
* __dateCreated__ - The date on which the eBay listing was created.
* __nrOfPictures__ - The number of pictures in the ad.
* __postalCode__ - The postal code for the location of the vehicle.
* __lastSeenOnline__ - When the crawler saw this ad last online.

#### Goal
The aim of this project is primarily to clean the data and perform basic analysis of the used car listings.

__Disclaimer:__
This project was created as part of the dataquest guided projects series.

## Data Loading

Below we will load the dataset and perform some basic data exploration, in order to understand what data cleaning tasks we need to perform and eventually perform them.

In [1]:
# import necessary libraries
import pandas as pd
import numpy as np

In [2]:
# load the dataset
autos = pd.read_csv("autos01012019.csv", encoding="Latin-1")

In [3]:
# check dataframe for more information
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

In [4]:
# check dataframe
autos.head(5)

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


Just from the looks of it, we can observe that the following features have missing values:
* ```vehicleType```
* ```gearbox```
* ```model```
* ```fuelType```
* ```notRepairedDamage```

Fortunately, the null values account for approximately less than 20%.

The column names use camelcase, instead of Python's preferred snakecase. We will replace the column names moving forward.

In terms of data types, the following features should be converted to the respective types
* ```dateCrawled```, ```dateCreated```, ```lastSeen``` - to datetime
* ```price``` -  to float
* ```odometer``` - to integer

Potentially, we should separate ```price``` into two columns (price, currency) or rename the feature to include currency. We will decide after further analysis of that feature. Similar approach for ```odometer```

In [5]:
# replace camelcase with snakecase
cols = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 
        'abtest', 'vehicle_type', 'registration_year', 'gearbox', 
        'power_ps', 'model', 'odometer', 'registration_month', 
        'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 
        'nr_of_pictures', 'postal_code', 'last_seen']

# replacing columns
autos.columns = cols

# check dataframe
autos.head(5)

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


Above we replaced the camelcased column names of the dataset with snakecased (Python's preferred) ones.

Next, we will determine whether other cleaning tasks need to be performed. First we will look at text columns which don't have useful information for analysis. After that, numeric data that is stored as text, which we can clean and convert.

In [6]:
# describe dataset including categorical features
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-09 11:54:38,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


We can observe that the following features only have two unique values:
* ```seller```
* ```offer_type```
* ```abtest```
* ```gearbox```
* ```unrepaired_damage```

For our analysis & data cleanup we will remove those features which mostly have one value. Let's explore that a bit more.

Next, we have numeric columns stored as text which we need to clean and convert to numeric, as well as rename the column name if necessary. Those are:
* ```price```
* ```odometer```

## Data Cleanup

In this section we are going to combine data cleanup with some basic analysis.

#### Dropping unnecessary columns

In [7]:
# value_counts of features with two unique values
cols_two = ['seller', 'offer_type', 'abtest',
            'gearbox', 'unrepaired_damage']
# print value counts
for col in cols_two:
    print(autos[col].value_counts(), '\n')

privat        49999
gewerblich        1
Name: seller, dtype: int64 

Angebot    49999
Gesuch         1
Name: offer_type, dtype: int64 

test       25756
control    24244
Name: abtest, dtype: int64 

manuell      36993
automatik    10327
Name: gearbox, dtype: int64 

nein    35232
ja       4939
Name: unrepaired_damage, dtype: int64 



We can see that ```seller``` and ```offer_type``` are mostly one single value. We will remove those columns, since they won't add any value to our analysis & clean up.

In [8]:
# drop above mentioned columns
autos.drop(['seller', 'offer_type'], axis=1, inplace=True)

In [9]:
# check columns
('seller' and 'offer_type') in autos.columns

False

#### Dealing with numeric columns

Here, we will convert price and odometer to numeric columns and rename columns.

In [10]:
# remove $, ',' signs & convert to numeric
autos['price'] = (autos['price']
                  .str.replace('$', '')
                  .str.replace(',', '')
                  .astype(int)
                 )

In [11]:
# remove km & replace ',' then conver to numeric
autos['odometer'] = (autos['odometer']
                     .str.replace('km', '')
                     .str.replace(',','')
                     .astype(int)
                    )

In [12]:
# rename columns
autos.rename({'price': 'price_usd', 'odometer': 'odometer_km'},
             axis='columns', inplace=True)

In [13]:
# check data
autos[['price_usd', 'odometer_km']].head(3)

Unnamed: 0,price_usd,odometer_km
0,5000,150000
1,8500,150000
2,8990,70000


#### Dealing with Dates

There are 5 columns that contain date information. They are:
* ```date_crawled```
* ```ad_created```
* ```last_seen```
* ```registration_month```
* ```registration_year```

The last two columns are already numeric, and we can use them in our analysis. We need to convert the first three into a datetime object since they are timestamps.

In [14]:
autos[['date_crawled','ad_created','last_seen']][0:5]

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50


To better understand these columns, let's look at the distributions of these dates.

##### date_crawled

In [15]:
# show dates distribution by percentage in ascending order
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False)\
                        .sort_index(ascending=True)

2016-03-05    0.02538
2016-03-06    0.01394
2016-03-07    0.03596
2016-03-08    0.03330
2016-03-09    0.03322
2016-03-10    0.03212
2016-03-11    0.03248
2016-03-12    0.03678
2016-03-13    0.01556
2016-03-14    0.03662
2016-03-15    0.03398
2016-03-16    0.02950
2016-03-17    0.03152
2016-03-18    0.01306
2016-03-19    0.03490
2016-03-20    0.03782
2016-03-21    0.03752
2016-03-22    0.03294
2016-03-23    0.03238
2016-03-24    0.02910
2016-03-25    0.03174
2016-03-26    0.03248
2016-03-27    0.03104
2016-03-28    0.03484
2016-03-29    0.03418
2016-03-30    0.03362
2016-03-31    0.03192
2016-04-01    0.03380
2016-04-02    0.03540
2016-04-03    0.03868
2016-04-04    0.03652
2016-04-05    0.01310
2016-04-06    0.00318
2016-04-07    0.00142
Name: date_crawled, dtype: float64

Judging from data here, it seems that the crawler has been uniformly scraping from march 2016 to April 2016, for about a month (34 days), every day. 

##### ad_created

In [16]:
# show dates distribution by percentage in ascending order
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False)\
                    .sort_index(ascending=True)

2015-06-11    0.00002
2015-08-10    0.00002
2015-09-09    0.00002
2015-11-10    0.00002
2015-12-05    0.00002
               ...   
2016-04-03    0.03892
2016-04-04    0.03688
2016-04-05    0.01184
2016-04-06    0.00326
2016-04-07    0.00128
Name: ad_created, Length: 76, dtype: float64

The ads earliest date seems to be 2015-06-11 and it seems the crawler has been picking ads from different dates or maybe they were like this on the website. Checking the last records we can observe that quite a lot of ads were scraped closer to 2016 april.

We can also observe that most of the ads are scraped during 74 days, intermittently from January 2016 to April 2016. There are only 6 records from 2015.

###### last_seen

In [17]:
# show dates distribution by percentage in ascending order
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False)\
                    .sort_index(ascending=True)

2016-03-05    0.00108
2016-03-06    0.00442
2016-03-07    0.00536
2016-03-08    0.00760
2016-03-09    0.00986
2016-03-10    0.01076
2016-03-11    0.01252
2016-03-12    0.02382
2016-03-13    0.00898
2016-03-14    0.01280
2016-03-15    0.01588
2016-03-16    0.01644
2016-03-17    0.02792
2016-03-18    0.00742
2016-03-19    0.01574
2016-03-20    0.02070
2016-03-21    0.02074
2016-03-22    0.02158
2016-03-23    0.01858
2016-03-24    0.01956
2016-03-25    0.01920
2016-03-26    0.01696
2016-03-27    0.01602
2016-03-28    0.02086
2016-03-29    0.02234
2016-03-30    0.02484
2016-03-31    0.02384
2016-04-01    0.02310
2016-04-02    0.02490
2016-04-03    0.02536
2016-04-04    0.02462
2016-04-05    0.12428
2016-04-06    0.22100
2016-04-07    0.13092
Name: last_seen, dtype: float64

```last_seen``` and ```date_crawled``` have exact dates though with different distributions. This means some of the ads were crawled new while some others were checked cheked in that period.

###### registration_year

In [18]:
autos['registration_year'].describe()

count    50000.000000
mean      2005.073280
std        105.712813
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

Since ```registration_year``` is a numeric type, we can see here that there is a date with 9999 year and a minimum year of 1000. These do seem quite odd. These values are probably some typos or errors let's check all values and their counts.

Before we proceed, we need to identify the date range for car registrations that makes sense. The upper bound is year 2016, since a car can't be listed and then registered. The lower bound can be anything, but prior to 1900 cars were not mainstream.

First we look at cars outside of this period 1900-2016.

In [19]:
autos[~autos['registration_year'].between(1900, 2016)].shape[0]

1972

There are 1972 cars that fall outside of this date range.

In [20]:
# registration year outside of those bounds
autos[~autos['registration_year'].between(1900, 2016)]['registration_year'].value_counts()

2017    1453
2018     492
9999       4
5000       4
2019       3
9000       2
1800       2
6200       1
4500       1
8888       1
4800       1
2800       1
1001       1
1000       1
1111       1
1500       1
9996       1
5911       1
4100       1
Name: registration_year, dtype: int64

Based on this quick analysis, we will remove all cars that are outside of this range. 

In [21]:
# remove cars with registration year in range of 1900-2016
autos.drop(autos.index[~autos['registration_year'].between(1900, 2016)], axis=0, inplace=True)

In [22]:
# check the column again
autos['registration_year'].value_counts(normalize=True)\
                            .sort_index(ascending=False)

2016    0.027401
2015    0.008308
2014    0.013867
2013    0.016782
2012    0.027546
          ...   
1934    0.000042
1931    0.000021
1929    0.000021
1927    0.000021
1910    0.000187
Name: registration_year, Length: 78, dtype: float64

After cleaning the cars with abnormal registration dates, we can observe that majority of cars are of recent registration with very few from 1900s.

Finally, let's convert the dates to datetime objects.

In [23]:
# check our dtypes
autos.dtypes

date_crawled          object
name                  object
price_usd              int32
abtest                object
vehicle_type          object
registration_year      int64
gearbox               object
power_ps               int64
model                 object
odometer_km            int32
registration_month     int64
fuel_type             object
brand                 object
unrepaired_damage     object
ad_created            object
nr_of_pictures         int64
postal_code            int64
last_seen             object
dtype: object

In [24]:
autos['date_crawled'] = pd.to_datetime(autos['date_crawled'])
autos['ad_created'] = pd.to_datetime(autos['ad_created'])
autos['last_seen'] = pd.to_datetime(autos['last_seen'])

In [25]:
# check dtypes
autos.dtypes

date_crawled          datetime64[ns]
name                          object
price_usd                      int32
abtest                        object
vehicle_type                  object
registration_year              int64
gearbox                       object
power_ps                       int64
model                         object
odometer_km                    int32
registration_month             int64
fuel_type                     object
brand                         object
unrepaired_damage             object
ad_created            datetime64[ns]
nr_of_pictures                 int64
postal_code                    int64
last_seen             datetime64[ns]
dtype: object

### Outlier detection & clean up

In this section, we will analyse further the __```price_usd```__ and __```odometer_km```__ columns to identify outliers, typos (mistakes) and unrealistic values.

##### Price

Let's explore the price feature. Here we aim to understand any anomalies / outiers in the price feature and manage the data accordingly.

In [26]:
# unique price points
unq_prices = len(autos['price_usd'].unique())
total_prices = autos['price_usd'].shape
print("There are {0} unique & {1} total price points".
      format(unq_prices,total_prices[0]))

# maximum and minimum values
print("The max price is {} and the min is {}".format(autos['price_usd'].max(),
                                                               autos['price_usd'].min()))

# describe price
autos['price_usd'].describe()

There are 2334 unique & 48028 total price points
The max price is 99999999 and the min is 0


count    4.802800e+04
mean     9.585252e+03
std      4.843817e+05
min      0.000000e+00
25%      1.150000e+03
50%      2.990000e+03
75%      7.400000e+03
max      1.000000e+08
Name: price_usd, dtype: float64

In [27]:
# show the top 20 highest values with their counts
autos['price_usd'].value_counts().sort_index(ascending=False).head(20)

99999999    1
27322222    1
12345678    1
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
Name: price_usd, dtype: int64

In [28]:
# show top 10 lowest values with their counts
autos['price_usd'].value_counts().sort_index(ascending=True).head(20)

0     1335
1      150
2        2
3        1
5        2
8        1
9        1
10       6
11       2
12       3
13       2
14       1
15       2
17       3
18       1
20       4
25       5
29       1
30       6
35       1
Name: price_usd, dtype: int64

Right away, we can observe that there seem to be some outliers in our prices. 
We have a car which has a price of 100mln and many cars that have no price. Let's investigate those.

Let's check the car which has almost 100mln price.

----*__More analysis followed here__*

Following our analysis, we need to remove the cars that are outliers and/or priced due to typo or an error.

In [29]:
# remove price outliers
autos['price_usd'] = autos.loc[autos['price_usd'].between(1, 1000000), 'price_usd']

##### Odometer

In [30]:
# unique, total distances
unq_distances = len(autos['odometer_km'].unique())
total_dist = autos['odometer_km'].shape
print("There are {0} unique & {1} total distances".
      format(unq_distances,total_prices[0]))

# describe price
autos['odometer_km'].describe()

There are 13 unique & 48028 total distances


count     48028.000000
mean     125544.161739
std       40106.751417
min        5000.000000
25%      100000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [31]:
# show the top 10 highest values with their counts
autos['odometer_km'].value_counts().sort_index(ascending=False).head(10)

150000    31029
125000     4960
100000     2110
90000      1696
80000      1396
70000      1199
60000      1137
50000      1008
40000       801
30000       769
Name: odometer_km, dtype: int64

In [32]:
# show top 10 lowest values with their counts
autos['odometer_km'].value_counts().sort_index(ascending=True).head(10)

5000      911
10000     249
20000     763
30000     769
40000     801
50000    1008
60000    1137
70000    1199
80000    1396
90000    1696
Name: odometer_km, dtype: int64

### Translating Categorical Variables

The dataset has columns that contain german words. For better understanding we are going to translate those categorical variables into english. These are those features:

* vehicle_type
* gearbox
* unrepaired_damage

For each of these categorical features, we will translate them and map to the original data.

#### vehicle_type - categorical feature

In [33]:
# uniques values
autos['vehicle_type'].unique()

array(['bus', 'limousine', 'kleinwagen', 'kombi', nan, 'coupe', 'suv',
       'cabrio', 'andere'], dtype=object)

In [34]:
# translations for unique values in a dict
vehicle_type_trans = {
    'bus': 'bus',
    'limousine': 'limousine',
    'kleinwagen': 'small car',
    'kombi': 'station wagon',
    'coupe': 'coupe',
    'suv': 'suv',
    'cabrio': 'convertible',
    'andere': 'other'
}

# map values
autos['vehicle_type'] = autos['vehicle_type'].map(vehicle_type_trans)

# check unique values
autos['vehicle_type'].unique()

array(['bus', 'limousine', 'small car', 'station wagon', nan, 'coupe',
       'suv', 'convertible', 'other'], dtype=object)

#### gearbox - categorical feature

In [35]:
# unique values
autos['gearbox'].unique()

array(['manuell', 'automatik', nan], dtype=object)

In [36]:
# translations
gearbox_trans = {'manuell': 'manual', 'automatik': 'automatic'}

# map values
autos['gearbox'] = autos['gearbox'].map(gearbox_trans)

# check unique values
autos['gearbox'].unique()

array(['manual', 'automatic', nan], dtype=object)

#### unrepaired_damage - categorical feature

In [37]:
# unique vales
autos['unrepaired_damage'].unique()

array(['nein', nan, 'ja'], dtype=object)

In [38]:
# translations
damage_trans = {'nein': 'No', 'ja': 'Yes'}

# map values
autos['unrepaired_damage'] = autos['unrepaired_damage'].map(damage_trans)

# check unique values
autos['unrepaired_damage'].unique()

array(['No', nan, 'Yes'], dtype=object)

In [39]:
autos.dtypes

date_crawled          datetime64[ns]
name                          object
price_usd                    float64
abtest                        object
vehicle_type                  object
registration_year              int64
gearbox                       object
power_ps                       int64
model                         object
odometer_km                    int32
registration_month             int64
fuel_type                     object
brand                         object
unrepaired_damage             object
ad_created            datetime64[ns]
nr_of_pictures                 int64
postal_code                    int64
last_seen             datetime64[ns]
dtype: object

## Data Analysis

### Car Brands

In this section, I will try to analyse car brands and aggregate car data based on brands. Particularly, I am interested in the following questions

1. What are the top 10 brands by number of ads?
2. Which brands have more than 5% ads in the database?

These are good indicators of car popularities.

In [40]:
# top 10 brands
autos['brand'].value_counts().head(10)

volkswagen       10188
bmw               5284
opel              5195
mercedes_benz     4580
audi              4149
ford              3352
renault           2274
peugeot           1418
fiat              1242
seat               873
Name: brand, dtype: int64

In [41]:
# top brands with more than 5% representation
autos['brand'].value_counts(normalize=True).head(6)

volkswagen       0.212126
bmw              0.110019
opel             0.108166
mercedes_benz    0.095361
audi             0.086387
ford             0.069793
Name: brand, dtype: float64

We can observe that the top 6 car brands have more than 5% representation in the dataset. We will focus on analysing these car brands.

Next, we will try to get an average price for each of these car brands.

In [42]:
# brands we need
brands = autos['brand'].value_counts(normalize=True).head(6).index

In [43]:
# number of cars for each brand
brand_avgprice = {}
for brand in brands:
    brand_avgprice[brand] = autos.loc[autos['brand'] == brand, 'price_usd'].mean()

# average brand prices
brand_avgprice

{'volkswagen': 5604.071269261963,
 'bmw': 8332.820517811953,
 'opel': 2975.2419354838707,
 'mercedes_benz': 8628.450366422385,
 'audi': 9336.687453600594,
 'ford': 4054.6930147058824}

We can observe that, among the cars on average Mercedes Benz is more expensive compared to other cars. Opel and Ford are less expensive.

Next for ease of use, we will convert the dictionary to pandas dataframe and add mean mileage to try to see whether there is a correlation between mileage and price.

In [44]:
# mean mileage calculation for brands
mean_mileage = {}
for brand in brands:
    mean_mileage[brand] = autos.loc[autos['brand'] == brand, 'odometer_km'].mean()

In [45]:
# convert dictionary to dataframe
df = pd.DataFrame(pd.Series(brand_avgprice), columns=['mean_price'])

# add mean mileage
df['mean_mileage'] = pd.Series(mean_mileage)

In [46]:
print(df)

                mean_price   mean_mileage
volkswagen     5604.071269  128730.369062
bmw            8332.820518  132434.708554
opel           2975.241935  129227.141482
mercedes_benz  8628.450366  130860.262009
audi           9336.687454  129287.780188
ford           4054.693015  124046.837709


We can observe from the above dataframe that mean mileage of a brand doesn't have a significant impact on the mean price of the car, since most of the brands are in a tight range in regards to kilometers driven.

In [47]:
# check dataset for next 
autos.head(5)

Unnamed: 0,date_crawled,name,price_usd,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,5000.0,control,bus,2004,manual,158,andere,150000,3,lpg,peugeot,No,2016-03-26,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,8500.0,control,limousine,1997,automatic,286,7er,150000,6,benzin,bmw,No,2016-04-04,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,8990.0,test,limousine,2009,manual,102,golf,70000,7,benzin,volkswagen,No,2016-03-26,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,4350.0,control,small car,2007,automatic,71,fortwo,70000,6,benzin,smart,No,2016-03-12,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,1350.0,test,station wagon,2003,manual,0,focus,150000,7,benzin,ford,No,2016-04-01,0,39218,2016-04-01 14:38:50


#### Most Common Brand / Model

In this section, we will find the most common brand model combination. To make this easier first we will create a new feature containing the concatenation of brand and model columns.

In [48]:
# concatenate two columns
autos['brand_model'] = autos['brand'].map(str) + '-' + autos['model'].map(str)

In [49]:
# brand_model combinations counts
autos['brand_model'].value_counts().head(10)

volkswagen-golf           3815
bmw-3er                   2688
volkswagen-polo           1677
opel-corsa                1645
opel-astra                1388
volkswagen-passat         1388
audi-a4                   1265
bmw-5er                   1163
mercedes_benz-c_klasse    1147
mercedes_benz-e_klasse     981
Name: brand_model, dtype: int64

We can observe that the three top most brand model combination in the dataset are:

1. volkswagen-golf
2. bmw-3er
3. volkswagen-polo

### Average Price vs Mileage

In this section we will explore whether there is a relationship between average price of a car and its mileage. Our mileage feature ```odometer_km``` is a discrete variable with many values. We will group the mileage into bins and look at how average price changes as mileage increases.

In [50]:
# remind ourselves the value distribution
autos['odometer_km'].value_counts().sort_index()

5000        911
10000       249
20000       763
30000       769
40000       801
50000      1008
60000      1137
70000      1199
80000      1396
90000      1696
100000     2110
125000     4960
150000    31029
Name: odometer_km, dtype: int64

For simplicity purposes we will divide the mileage into three groups. ```High```, ```medium``` and ```low``` mileage.

In [51]:
# create a new mileage_group feature
autos['mileage_group'] = pd.cut(autos['odometer_km'],
                                bins=[0, 40000, 80000, 160000],
                                labels=["low", "medium", "high"])

# check the values
autos['mileage_group'].value_counts()

high      39795
medium     4740
low        3493
Name: mileage_group, dtype: int64

Next we will aggregate the data and construct a dictionary with mileage group as keys and average prices as values.

In [52]:
# mileage vs prices dictionary
avg_mileage_price = {}
mileage_groups = autos['mileage_group'].value_counts().index
for mg in mileage_groups:
    avg_mileage_price[mg] = autos.loc[autos['mileage_group'] == mg, 'price_usd'].mean()
    
# check dictionary
avg_mileage_price

{'high': 4587.8234046954185,
 'medium': 11536.19837710869,
 'low': 15212.965714285714}

From the above, we can clearly observe that cars with less mileage are more expensive. On average, a car with less than 40000 km of mileage roughly is 50% times more expensive than a car with mileage between 40000 and 80000 kms. Compared with cars that have more than 80000 km mileage cars with less than 40000 km are almost 4 times more expensive.

### Damaged vs Non-Damaged Cars