The aim of the project is to practice data cleaning techniques and conduct basic analysis of included used car listing. The following analysis will be conducted:
- dependence of price from a brand;
- the most common brand/model combination;
- dependence of a price from a mileage;
- dependence of a price from a damage;

In [1]:
import numpy as np
import pandas as pd

In [2]:
autos = pd.read_csv('autos.csv', encoding='Latin-1')
print(autos.info())
print(autos.describe(include='all'))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

The dataset contains 20 columns, most of which are strings. Columns "vehicleType", "gearbox", "model", "fuelType", and "notRepairedDamage" contain some null values. Some numeric data presented in str format (columns "dateCrawled", "price", "odometer", "dateCreated", "lastSeen").

In [3]:
# Format columns of the dataframe
autos.columns = ['date_crawled', 'name', 'seller',
                 'offer_type', 'price', 'abtest',
                 'vehicle_type', 'registration_year',
                 'gearbox', 'power_ps', 'model', 'odometer',
                 'registration_month', 'fuel_type', 'brand',
                 'unrepaired_damage', 'ad_created',
                 'nr_of_pictures', 'postal_code', 'last_seen']

Columns 'seller', 'offer_type', 'abtest', and 'nr_of_pictures' have mostly one value and can be dropped. Columns that need more investigation:
- 'name' (car model can be extract from this column);
- 'price' (need to be cleanded, the most frequent value is zero);
- 'odometer_km (need to be cleaned);
- 'vehicle_type' and 'unrepaired_damage'(some words are in German);
- 'registration year"(unsual min and max values);
- 'power_ps"(unsual min and max values);
- Numeric data presented in str format (columns 'date_crawled', 'price", 'odometer', 'date_created', 'last_seen') need to be evaluated;

In [4]:
# Drop columns 'seller', 'offer_type', 'abtest',
# and'nr_of_pictures'
autos = autos.drop(
    ['seller', 'offer_type', 'abtest', 'nr_of_pictures'],
    axis=1)

In [5]:
# Explore data in the column 'name'
print(autos['name'].value_counts())

Ford_Fiesta                                                      78
BMW_316i                                                         75
Volkswagen_Golf_1.4                                              75
Volkswagen_Polo                                                  72
BMW_318i                                                         72
BMW_320i                                                         71
Opel_Corsa                                                       71
Renault_Twingo                                                   70
Volkswagen_Golf                                                  57
Opel_Corsa_1.2_16V                                               56
BMW_116i                                                         53
Opel_Corsa_B                                                     52
Peugeot_206                                                      52
Ford_Focus                                                       50
Volkswagen_Polo_1.2                             

We can see from the data that in most cases the second word is the model of the car. Let's extract the model and store it in the separate column. 

In [6]:
# Extract models from name column
autos['model'] = autos['name'].str.split('_', expand=True).iloc[:, 1]
print(autos[['brand','model']].head(20))

             brand     model
0          peugeot       807
1              bmw      740i
2       volkswagen      Golf
3            smart     smart
4             ford     Focus
5         chrysler     Grand
6       volkswagen      Golf
7       volkswagen        IV
8             seat     Arosa
9          renault    Megane
10      volkswagen      Golf
11   mercedes_benz      A140
12           smart     smart
13            audi        A3
14         renault      Clio
15  sonstige_autos        C3
16            opel    Vectra
17      volkswagen  Scirocco
18             bmw      mein
19           mazda   tribute


Now the data can be aggregated based on brand and model which will be done later.

In [7]:
# Clean data in column 'price' and print some statisctis
autos['price'] = (autos['price']
                  .str.replace('$', '')
                  .str.replace(',', '')
                 .astype(float))

print(autos['price'].unique().shape[0])
print(autos['price'].describe())

2357
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64


Column 'price' contains 2357 unique values. There are unusually min values(0) and unusually max values(8e+08). Based on descriptive statistic the distribution is skewed to the right with some outliers. Let's define a function which will remover outliers. 

In [8]:
# Definition of function 'remove_outliers'
def remove_outliers(dataset, column):
    return dataset[
        np.abs(dataset[column] - dataset[column].mean()
              ) <= 3*dataset[column].std()]

In [9]:
# Remove rows with outliers based on column 'price'
autos = remove_outliers(autos, 'price')
print(autos['price'].unique().shape[0])
print(autos['price'].describe())

2351
count    4.999100e+04
mean     5.831664e+03
std      1.427337e+04
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.300000e+06
Name: price, dtype: float64


In [10]:
# Clean data in column 'odometer', rename the column
autos['odometer'] = (autos['odometer']
                    .str.replace('km', '')
                    .str.replace(',', '')
                    .astype(float))
autos.rename({'odometer': 'odometer_km'},
             axis=1,
             inplace=True)
print(autos['odometer_km'].unique().shape[0])
print(autos['odometer_km'].describe())

13
count     49991.000000
mean     125736.432558
std       40038.005358
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64


Column 'odometer_km' contains 13 unique values. There are no unusual values. The distribution is skewed to the left, most of the data is located between 125000 and 150000. Now let's translate some German words in English. 

In [11]:
# Some words in column 'vehicle type' are in German language. Let's
# replace them with English ones. 
mapping = {'limousine': 'limousine',
           'kleinwagen': 'supermini',
           'kombi': 'station wagon',
           'bus': 'bus',
           'cabrio': 'cabrio',
           'coupe': 'coupe',
           'suv': 'suv',
           'andere': 'other'}
autos['vehicle_type'] = autos['vehicle_type'].map(mapping)
print(autos['vehicle_type'].value_counts())

limousine        12855
supermini        10822
station wagon     9127
bus               4093
cabrio            3061
coupe             2535
suv               1986
other              420
Name: vehicle_type, dtype: int64


In [12]:
# 'unrepaired_damage' contains German words.
# Replace them with English. 
mapping = {'ja': 'yes',
          'nein': 'no'}
autos['unrepaired_damage'] = autos['unrepaired_damage'].map(
mapping)
print(autos['unrepaired_damage'].value_counts())

no     35227
yes     4939
Name: unrepaired_damage, dtype: int64


In [13]:
# Explore data in registration year column
print(autos['registration_year'].describe())

count    49991.000000
mean      2005.074533
std        105.721987
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64


The 50% of data falls between 1999 and 2008. However, there are unusual values such as 1000 and 9999 which is invalid for the year. There are some outliers which should be dropped. We will use the function 'remove_outliers'.

In [14]:
# Remove rows with outliers from dataset based on 
# 'registration_year' column
autos = remove_outliers(autos, 'registration_year')
print(autos['registration_year'].describe())

count    49969.000000
mean      2003.359643
std          7.796921
min       1800.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       2019.000000
Name: registration_year, dtype: float64


Because a car can't be first registered before the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate and should also be removed. The production of petrol engines for automobiles was started in 1886. So, we can drop all years before 1886 and after 2016. 

In [15]:
# Drop rows with years before 1886 and 2016
autos = autos[autos['registration_year']
              .between(1886, 2016)]

# Calculate distribution of the remaining values
print(autos['registration_year']
      .value_counts(normalize=True, dropna=False)
      .sort_index())

1910    0.000187
1927    0.000021
1929    0.000021
1931    0.000021
1934    0.000042
1937    0.000083
1938    0.000021
1939    0.000021
1941    0.000042
1943    0.000021
1948    0.000021
1950    0.000062
1951    0.000042
1952    0.000021
1953    0.000021
1954    0.000042
1955    0.000042
1956    0.000104
1957    0.000042
1958    0.000083
1959    0.000146
1960    0.000687
1961    0.000125
1962    0.000083
1963    0.000187
1964    0.000250
1965    0.000354
1966    0.000458
1967    0.000562
1968    0.000541
          ...   
1987    0.001562
1988    0.002957
1989    0.003769
1990    0.008226
1991    0.007413
1992    0.008142
1993    0.009267
1994    0.013744
1995    0.027321
1996    0.030070
1997    0.042232
1998    0.051082
1999    0.062452
2000    0.069844
2001    0.056267
2002    0.052748
2003    0.056788
2004    0.056996
2005    0.062785
2006    0.056371
2007    0.047979
2008    0.046459
2009    0.043689
2010    0.033256
2011    0.034027
2012    0.027550
2013    0.016784
2014    0.0138

We can see that the majority of the cars were registered in 1994-2016, however, all other values preserved for the sanity of analysis. The next column to examine is "power_ps".

In [16]:
# Print descriptive statistics of the column and count
# unique values
print(autos['power_ps'].describe())
print(autos['power_ps'].value_counts().head(10))

count    48021.000000
mean       117.052019
std        195.135607
min          0.000000
25%         71.000000
50%        107.000000
75%        150.000000
max      17700.000000
Name: power_ps, dtype: float64
0      4988
75     3004
60     2084
150    1985
140    1823
101    1689
90     1671
116    1613
170    1460
105    1354
Name: power_ps, dtype: int64


There are some unusually large values and unusually small. Let's removed outlier using remove_outliers function. As the number of zeros is large we will replace them with the mean for all other data. The car with smallest horsepower I could found is Peel P50 with 4.2 ps. So, we will also replace values with the mean that is smaller than 5. 

In [17]:
# Remove rows with outliers based on column 'power_ps' and replace zeros
# replace small values with mean of other cells. 
autos = remove_outliers(autos, 'power_ps')

autos.loc[autos['power_ps']  < 4, 'power_ps'] = autos.loc[
    autos['power_ps'] >= 4, 'power_ps'].mean()

print(autos['power_ps'].describe())
print(autos['power_ps'].value_counts().head(10))

count    47964.000000
mean       126.476820
std         58.983982
min          4.000000
25%         87.000000
50%        122.000000
75%        150.000000
max        696.000000
Name: power_ps, dtype: float64
126.47682    4996
75.00000     3004
60.00000     2084
150.00000    1985
140.00000    1823
101.00000    1689
90.00000     1671
116.00000    1613
170.00000    1460
105.00000    1354
Name: power_ps, dtype: int64


Now let's look at the data in columns 'date_crawled', 'ad_created', and 'last_seen'.

In [18]:
# Print example data from columns
print(autos[['date_crawled',
             'ad_created',
             'last_seen']].head())

          date_crawled           ad_created            last_seen
0  2016-03-26 17:47:46  2016-03-26 00:00:00  2016-04-06 06:45:54
1  2016-04-04 13:38:56  2016-04-04 00:00:00  2016-04-06 14:45:08
2  2016-03-26 18:57:24  2016-03-26 00:00:00  2016-04-06 20:15:37
3  2016-03-12 16:58:10  2016-03-12 00:00:00  2016-03-15 03:16:28
4  2016-04-01 14:38:50  2016-04-01 00:00:00  2016-04-01 14:38:50


We can extract dates from columns 'date_crawled', 'ad_created', and 'last_seen' using str[:] slice. 

In [19]:
print(autos['date_crawled'].str[:7].value_counts(
normalize=True, dropna=False).sort_index())

2016-03    0.837566
2016-04    0.162434
Name: date_crawled, dtype: float64


The 83.7% of data was crawled on March 2016.

In [20]:
print(autos['ad_created'].str[:7].value_counts(
normalize=True, dropna=False).sort_index())

2015-06    0.000021
2015-08    0.000021
2015-09    0.000021
2015-11    0.000021
2015-12    0.000042
2016-01    0.000188
2016-02    0.001313
2016-03    0.836982
2016-04    0.161392
Name: ad_created, dtype: float64


The 83.7% of data was created on March 2016.

In [21]:
print(autos['last_seen'].str[:7].value_counts(
normalize=True, dropna=False).sort_index())

2016-03    0.423443
2016-04    0.576557
Name: last_seen, dtype: float64


Although most of the data was created on March 2016 this adds are still interesting for potential buyers as more than a half of it was seen in April. Now let's proceed with analysis.

In [22]:
# Examining brand column
top_all = autos['brand'].value_counts(
normalize=True,
dropna=False)
print(top_all)

volkswagen        0.212180
bmw               0.110103
opel              0.108102
mercedes_benz     0.095342
audi              0.086398
ford              0.069698
renault           0.047369
peugeot           0.029522
fiat              0.025894
seat              0.018139
skoda             0.016054
mazda             0.015157
nissan            0.015116
smart             0.013906
citroen           0.013885
toyota            0.012447
sonstige_autos    0.010925
hyundai           0.009862
volvo             0.009236
mini              0.008652
mitsubishi        0.008152
honda             0.007839
kia               0.007109
alfa_romeo        0.006630
porsche           0.006109
suzuki            0.005900
chevrolet         0.005713
chrysler          0.003649
daihatsu          0.002564
dacia             0.002564
jeep              0.002252
subaru            0.002189
land_rover        0.002043
saab              0.001605
jaguar            0.001585
trabant           0.001564
daewoo            0.001501
r

The data will be aggregated on the top ten brands.

In [23]:
# Find the mean price for each of top 10 brands
price_agg = {}
for brand in top_all.index[:10]: 
    price_agg[brand] = autos.loc[
        autos['brand'] == brand, 'price'].mean()
print(sorted(price_agg.items(),
   key=lambda x: x[1],reverse=True))

[('audi', 9083.736486486487), ('mercedes_benz', 8492.077848239667), ('bmw', 8336.321908729407), ('volkswagen', 5427.8745209786775), ('seat', 4308.93908045977), ('ford', 3953.517200119653), ('peugeot', 3039.313559322034), ('opel', 2874.969720347155), ('fiat', 2711.8011272141707), ('renault', 2396.3147007042253)]


There is a significant price gap in the top 10 brands. 'audi', 'mercedes_benz', and 'bmw' are the most expensive. 'volkswagen', 'seat' and 'ford' are less expensive. 'peugeot', 'opel', 'fiat', and 'renault' are the cheapest. We will look if the price of the brands depends on the mileage. 

In [24]:
# Find the mean price for each of top 10 brands
mileage_agg = {}
for brand in top_all.index[:10]:
    mileage_agg[brand] = autos.loc[
        autos['brand'] == brand, 'odometer_km'
    ].mean()
print(sorted(mileage_agg.items(),
            key=lambda x: x[1],
            reverse=True))

[('bmw', 132424.73016474152), ('mercedes_benz', 130848.45834244478), ('audi', 129306.22586872587), ('opel', 129273.86692381871), ('volkswagen', 128739.31413972683), ('renault', 128228.43309859154), ('peugeot', 127104.51977401129), ('ford', 124074.18486389471), ('seat', 121701.14942528735), ('fiat', 116553.94524959743)]


In [25]:
# Convert both dictionaries two the series objects
price_series = pd.Series(price_agg)
mileage_series = pd.Series(mileage_agg)

In [26]:
price_mileage = pd.DataFrame({'price': price_series,
                 'odometer_km': mileage_series})
print(price_mileage.sort_values(by=['odometer_km']))

                 odometer_km        price
fiat           116553.945250  2711.801127
seat           121701.149425  4308.939080
ford           124074.184864  3953.517200
peugeot        127104.519774  3039.313559
renault        128228.433099  2396.314701
volkswagen     128739.314140  5427.874521
opel           129273.866924  2874.969720
audi           129306.225869  9083.736486
mercedes_benz  130848.458342  8492.077848
bmw            132424.730165  8336.321909


We can see from the aggregate data that the price of the brands does not depend on mileage. In a normal situation the relationship should be negative (the more a mileage the less a price). However, the opposite trend can be seen from this data. It seems like a price depends on the brand of the car.

In [27]:
# Find the most common brand/model combinations
top_brands = []
top_models =  autos['model'].value_counts().head(10).index
for model in top_models:
    brand = autos.loc[autos['model'] == model, 'brand']
    top_brands.append(brand.iloc[0])

top_brand_model = pd.DataFrame({'top_brands': pd.Series(top_brands),
                               'top_models': pd.Series(top_models)})
print(top_brand_model)

      top_brands top_models
0  mercedes_benz       Benz
1     volkswagen       Golf
2           opel      Corsa
3     volkswagen       Polo
4           opel      Astra
5     volkswagen     Passat
6           audi         A4
7           audi         A6
8           audi         A3
9           ford      Focus


We can see the top 10 brand/model combinations. For further analysis, the data for mercedes_bens can be investigated more carefully. 

We will explore the "odometer_km" column to see if the price is dependent on mileage. We know that the minimum value is 5000 and the maximum value is 150000. We define the following groups to explore:
- 5000-7499 (group1);
- 7500-9999 (group2);
- 10000-12499 (group3);
- 12500-15000 (gropu4);

In [28]:
# Split 'odometer_km' on groups based on mileage and
# create a dictionary which contains average price for
# each group
splits = {'group1': [5000, 74999],
         'group2': [75000, 99999],
         'group3': [100000, 124999],
         'group4': [125000, 150000]
         }
price_groups = {}

for group in splits:
    price_groups[group] = autos.loc[
        autos['odometer_km'].between(
            splits[group][0],
            splits[group][1]),
        'price'
    ].mean()

print(price_groups)

{'group2': 8905.781877022653, 'group4': 4107.409825576543, 'group3': 7921.102564102564, 'group1': 13558.854859991203}


We can see from this data that the mileage and the price have a negative relationship: the bigger the mileage, the less the price. Our next step is to analyze the dependence of price from damage. From the previous steps, we know that there are two values in this column: 'yes' and 'no'. Let's see if the average price differs for each group.

In [29]:
damage_groups = {}
for group in ['yes', 'no']:
    damage_groups[group] = autos.loc[
        autos['unrepaired_damage'] == group, 'price'
    ].mean()
print(damage_groups)

{'yes': 2336.6798493408664, 'no': 7144.980216825926}


We can see from the data that damaged cars is cheaper than non-damaged cars. 

Here is the outcome for our analysis:
- the price depends on the brand, mileage (negative relationship), and damage (negative relationship);
- top ten brand/model combinations are presented;