The aim of this project is to demonstrate data cleaning and analysis skills with the use of the Python Pandas package. The data set I will be using has ebay car sales data.

In [1]:
import pandas as pd
import numpy as np
import re

df = pd.read_csv("autos.csv", encoding = 'Windows-1252')

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

In [3]:
df.head()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


After basic exploration of our dataset we can see that it has 20 attributes made up of mostly string and integer types. We can also see that the attribute names are using camel case instead of the python standard snake case used for column names. Also, some columns have null values, but none have more than ~20% null values.

In [4]:
df.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [5]:
def cleaned_cols(col):
    col = col.replace('yearOfRegistration','registration_year')
    col = col.replace('monthOfRegistration','registration_month')
    col = col.replace('notRepairedDamage','unrepaired_damage')
    col = col.replace('dateCreated','ad_created')
    col = re.sub( '(?<!^)(?=[A-Z])', '_', col ).lower()
    return col

df.columns = [cleaned_cols(c) for c in df.columns]

df.columns

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_p_s', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

In [6]:
df.describe()

Unnamed: 0,registration_year,power_p_s,registration_month,nr_of_pictures,postal_code
count,50000.0,50000.0,50000.0,50000.0,50000.0
mean,2005.07328,116.35592,5.72336,0.0,50813.6273
std,105.712813,209.216627,3.711984,0.0,25779.747957
min,1000.0,0.0,0.0,0.0,1067.0
25%,1999.0,70.0,3.0,0.0,30451.0
50%,2003.0,105.0,6.0,0.0,49577.0
75%,2008.0,150.0,9.0,0.0,71540.0
max,9999.0,17700.0,12.0,0.0,99998.0


In [7]:
df['price'] = df['price'].str.replace('$','')
df['price'] = df['price'].str.replace(',','')
df['price'] = df['price'].astype(float)
df['odometer'] = df['odometer'].str.replace('km','')
df['odometer'] = df['odometer'].str.replace(',','')
df['odometer'] = df['odometer'].astype(float)
df.rename(columns={'odometer':'odometer_km'}, inplace=True)

Let's continue exploring the data, specifically looking for data that doesn't look right. We'll start by analyzing the odometer_km and price columns. We will start with odometer_km:

In [8]:
df['odometer_km'].unique().shape

(13,)

In [9]:
df['odometer_km'].describe()

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [10]:
df['odometer_km'].value_counts().head().sort_index(ascending = False)

150000.0    32424
125000.0     5170
100000.0     2169
90000.0      1757
80000.0      1436
Name: odometer_km, dtype: int64

and now price:

In [11]:
df['price'].unique().shape

(2357,)

In [12]:
df['price'].describe()

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [13]:
df['price'].value_counts().head().sort_index(ascending = False)

2500.0     643
1500.0     734
1200.0     639
500.0      781
0.0       1421
Name: price, dtype: int64

There are 5 columns that should represent date values. Some of these columns were created by the crawler, some came from the website itself. 

Right now, the date_crawled, last_seen, and ad_created columns are all identified as string values by pandas. Because these three columns are represented as strings, we need to convert the data into a numerical representation so we can understand it quantitatively. The other two columns are represented as numeric values, so we can use methods like Series.describe() to understand the distribution without any extra data processing.

In [14]:
#checking format of string date columns

df[['date_crawled','ad_created','last_seen']][0:5]

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50


Earlier we saw that the registration_year attribute has values from 1000-9999. This means there is some data we will have to remove, because cars were not invented until the 1880s and a car cannot have been registered after 2019. I will remove all values before 1900 and after 2019.

In [15]:
df.loc[~ df['registration_year'].between(1900, 2019), 'registration_year'].shape

(24,)

In [16]:
df.loc[~ df['registration_year'].between(1900, 2019), 'registration_year'] = np.nan

When working with data on cars, its natural to explore variations across different car brands. We can use aggregation to understand the brand column. Heres what the process looks like:

* Identify the unique values we want to aggregate by
* Create an empty dictionary to store our aggregate data
* Loop over the unique values, and for each:
* Subset the dataframe by the unique values
* Calculate the mean of whichever column were interested in
* Assign the val/mean to the dict as k/v.

In [17]:
df['brand'].unique()

array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler',
       'seat', 'renault', 'mercedes_benz', 'audi', 'sonstige_autos',
       'opel', 'mazda', 'porsche', 'mini', 'toyota', 'dacia', 'nissan',
       'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda',
       'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda',
       'daewoo', 'suzuki', 'trabant', 'land_rover', 'alfa_romeo', 'lada',
       'rover', 'daihatsu', 'lancia'], dtype=object)

In [18]:
df['brand'].value_counts().sort_values(ascending = False)

volkswagen        10687
opel               5461
bmw                5429
mercedes_benz      4734
audi               4283
ford               3479
renault            2404
peugeot            1456
fiat               1308
seat                941
skoda               786
mazda               757
nissan              754
citroen             701
smart               701
toyota              617
sonstige_autos      546
hyundai             488
volvo               457
mini                424
mitsubishi          406
honda               399
kia                 356
alfa_romeo          329
porsche             294
suzuki              293
chevrolet           283
chrysler            181
dacia               129
daihatsu            128
jeep                110
subaru              109
land_rover           99
saab                 80
daewoo               79
trabant              78
jaguar               77
rover                69
lancia               57
lada                 31
Name: brand, dtype: int64

In [19]:
mean_prices = {}
brands = df['brand'].unique()
for brand in brands:
    mean_prices[brand] = df.loc[df['brand'] == brand,'price'].mean()

In [20]:
mean_prices

{'alfa_romeo': 3943.562310030395,
 'audi': 8965.560354891431,
 'bmw': 8252.918953766808,
 'chevrolet': 6432.929328621908,
 'chrysler': 3286.0552486187844,
 'citroen': 42657.46362339515,
 'dacia': 5897.736434108527,
 'daewoo': 1038.3544303797469,
 'daihatsu': 1552.09375,
 'fiat': 12134.20642201835,
 'ford': 7105.662546708824,
 'honda': 3889.8596491228072,
 'hyundai': 5316.754098360656,
 'jaguar': 11076.506493506493,
 'jeep': 11363.209090909091,
 'kia': 5707.3258426966295,
 'lada': 2476.9032258064517,
 'lancia': 3070.3508771929824,
 'land_rover': 18934.272727272728,
 'mazda': 3962.542932628798,
 'mercedes_benz': 29511.955428812842,
 'mini': 10392.393867924528,
 'mitsubishi': 3314.0492610837437,
 'nissan': 4588.879310344828,
 'opel': 5106.092657022524,
 'peugeot': 3010.8688186813188,
 'porsche': 44537.97959183674,
 'renault': 2351.301996672213,
 'rover': 1494.5217391304348,
 'saab': 3143.7,
 'seat': 4219.431455897981,
 'skoda': 6305.044529262086,
 'smart': 3482.971469329529,
 'sonstige_au

We see that alfa romeo has the largest average price, while volvo has the lowest average price. I will now create a new dataframe that stores aggregate values ike this for price.

In [23]:
bmp_series = pd.Series(mean_prices)
df = pd.DataFrame(bmp_series, columns=['mean_price'])