In [1]:
%%html
<style>
table {align:left;display:block}  # to align html tables to left
</style>

# Dataquest - Python for Data Science: Intermediate <br/> <br/> Project Title: Exploring eBay Car Sales Data

## Introduction
#### Metadata
Dataset original source: [Link](https://data.world/data-society/used-cars-data)

However, for the purposes of this analysis, we are using the following modified autos.csv file provided by Dataquest instead (found in same folder directory as this Jupyter Notebook).

Pre-treated dataset by Dataquest:
- Sampled 50,000 data points from the full dataset, to ensure code runs quickly in learning environment
- Dirtied the dataset a bit to more closely resemble what would expect from a scraped dataset (the version uploaded to Kaggle was cleaned to be easier to work with - last updated: the original dataset isn't available on Kaggle anymore)


| Column | Description |
| --- | --- |
| dateCrawled | When this ad was first crawled. All field-values are taken from this date |
| name | Name of the car |
| seller | Whether the seller is private or a dealer |
| offerType | The type of listing |
| price | The price on the ad to sell the car |
| abtest | Whether the listing is included in an A/B test |
| vehicleType | The vehicle Type |
| yearOfRegistration | The year in which the car was first registered |
| gearbox | The transmission type |
| powerPS | The power of the car in PS |
| model | The car model name |
| odometer | How many kilometers the car has driven |
| monthOfRegistration | The month in which the car was first registered |
| fuelType | What type of fuel the car uses |
| brand | The brand of the car |
| notRepairedDamage | If the car has a damage which is not yet repaired |
| dateCreated | The date on which the eBay listing was created |
| nrOfPictures | The number of pictures in the ad |
| postalCode | The postal code for the location of the vehicle |
| lastSeenOnline | When the crawler saw this ad last online |


## Ask: Background and questions

- The aim of this project is to clean the data and analyze the included used car listings. 
- We'll also become familiar with some of the unique benefits jupyter notebook provides for pandas.

## Prepare: Load, open and explore datasets

In [2]:
# Import the pandas and NumPy libraries
import numpy as np
import pandas as pd

# Read csv into dataframe, and assign to variable using pandas read_csv library
autos = pd.read_csv('autos.csv', delimiter=',')

In [3]:
# check if import is okay
# A neat feature of jupyter notebook
# is its ability to render the first few and
# last few values of any pandas object.
autos

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_T�V_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,2016-03-27 14:38:19,Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon,privat,Angebot,"$24,900",control,limousine,2011,automatik,239,q5,"100,000km",1,diesel,audi,nein,2016-03-27 00:00:00,0,82131,2016-04-01 13:47:40
49996,2016-03-28 10:50:25,Opel_Astra_F_Cabrio_Bertone_Edition___T�V_neu+...,privat,Angebot,"$1,980",control,cabrio,1996,manuell,75,astra,"150,000km",5,benzin,opel,nein,2016-03-28 00:00:00,0,44807,2016-04-02 14:18:02
49997,2016-04-02 14:44:48,Fiat_500_C_1.2_Dualogic_Lounge,privat,Angebot,"$13,200",test,cabrio,2014,automatik,69,500,"5,000km",11,benzin,fiat,nein,2016-04-02 00:00:00,0,73430,2016-04-04 11:47:27
49998,2016-03-08 19:25:42,Audi_A3_2.0_TDI_Sportback_Ambition,privat,Angebot,"$22,900",control,kombi,2013,manuell,150,a3,"40,000km",11,diesel,audi,nein,2016-03-08 00:00:00,0,35683,2016-04-05 16:45:07


In [4]:
# get more brief info about dataframe loaded
# including datatypes, number of non-null items, 
# no. of columns
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

In [5]:
# Re-look at first few rows
autos.head(3)

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37


### Preliminary Findings:
- Dataset: 50,000 rows, 20 columns
- Columns with null items:
  - vehicleType, gearbox, model, fuelType, notRepaired damage
- Columns with datatype subject to conversion:
  - DateTime: dateCrawled, dateCreated, lastSeen
  - Int/Float: Price, Odometer
- Column names subject to rename after later datatype change:
  - Odometer: odometer_km
  - Price: price_usd
- Column names use camelcase instead of Python's preferred snakecase - subject to rename
- Other column names consider to rename for readability
  - yearOfRegistration to registration_year
  - monthOfRegistration to registration_month
  - notRepairedDamage to unrepaired_damage
  - dateCreated to ad_created

## Process: Explore and clean dataset

### Initial Exploration and Cleaning

In [6]:
# view existing columns and to ease copy and paste when modifying
autos.columns

# modified new list of columns from autos.columns for readability
new_cols = ['date_crawled', 'name', 'seller', 'offer_type', 'price_usd', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer_km', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen']

# assign modified column names back to dataframe
autos.columns = new_cols

# check if transformation is okay
autos.head(1)

Unnamed: 0,date_crawled,name,seller,offer_type,price_usd,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54


#### Do some basic data exploration to determine what other cleaning tasks need to be done:
- Text columns where all or almost all values are the same. These can often be dropped as they don't have useful information for analysis
- Examples of numeric data stored as text which can be cleaned and converted. (eg. refer above preliminary findings)

In [7]:
# describe statistical summary
# include='all' to get both categorical and numeric columns
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price_usd,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-19 17:36:18,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


#### Findings:
- Value do not seem meaningful (repeated observations)
  - seller, offer_type
- Columns with datatype subject to conversion (per above prelim findings)
  - DateTime: date_crawled, date_created, last_seen
  - Int/Float: price_usd, odometer

### Exploring the Odometer and Price Columns

In [8]:
# Convert datatypes to numeric
# Remove any non-numeric characters

# start with price attribute
autos['price_usd'] = autos['price_usd'].str.replace('$', '', regex=False)  # regex=False so that '$' is treated as string
autos['price_usd'] = autos['price_usd'].str.replace(',', '')
autos['price_usd'] = autos['price_usd'].astype(int)

# check transformation, and further data exploration
autos['price_usd'].value_counts()

0        1421
500       781
1500      734
2500      643
1000      639
         ... 
6770        1
61999       1
20987       1
6578        1
48600       1
Name: price_usd, Length: 2357, dtype: int64

In [9]:
# next work on odometer
autos['odometer_km'] = autos['odometer_km'].str.replace('km', '')
autos['odometer_km'] = autos['odometer_km'].str.replace(',', '')
autos['odometer_km'] = autos['odometer_km'].astype(int)

# check transformation, and further data exploration
# dtype has become 'int'
autos['odometer_km'].value_counts()

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
5000        967
40000       819
30000       789
20000       784
10000       264
Name: odometer_km, dtype: int64

In [10]:
# drop meaningless columns identified (ie. seller, offer_type)
drop_cols = ['seller', 'offer_type']
autos.drop(labels=drop_cols, axis=1, inplace=True)

# check transformation
autos.columns

Index(['date_crawled', 'name', 'price_usd', 'abtest', 'vehicle_type',
       'registration_year', 'gearbox', 'power_ps', 'model', 'odometer_km',
       'registration_month', 'fuel_type', 'brand', 'unrepaired_damage',
       'ad_created', 'nr_of_pictures', 'postal_code', 'last_seen'],
      dtype='object')

In [11]:
# Consider removing outliers after reviewing data:
# Series.unique().shape to see how many unique values
# Series.describe() to view min/max/median/mean etc
# Series.value_counts()
# Series.sort_index()/.sort_values() to view extreme range of values (head if needed)

# Review odometer_km
autos['odometer_km'].unique()

array([150000,  70000,  50000,  80000,  10000,  30000, 125000,  90000,
        20000,  60000,   5000, 100000,  40000])

In [12]:
autos['odometer_km'].unique().shape

(13,)

In [13]:
autos['odometer_km'].value_counts(ascending=True).head(1500)

10000       264
20000       784
30000       789
40000       819
5000        967
50000      1027
60000      1164
70000      1230
80000      1436
90000      1757
100000     2169
125000     5170
150000    32424
Name: odometer_km, dtype: int64

In [14]:
autos['odometer_km'].describe()

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [15]:
autos['odometer_km'].value_counts().sort_index(ascending=True)

5000        967
10000       264
20000       784
30000       789
40000       819
50000      1027
60000      1164
70000      1230
80000      1436
90000      1757
100000     2169
125000     5170
150000    32424
Name: odometer_km, dtype: int64

In [16]:
# Review price_usd
autos['price_usd'].unique()

array([ 5000,  8500,  8990, ...,   385, 22200, 16995])

In [17]:
autos['price_usd'].unique().shape

(2357,)

In [18]:
autos['price_usd'].value_counts()

0        1421
500       781
1500      734
2500      643
1000      639
         ... 
6770        1
61999       1
20987       1
6578        1
48600       1
Name: price_usd, Length: 2357, dtype: int64

In [19]:
autos['price_usd'].describe()

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price_usd, dtype: float64

In [20]:
autos[autos['price_usd'] == 0].head(2)

Unnamed: 0,date_crawled,name,price_usd,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
27,2016-03-27 18:45:01,Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE,0,control,,2005,,0,,150000,0,,ford,,2016-03-27 00:00:00,0,66701,2016-03-27 18:45:01
71,2016-03-28 19:39:35,Suche_Opel_Astra_F__Corsa_oder_Kadett_E_mit_Re...,0,control,,1990,manuell,0,,5000,0,benzin,opel,,2016-03-28 00:00:00,0,4552,2016-04-07 01:45:48


In [21]:
autos['price_usd'].value_counts().sort_index(ascending=False).head(50)

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
194000      1
190000      1
180000      1
175000      1
169999      1
169000      1
163991      1
163500      1
155000      1
151990      1
145000      1
139997      1
137999      1
135000      1
130000      1
129000      1
128000      1
120000      2
119900      1
119500      1
116000      1
115991      1
115000      1
114400      1
109999      1
105000      2
104900      1
99900       2
99000       2
98500       1
Name: price_usd, dtype: int64

In [22]:
autos['price_usd'].value_counts().sort_index(ascending=True).head(50)

0      1421
1       156
2         3
3         1
5         2
8         1
9         1
10        7
11        2
12        3
13        2
14        1
15        2
17        3
18        1
20        4
25        5
29        1
30        7
35        1
40        6
45        4
47        1
49        4
50       49
55        2
59        1
60        9
65        5
66        1
70       10
75        5
79        1
80       15
89        1
90        5
99       19
100     134
110       3
111       2
115       2
117       1
120      39
122       1
125       8
129       1
130      15
135       1
139       1
140       9
Name: price_usd, dtype: int64

In [23]:
# Apply null values to all observations with less than or equal to 100 usd prices
autos.loc[autos['price_usd'] <= 100, 'price_usd'] = np.nan

# Apply null values to all observations with 99,999,999 usd prices
autos.loc[autos['price_usd'] == 99999999, 'price_usd'] = np.nan

# check if transformation is okay
autos['price_usd'].value_counts()

500.0      781
1500.0     734
2500.0     643
1000.0     639
1200.0     639
          ... 
27020.0      1
173.0        1
15390.0      1
52000.0      1
188.0        1
Name: price_usd, Length: 2318, dtype: int64

In [24]:
# relook at statistical summary after transformation
autos['price_usd'].describe()

count    4.810300e+04
mean     8.148740e+03
std      1.809206e+05
min      1.100000e+02
25%      1.250000e+03
50%      3.099000e+03
75%      7.500000e+03
max      2.732222e+07
Name: price_usd, dtype: float64

#### Findings:
- There are 1,421 observations with zero USD prices.
- There is 1 observation with 99,999,999 USD prices.
- There are also quite a number of observations with less than 100 USD - does not seem like a serious listing
  - As there is no easy way to fill in prices for these cars, and a value of 0 would skew any further statistical analysis, a null/NaN value could be preferable to apply to these values (performed), or even drop the rows
  
#### More context for data cleaning needed:
- Data cleaning criteria can be enhanced by finding out more on what should the reasonable range of milage in km (ie. odometer), or price in usd typically for used cars
  - Eg. Any cars listed at USD 999,990 and above may not be a serious offers upon further context, or it may be legitimate for expensive cars for serious sellers

#### Some statiscal observations (after cleaned data above):
- Mean price of cars listing: USD 8,149
- Mean odometer_km: 125,732 km

### Exploring the date columns (Reviewing datetime variables for data cleaning)

In [25]:
# Reviewing datetime variables for data cleaning

# review first few rows
autos[['date_crawled','ad_created','last_seen']][0:5]

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50


In [26]:
# Transform date_crawled

# extract dates only
date_crawled = autos['date_crawled'].str[:10]

# check transformation
date_crawled.head()

0    2016-03-26
1    2016-04-04
2    2016-03-26
3    2016-03-12
4    2016-04-01
Name: date_crawled, dtype: object

In [27]:
# create date distribution and sort index for chronological order
# value_counts(normalize=True) to display percentages
# value_counts(dropna=False) to including missing values
date_crawled_distribution = date_crawled.value_counts(normalize=True, dropna=False).sort_index(ascending=True)
date_crawled_distribution

2016-03-05    0.02538
2016-03-06    0.01394
2016-03-07    0.03596
2016-03-08    0.03330
2016-03-09    0.03322
2016-03-10    0.03212
2016-03-11    0.03248
2016-03-12    0.03678
2016-03-13    0.01556
2016-03-14    0.03662
2016-03-15    0.03398
2016-03-16    0.02950
2016-03-17    0.03152
2016-03-18    0.01306
2016-03-19    0.03490
2016-03-20    0.03782
2016-03-21    0.03752
2016-03-22    0.03294
2016-03-23    0.03238
2016-03-24    0.02910
2016-03-25    0.03174
2016-03-26    0.03248
2016-03-27    0.03104
2016-03-28    0.03484
2016-03-29    0.03418
2016-03-30    0.03362
2016-03-31    0.03192
2016-04-01    0.03380
2016-04-02    0.03540
2016-04-03    0.03868
2016-04-04    0.03652
2016-04-05    0.01310
2016-04-06    0.00318
2016-04-07    0.00142
Name: date_crawled, dtype: float64

In [28]:
# Transform ad_created in similar ways as date_crawled

# extract dates only
ad_created = autos['ad_created'].str[:10]

# create date distribution and sort index for chronological order
ad_created_distribution = ad_created.value_counts(normalize=True, dropna=False).sort_index(ascending=True)
ad_created_distribution

2015-06-11    0.00002
2015-08-10    0.00002
2015-09-09    0.00002
2015-11-10    0.00002
2015-12-05    0.00002
               ...   
2016-04-03    0.03892
2016-04-04    0.03688
2016-04-05    0.01184
2016-04-06    0.00326
2016-04-07    0.00128
Name: ad_created, Length: 76, dtype: float64

In [29]:
# Transform last_seen in similar ways as date_crawled

# extract dates only
last_seen = autos['last_seen'].str[:10]

# create date distribution and sort index for chronological order
last_seen_distribution = last_seen.value_counts(normalize=True, dropna=False).sort_index(ascending=True)
last_seen_distribution

2016-03-05    0.00108
2016-03-06    0.00442
2016-03-07    0.00536
2016-03-08    0.00760
2016-03-09    0.00986
2016-03-10    0.01076
2016-03-11    0.01252
2016-03-12    0.02382
2016-03-13    0.00898
2016-03-14    0.01280
2016-03-15    0.01588
2016-03-16    0.01644
2016-03-17    0.02792
2016-03-18    0.00742
2016-03-19    0.01574
2016-03-20    0.02070
2016-03-21    0.02074
2016-03-22    0.02158
2016-03-23    0.01858
2016-03-24    0.01956
2016-03-25    0.01920
2016-03-26    0.01696
2016-03-27    0.01602
2016-03-28    0.02086
2016-03-29    0.02234
2016-03-30    0.02484
2016-03-31    0.02384
2016-04-01    0.02310
2016-04-02    0.02490
2016-04-03    0.02536
2016-04-04    0.02462
2016-04-05    0.12428
2016-04-06    0.22100
2016-04-07    0.13092
Name: last_seen, dtype: float64

In [30]:
# reviewing statistical summary of registration year
autos['registration_year'].describe()

count    50000.000000
mean      2005.073280
std        105.712813
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

#### Findings:
- date_crawled: Date range is 2016-03-05 to 2016-04-07
  - Fairly evenly distributed counts throughout observed period. Fairly reasonable as this is the date where ad was first crawled, no material reason for distribution to be otherwise.
- ad_created: Date range is 2015-06-11 to 2016-04-07
  - It would appear there are very little car ads posted for the year 2015
- last_seen: Date range is 2016-03-05 to 2016-04-07
- registration_year:
  - The minimum value is 1000, before cars were invented
  - The maximum value is 9999, many years into the future
  - Because a car can't be first registered after the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1900s.
  

### Dealing with incorrect Registration Year Data

In [31]:
# Let's count the number of listings with cars
# that fall outside the 1900 - 2016 interval
# and see if it's safe to remove those rows
# entirely, or if we need more custom logic.

# Count number of cars registered after year 2016
(autos['registration_year'] > 2016).value_counts()

False    48034
True      1966
Name: registration_year, dtype: int64

In [32]:
# Count number of cars registered before year 1900
(autos['registration_year'] < 1900).value_counts()

False    49994
True         6
Name: registration_year, dtype: int64

In [33]:
# only keep rows that falls within defined range
bool_1900_2016 = autos['registration_year'].between(1900, 2016, inclusive='both')

# check distribution after transformation
registration_year_distribution = autos[bool_1900_2016]['registration_year'].value_counts(normalize=True)
registration_year_distribution.sort_index(ascending=False)

2016    0.027401
2015    0.008308
2014    0.013867
2013    0.016782
2012    0.027546
          ...   
1934    0.000042
1931    0.000021
1929    0.000021
1927    0.000021
1910    0.000187
Name: registration_year, Length: 78, dtype: float64

In [34]:
registration_year_distribution.sort_values(ascending=False)

2000    0.069834
2005    0.062776
1999    0.062464
2004    0.056988
2003    0.056779
          ...   
1948    0.000021
1931    0.000021
1938    0.000021
1939    0.000021
1952    0.000021
Name: registration_year, Length: 78, dtype: float64

### Findings (registration_year):
- We have removed data rows with registration year before 1900 and after 2016 for reasons described in above findings.
- The earliest registration date identified thereafter is year 1910,  and the latest year is 2016.
- The top 5 registration year in our data points are years 2000, 2005, 1999, 2004, 2003, which are reasonable years as they are several years before 2016, which explains sellers on listing aged used cars
- Old cars listed are also still possible, they could be listed for antique collectors, more context is needed if interested in more analysis

### Exploring Price/Mileage by Brand (Aggregating data for analysis)


In [124]:
# generate list of unique brand names
brands = autos['brand'].unique()

mean_price_by_brand = {}  # empty dictionary
for b in brands:
    value = autos.loc[autos['brand'] == b, 'price_usd'].mean()
    mean_price_by_brand[b] = value

# sort dictionary by values in descending values
sorted_mean_price_by_brand = {}
sorted_keys = sorted(mean_price_by_brand, key=mean_price_by_brand.get, reverse=True)
for key in sorted_keys:
    sorted_mean_price_by_brand[key] = mean_price_by_brand[key]    

# print mean price by brand, sorted by descending order
template = "The mean price by brand - {} is {:,.0f}."  # put thousand comma separator, and round to 0 decimal place
for brand in sorted_mean_price_by_brand:
    print(template.format(brand, sorted_mean_price_by_brand[brand]))

The mean price by brand - porsche is 46,764.
The mean price by brand - sonstige_autos is 45,859.
The mean price by brand - citroen is 43,910.
The mean price by brand - volvo is 33,292.
The mean price by brand - land_rover is 18,934.
The mean price by brand - fiat is 12,747.
The mean price by brand - jaguar is 11,844.
The mean price by brand - jeep is 11,574.
The mean price by brand - mini is 10,592.
The mean price by brand - audi is 9,268.
The mean price by brand - mercedes_benz is 8,582.
The mean price by brand - bmw is 8,547.
The mean price by brand - ford is 7,409.
The mean price by brand - chevrolet is 6,717.
The mean price by brand - volkswagen is 6,660.
The mean price by brand - skoda is 6,402.
The mean price by brand - kia is 5,923.
The mean price by brand - dacia is 5,898.
The mean price by brand - hyundai is 5,416.
The mean price by brand - opel is 5,362.
The mean price by brand - toyota is 5,148.
The mean price by brand - nissan is 4,694.
The mean price by brand - seat is 4,3

In [126]:
# if review dictionary directly
sorted_mean_price_by_brand

{'porsche': 46764.2,
 'sonstige_autos': 45859.1600877193,
 'citroen': 43909.67841409692,
 'volvo': 33292.31954022989,
 'land_rover': 18934.272727272728,
 'fiat': 12747.031325301205,
 'jaguar': 11844.041666666666,
 'jeep': 11573.638888888889,
 'mini': 10591.985576923076,
 'audi': 9268.353608496258,
 'mercedes_benz': 8582.035011886752,
 'bmw': 8547.199923693246,
 'ford': 7409.285071942446,
 'chevrolet': 6716.929889298893,
 'volkswagen': 6659.7639593908625,
 'skoda': 6402.441860465116,
 'kia': 5923.288629737609,
 'dacia': 5897.736434108527,
 'hyundai': 5416.23382045929,
 'opel': 5361.629423076923,
 'toyota': 5148.0032733224225,
 'nissan': 4694.3744911804615,
 'seat': 4357.980241492865,
 'suzuki': 4210.185053380783,
 'mazda': 4097.042349726776,
 'subaru': 4068.3861386138615,
 'alfa_romeo': 4054.471875,
 'honda': 4041.0234375,
 'chrysler': 3539.9166666666665,
 'smart': 3538.344927536232,
 'mitsubishi': 3431.6530612244896,
 'lancia': 3240.703703703704,
 'saab': 3183.493670886076,
 'peugeot':

In [128]:
# generate data for mileage analysis for top brands

mean_mileage_by_brand = {}  # empty dictionary
for b in brands:
    value = autos.loc[autos['brand'] == b, 'odometer_km'].mean()
    mean_mileage_by_brand[b] = value

# sort dictionary by values in descending values
sorted_mean_mileage_by_brand = {}
sorted_keys = sorted(mean_mileage_by_brand, key=mean_mileage_by_brand.get, reverse=True)
for key in sorted_keys:
    sorted_mean_mileage_by_brand[key] = mean_mileage_by_brand[key]    

# check transformation
sorted_mean_mileage_by_brand

{'saab': 143750.0,
 'volvo': 138632.3851203501,
 'rover': 136449.27536231885,
 'chrysler': 133149.17127071825,
 'bmw': 132521.64302818198,
 'alfa_romeo': 131109.4224924012,
 'mercedes_benz': 130886.14279678918,
 'audi': 129643.9411627364,
 'opel': 129298.66324848929,
 'volkswagen': 128955.27276129878,
 'renault': 128223.79367720465,
 'peugeot': 127352.33516483517,
 'jeep': 126409.09090909091,
 'mitsubishi': 126293.10344827586,
 'mazda': 125132.10039630119,
 'subaru': 124449.54128440368,
 'ford': 124131.93446392642,
 'honda': 123709.27318295739,
 'lancia': 123157.8947368421,
 'seat': 122061.63655685441,
 'daewoo': 121708.86075949368,
 'jaguar': 121298.7012987013,
 'citroen': 119764.62196861627,
 'nissan': 118978.7798408488,
 'land_rover': 118333.33333333333,
 'fiat': 117037.4617737003,
 'toyota': 115988.65478119935,
 'daihatsu': 114843.75,
 'kia': 112640.44943820225,
 'skoda': 110947.83715012722,
 'suzuki': 109334.47098976109,
 'hyundai': 106782.7868852459,
 'smart': 100756.06276747503,

In [142]:
# we want to construct a new dataframe that contain
# both columns for mean price and mean mileage, by brands
# Step1: Convert both dictionaries to series objects, using the pandas series constructor.
# Step2: Create a dataframe from the first series object using the pandas dataframe constructor.
# Step3: Assign the other series as a new column in this dataframe.

# Step1
price_series = pd.Series(sorted_mean_price_by_brand)
mileage_series = pd.Series(sorted_mean_mileage_by_brand)
print(price_series[:3], mileage_series[:3])  # check transformation

# Step2
df = pd.DataFrame(price_series, columns=['mean_price'])
df['mean_mileage'] = mileage_series
df.head(5)  # display top 5 dataframe by mean price

porsche           46764.200000
sonstige_autos    45859.160088
citroen           43909.678414
dtype: float64 saab     143750.000000
volvo    138632.385120
rover    136449.275362
dtype: float64


Unnamed: 0,mean_price,mean_mileage
porsche,46764.2,97363.945578
sonstige_autos,45859.160088,87188.644689
citroen,43909.678414,119764.621969
volvo,33292.31954,138632.38512
land_rover,18934.272727,118333.333333


#### Findings:
Top 5 mean prices by brands as above.
From above top 5 brands, mean price does not necessarily correlate with mean mileage.

## Conclusion: What we have done in this notebook
- practiced applying a variety of pandas methods to explore and understand a data set on car listings.

## Potential areas for further analysis
### Data cleaning next steps
- Identify categorical data that uses german words, translate them and map the values to their english counterparts
- Convert the dates to be uniform numeric data, so "2016-03-21" becomes the integer 20160321.
- See if there are particular keywords in the name column that can extract as new columns

### Analysis next steps
- Find the most common brand/model combinations
- Split the odometer_km into groups, and use aggregation to see if average prices follows any patterns based on the mileage.
- How much cheaper are cars with damage than their non-damaged counterparts?