<a href="https://colab.research.google.com/github/Rossel/DataQuest_Projects/blob/master/Project_03_Exploring_Ebay_Car_Sales_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project: Exploring Ebay Car Sales Data
*A project on data cleaning and data exploration using Python and Pandas to obtain useful insights in the market for second hand cars.*

![eBay logo](https://static.ebayinc.com/static/assets/Uploads/Content/_resampled/FillWyIzMzciLCIxOTAiXQ/eBay-Logo-Preview12.png)


## 1. Introduction

In this project from [DataQuest](https://www.dataquest.io/path/data-analyst/), I want to gain insight into the second hand car market. I would like to gain a better understanding of the top German car brands and how their offerings compare in to each other. My favourite car is a Porsche 911, and in this analysis I will compare the top 5 car brands to this sports car on various criteria.


### **The dataset**

I will work with a dataset of second hand cars cars from [eBay Kleinanzeigen](https://www.ebay-kleinanzeigen.de/s-autos/c216), a classifieds section of the German eBay website. Much of the data is in German, but I will translated it to English for better understanding.

The dataset was originally scraped and uploaded to [Kaggle](https://www.kaggle.com/orgesleka/used-cars-database/data). DataQuest made the following modifications to the original dataset:

* DataQuest sampled 50,000 data points from the full dataset, to ensure your code runs quickly in our hosted environment
* DataQuest dirtied the dataset a bit to more closely resemble what you would expect from a scraped dataset (the version uploaded to Kaggle was cleaned to be easier to work with)

The dataset can be downloaded [here](https://drive.google.com/file/d/1H8-SUpdMpteA-Qvxn0F1Ad3Ek3z8lU1t/view?usp=sharing). The data dictionary provided with data is as follows:

* `dateCrawled` - When this ad was first crawled. All field-values are taken from this date.
* `name` - Name of the car.
* `seller` - Whether the seller is private or a dealer.
* `offerType` - The type of listing
* `price` - The price on the ad to sell the car.
* `abtest` - Whether the listing is included in an A/B test.
* `vehicleType` - The vehicle Type.
* `yearOfRegistration` - The year in which the car was first registered.
* `gearbox` - The transmission type.
* `powerPS` - The power of the car in PS.
* `model` - The car model name.
* `kilometer` - How many kilometers the car has driven.
* `monthOfRegistration` - The month in which the car was first registered.
* `fuelType` - What type of fuel the car uses.
* `brand`- The brand of the car.
* `notRepairedDamage` - If the car has a damage which is not yet repaired.
* `dateCreated` - The date on which the eBay listing was created.
* `nrOfPictures` - The number of pictures in the ad.
* `postalCode` - The postal code for the location of the vehicle.
* `lastSeenOnline` - When the crawler saw this ad last online.

With this project I will practice my skills in cleaning the dataset and analysing the used car data. The description of the project can be found [here](https://www.dataquest.io/m/294-guided-project-exploring-ebay-car-sales-data/).


![car image](https://files.porsche.com/filestore/galleryimagerwd/multimedia/none/modelseries-911carrera992-indoor-08/zoom2/a748dda6-e75c-11e8-bec8-0019999cd470;sN;twebp/porsche-zoom2.webp)

### **Importing the data**

Let's start by importing the libraries we need and reading the dataset into pandas using Google Colab.


In [None]:
# Import functions from Google modules into Colaboratory
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [None]:
# Insert file id from Google Drive shareable link:
# https://drive.google.com/file/d/1H8-SUpdMpteA-Qvxn0F1Ad3Ek3z8lU1t/view?usp=sharing
id = "1H8-SUpdMpteA-Qvxn0F1Ad3Ek3z8lU1t"

In [None]:
# Download the dataset
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('autos.csv')

In [None]:
# Import pandas library
import pandas as pd
import numpy as np

The dataset could not be read using "UTF-8" or "Windows-1252" encoding, so we used "Latin-1".

In [None]:
 # Read encoded csv
 autos = pd.read_csv("autos.csv", encoding='Latin-1')

### **Exploring the data**
Let's render the first few and last few values of this pandas object, by running the `autos` variable in a separate cell.

In [None]:
# Render the first few and last rows of the autos dataframe
autos

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,2016-03-27 14:38:19,Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon,privat,Angebot,"$24,900",control,limousine,2011,automatik,239,q5,"100,000km",1,diesel,audi,nein,2016-03-27 00:00:00,0,82131,2016-04-01 13:47:40
49996,2016-03-28 10:50:25,Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...,privat,Angebot,"$1,980",control,cabrio,1996,manuell,75,astra,"150,000km",5,benzin,opel,nein,2016-03-28 00:00:00,0,44807,2016-04-02 14:18:02
49997,2016-04-02 14:44:48,Fiat_500_C_1.2_Dualogic_Lounge,privat,Angebot,"$13,200",test,cabrio,2014,automatik,69,500,"5,000km",11,benzin,fiat,nein,2016-04-02 00:00:00,0,73430,2016-04-04 11:47:27
49998,2016-03-08 19:25:42,Audi_A3_2.0_TDI_Sportback_Ambition,privat,Angebot,"$22,900",control,kombi,2013,manuell,150,a3,"40,000km",11,diesel,audi,nein,2016-03-08 00:00:00,0,35683,2016-04-05 16:45:07


Now we run  the `DataFrame.info()` an `DataFrame.shape()` method to print information about the `autos` dataframe.

In [None]:
# Print information on the autos dataframe
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

In [None]:
# Print dataframe dimensions
autos.shape

(50000, 20)

As expected there are 50.000 records across 20 categories. 

Other observations are:
* Five columns are integers, the others are objects (strings).
* Five columns contain null-values, but none have more than ~20% null values. 
* The column names contain capital letters ("[camel case](https://en.wikipedia.org/wiki/Camel_case)" formatting) instead of Python's preferred "[snakecase](https://en.wikipedia.org/wiki/Snake_case)", which means we cannot simply replace spaces with underscores.

### **Duplicate check**
Let's check for duplicates:

In [None]:
duplicate_bool = autos.duplicated() 
autos[duplicate_bool]

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen


No duplicates were found given each row represents one unique ad.


![car image](https://files.porsche.com/filestore/galleryimagerwd/multimedia/none/modelseries-911carrera992-details-01/zoom2/68c55f32-e75a-11e8-bec8-0019999cd470;sN;twebp/porsche-zoom2.webp)

## 2. Cleaning Column Names

Next we will convert the column names to make them more accesible for the reader. We will reword several columns, and convert strings from camelcase to snakecase.

This is the current array of the existing column names:

In [None]:
# Print array of column names
print(autos.columns)

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')


We will make the following edits to columns names:
* `yearOfRegistration` to `registration_year`
* `monthOfRegistration` to `registration_month`
* `notRepairedDamage` to `unrepaired_damage`
* `dateCreated` to `ad_created`
* `powerPS` to `power_bhp`
* Convert all labels from camelcase to snakecase.

### **The quick and easy rename method**



We could rename the columns quick and easy, by using this code (looks clean!):
```
autos.columns = [
       'date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_bhp', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'
]
```

### **More robust rename method using functions**

However, I prefer to use the functions below which would be more useful when more columns need renaming.

In [None]:
# Function name: fix_column(col)
# Input: The name of a column of the dataset
# Output: A standardized version of the column_name
# Description: The column names in the dataset are not consistent. This funtion
# will rename some of the columns and set all the column names to a common standard.

def fix_column(col):
    col = col.replace("yearOfRegistration","registration_year")
    col = col.replace("monthOfRegistration","registration_month")
    col = col.replace("notRepairedDamage", "unrepaired_damage")
    col = col.replace("dateCreated", "ad_created")
    col = col.replace("powerPS", "power_bhp")
    col = camel_to_snake(col)
    return col

In [None]:
# Function name: camel_to_snake
# Input: The name of a column in the dataset
# Output: The given name in snake case
# Description: This function converts a given column name to snake case to keep
# it consistant with Python conventions and standards.

def camel_to_snake(col):
    pos = 0
    for letter in col:
        if letter.isupper() == True:
            pos = col.index(letter)
            string1 = col[:pos]
            string2 = col[pos:]
            col = string1+"_"+string2
    return (col.lower())

In [None]:
autos_columns_fixed = []
autos.columns
for c in autos.columns:
  autos_columns_fixed.append(fix_column(c))

autos.columns = autos_columns_fixed

autos.columns

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_bhp', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

By adopting the standard Python formatting conventions it will be easier for the reader, other developers and data scientists to understand this notebook.

![car image](https://files.porsche.com/filestore/galleryimagerwd/multimedia/none/modelseries-911carrera992-details-02/zoom2/6febe695-e75a-11e8-bec8-0019999cd470;sN;twebp/porsche-zoom2.webp)

## 3. Deeper Exploration and Cleaning

Now let's do some basic data exploration to determine what other cleaning tasks need to be done. First, let's take a look at the descriptive statistics for all columns. By entering `include='all'` we will get both categorical and numeric columns:

In [None]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_bhp,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-05 16:57:05,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


### **Exploring the data in more detail**

In order to explore the findings, I executed the code below. It may look like a brute approach but it proved to be effective in offering deep insights for each category:

In [None]:
# for cat in autos.columns:
#  print(cat.upper())
#  print('\n' * 1)
#  print(autos[cat].value_counts())
#  print('\n' * 1)
#  print(autos[cat].value_counts())
#  print('\n' * 1)
#  print(autos[cat].head(20))
#  print('\n' * 4)

### **Data Exploration Findings**

√ `date_crawled`: This is a consistent time stamp, which looks good.

√ `name`: The first word seems to be the brand (as with the column `brand`), the second word the model (as with the column `model`). This column can therefore be interpreted as the title of the ad.

☐ `seller`: Only one entry has 'gewerblich’, all the rest is 'private’. 
*Consider dropping this row.*

☐ `offer_type`: Only one entry has 'Gesuch' ('Searched'), all the rest is 'Angebot' ('Offered'). 
*Consider dropping this row.*

☐ `price`: This column is not an integer. For further analysis, remove currency sign, move this to column header and turn into `int`. *Change from string to integer.*

√ `abtest`: Only two different values 'test' and 'control'. Looks good.

☐ `vehicle_type`: 8 different types are mentioned in German, which could be translated to English. There are also 95 NaN values of the 50,000 entries, which is not much. *Consider translating.*

√ `registration_year`: Some numbers in this category are out of the logical range. Entries have a minimal value of 1000, and a maximum value of 9999. Most seem fine.

☐ `gear_box`: There are 2680 NaN values. The other records have one of two different types of gearbox. *Consider translating.*

☐ `power_ps`: 5500 entries have noted 0 horsepower. The maximum value of 17700 is significantly more than the value at 75%, which suggests inaccurate data in this column. *Requires more investigation.*

☐ `model`: Looks similar to the second word in the `name` column, just after the first underscore. *Consider translating.*

☐ `odometer`: Mileage, where the "km" can be removed, in order to make the cells integers. *Change from string to integer.*

☐ `registration_month`: The minimum in this column is "0" (with 5075 entries), which also suggests inaccurate rows of data.

☐ `fuel_type`: Also described in German, but the entries look accurate and fine. *Consider translating.*

☐ `brand`: Looks fine, no duplicate names. Some strings are in German. *Consider translating.*

☐ `unrepaired_damage`: Almost 10,000 cases of NaN. The other entries are one of two values, written in German. *Consider translating.*

☐ `ad_created`: Date stamp seems fine, but the time stamp can be removed as it is 00:00:00 everywhere. *Consider removing time stamp.*

☐ `nr_of_pictures`: Zero everywhere so not much meaning, *Consider dropping this row.*

☐ `postal_code`: Has an entry of four digits, where usually 5 digits are used. The entry 99998 also seems unusual, as do numbers after the decimal. *Requires more investigation.*

√ `last_seen`: Date stamp seems fine.

### **Dropping columns with mostly one value**

There are three columns where all or almost all values are the same. These are:
- `seller`
- `offer_type`
- `nr_of_pictures`

The columns will be dropped as they have no significant use for our analysis.

In [None]:
autos.drop(["seller", "offer_type", "nr_of_pictures"], axis=1, inplace = True)
autos.shape

(50000, 17)

### **Translate German words to English**

Many strings names in the dataset are in German. For better understanding I will translate these to English.

In [None]:
# Function name: translate(string)
# Input: Strings of German words using a mapping of translated words
# Output: Strings translated to English
# Description: Many strings names in the dataset are in German. For better understanding
# I will translate these to English.

mappings = {"privat": "private", "gewerblich": "commercial", "Angebot": "Offer", "Gesuch": "Search", "kleinwagen":
"mini-car", "kombi":"station wagon", "cabrio": "convertible", "limousine": "sedan", "andere": "other" , "manuell": "manual", "automatik":
"automatic", "benzin":"gas", "elektro": "electric", "sonstige_auto": "other", "sonstige_autos": "other", "nein": "no", "ja": "yes"}

def translate(string):
  if string in mappings:
    return mappings[string]
  else:
    return string

columns_change = ["vehicle_type", "gearbox", "model", "fuel_type", "brand", "unrepaired_damage"]

autos[columns_change] = autos[columns_change].applymap(translate)

autos.head()

Unnamed: 0,date_crawled,name,price,abtest,vehicle_type,registration_year,gearbox,power_bhp,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,"$5,000",control,bus,2004,manual,158,other,"150,000km",3,lpg,peugeot,no,2016-03-26 00:00:00,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,"$8,500",control,sedan,1997,automatic,286,7er,"150,000km",6,gas,bmw,no,2016-04-04 00:00:00,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,"$8,990",test,sedan,2009,manual,102,golf,"70,000km",7,gas,volkswagen,no,2016-03-26 00:00:00,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,"$4,350",control,mini-car,2007,automatic,71,fortwo,"70,000km",6,gas,smart,no,2016-03-12 00:00:00,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,"$1,350",test,station wagon,2003,manual,0,focus,"150,000km",7,gas,ford,no,2016-04-01 00:00:00,39218,2016-04-01 14:38:50


![car image](https://files.porsche.com/filestore/galleryimagerwd/multimedia/none/modelseries-911carrera992-details-05/zoom2/91760262-e75a-11e8-bec8-0019999cd470;sN;twebp/porsche-zoom2.webp)

## 4. Exploring the Odometer and Price Columns

There are two important columns that have numeric values stored as text. These are the columns `price` and `odometer`. I will:
- Remove any non-numeric characters.
- Convert the column to a numeric dtype.
- Rename the two columns to `price_usd` and `odometer_km`.

In [None]:
# Print how the column looks before removing non-numeric characters.
print("Price column before conversion:") 
print(autos["price"].head(2))
# The "$" and "," characters are removed, and the colomn is converted to integers.
autos["price"] = (autos["price"]
                  .str.replace("$", "")
                  .str.replace(",", "")
                  .astype(int))

# The column label is changed to "price_usd".
autos.rename({"price": "price_usd"}, axis=1, inplace=True)

# Print the resulting column head to show the converted outcome.
print('\n' * 1)
print("Price column after conversion:") 
print(autos["price_usd"].head(2))

Price column before conversion:
0    $5,000
1    $8,500
Name: price, dtype: object


Price column after conversion:
0    5000
1    8500
Name: price_usd, dtype: int64


In [None]:
# Print how the column looks before removing non-numeric characters.
print("Odometer column before conversion:") 
print(autos["odometer"].head(2))
# The "km" and "," characters are removed, and the colomn is converted to integers.
autos["odometer"] = (autos["odometer"]
                     .str.replace("km", "")
                     .str.replace(",", "")
                     .astype(int))
# The column label is changed to "odometer_km".
autos.rename({"odometer": "odometer_km"}, axis=1, inplace=True)

# Print the resulting column head to show the converted outcome.
print('\n' * 1)
print("Odometer column after conversion:") 
print(autos["odometer_km"].head(2))

Odometer column before conversion:
0    150,000km
1    150,000km
Name: odometer, dtype: object


Odometer column after conversion:
0    150000
1    150000
Name: odometer_km, dtype: int64


Let's continue exploring the data, specifically looking for data that doesn't look right. We'll start by analyzing the `odometer_km` and `price` columns.


### **Explore Odometer values**

In [None]:
# Examine how many unique values there are for the odometer_km column.
autos["odometer_km"].unique().shape

(13,)

There are 13 unique values in the `odometer_km` column.

In [None]:
# Examine the min/max/median/mean values to look for outliers.
autos["odometer_km"].describe()

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [None]:
# Examine and sort the various values in the column to look for outliers.
autos["odometer_km"].value_counts()

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
5000        967
40000       819
30000       789
20000       784
10000       264
Name: odometer_km, dtype: int64

In the `odometer_km` column, there seem to be **no values that look unrealistically high or low (outliers)** that we might want to remove. We can assume that eBay offers sellers fixed categories for mileage to select for their cars. Most secondhand cars have a mileage of over 150,000km.

### **Explore Price values**

In [None]:
# Examine how many unique values there are for the price column.
autos["price_usd"].unique().shape

(2357,)

There are 2357 unique values in the price column.

In [None]:
# Examine the min/max/median/mean values to look for outliers.
autos["price_usd"].describe()

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price_usd, dtype: float64

In [None]:
# Sort and examine the value counts in ascending order.
print(autos["price_usd"]
      .value_counts()
      .head(10)
      .sort_index(ascending= True)
      )

0       1421
500      781
600      531
800      498
1000     639
1200     639
1500     734
2000     460
2500     643
3500     498
Name: price_usd, dtype: int64


In [None]:
# Sort the price column in descending order to see outliers and the most reasonable maximum values.
print(autos["price_usd"]
      .sort_values(ascending= False)
      .iloc[:20]
      )

39705    99999999
42221    27322222
39377    12345678
47598    12345678
27371    12345678
2897     11111111
24384    11111111
11137    10000000
47634     3890000
7814      1300000
22947     1234566
43049      999999
514        999999
37585      999990
36818      350000
14715      345000
34723      299000
35923      295000
12682      265000
47337      259000
Name: price_usd, dtype: int64


The two calculations help to interpret what the outliers and unrealistic prices are, and reveal that **the reasonable prices for second hand cars range between USD 500 and USD 350,000.** There are 1421 instances with a price of USD 0. We could reason that eBay is a bidding site, but such outliers do not provide us with good data on current value of second hand cars. Therefore we exclude these from our analysis.

In [None]:
# Remove price outliers from the dataset
autos = autos[autos["price_usd"].between(500,350000)]
autos["price_usd"].describe()

# Alternative code to do this is: 
# autos[(autos["price"] > 500) & (autos["price"] < 350000))

count     45097.000000
mean       6320.659600
std        9261.841444
min         500.000000
25%        1500.000000
50%        3500.000000
75%        7900.000000
max      350000.000000
Name: price_usd, dtype: float64

After dropping the price outliers, there are 45,097 records left in the dataset. The prices range between USD 500 and USD 350,000 with a mean of USD 6320.



![car image](https://files.porsche.com/filestore/galleryimagerwd/multimedia/none/modelseries-911carrera992-interior-07/zoom2/029f8d07-e75d-11e8-bec8-0019999cd470;sN;twebp/porsche-zoom2.webp)


## 5. Exploring the date columns

Let's now move on to the date columns and understand the date range the data covers. There are 5 columns that should represent date values. Some of these columns were created by the crawler, some came from the website itself. We can differentiate by referring to the data dictionary:



```
- `date_crawled`: added by the crawler
- `last_seen`: added by the crawler
- `ad_created`: from the website
- `registration_month`: from the website
- `registration_year`: from the website
```

Right now, the `date_crawled`, `last_seen`, and `ad_created` columns are all identified as string values by pandas. Because these three columns are represented as strings, we need to convert the data into a numerical representation so we can understand it quantitatively. 

The other two columns are represented as numeric values, so we can use methods like `Series.describe()` to understand the distribution without any extra data processing.

The values in the three string columns are formatted as the following timestamp values:



In [None]:
autos[['date_crawled','ad_created','last_seen']][0:5]

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50


The first 10 characters represent the day (e.g. `2016-03-12`). To understand the date range, we can extract just the date values, then use `Series.value_counts()` to generate a distribution, and then sort by the index.

### **Date Crawled Observations** 

In [None]:
# Explore the distribution of the dates on which the ads were crawled
print(autos["date_crawled"]
      .str[:10]
      .value_counts(normalize=True, dropna=False)
      .sort_index()
      )

2016-03-05    0.025567
2016-03-06    0.014125
2016-03-07    0.036189
2016-03-08    0.033173
2016-03-09    0.032907
2016-03-10    0.032707
2016-03-11    0.033018
2016-03-12    0.037320
2016-03-13    0.015522
2016-03-14    0.036300
2016-03-15    0.034016
2016-03-16    0.029359
2016-03-17    0.031155
2016-03-18    0.012883
2016-03-19    0.034747
2016-03-20    0.038073
2016-03-21    0.037741
2016-03-22    0.033018
2016-03-23    0.032397
2016-03-24    0.028982
2016-03-25    0.031089
2016-03-26    0.032641
2016-03-27    0.031177
2016-03-28    0.034836
2016-03-29    0.033262
2016-03-30    0.033328
2016-03-31    0.031665
2016-04-01    0.033905
2016-04-02    0.035767
2016-04-03    0.038827
2016-04-04    0.036610
2016-04-05    0.013172
2016-04-06    0.003171
2016-04-07    0.001353
Name: date_crawled, dtype: float64


The data has been **crawled between 03-05-2016 and 04-07-2016 which is roughly one month.** Looking at the percent distribution, most values look to be around three percent with a few days around one percent. I don't see an overweighting on one day which could skew results.

### **Ad Created Observations** 

In [None]:
# Explore the distribution of the dates on which the ads were created
print(autos["ad_created"]
      .str[:10]
      .value_counts(normalize=True, dropna=False)
      .sort_index()
      )

2015-06-11    0.000022
2015-08-10    0.000022
2015-09-09    0.000022
2015-11-10    0.000022
2015-12-05    0.000022
                ...   
2016-04-03    0.039049
2016-04-04    0.036987
2016-04-05    0.011908
2016-04-06    0.003260
2016-04-07    0.001197
Name: ad_created, Length: 76, dtype: float64


The ads have been **created between 06-11-2015 and 04-07-2016, which covers about 10 months.** There is quiet a big gap in percentages of entries with creation date between december 2015 and april 2016.

### **Last Seen Observations** 

In [None]:
# Explore the distribution of the dates on which the ads were seen by the crawler
print(autos["last_seen"]
      .str[:10].
      value_counts(normalize=True, dropna=False)
      .sort_index()
      )

2016-03-05    0.001087
2016-03-06    0.004169
2016-03-07    0.005211
2016-03-08    0.007007
2016-03-09    0.009468
2016-03-10    0.010289
2016-03-11    0.012041
2016-03-12    0.023904
2016-03-13    0.008870
2016-03-14    0.012285
2016-03-15    0.015677
2016-03-16    0.016165
2016-03-17    0.027674
2016-03-18    0.007406
2016-03-19    0.015411
2016-03-20    0.020423
2016-03-21    0.020667
2016-03-22    0.021243
2016-03-23    0.018405
2016-03-24    0.019536
2016-03-25    0.018582
2016-03-26    0.016476
2016-03-27    0.015456
2016-03-28    0.020534
2016-03-29    0.021354
2016-03-30    0.024148
2016-03-31    0.023438
2016-04-01    0.022862
2016-04-02    0.024880
2016-04-03    0.024946
2016-04-04    0.024303
2016-04-05    0.126616
2016-04-06    0.225314
2016-04-07    0.134155
Name: last_seen, dtype: float64


The dates for this category follow the same pattern as the `date_crawled` category, and cover roughly **one month between 03-05-2016 and 04-07-2016.** Looking at the percent distribution, most values look to be in a range of 1.5-2.5 percent with a few outliers. There is no overweighting towards certain days which could skew results.

![car image](https://files.porsche.com/filestore/galleryimagerwd/multimedia/none/992-carrera-gallery-05/zoom2/570851d7-7eea-11ea-80c9-005056bbdc38;sN;twebp/porsche-zoom2.webp)

## 6. Dealing with Incorrect Car Registration Data

### **Registration Year analysis**

Below I explore the descriptive statistics and the distribution of `registration_year` to look for any anomalies or incorrect data.

In [None]:
# Check the descriptive statistics for years cars were registered.
autos["registration_year"].describe()

count    45097.000000
mean      2005.064173
std         89.652017
min       1000.000000
25%       2000.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

In [None]:
# Explore the distribution of car registration years
print(autos["registration_year"].
      value_counts().
      sort_index()
      )

1000    1
1001    1
1910    2
1927    1
1929    1
       ..
5911    1
6200    1
8888    1
9000    1
9999    3
Name: registration_year, Length: 93, dtype: int64


In [None]:
# Explore the higher ranges of the distribution of car registration years
print(autos["registration_year"]
      .sort_values(ascending= False)
      .iloc[:20]
      )

38076    9999
8012     9999
33950    9999
49910    9000
25003    8888
8360     6200
27618    5911
22799    5000
49153    5000
42079    4800
453      4500
4549     4100
27578    2800
49185    2019
19829    2018
33581    2018
44548    2018
18779    2018
33566    2018
36258    2018
Name: registration_year, dtype: int64


### **Registration Year observations**

The registration dates show remarkable outliers, ranging from **before 1900 and beyond 2016**, the year in which the data was crawled. Some remarkable examples ar the years 1000, 1001, 8888, and 9999. These are likely to be incorrect entries, for example because the people posting the ads were unsure of the date their vehicle was registered.

### **Removing outliers**

Following our observations above, we will remove outliers in registration dates from before the year 1900 and beyond 2016.

In [None]:
# Remove the values outside our scope
autos = autos[autos["registration_year"].between(1900,2016)]

autos.shape

(43323, 17)

In [None]:
# Explore the distribution of registration years
autos["registration_year"].value_counts(normalize= True)

2005    0.066108
2000    0.062669
2004    0.061930
2006    0.061538
2003    0.061330
          ...   
1948    0.000023
1938    0.000023
1939    0.000023
1943    0.000023
1952    0.000023
Name: registration_year, Length: 78, dtype: float64

In [None]:
# Explore the top range in the distribution of registration years
autos["registration_year"].value_counts(normalize=True).head(10)

2005    0.066108
2000    0.062669
2004    0.061930
2006    0.061538
2003    0.061330
1999    0.059368
2001    0.058168
2002    0.055836
2007    0.052397
2008    0.050920
Name: registration_year, dtype: float64

It turns out that **most vehicles were registered in early 2000's.**

### **Registration Month analysis**

Now we check the month in which the cars were registered. As mentioned earlier, the min value is 0.

In [None]:
# Explore the distribution of months of registration
autos["registration_month"].value_counts()

3     4543
6     3901
4     3670
5     3669
7     3531
10    3322
9     3106
11    3094
12    3080
0     2973
1     2895
8     2851
2     2688
Name: registration_month, dtype: int64

### **Registration Month observations**

Around 3000 of the entries are registered in the month 0, which is an illegitimate month. If we want to be able to obtain make accurate statements about the months in which cars were registered, we have to remove the 0 entries from our dataset. 

### **Removing outliers**

In [None]:
# Remove the value of month "0" out of the dataset
autos = autos[~(autos["registration_month"] == 0)]

# Explore the distribution of months of registration after removing month "0"
autos["registration_month"].value_counts()

3     4543
6     3901
4     3670
5     3669
7     3531
10    3322
9     3106
11    3094
12    3080
1     2895
8     2851
2     2688
Name: registration_month, dtype: int64

In [None]:
# Explore the dimensionsx of the dataframe
autos.shape

(40350, 17)

After removing entries with unrealistic registration dates **we are left with 40,350 records in our dataset.** This is 80,7% of the original dataset and still a substantial amount of data to be able to draw representative conclusions for the German second hand car market.

## 7. Exploring Price by Brand

Now that we have cleaned the data. let's continue our analysis by looking at all the brands in the dataset and determine the top 5 brands in Germany, together with their mean prices.

In [None]:
autos["brand"].value_counts(normalize= True)

volkswagen       0.209021
bmw              0.115985
mercedes_benz    0.104238
opel             0.098414
audi             0.091202
ford             0.064709
renault          0.043841
peugeot          0.029492
fiat             0.023519
seat             0.017943
skoda            0.017596
smart            0.015217
nissan           0.015192
mazda            0.014696
citroen          0.013903
toyota           0.013829
hyundai          0.010409
mini             0.009839
volvo            0.009517
other            0.009368
honda            0.007782
mitsubishi       0.007658
kia              0.007658
porsche          0.006691
alfa_romeo       0.006592
chevrolet        0.006047
suzuki           0.005998
chrysler         0.003494
dacia            0.002924
jeep             0.002503
land_rover       0.002305
daihatsu         0.002057
subaru           0.001884
saab             0.001685
jaguar           0.001660
daewoo           0.001363
rover            0.001264
trabant          0.000991
lancia      

![car image](https://files.porsche.com/filestore/galleryimagerwd/multimedia/none/modelseries-911carrera992-indoor-01/zoom2/84da3d18-e75c-11e8-bec8-0019999cd470;sN;twebp/porsche-zoom2.webp)

### Most popular brands in Germany

**Volkswagen is clearly the most popular brand** in the German second hand car market, with over 20% market share on eBay. **BMW is second with 11,6%, followed by Mercedes with 10,4%. Opel counts for 9,8% and Audi has a market share of 9,1%**. The brand we are specifically interested in, **Porsche, has a much smaller share: 0,67% of our dataset**.

In [None]:
# Explore the amounts of ads for our top 5 brands
autos["brand"].value_counts()[:5]

volkswagen       8434
bmw              4680
mercedes_benz    4206
opel             3971
audi             3680
Name: brand, dtype: int64

In [None]:
# Explore the amounts of ads for our favourite brand Porsche
autos["brand"].value_counts()["porsche"]

270

There is a substantially lower amount of Porsches on the German second hand car market, in comparison to the top brands. Our dataset counts **270 instances**. In comparison, the other **top 4 German car brands together have nearly 25,000 ads** in our dataset.

### **Calculate the mean values for the top 5 car brands.**

In [None]:
# Select the top 5 German car brands
top_brands = ["volkswagen", "bmw", "mercedes_benz", "audi", "porsche"]

# pass the above variable into autos to select only those records from the data frame 
mean_price = autos.loc[autos['brand'].isin(top_brands)]

# create variable that calculates the mean of the data fr ame
selected = mean_price.groupby(['brand']).mean().round(decimals=0)

# execute selected to see the mean results selected
selected.sort_values("price_usd")

Unnamed: 0_level_0,price_usd,registration_year,power_bhp,odometer_km,registration_month,postal_code
brand,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
volkswagen,6043.0,2003.0,105.0,127677.0,6.0,50001.0
bmw,8828.0,2003.0,173.0,132481.0,6.0,55435.0
mercedes_benz,8981.0,2002.0,157.0,130804.0,6.0,51626.0
audi,9969.0,2005.0,166.0,128221.0,7.0,55096.0
porsche,48008.0,2002.0,301.0,98648.0,6.0,57519.0


### **Comparing German Top Brands**

**Volkswagen** is on average the best priced German-made car, at a price of about **USD 6000**. **BMW and Mercedes** follow with an average price of almost **USD 9000.** Theses 3 brands in our dataset were registered at around the same time (**mid-2002 and mid-2003**) and also have about the same mileage of around **130,000 km**. BMW does stand out in horsepower, with an average of 173 BHP versus 105 BHP from Volkswagen. 

**Audi** is the more expensive mid-range German brand, with an average price of almost **USD 10,000** in this dataset. This may have to do with the younger age of the cars, which are on average from **2005**. The mileage of the Audis is about the same as the other 3 brands, which can indicate that Audi drivers make more mileage per year. A possible reason for this could be that **Audis are more popular as a corporate lease car** and therefore used more intesively.

It is interesting to compare the results for **Porsche** to the other 4 German car brands. Porsche is (unsurprisingly) about 5-8 times more expensive than the other brands, with an average price of **USD 48,000**. My favourite sports car is also much more powerful, with an **average of 300 BHP** which is about 2-3 times higher that the mid-range cars. The average mileage is lower than the other 4 brands, at around 100,000 km. This might suggest that Porsche is **more of a leisure car than an everyday commuting vehicle**.

### **Further avenues for research**

- Clean the `power_bhp` column and compare the ranges between brands
- Compare various models per brand
- Measure the cost of damage to similar cars
- Move Data Exploration Findings to table
- Analyse postcodes: most values for zip codes are:
	- 10115, 65428, 66333, 45888, 44145
- Check NaN values across various columns

Moving forward in my learning journey on DataQuest.io, I intend to continuesly apply my newly gained skills to this analysis.


![car image](https://files.porsche.com/filestore/galleryimagerwd/multimedia/none/modelseries-911carrera992-details-06/zoom2/99cad962-e75a-11e8-bec8-0019999cd470;sN;twebp/porsche-zoom2.webp)


# **Appendix**

## Storing Aggregate Data in a DataFrame

For further practice, I want to apply another method to get an overview of mean prices versus mean mileage, by using a for loop. I will use aggregation to understand the average mileage for the top cars and if there's any visible link with mean price. 

While our natural instinct may be to display both aggregated series objects and visually compare them, this has a few limitations:
- it's difficult to compare more than two aggregate series objects if we want to extend to more columns
- we can't compare more than a few rows from each series object
- we can only sort by the index (brand name) of both series objects so we can easily make visual comparisons

Instead, we can combine the data from both series objects into a single dataframe (with a shared index) and display the dataframe directly. To do this we can apply two pandas methods:

- [pandas series constructor](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html)
- [pandas dataframe constructor](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)

First, let's get the mean prices and mean mileage using a for loop.

In [None]:
# Create an empty dictionary in which to store the average prices per car brand
mean_car_prices = {}

# Loop through the top 5 German car brands to obtain the average prices and add them to the dictionary mean_car_prices
for brand in top_brands:
    car_brand = autos[autos["brand"] == brand]
    mean_price = car_brand["price_usd"].mean()
    mean_car_prices[brand] = int(mean_price)

# Show the mean_car_prices dictionary
mean_car_prices

{'audi': 9969,
 'bmw': 8827,
 'mercedes_benz': 8981,
 'porsche': 48008,
 'volkswagen': 6043}

These numbers are the same as in the previous chapter. Now let's find out the mean mileage for these cars:

In [None]:
# Create an empty dictionary in which to store the average mileage per car brand
mean_car_mileage = {}

# Loop through the top 5 German car brands to obtain the average mileage and add them to the dictionary mean_car_mileage
for brand in top_brands:
    car_brand = autos[autos["brand"] == brand]
    mean_mileage = car_brand["odometer_km"].mean()
    mean_car_mileage[brand] = int(mean_mileage)

# Show the mean_car_prices dictionary
mean_car_mileage

{'audi': 128221,
 'bmw': 132480,
 'mercedes_benz': 130803,
 'porsche': 98648,
 'volkswagen': 127676}

In [None]:
average_mileage = pd.Series(mean_car_mileage).sort_values(ascending= False)
average_prices = pd.Series(mean_car_prices).sort_values(ascending= False)

brand_averages = pd.DataFrame(average_mileage, columns=['average_mileage'])
brand_averages["average_prices"] = average_prices
brand_averages

Unnamed: 0,average_mileage,average_prices
bmw,132480,8827
mercedes_benz,130803,8981
audi,128221,9969
volkswagen,127676,6043
porsche,98648,48008


Comparing German Top Brands we can draw the same conclusions as in the final chapter of this notebook.

---

