<a href="https://colab.research.google.com/github/Rossel/DataQuest_Courses/blob/master/Guided_Project_03_Exploring_Ebay_Car_Sales_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Guided Project: Exploring Ebay Car Sales Data
*Practice data cleaning and data explorationg using pandas in order to obtain useful insights in the market for second hand cars.*

![eBay logo](https://static.ebayinc.com/static/assets/Uploads/Content/_resampled/FillWyIzMzciLCIxOTAiXQ/eBay-Logo-Preview12.png)


## 1. Introduction



![car image](https://s3.caradvice.com.au/wp-content/uploads/2015/12/BMW-M4-GTS.jpg)

In this guided project, we'll work with a dataset of used cars from *eBay Kleinanzeigen*, a [classifieds](https://en.wikipedia.org/wiki/Classified_advertising) section of the German eBay website. The data is written in German language, but will be translated where necessary for better understanding.

The dataset was originally scraped and uploaded to [Kaggle](https://www.kaggle.com/orgesleka/used-cars-database/data). DataQuest made the following modifications to the original dataset:

* DataQuest sampled 50,000 data points from the full dataset, to ensure your code runs quickly in our hosted environment
* DataQuest dirtied the dataset a bit to more closely resemble what you would expect from a scraped dataset (the version uploaded to Kaggle was cleaned to be easier to work with)

The dataset can be downloaded [here](https://drive.google.com/file/d/1H8-SUpdMpteA-Qvxn0F1Ad3Ek3z8lU1t/view?usp=sharing). The data dictionary provided with data is as follows:

* `dateCrawled` - When this ad was first crawled. All field-values are taken from this date.
* `name` - Name of the car.
* `seller` - Whether the seller is private or a dealer.
* `offerType` - The type of listing
* `price` - The price on the ad to sell the car.
* `abtest` - Whether the listing is included in an A/B test.
* `vehicleType` - The vehicle Type.
* `yearOfRegistration` - The year in which the car was first registered.
* `gearbox` - The transmission type.
* `powerPS` - The power of the car in PS.
* `model` - The car model name.
* `kilometer` - How many kilometers the car has driven.
* `monthOfRegistration` - The month in which the car was first registered.
* `fuelType` - What type of fuel the car uses.
* `brand`- The brand of the car.
* `notRepairedDamage` - If the car has a damage which is not yet repaired.
* `dateCreated` - The date on which the eBay listing was created.
* `nrOfPictures` - The number of pictures in the ad.
* `postalCode` - The postal code for the location of the vehicle.
* `lastSeenOnline` - When the crawler saw this ad last online.

The aim of this project is to clean the data and analyze the included used car listings. We will also become more familiar with some of the unique benefits Jupyter notebook (or Google Colab) provides for pandas.


Let's start by importing the libraries we need and reading the dataset into pandas using Google Colab.


In [None]:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [None]:
id = "1H8-SUpdMpteA-Qvxn0F1Ad3Ek3z8lU1t"

In [None]:
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('autos.csv')

In [None]:
import pandas as pd
import numpy as np

The dataset could not be read using "UTF-8" or "Windows-1252" encoding, so we used "Latin-1".

In [None]:
autos = pd.read_csv("autos.csv", encoding='Latin-1')

Let's render the first few and last few values of this pandas object, by running the `autos` variable in a separate cell.

In [None]:
autos

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,2016-03-27 14:38:19,Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon,privat,Angebot,"$24,900",control,limousine,2011,automatik,239,q5,"100,000km",1,diesel,audi,nein,2016-03-27 00:00:00,0,82131,2016-04-01 13:47:40
49996,2016-03-28 10:50:25,Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...,privat,Angebot,"$1,980",control,cabrio,1996,manuell,75,astra,"150,000km",5,benzin,opel,nein,2016-03-28 00:00:00,0,44807,2016-04-02 14:18:02
49997,2016-04-02 14:44:48,Fiat_500_C_1.2_Dualogic_Lounge,privat,Angebot,"$13,200",test,cabrio,2014,automatik,69,500,"5,000km",11,benzin,fiat,nein,2016-04-02 00:00:00,0,73430,2016-04-04 11:47:27
49998,2016-03-08 19:25:42,Audi_A3_2.0_TDI_Sportback_Ambition,privat,Angebot,"$22,900",control,kombi,2013,manuell,150,a3,"40,000km",11,diesel,audi,nein,2016-03-08 00:00:00,0,35683,2016-04-05 16:45:07


Now we run  the `DataFrame.info()` an `DataFrame.shape()` method to print information about the `autos` dataframe.

In [None]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

In [None]:
autos.shape

(50000, 20)

As expected there are 50.000 records across 20 categories. 

Other observations are:
* Five columns are integers, the others are objects (strings).
* Five columns contain null-values, but none have more than ~20% null values. 
* The column names contain capital letters ("[camel case](https://en.wikipedia.org/wiki/Camel_case)" formatting) instead of Python's preferred "[snakecase](https://en.wikipedia.org/wiki/Snake_case)", which means we cannot simply replace spaces with underscores.

 Let's check for duplicates:

In [None]:
duplicate_bool = autos.duplicated() 
autos[duplicate_bool]

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen


No duplicates were found given each row represents one unique ad.


## 2. Cleaning Column Names

Next we will convert the column names from camelcase to snakecase and reword some of the column names based on the data dictionary to be more descriptive.

First we print an array of the existing column names:

In [None]:
print(autos.columns)

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')


Next we will make the following edits to columns names:
* `yearOfRegistration` to `registration_year`
* `monthOfRegistration` to `registration_month`
* `notRepairedDamage` to `unrepaired_damage`
* `dateCreated` to `ad_created`
* The rest of the columnn names from camelcase to snakecase.

We could do this quick and clean, by using the following code:
```
autos.columns = [
       'date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'
]
```



However, I prefer to use the functions described above which would also work for datasets with more columns.

In [None]:
# Function name: fix_column(col)
# Input: The name of a column of the dataset
# Output: A standardized version of the column_name
# Description: The column names in the dataset are not consistent. This funtion
# will rename some of the columns and set all the column names to a common standard.

def fix_column(col):
    col = col.replace("yearOfRegistration","registration_year")
    col = col.replace("monthOfRegistration","registration_month")
    col = col.replace("notRepairedDamage", "unrepaired_damage")
    col = col.replace("dateCreated", "ad_created")
    col = camel_to_snake(col)
    return col

In [None]:
# Function name: camel_to_snake
# Input: The name of a column in the dataset
# Output: The given name in snake case
# Description: This function converts a given column name to snake case to keep
# it consistant with Python conventions and standards.

def camel_to_snake(col):
    pos = 0
    for letter in col:
        if letter.isupper() == True:
            pos = col.index(letter)
            string1 = col[:pos]
            string2 = col[pos:]
            col = string1+"_"+string2
    return (col.lower())

In [None]:
autos_columns_fixed = []
autos.columns
for c in autos.columns:
  autos_columns_fixed.append(fix_column(c))

autos.columns = autos_columns_fixed

autos.columns

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_p_s', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

By adopting the standard Python formatting conventions it will be easier for other developers and data scientists to read the notebook.

## 3. Initial Exploration and Cleaning

Now let's do some basic data exploration to determine what other cleaning tasks need to be done. Initially we will look for: 
- Text columns where all or almost all values are the same. These can often be dropped as they don't have useful information for analysis. 
- Examples of numeric data stored as text which can be cleaned and converted.



First, let's take a look at the descriptive statistics for all columns. By entering `include='all'` we will get both categorical and numeric columns:

In [307]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_p_s,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-30 19:48:02,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


In order to explore the findings, I have executed the code below to get more insights:

In [308]:
#for cat in autos.columns:
#  print(cat.upper())
#  print('\n' * 1)
#  print(autos[cat].value_counts())
#  print('\n' * 1)
#  print(autos[cat].head(20))
#  print('\n' * 4)

**Data Exploration Findings**

√ `date_crawled`: This is a consistent time stamp, which looks good.

√ `name`: The first word seems to be the brand (as with the column `brand`), the second word the model (as with the column `model`). This column can therefore be interpreted as the title of the ad.

☐ `seller`: Only one entry has 'gewerblich’, all the rest is 'private’. 
*Consider dropping this row.*

☐ `offer_type`: Only one entry has 'Gesuch' ('Searched'), all the rest is 'Angebot' ('Offered'). 
*Consider dropping this row.*

☐ `price`: This column is not an integer. For further analysis, remove currency sign, move this to column header and turn into `int`. *Change from string to integer.*

√ `abtest`: Only two different values 'test' and 'control'. Looks good.

☐ `vehicle_type`: 8 different types are mentioned in German, which could be translated to English. There are also 95 NaN values of the 50,000 entries, which is not much. *Consider translating.*

√ `registration_year`: Some numbers in this category are out of the logical range. Entries have a minimal value of 1000, and a maximum value of 9999. Most seem fine.

☐ `gear_box`: There are 2680 NaN values. The other records have one of two different types of gearbox. *Consider translating.*

☐ `power_ps`: 5500 entries have noted 0 horsepower. The maximum value of 17700 is significantly more than the value at 75%, which suggests inaccurate data in this column. *Requires more investigation.*

☐ `model`: Looks similar to the second word in the `name` column, just after the first underscore. *Consider translating.*

☐ `odometer`: Mileage, where the "km" can be removed, in order to make the cells integers. *Change from string to integer.*

☐ `registration_month`: The minimum in this column is "0" (with 5075 entries), which also suggests inaccurate rows of data.

☐ `fuel_type`: Also described in German, but the entries look accurate and fine. *Consider translating.*

☐ `brand`: Looks fine, no duplicate names. Some strings are in German. *Consider translating.*

☐ `unrepaired_damage`: Almost 10,000 cases of NaN. The other entries are one of two values, written in German. *Consider translating.*

☐ `ad_created`: Date stamp seems fine, but the time stamp can be removed as it is 00:00:00 everywhere. *Consider removing time stamp.*

☐ `nr_of_pictures`: Zero everywhere so not much meaning, *Consider dropping this row.*

☐ `postal_code`: Has an entry of four digits, where usually 5 digits are used. The entry 99998 also seems unusual, as do numbers after the decimal. *Requires more investigation.*

√ `last_seen`: Date stamp seems fine.

**Dropping columns with mostly one value**

Three columns have excactly the same data for almost every row which will have no significant use for analysis and are therefore being removed.

In [310]:
autos.drop(["seller", "offer_type", "nr_of_pictures"], axis=1, inplace = True)
autos.shape

(50000, 17)

**Translate German words to English**

In [311]:
# Function name: translate(string)
# Input: Strings of German words using a mapping of translated words
# Output: Strings translated to English
# Description: Many strings names in the dataset are in German. For better understanding
# I will translate these to English.

mappings = {"privat": "private", "gewerblich": "commercial", "Angebot": "Offer", "Gesuch": "Search", "kleinwagen":
"mini-car", "kombi":"station wagon", "cabrio": "convertible", "limousine": "sedan", "andere": "other" , "manuell": "manual", "automatik":
"automatic", "benzin":"gas", "elektro": "electric", "sonstige_auto": "other", "sonstige_autos": "other", "nein": "no", "ja": "yes"}

def translate(string):
  if string in mappings:
    return mappings[string]
  else:
    return string

columns_change = ["vehicle_type", "gearbox", "model", "fuel_type", "brand", "unrepaired_damage"]

autos[columns_change] = autos[columns_change].applymap(translate)

autos.head()

Unnamed: 0,date_crawled,name,price,abtest,vehicle_type,registration_year,gearbox,power_p_s,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,"$5,000",control,bus,2004,manual,158,other,"150,000km",3,lpg,peugeot,no,2016-03-26 00:00:00,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,"$8,500",control,sedan,1997,automatic,286,7er,"150,000km",6,gas,bmw,no,2016-04-04 00:00:00,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,"$8,990",test,sedan,2009,manual,102,golf,"70,000km",7,gas,volkswagen,no,2016-03-26 00:00:00,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,"$4,350",control,mini-car,2007,automatic,71,fortwo,"70,000km",6,gas,smart,no,2016-03-12 00:00:00,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,"$1,350",test,station wagon,2003,manual,0,focus,"150,000km",7,gas,ford,no,2016-04-01 00:00:00,39218,2016-04-01 14:38:50


## 4. Exploring the Odometer and Price Columns

Let's begin with the two columns that have numeric values stored as text: `price` and `odometer`. For each column we will:
- Remove any non-numeric characters.
- Convert the column to a numeric dtype.
- Use `DataFrame.rename()` to rename two column to `price_usd` and `odometer_km`.

In [314]:
print("Price column before conversion:") 
print(autos["price"].head(2))
# The "$" and "," characters are removed, and the colomn is converted to integers.
autos["price"] = (autos["price"]
                  .str.replace("$", "")
                  .str.replace(",", "")
                  .astype(int))

# The column label is changed to "price_usd".
autos.rename({"price": "price_usd"}, axis=1, inplace=True)

print('\n' * 1)
print("Price column after conversion:") 
print(autos["price_usd"].head(2))

Price column before conversion:
0    $5,000
1    $8,500
Name: price, dtype: object


Price column after conversion:
0    5000
1    8500
Name: price_usd, dtype: int64


In [315]:
print("Odometer column before conversion:") 
print(autos["odometer"].head(2))
# The "km" and "," characters are removed, and the colomn is converted to integers.
autos["odometer"] = (autos["odometer"]
                     .str.replace("km", "")
                     .str.replace(",", "")
                     .astype(int))
# The column label is changed to "odometer_km".
autos.rename({"odometer": "odometer_km"}, axis=1, inplace=True)

print('\n' * 1)
print("Odometer column after conversion:") 
print(autos["odometer_km"].head(2))

Odometer column before conversion:
0    150,000km
1    150,000km
Name: odometer, dtype: object


Odometer column after conversion:
0    150000
1    150000
Name: odometer_km, dtype: int64


![car image](https://www.motoringresearch.com/wp-content/uploads/2018/12/14_New_Cars_2019.jpg)

Let's continue exploring the data, specifically looking for data that doesn't look right. We'll start by analyzing the `odometer_km` and `price` columns.


### **Odometer**

In [322]:
# Examine how many unique values there are for the odometer_km column.
autos["odometer_km"].unique().shape

(13,)

There are 13 unique values in the `odometer_km` column.

In [323]:
# Examine the min/max/median/mean values to look for outliers.
autos["odometer_km"].describe()

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [324]:
# Examine and sort the various values in the column to look for outliers.
autos["odometer_km"].value_counts()

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
5000        967
40000       819
30000       789
20000       784
10000       264
Name: odometer_km, dtype: int64

In the `odometer_km` column, there seem to be **no** values that look unrealistically high or low (outliers) that we might want to remove. We can assume that eBay offers sellers fixed categories for mileage to select for their cars. Most secondhand cars have a mileage of over 150,000km.

### **Price**

In [325]:
# Examine how many unique values there are for the price column.
autos["price_usd"].unique().shape

(2357,)

There are 2357 unique values in the price column.

In [326]:
# Examine the min/max/median/mean values to look for outliers.
autos["price_usd"].describe()

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price_usd, dtype: float64

In [327]:
# Sort and examine the value counts in ascending order.
autos["price_usd"].value_counts().head(10).sort_index(ascending= True)

0       1421
500      781
600      531
800      498
1000     639
1200     639
1500     734
2000     460
2500     643
3500     498
Name: price_usd, dtype: int64

In [328]:
# Sort the price column in descending order to see outliers and the most reasonable maximum values.
autos["price_usd"].sort_values(ascending= False).iloc[:20]

39705    99999999
42221    27322222
39377    12345678
47598    12345678
27371    12345678
2897     11111111
24384    11111111
11137    10000000
47634     3890000
7814      1300000
22947     1234566
43049      999999
514        999999
37585      999990
36818      350000
14715      345000
34723      299000
35923      295000
12682      265000
47337      259000
Name: price_usd, dtype: int64

The two calculations help to interpret what the outliers and unrealistic prices are, and reveal that the reasonable prices for second hand cars range between USD 500 and USD 350,000. There are 1421 instances with a price of USD 0. We could reason that eBay is a bidding site, but such outliers do not provide us with good data on current value of second hand cars. Therefor we exclude these from our analysis.

In [329]:
# Remove outliers from our dataset.
# Alternatively: autos[(autos["price"] > 500) & (autos["price"] < 350000))
autos = autos[autos["price_usd"].between(500,350000)]
autos["price_usd"].describe()

count     45097.000000
mean       6320.659600
std        9261.841444
min         500.000000
25%        1500.000000
50%        3500.000000
75%        7900.000000
max      350000.000000
Name: price_usd, dtype: float64

## 5. Exploring the date columns

Let's now move on to the date columns and understand the date range the data covers.

There are 5 columns that should represent date values. Some of these columns were created by the crawler, some came from the website itself. We can differentiate by referring to the data dictionary:



```
- `date_crawled`: added by the crawler
- `last_seen`: added by the crawler
- `ad_created`: from the website
- `registration_month`: from the website
- `registration_year`: from the website
```

Right now, the `date_crawled`, `last_seen`, and `ad_created` columns are all identified as string values by pandas. Because these three columns are represented as strings, we need to convert the data into a numerical representation so we can understand it quantitatively. The other two columns are represented as numeric values, so we can use methods like `Series.describe()` to understand the distribution without any extra data processing.

Let's first understand how the values in the three string columns are formatted. These columns all represent full timestamp values, like so:



In [None]:
autos[['date_crawled','ad_created','last_seen']][0:5]

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50


You'll notice that the first 10 characters represent the day (e.g. `2016-03-12`). To understand the date range, we can extract just the date values, use `Series.value_counts()` to generate a distribution, and then sort by the index.

To select the first 10 characters in each column, we can use `Series.str[:10]`:



In [None]:
print(autos['date_crawled'].str[:10])

0        2016-03-26
1        2016-04-04
2        2016-03-26
3        2016-03-12
4        2016-04-01
            ...    
49995    2016-03-27
49996    2016-03-28
49997    2016-04-02
49998    2016-03-08
49999    2016-03-14
Name: date_crawled, Length: 45097, dtype: object


## 6. Dealing with Incorrect Registration Year Data

![car image](https://cdn2.carbuyer.co.uk/sites/carbuyer_d7/files/f-pace-41_3.jpg)
One thing that stands out from the exploration we did in the last screen is that the `registration_year` column contains some odd values:

- The minimum value is `1000`, before cars were invented
- The maximum value is `9999`, many years into the future

Because a car can't be first registered after the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1900s.

Let's count the number of listings with cars that fall outside the 1900 - 2016 interval and see if it's safe to remove those rows entirely, or if we need more custom logic.

## 7. Exploring Price by Brand

One of the analysis techniques we learned in this course is aggregation. When working with data on cars, it's natural to explore variations across different car brands. We can use aggregation to understand the `brand` column.

In an earlier mission, we explored how to use loops to perform aggregation. Here's what the process looks like:


```
- Identify the unique values we want to aggregate by
- Create an empty dictionary to store our aggregate data
- Loop over the unique values, and for each:
    - Subset the dataframe by the unique values
    - Calculate the mean of whichever column we're interested in
    - Assign the val/mean to the dict as k/v.
```




## 8. Storing Aggregate Data in a DataFrame

![car image](https://s3.india.com/auto/wp-content/uploads/2017/04/Maserati-at-NYIAS-2017-Ghibli-Nerissimo-edition-studio-w-1.jpg)

In previous part, we aggregated across brands to understand mean price. We observed that in the top 6 brands, there's a distinct price gap.

- Audi, BMW and Mercedes Benz are more expensive
- Ford and Opel are less expensive
- Volkswagen is in between

For the top 6 brands, let's use aggregation to understand the average mileage for those cars and if there's any visible link with mean price. While our natural instinct may be to display both aggregated series objects and visually compare them, this has a few limitations:
- it's difficult to compare more than two aggregate series objects if we want to extend to more columns
- we can't compare more than a few rows from each series object
- we can only sort by the index (brand name) of both series objects so we can easily make visual comparisons

Instead, we can combine the data from both series objects into a single dataframe (with a shared index) and display the dataframe directly. To do this, we'll need to learn two pandas methods:

- [pandas series constructor](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html)
- [pandas dataframe constructor](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)

Here's an example of the series constructor that uses the `brand_mean_prices` dictionary:



In [None]:
# bmp_series = pd.Series(brand_mean_prices)
# print(bmp_series)

The keys in the dictionary became the index in the series object. We can then create a single-column dataframe from this series object. We need to use the `columns` parameter when calling the dataframe constructor (which accepts a array-like object) to specify the column name (or the column name will be set to `0` by default):



In [None]:
# df = pd.DataFrame(bmp_series, columns=['mean_price'])
# df

![car image](https://resources.stuff.co.nz/content/dam/images/1/k/6/9/i/l/image.related.StuffLandscapeSixteenByNine.1420x800.1k6a1a.png/1500346348050.jpg)



---



## **Data Cleanup Tasks**

√ `seller`: Only one entry has 'gewerblich’, all the rest is 'private’. 
*Consider dropping this row.*

√ `offer_type`: Only one entry has 'Gesuch' ('Searched'), all the rest is 'Angebot' ('Offered'). 
*Consider dropping this row.*

√ `price`: This column is not an integer. For further analysis, remove currency sign, move this to column header and turn into `int`. *Change from string to integer.*

☐ `vehicle_type`: 8 different types are mentioned in German, which could be translated to English. There are also 95 NaN values of the 50,000 entries, which is not much. *Consider translating.*

☐ `power_ps`: 5500 entries have noted 0 horsepower. The maximum value of 17700 is significantly more than the value at 75%, which suggests inaccurate data in this column. *Requires more investigation.*

√ `odometer`: Mileage, where the "km" can be removed, in order to make the cells integers. *Change from string to integer.*

☐ `registration_month`: The minimum in this column is "0" (with 5075 entries), which also suggests inaccurate rows of data.

☐ `fuel_type`: Also described in German, but the entries look accurate and fine. *Consider translating.*

☐ `unrepaired_damage`: Almost 10,000 cases of NaN. The other entries are one of two values, written in German. *Consider translating.*

☐ `ad_created`: Date stamp seems fine, but the time stamp can be removed as it is 00:00:00 everywhere. *Consider removing time stamp.*

√ `nr_of_pictures`: Zero everywhere so not much meaning, *Consider dropping this row.*

☐ `postal_code`: Has an entry of four digits, where usually 5 digits are used. The entry 99998 also seems unusual, as do numbers after the decimal. *Requires more investigation.*