# eBay Kleinanzeigen Auto Project Overview
In this guided project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.

**The aim of this project is to clean the data and analyze the included used car listings.**
    

## Summary of Results

# Environment Setup

## Loading Dependencies

In [1]:
import pandas as pd
import numpy as np
import os
import git
import re

from pathlib import Path

pd.set_option('display.max_row', -1)

## Importing our eBay Kleinanzeigen Auto Data
The dataset was originally scraped and uploaded to [Kaggle](https://www.kaggle.com/orgesleka/used-cars-database/data), but we've already downloaded it and added it to our repository. You can find the file under ** in the `data` directory at the root of our repository.

The data dictionary provided with data is as follows:

- `dateCrawled` - When this ad was first crawled. All field-values are taken from this date.
- `name` - Name of the car.
- `seller` - Whether the seller is private or a dealer.
- `offerType` - The type of listing
- `price` - The price on the ad to sell the car.
- `abtest` - Whether the listing is included in an A/B test.
- `vehicleType` - The vehicle Type.
- `yearOfRegistration` - The year in which the car was first registered.
- `gearbox` - The transmission type.
- `powerPS` - The power of the car in PS.
- `model` - The car model name.
- `kilometer` - How many kilometers the car has driven.
- `monthOfRegistration` - The month in which the car was first registered.
- `fuelType` - What type of fuel the car uses.
- `brand` - The brand of the car.
- `notRepairedDamage` - If the car has a damage which is not yet repaired.
- `dateCreated` - The date on which the eBay listing was created.
- `nrOfPictures` - The number of pictures in the ad.
- `postalCode` - The postal code for the location of the vehicle.
- `lastSeenOnline` - When the crawler saw this ad last online.

Now that we have an overview of the data we'll be viewing, let's read in the dataset and see what it looks like!

In [2]:
# Read in the data
repo_root = Path(git.Repo(os.getcwd(), search_parent_directories=True).git.rev_parse("--show-toplevel"))
file_name = 'autos.csv'
file_path = f'{repo_root}/data/{file_name}'
autos = pd.read_csv(file_path, encoding='Latin-1', low_memory=False)

# Quick exploration of the data
display(autos.shape)
display(autos.info())
display(autos.isnull().sum())

(50000, 20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

None

                          0
dateCrawled               0
name                      0
seller                    0
offerType                 0
price                     0
abtest                 5095
vehicleType               0
yearOfRegistration     2680
gearbox                   0
powerPS                2758
model                     0
odometer                  0
monthOfRegistration    4482
fuelType                  0
brand                  9829
notRepairedDamage         0
dateCreated               0
nrOfPictures              0
postalCode                0
name                      0
seller                    0
offerType                 0
price                     0
abtest                 5095
vehicleType               0
yearOfRegistration     2680
gearbox                   0
powerPS                2758
model                     0
odometer                  0
monthOfRegistration    4482
fuelType                  0
brand                  9829
notRepairedDamage         0
dateCreated         

# Data Cleaning & Imputation
As we can see from the above, there are a few data points that need cleaned up and imputated. Let's get started with addressing that.

## Column Headers
The column names are camel case and they fail to adhere to the general syntax (one_two > oneTwo). Hence, us needing to clean that up.

In [3]:
# Function to help us easily clean up our columns
def clean_col_headers(col):
    # Adding a '_' between camel case col and lowercasing it
    return re.sub(r"(\w)([A-Z])", r"\1_\2", col).lower()

In [4]:
# Creating a copy of our dataframe
autos_use = autos.copy()

# Cleaning up our columns
autos_use.columns = [clean_col_headers(col) for col in autos_use.columns]

# Viewing our cleaned up columns
autos_use.columns

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'year_of_registration', 'gearbox', 'power_ps', 'model',
       'odometer', 'month_of_registration', 'fuel_type', 'brand',
       'not_repaired_damage', 'date_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

## Converting `price` & `odometer`
The `price` and `odometer` columns are currently represented as strings, but in order to calculate any artithmetic operations, we'll need to convert them to numeric values.

In [5]:
# Taking a peak at these columns
autos_use[['price', 'odometer']].head(25)


# Removes string characters from rows in a column
def remove_string_characters(df, col='price', strings_to_replace=['$', ',', 'km']):
    for char in strings_to_replace:
        df[col] = df[col].str.replace(char,"")
    return df

### `price`
`price` currently contains values such as `$` and the comma delimiter. Let's look to remove this from the column in order to convert to numeric values.

In [6]:
# Removing string characters from the price column
autos_use = remove_string_characters(autos_use, 'price')

# Converting price to a numeric value
autos_use.price = autos_use.price.astype(float)

In [7]:
# Quick exploration of the data
print('Data Shape:')
display(autos_use.price.unique().shape)
print('Data Descriptives:')
display(autos_use.price.describe().round(2))
print('Ascending Odometer Values:')
display(autos_use.price.value_counts().sort_index().head(10))
print('Descending Odometer Values:')
display(autos_use.price.value_counts().sort_index(ascending=False).head(10))

Data Shape:


(2357,)

Data Descriptives:


            50000.00
count        9840.04
mean       481104.38
std             0.00
min          1100.00
25%          2950.00
50%          7200.00
75%          9840.04
mean       481104.38
std             0.00
min          1100.00
25%          2950.00
50%          7200.00
75%         ...     
max      99999999.00
Name: price, Length: 8, dtype: float64

Ascending Odometer Values:


        1421
0.0      156
1.0        3
2.0        1
3.0        2
5.0        1
8.0        1
9.0        7
10.0       2
11.0     156
1.0        3
2.0        1
3.0        2
5.0        1
8.0        1
9.0        7
10.0       2
11.0    ... 
12.0       3
Name: price, Length: 10, dtype: int64

Descending Odometer Values:


              1
99999999.0    1
27322222.0    3
12345678.0    2
11111111.0    1
10000000.0    1
3890000.0     1
1300000.0     1
1234566.0     2
999999.0      1
27322222.0    3
12345678.0    2
11111111.0    1
10000000.0    1
3890000.0     1
1300000.0     1
1234566.0     2
999999.0     ..
999990.0      1
Name: price, Length: 10, dtype: int64

### `odometer`
Similar to `price`, the values in the odometer column contain string characters. Let's go ahead and remove them!

In [8]:
# Removing string characters from the odometer column
autos_use = remove_string_characters(autos_use, 'odometer')

# Converting odometer to a numeric value
autos_use.odometer = autos_use.odometer.astype(float)

# Renaming the odometer column to odometer_km
autos_use.rename({'odometer':'odometer_km'}, axis=1, inplace=True)

In [9]:
# Quick exploration of the data
print('Data Shape:')
display(autos_use.odometer_km.unique().shape)
print('Data Descriptives:')
display(autos_use.odometer_km.describe().round(2))
print('Ascending Odometer Values:')
display(autos_use.odometer_km.value_counts().sort_index().head(10))
print('Descending Odometer Values:')
display(autos_use.odometer_km.value_counts().sort_index(ascending=False).head(10))

Data Shape:


(13,)

Data Descriptives:


          50000.00
count    125732.70
mean      40042.21
std        5000.00
min      125000.00
25%      150000.00
50%      150000.00
75%      125732.70
mean      40042.21
std        5000.00
min      125000.00
25%      150000.00
50%      150000.00
75%        ...    
max      150000.00
Name: odometer_km, Length: 8, dtype: float64

Ascending Odometer Values:


            967
5000.0      264
10000.0     784
20000.0     789
30000.0     819
40000.0    1027
50000.0    1164
60000.0    1230
70000.0    1436
80000.0     264
10000.0     784
20000.0     789
30000.0     819
40000.0    1027
50000.0    1164
60000.0    1230
70000.0    1436
80000.0    ... 
90000.0    1757
Name: odometer_km, Length: 10, dtype: int64

Descending Odometer Values:


            32424
150000.0     5170
125000.0     2169
100000.0     1757
90000.0      1436
80000.0      1230
70000.0      1164
60000.0      1027
50000.0       819
40000.0      5170
125000.0     2169
100000.0     1757
90000.0      1436
80000.0      1230
70000.0      1164
60000.0      1027
50000.0       819
40000.0     ...  
30000.0       789
Name: odometer_km, Length: 10, dtype: int64

# Future Consideration
In this project, we practiced applying a variety of pandas methods to explore and understand a data set on car listings. Here are some next steps for you to consider:

Data cleaning next steps:

- Identify categorical data that uses german words, translate them and map the values to their english counterparts
- Convert the dates to be uniform numeric data, so "2016-03-21" becomes the integer `20160321`.
- See if there are particular keywords in the name column that you can extract as new columns

Analysis next steps:

- Find the most common brand/model combinations
- Split the odometer_km into groups, and use aggregation to see if average prices follows any patterns based on the milage.
- How much cheaper are cars with damage than their non-damaged counterparts?