# *eBay Kleinanzeigen* Data Analysis 

## Context:

We will be analyzing *eBay Kleinanzeigen*, a classified section of the German eBay website. More specifically, this classified section for *eBay Kleinanzeigen* is for used cars. 

[Original Dataset](https://data.world/data-society/used-cars-data)

Data Dictionary:
- `dateCrawled`: When this ad was first crawled. All field-values are taken from this date.
- `name`: Name of the car.
- `seller`: Whether the seller is private or a dealer.
- `offerType`: The type of listing
- `price`: The price on the ad to sell the car.
- `abtest`: Whether the listing is included in an A/B test.
- `vehicleType`: The vehicle Type.
- `yearOfRegistration`: The year in which the car was first registered.
- `gearbox`: The transmission type.
- `powerPS`: The power of the car in PS.
- `model`: The car model name.
- `kilometer`: How many kilometers the car has driven.
- `monthOfRegistration`: The month in which the car was first registered.
- `fuelType`: What type of fuel the car uses.
- `brand`: The brand of the car.
- `notRepairedDamage`: If the car has a damage which is not yet repaired.
- `dateCreated`: The date on which the eBay listing was created.
- `nrOfPictures`: The number of pictures in the ad.
- `postalCode`: The postal code for the location of the vehicle.
- `lastSeenOnline`: When the crawler saw this ad last online.

## Looking to Answer:

1. TODO
2. TODO
3. TODO

# Findings & Summary:

Audi, BMW, and Mercedes Benz are the most expensive brands. Open and Ford are the least expensive brands. Volkswagen is a middle priced brand--not luxury , but not economy either.

Most cars in the dataset average about 130k miles even though the price alters heavily by brand. BMWs and Mercedes Benz slightly average mileage. Ford has the least amount of mileage at 124k miles.

## Initial Data Lookthrough

In [1]:
# import necessary packages
import numpy as np
import pandas as pd

In [2]:
autos = pd.read_csv('autos.csv',encoding='Latin-1')

In [3]:
#lets take a look at the data
autos

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,2016-03-27 14:38:19,Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon,privat,Angebot,"$24,900",control,limousine,2011,automatik,239,q5,"100,000km",1,diesel,audi,nein,2016-03-27 00:00:00,0,82131,2016-04-01 13:47:40
49996,2016-03-28 10:50:25,Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...,privat,Angebot,"$1,980",control,cabrio,1996,manuell,75,astra,"150,000km",5,benzin,opel,nein,2016-03-28 00:00:00,0,44807,2016-04-02 14:18:02
49997,2016-04-02 14:44:48,Fiat_500_C_1.2_Dualogic_Lounge,privat,Angebot,"$13,200",test,cabrio,2014,automatik,69,500,"5,000km",11,benzin,fiat,nein,2016-04-02 00:00:00,0,73430,2016-04-04 11:47:27
49998,2016-03-08 19:25:42,Audi_A3_2.0_TDI_Sportback_Ambition,privat,Angebot,"$22,900",control,kombi,2013,manuell,150,a3,"40,000km",11,diesel,audi,nein,2016-03-08 00:00:00,0,35683,2016-04-05 16:45:07


In [4]:
print(autos.dtypes)

dateCrawled            object
name                   object
seller                 object
offerType              object
price                  object
abtest                 object
vehicleType            object
yearOfRegistration      int64
gearbox                object
powerPS                 int64
model                  object
odometer               object
monthOfRegistration     int64
fuelType               object
brand                  object
notRepairedDamage      object
dateCreated            object
nrOfPictures            int64
postalCode              int64
lastSeen               object
dtype: object


Observations: 

1. Most columns are intepreted as strings. We will need to clean columns and convert them to numeric data. Also convert date based columns to datetime objects.


2. There is a time difference between `lastSeen`& `dateCrawled` will have to be examined and further understood.


3. `nrOfPictures`seems to be a column we can delete from first look.


4. Some columns have null values, but none have more than ~20% null values.


5. The column names use camelcase instead of Python's preferred snakecase, which means we can't just replace spaces with underscores.

# Part One: Cleaning & Exploring Data

In [5]:
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [6]:
#renaming column names
autos = autos.rename(columns={
    'dateCrawled':'date_crawled', 
    'name':'name', 
    'seller':'seller',
    'offerType':'offer_type',
    'price':'price',
    'abtest':'abtest',
    'vehicleType':'vehicle_type',
    'yearOfRegistration':'registration_year',
    'gearbox':'gear_box',
    'powerPS':'power_ps',
    'model':'model',
    'odometer':'odometer',
    'monthOfRegistration':'registration_month',
    'fuelType':'fuel_type',
    'brand':'brand',
    'notRepairedDamage':'unrepaired_damage',
    'dateCreated':'ad_created',
    'nrOfPictures':'nr_of_pictures',
    'postalCode':'postal_code',
    'lastSeen':'last_seen'}
                    )

Renaming the columns here will make data analysis easier. Some columns were renamed for clarity, others for more traditional `snake size` python programming.

In [7]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gear_box,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-25 19:57:10,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


**Columns that have mostly one value that are candidates to be dropped:**
1. `seller, offer_type,abtest,gear_box`

**Columns that need more investigation:**
1. `date_crawled,date_created, last_seen, power_ps,vehicle_type`

**Column examples of numeric data stored as text that needs to be cleaned:**
1.`price,registration_year,power_ps,odometer,registration_year,nr_of_pictures,postal_code`

In [8]:
# removing non-numeric characters and converting to numeric dtype

def column_namecleaner(ser,str_matches,_float=False,_int=False):
    """
    Clean a series object and convert it to numeric type based on
    what the str contains. In this anaylsis this function is is easier for str
    that contain '$', 'km', or ','
    
    ser: pandas series to search
    str_matches: tuple containing strings to remove from series
    _int: True if we want series to be converted to int, else False
    _float: True if we want series to be converted to float, else False
    
    Return cleaned series
    """
    for string in str_matches:
        if ser.str.contains(string,regex=False).any():
            ser = ser.str.replace(string,'')
    if (_int == True ):
        ser = ser.astype(int)
        return ser
    if (_float == True):
        ser = ser.astype(float)
        return ser

autos['price'] = column_namecleaner(autos['price'],("$",","),_int=True)
autos['odometer'] = column_namecleaner(autos['odometer'],("km",","),_int=True)

In [9]:
# renaming cleaned columns price, and odometer
autos = autos.rename(columns={
    'odometer':'odometer_km'
})

#### Deep dive into price and odometer_km columns

In [10]:
autos['price'].describe()

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [11]:
autos['price'].value_counts().sort_index(ascending=False).head(25)

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
194000      1
190000      1
180000      1
175000      1
169999      1
Name: price, dtype: int64

In [12]:
autos['price'].value_counts().sort_index(ascending=True).head(25)

0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
13       2
14       1
15       2
17       3
18       1
20       4
25       5
29       1
30       7
35       1
40       6
45       4
47       1
49       4
50      49
Name: price, dtype: int64

For context, *eBay* is an online auction website, where users are able to sell items to the public. Users generally do not sell items at \\$0 dollars so filtering these values out would be ideal. Considering 75 percent of values are priced at \\$7,200 dollars and below, prices that are in the millions are unlikely to actually be true. 

Considering these reasons, a price range from \\$1-\\$350,000 seems reasonabe to me.

In [13]:
autos = autos[autos["price"].between(1,350001)]

In [14]:
#looking at the prices description now:
autos['price'].describe()

count     48565.000000
mean       5888.935591
std        9059.854754
min           1.000000
25%        1200.000000
50%        3000.000000
75%        7490.000000
max      350000.000000
Name: price, dtype: float64

In [15]:
autos['odometer_km'].describe()

count     48565.000000
mean     125770.101925
std       39788.636804
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [16]:
autos['odometer_km'].value_counts().sort_index(ascending=False)

150000    31414
125000     5057
100000     2115
90000      1734
80000      1415
70000      1217
60000      1155
50000      1012
40000       815
30000       780
20000       762
10000       253
5000        836
Name: odometer_km, dtype: int64

A range of 13 unique values for the odometer tells us that mileage couldve been rounded or grouped for easier selling on the website. The value counts seem normal to me. Considering *eBay* is an auction website, its a logical conclusion that there are more cars with higher mileage, used cars are auctioned off all the time.

#### Exploring the date_crawled, ad_created, & last_seen colums

In [17]:
print(autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index())

2016-03-05    0.025327
2016-03-06    0.014043
2016-03-07    0.036014
2016-03-08    0.033296
2016-03-09    0.033090
2016-03-10    0.032184
2016-03-11    0.032575
2016-03-12    0.036920
2016-03-13    0.015670
2016-03-14    0.036549
2016-03-15    0.034284
2016-03-16    0.029610
2016-03-17    0.031628
2016-03-18    0.012911
2016-03-19    0.034778
2016-03-20    0.037887
2016-03-21    0.037373
2016-03-22    0.032987
2016-03-23    0.032225
2016-03-24    0.029342
2016-03-25    0.031607
2016-03-26    0.032204
2016-03-27    0.031092
2016-03-28    0.034860
2016-03-29    0.034099
2016-03-30    0.033687
2016-03-31    0.031834
2016-04-01    0.033687
2016-04-02    0.035478
2016-04-03    0.038608
2016-04-04    0.036487
2016-04-05    0.013096
2016-04-06    0.003171
2016-04-07    0.001400
Name: date_crawled, dtype: float64


Web scrapping was used on the website for a total of 32 days. Except for the last two days of scrapping (2016-04-06 & 2016-04-07) most days have a similar distibution percentage.

In [18]:
print(autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index())

2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
                ...   
2016-04-03    0.038855
2016-04-04    0.036858
2016-04-05    0.011819
2016-04-06    0.003253
2016-04-07    0.001256
Name: ad_created, Length: 76, dtype: float64


Ad created dates have a total duration of 301 days. From the distributions listed above most ads for this data set were created during year 2016. It could be worth filtering data for only 2016. 

In [19]:
print(autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index())

2016-03-05    0.001071
2016-03-06    0.004324
2016-03-07    0.005395
2016-03-08    0.007413
2016-03-09    0.009595
2016-03-10    0.010666
2016-03-11    0.012375
2016-03-12    0.023783
2016-03-13    0.008895
2016-03-14    0.012602
2016-03-15    0.015876
2016-03-16    0.016452
2016-03-17    0.028086
2016-03-18    0.007351
2016-03-19    0.015834
2016-03-20    0.020653
2016-03-21    0.020632
2016-03-22    0.021373
2016-03-23    0.018532
2016-03-24    0.019767
2016-03-25    0.019211
2016-03-26    0.016802
2016-03-27    0.015649
2016-03-28    0.020859
2016-03-29    0.022341
2016-03-30    0.024771
2016-03-31    0.023783
2016-04-01    0.022794
2016-04-02    0.024915
2016-04-03    0.025203
2016-04-04    0.024483
2016-04-05    0.124761
2016-04-06    0.221806
2016-04-07    0.131947
Name: last_seen, dtype: float64


Last seen date values are about this same as the date_crawled dates.

In [20]:
autos['registration_year'].describe()

count    48565.000000
mean      2004.755421
std         88.643887
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

Registration year is missing values. Min and max values are not accurate. Most cars in the dataset are older models--2008 or older. Car registrations' year range from '99-2008.

From personal knowledge/experience, and common knowledge of car selling, most people dont care to sell cars that are very old (greater than 20 years from current year date). Antique car auctions are generally not be held on the eBay website either. Considering the data was scrubbed in 2016, I believe a reasonable date range is 1900-2016.

In [21]:
autos['registration_year'] = autos['registration_year'].between(1900,2016)

# Part Two: Mean Price of Car by Brand

In [39]:
brand_counts = autos['brand'].value_counts(normalize=True)
brands = brand_counts[brand_counts > .05].index

In [40]:
# based on course understandings thus far, we will aggregate using
# for loops and dicts

mean_price_by_brand = {}
for b in brands:
    fil_data = autos[autos['brand']==b]
    mean = fil_data['price'].mean()
    mean_price_by_brand[b] = round(float(mean))
    
mean_price_by_brand

{'volkswagen': 5332,
 'opel': 2945,
 'bmw': 8261,
 'mercedes_benz': 8536,
 'audi': 9213,
 'ford': 3728}

Audi, BMW, and Mercedes Benz are the most expensive brands. Open and Ford are the least expensive brands. Volkswagen is a middle priced brand--not luxury , but not economy either.

In [58]:
mean_price_by_brand1 = {}
mean_mileage_by_brand = {}
for b in brands:
    fil_data = autos[autos['brand']==b]
    meanprice = fil_data['price'].mean()
    meanmileage = fil_data['odometer_km'].mean()
    mean_price_by_brand1[b] = meanprice
    mean_mileage_by_brand[b] = meanmileage

brand_mean_price_series = pd.Series(mean_price_by_brand1)
brand_mean_mileage_series = pd.Series(mean_mileage_by_brand)

branddf = pd.DataFrame(brand_mean_price_series,
                     columns=['mean_price'])

In [59]:
branddf['mean_mileage'] = brand_mean_mileage_series

In [61]:
branddf.describe()

Unnamed: 0,mean_price,mean_mileage
count,6.0,6.0
mean,6335.973056,129266.868629
std,2688.198629,2770.972917
min,2944.607542,124349.497339
25%,4129.428743,129018.224372
50%,6796.930434,129437.867319
75%,8467.365924,130470.464327
max,9212.930662,132682.973075


Most cars in the dataset average about 130k miles even though the price alters heavily by brand. BMWs and Mercedes Benz slightly average mileage. Ford has the least amount of mileage at 124k miles.