# Ebay Car Sell Project

The aim of this project is to clean the data and analyze the included used car listings. Here we work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.


In [1]:
#Import the pandas and NumPy libraries

import pandas as pd
import numpy as np


## Step1: Read data
Read the _autos.csv_ CSV file into pandas, and assign it to the variable name autos, we use popular encodings <font color=red>'Latin-1'</font> to read the file without error

In [2]:
autos=pd.read_csv('autos.csv',encoding='Latin-1')

In [3]:
autos.info()
autos.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 371528 entries, 0 to 371527
Data columns (total 20 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   dateCrawled          371528 non-null  object
 1   name                 371528 non-null  object
 2   seller               371528 non-null  object
 3   offerType            371528 non-null  object
 4   price                371528 non-null  int64 
 5   abtest               371528 non-null  object
 6   vehicleType          333659 non-null  object
 7   yearOfRegistration   371528 non-null  int64 
 8   gearbox              351319 non-null  object
 9   powerPS              371528 non-null  int64 
 10  model                351044 non-null  object
 11  kilometer            371528 non-null  int64 
 12  monthOfRegistration  371528 non-null  int64 
 13  fuelType             338142 non-null  object
 14  brand                371528 non-null  object
 15  notRepairedDamage    299468 non-nu

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-24 11:52:17,Golf_3_1.6,privat,Angebot,480,test,,1993,manuell,0,golf,150000,0,benzin,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,privat,Angebot,18300,test,coupe,2011,manuell,190,,125000,5,diesel,audi,ja,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,"Jeep_Grand_Cherokee_""Overland""",privat,Angebot,9800,test,suv,2004,automatik,163,grand,125000,8,diesel,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,GOLF_4_1_4__3TÜRER,privat,Angebot,1500,test,kleinwagen,2001,manuell,75,golf,150000,6,benzin,volkswagen,nein,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,Skoda_Fabia_1.4_TDI_PD_Classic,privat,Angebot,3600,test,kleinwagen,2008,manuell,69,fabia,90000,7,diesel,skoda,nein,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21


We find following observations:

* The dataset contains 20 columns, most of which are strings.
* Some columns have null values.
* Some columns name use [camelcase](https://en.wikipedia.org/wiki/Camel_case) instead of Python's preferred [snakecase](https://en.wikipedia.org/wiki/Snake_case) which means we we can't just replace spaces with underscores.

## Step2: Data Cleaning
We start by cleaning the data set to make it easier to work with.

### Cleaning Column Names
Let's convert the column names from camelcase to snakecase and reword some of the column names based on the data dictionary to be more descriptive.

In [4]:
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'kilometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [5]:
#rename the columns
autos.rename(columns={"dateCrawled":"date_crawled",
                        "offerType":"offer_type",
                        "vehicleType":"vehicle_type",
                        "yearOfRegistration":"registration_year",
                       "powerPS":"power_ps",
                       "monthOfRegistration":"registration_month",
                       "fuelType":"fuel_type",
                       "notRepairedDamage":"unrepaired_damage",
                       "dateCreated":"ad_created",
                       "nrOfPictures":"num_pictures",
                       "postalCode":"postal_code",
                       "lastSeen":"last_seen"}, inplace=True)

In [6]:
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,kilometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,num_pictures,postal_code,last_seen
0,2016-03-24 11:52:17,Golf_3_1.6,privat,Angebot,480,test,,1993,manuell,0,golf,150000,0,benzin,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,privat,Angebot,18300,test,coupe,2011,manuell,190,,125000,5,diesel,audi,ja,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,"Jeep_Grand_Cherokee_""Overland""",privat,Angebot,9800,test,suv,2004,automatik,163,grand,125000,8,diesel,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,GOLF_4_1_4__3TÜRER,privat,Angebot,1500,test,kleinwagen,2001,manuell,75,golf,150000,6,benzin,volkswagen,nein,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,Skoda_Fabia_1.4_TDI_PD_Classic,privat,Angebot,3600,test,kleinwagen,2008,manuell,69,fabia,90000,7,diesel,skoda,nein,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21


### Initial Data Exploration
Now let's do some basic data exploration to determine what other cleaning tasks need to be done. Initially we will look for: - Text columns where all or almost all values are the same. These can often be dropped as they don't have useful information for analysis. - Examples of numeric data stored as text which can be cleaned and converted.

In [7]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,kilometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,num_pictures,postal_code,last_seen
count,371528,371528,371528,371528,371528.0,371528,333659,371528.0,351319,371528.0,351044,371528.0,371528.0,338142,371528,299468,371528,371528.0,371528.0,371528
unique,280500,233531,2,2,,2,8,,2,,251,,,7,40,2,114,,,182806
top,2016-03-24 14:49:47,Ford_Fiesta,privat,Angebot,,test,limousine,,manuell,,golf,,,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:45:59
freq,7,657,371525,371516,,192585,95894,,274214,,30070,,,223857,79640,263182,14450,,,17
mean,,,,,17295.14,,,2004.577997,,115.549477,,125618.688228,5.734445,,,,,0.0,50820.66764,
std,,,,,3587954.0,,,92.866598,,192.139578,,40112.337051,3.712412,,,,,0.0,25799.08247,
min,,,,,0.0,,,1000.0,,0.0,,5000.0,0.0,,,,,0.0,1067.0,
25%,,,,,1150.0,,,1999.0,,70.0,,125000.0,3.0,,,,,0.0,30459.0,
50%,,,,,2950.0,,,2003.0,,105.0,,150000.0,6.0,,,,,0.0,49610.0,
75%,,,,,7200.0,,,2008.0,,150.0,,150000.0,9.0,,,,,0.0,71546.0,


The following observations can be made:

* 'seller' and 'offer_type'have nearly all values same
* 'num_pictures' has 0 value for all rows
* '_registrationyear' has min value 1000 which was long before cars were invented and max value 9000 which is many years in the future
* '_registrationmonth' has min value 0 which is invalid as months range in 1 to 12

So, seller , _offertype and _numpictures are candidates to be dropped.


### Dropping columns with mostly one value

In [8]:
autos=autos.drop(['seller','offer_type','num_pictures'],axis=1)



In [9]:
autos.shape

(371528, 17)

### Exploring price and kilometer

In [10]:
autos.price.describe()

count    3.715280e+05
mean     1.729514e+04
std      3.587954e+06
min      0.000000e+00
25%      1.150000e+03
50%      2.950000e+03
75%      7.200000e+03
max      2.147484e+09
Name: price, dtype: float64

In [11]:
autos['price'].value_counts().sort_index(ascending= False).head(50)

2147483647     1
99999999      15
99000000       1
74185296       1
32545461       1
27322222       1
14000500       1
12345678       9
11111111      10
10010011       1
10000000       8
9999999        3
3895000        1
3890000        1
2995000        1
2795000        1
1600000        2
1300000        1
1250000        2
1234566        1
1111111        2
1010010        1
1000000        5
999999        13
999990         1
911911         1
849000         1
820000         1
780000         1
745000         2
725000         1
700000         1
650000         1
619000         1
600000         2
599000         1
585000         1
579000         1
517895         1
500000         2
488997         1
487000         1
485000         1
466000         1
445000         1
440000         1
420000         1
399997         1
395000         1
390000         1
Name: price, dtype: int64

In [12]:
autos["price"].value_counts().head(30)

0       10778
500      5670
1500     5394
1000     4649
1200     4594
2500     4438
600      3819
3500     3792
800      3784
2000     3432
999      3364
750      3203
650      3150
4500     3053
850      2946
2200     2936
700      2936
1800     2886
900      2874
950      2793
1100     2772
1300     2757
300      2731
3000     2720
550      2591
1600     2570
5500     2543
350      2514
400      2442
1250     2441
Name: price, dtype: int64

In [13]:
autos=autos[autos['price'].between(1,350000)]
autos["price"].describe()

count    360635.000000
mean       5898.671956
std        8866.359669
min           1.000000
25%        1250.000000
50%        3000.000000
75%        7490.000000
max      350000.000000
Name: price, dtype: float64

In [14]:
autos['kilometer'].describe()

count    360635.000000
mean     125675.364288
std       39818.609118
min        5000.000000
25%      100000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: kilometer, dtype: float64

In [15]:
autos['kilometer'].value_counts().sort_index()

5000        6012
10000       1869
20000       5530
30000       5935
40000       6319
50000       7531
60000       8593
70000       9673
80000      10905
90000      12349
100000     15477
125000     37371
150000    233071
Name: kilometer, dtype: int64

All values are rounded, which indicates that the sellers might have to select from pre-set options for this field. There are more high mileage cars.

### Dealing with Time data

Right now, the _date_crawled, last_seen, and ad_created_ columns are all identified as string values by pandas. Because these three columns are represented as strings, we need to convert the data into a numerical representation so we can understand it quantitatively.


Let's first understand how the values in the three string columns are formatted. These columns all represent full timestamp values, like so:

In [16]:
autos.loc[0:5,['date_crawled','last_seen','ad_created']]

Unnamed: 0,date_crawled,last_seen,ad_created
0,2016-03-24 11:52:17,2016-04-07 03:16:57,2016-03-24 00:00:00
1,2016-03-24 10:58:45,2016-04-07 01:46:50,2016-03-24 00:00:00
2,2016-03-14 12:52:21,2016-04-05 12:47:46,2016-03-14 00:00:00
3,2016-03-17 16:54:04,2016-03-17 17:40:17,2016-03-17 00:00:00
4,2016-03-31 17:25:20,2016-04-06 10:17:21,2016-03-31 00:00:00
5,2016-04-04 17:36:23,2016-04-06 19:17:07,2016-04-04 00:00:00


You'll notice that the first 10 characters represent the day `(e.g. 2016-03-12)`. To understand the date range, we can extract just the date values, use `Series.value_counts()` to generate a distribution, and then sort by the index.

In [17]:
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

2016-03-05    0.025547
2016-03-06    0.014483
2016-03-07    0.035657
2016-03-08    0.033469
2016-03-09    0.034115
2016-03-10    0.032645
2016-03-11    0.032773
2016-03-12    0.036242
2016-03-13    0.015783
2016-03-14    0.036330
2016-03-15    0.033424
2016-03-16    0.030205
2016-03-17    0.031647
2016-03-18    0.013119
2016-03-19    0.035271
2016-03-20    0.036400
2016-03-21    0.035682
2016-03-22    0.032493
2016-03-23    0.032002
2016-03-24    0.029914
2016-03-25    0.032800
2016-03-26    0.031974
2016-03-27    0.030227
2016-03-28    0.035063
2016-03-29    0.034126
2016-03-30    0.033535
2016-03-31    0.031872
2016-04-01    0.034145
2016-04-02    0.035094
2016-04-03    0.038812
2016-04-04    0.037628
2016-04-05    0.012780
2016-04-06    0.003128
2016-04-07    0.001617
Name: date_crawled, dtype: float64

The distribution of `date_crawled` appears uniform with crawling occuring daily from the 5th March to the 7th April.

In [18]:
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

2016-03-05    0.001264
2016-03-06    0.004098
2016-03-07    0.005202
2016-03-08    0.007939
2016-03-09    0.009824
2016-03-10    0.011460
2016-03-11    0.012955
2016-03-12    0.023240
2016-03-13    0.008410
2016-03-14    0.012176
2016-03-15    0.016324
2016-03-16    0.016418
2016-03-17    0.028699
2016-03-18    0.006888
2016-03-19    0.016330
2016-03-20    0.019884
2016-03-21    0.020026
2016-03-22    0.020508
2016-03-23    0.018015
2016-03-24    0.019163
2016-03-25    0.019000
2016-03-26    0.015958
2016-03-27    0.016721
2016-03-28    0.022189
2016-03-29    0.023284
2016-03-30    0.023725
2016-03-31    0.024243
2016-04-01    0.023897
2016-04-02    0.024967
2016-04-03    0.025308
2016-04-04    0.025536
2016-04-05    0.126962
2016-04-06    0.218950
2016-04-07    0.130437
Name: last_seen, dtype: float64

`last_seen` indicates when each listing was last seen by the crawler. This helps us know when the ad was removed, probably because the car was sold.

We can see that the last three days recorded have a larger frequency of `last_seen` values. Since they represent around 10x the values from the previous days, it is unlikely that there was a big increase in the number of cars sold.


In [19]:
autos["ad_created"].str[:10].value_counts(normalize=True, dropna=False).sort_index()

2014-03-10    0.000003
2015-03-20    0.000003
2015-06-11    0.000003
2015-06-18    0.000003
2015-08-07    0.000003
                ...   
2016-04-03    0.039001
2016-04-04    0.037736
2016-04-05    0.011613
2016-04-06    0.003119
2016-04-07    0.001553
Name: ad_created, Length: 114, dtype: float64

We can see that `ad_created` has a wide date range.

Since `registration_year` has numeric values, we don't have to apply any data processing. We can just look `Series.describe()` to understand its distribution.

In [20]:
autos['registration_year'].describe()

count    360635.000000
mean       2004.433133
std          81.016977
min        1000.000000
25%        1999.000000
50%        2004.000000
75%        2008.000000
max        9999.000000
Name: registration_year, dtype: float64

One thing that stands out from the exploration we did in the last screen is that the `registration_year` column contains some odd values:
   * The minimum value is `1000`, before cars were invented
   * The maximum value is `9999`, many years into the future

Because a car can't be first registered after the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1900s.

Let's count the number of listings with cars that fall outside the 1900 - 2016 interval and see if it's safe to remove those rows entirely, or if we need more custom logic.

In [21]:
autos[autos['registration_year'].between(1900,2016)].describe()

Unnamed: 0,price,registration_year,power_ps,kilometer,registration_month,postal_code
count,346660.0,346660.0,346660.0,346660.0,346660.0,346660.0
mean,5990.973969,2002.896489,117.563523,125466.292621,5.836125,51097.624217
std,8978.439505,7.244719,186.787008,39877.679301,3.665681,25780.428453
min,1.0,1910.0,0.0,5000.0,0.0,1067.0
25%,1250.0,1999.0,75.0,100000.0,3.0,30853.0
50%,3150.0,2003.0,109.0,150000.0,6.0,49843.0
75%,7500.0,2008.0,150.0,150000.0,9.0,71735.0
max,350000.0,2016.0,20000.0,150000.0,12.0,99998.0


In [22]:
autos.loc[autos['registration_year'].between(1900,2016),'registration_year'].value_counts(normalize=True).sort_index()

1910    0.000167
1911    0.000003
1923    0.000009
1925    0.000003
1927    0.000006
          ...   
2012    0.026998
2013    0.017657
2014    0.013757
2015    0.008420
2016    0.026585
Name: registration_year, Length: 94, dtype: float64

In [23]:
autos=autos[autos['registration_year'].between(1900,2016)]

In [24]:
autos['registration_year'].min()

1910

We can also notice that the time that follows the date in `ad_created` is 00:00:00 for the first few rows. We can check if the case is the same for every row in the dataframe. If so, we can remove this part that will not be of much use during analysis.

In [25]:
date_format='%Y-%m-%d %H:%M:%S'
columns=["date_crawled", "ad_created", "last_seen"]
for column in columns:
    autos[column]=pd.to_datetime(autos[column],format=date_format)
    # Checking
    print(column, autos[column].dtype)

date_crawled datetime64[ns]
ad_created datetime64[ns]
last_seen datetime64[ns]


We can see that the values in the 3 columns are no longer stored as objects but as `datetime64[ns]`. For instance, we acess the years in date_crawled with the attribute Series.dt.year

In [26]:
autos["date_crawled"].dt.year.unique()

array([2016])

In [27]:
autos["ad_created"] = autos["ad_created"].dt.date
autos.head(3)

Unnamed: 0,date_crawled,name,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,kilometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,postal_code,last_seen
0,2016-03-24 11:52:17,Golf_3_1.6,480,test,,1993,manuell,0,golf,150000,0,benzin,volkswagen,,2016-03-24,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,18300,test,coupe,2011,manuell,190,,125000,5,diesel,audi,ja,2016-03-24,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,"Jeep_Grand_Cherokee_""Overland""",9800,test,suv,2004,automatik,163,grand,125000,8,diesel,jeep,,2016-03-14,90480,2016-04-05 12:47:46


### German words
Some columns such as `gearbox` and `unrepaired_damage` seem to have some german words. Using `Series.unique()`, we explore if some other columns also have german words. We will then translate them into their English counterparts.

In [28]:
columns = ["vehicle_type", "gearbox", "unrepaired_damage", "fuel_type" ]
for column in columns:
    print(column, "\n", autos[column].unique())

vehicle_type 
 [nan 'coupe' 'suv' 'kleinwagen' 'limousine' 'cabrio' 'bus' 'kombi'
 'andere']
gearbox 
 ['manuell' 'automatik' nan]
unrepaired_damage 
 [nan 'ja' 'nein']
fuel_type 
 ['benzin' 'diesel' nan 'lpg' 'andere' 'hybrid' 'cng' 'elektro']


We can also see that the `name` column contains german words. However, replacing each value in that column using `Series.map()` can be time consuming.

It can be seen that some values that were seen above are also present in `name`. However, they have different capitalisation. Therefore, we will use `pandas.DataFrame.replace()` after converting all the letters in `name` to small letters and replacing underscores by space.

In [29]:
autos['name']=autos['name'].str.lower()

german_english = {
    'bus'       : 'bus',
    'limousine' : 'limousine',
    'kleinwagen': 'small car',
    'kombi'     : 'station wagon',
    'coupe'     : 'coupe',
    'suv'       : 'suv',
    'cabrio'    : 'convertible',
    'andere'    : 'other',
    'manuell'   : 'manual',
    'automatik' : 'automatic',
    'nein'      : 'no',
    'ja'        : 'yes',
    'lpg'       : 'lpg', 
    'benzin'    : 'petrol',
    'diesel'    : 'diesel',
    'cng'       : 'cng',
    'hybrid'    : 'hybrid',
    'elektro'   : 'electro'
} 

autos=autos.replace(german_english,regex=True)
# regex = True so that it doesn't look for exact cell match
# looks for any string match in the cell

## Exploring Band

In this section, we make use of an analysis technique called aggregation.

The brand of a car can have various impact on the price as well as how well it sells. Thus, exploring brand will be necessary. We make use of aggregation to understand the relationship between the brand and the average price of cars under that brand.

The steps involved in the aggregation are as follows:

* Identify the unique brands we want to aggregate by
* Create an empty dictionary to store our aggregate data
* For each brand:
    - Select rows in the dataframe that corresponds to the brand
    - Calculate the mean of the price column
    - Assign the brand/mean to the dictionary as key/value.

In [39]:
list_top_brands =autos['brand'].value_counts().head().index
print(list_top_brands)

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi'], dtype='object')


We can already see that Volkswagen is the most popular brand, which is almost twice of the next two brands. It can also be noted that the 4 top brands are German.

Now that we have a list of brands that we want to perform aggregration on, we can proceed.

We make use of `DataFrame.loc()` to select the rows matching each brand while considering the price column only.

To calculate the mean, we use `Series.mean()` and round the result to 2 decimal places when adding it as a value in the dictionary, `brand_price`.
### By price
We can also note that we could have selected all rows that have cars matching each brand first as `selected_rows`. And then calculate `selected_rows["price"].mean()`.

In [47]:
brand_price = {}
for brand in list_top_brands:
    selected_prices = autos.loc[autos["brand"] == brand, "price"]
    mean_prices = selected_prices.mean()
    # mean_prices contains many decimals
    # round to 2 decimal places
    brand_price[brand] = round(mean_prices, 2)
    
brand_price_sorted=sorted(brand_price.items(), key=lambda x: x[1],reverse=True)
print(brand_price_sorted)
print('\n')
mbp_series=pd.Series(brand_price)
print(mbp_series.sort_values(ascending = False))

[('audi', 9086.28), ('mercedes_benz', 8551.65), ('bmw', 8449.12), ('volkswagen', 5400.19), ('opel', 2971.9)]


audi             9086.28
mercedes_benz    8551.65
bmw              8449.12
volkswagen       5400.19
opel             2971.90
dtype: float64


We can observe that
* Audi, BMW and Mercesdes benz are the more expensive brands in the top 5
* Opel and Fort are less expensive
* Despite being the most popular brand, Volkswagen falls in-between in terms of average price.

In [49]:
##create a dataframe
mbp_df=pd.DataFrame(mbp_series,columns=["mean_price"])
print(mbp_df.sort_values(by = "mean_price", ascending = False))


               mean_price
audi              9086.28
mercedes_benz     8551.65
bmw               8449.12
volkswagen        5400.19
opel              2971.90


We want to find the most common brand/model combinations for each brand. The analysis will use the same list of selected brands as above. We also make use of aggregation here. 

* we create an empty dictionary that will store the brand name as key and the selected model as value
* for each brand:
    - we select rows where entries in `brand` corresponds to the "active brand" of the loop while specifying that we want the values in `model` only.
    - using series.value_counts(), we find the number of occurence of that each model of the brand in descending order. We access the label part by using the `Series.index` attribute. 
    - since the first element of that result corresponds to the top model, we access it with `Series[0]`.

In [59]:
brand_model={}

for brand in list_top_brands:
    select_model=autos.loc[autos['brand']==brand, 'model']
    sorted_models=select_model.value_counts().index
    brand_model[brand] = sorted_models[0]

brand_model

{'volkswagen': 'golf',
 'bmw': '3er',
 'opel': 'corsa',
 'mercedes_benz': 'c_klasse',
 'audi': 'a4'}

In [62]:
bms=pd.Series(brand_model)
bmdf=pd.DataFrame(bms, columns=['model_car'])

## Exploring Price

**By Mileage**

For this part of our project, we want to see if there are any patterns between milage and the average prices of cars. We proceed using aggregation.

First, we explore `kilometer` to see which 'unique values' from the column we will use.

In [63]:
autos['kilometer'].unique()

array([150000, 125000,  90000,  30000,  70000,   5000, 100000,  60000,
        20000,  80000,  50000,  40000,  10000])

In [64]:
print(autos['kilometer'].min())
print(autos['kilometer'].max())

5000
150000


We decide to use the following intervals:
* 0-50,000 km
* 50,000-100,000 km
* 100,000-150,000 km

In [70]:
km_class=autos['kilometer'].value_counts().index
km_class

Int64Index([150000, 125000, 100000,  90000,  80000,  70000,  60000,  50000,
             40000,  30000,   5000,  20000,  10000],
           dtype='int64')

In [72]:
for bound in range(0,150000,50000):
    lower_bound=bound
    upper_bound=lower_bound+50000
    kilometer_range=(autos['kilometer'] >= lower_bound) & (autos['kilometer'] < upper_bound)
    avg_price=autos.loc[kilometer_range,'price'].mean()
    print(str(bound)+ "-" + str(upper_bound),":", str(round(avg_price,2)))

0-50000 : 14952.01
50000-100000 : 10704.42
100000-150000 : 6798.24


We can see that as mileage increases, the prices decreases by approximatively $ 5000.

We can further our analysis to see how the average mileage, price and brand are related.

We will first create a dataframe containing the average price column and with the brands as rows to display our results using `pandas.DataFrame()`. We have to make use of `list()` when accessing `Dictionary.keys()` since the result of the latter is stored under the type `dict_keys`.