# Exploring eBay Car Sales data

In this project we are using a dataset of use cars from eBay Kleinanzeigen. The aim of this project is to clean the data and analyse the included used car listings. 

You can find the dataset [here](https://data.world/data-society/used-cars-data).

The data dictionary provided with data is as follows:

- __dateCrawled__ - When this ad was first crawled. All field-values are taken from this date.
- __name__ - Name of the car.
- __seller__ - Whether the seller is private or a dealer.
- __offerType__ - The type of listing
- __price__ - The price on the ad to sell the car.
- __abtest__ - Whether the listing is included in an A/B test.
- __vehicleType__ - The vehicle Type.
- __yearOfRegistration__ - The year in which the car was first registered.
- __gearbox__ - The transmission type.
- __powerPS__ - The power of the car in PS.
- __model__ - The car model name.
- __kilometer__ - How many kilometers the car has driven.
- __monthOfRegistration__ - The month in which the car was first registered.
- __fuelType__ - What type of fuel the car uses.
- __brand__ - The brand of the car.
- __notRepairedDamage__ - If the car has a damage which is not yet repaired.
- __dateCreated__ - The date on which the eBay listing was created.
- __nrofPictures__ - The number of pictures in the ad.
- __postalCode__ - The postal code for the location of the vehicle.
- __lastSeenOnline__ - When the crawler saw this ad last online.

## Reading the Data

The dagualt encoding gave an error, therefore we chose <i>Latin-1</i>

In [274]:
import numpy as np
import pandas as pd
# we got an error while reading with the default encoding 
autos = pd.read_csv('autos.csv', encoding = "Latin-1")


In [275]:
autos

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,2016-03-27 14:38:19,Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon,privat,Angebot,"$24,900",control,limousine,2011,automatik,239,q5,"100,000km",1,diesel,audi,nein,2016-03-27 00:00:00,0,82131,2016-04-01 13:47:40
49996,2016-03-28 10:50:25,Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...,privat,Angebot,"$1,980",control,cabrio,1996,manuell,75,astra,"150,000km",5,benzin,opel,nein,2016-03-28 00:00:00,0,44807,2016-04-02 14:18:02
49997,2016-04-02 14:44:48,Fiat_500_C_1.2_Dualogic_Lounge,privat,Angebot,"$13,200",test,cabrio,2014,automatik,69,500,"5,000km",11,benzin,fiat,nein,2016-04-02 00:00:00,0,73430,2016-04-04 11:47:27
49998,2016-03-08 19:25:42,Audi_A3_2.0_TDI_Sportback_Ambition,privat,Angebot,"$22,900",control,kombi,2013,manuell,150,a3,"40,000km",11,diesel,audi,nein,2016-03-08 00:00:00,0,35683,2016-04-05 16:45:07


The number fo rows is 50000 and columns is 20


## Cleaning and Organising the Data

Cleaning the data so that it can be consistent with everyone accessing and analysing it.

In [276]:
autos.head()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


In [277]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

- Most of the columns have all the 50000 values, except for <b>'vehicletype'</b>, <b>'gearbox'</b>, <b>'model'</b>, <b>'fueltype'</b> and <b>'notRepairedDamage'</b>. These columns have some missing values.


In [278]:
# using column attribute on the dataframe object 
print(autos.columns)

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')


In [279]:
autos = autos.rename(columns={"yearOfRegistration":"registration_year"
                      ,"monthOfRegistration":"registration_month"
                      ,"notRepairedDamage":"unrepaired_damage"
                      ,"dateCreated":"ad_created"})
new_columns = []
for c in autos.columns:
    index=0 
    i=0
    while i<(len(c)):
        if c[i].isupper():
            index = i
            if index!=len(c)-1:
                c=c[:index]+'_'+c[index:]
                i=i+1
        i=i+1
    
    new_columns.append(c.lower())
autos.columns = new_columns


In [280]:
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


We have made the above changes so that the column labels can be consistent with the our choice of label.


In [281]:
autos.describe(include = 'all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-29 23:42:13,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


 Columns that have mostly one value-
1. seller
2. offer_type

- The values in <b>nr_of_pictures</b> is 0 for all the rows. 
 
Also we can see that the price and odometer have additional units with numbers presented as strings.


The problems we want to solve:
- convert <b>price</b> and <b>odometer_km</b> to numeric type.
- remove <i>seller</i>, <i>offer_type</i> and <i>nr_of_pictures</i>
- the minimum value in the __registration_month__ column is 0 which is impossible.
- the __registration_year__ minimum value is 1000 which is impossible.

In [282]:
autos['price'] = autos['price'].str.replace('$','')
autos['price'] = autos['price'].str.replace(',','').astype(int)
autos['odometer'] = autos['odometer'].str.replace('km','')
autos['odometer'] = autos['odometer'].str.replace(',','').astype(int)
autos = autos.rename({'odometer':'odometer_km'},axis = 1)
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,5000,control,bus,2004,manuell,158,andere,150000,3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,8500,control,limousine,1997,automatik,286,7er,150000,6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,8990,test,limousine,2009,manuell,102,golf,70000,7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,4350,control,kleinwagen,2007,automatik,71,fortwo,70000,6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,1350,test,kombi,2003,manuell,0,focus,150000,7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


In [283]:
autos["price"].dtype

dtype('int64')

In [284]:
autos["odometer_km"].dtype

dtype('int64')

As we can see, the datatype of the individual series has changed to 'int64' as we wanted it to change.


In [285]:
price = autos["price"]
price.unique().shape

(2357,)

In [286]:
price.describe()

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [287]:
price.max()

99999999

The minimum value of price seems to be unrealistic as 0.00 dollars and also the max value of 100000000 dollars also seems to be unrealistic for used cars.

In [288]:
price.value_counts().sort_index(ascending = False)

99999999       1
27322222       1
12345678       3
11111111       2
10000000       1
            ... 
5              2
3              1
2              3
1            156
0           1421
Name: price, Length: 2357, dtype: int64

According to me, the values of used car should not be more than 500,000 dollars and less than 1000 dollars. So we would like to keep values from 1000 dollar to 500,000 dollars. 

In [289]:
#removing values from 0 to 999 dollars
autos[autos["price"].between(0,999)] = np.nan
#Simimlar for 500,000 to 9999999 dollars
autos[autos["price"].between(500000,100000000)] = np.nan

In [290]:
autos["price"].value_counts().sort_index(ascending = False)

350000.0      1
345000.0      1
299000.0      1
295000.0      1
265000.0      1
           ... 
1050.0       95
1049.0        6
1040.0        1
1039.0        1
1000.0      639
Name: price, Length: 2089, dtype: int64

In [291]:
autos["price"].describe()

count     38626.000000
mean       7255.376275
std        9698.439853
min        1000.000000
25%        2200.000000
50%        4350.000000
75%        8950.000000
max      350000.000000
Name: price, dtype: float64

As we can see the minimum and maximum values are between our set boundaries.
Also, the number of elements in the series(count) has decreased to 38626


After analysing the price values, we can try to analyse odometer's values.

In [292]:
autos["odometer_km"].value_counts(dropna = False)

150000.0    23314
NaN         11374
125000.0     4340
100000.0     1860
90000.0      1569
80000.0      1334
70000.0      1154
60000.0      1099
50000.0       986
40000.0       795
30000.0       748
20000.0       692
5000.0        507
10000.0       228
Name: odometer_km, dtype: int64

In [293]:
autos["odometer_km"].describe()

count     38626.000000
mean     122778.568840
std       40796.873127
min        5000.000000
25%      100000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

As we can see, the values of odometer are reasonable. The maximum value is not extremely high, at 150000.00 kms.

## Exploring the Date range of the data

As seen from df.info() that the values of date_crawled, last_seen, and ad_created are of dtype 'object' , we wil have to convert it into numerical representation. The other two, registration_month and registration_year are in unmeric dtype,

In [294]:
date_col = ["date_crawled", "last_seen" ,"ad_created", 
            "registration_month","registration_year"]
autos[date_col]

Unnamed: 0,date_crawled,last_seen,ad_created,registration_month,registration_year
0,2016-03-26 17:47:46,2016-04-06 06:45:54,2016-03-26 00:00:00,3.0,2004.0
1,2016-04-04 13:38:56,2016-04-06 14:45:08,2016-04-04 00:00:00,6.0,1997.0
2,2016-03-26 18:57:24,2016-04-06 20:15:37,2016-03-26 00:00:00,7.0,2009.0
3,2016-03-12 16:58:10,2016-03-15 03:16:28,2016-03-12 00:00:00,6.0,2007.0
4,2016-04-01 14:38:50,2016-04-01 14:38:50,2016-04-01 00:00:00,7.0,2003.0
...,...,...,...,...,...
49995,2016-03-27 14:38:19,2016-04-01 13:47:40,2016-03-27 00:00:00,1.0,2011.0
49996,2016-03-28 10:50:25,2016-04-02 14:18:02,2016-03-28 00:00:00,5.0,1996.0
49997,2016-04-02 14:44:48,2016-04-04 11:47:27,2016-04-02 00:00:00,11.0,2014.0
49998,2016-03-08 19:25:42,2016-04-05 16:45:07,2016-03-08 00:00:00,11.0,2013.0


We would have to be careful with some column labels while referring from the sdata dictionary here as we had changed their name and format at the start of this project.

In [295]:
autos["registration_month"].value_counts()

3.0     4046
6.0     3532
4.0     3307
5.0     3283
7.0     3178
10.0    2981
11.0    2813
9.0     2797
12.0    2733
1.0     2578
8.0     2547
0.0     2424
2.0     2407
Name: registration_month, dtype: int64

The '0' value is invalid for month, so we will need to remove those values.

In [296]:
autos = autos[~(autos["registration_month"] == 0)]

In [297]:
autos["registration_month"].value_counts()

3.0     4046
6.0     3532
4.0     3307
5.0     3283
7.0     3178
10.0    2981
11.0    2813
9.0     2797
12.0    2733
1.0     2578
8.0     2547
2.0     2407
Name: registration_month, dtype: int64

In [298]:
date_crawled = autos['date_crawled'].str[:10].value_counts(normalize=True,
                    dropna=False).sort_index()*100
date_crawled

2016-03-05     1.944258
2016-03-06     1.065663
2016-03-07     2.671515
2016-03-08     2.463427
2016-03-09     2.452917
2016-03-10     2.541197
2016-03-11     2.536993
2016-03-12     2.858584
2016-03-13     1.223306
2016-03-14     2.782916
2016-03-15     2.545401
2016-03-16     2.211199
2016-03-17     2.309988
2016-03-18     0.977384
2016-03-19     2.654700
2016-03-20     2.930049
2016-03-21     2.829158
2016-03-22     2.473936
2016-03-23     2.431898
2016-03-24     2.194384
2016-03-25     2.299479
2016-03-26     2.520178
2016-03-27     2.404574
2016-03-28     2.684126
2016-03-29     2.570624
2016-03-30     2.509669
2016-03-31     2.368841
2016-04-01     2.652598
2016-04-02     2.770304
2016-04-03     2.995208
2016-04-04     2.831259
2016-04-05     1.013116
2016-04-06     0.256432
2016-04-07     0.117706
NaN           23.907012
Name: date_crawled, dtype: float64

The dates are from March and April, 2016, implying that the site was probably crwaled at that time.

In [299]:
date_crawled.describe()

count    35.000000
mean      2.857143
std       3.735243
min       0.117706
25%       2.255339
50%       2.509669
75%       2.677821
max      23.907012
Name: date_crawled, dtype: float64

In [300]:
autos["date_crawled"].unique().shape

(35252,)

Now we have 35252.0 unique values in the <b>data_crawled</b> column and each and every value contributes around 2.85% to the total values. 

In [301]:
autos["date_crawled"].shape

(47576,)

In [302]:
ad_created = autos['ad_created'].str[:10].value_counts(normalize = True, dropna = False).sort_index()*100
ad_created

2015-06-11     0.002102
2015-08-10     0.002102
2015-09-09     0.002102
2015-11-10     0.002102
2015-12-30     0.002102
                ...    
2016-04-04     2.860686
2016-04-05     0.912225
2016-04-06     0.264839
2016-04-07     0.102993
NaN           23.907012
Name: ad_created, Length: 75, dtype: float64

In [303]:
ad_created.describe()

count    75.000000
mean      1.333333
std       2.905960
min       0.002102
25%       0.002102
50%       0.018917
75%       2.464478
max      23.907012
Name: ad_created, dtype: float64

In [304]:
autos["ad_created"].unique().shape

(75,)

There are 75 unique values with each unique value occupying 1.3% of the total ad created.

In [305]:
autos["ad_created"].dtype

dtype('O')

In [306]:
last_seen = autos["last_seen"].str[:10].value_counts(normalize = True, dropna = False).sort_index()*100
last_seen

2016-03-05     0.084076
2016-03-06     0.262738
2016-03-07     0.353119
2016-03-08     0.468724
2016-03-09     0.678914
2016-03-10     0.739869
2016-03-11     0.876492
2016-03-12     1.685724
2016-03-13     0.626366
2016-03-14     0.905919
2016-03-15     1.126618
2016-03-16     1.164453
2016-03-17     1.996805
2016-03-18     0.563309
2016-03-19     1.120313
2016-03-20     1.481840
2016-03-21     1.494451
2016-03-22     1.565916
2016-03-23     1.374643
2016-03-24     1.397764
2016-03-25     1.353624
2016-03-26     1.195981
2016-03-27     1.074071
2016-03-28     1.462922
2016-03-29     1.582731
2016-03-30     1.778207
2016-03-31     1.717252
2016-04-01     1.786615
2016-04-02     1.902220
2016-04-03     1.858080
2016-04-04     1.752985
2016-04-05    10.059694
2016-04-06    17.906087
2016-04-07    10.694468
NaN           23.907012
Name: last_seen, dtype: float64

In [307]:
last_seen.describe()

count    35.000000
mean      2.857143
std       5.075018
min       0.084076
25%       0.891206
50%       1.397764
75%       1.765596
max      23.907012
Name: last_seen, dtype: float64

In [308]:
last_seen.unique().shape

(35,)

The average sales that each date occupy is arounf 3 %. Also the highest amount of sales have taken place during 5th,6th,7th of April, cumulatively constituting 30% of total sales.

In [309]:
registration_year = autos["registration_year"]
registration_year.describe()

count    36202.000000
mean      2004.989006
std         46.408696
min       1927.000000
25%       2001.000000
50%       2005.000000
75%       2009.000000
max       9000.000000
Name: registration_year, dtype: float64

The maximum value is 9999 and minimum is 1000 which is unusual and rather impossible. Therefore, we wil. have to fix it.

Registration year is the year the car was first registered. Realistically we can consider the car to be registered earliest in 1950 and latest in 2016 (since the dataset is from 2016, any year > 2016 is definitely inaacurate).

In [310]:
autos.loc[autos['registration_year'] < 1950,'registration_year'] = np.nan
autos.loc[autos['registration_year'] > 2016,'registration_year'] = np.nan
autos["registration_year"].describe()

count    35064.000000
mean      2004.193703
std          6.724903
min       1950.000000
25%       2001.000000
50%       2005.000000
75%       2009.000000
max       2016.000000
Name: registration_year, dtype: float64

As we can see that this problem of inaccurate registration year has ben fixed.

In [311]:
autos["registration_year"].value_counts(normalize = True, dropna= False).sort_index().mean()*100

1.4705882352941175

Therefore, each year, sales have been approx. 1.5 % of total sales

# Aggregrating Brand Data

In [312]:
autos["brand"].describe(include = 'all')

count          36202
unique            40
top       volkswagen
freq            7609
Name: brand, dtype: object

In [313]:
brand = autos["brand"].value_counts(normalize = True, dropna = True).sort_index()*100
brand

alfa_romeo         0.621513
audi               9.742556
bmw               12.474449
chevrolet          0.654660
chrysler           0.317662
citroen            1.383901
dacia              0.331473
daewoo             0.093917
daihatsu           0.160212
fiat               2.104856
ford               5.878128
honda              0.756864
hyundai            1.099387
jaguar             0.185073
jeep               0.276228
kia                0.798298
lada               0.058008
lancia             0.066295
land_rover         0.256892
mazda              1.386664
mercedes_benz     11.223137
mini               1.110436
mitsubishi         0.690570
nissan             1.397713
opel               8.842053
peugeot            2.751229
porsche            0.748577
renault            3.748412
rover              0.074582
saab               0.140876
seat               1.803768
skoda              1.922546
smart              1.682228
sonstige_autos     0.997182
subaru             0.160212
suzuki             0

We will select brands which contribute to over 1% sales. Before this, I tried to segregrate brands with more than 5% sales but the count was low.

In [314]:
bool_brand = brand>1
bool_brand

alfa_romeo        False
audi               True
bmw                True
chevrolet         False
chrysler          False
citroen            True
dacia             False
daewoo            False
daihatsu          False
fiat               True
ford               True
honda             False
hyundai            True
jaguar            False
jeep              False
kia               False
lada              False
lancia            False
land_rover        False
mazda              True
mercedes_benz      True
mini               True
mitsubishi        False
nissan             True
opel               True
peugeot            True
porsche           False
renault            True
rover             False
saab              False
seat               True
skoda              True
smart              True
sonstige_autos    False
subaru            False
suzuki            False
toyota             True
trabant           False
volkswagen         True
volvo             False
Name: brand, dtype: bool

In [315]:
sorted_brands = brand[bool_brand]

In [316]:
for i in sorted_brands:
    print(i)

9.742555659908291
12.47444892547373
1.3839014419092868
2.1048560852991547
5.878128280205513
1.0993867742113697
1.386663720236451
11.223136843268328
1.1104358875200264
1.3977128335451081
8.842052925252748
2.751229213855588
3.748411689961881
1.8037677476382519
1.9225457157063146
1.6822275012430252
1.4916302966686923
21.01817579139274


In [317]:
brand_list=sorted_brands.index
brand_list

Index(['audi', 'bmw', 'citroen', 'fiat', 'ford', 'hyundai', 'mazda',
       'mercedes_benz', 'mini', 'nissan', 'opel', 'peugeot', 'renault', 'seat',
       'skoda', 'smart', 'toyota', 'volkswagen'],
      dtype='object')

In [318]:
top_dict_price = {}
top_dict_mean = {}
for i in brand_list:
    total_price = autos.loc[autos["brand"]==i,"price"].sum()
    mean = autos.loc[autos["brand"] == i,"price"].mean()
    top_dict_price[i]=total_price
    top_dict_mean[i] = mean

In [319]:
top_dict_price

{'audi': 37081584.0,
 'bmw': 41800984.0,
 'citroen': 2310480.0,
 'fiat': 3111017.0,
 'ford': 11413970.0,
 'hyundai': 2488337.0,
 'mazda': 2736561.0,
 'mercedes_benz': 38287813.0,
 'mini': 4330953.0,
 'nissan': 3232847.0,
 'opel': 13687737.0,
 'peugeot': 3990139.0,
 'renault': 4906862.0,
 'seat': 3657409.0,
 'skoda': 4828375.0,
 'smart': 2315984.0,
 'toyota': 3029570.0,
 'volkswagen': 51570062.0}

In [320]:
top_dict_mean

{'audi': 10513.633115962575,
 'bmw': 9256.196634189548,
 'citroen': 4611.736526946108,
 'fiat': 4082.6994750656168,
 'ford': 5363.707706766917,
 'hyundai': 6252.103015075377,
 'mazda': 5451.316733067729,
 'mercedes_benz': 9423.532611370909,
 'mini': 10773.514925373134,
 'nissan': 6389.025691699605,
 'opel': 4276.0815370196815,
 'peugeot': 4006.163654618474,
 'renault': 3615.9631540162122,
 'seat': 5600.932618683001,
 'skoda': 6937.32040229885,
 'smart': 3802.9293924466338,
 'toyota': 5610.314814814815,
 'volkswagen': 6777.508476803785}

In [321]:
sorted(top_dict_price.items(), key = lambda kv: kv[1])

[('citroen', 2310480.0),
 ('smart', 2315984.0),
 ('hyundai', 2488337.0),
 ('mazda', 2736561.0),
 ('toyota', 3029570.0),
 ('fiat', 3111017.0),
 ('nissan', 3232847.0),
 ('seat', 3657409.0),
 ('peugeot', 3990139.0),
 ('mini', 4330953.0),
 ('skoda', 4828375.0),
 ('renault', 4906862.0),
 ('ford', 11413970.0),
 ('opel', 13687737.0),
 ('audi', 37081584.0),
 ('mercedes_benz', 38287813.0),
 ('bmw', 41800984.0),
 ('volkswagen', 51570062.0)]

In [322]:
sorted(top_dict_mean.items(), key = lambda kv: kv[1])

[('renault', 3615.9631540162122),
 ('smart', 3802.9293924466338),
 ('peugeot', 4006.163654618474),
 ('fiat', 4082.6994750656168),
 ('opel', 4276.0815370196815),
 ('citroen', 4611.736526946108),
 ('ford', 5363.707706766917),
 ('mazda', 5451.316733067729),
 ('seat', 5600.932618683001),
 ('toyota', 5610.314814814815),
 ('hyundai', 6252.103015075377),
 ('nissan', 6389.025691699605),
 ('volkswagen', 6777.508476803785),
 ('skoda', 6937.32040229885),
 ('bmw', 9256.196634189548),
 ('mercedes_benz', 9423.532611370909),
 ('audi', 10513.633115962575),
 ('mini', 10773.514925373134)]

From the operations run above, we can conclude about the oprices and average price of the car.

- <b>'volkswagen'</b> will have provided the highest sales in terms of price followed by <b>'bmw'</b> and <b>'mercedes'</b>
- Although <b>'mini'</b> has jsut about 1 % contribution to the total sales, their average sale values are the highest.Similar can be said about <b>sonstige_autos</b> who have the second highest average sale values while contributing to 1% of total sales.
- <b>'bmw'</b> and <b>'mercedes_benz'</b> are holding the top five positionz in both the top sales price and their average values.
- The cheapest brands in terms of total sales and average values are <b>'smart'</b>, <b>'citreon'</b> and <b>'fiat'</b>. 


In [411]:
autos["brand"].value_counts()

volkswagen        7609
bmw               4516
mercedes_benz     4063
audi              3527
opel              3201
ford              2128
renault           1357
peugeot            996
fiat               762
skoda              696
seat               653
smart              609
toyota             540
nissan             506
mazda              502
citroen            501
mini               402
hyundai            398
sonstige_autos     361
volvo              324
kia                289
honda              274
porsche            271
mitsubishi         250
chevrolet          237
alfa_romeo         225
suzuki             210
dacia              120
chrysler           115
jeep               100
land_rover          93
jaguar              67
subaru              58
daihatsu            58
saab                51
daewoo              34
rover               27
trabant             27
lancia              24
lada                21
Name: brand, dtype: int64

As we can see, <b>'volkswagen'</b>, <b>'bmw'</b>, <b>'mercedes_benz'</b> and <b>'audi'</b> are the most popular in car sales, owing to two reasons:

- The car companies are originated in <b>'Germany'</b> itself.
- They have a reasonable averagevalue for their brand value and their mean mileage(discussed in the cells below). 

In [323]:
pd.DataFrame(pd.Series(top_dict_mean),columns = ['mean_price'])

Unnamed: 0,mean_price
audi,10513.633116
bmw,9256.196634
citroen,4611.736527
fiat,4082.699475
ford,5363.707707
hyundai,6252.103015
mazda,5451.316733
mercedes_benz,9423.532611
mini,10773.514925
nissan,6389.025692


In [324]:
top_dict_mileage = {}
for i in brand_list:
    mean = autos.loc[autos["brand"]==i,"odometer_km"].mean()
    top_dict_mileage[i] = mean

In [325]:
top_dict_mileage

{'audi': 127262.54607314999,
 'bmw': 131789.19397697077,
 'citroen': 114071.85628742515,
 'fiat': 107224.4094488189,
 'ford': 119424.34210526316,
 'hyundai': 101909.54773869347,
 'mazda': 119103.58565737052,
 'mercedes_benz': 129851.09524981541,
 'mini': 88718.90547263682,
 'nissan': 110187.74703557312,
 'opel': 123633.23961262105,
 'peugeot': 121892.5702811245,
 'renault': 121591.74649963154,
 'seat': 116408.88208269526,
 'skoda': 109195.40229885057,
 'smart': 98021.34646962234,
 'toyota': 113833.33333333333,
 'volkswagen': 125384.41319490077}

In [326]:
mean_mileage = pd.Series(top_dict_mileage)
mean_mileage

audi             127262.546073
bmw              131789.193977
citroen          114071.856287
fiat             107224.409449
ford             119424.342105
hyundai          101909.547739
mazda            119103.585657
mercedes_benz    129851.095250
mini              88718.905473
nissan           110187.747036
opel             123633.239613
peugeot          121892.570281
renault          121591.746500
seat             116408.882083
skoda            109195.402299
smart             98021.346470
toyota           113833.333333
volkswagen       125384.413195
dtype: float64

In [327]:
mean_price = pd.Series(top_dict_mean)
mean_price = mean_price.sort_values(ascending = False)
mean_price

mini             10773.514925
audi             10513.633116
mercedes_benz     9423.532611
bmw               9256.196634
skoda             6937.320402
volkswagen        6777.508477
nissan            6389.025692
hyundai           6252.103015
toyota            5610.314815
seat              5600.932619
mazda             5451.316733
ford              5363.707707
citroen           4611.736527
opel              4276.081537
fiat              4082.699475
peugeot           4006.163655
smart             3802.929392
renault           3615.963154
dtype: float64

In [328]:
df = pd.DataFrame(mean_price, columns=['mean_price_$'])
df['mean_mileage_kms']= mean_mileage
df

Unnamed: 0,mean_price_$,mean_mileage_kms
mini,10773.514925,88718.905473
audi,10513.633116,127262.546073
mercedes_benz,9423.532611,129851.09525
bmw,9256.196634,131789.193977
skoda,6937.320402,109195.402299
volkswagen,6777.508477,125384.413195
nissan,6389.025692,110187.747036
hyundai,6252.103015,101909.547739
toyota,5610.314815,113833.333333
seat,5600.932619,116408.882083


In [329]:
df['mean_price_$']=df["mean_price_$"].astype(int)
df["mean_mileage_kms"]=df["mean_mileage_kms"].astype(int)
df

Unnamed: 0,mean_price_$,mean_mileage_kms
mini,10773,88718
audi,10513,127262
mercedes_benz,9423,129851
bmw,9256,131789
skoda,6937,109195
volkswagen,6777,125384
nissan,6389,110187
hyundai,6252,101909
toyota,5610,113833
seat,5600,116408


As we can analyse from the dtataframe above, there is no direct relation between the mean price and mean mileage. As the mean price is decreasing the mean mileage is changing variably without any observable pattern 

In [330]:
autos[autos["unrepaired_damage"] == "nein"]

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,5000.0,control,bus,2004.0,manuell,158.0,andere,150000.0,3.0,lpg,peugeot,nein,2016-03-26 00:00:00,0.0,79588.0,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,8500.0,control,limousine,1997.0,automatik,286.0,7er,150000.0,6.0,benzin,bmw,nein,2016-04-04 00:00:00,0.0,71034.0,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,8990.0,test,limousine,2009.0,manuell,102.0,golf,70000.0,7.0,benzin,volkswagen,nein,2016-03-26 00:00:00,0.0,35394.0,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,4350.0,control,kleinwagen,2007.0,automatik,71.0,fortwo,70000.0,6.0,benzin,smart,nein,2016-03-12 00:00:00,0.0,33729.0,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,1350.0,test,kombi,2003.0,manuell,0.0,focus,150000.0,7.0,benzin,ford,nein,2016-04-01 00:00:00,0.0,39218.0,2016-04-01 14:38:50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,2016-03-27 14:38:19,Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon,privat,Angebot,24900.0,control,limousine,2011.0,automatik,239.0,q5,100000.0,1.0,diesel,audi,nein,2016-03-27 00:00:00,0.0,82131.0,2016-04-01 13:47:40
49996,2016-03-28 10:50:25,Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...,privat,Angebot,1980.0,control,cabrio,1996.0,manuell,75.0,astra,150000.0,5.0,benzin,opel,nein,2016-03-28 00:00:00,0.0,44807.0,2016-04-02 14:18:02
49997,2016-04-02 14:44:48,Fiat_500_C_1.2_Dualogic_Lounge,privat,Angebot,13200.0,test,cabrio,2014.0,automatik,69.0,500,5000.0,11.0,benzin,fiat,nein,2016-04-02 00:00:00,0.0,73430.0,2016-04-04 11:47:27
49998,2016-03-08 19:25:42,Audi_A3_2.0_TDI_Sportback_Ambition,privat,Angebot,22900.0,control,kombi,2013.0,manuell,150.0,a3,40000.0,11.0,diesel,audi,nein,2016-03-08 00:00:00,0.0,35683.0,2016-04-05 16:45:07


In [331]:
autos[autos["unrepaired_damage"] == 'ja']

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
28,2016-03-19 21:56:19,MINI_Cooper_D,privat,Angebot,5250.0,control,kleinwagen,2007.0,manuell,110.0,cooper,150000.0,7.0,diesel,mini,ja,2016-03-19 00:00:00,0.0,15745.0,2016-04-07 14:58:48
51,2016-03-22 12:57:38,Mercedes_Benz_S_320_CDI,privat,Angebot,6000.0,control,limousine,2005.0,automatik,204.0,s_klasse,150000.0,7.0,diesel,mercedes_benz,ja,2016-03-22 00:00:00,0.0,49492.0,2016-04-04 00:17:51
81,2016-03-12 19:46:34,Nissan_Micra_K12___super_Kleinwagen!,privat,Angebot,2000.0,test,kleinwagen,2007.0,manuell,65.0,micra,150000.0,9.0,benzin,nissan,ja,2016-03-12 00:00:00,0.0,52156.0,2016-03-26 05:15:53
100,2016-03-12 18:57:48,Mazda_5_2.0_CD_DPF_Exclusive,privat,Angebot,4500.0,test,bus,2007.0,manuell,143.0,5_reihe,90000.0,7.0,diesel,mazda,ja,2016-03-12 00:00:00,0.0,51375.0,2016-04-07 12:47:06
110,2016-03-09 20:52:02,Ford_Galaxy,privat,Angebot,1300.0,test,bus,2000.0,manuell,145.0,galaxy,150000.0,5.0,benzin,ford,ja,2016-03-09 00:00:00,0.0,59556.0,2016-03-11 05:44:34
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49914,2016-03-16 12:47:53,Opel_zafira,privat,Angebot,1300.0,test,limousine,2003.0,manuell,101.0,zafira,150000.0,1.0,diesel,opel,ja,2016-03-16 00:00:00,0.0,12057.0,2016-03-16 12:47:53
49928,2016-03-30 10:54:53,Audi_A4,privat,Angebot,4900.0,test,kombi,2015.0,manuell,131.0,a4,150000.0,1.0,diesel,audi,ja,2016-03-30 00:00:00,0.0,74354.0,2016-04-07 02:16:00
49962,2016-03-14 17:57:15,Mitsubishi_Space_Star_1__3_L__Bj_2004_Standhei...,privat,Angebot,2200.0,test,limousine,2004.0,manuell,82.0,andere,150000.0,4.0,benzin,mitsubishi,ja,2016-03-14 00:00:00,0.0,45481.0,2016-03-15 13:45:08
49966,2016-04-02 19:49:19,Citroën_C1_1.0_**Euro4**TÜV_OKT_2017**Scheiten...,privat,Angebot,1490.0,control,kleinwagen,2006.0,manuell,68.0,c1,150000.0,7.0,benzin,citroen,ja,2016-04-02 00:00:00,0.0,26603.0,2016-04-02 19:49:19


In [346]:
autos["brand"][(autos["unrepaired_damage"] == 'ja') & (autos["brand"] == 'bmw')]


432      bmw
545      bmw
580      bmw
798      bmw
1109     bmw
        ... 
48886    bmw
49166    bmw
49464    bmw
49540    bmw
49773    bmw
Name: brand, Length: 259, dtype: object

In [372]:
unique_brand = autos["brand"].unique()
unique_brand 

array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler', nan,
       'audi', 'renault', 'sonstige_autos', 'mazda', 'porsche', 'mini',
       'mercedes_benz', 'seat', 'toyota', 'dacia', 'opel', 'jeep', 'saab',
       'volvo', 'nissan', 'jaguar', 'skoda', 'subaru', 'fiat',
       'mitsubishi', 'chevrolet', 'hyundai', 'honda', 'kia', 'citroen',
       'suzuki', 'trabant', 'daewoo', 'land_rover', 'alfa_romeo',
       'daihatsu', 'rover', 'lancia', 'lada'], dtype=object)

In [348]:
mean_price_all_brands= {}
for i in unique_brand:
    mean = autos.loc[autos["brand"] == i,"price"].mean()
    mean_price_all_brands[i] = mean
mean_price_all_brands

{'peugeot': 4006.163654618474,
 'bmw': 9256.196634189548,
 'volkswagen': 6777.508476803785,
 'smart': 3802.9293924466338,
 'ford': 5363.707706766917,
 'chrysler': 4542.069565217392,
 nan: nan,
 'audi': 10513.633115962575,
 'renault': 3615.9631540162122,
 'sonstige_autos': 15244.880886426592,
 'mazda': 5451.316733067729,
 'porsche': 47978.723247232476,
 'mini': 10773.514925373134,
 'mercedes_benz': 9423.532611370909,
 'seat': 5600.932618683001,
 'toyota': 5610.314814814815,
 'dacia': 6012.983333333334,
 'opel': 4276.0815370196815,
 'jeep': 12205.54,
 'saab': 4553.470588235294,
 'volvo': 6144.074074074074,
 'nissan': 6389.025691699605,
 'jaguar': 12501.820895522387,
 'skoda': 6937.32040229885,
 'subaru': 6388.189655172414,
 'fiat': 4082.6994750656168,
 'mitsubishi': 4832.844,
 'chevrolet': 7176.763713080169,
 'hyundai': 6252.103015075377,
 'honda': 5323.131386861314,
 'kia': 6848.294117647059,
 'citroen': 4611.736526946108,
 'suzuki': 5283.490476190476,
 'trabant': 3260.1481481481483,
 '

In [360]:
mean_price_all_brands = pd.Series(mean_price_all_brands)
mean_price_all_brands = mean_price_all_brands.sort_values(ascending = False)

In [361]:
unrepaired_damage_ja_dict = {}
unrepaired_damage_nein_dict = {}
for i in unique_brand:
    unrepaired_damage_ja_price = autos.loc[(autos["brand"] == i) & (autos["unrepaired_damage"] == 'ja'),"price"]
    unrepaired_damage_ja_dict[i] = unrepaired_damage_ja_price.mean()
    unrepaired_damage_nein_price = autos.loc[(autos["brand"] == i) & (autos["unrepaired_damage"] == 'nein'),"price"]
    unrepaired_damage_nein_dict[i] = unrepaired_damage_nein_price.mean()

In [362]:
unrepaired_damage_ja_dict

{'peugeot': 2596.0131578947367,
 'bmw': 5093.131274131274,
 'volkswagen': 3958.844611528822,
 'smart': 1825.851851851852,
 'ford': 2988.3680555555557,
 'chrysler': 4190.727272727273,
 nan: nan,
 'audi': 4742.314285714286,
 'renault': 2502.164835164835,
 'sonstige_autos': 8779.869565217392,
 'mazda': 2455.5588235294117,
 'porsche': 14670.0,
 'mini': 4595.0,
 'mercedes_benz': 5132.489795918367,
 'seat': 3062.121951219512,
 'toyota': 4296.15625,
 'dacia': 5011.0,
 'opel': 2662.6936936936936,
 'jeep': 3414.1428571428573,
 'saab': 1700.0,
 'volvo': 3268.4117647058824,
 'nissan': 3738.0731707317073,
 'jaguar': 5614.142857142857,
 'skoda': 4957.717391304348,
 'subaru': 5207.0,
 'fiat': 2684.6666666666665,
 'mitsubishi': 2459.4545454545455,
 'chevrolet': 4506.076923076923,
 'hyundai': 3792.6060606060605,
 'honda': 2821.125,
 'kia': 3159.12,
 'citroen': 3388.8529411764707,
 'suzuki': 3337.8571428571427,
 'trabant': 1350.0,
 'daewoo': nan,
 'land_rover': 5223.75,
 'alfa_romeo': 4226.0,
 'daihats

In [363]:
unrepaired_damage_nein_dict

{'peugeot': 4255.284301606922,
 'bmw': 9925.5,
 'volkswagen': 7352.649022801303,
 'smart': 3976.042596348884,
 'ford': 5732.630510440836,
 'chrysler': 4741.122222222222,
 nan: nan,
 'audi': 11454.145404663923,
 'renault': 3909.281793229643,
 'sonstige_autos': 16889.297872340427,
 'mazda': 5923.506053268765,
 'porsche': 51125.256097560974,
 'mini': 11196.214092140921,
 'mercedes_benz': 10220.374962201391,
 'seat': 5988.024029574862,
 'toyota': 5907.011061946902,
 'dacia': 6217.416666666667,
 'opel': 4577.949244478884,
 'jeep': 12537.595238095239,
 'saab': 4968.863636363636,
 'volvo': 6647.208029197081,
 'nissan': 6903.976247030879,
 'jaguar': 14118.357142857143,
 'skoda': 7295.691652470187,
 'subaru': 6962.739130434783,
 'fiat': 4348.179571663921,
 'mitsubishi': 5263.656862745098,
 'chevrolet': 7635.208556149732,
 'hyundai': 6671.696048632219,
 'honda': 5804.497737556561,
 'kia': 7533.547008547009,
 'citroen': 4783.72572815534,
 'suzuki': 5444.207650273224,
 'trabant': 3653.8,
 'daewoo'

In [373]:
percentage_diff = {}
for i in unique_brand:
    unrepaired_damage_ja_price = autos.loc[(autos["brand"] == i) & (autos["unrepaired_damage"] == 'ja'),"price"]
    unrepaired_damage_nein_price = autos.loc[(autos["brand"] == i) & (autos["unrepaired_damage"] == 'nein'),"price"]
    percentage_diff[i] = (unrepaired_damage_nein_price.mean()-unrepaired_damage_ja_price.mean())*100/unrepaired_damage_ja_price.mean()
percentage_diff

{'peugeot': 63.91612995743781,
 'bmw': 94.88011334820688,
 'volkswagen': 85.72714375778101,
 'smart': 117.76370258716351,
 'ford': 91.8314747001639,
 'chrysler': 13.133637998274201,
 nan: nan,
 'audi': 141.5307108423478,
 'renault': 56.235981670332734,
 'sonstige_autos': 92.36388134112609,
 'mazda': 141.22843226190037,
 'porsche': 248.50208655460784,
 'mini': 143.66080722831165,
 'mercedes_benz': 99.13093583408943,
 'seat': 95.5514550029625,
 'toyota': 37.49525664823998,
 'dacia': 24.075367524778827,
 'opel': 71.92924801381658,
 'jeep': 267.2252674449419,
 'saab': 192.28609625668446,
 'volvo': 103.37731313347078,
 'nissan': 84.69344851479896,
 'jaguar': 151.47840911982496,
 'skoda': 47.15828024539194,
 'subaru': 33.71882332311855,
 'fiat': 61.96348044439736,
 'mitsubishi': 114.01724510311257,
 'chevrolet': 69.44248148644826,
 'hyundai': 75.91323596540575,
 'honda': 105.75117152045944,
 'kia': 138.46979565660718,
 'citroen': 41.160617211516616,
 'suzuki': 63.10487289498212,
 'trabant': 

In [374]:
df1 = pd.DataFrame(pd.Series(mean_price_all_brands), columns = ["mean_price"])
df1

Unnamed: 0,mean_price
porsche,47978.723247
land_rover,19685.956989
sonstige_autos,15244.880886
jaguar,12501.820896
jeep,12205.54
mini,10773.514925
audi,10513.633116
mercedes_benz,9423.532611
bmw,9256.196634
chevrolet,7176.763713


In [375]:
df1["damage_repaired"] = pd.Series(unrepaired_damage_nein_dict)
df1.head()

Unnamed: 0,mean_price,damage_repaired
porsche,47978.723247,51125.256098
land_rover,19685.956989,21688.512821
sonstige_autos,15244.880886,16889.297872
jaguar,12501.820896,14118.357143
jeep,12205.54,12537.595238


In [376]:
df1["damage_not_repaired"] = pd.Series(unrepaired_damage_ja_dict)
df1.head()

Unnamed: 0,mean_price,damage_repaired,damage_not_repaired
porsche,47978.723247,51125.256098,14670.0
land_rover,19685.956989,21688.512821,5223.75
sonstige_autos,15244.880886,16889.297872,8779.869565
jaguar,12501.820896,14118.357143,5614.142857
jeep,12205.54,12537.595238,3414.142857


In [382]:
df1["percentage_increase"] = pd.Series(percentage_diff)

In [389]:
df1

Unnamed: 0,mean_price,damage_repaired,damage_not_repaired,perecentage_diff,perecentage_increase,percentage_increase
porsche,47978.723247,51125.256098,14670.0,248.502087,248.502087,248.502087
land_rover,19685.956989,21688.512821,5223.75,315.190482,315.190482,315.190482
sonstige_autos,15244.880886,16889.297872,8779.869565,92.363881,92.363881,92.363881
jaguar,12501.820896,14118.357143,5614.142857,151.478409,151.478409,151.478409
jeep,12205.54,12537.595238,3414.142857,267.225267,267.225267,267.225267
mini,10773.514925,11196.214092,4595.0,143.660807,143.660807,143.660807
audi,10513.633116,11454.145405,4742.314286,141.530711,141.530711,141.530711
mercedes_benz,9423.532611,10220.374962,5132.489796,99.130936,99.130936,99.130936
bmw,9256.196634,9925.5,5093.131274,94.880113,94.880113,94.880113
chevrolet,7176.763713,7635.208556,4506.076923,69.442481,69.442481,69.442481


- As we can see, the percentage increment does not have a direct relation with the average valude of the car, but it is substantially for the top end cars, which can bring about a great return for repairing damage.

In [406]:
bool_percentage_increase = df1['percentage_increase'] >= 100
profit_on_repair = bool_percentage_increase[bool_per].index
profit_on_repair = [i for i in profit_on_repair]

In [408]:
profit_on_repair

['porsche',
 'land_rover',
 'jaguar',
 'jeep',
 'mini',
 'audi',
 'kia',
 'volvo',
 'mazda',
 'honda',
 'mitsubishi',
 'saab',
 'smart',
 'trabant',
 'rover']

As we can see, I have created a list  which can gice us high returns on repairing their damage.

## Conclusion

- The highest sales are recorded by <b>volkswagen</b>,<b>mercedes_benz</b> and <b>bmw</b>
- The average sales value for <b>mini</b> and <b>sonstige_autos</b> are highest, but they do not contribute a lot to the total sales
- We cannot predict a observe a direct relation between the average price and mioleage of a car.
- As we can see, the percentage increment does not have a direct relation with the average valude of the car, but it is substantially for the top end cars, which can bring about a great return for repairing damage.