# Ebay Car Sales Data

In this project, I worked with a dataset of used cars from *eBay Kleinanzeigen*, a [classifieds](https://en.wikipedia.org/wiki/Classified_advertising) section of the German eBay website.

The dataset was originally scraped and uploaded to Kaggle by user [orgesleka](https://www.kaggle.com/orgesleka).
The original dataset isn't available on Kaggle anymore, but you can find it [here](https://data.world/data-society/used-cars-data).

The aim of this project is to clean the data and analyze the included used car listings.

In [1]:
import pandas as pd
import numpy as np
from datetime import timedelta

In [2]:
autos = pd.read_csv('autos.csv', encoding='Windows -1251')

In [3]:
# changing columns position function
def change_col_position(column_change, position):
    columns = list(autos.columns)
    columns.remove(column_change)
    columns.insert(position, column_change)
    return autos[columns]

---
### Dataset overview

In [4]:
autos.head()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-24 11:52:17,Golf_3_1.6,privat,Angebot,480,test,,1993,manuell,0,golf,150000,0,benzin,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,privat,Angebot,18300,test,coupe,2011,manuell,190,,125000,5,diesel,audi,ja,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,"Jeep_Grand_Cherokee_""Overland""",privat,Angebot,9800,test,suv,2004,automatik,163,grand,125000,8,diesel,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,GOLF_4_1_4__3TЬRER,privat,Angebot,1500,test,kleinwagen,2001,manuell,75,golf,150000,6,benzin,volkswagen,nein,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,Skoda_Fabia_1.4_TDI_PD_Classic,privat,Angebot,3600,test,kleinwagen,2008,manuell,69,fabia,90000,7,diesel,skoda,nein,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21


In [5]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 371528 entries, 0 to 371527
Data columns (total 20 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   dateCrawled          371528 non-null  object
 1   name                 371528 non-null  object
 2   seller               371528 non-null  object
 3   offerType            371528 non-null  object
 4   price                371528 non-null  int64 
 5   abtest               371528 non-null  object
 6   vehicleType          333659 non-null  object
 7   yearOfRegistration   371528 non-null  int64 
 8   gearbox              351319 non-null  object
 9   powerPS              371528 non-null  int64 
 10  model                351044 non-null  object
 11  kilometer            371528 non-null  int64 
 12  monthOfRegistration  371528 non-null  int64 
 13  fuelType             338142 non-null  object
 14  brand                371528 non-null  object
 15  notRepairedDamage    299468 non-nu

- The dataset contains 20 columns, most of which are strings.
- Some columns have null values, but none have more than ~20% null values.
- The column names use camelcase. It will add space between words.

---
### Cleaning Column Names

In [6]:
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'kilometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [7]:
new_columns = ['Date Crawled', 'Name', 'Seller', 'Offer Type', 'Price EUR', 'Abtest',
       'Vehicle Type', 'Registration Year', 'Gearbox', 'Power PS', 'Model',
       'Kilometer', 'Registration Month', 'Fuel Type', 'Brand',
       'Unepaired Damage', 'Ad Created', 'Nr of Pictures', 'Postal Code',
       'Last Seen']

autos.columns = new_columns
autos.head()

Unnamed: 0,Date Crawled,Name,Seller,Offer Type,Price EUR,Abtest,Vehicle Type,Registration Year,Gearbox,Power PS,Model,Kilometer,Registration Month,Fuel Type,Brand,Unepaired Damage,Ad Created,Nr of Pictures,Postal Code,Last Seen
0,2016-03-24 11:52:17,Golf_3_1.6,privat,Angebot,480,test,,1993,manuell,0,golf,150000,0,benzin,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,privat,Angebot,18300,test,coupe,2011,manuell,190,,125000,5,diesel,audi,ja,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,"Jeep_Grand_Cherokee_""Overland""",privat,Angebot,9800,test,suv,2004,automatik,163,grand,125000,8,diesel,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,GOLF_4_1_4__3TЬRER,privat,Angebot,1500,test,kleinwagen,2001,manuell,75,golf,150000,6,benzin,volkswagen,nein,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,Skoda_Fabia_1.4_TDI_PD_Classic,privat,Angebot,3600,test,kleinwagen,2008,manuell,69,fabia,90000,7,diesel,skoda,nein,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21


Changes

- "price" to "Price eur" (assuming the currency is Euro)
- "yearOfRegistration" to "Registration year".
- "monthOfRegistration" to "Registration month".
- "notRepairedDamage" to "Unrepaired damage".     
- "dateCreated" to "Ad created".

---
### Dropping useless columns.

In [8]:
autos.describe(include='all')

Unnamed: 0,Date Crawled,Name,Seller,Offer Type,Price EUR,Abtest,Vehicle Type,Registration Year,Gearbox,Power PS,Model,Kilometer,Registration Month,Fuel Type,Brand,Unepaired Damage,Ad Created,Nr of Pictures,Postal Code,Last Seen
count,371528,371528,371528,371528,371528.0,371528,333659,371528.0,351319,371528.0,351044,371528.0,371528.0,338142,371528,299468,371528,371528.0,371528.0,371528
unique,280500,233531,2,2,,2,8,,2,,251,,,7,40,2,114,,,182806
top,2016-03-24 14:49:47,Ford_Fiesta,privat,Angebot,,test,limousine,,manuell,,golf,,,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-06 13:45:54
freq,7,657,371525,371516,,192585,95894,,274214,,30070,,,223857,79640,263182,14450,,,17
mean,,,,,17295.14,,,2004.577997,,115.549477,,125618.688228,5.734445,,,,,0.0,50820.66764,
std,,,,,3587954.0,,,92.866598,,192.139578,,40112.337051,3.712412,,,,,0.0,25799.08247,
min,,,,,0.0,,,1000.0,,0.0,,5000.0,0.0,,,,,0.0,1067.0,
25%,,,,,1150.0,,,1999.0,,70.0,,125000.0,3.0,,,,,0.0,30459.0,
50%,,,,,2950.0,,,2003.0,,105.0,,150000.0,6.0,,,,,0.0,49610.0,
75%,,,,,7200.0,,,2008.0,,150.0,,150000.0,9.0,,,,,0.0,71546.0,


In [9]:
autos['Nr of Pictures'].value_counts(dropna=False)

0    371528
Name: Nr of Pictures, dtype: int64

Columns "Nr of pictures", "Seller" and "Offer type" they have mosttly one value. We will drop of then.

In [10]:
autos = autos.drop(["Seller", "Offer Type", "Nr of Pictures"], axis=1)
autos.head()

Unnamed: 0,Date Crawled,Name,Price EUR,Abtest,Vehicle Type,Registration Year,Gearbox,Power PS,Model,Kilometer,Registration Month,Fuel Type,Brand,Unepaired Damage,Ad Created,Postal Code,Last Seen
0,2016-03-24 11:52:17,Golf_3_1.6,480,test,,1993,manuell,0,golf,150000,0,benzin,volkswagen,,2016-03-24 00:00:00,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,18300,test,coupe,2011,manuell,190,,125000,5,diesel,audi,ja,2016-03-24 00:00:00,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,"Jeep_Grand_Cherokee_""Overland""",9800,test,suv,2004,automatik,163,grand,125000,8,diesel,jeep,,2016-03-14 00:00:00,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,GOLF_4_1_4__3TЬRER,1500,test,kleinwagen,2001,manuell,75,golf,150000,6,benzin,volkswagen,nein,2016-03-17 00:00:00,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,Skoda_Fabia_1.4_TDI_PD_Classic,3600,test,kleinwagen,2008,manuell,69,fabia,90000,7,diesel,skoda,nein,2016-03-31 00:00:00,60437,2016-04-06 10:17:21


---
### Exploring the Kilometer and Price EUR columns

In [11]:
autos['Kilometer'].value_counts().sort_index()

5000        7069
10000       1949
20000       5676
30000       6041
40000       6376
50000       7615
60000       8669
70000       9773
80000      11053
90000      12523
100000     15920
125000     38067
150000    240797
Name: Kilometer, dtype: int64

We can see that the values in this field are rounded, which might indicate that sellers had to choose from pre-set options for this field.  Additionally, there are more high mileage than low mileage vehicles.  

In [12]:
autos['Price EUR'].value_counts().sort_index()

0             10778
1              1189
2                12
3                 8
4                 1
              ...  
32545461          1
74185296          1
99000000          1
99999999         15
2147483647        1
Name: Price EUR, Length: 5597, dtype: int64

In [13]:
autos['Price EUR'].value_counts().sort_index(ascending=False).head(10)

2147483647     1
99999999      15
99000000       1
74185296       1
32545461       1
27322222       1
14000500       1
12345678       9
11111111      10
10010011       1
Name: Price EUR, dtype: int64

Given that eBay is an auction site, there could legitimately be items where the opening bid is $\$ 1$ . We will keep the $\$1$ items, but remove anything above $\$350,000$, since it seems that prices increase steadily to that number and then jump up to less realistic numbers.

In [14]:
# filtering the DataSet with price betwwen 1 to 350000
autos = autos[autos['Price EUR'].between(1,350001)]

In [17]:
autos['Price EUR'].value_counts(bins=10)

(-349.0, 35000.9]       356773
(35000.9, 70000.8]        3236
(70000.8, 105000.7]        358
(105000.7, 140000.6]       133
(140000.6, 175000.5]        52
(175000.5, 210000.4]        31
(210000.4, 245000.3]        21
(245000.3, 280000.2]        16
(280000.2, 315000.1]         8
(315000.1, 350000.0]         7
Name: Price EUR, dtype: int64

We can see that most of the prices are between 0 - 35000 Euros.

In [15]:
autos.shape

(360635, 17)

The DataSet now has 360635 rows.

---
### Exploring the date columns

There are a number of columns with date information:

- Date Crawled
- Registration Month
- Registration Year
- Ad Created
- Last Seen

We'll explore each of these columns to learn more about them.

In [19]:
autos[['Date Crawled', 'Registration Month', 'Registration Year', 'Ad Created', 'Last Seen']].head()

Unnamed: 0,Date Crawled,Registration Month,Registration Year,Ad Created,Last Seen
0,2016-03-24 11:52:17,0,1993,2016-03-24 00:00:00,2016-04-07 03:16:57
1,2016-03-24 10:58:45,5,2011,2016-03-24 00:00:00,2016-04-07 01:46:50
2,2016-03-14 12:52:21,8,2004,2016-03-14 00:00:00,2016-04-05 12:47:46
3,2016-03-17 16:54:04,6,2001,2016-03-17 00:00:00,2016-03-17 17:40:17
4,2016-03-31 17:25:20,7,2008,2016-03-31 00:00:00,2016-04-06 10:17:21


It appears the date format in "Date crawled", "Ad created" and "Last seen" columns is in "yyyy-mm-dd hh:mm:ss" format. We will use regular expressions to verify the format.

In [20]:
pattern = r'^2016-0[3-4]-(?:[0-2][0-9]|3[0-1]) (?:[0-1][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9]$'

for column in ['Date Crawled', 'Ad Created', 'Last Seen']:
    print(autos[column].str.contains(pattern).shape[0])

360635
360635
360635


Since the number of matches are the same of the number of rows we can conclude that all the entries in "Date crawled", "Ad created" and "Last seen" columns are in "yyyy-mm-dd hh:mm:ss" format. Next, we will convert "Date crawled", "Ad created" and "Last seen" columns in datetime format.

In [22]:
for column in ['Date Crawled', 'Ad Created', 'Last Seen']:
    autos[column] = pd.to_datetime(autos[column])

We will take a look on the relative frequency distribution of "Date crawled", "Ad created" and "Last seen" columns based on yyyy-mm-dd data.

In [24]:
autos['Date Crawled'].dt.to_period('D').value_counts(normalize=True).sort_index()

2016-03-05    0.025547
2016-03-06    0.014483
2016-03-07    0.035657
2016-03-08    0.033469
2016-03-09    0.034115
2016-03-10    0.032645
2016-03-11    0.032773
2016-03-12    0.036242
2016-03-13    0.015783
2016-03-14    0.036330
2016-03-15    0.033424
2016-03-16    0.030205
2016-03-17    0.031647
2016-03-18    0.013119
2016-03-19    0.035271
2016-03-20    0.036400
2016-03-21    0.035682
2016-03-22    0.032493
2016-03-23    0.032002
2016-03-24    0.029914
2016-03-25    0.032800
2016-03-26    0.031974
2016-03-27    0.030227
2016-03-28    0.035063
2016-03-29    0.034126
2016-03-30    0.033535
2016-03-31    0.031872
2016-04-01    0.034145
2016-04-02    0.035094
2016-04-03    0.038812
2016-04-04    0.037628
2016-04-05    0.012780
2016-04-06    0.003128
2016-04-07    0.001617
Freq: D, Name: Date Crawled, dtype: float64

Looks like the site was crawled daily over roughly a one month period in March and April 2016. The values on the dates "2016-04-07" and "2016-04-06" are much lower than the others.

In [25]:
autos['Ad Created'].value_counts(normalize=True).sort_index()

2014-03-10    0.000003
2015-03-20    0.000003
2015-06-11    0.000003
2015-06-18    0.000003
2015-08-07    0.000003
                ...   
2016-04-03    0.039001
2016-04-04    0.037736
2016-04-05    0.011613
2016-04-06    0.003119
2016-04-07    0.001553
Name: Ad Created, Length: 114, dtype: float64

The above frequency table is very large. We will analyse 10 intervals.

In [27]:
ad_created_freq  = autos['Ad Created'].value_counts(normalize=True).sort_index()

range_days = abs((ad_created_freq.index.max() - ad_created_freq.index.min()).days)
range_days

759

In [28]:
start = ad_created_freq.index.min()
end = ad_created_freq.index.max() + timedelta(days=1)

ad_created_freq.groupby( pd.cut(ad_created_freq.index, pd.date_range(start, end, freq='76D'), right=False) ).sum()

[2014-03-10, 2014-05-25)    0.000003
[2014-05-25, 2014-08-09)    0.000000
[2014-08-09, 2014-10-24)    0.000000
[2014-10-24, 2015-01-08)    0.000000
[2015-01-08, 2015-03-25)    0.000003
[2015-03-25, 2015-06-09)    0.000000
[2015-06-09, 2015-08-24)    0.000011
[2015-08-24, 2015-11-08)    0.000014
[2015-11-08, 2016-01-23)    0.000147
[2016-01-23, 2016-04-08)    0.999823
Name: Ad Created, dtype: float64

There is a large variety of ad created dates. Most all within 3 months of the listing date. The oldest is in March 2014.

In [29]:
autos['Last Seen'].value_counts(normalize=True, dropna=False).sort_index()

2016-03-05 14:15:08    0.000003
2016-03-05 14:15:16    0.000003
2016-03-05 14:15:39    0.000003
2016-03-05 14:18:30    0.000003
2016-03-05 14:25:59    0.000003
                         ...   
2016-04-07 14:58:47    0.000025
2016-04-07 14:58:48    0.000025
2016-04-07 14:58:49    0.000031
2016-04-07 14:58:50    0.000028
2016-04-07 14:58:51    0.000003
Name: Last Seen, Length: 178199, dtype: float64

In [None]:
autos.loc[:,"last_seen"].str[:10].value_counts(normalize=True, dropna=False).sort_values()

<font color="Coral">
The distribution looks ok. The last three days contain a disproportionate amount of values. It's unlikely that there was a massive spike in sales, and more likely that these value sare to do with the crawling period ending and don't indicate car sales.
</font>

In [None]:
autos.loc[:,"registration_year"].describe()

<font color="Coral">
The min and the max values are wrong!
</font>

<font color="DarkGoldenRod">
<h2>Step 6: Dealing with Incorrect Registration Year Data</h2>
</font>

In [None]:
print(autos.loc[:,"registration_year"].value_counts(dropna=False).sort_index().head(10))

autos.loc[:,"registration_year"].value_counts(dropna=False).sort_index(ascending=False).head(20)

<font color="Coral">
We will consider values in "registration_year" column from 1900 to 2016.
</font>

In [None]:
autos = autos[autos["registration_year"].between(1900,2016)]

In [None]:
autos.loc[:,"registration_year"].value_counts(normalize=True).head(20)

<font color="Coral">
The distribution looks ok. Most of the vehicles were first registrated in the past 20 years.
</font>

<font color="DarkGoldenRod">
<h2>Step 7: Exploring Price by Brand</h2>
</font>

In [None]:
# bool_brand_per = autos.loc[:,"brand"].value_counts(normalize=True, dropna=False) >= 0.01

# autos.loc[:,"brand"].value_counts(normalize=True, dropna=False)[bool_brand_per]

autos.loc[:,"brand"].value_counts(normalize=True, dropna=False)

<font color="Coral">
I choose the values in "brand" column having relative frequencies >= 5 percent.
</font>

In [None]:
brand_mean_prices = {}

In [None]:
brand_counts = autos["brand"].value_counts(normalize=True)
common_brands = brand_counts[brand_counts > 0.05].index

print(common_brands)

In [None]:
for brand in common_brands:
    brand_mean = autos.loc[autos.loc[:,"brand"] == brand, "price_usd"].mean()
    brand_mean_prices[brand] = int(brand_mean) 
    
# for item in sorted(brand_mean_prices.items(), key=lambda item: item[1], reverse=True):
#     print(item)

bmp_series = pd.Series(brand_mean_prices)
bmp_series.sort_values(ascending=False).head(6)

<font color="Coral">
"Audi" and "Mercedes" are mre expensive.
</font>

<font color="DarkGoldenRod">
    <h4>Step 8: Storing Aggregate Data in a DataFrame</h4>
</font>

In [None]:
brand_mean_mileage = {}
for brand in bmp_series.index:
    mileage_mean = autos.loc[autos.loc[:,"brand"]==brand, "odometer_km"].mean()
    brand_mean_mileage[brand] = int(mileage_mean) 

bmkm_series = pd.Series(brand_mean_mileage)
bmkm_series.sort_values(ascending=False).head(6)

In [None]:
mean_df = pd.DataFrame(bmkm_series, columns=["mean_mileage"])
mean_df["mean_price"] = bmp_series

mean_df.sort_values(by=['mean_mileage'], ascending=False)

<font color="Coral">
The range of car miliages does not vary as the prices do by brand, instead all falling within 10% for the top brands.
</font>

<font color="DarkGoldenRod">

#### Exercise: Next steps:
    
1. Data cleaning next steps:
    1. Identify categorical data that uses german words, translate them and map the values to their english counterparts.
    2. Convert the dates to be uniform numeric data, so "2016-03-21" becomes the integer 20160321.
    3. See if there are particular keywords in the name column that you can extract as new columns.
1. Analysis next steps
    1. Find the most common brand/model combinations.
    2. Split the odometer_km into groups, and use aggregation to see if average prices follows any patterns based on the mileage.
    3. How much cheaper are cars with damage than their non-damaged counterparts?
</font>

<font  color="DarkGoldenRod">

#### 1.A. Identify categorical data that uses german words, translate them and map the values to their english counterparts.
</font>

In [None]:
autos.describe(include="all")

In [None]:
autos["abtest"].unique()

In [None]:
autos["vehicle_type"].unique()

In [None]:
mapping_dict = {'bus': 'bus', 'limousine': 'limousine', 'kleinwagen': "small_car", 'kombi': 'kombi', np.NaN: np.NaN, 'coupe': 'coupe', 'suv': 'suv', 'cabrio': 'convertible', 'andere': 'other'}

autos["vehicle_type"] = autos["vehicle_type"].map(mapping_dict)

autos["vehicle_type"].unique()

In [None]:
autos["gearbox"].unique()

In [None]:
mapping_dict = {'manuell': 'manual', 'automatik': 'automatic', np.NaN: np.NaN}

autos["gearbox"] = autos["gearbox"].map(mapping_dict)

autos["gearbox"].unique()

In [None]:
autos["fuel_type"].unique()

In [None]:
mapping_dict = dict(zip(('lpg', 'benzin', 'diesel', np.NaN, 'cng', 'hybrid', 'elektro', 'andere'), 
                        ('lpg', 'gasoline', 'diesel', np.NaN, 'cng', 'hybrid', 'electric', 'others')))

autos["fuel_type"] = autos["fuel_type"].map(mapping_dict)

autos["fuel_type"].unique()

In [None]:
autos["unepaired_damage"].unique()

In [None]:
mapping_dict = {'nein': 'no', np.NaN: np.NaN, 'ja': 'yes'}

autos["unepaired_damage"] = autos["unepaired_damage"].map(mapping_dict)

print(autos["unepaired_damage"].unique())

autos.head()

<font color="DarkGoldenRod">

#### 1.B. Convert the dates to be uniform numeric data, so "2016-03-21" becomes the integer 20160321.
</font>

In [None]:
autos["date_crawled"].unique()

In [None]:
for column in ["date_crawled", "ad_created", "last_seen"]:
    autos[column] = autos[column].str.replace("-","").str.strip()

autos[["date_crawled", "ad_created", "last_seen"]].head()

<font color="DarkGoldenRod">

#### 1.C. See if there are particular keywords in the name column that you can extract as new columns.
</font>

In [None]:
autos = change_col_position("brand", 1)

autos.head()

In [None]:
autos["name"].describe()

<font color="Coral">

We will remove the brand in the "name" column. Since there are 35812 of unique values in "name" column, we will procedd as follows:
1. Extract the first name (brand) in "name" column in lower case and assign to a pd.Series called "names_brand".
2. Convert the values in "brand" column to lower case. 
3. Convert the brand in "name" column to lower case.
4. Removing the brand name in "name" column when it cointains the brand in the name.
</font>

In [None]:
names_brand = autos["name"].str.split("_").str[0].str.lower()

autos["brand"] = autos["brand"].str.lower().str.strip()

autos["name"] = names_brand + "_" + autos["name"].str.split("_").str[1:].str.join("_")

autos["name"].head(10)

In [None]:
bool_names_brands = names_brand == autos["brand"]

comparasion_df = pd.concat([names_brand, autos["name"]], axis=1)

print(comparasion_df[bool_names_brands].head(10))
print(100*"-")

# removing the brand in the "name" column
autos.loc[bool_names_brands, "name"] = autos.loc[bool_names_brands, "name"].str.split("_").str[1:].str.join("_")

comparasion_df = pd.concat([names_brand, autos["name"]], axis=1)

print(comparasion_df[bool_names_brands].head(10))

In [None]:
print(autos["name"][~bool_names_brands].head(10))

print(100*"-")

autos["name"][~bool_names_brands].value_counts().head(20)

<font color="coral">
Still have some brand names in the begining of the values in "name" column. We can remove "vm" and "mercedes_Benz"
</font>

In [None]:
# removing "vw" in the "name" column
bool_vw = autos["name"].str.split("_").str[0] == "vw"

autos.loc[bool_vw, "name"] = autos.loc[bool_vw, "name"].str.split("_").str[1:].str.join("_")

# removing "mercedes_Benz" in the "name" column
bool_mb = (autos["name"].str.split("_").str[0] == "mercedes") & (autos["name"].str.split("_").str[1] == "Benz")

autos.loc[bool_mb, "name"] = autos.loc[bool_mb, "name"].str.split("_").str[1:].str.join("_")
autos.loc[bool_mb, "name"] = autos.loc[bool_mb, "name"].str.split("_").str[1:].str.join("_")

autos["name"][(~bool_names_brands) & (~bool_vw) & (~bool_mb)].describe()

<font color="coral">
Removing "citroлn"
</font>

In [None]:
# removing "citroлn" in the "name" column
bool_ci = autos["name"].str.split("_").str[0] == "citroлn"

autos.loc[bool_ci, "name"] = autos.loc[bool_ci, "name"].str.split("_").str[1:].str.join("_")

print(autos["name"][(~bool_names_brands) & (~bool_vw) & (~bool_mb) & (~bool_ci)].describe())

# autos["name"][(~bool_names_brands) & (~bool_vw) & (~bool_mb) & (~bool_ci)].value_counts().head(50)

<font color="Coral">
more than 80% of the values in "name" where treated. Most of the non-changed values are unique values.  
</font>

<font color="DarkGoldenRod">

#### 2.A. Find the most common brand/model combinations.
</font>

In [None]:
most_common_brand = autos["brand"].value_counts().idxmax()

# print(most_common_brand) # volkswagen

bool_brand_model = autos["brand"] == most_common_brand

autos.loc[bool_brand_model, "name"].value_counts().head()

<font color="DarkGoldenRod">

#### 2.A. Split the odometer_km into groups, and use aggregation to see if average prices follows any patterns based on the mileage.
</font>

In [None]:
autos["odometer_km"].describe()

In [None]:
autos["odometer_km"].unique()

In [None]:
autos["odometer_km"].value_counts().sort_index(ascending=False)

<font color="Coral">

We split "odometer_km" into 3 groups: 
- low_km with values less or equal that 5000km;
- medium_km with values greater 5000 and less or equal than 10000km;
- high_km with values greater 10000km.

After we calculate the mean of each group and assingn as values af oa dictionary with low_km, medium_km, high_km keys.
</font>

In [None]:
low_km_mean_price = int(autos.loc[autos["odometer_km"] <= 50000, "price_usd"].mean())
medium_km_mean_price = int(autos.loc[(5000 < autos["odometer_km"]) & (autos["odometer_km"] <= 100000), "price_usd"].mean())
high_km_mean_price = int(autos.loc[(100000 < autos["odometer_km"]) & (autos["odometer_km"] <= 150000), "price_usd"].mean())

price_by_km = {"low_km_mean_price": low_km_mean_price, "medium_km_mean_price": medium_km_mean_price, "high_km_mean_price": high_km_mean_price}

print(price_by_km)

<font color="Coral">
The mean price of cars with low km is more than 3 times of the price with high km.
</font>

<font color="DarkGoldenRod">

#### 3.C How much cheaper are cars with damage than their non-damaged counterparts?
</font>

In [None]:
autos.rename({"unepaired_damage": "unrepaired_damage"}, axis=1, inplace=True)

In [None]:
autos["unrepaired_damage"].unique()

In [None]:
autos["unrepaired_damage"].value_counts(dropna=False)

In [None]:
damaged_mean = int(autos.loc[autos["unrepaired_damage"] == "yes","price_usd"].mean())
no_damaged_mean = int(autos.loc[autos["unrepaired_damage"] == "no","price_usd"].mean())

price_by_damage = {"damaged_mean": damaged_mean, "no_damaged_mean": no_damaged_mean}

print(price_by_damage)

In [None]:
print(int(autos["price_usd"].mean()))

<font color="Coral">
THe mean price of damaged car is more than 3 times less then the mean price of non-damaged cars.
</font>