# Ebay Used Car Analysis
In this project I will be examining ebay Kleinanzeigen, a classified section of the German Ebay website. The purpose of the project will be to learn how to clean a dirty data set, as well as extract meaningful information from the used car listings. 

In [1]:
import numpy as np
import pandas as pd
autos = pd.read_csv('autos.csv', encoding = 'Latin-1')


In [2]:
autos.info()
autos.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


## Data Descriptions

    dateCrawled - When this ad was first crawled. 
    name - Name of the car.
    seller - Whether the seller is private or a dealer.
    offerType - The type of listing
    price - The price on the ad to sell the car.
    abtest - Whether the listing is included in an A/B test.
    vehicleType - The vehicle Type.
    yearOfRegistration - The year in which the car was first registered.
    gearbox - The transmission type.
    powerPS - The power of the car in PS.
    model - The car model name.
    kilometer - How many kilometers the car has driven.
    monthOfRegistration - The month in which the car was first registered.
    fuelType - What type of fuel the car uses.
    brand - The brand of the car.
    notRepairedDamage - If the car has a damage which is not yet repaired.
    dateCreated - The date on which the eBay listing was created.
    nrOfPictures - The number of pictures in the ad.
    postalCode - The postal code for the location of the vehicle.
    lastSeenOnline - When the crawler saw this ad last online.
    
### Observations
There will need to be some data cleaning on this dataset. When examining the first few rows of data I noticed that there are some data points that contain both numerical and letter data. These will need to be converted to a single type in order to analyze the data. There are also some NAN values in some of these columns that may limit the accuracy of my analysis.
    


### 1. Covert columns to snake case.
This is the preferred casing for python.

In [3]:
autos.columns = 'date_crawled', 'name', 'seller','offer_type', 'price', 'ab_test', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer_km', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'date_created', 'num_of_pics', 'postal_code','last_seen'
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,ab_test,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,date_created,num_of_pics,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


### 2. Drop irrelevant columns and convert datapoints to useful  types
I will drop columns that are irrelevent to my analysis as well as convert data types if needed. These changes will include...

**seller:** There are not many unique values under the seller column. I will drop this column

**offer_type:** There are not many unique values under the offer type column. I will drop this column

**price: **I will convert price to a number data type. This will allow for better analyis of that column.

**registration_year:** There are maximum and minumum values here that are suspicious. I don't think there is a car that was registred in the year 1000 or 9999. 

**odometer_km:** This is stored as an object and will need to be converted to an integer for analysis. 

**num_of_pictures:** All of the datapoints are zero. The data in this column is useless.

**date_crawled:** All data must be converted to a numerical representation for analysis. 

**last_seen:** All data must be converted to a numerical representation for analysis. 

**ad_created:** All data must be converted to a numerical representation for analysis. 



In [4]:
autos.describe(include = 'all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,ab_test,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,date_created,num_of_pics,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-19 17:36:18,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


In [5]:
#Remove km and commas from odometer column. Convert to int type.
autos['odometer_km'] = (autos['odometer_km']#The open parenthesis at the beginning allows me to string methods together over different lines.
                        .str.replace('km', '')#.str.replace() is a method that uses the str accesible to search through each string in the column and replace the designated string with the designated replacement string. 
                        .str.replace(',','')
                        .astype(int))#converts type from object in int. 
                        
odometers = autos["odometer_km"]

odometers.describe()

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

The statistics associated with the odometers column don't seem unrealistic. This column will stay as is. 

In [6]:
#Clean price column
autos['price'] = (autos['price']
                  .str.replace('$','')
                  .str.replace(',','')
                  .astype(int))


In [7]:
autos['price'].describe()

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

The above cell provides a quick look at the data. There are clearly some large outliers in data. The maximum and minimum values seem unlikely.

In [8]:
autos['price'].unique().shape

(2357,)

There are 2357 unique values in this data set. This is far from the 50,000 rows that are in the complete data set. There must be alot of similar values.

In [9]:
autos['price'].value_counts().sort_index(ascending = True).head(10)

0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
Name: price, dtype: int64

In [10]:
autos['price'].value_counts().sort_index(ascending = False).head(15)

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
Name: price, dtype: int64

There are many cars that are priced at zero. I will remove all cars that are listed for zero, because these are likely errors. I will keep cars listed over 0 because ebay is an auction site and opening bids are often placed extremely low to attract bidders. 

The top prices are also suspicious. I have chosen to remove price values that are extreme outliers. This price is more likely to be a mistake or reflect an artificial price than the actual value of the vehicle. I will remove any price above 350,000 because this is the point at which the pricing seems to become unrealistically inflated or artificial.  


In [11]:
#Removing price outliers
autos = autos[autos['price'].between(1, 350000)]
autos['price'].describe()

count     48565.000000
mean       5888.935591
std        9059.854754
min           1.000000
25%        1200.000000
50%        3000.000000
75%        7490.000000
max      350000.000000
Name: price, dtype: float64

After removing outliers there are 48,565 rows with price values left in the dataset. The average car price is shown to be about 5888 dollars. Let's look at the registration year next.


In [12]:
reg_year_1000 = autos[autos['registration_year'] == 1000] 
reg_year_1000 

Unnamed: 0,date_crawled,name,seller,offer_type,price,ab_test,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,date_created,num_of_pics,postal_code,last_seen
22316,2016-03-29 16:56:41,VW_Kaefer.__Zwei_zum_Preis_von_einem.,privat,Angebot,1500,control,,1000,manuell,0,kaefer,5000,0,benzin,volkswagen,,2016-03-29 00:00:00,0,48324,2016-03-31 10:15:28


This is the suspicious datapoint that indicates the car was registred in the year 1000.

In [13]:
reg_year_9999 = autos[autos['registration_year'] == 9999]
reg_year_9999

Unnamed: 0,date_crawled,name,seller,offer_type,price,ab_test,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,date_created,num_of_pics,postal_code,last_seen
8012,2016-03-23 16:43:29,Opel_GT_Karosserie_mit_Brief!,privat,Angebot,700,test,,9999,,0,andere,10000,0,,opel,,2016-03-23 00:00:00,0,21769,2016-04-05 20:16:15
33950,2016-03-23 21:52:25,58er_karmann_ghia_lowlight_Kaefer__zum_restaur...,privat,Angebot,7999,test,,9999,,0,kaefer,10000,0,,volkswagen,,2016-03-23 00:00:00,0,47638,2016-04-06 03:46:40
38076,2016-04-04 22:54:47,Mercedes_Benz_A180,privat,Angebot,18000,test,,9999,,0,a_klasse,10000,0,benzin,mercedes_benz,,2016-04-04 00:00:00,0,51379,2016-04-07 02:44:52


Another weird datapoint indicating these vehicles were registred in the year 9999.

I will check to see how what the value counts are for the registration year column.

In [14]:
autos['registration_year'].value_counts().sort_index().head()

1000    1
1001    1
1111    1
1800    2
1910    5
Name: registration_year, dtype: int64

In [15]:
autos['registration_year'].value_counts().sort_index(ascending = False).head(15)

9999       3
9000       1
8888       1
6200       1
5911       1
5000       4
4800       1
4500       1
4100       1
2800       1
2019       2
2018     470
2017    1392
2016    1220
2015     392
Name: registration_year, dtype: int64

There are a few values that are incorrect in the "registration_year" column. I will remove any values below 1800 and after 2018. It's possible an antique steam engine or two from the 1800's found it's way on to ebay. Any year after 2019 is unlikely, because the dataset was last uploaded to kaggle in 2019. 

In [16]:
autos = autos[autos['registration_year'].between(1800,2019)]
autos['registration_year'].describe()

count    48547.000000
mean      2003.453128
std          7.677914
min       1800.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       2019.000000
Name: registration_year, dtype: float64

In [17]:
autos['registration_year'].value_counts(normalize = True).sort_index()

1800    0.000041
1910    0.000103
1927    0.000021
1929    0.000021
1931    0.000021
1934    0.000041
1937    0.000082
1938    0.000021
1939    0.000021
1941    0.000041
1943    0.000021
1948    0.000021
1950    0.000062
1951    0.000041
1952    0.000021
1953    0.000021
1954    0.000041
1955    0.000041
1956    0.000082
1957    0.000041
1958    0.000082
1959    0.000124
1960    0.000474
1961    0.000124
1962    0.000082
1963    0.000165
1964    0.000247
1965    0.000350
1966    0.000453
1967    0.000536
          ...   
1990    0.007148
1991    0.006983
1992    0.007621
1993    0.008754
1994    0.012957
1995    0.025274
1996    0.028282
1997    0.040188
1998    0.048674
1999    0.059674
2000    0.065009
2001    0.054298
2002    0.051208
2003    0.055596
2004    0.055678
2005    0.060477
2006    0.054998
2007    0.046903
2008    0.045626
2009    0.042948
2010    0.032731
2011    0.033432
2012    0.026984
2013    0.016541
2014    0.013657
2015    0.008075
2016    0.025130
2017    0.0286

Most cars listed were registered in the last 20 years. This makes sense because cars that are older than 20 years old are less likely to be worth the hassle of selling. The exceptions would be collector's cars. 

I will now focus on the other date and time columns by converting them to numerical data types and analyzing the columns for relevent information. 

In [18]:
time_table = autos[['date_crawled', 'date_created', 'last_seen']]
time_table.head()

Unnamed: 0,date_crawled,date_created,last_seen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50


In [19]:
date_crawled_dates = autos['date_crawled'].str[:10]
date_created_dates = autos['date_created'].str[:10]
last_seen_dates = autos['last_seen'].str[:10]

date_crawled_dates.value_counts(normalize = True, dropna = False).sort_index()

2016-03-05    0.025316
2016-03-06    0.014028
2016-03-07    0.036027
2016-03-08    0.033308
2016-03-09    0.033102
2016-03-10    0.032196
2016-03-11    0.032566
2016-03-12    0.036913
2016-03-13    0.015676
2016-03-14    0.036563
2016-03-15    0.034276
2016-03-16    0.029621
2016-03-17    0.031598
2016-03-18    0.012915
2016-03-19    0.034791
2016-03-20    0.037881
2016-03-21    0.037386
2016-03-22    0.032999
2016-03-23    0.032196
2016-03-24    0.029353
2016-03-25    0.031619
2016-03-26    0.032216
2016-03-27    0.031104
2016-03-28    0.034853
2016-03-29    0.034049
2016-03-30    0.033699
2016-03-31    0.031825
2016-04-01    0.033679
2016-04-02    0.035491
2016-04-03    0.038602
2016-04-04    0.036480
2016-04-05    0.013101
2016-04-06    0.003172
2016-04-07    0.001401
Name: date_crawled, dtype: float64

The webcrawler that gathered this data started in March of 2016 and gathered a consistent amount of until the 5th of April. On the 5th, the amount of data being gathered quickly dropped off. 

In [20]:
date_created_dates.value_counts(normalize = True, dropna = False).sort_index()


2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
2015-12-30    0.000021
2016-01-03    0.000021
2016-01-07    0.000021
2016-01-10    0.000041
2016-01-13    0.000021
2016-01-14    0.000021
2016-01-16    0.000021
2016-01-22    0.000021
2016-01-27    0.000062
2016-01-29    0.000021
2016-02-01    0.000021
2016-02-02    0.000041
2016-02-05    0.000041
2016-02-07    0.000021
2016-02-08    0.000021
2016-02-09    0.000021
2016-02-11    0.000021
2016-02-12    0.000041
2016-02-14    0.000041
2016-02-16    0.000021
2016-02-17    0.000021
2016-02-18    0.000041
2016-02-19    0.000062
2016-02-20    0.000041
2016-02-21    0.000062
                ...   
2016-03-09    0.033164
2016-03-10    0.031907
2016-03-11    0.032896
2016-03-12    0.036748
2016-03-13    0.017014
2016-03-14    0.035203
2016-03-15    0.034008
2016-03-16    0.030136
2016-03-17    0.031248
2016-03-18    0.013595
2016-03-19    0.033699
2016-03-20    0.037943
2016-03-21 

Interestingly, the date created column data correlates with the date crawled column. This is strange because one date was generated by the crawler and the other date was created by the website. This is likely due to the way the crawler is gathering data. 

In [21]:
last_seen_dates.value_counts(normalize = True, dropna = False).sort_index()

2016-03-05    0.001071
2016-03-06    0.004326
2016-03-07    0.005397
2016-03-08    0.007415
2016-03-09    0.009599
2016-03-10    0.010670
2016-03-11    0.012380
2016-03-12    0.023771
2016-03-13    0.008899
2016-03-14    0.012606
2016-03-15    0.015882
2016-03-16    0.016458
2016-03-17    0.028076
2016-03-18    0.007354
2016-03-19    0.015820
2016-03-20    0.020660
2016-03-21    0.020640
2016-03-22    0.021381
2016-03-23    0.018539
2016-03-24    0.019775
2016-03-25    0.019218
2016-03-26    0.016808
2016-03-27    0.015655
2016-03-28    0.020866
2016-03-29    0.022349
2016-03-30    0.024760
2016-03-31    0.023750
2016-04-01    0.022803
2016-04-02    0.024924
2016-04-03    0.025213
2016-04-04    0.024492
2016-04-05    0.124724
2016-04-06    0.221785
2016-04-07    0.131934
Name: last_seen, dtype: float64

The last seen dates follow a ramping up toward the beginning of the webcrawling and then sky rocket up toward the last few days. The last seen dates are based on when the crawler last saw the ebay ad. If there is data in this column it is presumed that the car was sold, or the ad was removed for some other reason. The large spikes toward the end are unlikely to reflect ads actually having been removed. This is unlikely because the increase is 5x as high in the last three dates than in the other previous dates. There is no explanation for this amount of ads being removed during this time, other than something going on with the webcrawler.   

I will now look at the registration month column.

In [22]:
autos['registration_month'].describe()

count    48547.000000
mean         5.783447
std          3.684942
min          0.000000
25%          3.000000
50%          6.000000
75%          9.000000
max         12.000000
Name: registration_month, dtype: float64

It appears that there are some month zero values. Since there isn't a zeroeth month, these values will be removed. First, I would like to check the frequency of this value. 

In [23]:
autos['registration_month'].value_counts().sort_index()

0     4468
1     3219
2     2937
3     5002
4     4036
5     4031
6     4270
7     3855
8     3126
9     3330
10    3588
11    3312
12    3373
Name: registration_month, dtype: int64

There are almost 4500 values listed with a zeroeth month. I will remove these now. 

In [24]:
autos = autos[autos['registration_month'].between(1,12)]
autos['registration_month'].describe()

count    44079.000000
mean         6.369677
std          3.349782
min          1.000000
25%          3.000000
50%          6.000000
75%          9.000000
max         12.000000
Name: registration_month, dtype: float64

In [25]:
autos['registration_month'].value_counts(normalize = True).sort_index()

1     0.073028
2     0.066630
3     0.113478
4     0.091563
5     0.091449
6     0.096872
7     0.087457
8     0.070918
9     0.075546
10    0.081399
11    0.075138
12    0.076522
Name: registration_month, dtype: float64

This data reflects that most cars listed were registered during the months outside of January and Febuary. I'm not sure this data is useful for my analyis.

I will now shift focus on to the brands of the cars for analysis. 

In [26]:
brand_counts = autos["brand"].value_counts(normalize=True)
common_brands = brand_counts[brand_counts > .05].index
print(common_brands)



Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')


The above brands represent the top 5% of brands that are listed in the Ebay Kleinanzeigen data set. I will be examing how the most popular car brands are priced on this site. 

In [27]:
brand_price = {}


for brand in common_brands:
    brand_only = autos[autos['brand'] == brand]
    mean_price = brand_only['price'].mean()
    brand_price[brand] = int(mean_price)
print(brand_price)

{'mercedes_benz': 8826, 'volkswagen': 5689, 'bmw': 8586, 'ford': 3937, 'audi': 9680, 'opel': 3142}


The average prices in the data set for the top brands are as follows...

Audi: 9680  
BMW: 8586  
Mercedes Benz: 8826  
Volkswagen: 5689  
Ford: 3937  
Opel: 3142  

5/6 of the top brands are from German manufacturers.
The top 3 brands are higher priced German luxury vehicles.
Volkswagon takes the mid-tier pricing position
Budget options inclued Ford and Opel
Ford is the only non-german manufactured car in the list.



I would now like to check if mileage has any link to the cost of these top brands or if price is mainly driven by brand alone. I will now construct a new dataframe with all of this information.

In [28]:
brand_avg_mileage = {}
for brand in common_brands:
    brand_only = autos[autos['brand'] == brand]
    mean_mileage = brand_only['odometer_km'].mean()
    brand_avg_mileage[brand] = int(mean_mileage)

    

In [29]:
series_avg_mileage = pd.Series(brand_avg_mileage)
series_brand_price = pd.Series(brand_price)

price_brand_mileage_df = pd.DataFrame(series_brand_price, columns = ["avg-price"])
price_brand_mileage_df['avg_mileage_km'] = series_avg_mileage

price_brand_mileage_df

Unnamed: 0,avg-price,avg_mileage_km
audi,9680,128940
bmw,8586,132581
ford,3937,124009
mercedes_benz,8826,130987
opel,3142,128876
volkswagen,5689,128356


All of the top brands listed in the dataset have a similar average mileage. This tells me that brand is dictating average cost more than mileage. 

### Conclusion 

This concludes this Data cleaning exercise. To recap this project, I...
1. Renamed columns to adhere to a consistent naming convention.
2. Identified and dropped irrelevant or inaccurate datapoints. 
3. Identified top used car brands and their average prices.
4. Identified that brand was more important than mileage in determining average used car price. 

