# Data cleaning for Ebay car sales Data

In this project I aim to show my data cleaning skill 

We'll analyse <a href="https://data.world/data-society/used-cars-data">existing data</a> of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website, scraped and uploaded to Kaggle by user orgesleka.The dataset containing vehicle Type, name of the car, ads price to sell the car,  car brand and more. You can find the Data <a href="https://data.world/data-society/used-cars-data">Here</a>

#### Summary of Results
After analysing the data, we where able to find the most expensive car brands. Not so much value for a business or anybody though, you can google that and get the answer but the main point of this analysis was to show my data cleaning skill.

## We use Pandas to read the dataset

In [58]:
import pandas as pd

# Read the CSV file and assign it to the variable 'autos'
autos = pd.read_csv('autos.csv', encoding='Latin-1')

#### Some information about the Dataframe

In [59]:
autos.info() #Display some information about the DataFrame Structure
autos.head() #display the first few rows of the DataFrame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


From my observation, the dataset contains 20 columns, most of which are strings. About 5 columns have Null values in them. Most column names use camelcase instead of snakecase.

We are going to convert the column names from camelcase to snakecase and reword some of the column names based on the data dictionary to be more descriptive. Then we'll take care of the rest.

## Changing Column Names

Changing from camelCase to snake_case is often a matter of coding style and consistency. Doing this makes the data more readable.

In [60]:
# Define a mapping of old column names to new column names
column_mapping = {
    'dateCrawled':'date_crawled',
                   'offerType':'offer_type',
                   'abtest':'ab_test',
                   'vehicleType':'vehicle_type',
                   'yearOfRegistration':'registration_year',
                   'gearbox':'gear_box',
                   'powerPS':'power_ps',
                   'odometer':'odometer_km',
                   'monthOfRegistration':'registration_month',
                   'fuelType':'fuel_type',
                   'notRepairedDamage':'unrepaired_damage',
                   'dateCreated':'ad_created',
                   'nrOfPictures':'nr_of_pictures',
                   'postalCode':'postal_code',
                   'lastSeen':'last_seen'
}

# Rename columns based on the mapping
autos.rename(columns=column_mapping, inplace=True)

# Convert column names from camelCase to snake_case for all columns
autos.columns = [col.lower().replace(" ", "_") for col in autos.columns]

In [61]:
autos.head() #printing the first 5 rows

Unnamed: 0,date_crawled,name,seller,offer_type,price,ab_test,vehicle_type,registration_year,gear_box,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


## Exploring the Odometer and Price Columns 
Now we are going to explore the data more closely and try to clean it before analysis. 

In [62]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,ab_test,vehicle_type,registration_year,gear_box,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-21 16:37:21,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


<b>From my observation:</b>
* More than 5 columns have Low Variability(single values). We'll remove seller, offer_type, ab_test, unrepaired_damage and gear_box because they don't provide much meaningful information/insight for our analysis. 


* Colums like price and odometer are numeric values stored as text. We'll first remove any non-numeric characters and convert columns to numeric dtype.

### Removing Low Variability columns

In [63]:
# Remove multiple columns at once
columns_to_remove = ['name', 'seller', 'offer_type', 'ab_test', 'gear_box', 'unrepaired_damage']
autos = autos.drop(columns_to_remove, axis=1)

In [64]:
autos.head()

Unnamed: 0,date_crawled,price,vehicle_type,registration_year,power_ps,model,odometer_km,registration_month,fuel_type,brand,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,"$5,000",bus,2004,158,andere,"150,000km",3,lpg,peugeot,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,"$8,500",limousine,1997,286,7er,"150,000km",6,benzin,bmw,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,"$8,990",limousine,2009,102,golf,"70,000km",7,benzin,volkswagen,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,"$4,350",kleinwagen,2007,71,fortwo,"70,000km",6,benzin,smart,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,"$1,350",kombi,2003,0,focus,"150,000km",7,benzin,ford,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


### Cleaning price & odometer column
Converting price and odometer columns to numeric types and renaming odometer column to odometer_km

In [65]:
def clean_price_and_odometer_columns(df):
    # Remove dollar sign and commas from the "price" column
    df['price'] = df['price'].str.replace('[$,]', '', regex=True)
    # Convert the cleaned "price" column to integers (if needed)
    df['price'] = df['price'].astype(int)

    # Remove "km" and commas from the "odometer" column
    df['odometer_km'] = df['odometer_km'].str.replace('[km,]', '', regex=True)
    # Convert the cleaned "odometer" column to integers (if needed)
    df['odometer_km'] = df['odometer_km'].astype(int)

# Example usage to clean the "price" and "odometer" columns
clean_price_and_odometer_columns(autos)

### Cleaning continues...Looking for unrealistically high or low values in price column & removing it

In [66]:
print(autos['price'].describe(include="all"))

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64


The minimum price of a car is \\$0 and maximum is \\$100m. Ebay is an auction site, A car worth 100 million Dollars can't be listed on ebay and we certainly won't find cars as as low as \\$0, it's not realistic. Infact, Impossible!

We are going to check how many cars in the dataset are priced \\$0 along side the \\$100m and also other prices. Doing this will help us Identify high or low(outliers)

In [67]:
# Calculate the counts of unique prices in the 'price' column
price_counts = autos['price'].value_counts().head()

# Print the unique prices and the number of cars sold at each price
for price, count in price_counts.items():
    print(f"{count} : ${price}")

1421 : $0
781 : $500
734 : $1500
643 : $2500
639 : $1000


1421 Cars are priced at \\$0, the next price is\\$500. It's safe to set our minimum price at \\$500 because no car is bought with \\$0 in the first place and cars are actually no car is being sold less than \\$500 on the <a href="https://www.kleinanzeigen.de/">site</a>

In [68]:
price_counts = autos['price'].value_counts().sort_index(ascending=False).head(20)
print(price_counts)

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
Name: price, dtype: int64


These numbers are quite large, especially the top five prices, which range from 10 to 100 million. However, the counts for these prices are quite low. In fact, the ninth and tenth highest prices are below 1 million, and the eleventh price drops to 350,000.

Given the consistently low counts, we can safely leave out cars with prices <b>above 350,000 and less than 500</b> from our analysis. This will helps avoid any skew from the extremely high/low price.

In [69]:
autos = autos[autos["price"].between(500,350000)]

 ### Cleaning continues...Looking for unrealistically high or low values in odometer_km column & removing it

In [70]:
autos["odometer_km"].describe()

count     45097.000000
mean     125293.035013
std       39622.744927
min        5000.000000
25%      100000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [71]:
autos["odometer_km"].value_counts().sort_index(ascending=False)

150000    28698
125000     4838
100000     2031
90000      1676
80000      1385
70000      1189
60000      1131
50000       996
40000       808
30000       765
20000       727
10000       237
5000        616
Name: odometer_km, dtype: int64

Seems reasonable (remember this is the number of km the car has driven already).

## Exploring the date columns
There are 5 columns that should represent date values. Some of these columns were created by the crawler, some came from the website itself. We can differentiate by referring to the data dictionary:

- `date_crawled`: added by the crawler
- `last_seen`: added by the crawler
- `ad_created`: from the website
- `registration_month`: from the website
- `registration_year`: from the website

In [72]:
autos[['date_crawled','ad_created','last_seen', 'registration_year','registration_month']][0:5]

Unnamed: 0,date_crawled,ad_created,last_seen,registration_year,registration_month
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54,2004,3
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08,1997,6
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37,2009,7
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28,2007,6
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50,2003,7


Columns are represented are with full timestamp value. We'll later extract the date values only to understand the date range. 

In [73]:
print(autos[['date_crawled', 'ad_created', 'last_seen', 'registration_month', 'registration_year']].dtypes)

date_crawled          object
ad_created            object
last_seen             object
registration_month     int64
registration_year      int64
dtype: object


The date_crawled, last_seen, and ad_created columns are all identified as string values by pandas and the rest two as numeric columns. 

We need to convert the three string columns to Numeric representation so we can understand it quantitatively.

### Extracting "Date values" from string columns

In [74]:
(autos["date_crawled"]
 .str[:10] # takes only the first 10 characters of date string
 .value_counts( # takes each unique value in col, 
                # and counts how many times it appears
    normalize=True, # uses percentages instead of count 
    dropna=False)   # includes missing values in distribution
 .sort_index()) # ranks dates in ascending order 

2016-03-05    0.025567
2016-03-06    0.014125
2016-03-07    0.036189
2016-03-08    0.033173
2016-03-09    0.032907
2016-03-10    0.032707
2016-03-11    0.033018
2016-03-12    0.037320
2016-03-13    0.015522
2016-03-14    0.036300
2016-03-15    0.034016
2016-03-16    0.029359
2016-03-17    0.031155
2016-03-18    0.012883
2016-03-19    0.034747
2016-03-20    0.038073
2016-03-21    0.037741
2016-03-22    0.033018
2016-03-23    0.032397
2016-03-24    0.028982
2016-03-25    0.031089
2016-03-26    0.032641
2016-03-27    0.031177
2016-03-28    0.034836
2016-03-29    0.033262
2016-03-30    0.033328
2016-03-31    0.031665
2016-04-01    0.033905
2016-04-02    0.035767
2016-04-03    0.038827
2016-04-04    0.036610
2016-04-05    0.013172
2016-04-06    0.003171
2016-04-07    0.001353
Name: date_crawled, dtype: float64

from my Observation, Data from the site where crawled every day between (and including) 5th March and 7th April 2016

In [75]:
(autos["ad_created"]
 .str[:10] # takes only the first 10 characters of date string
 .value_counts( # takes each unique value in col, 
                # and counts how many times it appears
    normalize=True, # uses percentages instead of count 
    dropna=False)   # includes missing values in distribution
 .sort_index()) # ranks dates in ascending order 

2015-06-11    0.000022
2015-08-10    0.000022
2015-09-09    0.000022
2015-11-10    0.000022
2015-12-05    0.000022
                ...   
2016-04-03    0.039049
2016-04-04    0.036987
2016-04-05    0.011908
2016-04-06    0.003260
2016-04-07    0.001197
Name: ad_created, Length: 76, dtype: float64

From my observation, Ads where created from (7th march 2016) to (11th june 2015). Ads created where created immediately after data was crawled starting 5th March

In [76]:
(autos["last_seen"]
 .str[:10] # takes only the first 10 characters of date string
 .value_counts( # takes each unique value in col, 
                # and counts how many times it appears
    normalize=True, # uses percentages instead of count 
    dropna=False)   # includes missing values in distribution
 .sort_index()) # ranks dates in ascending order 

2016-03-05    0.001087
2016-03-06    0.004169
2016-03-07    0.005211
2016-03-08    0.007007
2016-03-09    0.009468
2016-03-10    0.010289
2016-03-11    0.012041
2016-03-12    0.023904
2016-03-13    0.008870
2016-03-14    0.012285
2016-03-15    0.015677
2016-03-16    0.016165
2016-03-17    0.027674
2016-03-18    0.007406
2016-03-19    0.015411
2016-03-20    0.020423
2016-03-21    0.020667
2016-03-22    0.021243
2016-03-23    0.018405
2016-03-24    0.019536
2016-03-25    0.018582
2016-03-26    0.016476
2016-03-27    0.015456
2016-03-28    0.020534
2016-03-29    0.021354
2016-03-30    0.024148
2016-03-31    0.023438
2016-04-01    0.022862
2016-04-02    0.024880
2016-04-03    0.024946
2016-04-04    0.024303
2016-04-05    0.126616
2016-04-06    0.225314
2016-04-07    0.134155
Name: last_seen, dtype: float64

In [77]:
autos['registration_year'].describe()

count    45097.000000
mean      2005.064173
std         89.652017
min       1000.000000
25%       2000.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

It also doesn't make sense to have any registration_year higher than 2016 given that all data was crawled in 2016. Cars didn't even exist in year 1000 and we haven't reached year 9999 yet.

The data was crawled in 2016, that means no registration should be higher than 2016 but it hard to determine the earliest valid year. Realistically it could be earlier somewhere in the first few decades of the 1900s.

We'll remove the cars that fall outside the 1900-2016 interval.

In [78]:
autos = autos[autos["registration_year"].between(1990,2016)]

In [79]:
autos["registration_year"].describe()

count    42105.000000
mean      2003.922385
std          5.594638
min       1990.000000
25%       2000.000000
50%       2004.000000
75%       2008.000000
max       2016.000000
Name: registration_year, dtype: float64

Now we are good! it makes sense now. Here we can see that we have successfully limited the bounds of registration_year to 1990-2016.

## Exploring Price by Brand

We want to find the most expensive brands, less expensive and inbetween brands

In [80]:
autos['brand'].value_counts(normalize=True).head(10)

volkswagen       0.211590
bmw              0.116186
opel             0.101342
mercedes_benz    0.099489
audi             0.092151
ford             0.065123
renault          0.044817
peugeot          0.030186
fiat             0.023418
seat             0.018406
Name: brand, dtype: float64

In [81]:
# List of selected brands
selected_brands = ['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault', 'peugeot', 'fiat', 'seat']

# Create an empty dictionary to hold aggregate data
brand_aggregate = {}

# Loop over the selected brands
for brand in selected_brands:
    # Calculate the mean price for each brand
    mean_price = autos[autos['brand'] == brand]['price'].mean()
    # Assign the mean price to the dictionary with the brand name as the key
    brand_aggregate[brand] = mean_price

# Print the dictionary of aggregate data
print(brand_aggregate)


{'volkswagen': 5780.447973958918, 'bmw': 8616.322158626328, 'opel': 3374.1181157722053, 'mercedes_benz': 8709.691811888279, 'audi': 9686.49613402062, 'ford': 3950.854485776805, 'renault': 2757.5262321144673, 'peugeot': 3354.910306845004, 'fiat': 3142.7951318458418, 'seat': 4810.883870967742}


We can observe that:

Renault, Opel and Ford, Fiat are (on average) cheaper (~ 2.4 - 3.5 thousand dollars);
BMW, Mercedes and Audi are (on average) the most expensive (~ 8.4 - 9.4 thousand dollars); and
Volkswagen sits (on average) between these two categories, with a mean price of ~ 5.4 thousand dollars.

As the order of popularity seems to flit between the cheaper and more expensive brands, with Volkswagen the most popular brand the only one that can't fit in those categories (i.e. the order of popularity is: Volkswagen, expensive, cheap, expensive, expensive, cheap, cheap) it may be that Volkswagen is the most popular because of this middle price.

## Exploring Average Mileage for Top Brands

In the same way, let's look at the average mileage for these top Ten brands:

In [82]:
top_brands_mean_mileage = {}

for b in brand_aggregate:
    selected_rows = autos[autos["brand"] == b]
    mean_mileage = selected_rows["odometer_km"].mean()
    top_brands_mean_mileage[b] = mean_mileage
    
print(top_brands_mean_mileage)

{'volkswagen': 128617.68997642833, 'bmw': 132967.0891251022, 'opel': 128637.2158425123, 'mercedes_benz': 131032.46598233469, 'audi': 128717.78350515464, 'ford': 124620.71480671043, 'renault': 126605.72337042926, 'peugeot': 126132.96616837136, 'fiat': 115725.15212981745, 'seat': 120058.06451612903}


From the above analysis, the only immediate takeaway is that cars of every brand are sold with a lot of mileage. The brand with the lowest average mileage is 'fiat' with an average of 115725 kilometers. The brand with the highest average mileage is 'bmw' with an average of 132967 kilometers.

### Brand Dataframe

To allow us to explore the relationship between price and mileage for these top brands, we should convert both dictionaries to series objects using the pandas series constructor (pandas.Series(data=None, index=None, dtype=None, name=None, copy=None, fastpath=False)):

In [83]:
mean_mileage_series = (pd.Series(top_brands_mean_mileage)
                       .sort_values(ascending=False))
# as we want to look at whether there is a visible link
# between avg mileage and price for the most popular brands, 
# it makes sense to change the order of the series from 
# highest to lowest mileage to make patterns easier to spot

print(mean_mileage_series)


bmw              132967.089125
mercedes_benz    131032.465982
audi             128717.783505
opel             128637.215843
volkswagen       128617.689976
renault          126605.723370
peugeot          126132.966168
ford             124620.714807
seat             120058.064516
fiat             115725.152130
dtype: float64


In [84]:
mean_price_series = pd.Series(brand_aggregate)
print(mean_price_series)

volkswagen       5780.447974
bmw              8616.322159
opel             3374.118116
mercedes_benz    8709.691812
audi             9686.496134
ford             3950.854486
renault          2757.526232
peugeot          3354.910307
fiat             3142.795132
seat             4810.883871
dtype: float64


Now we can create a dataframe from the first series object using the pandas dataframe constructor (pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)). We need to use the columns parameter to specify the column name (or the column name will be set to 0 by default):

In [85]:
brand_dataframe = pd.DataFrame(mean_mileage_series, columns=["mean_mileage"])
brand_dataframe

Unnamed: 0,mean_mileage
bmw,132967.089125
mercedes_benz,131032.465982
audi,128717.783505
opel,128637.215843
volkswagen,128617.689976
renault,126605.72337
peugeot,126132.966168
ford,124620.714807
seat,120058.064516
fiat,115725.15213


And finally, we can assign the second series as a new column in this dataframe:

In [86]:
brand_dataframe["mean_price"] = mean_price_series
brand_dataframe

Unnamed: 0,mean_mileage,mean_price
bmw,132967.089125,8616.322159
mercedes_benz,131032.465982,8709.691812
audi,128717.783505,9686.496134
opel,128637.215843,3374.118116
volkswagen,128617.689976,5780.447974
renault,126605.72337,2757.526232
peugeot,126132.966168,3354.910307
ford,124620.714807,3950.854486
seat,120058.064516,4810.883871
fiat,115725.15213,3142.795132


Based on our dataset, there is no clear relation between either average mileage and car brand. And by extension, we have no evidence of the any impacts that average mileage has on price. But, since the cars all have similar wear and tear, our average price data is very controlled. This means that we can conclude that the resale value of 'audi' is the highest of the brands we reviewed, and 'renault' has the lowest resale value.