# Exploring eBay Car Sales Data

This project involves exploring and analyzing a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. The dataset, originally scraped and modified to resemble real-world, uncleaned data, includes information on car listings such as price, model, registration year, odometer readings, and more.

The goal of this project is to clean the dataset and perform initial analyses, using pandas for data manipulation. This project also highlights some of the advantages of working in a Jupyter environment.

In [4]:
import pandas as pd
import numpy as np

autos = pd.read_csv("Dataset/autos.csv")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdc in position 732: invalid continuation byte

We noticed some errors when we tried to open the file with UTF-8 encoding. 

In [6]:
autos = pd.read_csv("Dataset/autos.csv", encoding="latin1")

In [7]:
autos.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 371528 entries, 0 to 371527
Data columns (total 20 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   dateCrawled          371528 non-null  object
 1   name                 371528 non-null  object
 2   seller               371528 non-null  object
 3   offerType            371528 non-null  object
 4   price                371528 non-null  int64 
 5   abtest               371528 non-null  object
 6   vehicleType          333659 non-null  object
 7   yearOfRegistration   371528 non-null  int64 
 8   gearbox              351319 non-null  object
 9   powerPS              371528 non-null  int64 
 10  model                351044 non-null  object
 11  kilometer            371528 non-null  int64 
 12  monthOfRegistration  371528 non-null  int64 
 13  fuelType             338142 non-null  object
 14  brand                371528 non-null  object
 15  notRepairedDamage    299468 non-nu

In [None]:
autos.head()

### Dataset Overview

The dataset contains **371,528** rows and **20 columns**, which represent various attributes of used car . Here are some key observations:

- **Data Completeness**: 
  - Most columns are fully populated, but some contain missing values, notably:
    - `vehicleType` (333,659 non-null values)
    - `gearbox` (351,319 non-null values)
    - `model` (351,044 non-null values)
    - `fuelType` (338,142 non-null values)
    - `notRepairedDamage` (299,468 non-null values)
  
- **Data Types**:
  - The dataset includes a mix of numerical (`int64`) and categorical (`object`) data.
  - Key numerical fields include `price`, `powerPS`, `yearOfRegistration`, `kilometer`, and `postalCode`.
  - Categorical data includes fields like `name`, `seller`, `offerType`, `vehicleType`, and `fuelType`.




### Cleaning Column Names

**The column names use camelcase instead of Python's preferred snakecase**

In [9]:
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'kilometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [10]:
new_columns_name = (['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'kilometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'])

In [11]:
autos.columns = new_columns_name

In [12]:
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,kilometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-24 11:52:17,Golf_3_1.6,privat,Angebot,480,test,,1993,manuell,0,golf,150000,0,benzin,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,privat,Angebot,18300,test,coupe,2011,manuell,190,,125000,5,diesel,audi,ja,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,"Jeep_Grand_Cherokee_""Overland""",privat,Angebot,9800,test,suv,2004,automatik,163,grand,125000,8,diesel,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,GOLF_4_1_4__3TÜRER,privat,Angebot,1500,test,kleinwagen,2001,manuell,75,golf,150000,6,benzin,volkswagen,nein,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,Skoda_Fabia_1.4_TDI_PD_Classic,privat,Angebot,3600,test,kleinwagen,2008,manuell,69,fabia,90000,7,diesel,skoda,nein,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21


**Changing column names to snake_case improves readability and consistency, making data processing and analysis easier with tools like Pandas. This convention helps avoid case-related errors and makes the code more maintainable.**

###  Initial Exploration and Cleaning

For Price 

In [None]:
autos.head()

In [None]:
autos.info()

In [None]:
autos["price"].shape

In [None]:
autos["price"].describe

In [13]:
autos = autos.loc[autos["price"] != 0]

**Entries with a price of 0 were removed as they often represent data entry errors or missing values, which could skew the analysis and statistics by introducing non-representative values of the used car market.**


In [14]:
autos['price'].value_counts().sort_index(ascending=True)

price
1             1189
2               12
3                8
4                1
5               26
              ... 
32545461         1
74185296         1
99000000         1
99999999        15
2147483647       1
Name: count, Length: 5596, dtype: int64

In [15]:
autos = autos[autos['price'] >= 1000]

In [16]:
autos['price'].value_counts().sort_index(ascending=True)

price
1000          4649
1001            11
1003             1
1009             1
1010             2
              ... 
32545461         1
74185296         1
99000000         1
99999999        15
2147483647       1
Name: count, Length: 5095, dtype: int64

In [20]:
autos = autos[autos['price'] <= 27000000]

In [21]:
autos['price'].value_counts().sort_index(ascending=True)

price
1000        4649
1001          11
1003           1
1009           1
1010           2
            ... 
10000000       8
10010011       1
11111111      10
12345678       9
14000500       1
Name: count, Length: 5089, dtype: int64

I decided to remove all cars priced below 1000 and all cars listed at 27 million, as this price represents the most expensive car in the world.


For Km : 

In [25]:
autos['kilometer'].value_counts().sort_index(ascending=True)

kilometer
5000        3634
10000       1723
20000       5027
30000       5640
40000       6184
50000       7343
60000       8257
70000       9193
80000      10195
90000      11294
100000     13693
125000     31976
150000    174014
Name: count, dtype: int64

In [26]:
autos["price"].describe

<bound method NDFrame.describe of 1         18300
2          9800
3          1500
4          3600
6          2200
          ...  
371523     2200
371524     1199
371525     9200
371526     3400
371527    28990
Name: price, Length: 288173, dtype: int64>

For the price, the data seems accurate.