## Used care listings
We are working with dataset fof used cars from eBay 'Kleinanzeigen', a classfieds section of the German eBay website. The dataset was originally scraped and uploaded to Kaggle, now with a few modifications from the original dataset:

+ now sampled 50,000 data points from the full dataset to ensure the code runs quickly in hosted enviroment
+ Dirtied the dataset a bit to more closely resemable what would have been expected from a scraped dataset

The aim of this proejct is to clearn the data and analyze the included used car listings.


In [None]:
import pandas as pd

dateparser = lambda x:pd.datetime.strptime(x,'%Y-%m-%d %H:%M:%S')
autos = pd.read_csv('autos.csv',encoding='Latin-1',parse_dates=['dateCrawled','dateCreated','lastSeen'],date_parser=dateparser)
autos.head()



In [None]:
autos.info()

## Observations
+ data volume of 50000
+ memory usage of 7.6+ MB
+ 20 columns of mostly strings
+ some columns have null values, not more than 20% of the data
+ The column names use camelcase instead of Python's preferred snakecase, which means we can't just replace spaces with underscores.

In [None]:
autos.columns

In [None]:
autos = autos.rename(columns={'dateCrawled':'date_crawled', 
                              'yearOfRegistration':'registration_year',
                      'monthOfRegistration':'registration_month',
                      'notRepairedDamage':'unrepaired_damage',
                      'dateCreated':'ad_created',
                      'lastSeen':'last_seen',
                      'postalCode':'postal_code',
                     'nrOfPictures':'number_pictures',
                     'fuelType':'fuel_type',
                     'powerPS':'power_ps',
                     'vehicleType':'vehicle_type',
                     'offerType':'offer_type'})

In [None]:
autos.head()


Let's convert the column names from camelcase to snakecase and reword some of the column names based on the data dictionary to e more descriptive

In [None]:
autos.describe()

## Observations
+ number of pictures seem to mostly have one value that may not be relevant to our analysis
+ let's investigate more into power_ps
+ seems registration year and month might be stored as strings and may need some cleaning
+ We may also have to clean up the trailing longs into floats in each column depending on our usage

In [None]:
import re

autos.loc[:,['price','odometer']].replace(to_replace='\$([0-9,\.]+).*', value=r'\1',regex=True,inplace=True)


In [None]:
autos['price'] = [i.replace('$','').replace(',','') if '$' in i else i for i in autos['price']]
autos['price'] = autos['price'].astype('int')

In [None]:
autos['odometer'] = [i.replace('km','').replace(',','') if 'km' in i else i for i in autos['odometer']]
autos['odometer'] = autos['price'].astype('int')

In [None]:
autos[['price','odometer']].dtypes

In [None]:
autos = autos.rename(columns={'odometer':'odometer_km'})

In [None]:
autos.head()

In [None]:
autos[['price','odometer_km']].sort_values(by=['price','odometer_km'],ascending=False)

In [None]:
autos = autos[(autos[['price','odometer_km']] > 0).all(axis=1) & (autos[['price','odometer_km']] < 169000).all(axis=1)]

# cleaning summary
+ did a sort values of both columns
+ found repeating dubious integer values, went ahead to index in to smallest range out of those values 

In [None]:
autos = autos.sort_values(by=['date_crawled','ad_created','last_seen'],ascending=True)


sorted autos by date_crawled, ad_created and last_seen from ascending order 

In [None]:
autos['date_crawled'] = pd.to_datetime(autos['date_crawled']).dt.date

In [None]:
autos['last_seen'] = pd.to_datetime(autos['last_seen']).dt.date

changed last_seen and date_crawled columns to datetime date objects instead of time stamp for percentage distrubutions 

In [None]:
import datetime as dt
march_date_crawled = autos.loc[(autos['date_crawled'] >= dt.date(2016,3,1)) & (autos['date_crawled'] <= dt.date(2016,4,1)),'date_crawled'].value_counts(normalize=True, dropna=False)

In [None]:
march_date_crawled

sample percentage distribution of march by date_crawled column

In [None]:
march_last_seen = autos.loc[(autos['last_seen'] > dt.date(2016,3,1)) & (autos['last_seen'] < dt.date(2016,4,1)),'last_seen'].value_counts(normalize=True, dropna=False)

In [None]:
march_last_seen

Sample of distribution percentage of march last_seen logs

In [None]:
march_ad_created = autos.loc[(autos['ad_created'] > dt.date(2016,3,1)) & (autos['ad_created'] < dt.date(2016,4,1)),'ad_created'].value_counts(normalize=True,dropna=False)

In [None]:
march_ad_created

sample of percentage distribution of march ad_created logs

## Summary of exploratory date_crawled,  ad_created, and last_seen logs
+ sorted autos by date_crawled, ad_created and last_seen from ascending order
+ changed last_seen and date_crawled columns to datetime date objects instead of time stamp for percentage distrubutions
+ provided some sample percentages distribution logs for each column

In [None]:
autos['registration_year'].describe()

In [None]:
autos.registration_year = pd.DatetimeIndex(autos.registration_year).year

In [None]:
autos.registration_year.dt.year

In [None]:
autos['registration_year'] = autos[(autos['registration_year'] > 1900) & (autos['registration_year'] < 2016)]   

In [None]:
autos.registration_year

Because a car can't be first registered after the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1900s.

Let's count the number of listings with cars that fall outside the 1900 - 2016 interval and see if it's safe to remove those rows entirely, or if we need more custom logic.

In [None]:
autos.registration_year.value_counts(normalize=True)