# Data Cleaning

1. Import Data

Data Cleaning Steps:
- export raw_combined_dataset
- upload for team access
- create filtered_combined_dataset
    - identify unique id column(s?) ['id','url']
    - add new from v6 to v5 with sql-join
    - remove from v5 what is not in v6 ; for every row in v5 ...
- export filtered_combined_dataset to be the "work_data" (newfile)

In [20]:
# Import Libraries
from pathlib import Path
import pandas as pd
import datetime

# Step 1: Import the csv's into DataFrames

In [21]:
# identify path to raw data
csvpath_v10 = Path('./raw_data/vehicles_v10.csv')
csvpath_v9 = Path('./raw_data/vehicles_v9.csv')
csvpath_v7 = Path('./raw_data/vehicles_v7.csv')
csvpath_v6 = Path('./raw_data/vehicles_v6.csv')
csvpath_v5 = Path('./raw_data/vehicles_v5.csv')

In [22]:
# load datasets into DataFrames
vehicles_v10_df = pd.read_csv(csvpath_v10)
vehicles_v9_df = pd.read_csv(csvpath_v9)
vehicles_v7_df = pd.read_csv(csvpath_v7)
vehicles_v6_df = pd.read_csv(csvpath_v6)
vehicles_v5_df = pd.read_csv(csvpath_v5)

# Step 2: Establish the indexing column as the URL, and check for duplicates

the url is the most unique index we have here so set each dataframe to use 'url' as the index_col

In [23]:
vehicles_v5_df.set_index('url')
vehicles_v6_df.set_index('url')
vehicles_v7_df.set_index('url')
vehicles_v9_df.set_index('url')
vehicles_v10_df.set_index('url')

Unnamed: 0_level_0,id,region,region_url,price,year,manufacturer,model,condition,cylinders,fuel,...,size,type,paint_color,image_url,description,county,state,lat,long,posting_date
url,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
https://prescott.craigslist.org/cto/d/prescott-2010-ford-ranger/7222695916.html,7222695916,prescott,https://prescott.craigslist.org,6000,,,,,,,...,,,,,,,az,,,
https://fayar.craigslist.org/ctd/d/bentonville-2017-hyundai-elantra-se/7218891961.html,7218891961,fayetteville,https://fayar.craigslist.org,11900,,,,,,,...,,,,,,,ar,,,
https://keys.craigslist.org/cto/d/summerland-key-2005-excursion/7221797935.html,7221797935,florida keys,https://keys.craigslist.org,21000,,,,,,,...,,,,,,,fl,,,
https://worcester.craigslist.org/cto/d/west-brookfield-2002-honda-odyssey-ex/7222270760.html,7222270760,worcester / central MA,https://worcester.craigslist.org,1500,,,,,,,...,,,,,,,ma,,,
https://greensboro.craigslist.org/cto/d/trinity-1965-chevrolet-truck/7210384030.html,7210384030,greensboro,https://greensboro.craigslist.org,4900,,,,,,,...,,,,,,,nc,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
https://wyoming.craigslist.org/ctd/d/atlanta-2019-nissan-maxima-sedan-4d/7301591192.html,7301591192,wyoming,https://wyoming.craigslist.org,23590,2019.0,nissan,maxima s sedan 4d,good,6 cylinders,gas,...,,sedan,,https://images.craigslist.org/00o0o_iiraFnHg8q...,Carvana is the safer way to buy a car During t...,,wy,33.786500,-84.445400,2021-04-04T03:21:31-0600
https://wyoming.craigslist.org/ctd/d/atlanta-2020-volvo-s60-t5-momentum/7301591187.html,7301591187,wyoming,https://wyoming.craigslist.org,30590,2020.0,volvo,s60 t5 momentum sedan 4d,good,,gas,...,,sedan,red,https://images.craigslist.org/00x0x_15sbgnxCIS...,Carvana is the safer way to buy a car During t...,,wy,33.786500,-84.445400,2021-04-04T03:21:29-0600
https://wyoming.craigslist.org/ctd/d/atlanta-2020-caddy-cadillac-xt4-sport/7301591147.html,7301591147,wyoming,https://wyoming.craigslist.org,34990,2020.0,cadillac,xt4 sport suv 4d,good,,diesel,...,,hatchback,white,https://images.craigslist.org/00L0L_farM7bxnxR...,Carvana is the safer way to buy a car During t...,,wy,33.779214,-84.411811,2021-04-04T03:21:17-0600
https://wyoming.craigslist.org/ctd/d/atlanta-2018-lexus-es-es-350-sedan-4d/7301591140.html,7301591140,wyoming,https://wyoming.craigslist.org,28990,2018.0,lexus,es 350 sedan 4d,good,6 cylinders,gas,...,,sedan,silver,https://images.craigslist.org/00z0z_bKnIVGLkDT...,Carvana is the safer way to buy a car During t...,,wy,33.786500,-84.445400,2021-04-04T03:21:11-0600


In [24]:
# combine all datasets together and check for duplicated url's across different versions
check_duplicated_url_df = pd.concat(
    [
        vehicles_v5_df,
        vehicles_v6_df,
        vehicles_v7_df,
        vehicles_v9_df,
        vehicles_v10_df
    ]
)

In [25]:
check_duplicated_url_df.duplicated().sum()

37

- so there are ```113,523``` duplicated url's across the five datasets.
    - this espcecially means that the existence of the listings must be tracked across time.

- each dataset must have some kind of timestamp associated with it.
    - go through each dataset, check the columns, and create a new date column if it doesn't exist

# Step 3: Identify the 'posting_date' to be added to each dataset based upon each version's scrape date

In [26]:
# generate timestamp string for each of the datasets that do not already contain one
vehicles_v5_df['posting_date'] = '2018-10-31'       # date derived from content description at https://www.kaggle.com/austinreese/craigslist-carstrucks-data/version/5
vehicles_v6_df['posting_date'] = '2019-06-09'       # date taken from version timestamp at https://www.kaggle.com/austinreese/craigslist-carstrucks-data/version/6
vehicles_v7_df['posting_date'] = '2019-07-14'       # date taken from version timestamp at https://www.kaggle.com/austinreese/craigslist-carstrucks-data/version/7

# reminder that v8 didn't have any data associated with it's page

vehicles_v9_df['posting_date'] = '2021-04-19'       # date taken from version timestamp at https://www.kaggle.com/austinreese/craigslist-carstrucks-data/version/9

# reminder that v10 already has a 'posting_date' value for nearly all entries but the time needs to be stripped to include only the date

In [27]:
# verify data type for 'posting_date' is a Python object [dtype('o')
display(vehicles_v10_df['posting_date'].dtype)
# display the unique date values in the dataset before stripping away the time values
display(vehicles_v10_df['posting_date'].unique())

dtype('O')

array([nan, '2021-05-04T12:31:18-0500', '2021-05-04T12:31:08-0500', ...,
       '2021-04-04T03:21:17-0600', '2021-04-04T03:21:11-0600',
       '2021-04-04T03:21:07-0600'], dtype=object)

In [28]:
# strip the time and leave only the date for the vehicles_v10_df
vehicles_v10_df.loc[:,'posting_date'] = vehicles_v10_df['posting_date'].str.split('T').str[0]
# check how many unique dates are associated with entries in the dataset
pd.unique(vehicles_v10_df['posting_date'])

array([nan, '2021-05-04', '2021-05-03', '2021-05-02', '2021-05-01',
       '2021-04-30', '2021-04-29', '2021-04-28', '2021-04-27',
       '2021-04-26', '2021-04-25', '2021-04-24', '2021-04-23',
       '2021-04-22', '2021-04-21', '2021-04-20', '2021-04-19',
       '2021-04-18', '2021-04-17', '2021-04-16', '2021-04-15',
       '2021-04-14', '2021-04-13', '2021-04-12', '2021-04-11',
       '2021-04-10', '2021-04-09', '2021-04-08', '2021-04-07',
       '2021-04-06', '2021-04-05', '2021-04-04'], dtype=object)

# Step 4: Combine all datasets together; Eliminate any duplicate entries

In [29]:
'''
# set each dataset to use 'url' as its index
vehicles_v5_df.set_index(keys='url',inplace=True,verify_integrity=True)
vehicles_v6_df.set_index(keys='url',inplace=True,verify_integrity=True)
vehicles_v7_df.set_index(keys='url',inplace=True,verify_integrity=True)
vehicles_v9_df.set_index(keys='url',inplace=True,verify_integrity=True)
vehicles_v10_df.set_index(keys='url',inplace=True,verify_integrity=True)
'''

In [34]:
# from version 5 to version 6 -- which entries were added, and which entries are no longer there?
# 1. start by adding v5 to a new empty 'combined_df'
combined_df = pd.DataFrame()
print(f'start:\t{len(combined_df)}')

combined_df = combined_df.append(vehicles_v5_df)
print(f'append v5:\t{len(combined_df):,.0f}')

combined_df = combined_df.append(vehicles_v6_df)
print(f'append v6:\t{len(combined_df):,.0f}')

combined_df = combined_df.append(vehicles_v7_df)
print(f'append v7:\t{len(combined_df):,.0f}')

combined_df = combined_df.append(vehicles_v9_df)
print(f'append v9:\t{len(combined_df):,.0f}')

combined_df = combined_df.append(vehicles_v10_df)
print(f'append v10:\t{len(combined_df):,.0f}')

start:	0
append v5:	677,812
append v6:	1,121,217
append v7:	1,668,981
append v9:	2,110,377
append v10:	2,537,257


In [12]:
# compare entries in version 6 to entries in version 5
# if the entry was in version 5 and NOT in version 6, mark as sold (it could've also been removed so should we just assume it's been sold??)
# if the entry was in version 5 and in version 6 then do nothing because it has NOT been sold/removed
# if the entry was NOT in version 5 and IS in version 6 then it's a new listing

# pd.merge and diff_df

In [13]:
# perform a right join to identify which vehicles from version 5 were sold by the time version 6 data was scraped
diff_df = pd.merge(vehicles_v5_df, vehicles_v6_df, on='url',how='outer', indicator='exist')

In [14]:
diff_df = diff_df.loc[diff_df['exist'] != 'both']

In [15]:
diff_df.loc[diff_df['exist'] == 'right_only']

Unnamed: 0_level_0,city_x,price_x,year_x,manufacturer_x,make_x,condition_x,cylinders_x,fuel_x,odometer_x,title_status_x,...,drive_y,size_y,type_y,paint_color_y,image_url_y,desc,lat_y,long_y,posting_date_y,exist
url,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
https://albuquerque.craigslist.org/cto/d/albuquerque-2005-hyundai-elantra/6904944887.html,,,,,,,,,,,...,fwd,mid-size,sedan,blue,https://images.craigslist.org/00C0C_jxcguIkSFN...,Selling a 2005 Hyundai Elantra 2.0 4 cylinder....,35.156142,-106.656501,2019-06-09,right_only
https://albuquerque.craigslist.org/cto/d/albuquerque-2010-bmw-535xi/6904936037.html,,,,,,,,,,,...,,,sedan,white,https://images.craigslist.org/00t0t_cy5QmA1gOF...,"2010 BMW 535xi\nAll wheel drive \n120,000\nCle...",35.151887,-106.708317,2019-06-09,right_only
https://albuquerque.craigslist.org/ctd/d/albuquerque-2005-toyota-tacoma-cd/6904932048.html,,,,,,,,,,,...,,,,,https://images.craigslist.org/00E0E_izNM6a51mb...,Contact at amandafiorello(at)hotmail.com\n\n \...,35.058537,-106.877873,2019-06-09,right_only
https://albuquerque.craigslist.org/cto/d/rio-rancho-classic-1971-ford-truck/6904926976.html,,,,,,,,,,,...,rwd,,,,https://images.craigslist.org/01414_gX7P5ovXx0...,CLASSIC FORD NEEDS A HOME. SERIOUS BUYERS ONLY...,35.249300,-106.681800,2019-06-09,right_only
https://albuquerque.craigslist.org/cto/d/albuquerque-toyota-camry-1998/6904923564.html,,,,,,,,,,,...,,compact,,,https://images.craigslist.org/00L0L_oXTsooKih6...,Good running car. Great on gas. It has a sunro...,35.058537,-106.877873,2019-06-09,right_only
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
https://zanesville.craigslist.org/ctd/d/zanesville-2014-ford-escape-se-4wd/6882690167.html,,,,,,,,,,,...,4wd,full-size,SUV,white,https://images.craigslist.org/00b0b_bGtUYkMZ6g...,"2014 Ford Escape SE 4WD - $11,750\n\nYear: 201...",39.972619,-82.010975,2019-06-09,right_only
https://zanesville.craigslist.org/ctd/d/zanesville-2013-ram-1500-tradesman-quad/6882690114.html,,,,,,,,,,,...,4wd,full-size,truck,white,https://images.craigslist.org/00C0C_b0QJFQy3wB...,"2013 RAM 1500 Tradesman Quad Cab 4WD - $20,990...",39.972619,-82.010975,2019-06-09,right_only
https://zanesville.craigslist.org/ctd/d/zanesville-2008-buick-lacrosse-cx/6882504783.html,,,,,,,,,,,...,fwd,full-size,sedan,brown,https://images.craigslist.org/00N0N_9lWEqAR0PJ...,Excellent condition everywhere. Runs and drive...,39.947697,-81.962020,2019-06-09,right_only
https://zanesville.craigslist.org/ctd/d/2013-ford-escape-bad-credit-no-credit/6882503839.html,,,,,,,,,,,...,fwd,,,,https://images.craigslist.org/00c0c_fn9EIA5C2Y...,IF YOU'RE LOOKING FOR A GREAT VEHICLE\n\nand y...,41.033400,-81.438500,2019-06-09,right_only


In [16]:
# left_only: 677,812
# right_only: 443405
# sum: 1,121,217
diff_df

Unnamed: 0_level_0,city_x,price_x,year_x,manufacturer_x,make_x,condition_x,cylinders_x,fuel_x,odometer_x,title_status_x,...,drive_y,size_y,type_y,paint_color_y,image_url_y,desc,lat_y,long_y,posting_date_y,exist
url,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
https://tricities.craigslist.org/cto/d/1978-bronco/6736984460.html,tricities,5000.0,1978.0,ford,bronco,,,gas,,clean,...,,,,,,,,,,left_only
https://tricities.craigslist.org/cto/d/2008-buick-lucerne-cxl-sale/6716121500.html,tricities,5000.0,2008.0,buick,lucerne cxl v6,like new,6 cylinders,gas,51000.0,clean,...,,,,,,,,,,left_only
https://tricities.craigslist.org/cto/d/2006-pont-gto/6731405764.html,tricities,13500.0,2006.0,,Pont GTO,excellent,8 cylinders,gas,93000.0,clean,...,,,,,,,,,,left_only
https://tricities.craigslist.org/cto/d/2006-mercedes-e350/6736958987.html,tricities,6200.0,2006.0,mercedes-benz,,,,gas,,rebuilt,...,,,,,,,,,,left_only
https://tricities.craigslist.org/cto/d/2016-ford-f350-dually/6736964819.html,tricities,37900.0,2016.0,ford,f350,excellent,8 cylinders,diesel,70500.0,clean,...,,,,,,,,,,left_only
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
https://zanesville.craigslist.org/ctd/d/zanesville-2014-ford-escape-se-4wd/6882690167.html,,,,,,,,,,,...,4wd,full-size,SUV,white,https://images.craigslist.org/00b0b_bGtUYkMZ6g...,"2014 Ford Escape SE 4WD - $11,750\n\nYear: 201...",39.972619,-82.010975,2019-06-09,right_only
https://zanesville.craigslist.org/ctd/d/zanesville-2013-ram-1500-tradesman-quad/6882690114.html,,,,,,,,,,,...,4wd,full-size,truck,white,https://images.craigslist.org/00C0C_b0QJFQy3wB...,"2013 RAM 1500 Tradesman Quad Cab 4WD - $20,990...",39.972619,-82.010975,2019-06-09,right_only
https://zanesville.craigslist.org/ctd/d/zanesville-2008-buick-lacrosse-cx/6882504783.html,,,,,,,,,,,...,fwd,full-size,sedan,brown,https://images.craigslist.org/00N0N_9lWEqAR0PJ...,Excellent condition everywhere. Runs and drive...,39.947697,-81.962020,2019-06-09,right_only
https://zanesville.craigslist.org/ctd/d/2013-ford-escape-bad-credit-no-credit/6882503839.html,,,,,,,,,,,...,fwd,,,,https://images.craigslist.org/00c0c_fn9EIA5C2Y...,IF YOU'RE LOOKING FOR A GREAT VEHICLE\n\nand y...,41.033400,-81.438500,2019-06-09,right_only
