# Preliminary data preprocessing & exploration

In this notebook we look into the data and check some insides that may be valuable for us. Also we proove some hypothesis regarding data types, values, correlations. In addition, we write our processing functions that further would be used in pipeline as a whole.

Data we are working with is taken from kaggle, so please make sure you downloaded [dataset](https://www.kaggle.com/datasets/austinreese/craigslist-carstrucks-data) and saved it to `data/archive/vehicels.csv` according to current file. If you have other place, make sure to change file path in cell 2.

In [1]:
import pandas as pd

In [2]:
# read data from dataset
# https://www.kaggle.com/datasets/austinreese/craigslist-carstrucks-data

df = pd.read_csv('data/archive/vehicles.csv')
df.head()

Unnamed: 0,id,url,region,region_url,price,year,manufacturer,model,condition,cylinders,...,size,type,paint_color,image_url,description,county,state,lat,long,posting_date
0,7222695916,https://prescott.craigslist.org/cto/d/prescott...,prescott,https://prescott.craigslist.org,6000,,,,,,...,,,,,,,az,,,
1,7218891961,https://fayar.craigslist.org/ctd/d/bentonville...,fayetteville,https://fayar.craigslist.org,11900,,,,,,...,,,,,,,ar,,,
2,7221797935,https://keys.craigslist.org/cto/d/summerland-k...,florida keys,https://keys.craigslist.org,21000,,,,,,...,,,,,,,fl,,,
3,7222270760,https://worcester.craigslist.org/cto/d/west-br...,worcester / central MA,https://worcester.craigslist.org,1500,,,,,,...,,,,,,,ma,,,
4,7210384030,https://greensboro.craigslist.org/cto/d/trinit...,greensboro,https://greensboro.craigslist.org,4900,,,,,,...,,,,,,,nc,,,


In [3]:
# percentage of null values
df.isna().sum() / df.shape[0] * 100

id                0.000000
url               0.000000
region            0.000000
region_url        0.000000
price             0.000000
year              0.282281
manufacturer      4.133714
model             1.236179
condition        40.785232
cylinders        41.622470
fuel              0.705819
odometer          1.030735
title_status      1.930753
transmission      0.598763
VIN              37.725356
drive            30.586347
size             71.767476
type             21.752717
paint_color      30.501078
image_url         0.015930
description       0.016398
county          100.000000
state             0.000000
lat               1.534155
long              1.534155
posting_date      0.015930
dtype: float64

## Region vs Region_url 

In [6]:
view = df.groupby(['region', 'region_url'])['id']

In [7]:
view.nunique()

region                  region_url                        
SF bay area             https://sfbay.craigslist.org          2936
abilene                 https://abilene.craigslist.org         235
akron / canton          https://akroncanton.craigslist.org    2211
albany                  https://albany.craigslist.org         2312
                        https://albanyga.craigslist.org        225
                                                              ... 
york                    https://york.craigslist.org            777
youngstown              https://youngstown.craigslist.org      664
yuba-sutter             https://yubasutter.craigslist.org     1747
yuma                    https://yuma.craigslist.org            335
zanesville / cambridge  https://zanesville.craigslist.org      313
Name: id, Length: 416, dtype: int64

We can see from the above that albany region have different cities, indicating region_url could bring a bit more information than just region itself. Though, it would result in poorer interpretabillity, so for that reason we would transform links to more human-friendly (delete https and craigslist.org)

In [45]:
# prove all links follow format `https://<REGION-CODE>.craigslist.org`
for line in df['region_url'].unique():
    if not line.startswith('https://') or not line.endswith('.craigslist.org'):
        print(line)

In [46]:
def normalize_region_url(data: pd.DataFrame) -> pd.DataFrame:
    if 'region_url' not in data.columns:
        return data
    
    from_link = lambda l: l.strip()[8:-15]  # TODO: swap to regex or more robust way
    data['region_url'] = data['region_url'].apply(from_link)
    return data

### Check how `VIN` correlates with `id`

First of all, it is worth mention that `id` corresponds to advertisement, and `VIN` (ideally) should be unique identifier of a car. However it is possible for same `VIN` to be present several the dataset as car may be sold twice or even more times during the time this data was collected. So we want to check it.

In [47]:
view_vin = df.groupby('VIN')['id']

In [52]:
# Percentage of unique VIN's
len(df['VIN'].unique()) / df.shape[0] * 100

27.70450712143928

In [54]:
# some examples of several adverticement for same `VIN` number
df[df['VIN'] == 'ZPBUA1ZL1KLA02237']

Unnamed: 0,id,url,region,region_url,price,year,manufacturer,model,condition,cylinders,...,size,type,paint_color,image_url,description,county,state,lat,long,posting_date
9080,7306378728,https://flagstaff.craigslist.org/ctd/d/miami-2...,flagstaff / sedona,https://flagstaff.craigslist.org,229500,2019.0,,Lamborghini Urus,,8 cylinders,...,compact,SUV,white,https://images.craigslist.org/00h0h_eYxCtZ0aXt...,Elite Motor Cars of MiamiAsk for: Sales☎ (786)...,,az,25.827103,-80.24152,2021-04-13T14:09:46-0700
415796,7306494891,https://appleton.craigslist.org/ctd/d/miami-20...,appleton-oshkosh-FDL,https://appleton.craigslist.org,229500,2019.0,,Lamborghini Urus,,8 cylinders,...,compact,SUV,white,https://images.craigslist.org/00h0h_eYxCtZ0aXt...,Elite Motor Cars of MiamiAsk for: Sales☎ (786)...,,wi,25.827103,-80.24152,2021-04-13T20:43:00-0500


### Check whether advertisements with same `VIN` have same technical parametes

In this case we will look into `cylinders` column as we supposed it to be consistent with `VIN`.

In [55]:
view_trans = df.groupby('VIN')['cylinders']

In [56]:
import numpy as np

for vin, cylinders in view_trans:
    
    has_digit = False
    has_letter = False
    
    for ch in vin:
        has_digit = ch.isdigit() or has_digit
        has_letter = ch.isalpha() or has_letter
    
    if has_digit and has_letter and len(set(cylinders) - {np.nan}) > 1:
        print(vin)
        break

19UUA66267A021807


In [58]:
# completely different cars has the same `VIN`!!!! 🤯🤯🤯
df[df['VIN'] == '19UUA66267A021807']

Unnamed: 0,id,url,region,region_url,price,year,manufacturer,model,condition,cylinders,...,size,type,paint_color,image_url,description,county,state,lat,long,posting_date
116766,7316401351,https://tampa.craigslist.org/hil/ctd/d/tampa-2...,tampa bay area,https://tampa.craigslist.org,5495,2008.0,toyota,camry,excellent,4 cylinders,...,mid-size,sedan,custom,https://images.craigslist.org/01212_fQan0irH9q...,$1500 DOWN!! 2008 TOYOTA CAMARY CE $1500....,,fl,27.965972,-82.385063,2021-05-03T16:15:31-0400
116936,7316182527,https://tampa.craigslist.org/hil/ctd/d/tampa-2...,tampa bay area,https://tampa.craigslist.org,12995,2014.0,ford,f150,excellent,8 cylinders,...,full-size,truck,white,https://images.craigslist.org/00V0V_5A4QbSlT2t...,CASH SPECIAL $12995.00 !! 2014 FORD F150 XL...,,fl,27.965972,-82.385063,2021-05-03T10:32:23-0400


In [109]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go

fig = make_subplots(rows=3, cols=3, specs=[
    [{"type": "pie"} for _ in range(3)] for _ in range(3)
])


# lets' look into categorical features distributions (where there are not so many unique values)
for idx, column_name in enumerate(['condition', 'cylinders', 'fuel', 'transmission', 'drive', 'size', 'type', 'paint_color']):
    counted_data = df[column_name].value_counts(dropna=False).reset_index()
    fig.add_trace(
        go.Pie( 
            values=counted_data['count'], 
            labels=counted_data[column_name],
            name=column_name
        ),
        row=(idx//3) + 1, col=(idx%3) + 1
    )

fig.update_layout(height=1000, width=1000, title_text='Categorical data distributions')

### Result of an analysis 

In [110]:
drop_columns = [
#     'id',
    'url',
    'region',
#     'region_url',
#     'price',
#     'year',
#     'manufacturer',
#     'model',
#     'condition',
#     'cylinders',
#     'fuel',
#     'odometer',
    'title_status',
#     'transmission',
    'VIN',
#     'drive',
#     'size',
#     'type',
#     'paint_color',
    'image_url',
    'description',
    'county',
#     'state',
#     'lat',
#     'long',
    'posting_date'
]