## Other Variables - General EDA 

While the other variables aren't directly involved in the problem statement, it is still important to carry out proper EDA to gain insight into how they collectively affect the price of used cars.

This generates a SweetViz report of the dataset. While it can do much of the basic EDA for each variable, in particular, it also does a Pearson correlation matrix for the categorical variables, allowing us to understand how much they directly affect the price variable.

In [None]:
# importing packages to be used
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import sweetviz

pd.options.mode.chained_assignment = None

In [2]:
# If you have not downloaded the dataset, you can unzip the vehicle.zip file to obtain the csv
# or you can download version 10 of the dataset from
# https://www.kaggle.com/austinreese/craigslist-carstrucks-data

import zipfile
# Here we use the python package to open the zip file such that there is no need to unzip it
with zipfile.ZipFile("craigslist-carstrucks-data/vehicles.zip") as z:
   with z.open("vehicles.csv") as f:
      carData = pd.read_csv(f)
carData.head()

Unnamed: 0,id,url,region,region_url,price,year,manufacturer,model,condition,cylinders,...,size,type,paint_color,image_url,description,county,state,lat,long,posting_date
0,7222695916,https://prescott.craigslist.org/cto/d/prescott...,prescott,https://prescott.craigslist.org,6000,,,,,,...,,,,,,,az,,,
1,7218891961,https://fayar.craigslist.org/ctd/d/bentonville...,fayetteville,https://fayar.craigslist.org,11900,,,,,,...,,,,,,,ar,,,
2,7221797935,https://keys.craigslist.org/cto/d/summerland-k...,florida keys,https://keys.craigslist.org,21000,,,,,,...,,,,,,,fl,,,
3,7222270760,https://worcester.craigslist.org/cto/d/west-br...,worcester / central MA,https://worcester.craigslist.org,1500,,,,,,...,,,,,,,ma,,,
4,7210384030,https://greensboro.craigslist.org/cto/d/trinit...,greensboro,https://greensboro.craigslist.org,4900,,,,,,...,,,,,,,nc,,,


In [None]:
# Please refer to the HTML report under the EDA folder

report = sweetviz.analyze(carData, target_feat="price")  
report.show_html('EDA/report.html')

[Step 2/3] Processing Pairwise Features      |██████████████████████                   | [ 54%]   00:48 -> (01:00 left)

The following code blocks prints a bar graph representing the number of times each unique value appears for each variable. 

In [None]:
def unique_counts_graph (dataset, variable):
    #prints a bar graph representing the number of value_counts for each value of a variable
    plt.figure(figsize=(18,10))
    ax = dataset[variable].value_counts().plot(kind='bar')
    ax.set_title("Number of unique values of '{}'".format(variable))

In [None]:
variable_list = ["condition", "cylinders", "fuel", "title_status", "type", "size", "transmission", "drive", "paint_color"]
for i in range(len(variable_list)):
    unique_counts_graph(carData, variable_list[i])

The following code block generates box plots for different variables, showing the price distribution for each unqiue value. Only variables with a limited number of unique values have been chosen to create a graph that is readable and provides usefulness. 

In [None]:
def box_plot_generation (dataset, variable):
    df = dataset[['price', variable]]
    df[variable] = df[variable].astype('category')
    priceq1 = df.quantile(q=0.25)[0]
    priceq3 = df.quantile(q=0.75)[0]
    iqf = priceq3-priceq1
    upperlimit = priceq3 + 1.5*iqf
    lowerlimit = priceq1 - 1.5*iqf
    f = plt.figure(figsize=(16, 8))
    f = sb.boxplot(x = variable, y = 'price', data = df)
    f.set_ylim(0, upperlimit*3)

In [None]:
variable_list = ["condition", "cylinders", "fuel", "size", "transmission", "drive"]
for i in range(len(variable_list)):
    box_plot_generation(carData, variable_list[i])

## Dropping Variables

For a multitude of reasons, several variables will have to be dropped. This section aims to justify why we have dropped several variables.

By doing a correlation matrix of the listings of several manufacturers, one can see that the latitude and longitude of the car listings have almost no correlation with price whatsoever. We have thus decided to drop it.

In [None]:
manufacturer_list = ['ford', 'chevrolet', 'toyota', 'honda', 'nissan', 'jeep']
fig, ax = plt.subplots(ncols=3, nrows=2, figsize=(24,20))
counter = 0
for i in range(len(manufacturer_list)):
    x = manufacturer_list[i]
    y = carData[carData['manufacturer'] == x]
    numeric_cols = [column for column in y.columns if y[column].dtype != 'object']
    y = y[numeric_cols]
    sb.heatmap(y.corr(), vmin=-1, vmax=1, annot=True,fmt=".2f", ax=ax[counter][i%3])
    ax[counter][i%3].title.set_text(x)
    if i == 2:
        counter += 1

We have also decided to drop cylinders. This is because a significant portion of the dataset contains "other", which is hard to replace, as there exist electric car listings which do not have cylinders, as well as normal cars which have more than 12 cylinders. We cannot just lump normal cars with electric cars as this will distort the data.

In [None]:
carData["cylinders"].value_counts()

The same can be said for car type. Unfortunately, a large portion of the values are "other", and there is no easy way to clean this data up without making sweeping generalizations or manually checking each model. 

In [None]:
carData["type"].value_counts()

There are simply too many models to do anything useful with it. 

In [None]:
len(carData["model"].value_counts())

It is hard to carry out any meaningful analysis on the paint_color with such general colors given. 

In [None]:
carData["paint_color"].value_counts()

# End of EDA
This marks the end of this notebook.

To conlude, we did

- Feature exploration with this notebook
- Justified why we wanted to drop some variables
- Justified why certain variables are worth exploring

Learning points

- using python packages such as sweetviz to generate reports
- variety of ways we can display and analyse the same information

In the next notebook we will perform data cleaning with the results and insights we have gotten