In [8]:
#set the wd to file location
setwd(getSrcDirectory(function(){})[1])

# Exploratory Data Analysis 

This notebook is to be run 3rd.\
The purpose of this file is to perform initial exploration on the split of training data prior to vetting to suggest the next steps - wrangling, transformations, predictive methods;

In [2]:
#Import Necessary libraries
suppressWarnings({
library("dplyr")
library("tidyverse")})

In [74]:
##Read Data
data<- read.csv("./03a-vehicles-sample.csv", header = TRUE)

check the dimensions of the data frame

In [4]:
dim(data)

In [5]:
colnames(data)

### Convert all character data to factors

In [6]:
data<- data%>%
mutate_if(sapply(data, is.character), as.factor)

### Summary Statistics

In [7]:
summary(data)

     price             year         manufacturer       condition    
 Min.   :  1003   Min.   :2000   ford     :31090            :69731  
 1st Qu.:  7500   1st Qu.:2008   chevrolet:23357   excellent:50158  
 Median : 14995   Median :2013   toyota   :16500   fair     : 2776  
 Mean   : 18614   Mean   :2012   honda    :11184   good     :59523  
 3rd Qu.: 26590   3rd Qu.:2017   nissan   : 9526   like new :12023  
 Max.   :199999   Max.   :2022   jeep     : 8311   new      :  612  
                                 (Other)  :95096   salvage  :  241  
        cylinders           fuel           odometer          title_status   
             :75248           :  1041   Min.   :     0             :  2684  
 6 cylinders :46043   diesel  : 12150   1st Qu.: 41526   clean     :184490  
 4 cylinders :40315   electric:   849   Median : 92050   lien      :  1028  
 8 cylinders :30976   gas     :165360   Mean   : 96207   missing   :    94  
 5 cylinders :  968   hybrid  :  2753   3rd Qu.:139056   parts 

### Comments on numeric data:
 - ```year``` : values range from 2000 to 2022 depicting the year the posted vehicle was manufactured. A better variable for interpretation would be ```age```. The dataset was collected for posts made in year 2022 (over the period of 3 months). We can mutate the dataset to create ```age``` as follows $age = 2022 - year$.
  - ```odometer``` : values range from 0 to ~500,000 miles. Extreme values were excluded from the dataset (see 01-dataClean.ipynb), however looking at the summary odometer appears to be skewed to the right. Considering all values are positive, performing a square-root or cube-root transformation might be beneficial. 

### Comments on categorical data 
 - Binning : many attributes have similar categories that can be binned to produce better predictive power. attribute - ```condition```, ```fule```, ```cylinders```, ```transmission```, ```drive```, ```size```, ```type```, ```paint_color``` & ```title_status``` have significant amount of missing values, these may have their own predictive power. we expect a lower asking price if the poster has lesser information about the car. 
 

### Comments on Response Variable ```price```
 ```price``` gives us the posted ask price in USD for a given vehicle. From the summary statistics, we notice a significant interval in the prices between 3rd Qu. and max price. This indicates a right skewed data with many possible extreme points. We suggest trying a log transformation to account for this. 

In [76]:
country_of_orign<- function(manufacturer){
    originCountry = switch(manufacturer,
        "missing" = "missing",
        "acura" = "Japan",
        "alfa-romeo" = "Italy",
        "aston-martin" = "UK",
        "audi" = "Germany",
        "bmw" = "Germany", 
        "buick" = "USA",
        "cadillac" = "USA",
        "chevrolet" = "USA",
        "chrysler" = "USA",
        "dodge" = "USA",
        "ferrari" = "Italy",
        "fiat" = "Italy",
        "ford" = "USA",
        "gmc" = "USA",
        "harley-davidson" = "USA", 
        "honda" = "Japan",
        "hyundai" = "South Korea",
        "infiniti" = "Japan",
        "jaguar" = "UK",
        "jeep" = "USA",
        "kia" = "South Korea",
        "land rover" = "UK",
        "lexus" = "Japan",
        "lincoln" = "USA",
        "mazda" = "Japan",
        "mercedes-benz" = "Germany",
        "mercury" = "USA",
        "mini" = "UK",
        "mitsubishi" = "Japan",
        "mogran" = "UK",
        "nissan" = "Japan",
        "pontiac" = "USA",
        "porsche" = "Germany",
        "ram" = "USA",
        "rover" = "UK",
        "saturn" = "USA",
        "subaru" = "Japan",
        "tesla" = "USA",
        "toyota"  = "Japan",
        "volkswa gen" = "Germany",
        "volvo" = "Sweden")
    return(originCountry)
    }

In [59]:
var <- as.character(data$manufacturer)
head(var)
var[var==""] = "missing"

data$manufacturer<- var
head(data$manufacturer, 200)

In [80]:
var = sapply(data$manufacturer, function(i) country_of_orign(i))

In [81]:
head(var, 20)


In [82]:
new_data = data%>%
mutate(countryOrign = var)
colnames(new_data)
head(new_data)

Unnamed: 0_level_0,price,year,manufacturer,condition,cylinders,fuel,odometer,title_status,transmission,drive,size,type,paint_color,state,countryOrign
Unnamed: 0_level_1,<dbl>,<int>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<named list>
1,33590,2014,gmc,good,8 cylinders,gas,57923,clean,other,,,pickup,white,al,USA
2,22590,2010,chevrolet,good,8 cylinders,gas,71229,clean,other,,,pickup,blue,al,USA
3,39590,2020,chevrolet,good,8 cylinders,gas,19160,clean,other,,,pickup,red,al,USA
4,30990,2017,toyota,good,8 cylinders,gas,41124,clean,other,,,pickup,red,al,Japan
5,15000,2013,ford,excellent,6 cylinders,gas,128000,clean,automatic,rwd,full-size,truck,black,al,USA
6,27990,2012,gmc,good,8 cylinders,gas,68696,clean,other,4wd,,pickup,black,al,USA


In [83]:
country_origin_transform<-function(data){
    var <- as.character(data$manufacturer)
    var[var==""] = "missing"
    data$manufacturer =  var
    country = sapply(data$manufacturer, function(i) country_of_orign(i))
    new_data = data%>%
        mutate(countryOrign = country)
    return(new_data)             
    }

In [84]:
head(country_origin_transform(data))

Unnamed: 0_level_0,price,year,manufacturer,condition,cylinders,fuel,odometer,title_status,transmission,drive,size,type,paint_color,state,countryOrign
Unnamed: 0_level_1,<dbl>,<int>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<named list>
1,33590,2014,gmc,good,8 cylinders,gas,57923,clean,other,,,pickup,white,al,USA
2,22590,2010,chevrolet,good,8 cylinders,gas,71229,clean,other,,,pickup,blue,al,USA
3,39590,2020,chevrolet,good,8 cylinders,gas,19160,clean,other,,,pickup,red,al,USA
4,30990,2017,toyota,good,8 cylinders,gas,41124,clean,other,,,pickup,red,al,Japan
5,15000,2013,ford,excellent,6 cylinders,gas,128000,clean,automatic,rwd,full-size,truck,black,al,USA
6,27990,2012,gmc,good,8 cylinders,gas,68696,clean,other,4wd,,pickup,black,al,USA
