In [1]:
from fastai.tabular.all import *

pd.options.display.float_format = '{:.2f}'.format
pd.set_option('display.max_columns', None)
set_seed(42)

In [2]:
df_train = pd.read_csv('train.csv') # 6298 rows
df_test = pd.read_csv('test.csv') # 1000 rows

# File Contents

The training dataset file contains 6298 rows of data, separated into 29 columns. The last 2 columns are the dependent variables. One column is Categoric ['Vehicle_Trim'], the other is Numeric ['Dealer_Listing_Price']. These columns do not exist in the test dataset file. 

Below are all the column names, showing what data types are in the column, and if there are any missing data in datasets. An initial feature engineering approach will also be listed.


## ListingID                 

ListingID is an int64 type which serves as the unique identifier for each listing. It kinda serves as an index. This column will be dropped from the Training data as it does not serve a purpose in the model. It will also need to be removed from the Test data, but needs to be kept/referenced for final submisson.

#### ACTION: Remove column to simplify the model

df = df.drop(['ListingID'], axis = 1)


## SellerCity        

SellerCity is a string type. There are no missing values in either the training or test set. 
To simplify the model, this may be eliminated or combined with SellerState and SellerZip to define location.

ACTION: Remove after fixing the zipcode

## SellerIsPriv               

SellerIsPriv is a boolean value to indicate if the listing is from a private seller. There are no missing values in either the training or test set.

ACTION: Keep, make sure it is categorical

## SellerListSrc          

SellerListSrc is a string value. There are "2" values missing in the training data and "0" missing in the test data.

ACTION: Fill in the 2 values in the training data.

## SellerName          

SellerName is a string value. There are no missing values in either the training or test set. 

ACTION: Keep, make sure it is categorical.

## SellerRating            

SellerRating is a float value. There are no missing values in either the training or test set. 

ACTION: Keep, make sure it is continuous.

## SellerRevCnt     

SellerRevCnt is an integer. There are no missing values in either the training or test set. 

ACTION: Keep, make sure it is continuous.

## SellerState       

SellerState is a string type. There are no missing values in either the training or test set.
To simplify the model, this may be eliminated or combined with SellerCity and SellerZip to define location.

ACTION: Remove after fixing the zipcode.

## SellerZip   

SellerZip is a float type. There are "2" values missing in the training data and "0" missing in the test data.
To simplify the model, this may be eliminated or combined with SellerCity and SellerZip to define location. 
Should also not be a float.

ACTION: 
df_train[df_train['SellerZip'].isnull()]



In [None]:
df_train.query('SellerZip.isnull()', engine='python')

## VehBodystyle  

VehBodystyle is a string type. There are no missing values in either the training or test set. There is only one value for this column.

#### ACTION: Remove column to simplify the model

df = df.drop(['VehBodystyle'], axis = 1)

## VehCertified            

VehCertified is a boolean value. There are no missing values in either the training or test set. 

ACTION: Keep, make sure it is categorical.

## VehColorExt      

VehColorExt is a string value. There are "73" values missing in the training data and "7" missing in the test data.

ACTION: Make a new column, "NewExtColor" combining the like values and fix spelling errors.

In [None]:
df_train.VehColorExt.value_counts()

## VehColorInt  

VehColorInt is a string value. There are "728" values missing in the training data and "108" missing in the test data.

ACTION: Make a new column, "NewIntColor" combining the like values and fix spelling errors.

## VehDriveTrain      

VehDriveTrain is a string value. There are "401" values missing in the training data and "64" missing in the test data.

ACTION: Make a new column, "NewDriveTrain"

## VehEngine     

VehEngine is a string value. There are "361" values missing in the training data and "58" missing in the test data.

## VehFeats 

VehFeats is a list of strings. There are "275" values missing in the training data and "37" missing in the test data.

## VehFuel           

VehFuel is a string value. There are "2" values missing in the training data and "0" missing in the test data.

## VehHistory  

VehHistory is a comma separated string. There are "201" values missing in the training data and "27" missing in the test data.

## VehListdays      

VehListdays is a float value. There are "2" values missing in the training data and "0" missing in the test data.

## VehMake

VehMake is a string value. There are no missing values in either the training or test set. There are only 2 choices being Jeep or Cadillac, and this is really an extra column as we have VehModel. One of these columns should be removed to simplify the model.

## VehMileage              

VehMileage is a float value. There are "2" values missing in the training data and "1" missing in the test data. This can probably be imputed with the mean.

ACTION: Impute the missing values with the mean.

## VehModel

VehModel is a string value. There are no missing values in either the training or test set. There are only 2 choices being Grand Cherokee or XT5, and this is really an extra column as we have VehMake. One of these columns should be removed to simplify the model.

## VehPriceLabel

VehPriceLabel is a string value. There are "285" values missing in the training data and "38" missing in the test data.

## VehSellerNotes

VehSellerNotes is free text. There are "243" values missing in the training data and "41" missing in the test data.

## VehType    

VehType is a string. There are no missing values in either the training or test set. There is only one value for this column.

#### ACTION: Remove column to simplify the model

df = df.drop(['VehType'], axis = 1) 

## VehTransmission         

VehTransmission is a string value. There are "197" values missing in the training data and "27" missing in the test data.

## VehYear

VehTYear is an integer. There are no missing values in either the training or test set.

## Vehicle_Trim     

Vehicle_Trim is a string and the depenedent variable for one of the analysis. There are "405" values missing in the training set. Since this is the dependent variable I will remove these rows for the Classifier as I cannot train on them nor test to them.

ACTION: Remove null rows for a better train/test split for the classifier task.

## Dealer_Listing_Price    

Dealer_Listing_Price is a float value and the dependent variable for one of the analysis. There are "52" values missing in the training set. Since this is the dependent variable I will remove these rows for the Regression as I cannot train on them nor test to them.

ACTION: Remove null rows for a better train/test split for the regression task.

Start of functions, items copied from Titanic notebook

In [None]:
def add_features(df):
    df = df.drop(['ListingID'], axis = 1)
    df = df.drop(['VehType'], axis = 1)
    df = df.drop(['VehBodystyle'], axis = 1)
    
    #df['LogFare'] = np.log1p(df['Fare'])
    #df['Deck'] = df.Cabin.str[0].map(dict(A="ABC", B="ABC", C="ABC", D="DE", E="DE", F="FG", G="FG"))
    #df['Family'] = df.SibSp+df.Parch
    #df['Alone'] = df.Family==1
    #df['TicketFreq'] = df.groupby('Ticket')['Ticket'].transform('count')
    #df['Title'] = df.Name.str.split(', ', expand=True)[1].str.split('.', expand=True)[0]
    #df['Title'] = df.Title.map(dict(Mr="Mr",Miss="Miss",Mrs="Mrs",Master="Master")).value_counts(dropna=False)

add_features(df)

In [None]:
cats=["Sex","Embarked"]
conts=['Age', 'SibSp', 'Parch', 'LogFare',"Pclass"]
dep="Survived"