# Decision Tree based on cleaned data

As we cleaned our data in the other notebook, it is now time to train and use the prediction.

In [64]:
import pandas as pd

#read in the cleaned training data!
car_data_training = pd.read_csv('training_clean.csv')
car_data_test = pd.read_csv('test_clean.csv')

#car_data_training.columns
car_data_test["RefId"]

0         73015
1         73016
2         73017
3         73018
4         73019
5         73020
6         73021
7         73022
8         73023
9         73024
10        73025
11        73026
12        73027
13        73028
14        73029
15        73030
16        73031
17        73032
18        73033
19        73034
20        73035
21        73036
22        73037
23        73038
24        73039
25        73040
26        73041
27        73042
28        73043
29        73044
          ...  
48677    121717
48678    121718
48679    121719
48680    121720
48681    121721
48682    121722
48683    121723
48684    121724
48685    121725
48686    121726
48687    121727
48688    121728
48689    121729
48690    121730
48691    121731
48692    121732
48693    121733
48694    121734
48695    121735
48696    121736
48697    121737
48698    121738
48699    121739
48700    121740
48701    121741
48702    121742
48703    121743
48704    121744
48705    121745
48706    121746
Name: RefId, dtype: int6

## Introduce Labels

As the Decision Tree cannot work with strings, we have to replace the strings with numerical labels

In [65]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
label_training = car_data_training.apply(le.fit_transform)
label_test = car_data_test.apply(le.fit_transform)

## Extract Labels and Classes

In [66]:
# the independent vals
training_input = label_training[[
        'PurchDate', 
        'Auction',
        'VehYear', 
        'VehicleAge', 
        'Make', 
        'Model', 
        'Trim', 
        'SubModel',
        'Color', 
        'Transmission', 
        'WheelTypeID', 
        'WheelType', 
        'VehOdo',
        'Nationality', 
        'Size', 
        'TopThreeAmericanName',
        'MMRAcquisitionAuctionAveragePrice',
        'MMRAcquisitionAuctionCleanPrice', 
        'MMRAcquisitionRetailAveragePrice',
        'MMRAcquisitonRetailCleanPrice', 
        'MMRCurrentAuctionAveragePrice',
        'MMRCurrentAuctionCleanPrice', 
        'MMRCurrentRetailAveragePrice',
        'MMRCurrentRetailCleanPrice', 
        'PRIMEUNIT', 
        'AUCGUART', 
        'BYRNO',
        'VNZIP1', 
        'VNST', 
        'VehBCost', 
        'IsOnlineSale', 
        'WarrantyCost'
    ]].values

# the classification info
training_classes = label_training['IsBadBuy'].values

#ugly, but we have to do the same with the test data
# the independent vals
test_input = label_test[[
        'PurchDate', 
        'Auction',
        'VehYear', 
        'VehicleAge', 
        'Make', 
        'Model', 
        'Trim', 
        'SubModel',
        'Color', 
        'Transmission', 
        'WheelTypeID', 
        'WheelType', 
        'VehOdo',
        'Nationality', 
        'Size', 
        'TopThreeAmericanName',
        'MMRAcquisitionAuctionAveragePrice',
        'MMRAcquisitionAuctionCleanPrice', 
        'MMRAcquisitionRetailAveragePrice',
        'MMRAcquisitonRetailCleanPrice', 
        'MMRCurrentAuctionAveragePrice',
        'MMRCurrentAuctionCleanPrice', 
        'MMRCurrentRetailAveragePrice',
        'MMRCurrentRetailCleanPrice', 
        'PRIMEUNIT', 
        'AUCGUART', 
        'BYRNO',
        'VNZIP1', 
        'VNST', 
        'VehBCost', 
        'IsOnlineSale', 
        'WarrantyCost'
    ]].values

#there's no test classes - we have to predict them

# let's take a look at a subset
test_input[:5]

array([[  138,     0,     4,     4,    22,   389,     8,   136,    15,
            1,     0,     1, 25765,     1,     3,     3,  3201,  3698,
         2762,  3209,  3102,  3517,  5189,  5269,     0,     0,    43,
           19,     6,   814,     0,   192],
       [  138,     0,     4,     4,     3,   501,    54,   209,    15,
            1,     0,     1,  9320,     1,     6,     3,  2676,  3009,
         2220,  2501,  2849,  3060,  4223,  4821,     0,     0,    40,
           19,     6,   770,     0,    72],
       [  138,     0,     5,     3,     5,   219,     7,   369,    15,
            1,     0,     1, 13991,     1,     7,     1,  8142,  9531,
         8240, 10028,  8498,  8908, 10733, 10963,     0,     0,    40,
           19,     6,  1493,     0,    88],
       [  138,     0,     1,     7,    23,   453,    51,   200,     5,
            1,     0,     1, 27452,     1,     6,     3,   770,   914,
          521,   629,  1157,  1008,  2447,  2092,     0,     0,    43,
           19,  

## Decision Tree

To use a decision tree, we have to import and set it up. As we have no "labeled test data", we just split the training data to check the model precision.

In [67]:
from sklearn.cross_validation import train_test_split

(trin, tein, trcl, tecl) = train_test_split(training_input, training_classes, train_size=0.75, random_state=1)

In [68]:
from sklearn.tree import DecisionTreeClassifier

# Create the classifier
dt_classifier = DecisionTreeClassifier()

# Train the classifier with the training set
dt_classifier.fit(trin, trcl)

# Verify 
dt_classifier.score(tein, tecl)

0.82516715992546308

In [71]:
# Do the actual prediction
classified = dt_classifier.predict(test_input)

In [72]:
# Now format the data for kaggle entry
df = pd.DataFrame(car_data_test["RefId"])
df["IsBadBuy"] = classified

df[:]

Unnamed: 0,RefId,IsBadBuy
0,73015,0
1,73016,0
2,73017,0
3,73018,0
4,73019,1
5,73020,0
6,73021,0
7,73022,1
8,73023,0
9,73024,1


In [73]:
# As we've got the right format, save to CSV
df.to_csv("fez_entry.csv", index=False)