### Evaluating Used Cars with Classification

#### Introduction
In recent years, used car market is getting larger and larger. Many people begin purchasing used cars instead of new cars, since used cars are always cheaper than new cars, and a lot of used cars really have good reliability. However, there are still a bunch of defective used cars in market. For example, my one friend bought a 2000 Toyota. One day, when she was driving, suddenly, her engine broke down. I am also a used-car victim, I purchased a 2001 Nissan six years ago, after just one week, I could not start up my car anymore. Defective used cars not only hurt customers, but also ruin sellers' reputation, so evaluating used cars is very important.

#### Data Description
Our data includes 1728 used cars. Our variables are: 1). Buying price, 2). Price of maintenance, 3). Number of doors, 4). Capacity in terms of persons to carry, 5).The size of trunk, and 6). Estimated safety of the car. Both buying price and price of maintenance are categorized into four levels: very high, high, medium, and low. Number of doors includes 2, 3, 4, and 5-more. Capacity in terms of persons to carry has three levels: 2, 4, and more. The size of trunk is categorized into small, medium, and big.  Estimated safety of the car is low, medium, and high. Our classifications for the used cars are unacceptable, acceptable, good, and very good. Our dataset can be downloaded from https://archive.ics.uci.edu/ml/datasets/Car+Evaluation 

In [1]:
import pandas as pd
car = pd.read_csv('C:\Atop Materials\car evaluation.csv', header = 0)
car.head()

Unnamed: 0,Buying,Maint,Doors,Persons,Trunk,Safty,Evaluation
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [36]:
def tonum (x):
    if x == "vhigh":
        return 4
    if x == "high":
        return 3
    if x == "med":
        return 2
    if x == "low":
        return 1
    if x == "5more":
        return 5
    if x == "4":
        return 4
    if x == "3":
        return 3
    if x == "2":
        return 2
    if x == "more":
        return 5
    if x == "small":
        return 1
    if x == "big":
        return 3
    if x == "unacc":
        return 1
    if x == "acc":
        return 2
    if x == "good":
        return 3
    if x == "vgood":
        return 4
    
car["buying"] = car["Buying"].apply(tonum)
car["maint"] = car["Maint"].apply(tonum)
car["doors"] = car["Doors"].apply(tonum)
car["persons"] = car["Persons"].apply(tonum)
car["trunk"] = car["Trunk"].apply(tonum)
car["safty"] = car["Safty"].apply(tonum)
car["evaluation"] = car["Evaluation"].apply(tonum)
car = car[["buying", "maint", "doors", "persons", "trunk", "safty", "evaluation"]]

#### Methodology
In this section, we will devide our data into training set and test set, and then we will use support vector machine, k-nearest neighbors, and decision tree to do classification.

In [48]:
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

x = car[["buying", "maint", "doors", "persons", "trunk", "safty"]]
y = car["evaluation"]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 1)


# svm
clf1 = svm.SVC(kernel = 'linear')
clf1.fit(x_train, y_train)
yhat1 = clf1.predict(x_test)
print("Accuracy of SVM is:", np.round(np.mean(y_test == yhat1), 4))

# KNN
k = 5
clf2 = KNeighborsClassifier(n_neighbors = k)
clf2.fit(x_train, y_train)
yhat2 = clf2.predict(x_test)
print("Accuracy of KNN is:", np.round(np.mean(y_test == yhat2), 4))

# Decision Tree
clf3 = DecisionTreeClassifier(criterion = 'entropy', max_depth = 4)
clf3.fit(x_train, y_train)
yhat3 = clf3.predict(x_test)
print("Accuracy of Decision Tree is:", np.round(np.mean(y_test == yhat3), 4))


Accuracy of SVM is: 0.8651
Accuracy of KNN is: 0.9518
Accuracy of Decision Tree is: 0.8555


#### Result
From methodology section, we see that KNN with k = 5 has the highest accuracy.

#### Discussion
In this report, we used SVM, KNN, and Decision Tree to do classification analysis for used cars, and we found that KNN has the highest accuracy. However, this time, we just used very simple versions of these three classifiers, some advanced versions may improve the accuracies for SVM, and Decision Tree, like Twin Bondary SVM.

#### Conclusion
We can predict the quality of used cars with high accuracy using classifiers.