# Car Evaluation 🚗 2.5 (McQueen Edition)
   
## Lab Six: Evaluation and Multi-Layer Perceptron
   
### Justin Ledford, Luke Wood, Traian Pop

# McQueen Quotes

### Speed. I am speed.

### Don't leave me here! I'm in hillbilly hell! My IQ's dropping by the second! I'm becoming one of them!

### Float like a Cadillac, sting like a Beamer.

### I think The King should finish his last race.

### Great timing, Mater.

### 1 winner, 42 losers. I eat losers for breakfast.

### In your dreams, Thunder.

### (Upon meeting Sally) Holy Porsche....

### Hah! This grumpy old race car I know now once told me something. Its just a empty cup.

### He won three Piston Cups-(Mater says)He did what in his cup?!

Preparation (15 points total)
[5 points] (mostly the same as from lab four) Explain the task and what business-case or use-case it is designed to solve (or designed to investigate). Detail exactly what the task is and what parties would be interested in the results. How well should your algorithm perform in order to be useful to third parties. 
[10 points] (mostly the same as from labs one through three) Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove variables that are not needed/useful for the analysis. Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created).

## Business Understanding

### Background Knowledge

The dataset we used for this project comes from the UCI machine learning archive, and it pertains to the quality of certain cars. The 1769 instances we have are described by 6 attributes and split into 4 classes. The 6 attributes are: buying price, maintenance cost, number of doors, number of people that can fit in the car, the size of the trunk, and the safety rating. Although a car typically has more characteristics one can measure in hopes to get a good representation of its class, for this specific data set the owners used a model that states the only characteristics of a car that are useful are PRICE (buying, maintenance), TECH (safety), and COMFORT (doors, persons, trunk size).

The classification task in this dataset is broken into 4 classes. This includes 'unacc' (unacceptable), 'acc' (acceptable), 'good' (good!), and 'v-good' (very good!). These conditions decrease in percentage in the dataset as you go down the list. There are a high value of "unacceptable" with almost 70% of the data belonging to this class, about 22% "acceptable", and about 4% each for "good" and "very good".

### Use-Case

One possible use for this dataset is to assist customers in purchasing the car quality that they desire and to assure customer satisfaction with their vehicle. Throughout this project, the way we measure 'success' is by minimizing 'unaccepted' false negatives. The reason we focus mainly on that metric, and less on the false negative of the other car qualities, is due to the significant damage a false negative in the 'unaccepted' car category can create.

Take for example someone comes in to a dealership with intent to buy an affordable family car. While the difference between purchasing an "acceptable" and "good" car might just be a slightly lower safety rank or a tad more expensive to maintain, a car that would be unacceptable could range anywhere extremely overpriced to a very low safety rating. Due to the range of "unacceptable" cars being fairly large (considering they are 70% of the dataset), the risk of someone being disappointed in an accidental false negative is quite high. While a dealership attempts to sell their worst possible car for the highest potential price to a consumer, they still have to meet the qualifications of the customer. If they oversell something or start lying, the repercussions for the dealership are huge and the risk is not worth it.

We are planning on getting a percentage value of over 90% but anything above that is really helpful. We think this appropriate due to the relative low cost activity needed in order to make someone go in person to look at the car instead of just relying on online descriptions. Although we will be able to eliminate the majority of the noise during a customer's search, they will still have to finalize specific details and the overall decision themselves.
___

## Data Preparation

The data originally consisted of categorical attributes with each value as a string:

   buying:       v-high, high, med, low  
   maintain:        v-high, high, med, low  
   doors:        2, 3, 4, 5-more  
   persons:      2, 4, more  
   trunk:        small, med, big  
   safety:       low, med, high   
   
The class values were also strings (nuance, AC, good, vigor).

We converted the attributes to one hot encoding, and converted the class values to integers (0-3). Since the dataset only has a small amount of attributes, we decided to keep all of them, and that dimensionality reduction was not necessary with this data.

In [None]:
#one-hot encoding method
df = pd.read_csv(
        'http://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data',
        header=None)

df.columns = ['buying', 'maint', 'doors', 'persons', 'trunk', 'safety', 'class']

# One hot encode caegorical attributes
df_dummies = pd.get_dummies(df.drop('class', axis=1))

# Convert class to integers
y = df['class'].replace(to_replace=['unacc', 'acc', 'good', 'vgood'],
                        value=range(4)).values
X = df_dummies.values

In [None]:
plotly.offline.init_notebook_mode() # run at the start of every notebook

graph1 = {'labels': ['Unacceptable', 'Acceptable', 'Good', 'Very good'],
          'values': np.bincount(y),
            'type': 'pie'}
fig = dict()
fig['data'] = [graph1]
fig['layout'] = {'title': 'Class Distribution',
                'autosize':False,
                'width':500,
                'height':300}

plotly.offline.iplot(fig)

## Evaluation 

Evaluation (20 points total)
[10 points] (mostly the same as from lab five) Choose and explain what metric(s) you will use to evaluate your algorithm’s generalization performance. You should give a detailed argument for why this (these) metric(s) are appropriate on your data. That is, why is the metric appropriate for the task (e.g., in terms of the business case for the task). Please note: rarely is accuracy the best evaluation metric to use. Think deeply about an appropriate measure of performance.
[10 points] (mostly the same as from lab five) Choose the method you will use for dividing your data into training and testing (i.e., are you using Stratified 10-fold cross validation? Why?). Explain why your chosen method is appropriate or use more than one method as appropriate. For example, if you are using time series data then you should be using continuous training and testing sets across time. Convince me that your cross validation method is a realistic mirroring of how an algorithm would be used in practice. 

### Metric Evaluation

With our data set and application, we believe it is most important to minimize the amount of false negatives for the "unacceptable class" because it would be unfortunate to classify a true unacceptable car as something acceptable, or even good/very good.

In order to do this we wrote a weighted recall score function that takes a cost matrix of the weights for each false negative. The recall score is still $tp / (tp + fn)$, but the values of $fn$ are now weighted according to the cost matrix.

We weighted any "unacceptable" false negative twice the cost as any other false negative.

In [None]:
from sklearn.metrics import confusion_matrix

# High cost for false negatives on "unacceptable" class
weights = np.array([
    [0, 2, 2, 2],
    [1, 0, 1, 1],
    [1, 1, 0, 1], 
    [1, 1, 1, 0] 
])                                                                 

def weighted_recall_score(y_true, y_pred, weight_matrix):
    conf_matrix = confusion_matrix(y_true, y_pred)

    tp = np.sum(np.diagonal(conf_matrix))
    fn = np.sum(weight_matrix * conf_matrix)

    return tp / (tp + fn)

### Dividing the Data

We chose stratified shuffle split as our cross validation method. Our data set is highly imbalanced (over 70% "unacceptable") so a stratified method is necessary. We chose shuffle split in order to use a higher training size with less iterations than k fold. Our method is similar to how it would be used in practice, our model will be used in production by giving it a sample of cars (new cars for the year/season, etc) and the class balance will likely be similar to how our training set is divided.
___

Modeling (55 points total)
[35 points] Create a custom ensemble classifier that uses multi-layer perceptron models for the individual classifiers. You can use bagging or boosting to select the training examples for each MLP in the ensemble, whichever you prefer.   
[20 points] Evaluate the performance of the ensemble classifier with your chosen evaluation metric(s). Visualize the results with a confusion matrix, receiver operating characteristic, and area under the curve. Visually compare its performance to the individual classifiers that make up the ensemble.

Exceptional Work (10 points total)
You have free reign to provide additional analyses.
One idea: add randomized feature selection to your bagging or boosting models