# Nathan Vanos and Scott Kopczynski
# Project Title: The "Sales" Vector Machine
# CPSC 310, Professor Sprint, Final Project
# Date: 05/09/2019

## Introduction
************
The classification project that we partook in was certainly one of the most rewarding tasks of the year. That being said, it was quite difficult. This may have been, at least in part, because of the dataset we chose to classify. Our table is essentially a collection of houses that contains listings of various real estate attributes. So, what was the task we chose? We decided that it would be a good idea to attempt to predict house prices! We discovered a great many things along the way, including how bad of an idea it is to run Naïve Bayes over all the attributes in the dataset. We also found out that ensemble classifiers, interestingly enough, have the highest accuracy, and support vector machines (at least with this data) have a far less impressive accuracy. 
************

## Data Analysis

## Classification Results
************
So, what is the amazing classifier that we elected to use on this dataset? We chose the one and only support vector machine from SKLearn. The general approach we used to classify the data with our support vector machine was similar to the strategy we employed in our other assignments throughout the semester. 
************
### Cleaning the Data
************
First, our classifier cleans the data so that there are no “NA” values. We ultimately elected to do this in several steps, which are listed below. The program eventually replaces the NAs with the mode of the various attributes. 
```
# ** clean data
    training_table = clean_alley(training_table, header)
    training_table  = clean_garage(training_table, header)
    training_table  = clean_basement(training_table, header)
    training_table  = fill_with_best_val(training_table, header)
    training_table  = replace_with_mode(training_table, header)
    training_table  = remove_nas(training_table)
```
************

### Discretizing the Data
************
Next, the classifier discretizes the table. This is also a multi-step process, and we eventually had to write an ugly boolean function to implement it. The function is_continuous() determines if an attribute is continuous, and if this results to true, the attribute is converted to a categorical one. Since the class label itself is a continuous attribute, we had to include sale_prices = np.log1p(sale_prices) in order to normalize the prices so they could be more effectively fit. Lastly, we chose to split the continuous attributes into ten bins, mainly because we wanted a number that wasn’t too large or too small.  
```
# discretizes continuous attributes
def discretize_table(header, table):
    sale_prices = get_column_float(table, header.index("SalePrice"))
    sale_prices = np.log1p(sale_prices)
    for row in table:
        row[header.index("SalePrice")] = sale_prices[table.index(row)]
    # convert all continuous attributes to categorical ones
    for att in range(0, len(header)):
        if is_continuous(header[att]):
            values = get_column_float(table, att)
            cutoffs = compute_equal_widths_cutoffs(values, 10)
            convert_to_categorical(cutoffs, att, table)
            
# checks if an attribute is continuous
def is_continuous(column):
    if column == "LotArea":
        return True
    if column == "BsmtFinSF1":
        return True
    if column == "BsmtFinSF2":
        return True
    if column == "BsmtUnfSF":
        return True
    if column == "TotalBsmtSF":
        return True
    if column == "1stFlrSF":
        return True
    if column == "2ndFlrSF":
        return True
    if column == "LowQualFinSF":
        return True
    if column == "GrLivArea":
        return True
    if column == "GarageArea":
        return True
    if column == "WoodDeckSF":
        return True
    if column == "OpenPorchSF":
        return True
    if column == "EnclosedPorch":
        return True
    if column == "3SsnPorch":
        return True
    if column == "ScreenPorch":
        return True
    if column == "PoolArea":
        return True
    if column == "MiscVal":
        return True
    if column == "SalePrice":
        return True
    return False

```
************

### Plotting the clf
************
Since implementing all of the features of a support vector machine would be ridiculously complicated, we ultimately decided to just use SKLearn’s library to implement one. Even so, we still feel that we learned a lot about this classification technique. There are two key steps. The first step is fitting the data, and in order to do this, the program creates an x and y axis, somewhat like a graph. The y-axis contains the class, and the x-axis contains the rest of the attributes. Then, the data is fit with a call to svm’s SVC function. An important parameter for SVC is the C parameter, which penalizes slack variables that can lead to miscalculation. A higher value of C leads to more variables being penalized, so a well-chosen C can make or break this classifier’s performance. 
```
y_vals = get_column(train_data, len(table[0])-1)
x_vals = get_attribs(train_data)
clf = svm.SVC(C=10000000000000, cache_size=200, class_weight=None, coef0=0.0,
                      decision_function_shape='ovr', degree=80, gamma=0.00000000000001, kernel='rbf',
                      max_iter=-1, probability=False, random_state=None, shrinking=True,
                      tol=0.001, verbose=False)
```
************
### Fitting the Data
************
The second step that the program takes to classify the data is to fit the data using the clf, which it basically does by generating a best fit line through the new data points.  
```
clf.fit(x_vals, y_vals)
result = clf.predict([row[:-2]]) # code here is in a for loop
predicted.append(result)
```
************

### Stratified K-Fold Cross Validation
************
To test our classifier’s predictive ability, our program uses the classic approach of stratified k-fold cross validation, in which k folds of the original data are generated, and then each one is used in turn as the test set for a total of k “runs” of the classifier. The accuracy of each run is recorded, then at the end, the average accuracy is calculated. The average accuracy is the metric we used to rate our classifier’s performance.
```
# classify the dataset with support vector machine
def classify_with_svm(table):
    accuracies = []
    table = encode_data(table)
    k_folds = knn_naive_classifier.determine_stratified_k_fold(table, 10)
    correct_counts = []
    predicted = []
    actual = []
    for i in range(0, len(k_folds)):
        train_data = []
        for j in range(0, len(k_folds)):
            if j != i:
                train_data+= k_folds[j]
        y_vals = get_column(train_data, len(table[0])-1)
        x_vals = get_attribs(train_data)
        #x_vals = encode_data(x_vals)
        clf = svm.SVC(C=10000000000000, cache_size=200, class_weight=None, coef0=0.0,
                      decision_function_shape='ovr', degree=80, gamma=0.00000000000001, kernel='rbf',
                      max_iter=-1, probability=False, random_state=None, shrinking=True,
                      tol=0.001, verbose=False)
        clf.fit(x_vals, y_vals)
        correct_count = 0
        for row in k_folds[i]:
            result = clf.predict([row[:-2]])
            predicted.append(result)
            actual.append(row[len(row)-1])
            if(result == row[len(row)-1]):
                correct_count += 1
        correct_counts.append(correct_count)
    accuracy = (sum(correct_counts) / len(correct_counts))/len(k_folds[0])
    accuracies.append(accuracy)
    return accuracies[0]
```
************

### Other Classifiers
************
We compared our new classifier to several old ones, including a Naïve Bayes classifier, a KNN classifier, and an ensemble classifier. In Naïve Bayes, the classification of instances is based on probabilities, and probabilities require a huge amount of computation. We were forced to implement this algorithm with significant attribute selection, since the program took more than an hour to run Naïve Bayes without it.
Our KNN classifier turned out to be the second slowest, since it also requires tons of computation. The basic idea is this: for each instance to be classified, calculate the k nearest neighbors amongst the training set using the Euclidean distance algorithm, and then use a majority vote to pick the class label.
For our ensemble, we chose to use a random decision forest. Our ensemble generates N trees, takes the M most accurate of those trees, then votes from amongst all of their predictions in order to choose the class label. We chose to use the TDIDT method to implement this algorithm. 
Our program has some surprising results regarding the accuracies of these classifiers. We both expected the support vector machine’s accuracy to be the highest, but that was not the case. Our support vector machine has the lowest accuracy of all the classifiers, coming it at a measly 60 – 70 percent. Naïve Bayes and KNN come next, and are nearly tied at about 82 percent accuracy, while the ensemble classifier has by far the best accuracy at around 90 percent.
************

## Conclusion
************
Ultimately, our project consisted of cleaning the dataset, discretizing continuous attributes, creating a plot with SKLearn’s support vector machine, fitting the data to that plot for predictions, and finally, using stratified k-fold cross validation to compute the accuracy. The dataset proved to be challenging for us in several ways: the number of attributes, the presence of NA values, and lastly, the need to normalize the sale price. There was ultimately more work we had to do before implementing the classifier than there was for implementing the classifier itself. Our classifier’s performance wound up being less than satisfactory, and we definitely expected its accuracy to be higher. The low accuracy could either be a result of this particular dataset, or it could also be due to our lack of experience using support vector machines. We were unsure exactly how to choose the best C and gamma values for the SVC function, and we believe that if we had more time, and we figured out how to optimize these values, the support vector machine’s accuracy could improve drastically. But at the end of the day, we still learned a lot, and now we have a basic understanding of a new classifier to take with us into the real world.
************