# MSML651 Final Project
## Part 1: Model Evaluation
## Julie Lenzer

In this project, I am investigating the classification of poker hands given an ordered list of cards in a draw. This portion of the project is exploring various different classification models to determine which one performs best on the given data set.

In [32]:
import pandas as pd
import numpy as np

In [33]:
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

## Create file to redirect output to it

In [34]:
import sys
    
original_stdout = sys.stdout # Save a reference to the original standard output
outf = open("LenzerOut-ModelEval.txt", "w")
sys.stdout = outf # Change the standard output to the file we created.
print("MSML651 - Lenzer Final Project\nModel Evaluation", "\n\n")

## Load Data and Extract Features
The data set comes from https://www.kaggle.com/rasvob/uci-poker-hand-dataset. There are 25,010 elements in the training data and 1,000,000 in the testing, for which I will use pyspark to process once a model is selected.

In [35]:
fdata = pd.read_csv("poker-hand-training.data")
fdata.columns = ["S1","C1","S2","C2","S3","C3","S4","C4","S5","C5","Y"]
fdata.head()

Unnamed: 0,S1,C1,S2,C2,S3,C3,S4,C4,S5,C5,Y
0,2,11,2,13,2,10,2,12,2,1,9
1,3,12,3,11,3,13,3,10,3,1,9
2,4,10,4,11,4,1,4,13,4,12,9
3,4,1,4,13,4,12,4,11,4,10,9
4,1,2,1,4,1,5,1,3,1,6,8


In [36]:
#Extract the features (X) from the data and also pull out what we are trying to predict (Y)
X = fdata.iloc[:,2:-1].copy()
Y = fdata['Y']

## Data and Model Prep
First, the training set is divided into training and test data sets with a 70/30 split. 

In [37]:
# Using JUST the training data set, break that into train / test in order to determine the "best" model to use 
# for this classification task
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=.3)

In [38]:
# Look at how the data gets split - what is the distribution of the various classes
print(x_train.shape)
print(x_test.shape)

# count total, test, and training distibution by class
from collections import Counter
total_counts = Counter(Y)
training_counts = Counter(y_train)
test_counts = Counter(y_test)
print ("Data Analysis\n")
print("Totals by Classification:", total_counts)
print("Training Data counts:", training_counts)
print("Test Data Counts:", test_counts)
print("---------------","\n\n")

## Create and Evaluate Models
Run through several models included K-Nearest Neighbors, Support Vector Machine (SVM), Decision Tree, and Random Forest to determine which one performs best when looking at accuracy and weighted F1 score(s). I also then performed cross validation to determine if the train/test split might have a significant impact on the models overall performance.

In [39]:
#Import classifiction models
#
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.metrics import classification_report

In [40]:
# Create a functino to run through each classification model and print it's results in order to compare them
# to find the "best" one
#
def model_eval(dict):
    
    for key,value in dict.items():

        classifier = value
        classifier.fit(x_train, y_train)
        predictions = classifier.predict(x_test)
        print("Accuracy Score of" , key ,  ": ", metrics.accuracy_score(y_test,predictions))
        result = pd.crosstab(y_test, predictions, rownames=['Actual Result'], colnames=['Predicted Result'])
        print(result)
        print(metrics.classification_report(y_test,predictions, zero_division=0))
        print("F1 Score:", metrics.f1_score(y_test, predictions, average="weighted", labels=np.unique(predictions), zero_division=0))
        #
        print ("Cross Validated Results")
        from sklearn.model_selection import cross_val_score
        cv_scores = cross_val_score(classifier,X,Y,cv=3)
        print(cv_scores)
        print("---------------","\n")

In [41]:
# Create the list of models to evaluate and then run them through the evaluation function
#
model_list =  { "KNeighborsClassifier": KNeighborsClassifier(5),"SVM":
    svm.SVC(kernel='linear'), "DecisionTree": DecisionTreeClassifier(random_state=0), "RandomForest": RandomForestClassifier(n_jobs=2, random_state=0)}
model_eval(model_list)
#
# Next, testing a number of K-nearest neighbors values
model_list =  {"KNeighbors2": KNeighborsClassifier(2), "KNeighbors3": KNeighborsClassifier(3), "KNeighbors4": KNeighborsClassifier(4), "KNeighbors5": KNeighborsClassifier(5), "KNeighbors6": KNeighborsClassifier(6), "KNeighbors7": KNeighborsClassifier(7), "KNeighbors8": KNeighborsClassifier(8)}
model_eval(model_list)

sys.stdout = original_stdout # Reset the standard output to its original value
outf.close() # Close the file