# Machine Learning On Poker Hand Dataset

Since the data is categorical in nature (10 categories), we will use classification algorithms from sklearn. If need be, we will use Boosting or Bagging techniques.

In [1]:
import warnings
warnings.filterwarnings('ignore')

## Helper functions

In [2]:
import pandas as pd

def load_data(path):
    df = pd.read_csv(path)
    return df

In [3]:
from sklearn.preprocessing import StandardScaler

def scale_data(x):
    scaler = StandardScaler()
    x = scaler.fit_transform(x)
    return x

In [4]:
from sklearn.model_selection import train_test_split

def get_prep_data():
    df_train = load_data('../../dataset/poker-hand-traintest')
    df_test = load_data('../../dataset/poker-hand-test')
    
    df_train = df_train.iloc[:, 1:]
    df_test = df_test.iloc[:25009, 1:]
    
    df = pd.concat([df_train, df_test])
    
    x = df.iloc[:, 0:10]
    x = scale_data(x)
    y = df['Hand']
    data_splits = train_test_split(x, y, test_size=0.2)
    return data_splits

In [5]:
x_train, x_test, y_train, y_test = get_prep_data()
print(f'x_train: {x_train.shape} \ny_train: {y_train.shape} \nx_test: {x_test.shape} \ny_test: {y_test.shape}')

x_train: (40014, 10) 
y_train: (40014,) 
x_test: (10004, 10) 
y_test: (10004,)


## Logistic Regression

In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

logreg = LogisticRegression()
logreg.fit(x_train, y_train)
y_pred = logreg.predict(x_test)
print('Accuracy Score: ', accuracy_score(y_test, y_pred))
print('Confusion matrix: \n', confusion_matrix(y_test, y_pred))

Accuracy Score:  0.501999200319872
Confusion matrix: 
 [[5022    0    0    0    0    0    0    0    0]
 [4212    0    0    0    0    0    0    0    0]
 [ 474    0    0    0    0    0    0    0    0]
 [ 227    0    0    0    0    0    0    0    0]
 [  34    0    0    0    0    0    0    0    0]
 [  20    0    0    0    0    0    0    0    0]
 [  13    0    0    0    0    0    0    0    0]
 [   1    0    0    0    0    0    0    0    0]
 [   1    0    0    0    0    0    0    0    0]]


## SVM

In [7]:
from sklearn.svm import SVC, LinearSVC

svc = SVC(C=2.5, kernel='rbf')
svc.fit(x_train, y_train)
y_pred = svc.predict(x_test)
print('Accuracy Score: ', accuracy_score(y_test, y_pred))
print('Confusion matrix: \n', confusion_matrix(y_test, y_pred))

Accuracy Score:  0.5697720911635346
Confusion matrix: 
 [[4003 1019    0    0    0    0    0    0    0]
 [2515 1697    0    0    0    0    0    0    0]
 [ 225  249    0    0    0    0    0    0    0]
 [  65  162    0    0    0    0    0    0    0]
 [   0   34    0    0    0    0    0    0    0]
 [  18    2    0    0    0    0    0    0    0]
 [   2   11    0    0    0    0    0    0    0]
 [   0    1    0    0    0    0    0    0    0]
 [   0    1    0    0    0    0    0    0    0]]


In [10]:
linsvc = LinearSVC()
linsvc.fit(x_train, y_train)
y_pred = linsvc.predict(x_test)
print('Accuracy Score: ', accuracy_score(y_test, y_pred))
print('Confusion matrix: \n', confusion_matrix(y_test, y_pred))

Accuracy Score:  0.5061975209916033
Confusion matrix: 
 [[5064    0    0    0    0    0    0    0    0    0]
 [4170    0    0    0    0    0    0    0    0    0]
 [ 492    0    0    0    0    0    0    0    0    0]
 [ 197    0    0    0    0    0    0    0    0    0]
 [  39    0    0    0    0    0    0    0    0    0]
 [  17    0    0    0    0    0    0    0    0    0]
 [  19    0    0    0    0    0    0    0    0    0]
 [   4    0    0    0    0    0    0    0    0    0]
 [   1    0    0    0    0    0    0    0    0    0]
 [   1    0    0    0    0    0    0    0    0    0]]


## Decision Tree

In [11]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(criterion='entropy', splitter='random')
dt.fit(x_train, y_train)
y_pred = dt.predict(x_test)
print('Accuracy Score: ', accuracy_score(y_test, y_pred))
print('Confusion matrix: \n', confusion_matrix(y_test, y_pred))

Accuracy Score:  0.5268892443022791
Confusion matrix: 
 [[3048 1781  138   56   17   23    1    0    0    0]
 [1612 2105  261  139   32   14    4    0    3    0]
 [ 101  269   79   31   10    0    2    0    0    0]
 [  37  103   21   34    0    0    2    0    0    0]
 [   8   14    9    4    4    0    0    0    0    0]
 [  12    4    0    1    0    0    0    0    0    0]
 [   2    7    5    3    0    0    1    1    0    0]
 [   0    3    0    1    0    0    0    0    0    0]
 [   1    0    0    0    0    0    0    0    0    0]
 [   0    1    0    0    0    0    0    0    0    0]]


In [10]:
dt = DecisionTreeClassifier(criterion='entropy', splitter='random')
dt.fit(x_train, y_train)
y_pred = dt.predict(x_test)
print('Accuracy Score: ', accuracy_score(y_test, y_pred))
print('Confusion matrix: \n', confusion_matrix(y_test, y_pred))

Accuracy Score:  0.43662534986005597
Confusion matrix: 
 [[1262 1014  131   62   15    3    5    4    1    1]
 [ 987  885  137   69   20    0    8    0    0    2]
 [ 100  119   17   11    1    0    0    0    0    0]
 [  45   47    8    8    0    0    0    0    0    0]
 [   8    7    3    1    0    0    0    0    0    0]
 [   3    0    0    0    0   12    0    0    0    0]
 [   1    5    0    0    0    0    0    0    0    0]
 [   0    0    0    0    0    0    0    0    0    0]
 [   0    0    0    0    0    0    0    0    0    0]
 [   0    0    0    0    0    0    0    0    0    0]]


In [12]:
dt = DecisionTreeClassifier(criterion='entropy', splitter='random')
dt.fit(x_train, y_train)
y_pred = dt.predict(x_test)
print('Accuracy Score: ', accuracy_score(y_test, y_pred))
print('Confusion matrix: \n', confusion_matrix(y_test, y_pred))

Accuracy Score:  0.4634146341463415
Confusion matrix: 
 [[2718 1978  245  100   15    1    7    0    0    0]
 [1864 1843  276  152   25    1    5    3    1    0]
 [ 167  242   53   22    7    0    1    0    0    0]
 [  71   88   20   13    4    0    1    0    0    0]
 [   8   15   11    3    1    0    0    0    1    0]
 [   6    2    0    0    0    8    0    0    0    1]
 [   9    6    1    3    0    0    0    0    0    0]
 [   1    3    0    0    0    0    0    0    0    0]
 [   0    0    0    0    0    1    0    0    0    0]
 [   0    0    0    0    0    1    0    0    0    0]]


We cant keep on trying randomized splits. Anyways, it seems that accuracy wont go more than 50% at most.

## Random Forest

In [13]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=250, criterion='gini', max_features='sqrt')
rf.fit(x_train, y_train)
y_pred = rf.predict(x_test)
print('Accuracy Score: ', accuracy_score(y_test, y_pred))
print('Confusion matrix: \n', confusion_matrix(y_test, y_pred))

Accuracy Score:  0.6417433026789284
Confusion matrix: 
 [[4228  836    0    0    0    0    0    0    0    0]
 [1978 2189    2    0    1    0    0    0    0    0]
 [  94  397    1    0    0    0    0    0    0    0]
 [  21  172    2    2    0    0    0    0    0    0]
 [   0   39    0    0    0    0    0    0    0    0]
 [  17    0    0    0    0    0    0    0    0    0]
 [   0   19    0    0    0    0    0    0    0    0]
 [   0    4    0    0    0    0    0    0    0    0]
 [   0    1    0    0    0    0    0    0    0    0]
 [   0    1    0    0    0    0    0    0    0    0]]


We can try to boost this classifier.

## Boosting

In [14]:
from sklearn.ensemble import AdaBoostClassifier

ada = AdaBoostClassifier(n_estimators=150, algorithm='SAMME.R')
ada.fit(x_train, y_train)
y_pred = ada.predict(x_test)
print('Accuracy Score: ', accuracy_score(y_test, y_pred))
print('Confusion matrix: \n', confusion_matrix(y_test, y_pred))

Accuracy Score:  0.48370651739304277
Confusion matrix: 
 [[4684  170    0    0    0    0    0    0  210    0]
 [3852  155    0    0    0    0    0    0  163    0]
 [ 452   24    0    0    0    0    0    0   16    0]
 [ 180   11    0    0    0    0    0    0    6    0]
 [  37    2    0    0    0    0    0    0    0    0]
 [  16    1    0    0    0    0    0    0    0    0]
 [  15    2    0    0    0    0    0    0    2    0]
 [   3    1    0    0    0    0    0    0    0    0]
 [   1    0    0    0    0    0    0    0    0    0]
 [   1    0    0    0    0    0    0    0    0    0]]


In [17]:
adarf = AdaBoostClassifier(n_estimators=250, algorithm='SAMME.R', base_estimator=rf)
adarf.fit(x_train, y_train)
y_pred = adarf.predict(x_test)
print('Accuracy Score: ', accuracy_score(y_test, y_pred))
print('Confusion matrix: \n', confusion_matrix(y_test, y_pred))

Accuracy Score:  0.6444422231107557
Confusion matrix: 
 [[4251  813    0    0    0    0    0    0    0    0]
 [1974 2193    2    0    1    0    0    0    0    0]
 [  74  417    1    0    0    0    0    0    0    0]
 [  27  167    1    2    0    0    0    0    0    0]
 [   0   39    0    0    0    0    0    0    0    0]
 [  17    0    0    0    0    0    0    0    0    0]
 [   0   18    1    0    0    0    0    0    0    0]
 [   0    4    0    0    0    0    0    0    0    0]
 [   0    1    0    0    0    0    0    0    0    0]
 [   0    1    0    0    0    0    0    0    0    0]]


## Extra Trees

In [18]:
from sklearn.ensemble import ExtraTreesClassifier

ext = ExtraTreesClassifier(n_estimators=125, criterion='gini')
ext.fit(x_train, y_train)
y_pred = ext.predict(x_test)
print('Accuracy Score: ', accuracy_score(y_test, y_pred))
print('Confusion matrix: \n', confusion_matrix(y_test, y_pred))

Accuracy Score:  0.5964614154338265
Confusion matrix: 
 [[4003 1061    0    0    0    0    0    0    0    0]
 [2206 1958    4    0    2    0    0    0    0    0]
 [ 149  339    3    0    1    0    0    0    0    0]
 [  39  154    2    2    0    0    0    0    0    0]
 [   6   33    0    0    0    0    0    0    0    0]
 [  16    0    0    0    0    1    0    0    0    0]
 [   2   17    0    0    0    0    0    0    0    0]
 [   0    4    0    0    0    0    0    0    0    0]
 [   0    1    0    0    0    0    0    0    0    0]
 [   0    1    0    0    0    0    0    0    0    0]]


## Voting Classifier

In [19]:
estimators = list()
estimators.append(('adarf', adarf))
estimators.append(('ext', ext))
estimators.append(('rf', rf))

from sklearn.ensemble import VotingClassifier

vc = VotingClassifier(estimators = estimators, voting ='soft')
vc.fit(x_train, y_train)
y_pred = vc.predict(x_test)
print('Accuracy Score: ', accuracy_score(y_test, y_pred))
print('Confusion matrix: \n', confusion_matrix(y_test, y_pred))

Accuracy Score:  0.6352459016393442
Confusion matrix: 
 [[4217  847    0    0    0    0    0    0    0    0]
 [2030 2136    3    0    1    0    0    0    0    0]
 [ 103  387    1    0    1    0    0    0    0    0]
 [  25  170    1    1    0    0    0    0    0    0]
 [   1   38    0    0    0    0    0    0    0    0]
 [  17    0    0    0    0    0    0    0    0    0]
 [   0   19    0    0    0    0    0    0    0    0]
 [   0    4    0    0    0    0    0    0    0    0]
 [   0    1    0    0    0    0    0    0    0    0]
 [   0    1    0    0    0    0    0    0    0    0]]


## Stacked Classifiers

In [20]:
models = (('adarf', adarf), 
          ('ext', ext),
          ('rf', rf))

from sklearn.ensemble import StackingClassifier

stclf = StackingClassifier(estimators=models, cv=5, n_jobs=5)
stclf.fit(x_train, y_train)
y_pred = stclf.predict(x_test)
print('Accuracy Score: ', accuracy_score(y_test, y_pred))
print('Confusion matrix: \n', confusion_matrix(y_test, y_pred))

Accuracy Score:  0.6693322670931627
Confusion matrix: 
 [[4040 1024    0    0    0    0    0    0    0    0]
 [1529 2614   15   12    0    0    0    0    0    0]
 [  38  424   22    7    1    0    0    0    0    0]
 [   9  164    9   15    0    0    0    0    0    0]
 [   1   38    0    0    0    0    0    0    0    0]
 [  12    0    0    0    0    5    0    0    0    0]
 [   0   13    3    3    0    0    0    0    0    0]
 [   0    3    0    1    0    0    0    0    0    0]
 [   0    1    0    0    0    0    0    0    0    0]
 [   1    0    0    0    0    0    0    0    0    0]]


WOW! It is almost 67%. But still very low. We will have to stick to deel learning model for better classification as the data itself is highly imbalanced. Anyways, we will save our stacked classifier so that we can use it in our streamlit app for comparitive study.

Let us save some models.

## Saving models

In [6]:
import pickle

In [22]:
pickle.dump(rf, open('random_forest.pkl', 'wb'))
pickle.dump(adarf, open('adaboost_random_forest.pkl', 'wb'))
pickle.dump(vc, open('voting_classifier.pkl', 'wb'))
pickle.dump(stclf, open('stacking_classifier.pkl', 'wb'))

## Using the models

In [16]:
import numpy as np

model_1 = pickle.load(open('random_forest.pkl', 'rb'))
model_2 = pickle.load(open('adaboost_random_forest.pkl', 'rb'))
model_3 = pickle.load(open('voting_classifier.pkl', 'rb'))
model_4 = pickle.load(open('stacking_classifier.pkl', 'rb'))

In [17]:
pred_1 = model_1.predict(np.array([2,11,2,13,2,10,2,12,2,1]).reshape(1,-1))
pred_2 = model_2.predict(np.array([2,11,2,13,2,10,2,12,2,1]).reshape(1,-1))
pred_3 = model_3.predict(np.array([2,11,2,13,2,10,2,12,2,1]).reshape(1,-1))
pred_4 = model_4.predict(np.array([2,11,2,13,2,10,2,12,2,1]).reshape(1,-1))

print(pred_1, pred_2, pred_3, pred_4)

[1] [3] [1] [3]


Hence, we are done with machine learning. Let us move onto deep learning to improve our project.