# Métodos de Classificação em Mãos de Poker

Dataset Usado: https://archive.ics.uci.edu/dataset/158/poker+hand

Os métodos à serem testados são: K Nears Neighbors (KNN), Logistic Regression, Linear Discriminant Analysis (LDA), and Quadratic Discriminant Analysis (QDA)



In [1]:
import pandas as pd
import numpy as np
from sklearn import neighbors
from sklearn.metrics import confusion_matrix, classification_report, precision_score
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis


Dataset é composto dos Naipes e Valor correspondentes às 5 cartas de uma mão, sendo a coluna Poker Hand a classificação da mão:
      
      0: Nothing in hand; not a recognized poker hand  
      1: One pair; one pair of equal ranks within five cards
      2: Two pairs; two pairs of equal ranks within five cards
      3: Three of a kind; three equal ranks within five cards
      4: Straight; five cards, sequentially ranked with no gaps
      5: Flush; five cards with the same suit
      6: Full house; pair + different rank three of a kind
      7: Four of a kind; four equal ranks within five cards
      8: Straight flush; straight + flush
      9: Royal flush; {Ace, King, Queen, Jack, Ten} + flush

In [2]:
nome_col = ['Naipe 1','Valor 1','Naipe 2','Valor 2','Naipe 3','Valor 3','Naipe 4','Valor 4','Naipe 5','Valor 5','Poker Hand']

Dados já estão divididos em Treino e Test, sendo fornecidos 25010 linhas para treino e 1000000 para teste

In [3]:
ph_train = pd.read_csv('poker-hand-training-true.data', names=nome_col)
ph_train.head()

Unnamed: 0,Naipe 1,Valor 1,Naipe 2,Valor 2,Naipe 3,Valor 3,Naipe 4,Valor 4,Naipe 5,Valor 5,Poker Hand
0,1,10,1,11,1,13,1,12,1,1,9
1,2,11,2,13,2,10,2,12,2,1,9
2,3,12,3,11,3,13,3,10,3,1,9
3,4,10,4,11,4,1,4,13,4,12,9
4,4,1,4,13,4,12,4,11,4,10,9


In [4]:
ph_train.shape

(25010, 11)

In [5]:
ph_test = pd.read_csv('poker-hand-testing.data', names=nome_col)
ph_test.head()

Unnamed: 0,Naipe 1,Valor 1,Naipe 2,Valor 2,Naipe 3,Valor 3,Naipe 4,Valor 4,Naipe 5,Valor 5,Poker Hand
0,1,1,1,13,2,4,2,3,1,12,0
1,3,12,3,2,3,11,4,5,2,5,1
2,1,9,4,6,1,4,3,2,3,9,1
3,1,4,3,13,2,13,2,1,3,6,1
4,3,10,2,7,1,2,2,11,4,9,0


In [6]:
ph_test.shape

(1000000, 11)

Para execução mais rápida do código, escolhi apenas 25000 linhas para teste:

In [7]:
X_train = ph_train.drop('Poker Hand',axis=1)
y_train = ph_train['Poker Hand']
X_test = ph_test[:25000].drop('Poker Hand',axis=1)
y_test = ph_test[:25000]['Poker Hand']

### Testando KNN:

In [8]:
Accuracy = []
Scores = []

#Testando no intervalo de K entre 1 e 10:
for k in range(1,11):
    knn = neighbors.KNeighborsClassifier(n_neighbors=k)
    pred = knn.fit(X_train, y_train).predict(X_test)
    
    print('Valor de vizinhos K:',k)
    print(confusion_matrix(y_test, pred).T)
    print(classification_report(y_test, pred, digits=3,zero_division=1.0))
    accuracy = knn.score(X_test, y_test)
    Accuracy.append((k, accuracy))


Valor de vizinhos K: 1
[[7545 4521  371   92   21   36    4    0    0    0]
 [4493 5004  614  341   48   13   16    4    0    0]
 [ 371  593  165   57   11    0    8    1    0    0]
 [ 105  268   36   57    5    0    5    2    0    0]
 [  12   56   12    4   10    0    1    0    0    0]
 [  37   16    1    1    0    2    0    0    0    0]
 [   2   14    4    2    1    0    1    1    0    0]
 [   1    5    0    0    0    0    0    0    0    0]
 [   0    3    1    0    1    0    0    0    0    0]
 [   2    2    0    0    1    0    0    0    0    0]]
              precision    recall  f1-score   support

           0      0.599     0.600     0.600     12568
           1      0.475     0.477     0.476     10482
           2      0.137     0.137     0.137      1204
           3      0.119     0.103     0.110       554
           4      0.105     0.102     0.104        98
           5      0.035     0.039     0.037        51
           6      0.040     0.029     0.033        35
           7 

Valor de vizinhos K: 9
[[9266 5511  476  126    9   44    3    0]
 [3296 4918  702  414   89    7   27    7]
 [   4   44   21   12    0    0    3    1]
 [   2    9    4    2    0    0    2    0]
 [   0    0    1    0    0    0    0    0]
 [   0    0    0    0    0    0    0    0]
 [   0    0    0    0    0    0    0    0]
 [   0    0    0    0    0    0    0    0]]
              precision    recall  f1-score   support

           0      0.600     0.737     0.662     12568
           1      0.520     0.469     0.493     10482
           2      0.247     0.017     0.033      1204
           3      0.105     0.004     0.007       554
           4      0.000     0.000     1.000        98
           5      1.000     0.000     0.000        51
           6      1.000     0.000     0.000        35
           7      1.000     0.000     0.000         8

    accuracy                          0.568     25000
   macro avg      0.559     0.153     0.274     25000
weighted avg      0.538     0.568   

In [9]:
max(Accuracy)

(10, 0.5704)

In [10]:
Scores.append(('Knn', max(Accuracy)[1]))

### Testando Regressão Logística:

In [11]:
logreg = LogisticRegression(max_iter=25000)

logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

In [12]:
print(classification_report(y_test, y_pred, digits=3,zero_division=1.0))
logreg.score(X_test, y_test)

              precision    recall  f1-score   support

           0      0.503     1.000     0.669     12568
           1      1.000     0.000     0.000     10482
           2      1.000     0.000     0.000      1204
           3      1.000     0.000     0.000       554
           4      1.000     0.000     0.000        98
           5      1.000     0.000     0.000        51
           6      1.000     0.000     0.000        35
           7      1.000     0.000     0.000         8

    accuracy                          0.503     25000
   macro avg      0.938     0.125     0.084     25000
weighted avg      0.750     0.503     0.336     25000



0.50272

In [13]:
Scores.append(('LogReg',logreg.score(X_test, y_test)))

### Análise Discriminante Linear (LDA):

In [14]:
lda = LinearDiscriminantAnalysis()
model = lda.fit(X_train, y_train)

In [15]:
pred=model.predict(X_test)
print(np.unique(pred, return_counts=True))

(array([0], dtype=int64), array([25000], dtype=int64))


In [16]:
print(confusion_matrix(pred, y_test))
print(classification_report(y_test, pred, digits=3,zero_division=1.0))

[[12568 10482  1204   554    98    51    35     8]
 [    0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0]]
              precision    recall  f1-score   support

           0      0.503     1.000     0.669     12568
           1      1.000     0.000     0.000     10482
           2      1.000     0.000     0.000      1204
           3      1.000     0.000     0.000       554
           4      1.000     0.000     0.000        98
           5      1.000     0.000     0.000        51
           6      1.000     0.000     0.000        35
           7      1.000     0.000     0.000         8

    accuracy                          0.503     25000
   macro avg      0.938     0.125     0.084     2

In [17]:
Scores.append(('LDA',lda.score(X_test, y_test)))

### Análise Discriminante Quadrática (QDA):

In [18]:
qda = QuadraticDiscriminantAnalysis()
model2 = qda.fit(X_train, y_train)



In [19]:
pred2=model2.predict(X_test)
print(np.unique(pred2, return_counts=True))
print(confusion_matrix(pred2, y_test))
print(classification_report(y_test, pred2, digits=3,zero_division=1.0))

(array([0, 1, 3, 4], dtype=int64), array([18431,  6565,     2,     2], dtype=int64))
[[10278  7128   679   251    34    48    12     1]
 [ 2290  3353   525   302    64     3    22     6]
 [    0     0     0     0     0     0     0     0]
 [    0     1     0     0     0     0     0     1]
 [    0     0     0     1     0     0     1     0]
 [    0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0]]
              precision    recall  f1-score   support

           0      0.558     0.818     0.663     12568
           1      0.511     0.320     0.393     10482
           2      1.000     0.000     0.000      1204
           3      0.000     0.000     1.000       554
           4      0.000     0.000     1.000        98
           5      1.000     0.000     0.000        51
           6      1.000     0.000     0.000        35
           7      1.000     0.000     0.000         8

    accuracy      

In [20]:
Scores.append(('QDA',qda.score(X_test, y_test)))

In [21]:
Scores

[('Knn', 0.5704), ('LogReg', 0.50272), ('LDA', 0.50272), ('QDA', 0.54524)]

O Método KNN teve a melhor valor de acurácia, possívelmente por causa do teste de variação de parâmetros. Talvez seja possível chegar em resultados melhores nos outros métodos se variarmos os parâmetros também.

In [22]:
max(Scores, key=lambda x: x[1])

('Knn', 0.5704)