In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from scipy import sparse

### Hypothesis 1

We need feature matrix in a way such that every column represents a unique characteristic (or feature) and every row represents a pass (or instance).

Let's assume this hypothesis-
- Let the first column represent the distance of the player from the sender's team who is closest to the sender, the second column repsents the distance of the second closest player from the sender's team and so on. Thus, the 10th column represents the player from the sender's teeam farthest from the sender.
- In a similar way, let the 11th column represent the distance of the player from the opponent team who is closest to the sender, and likewise the 22nd column will represent the distance of the player farthest from the opponent team farthest from the sender.
- If all players, except the sender, are arranged in the order in which their distances from the sender appear in the hypothesis, the position of receiver in this ordered list is our prediction.

Thus, for every formation of the feature matrix as represented by our hypothesis, we have a prediction.

Let's see how a logistic regression classifier performs in this scenario.

In [3]:
X = np.loadtxt('unscaled_featmat.txt')
Y = np.loadtxt('unscaled_labels.txt')

In [4]:
X.shape,Y.shape

((11682, 21), (11682, 21))

In [6]:
Y_list = []
for row in Y:
    for i,element in enumerate(row):
        if element == 1:
            Y_list.append(i+1)
            continue

In [7]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y_list, random_state = 10,test_size=0.2)
sXtr = sparse.csr_matrix(X_train)
sXte = sparse.csr_matrix(X_test)

In [24]:
# Logistic Regression

log = LogisticRegression(penalty='l2')
log.fit(sXtr,Y_train)
training_accuracy = log.score(sXtr, Y_train)
test_accuracy = log.score(sXte, Y_test)

print("Accuracy on training data: %0.2f" %(training_accuracy))
print("Accuracy on test data: %0.2f" %(test_accuracy))

Accuracy on training data: 0.28
Accuracy on test data: 0.26


In [13]:
predicted = log.predict(sXte)
list(predicted).count(1)/len(predicted)

0.7419768934531451

In [28]:
#Naive Bayes
from sklearn import naive_bayes
cnb = naive_bayes.MultinomialNB()
cnb.fit(sXtr,Y_train)
training_accuracy_nb = cnb.score(sXtr,Y_train)
test_accuracy_nb = cnb.score(sXte,Y_test)

print("Accuracy on training data: %0.2f" %(training_accuracy_nb))
print("Accuracy on test data: %0.2f" %(test_accuracy_nb))

Accuracy on training data: 0.08
Accuracy on test data: 0.06


In [29]:
predicted_nb = cnb.predict(sXte)
list(predicted_nb).count(10)/len(predicted_nb)
#pd.DataFrame(columns=[Y_test,predicted_nb])

0.24732563115104836

In [45]:
# RandomForest
from sklearn.ensemble import RandomForestClassifier as rf
crf = rf(n_estimators=1)
crf.fit(sXtr,Y_train)

training_accuracy_rf = crf.score(sXtr,Y_train)
test_accuracy_rf = crf.score(sXte,Y_test)

print("Accuracy on training data: %0.2f" %(training_accuracy_rf))
print("Accuracy on test data: %0.2f" %(test_accuracy_rf))

predictedrf = crf.predict(sXte)
probrf = crf.predict_proba(sXte)

Accuracy on training data: 0.69
Accuracy on test data: 0.15


In [53]:
metrics=pd.DataFrame()
metrics['Metric']=['Test Accuracy','Train Accuracy']
metrics.set_index('Metric')
metrics['RF'] = [test_accuracy_rf*100,training_accuracy_rf*100]
metrics['NB'] = [test_accuracy_nb*100,training_accuracy_nb*100]
metrics['LR'] = [test_accuracy*100,training_accuracy*100]
metrics

Unnamed: 0,Metric,RF,NB,LR
0,Test Accuracy,15.190415,6.332905,26.10184
1,Train Accuracy,68.903157,7.747459,28.378812


In [46]:
a=pd.DataFrame()
a['Original Test Labels']=Y_test
a['RF predicted'] = predictedrf
a['NB predicted'] = predicted_nb
a['LR predictd'] = predicted
a.head()

Unnamed: 0,Original Test Labels,RF predicted,NB predicted,LR predictd
0,1,3,10,1
1,2,2,1,1
2,5,8,20,1
3,1,21,12,1
4,14,1,17,1


In [48]:
b=pd.DataFrame()
b['Original Train Labels']=Y_train
b['RF predicted'] = crf.predict(sXtr)
b['NB predicted'] = cnb.predict(sXtr)
b['LR predictd'] = log.predict(sXtr)
b.head()

Unnamed: 0,Original Train Labels,RF predicted,NB predicted,LR predictd
0,4,4,19,1
1,9,9,21,1
2,9,9,11,1
3,4,4,21,1
4,17,2,17,1


The accuracy indicates that the assumed hypothesis is flawed. We do not need to see the other evaluation metrics because the misclassification error is very poor in the first place.
A better hypothesis could be chosen.

There are a number of instances where a player passes the ball unintentionally to a player from the other team (called an intercept).
A more obvious way of improving the hypothesis would be to delete all cases of intercept.
Let's do that and see where it takes us. An additional advantage of doing so would be a drastic reduction in the number of labels. Since, we would no longer consider passes to players from the opposite team, we are left only with 10 classes, each representing a player from the sender's team.