<h1><center>A Win/Lose prediction model of Premierleague games</center></h1>

<img src='premier.jpg' width="500">

## Contents

- EDA
- Preprocessing
- Modeling
- Optimization
- Evaluation

---

In [None]:
# 1. Basic
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib as mpl
from matplotlib.pyplot import *

In [None]:
# 2. Preprocessing
from sklearn.preprocessing import LabelEncoder

def category_to_ohe(train_col, test_col):
    le = LabelEncoder()
    le.fit(train_col)
    
    labeled_train_col = le.transform(train_col)
    labeled_train_col = labeled_train_col.reshape(len(labeled_train_col),1)
    
    labeled_test_col = le.transform(test_col)
    labeled_test_col = labeled_test_col.reshape(len(labeled_test_col),1)
    
    return labeled_train_col, labeled_test_col

In [None]:
# 3. Modeling

# 3.1 조건부 확률기반 생성모형
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis # QDA
from sklearn.naive_bayes import * # Naive basesian
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis # LDA


# 3.2 조건부 확률기반 판별모형
from sklearn.tree import DecisionTreeClassifier


# 3.3 모형결합 (Ensenble)
from sklearn.ensemble import VotingClassifier # voting
from sklearn.ensemble import BaggingClassifier # bagging
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier # random Forest

from sklearn.ensemble import AdaBoostClassifier # AdaBoost
from sklearn.ensemble import GradientBoostingClassifier # GradientBoost
import xgboost # xgboost


# 3.4 판별함수 모형
from sklearn.linear_model import Perceptron # perceptron
from sklearn.linear_model import SGDClassifier # SGD
from sklearn.svm import SVC # support vector machine

In [None]:
# 4. Optimizer
from sklearn.model_selection import validation_curve # validation curve
from sklearn.model_selection import GridSearchCV # gridseach
from sklearn.model_selection import ParameterGrid # ParameterGrid

In [None]:
# 5. Evaluation
from sklearn.metrics import * # make confusion matrix
from sklearn.preprocessing import label_binarize # ROC curve
from sklearn.metrics import auc # AUC

---

## 1. EDA (Exploratory Data Analysis)

In [None]:
# load data
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

In [None]:
# check shape
print('train shape :', train.shape) # 12 - 16 season
print('test shape :', test.shape) # 17 season

In [None]:
# check information
train.info()

In [None]:
# Unique
print('train unique 갯수')
for i in range(14):
    print('{} : {}개'.format(train.columns[i], len(set(train[train.columns[i]]))))

In [None]:
# y_data
print('win 횟수 :', len(train[train['Result'] == 0]))
print('lose 횟수 :', len(train[train['Result'] == 1]))
print('draw 횟수 :', len(train[train['Result'] == 2]))

sns.countplot(x = 'Result', data = train)
plt.show()

In [None]:
# feature countplot
plt.figure(figsize=(20, 18))
subplots_adjust(hspace = 0.3)

for i in range(1, 12+1) :
    plt.subplot(4, 3, i)
    sns.barplot(x = train['Result'], y = train[train.columns[i]],)
    plt.title('{} countplot plot'.format(train.columns[i]))

In [None]:
# feature boxplot
plt.figure(figsize=(20, 18))
subplots_adjust(hspace = 0.3)

for i in range(1, 12+1) :
    plt.subplot(4, 3, i)
    sns.boxplot(x = train['Result'], y = train[train.columns[i]], data = train)
    plt.title('{} box plot'.format(train.columns[i]))

In [None]:
# 숫자화
train.groupby('Result').mean()

In [None]:
# scatter plot
result0 = train[train['Result'] == 0]
result1 = train[train['Result'] == 1]
result2 = train[train['Result'] == 2]

# feature scatter plot
plt.figure(figsize=(20, 20))
subplots_adjust(hspace = 0.3)

for i in range(1, 12+1) :
    plt.subplot(4, 3, i)
    plt.plot(result0[result0.columns[i]], 'ro', alpha = 0.5, markersize = 3)
    plt.plot(result1[result1.columns[i]], 'bo', alpha = 0.3, markersize = 3)
    plt.plot(result2[result2.columns[i]], 'go', alpha = 0.2, markersize = 3)
    plt.title('{} scatter plot'.format(result0.columns[i]))
    plt.xlabel(result0.columns[i])

In [None]:
# correlation
correlation = train.drop(['Team', 'Result'], axis = 1)

f, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(correlation.corr(), annot=True, linewidths=1)
plt.show()

In [None]:
# Select columns
category = ['Home']
continuous = ['Possession', 'Shots', 'Touches', 'Passes',
              'Tackles', 'Clearances','SOT', 'Corners', 'Offsides', 'Goal', ]

In [None]:
# make train/test data

train_cols, test_cols = [], []

# category
for cat in category:
    train_tok, test_tok = category_to_ohe(train[cat],test[cat])
    train_cols.append(train_tok)
    test_cols.append(test_tok)    

# continuous
for con in continuous:
    train_cols.append(train[con].values.reshape(len(train),1))
    test_cols.append(test[con].values.reshape(len(test),1))
 

In [None]:
# stack train/test data
X_train = np.hstack(tuple(each for each in train_cols))
X_test = np.hstack(tuple(each for each in test_cols))
y_train = train['Result']

In [None]:
X_train

## 3. Modeling

 - 조건부 확률 모형 : 각 클래스가 정답일 조건부 확률을 계산

    - 조건부 확률기반 생성모형 : 베이즈 정리를 사용

        - LDA (linear discriminant analysis)
        - QDA (Quadratic Discriminanat Analysis)
        - 나이브 베이지안 (Naive Bayes)
    
    - 조건부 확률기반 판별모형 :  직접 조건부 확률 함수를 추정
    
        - 로지스틱 회귀 (Logistic Regression)
        - 의사결정나무 (Descision Tree)
        - KNN (K Nearest Neighbor)
        
        
- 판별함수 모형 : 경계면을 찾아서 데이터가 어느 위치에 있는지 계산

    - 퍼셉트론 (Perceptron)
    - 서포트 벡터 머신 (Support Vector Machine)
    - 신경망 (Neural Network)  
    
    
- 모형결합 (Ensemble) : 복수의 예측모형을 결합하여 더 나은 성능을 예측하려는 시도

    - 취합 방법론 : 사용할 모형의 집합이 이미 결정되어 있음
        
        - 다수결 (Majority voting)
        - 배깅 (Bagging)
        - 랜덤 포레스트 (Random Forest)
        
    - 부스팅 방법론 : 사용할 모형을 점진적으로 늘림
    
        - 에이다 부스트 (AdaBoost)
        - 그레디언트 부스트 (Gradient Boost)

### 3.1 조건부 확률모형

#### 3.1.1 조건부 확률기반 생성 모형

In [None]:
# LDA (linear discriminant analysis)
model = LinearDiscriminantAnalysis(n_components=3, solver="svd", 
        store_covariance=True).fit(X_train, y_train)
predict_proba = model.predict_proba(X_test)

# comparison
y_true = test['Result']
y_pred = []

for i in range(760) :
    y_pred.append(np.argmax(predict_proba[i]))

target_names = ['win', 'lose', 'draw']
print('Confusion Matrix : \n\n',confusion_matrix(y_true, y_pred))
print('\n\n Classification Report : \n\n', classification_report(y_true, y_pred, target_names=target_names))

In [None]:
# QDA (Quadratic Discriminanat Analysis)
model = QuadraticDiscriminantAnalysis().fit(X_train, y_train)
predict_proba = model.predict_proba(X_test)

# comparison
y_true = test['Result']
y_pred = []

for i in range(760) :
    y_pred.append(np.argmax(predict_proba[i]))

target_names = ['win', 'lose', 'draw']
print('Confusion Matrix : \n\n',confusion_matrix(y_true, y_pred))
print('\n\n Classification Report : \n\n', classification_report(y_true, y_pred, target_names=target_names))

In [None]:
# Naive bayesian - Multinomial
model = MultinomialNB().fit(X_train, y_train)
predict_proba = model.predict_proba(X_test)

# comparison
y_true = test['Result']
y_pred = []

for i in range(760) :
    y_pred.append(np.argmax(predict_proba[i]))

target_names = ['win', 'lose', 'draw']
print('Confusion Matrix : \n\n',confusion_matrix(y_true, y_pred))
print('\n\n Classification Report : \n\n', classification_report(y_true, y_pred, target_names=target_names))

In [None]:
# Naive bayesian - Gaussian
model = GaussianNB().fit(X_train, y_train)
predict_proba = model.predict_proba(X_test)

# comparison
y_true = test['Result']
y_pred = []

for i in range(760) :
    y_pred.append(np.argmax(predict_proba[i]))

target_names = ['win', 'lose', 'draw']
print('Confusion Matrix : \n\n',confusion_matrix(y_true, y_pred))
print('\n\n Classification Report : \n\n', classification_report(y_true, y_pred, target_names=target_names))

#### 3.1.2 조건부 확률기반 판별모형

In [None]:
# Logistic Regression : 사용 X (종속변수가 이항분포를 따라야함)

In [None]:
# Descision Tree
model = DecisionTreeClassifier(criterion='entropy', 
        max_depth=7, min_samples_leaf=5).fit(X_train, y_train)
predict_proba = model.predict_proba(X_test)

# comparison
y_true = test['Result']
y_pred = []

for i in range(760) :
    y_pred.append(np.argmax(predict_proba[i]))

target_names = ['win', 'lose', 'draw']
print('Confusion Matrix : \n\n',confusion_matrix(y_true, y_pred))
print('\n\n Classification Report : \n\n', classification_report(y_true, y_pred, target_names=target_names))

In [None]:
# KNN (K Nearest Neighbor)

---

### 3.2 모형결합 (Ensemble)

#### 3.2.1 취합 방법론

In [None]:
# 다수결 (Majority voting)

# 취합할 모델 생성
model1 = LinearDiscriminantAnalysis(n_components=3, solver="svd", store_covariance=True)
model2 = QuadraticDiscriminantAnalysis()
model3 = GaussianNB()
model4 = MultinomialNB()

# ensemble 생성
ensemble = VotingClassifier(estimators=[('lda', model1), ('qda', model2), ('gnb', model3), ('mul', model4)], 
                            voting='soft', weights=[1, 1, 1, 1])

predict_proba = [c.fit(X_train, y_train).predict_proba(X_test) for c in (model1, model2, model3, model4, ensemble)]

# comparison
y_true = test['Result']
y_pred = []

for i in range(760) :
    y_pred.append(np.argmax(predict_proba[4][i])) # ensemble index

target_names = ['win', 'lose', 'draw']
print('Confusion Matrix : \n\n',confusion_matrix(y_true, y_pred))
print('\n\n Classification Report : \n\n', classification_report(y_true, y_pred, target_names=target_names))

In [None]:
# 배깅 (Bagging)
model1 = DecisionTreeClassifier().fit(X_train, y_train)
model2 = BaggingClassifier(DecisionTreeClassifier(), bootstrap_features=True, random_state=0).fit(X_train, y_train)
predict_proba = model2.predict_proba(X_test)

# comparison
y_true = test['Result']
y_pred = []

for i in range(760) :
    y_pred.append(np.argmax(predict_proba[i]))

target_names = ['win', 'lose', 'draw']
print('Confusion Matrix : \n\n',confusion_matrix(y_true, y_pred))
print('\n\n Classification Report : \n\n', classification_report(y_true, y_pred, target_names=target_names))

In [None]:
# 랜덤포레스트 (RandomForest)
clf = RandomForestClassifier(n_estimators=1000, max_depth=10, min_samples_split = 10, criterion = 'entropy')
model = clf.fit(X_train, y_train)
predict_proba = model.predict_proba(X_test)

# comparison
y_true = test['Result']
y_pred = []

for i in range(760) :
    y_pred.append(np.argmax(predict_proba[i]))

target_names = ['win', 'lose', 'draw']
print('Confusion Matrix : \n\n',confusion_matrix(y_true, y_pred))
print('\n\n Classification Report : \n\n', classification_report(y_true, y_pred, target_names=target_names))

#### 3.2.2 부스팅 방법론

In [None]:
# 에이다 부스트 (Ada Boost)
model = AdaBoostClassifier(DecisionTreeClassifier(max_depth=5, random_state=0), 
                               algorithm="SAMME", n_estimators=100).fit(X_train, y_train)
predict_proba = model.predict_proba(X_test)

# comparison
y_true = test['Result']
y_pred = []

for i in range(760) :
    y_pred.append(np.argmax(predict_proba[i]))

target_names = ['win', 'lose', 'draw']
print('Confusion Matrix : \n\n',confusion_matrix(y_true, y_pred))
print('\n\n Classification Report : \n\n', classification_report(y_true, y_pred, target_names=target_names))

In [None]:
# 그레디언트 부스트 (Gradient Boost)
model = GradientBoostingClassifier(n_estimators=100, max_depth=2, random_state=0).fit(X_train, y_train)
predict_proba = model.predict_proba(X_test)

# comparison
y_true = test['Result']
y_pred = []

for i in range(760) :
    y_pred.append(np.argmax(predict_proba[i]))

target_names = ['win', 'lose', 'draw']
print('Confusion Matrix : \n\n',confusion_matrix(y_true, y_pred))
print('\n\n Classification Report : \n\n', classification_report(y_true, y_pred, target_names=target_names))

In [None]:
# XG boost
model = xgboost.XGBClassifier(n_estimators=100, max_depth=2).fit(X_train, y_train)
predict_proba = model.predict_proba(X_test)

# comparison
y_true = test['Result']
y_pred = []

for i in range(760) :
    y_pred.append(np.argmax(predict_proba[i]))

target_names = ['win', 'lose', 'draw']
print('Confusion Matrix : \n\n',confusion_matrix(y_true, y_pred))
print('\n\n Classification Report : \n\n', classification_report(y_true, y_pred, target_names=target_names))

---

### 3.3 판별함수 모형

In [None]:
# 퍼셉트론 (Perceptron) - perceptron
model = Perceptron(max_iter=500, eta0=0.1, random_state=1).fit(X_train, y_train)
predict_proba = model.predict(X_test)

# comparison
y_true = test['Result']
y_pred = []

for i in range(760) :
    y_pred.append(np.argmax(predict_proba[i]))

target_names = ['win', 'lose', 'draw']
print('Confusion Matrix : \n\n',confusion_matrix(y_true, y_pred))
print('\n\n Classification Report : \n\n', classification_report(y_true, y_pred, target_names=target_names))

In [None]:
# 퍼셉트론 (Perceptron) - SGD
model = SGDClassifier(loss="hinge", max_iter=3, random_state=1).fit(X_train, y_train)
predict_proba = model.predict(X_test)

# comparison
y_true = test['Result']
y_pred = []

for i in range(760) :
    y_pred.append(np.argmax(predict_proba[i]))

target_names = ['win', 'lose', 'draw']
print('Confusion Matrix : \n\n',confusion_matrix(y_true, y_pred))
print('\n\n Classification Report : \n\n', classification_report(y_true, y_pred, target_names=target_names))

In [None]:
# 서포트 벡터 머신 (Support Vector Machine) - linear
model = SVC(kernel='linear').fit(X_train, y_train)
predict_proba = model.predict(X_test)

# comparison
y_true = test['Result']
y_pred = []

for i in range(760) :
    y_pred.append(np.argmax(predict_proba[i]))

target_names = ['win', 'lose', 'draw']
print('Confusion Matrix : \n\n',confusion_matrix(y_true, y_pred))
print('\n\n Classification Report : \n\n', classification_report(y_true, y_pred, target_names=target_names))

In [None]:
# 서포트 벡터 머신 (Support Vector Machine) - 다항 커널 (Polynomial Kernel)
model = SVC(kernel="poly", degree=2, gamma=1, coef0=0).fit(X_train, y_train)
predict_proba = model.predict(X_test)

# comparison
y_true = test['Result']
y_pred = []

for i in range(760) :
    y_pred.append(np.argmax(predict_proba[i]))

target_names = ['win', 'lose', 'draw']
print('Confusion Matrix : \n\n',confusion_matrix(y_true, y_pred))
print('\n\n Classification Report : \n\n', classification_report(y_true, y_pred, target_names=target_names))

In [None]:
# 서포트 벡터 머신 (Support Vector Machine) - RBF(Radial Basis Function)
model = SVC(kernel="rbf").fit(X_train, y_train)
predict_proba = model.predict(X_test)

# comparison
y_true = test['Result']
y_pred = []

for i in range(760) :
    y_pred.append(np.argmax(predict_proba[i]))

target_names = ['win', 'lose', 'draw']
print('Confusion Matrix : \n\n',confusion_matrix(y_true, y_pred))
print('\n\n Classification Report : \n\n', classification_report(y_true, y_pred, target_names=target_names))

In [None]:
# 서포트 벡터 머신 (Support Vector Machine) - 시그모이드 커널 (Sigmoid Kernel)
model = SVC(kernel="sigmoid", gamma=2, coef0=2).fit(X_train, y_train)
predict_proba = model.predict(X_test)

# comparison
y_true = test['Result']
y_pred = []

for i in range(760) :
    y_pred.append(np.argmax(predict_proba[i]))

target_names = ['win', 'lose', 'draw']
print('Confusion Matrix : \n\n',confusion_matrix(y_true, y_pred))
print('\n\n Classification Report : \n\n', classification_report(y_true, y_pred, target_names=target_names))

In [None]:
# 신경망 (Neural Network)

---

## 4. Optimization

- Validation curve
- GridSearchCV
- ParameterGrid

In [None]:
# Validation curve

In [None]:
# GridSearchCV

In [None]:
# ParameterGrid

---

## 5. Evaluation

- Confusion Matrix
- ROC Curve
- AUC

### 5.1 Confusion Matrix

In [None]:
confusion_matrix(y_true, y_pred)

In [None]:
target_names = ['class 0', 'class 1', 'class 2']
print(classification_report(y_true, y_pred, target_names=target_names))

### 5.2 ROC (Receiver Operator Characteristic)

In [None]:
# data
X = X_train
y = label_binarize(y_train, [0, 1, 2])

# ROC curve
fpr = [None] * 3
tpr = [None] * 3
thr = [None] * 3

for i in range(3):
    model = clf.fit(X, y[:, i])
    fpr[i], tpr[i], thr[i] = roc_curve(y[:, i], model.predict_proba(X)[:, 1])
    plt.plot(fpr[i], tpr[i])

plt.xlabel('False Positive Rate (Fall-Out)')
plt.ylabel('True Positive Rate (Recall)')
plt.show()

### 5.3 AUC (Area Under the Curve)

In [None]:
# AUC
auc(fpr[0], tpr[0]), auc(fpr[1], tpr[1]), auc(fpr[2], tpr[2])