In [1]:
from pandas import read_csv, DataFrame
from tqdm import tqdm

games_df = read_csv("./data/processed_games.csv")
games_df = games_df.drop(['player_result', 'player_start_of_streak','player_streak_id','opponent_result','opponent_start_of_streak','opponent_streak_id'], axis = 1)
games_df = games_df.dropna()

# Currently all players have a winning result. So we can create new label all assigned with 1
games_df["label"] = 1

# We want to reflect all data and append it to original. This is required since to train our models, there needs to be 2 different values
# for label value. In our case 1 would if player won and 0 would be if they lost.
new_games_df = DataFrame()
for i, row1 in tqdm(games_df.iterrows(), total=games_df.shape[0]):
    new_row = {
        "date": row1["date"],
        "player": row1["opponent"],
        "opponent": row1["player"],
        "PL5G": row1["OL5G"],
        "OL5G": row1["PL5G"],
        "PS": row1["OS"],
        "OS": row1["PS"],
        "PAW": row1["OAW"],
        "OAW": row1["PAW"],
        "PNW": row1["ONW"],
        "ONW":row1["PNW"],
        "PNL": row1["ONL"],
        "ONL": row1["PNL"],
        "label": 0,
    }
    new_games_df = new_games_df.append(new_row, ignore_index=True)

games_df = games_df.append(new_games_df,ignore_index=True)

games_df["PWR"] = games_df["PNW"] / (games_df["PNW"] + games_df["PNL"])
games_df["OWR"] = games_df["ONW"] / (games_df["ONW"] + games_df["ONL"])
games_df["AWR"] = games_df["PAW"] / (games_df["PAW"] + games_df["OAW"])

100%|████████████████████████████████████| 31553/31553 [03:07<00:00, 168.69it/s]


#### Modeling

After we have processed data, we want to try applying it on several models. Before we dive right in, we want to perform some more actions. Here we will first assign feature columns, label colum then split train and test data.

For models, I have personally chosen 3 different models to try. Logistic Regression, Decision Tree Classifier, and Baysian Classifier. The selections were based on one main reason: binary classification method. This project itself needs a model that determines between win or lose, in other words, True or False. This can be considered as a binary classification, and I, at this moment think these 3 are best approaches to make.


In [2]:
from sklearn.model_selection import train_test_split

feature_cols = [
    "PL5G",
    "OL5G",
    "PS",
    "OS",
    "PWR",
    "OWR",
    "AWR",
]

X = games_df[feature_cols]
y = games_df["label"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)


In [3]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
accuracy = metrics.accuracy_score(y_test, y_pred)

print("CONFUSION MATRIX: \n", cnf_matrix)
print("ACCURACY:", accuracy)

CONFUSION MATRIX: 
 [[7633  364]
 [ 351 7429]]
ACCURACY: 0.954680864549661


In [4]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
accuracy = metrics.accuracy_score(y_test, y_pred)

print("CONFUSION MATRIX:\n", cnf_matrix)
print("ACCURACY:", accuracy)

CONFUSION MATRIX:
 [[7778  219]
 [ 247 7533]]
ACCURACY: 0.9704633326994992


In [5]:
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
accuracy = metrics.accuracy_score(y_test, y_pred)

print("CONFUSION MATRIX:\n", cnf_matrix)
print("ACCURACY:", accuracy)

CONFUSION MATRIX:
 [[7588  409]
 [ 392 7388]]
ACCURACY: 0.9492298916143753


#### Results

All three models don't have a significant difference in results. Except right now, Decision Tree seems to have the best fit model. This actually is what I slightly expected. The main reason is, I would predict a player with a win rate against opponent with 80% would be likely to win as well as 90%. In other words, for some features, I hypothesized that if a feature is up to some level, they would be likely to win.

One reason I can assume that all 3 models have successfully have a pretty high accuracy is becuase of my sample size. In this project I have a lot of data: probably 30000+ rows. And compared to that I don't have too many features to consider about. 