In [3]:
import json, pickle
from collections import Counter
import warnings
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.base import clone
from scipy.sparse import vstack, hstack
from scipy.stats.stats import spearmanr
from scipy.stats.stats import kendalltau
import random

In [5]:
tournaments = pickle.load(open('tournaments.pkl', 'rb'))
results = pickle.load(open('results.pkl', 'rb'))
players = pickle.load(open('players.pkl', 'rb'))

# Summary

### 1. Прочитайте и проанализируйте данные, выберите турниры, в которых есть данные о составах команд и повопросных результатах (поле mask в results.pkl). Для унификации предлагаю: 
### 1.1 взять в тренировочный набор турниры с dateStart из 2019 года;  
### 1.2 в тестовый — турниры с dateStart из 2020 года.  

In [6]:
df_tournaments                  = pd.DataFrame(tournaments.values()).set_index("id")
df_tournaments["year"]          = df_tournaments["dateStart"].apply(lambda x: int(x[:4]))
# 1.1
df_train_tournaments            = df_tournaments[df_tournaments["year"] == 2019]
# 1.2
df_test_tournaments             = df_tournaments[df_tournaments["year"] == 2020]

In [7]:
df_players                      = pd.DataFrame(players.values()).set_index("id")
df_players_indexed              = df_players.copy()

### 2. Постройте baseline-модель на основе линейной или логистической регрессии, которая будет обучать рейтинг-лист игроков. Замечания и подсказки:
### 2.1 повопросные результаты — это фактически результаты броска монетки, и их предсказание скорее всего имеет отношение к бинарной классификации;
### 2.2 в разных турнирах вопросы совсем разного уровня сложности, поэтому модель должна это учитывать; скорее всего, модель должна будет явно обучать не только силу каждого игрока, но и сложность каждого вопроса;
### 2.3 для baseline-модели можно забыть о командах и считать, что повопросные результаты команды просто относятся к каждому из её игроков.


### Solution: Train Logistic Regression Model on one variable of One Hot Vector of Player Level and Question Complexity

### Result: Table contains famous names:
#####    'Stanislav Mereminskiy'
#####      'Mikhail Levandovskiy'
#####      'Ilya Novikov'
#####      'Sergey Nikolenko'
#####      'Yulia Arkhangelskaya'
#####      'Nikolay Krapil'

### 3. Качество рейтинг-системы оценивается качеством предсказаний результатов турниров. Но сами повопросные результаты наши модели предсказывать вряд ли смогут, ведь неизвестно, насколько сложными окажутся вопросы в будущих турнирах; да и не нужны эти предсказания сами по себе. Поэтому:
### 3.1 предложите способ предсказать результаты нового турнира с известными составами, но неизвестными вопросами, в виде ранжирования команд;
### 3.2 в качестве метрики качества на тестовом наборе давайте считать ранговые корреляции Спирмена и Кендалла (их можно взять в пакете scipy) между реальным ранжированием в результатах турнира и предсказанным моделью, усреднённые по тестовому множеству турниров.

### Solution:
### 1) Let's choose random question from our database of questions (mask in results table + tournament id)
### 2) On this question our commands will answer (probability of team asnwer question): 
$ 1 - \prod_{i \in S} (1 - f(Player Level, Random Question))$, where $f = \frac{1}{1 + exp^{-x}}$, S = set of players in team
### 3) Let's calculate average of our metrics (Spearman and Kendall Correlations) for every team
### 4) Make n iterations of steps 1-3 and calculate averages of defined metrics
### 5) For this taks due to time consuming calculations it was used 10 as number of iterations

### Result (metrics are approximately in interval that were mentioned in task):
#### BASELINE METRICS:
#### Spearman Correlation: 78.96% (70% - 80%)
#### Kendall Correlation:     63.07% (50% - 60%)

### 4. Теперь главное: ЧГК — это всё-таки командная игра. Поэтому:
### 4.1 предложите способ учитывать то, что на вопрос отвечают сразу несколько игроков; скорее всего, понадобятся скрытые переменные; не стесняйтесь делать упрощающие предположения, но теперь переменные “игрок X ответил на вопрос Y” при условии данных должны стать зависимыми для игроков одной и той же команды;
### 4.2 разработайте EM-схему для обучения этой модели, реализуйте её в коде;
### 4.3 обучите несколько итераций, убедитесь, что целевые метрики со временем растут (скорее всего, ненамного, но расти должны), выберите лучшую модель, используя целевые метрики.

### Solution:

1) E STEP: 

$Z_i{}_j = 0$, if question is unanswered so if player $i$ gives wrong answer then whole team gives wrong answer on question $j$

$Z_i{}_j = \frac{f(Player Level, Random Question)}{P(Team)}$, where $f = \frac{1}{1 + exp^{-x}}$, $P(Team)$ - found probability on previous step.

2) M STEP: 

Build logistic model on previos step derived Z.

3) Make 10 iterations of steps E and M


### Result:
### Sometimes on particuar iteration we have worse results than earlier, but with time metrics become better and in final:
### Best model achived on:  7th iteration (out of 10)

### 5. А что там с вопросами? Постройте “рейтинг-лист” турниров по сложности вопросов. Соответствует ли он интуиции (например, на чемпионате мира в целом должны быть сложные вопросы, а на турнирах для школьников — простые)? Если будет интересно: постройте топ сложных и простых вопросов со ссылками на конкретные записи в базе вопросов ЧГК (это чисто техническое дело, тут никакого ML нету).

### Result:
#### 1) Top 10 Tournaments

#### Чемпионат Санкт-Петербурга. Первая лига
#### 'Первенство правого полушария'
#### 'Угрюмый Ёрш'
#### 'Кубок городов'
#### 'Чемпионат России'
#### 'Синхрон высшей лиги Москвы'
#### 'All Cats Are Beautiful'
#### 'Ра-II: синхрон "Борского корабела"'
#### 'Антибинго'
#### 'Ускользающая сова'

#### 2) Bottom 10 Tournaments:

#### '(а)Синхрон-lite. Лига старта. Эпизод X'
#### 'Второй тематический турнир имени Джоуи Триббиани'
#### 'Синхрон-lite. Выпуск XXIX'
#### 'Парный асинхронный турнир ChGK is...'
#### '(а)Синхрон-lite. Лига старта. Эпизод IX'
#### '(а)Синхрон-lite. Лига старта. Эпизод III'
#### '(а)Синхрон-lite. Лига старта. Эпизод VI'
#### 'Синхрон-lite. Выпуск XXX'
#### 'Асинхрон по «Королю и Шуту»'
#### '(а)Синхрон-lite. Лига старта. Эпизод V'

### CODES

### 1

In [8]:
# Lets create lists of players and questions among different tournaments and their results description
def qsts_plrs_base_databases (tourn, res, flag, mask_flag, members, player_flag, ids, tourn_id, team_id, plr_id, team, pos):
  
    questions_base, players_base, base, test = [], [], [], []
    
    for i in tourn.index:
        trn_res, number_qsts, mask, data_tourn = res[i], set(), "", []
    
        for word in res[i]:
            if word.get(mask_flag) is not None:
                number_qsts.add(len("".join(filter(str.isdigit, word[mask_flag]))))

        if len(number_qsts) > 1:
            continue
            
        for word in trn_res:
            record = {}
            if word.get(mask_flag) is not None and word.get(members) is not None:
                if word[members] != []:                    
                    if flag == 'test':
                        mask = "".join(filter(str.isdigit, word[mask_flag]))
                        record[tourn_id], record[team_id], record[mask_flag], record[plr_id], record[pos]  = i, word.get(team).get(ids), list(map(int, mask)), [], word[pos]
                        players = word[members]
                        for player in players:
                            record[plr_id].append(player[player_flag][ids])
                        data_tourn.append(record)
                    else:
                        mask, players = "".join(filter(str.isdigit, word[mask_flag])), word[members]
                        record[tourn_id], record[team_id], record[mask_flag], record[plr_id]  = i, word.get(team).get(ids), list(map(int, mask)), []
                        for player in players:
                            record[plr_id].append(player[player_flag][ids])
                            players_base.append(player[player_flag][ids]) 
                        base.append(record)
        
        if flag == 'test':
            if mask:
                test.append(data_tourn)

        for j, k in enumerate(mask):
               questions_base.append(str(i) + "-" + str(j))
                
    if flag == 'train':
        return questions_base, players_base, base
    else:
        return questions_base, players_base, test
questions_base, players_base, base = qsts_plrs_base_databases(df_train_tournaments, results, 'train', 'mask', 'teamMembers', 'player', 'id', 'tournament_id', 'team_id', 'players_id', 'team', 'position')

In [9]:
questions_base, players_base, base = qsts_plrs_base_databases(df_train_tournaments, results, 'train', 'mask', 'teamMembers', 'player', 'id', 'tournament_id', 'team_id', 'players_id', 'team', 'position')

### 2

In [None]:
Encoder_plrs, Encoder_qsts  = OneHotEncoder(), OneHotEncoder()
players_base_encoded        = Encoder_plrs.fit_transform(np.array(players_base).reshape(-1, 1))
questions_base_encoded      = Encoder_qsts.fit_transform(np.array(questions_base).reshape(-1, 1))

def Logistic_Regression_Training_With_Encoders(df, flag_1, flag_2, flag_3):

    features, labels = [], []
    for record in df:
        if record != {}:

            idx, mask, plr_idx, trn_qstn = str(record[flag_1]), record[flag_2], record[flag_3], []    
            matrix_plr = Encoder_plrs.transform(np.array([np.full((len(mask), ), k) for k in plr_idx]).reshape(-1, 1))

            for j in range(len(mask)):
                trn_qstn.append(idx + "-" + str(j))

            matrix_qsts = Encoder_qsts.transform(np.tile(trn_qstn, len(plr_idx)).reshape(-1, 1))
            labels.append(np.tile(mask, len(plr_idx)).reshape(-1, 1))
            features.append(hstack([matrix_plr, matrix_qsts]))

    logreg = LogisticRegression()
    logreg.fit(vstack(features), 
               np.vstack(labels))
    
    return vstack(features), np.vstack(labels), logreg

features, labels, logreg = Logistic_Regression_Training_With_Encoders(base, 'tournament_id', 'mask', 'players_id')

def dict_players(df, flag):
    
    plrs = []
    for record in df:
         if record != {}:
            for idx in record[flag]:
                   if idx not in plrs:
                        plrs.append(idx)
    return plrs

plrs = dict_players(base, 'players_id')

def table_with_players_rating(df, flag_1, flag_2, flag_3):
    
    Rating               = pd.DataFrame({flag_1: sorted(plrs), 'Rating': log_reg.coef_[0][:len(plrs)]})
    number_games_per_plr = Counter()
    for record in df:
        if record != {}:
            for idx in record[flag_3]:
                number_games_per_plr[idx] += len(record[flag_2])
            
    Games_Number = pd.DataFrame.from_dict(number_games_per_plr, orient='index')
    Games_Number[flag_1] = Games_Number.index
    Rating = df_players_indexed.merge(Rating, on=flag_1)
    Rating = Rating.merge(Games_Number, on=flag_1)
    print(Rating.sort_values(by='Rating', ascending=False).head(30))
    return Rating

Rating = table_with_players_rating(base, 'player_id', 'mask', 'players_id')

In [None]:
print(Rating)

### 3

In [17]:
_, _, base_test = qsts_plrs_base_databases(df_test_tournaments, results, "test", "mask", "teamMembers", "player", "id", "tournament_id", "team_id", "players_id", "team", "position")

In [18]:
def predict_team_rating_Spearman_Kendall(df, model, pos, plrs, team, tourn, qst):
    
    test_res = {}
   
    for idx in df:
        tourn_res = {}
        for word in idx:
            k = word[pos]
            mmbrs = []
            for plr_id in word[plrs]:
                try:
                    Encoder_plrs.transform([[plr_id]])
                    mmbrs.append(plr_id)
                except ValueError:
                    continue   
            if mmbrs == []:
                continue
            matrix_plr, matrix_qsts  = Encoder_plrs.transform(np.array(mmbrs).reshape(-1, 1)), Encoder_qsts.transform(np.full((len(mmbrs), 1), qst))
            pred = model.predict_proba(hstack([matrix_plr, matrix_qsts]))[:, 1]
            tourn_res[word[team]] = (1 - np.product(1 - pred), k)
        test_res[idx[0][tourn]] = tourn_res
    
    sp_corr, ken_corr  = [], []
    
    for idx in test_res.values():
        pred, k = [], []
        for i in idx.values():
            pred.append(i[0])
            k.append(i[1])
        sp_corr.append(spearmanr(pred, k)[0])
        ken_corr.append(kendalltau(pred, k)[0])
    
    return np.nanmean(np.abs(sp_corr)), np.nanmean(np.abs(ken_corr))

In [19]:
number_iter = 10
res = []
for i in range(number_iter):
    qst = random.choice(questions_base)
    sp, ke = predict_team_rating_Spearman_Kendall(base_test, logreg, 'position', 'players_id', 'team_id' , 'tournament_id', qst)
    print('Iteration: ', i + 1, 'Spearman Correlation: ', sp, 'Kendall Correlation: ', ke)
    res.append([sp, ke])
res = np.array(res)

Iteration:  1 Spearman Correlation:  0.7873801114705261 Kendall Correlation:  0.6278180169776942
Iteration:  2 Spearman Correlation:  0.7933366182232459 Kendall Correlation:  0.6347137343408068
Iteration:  3 Spearman Correlation:  0.7943439199792 Kendall Correlation:  0.6362151561649261
Iteration:  4 Spearman Correlation:  0.7904519078919052 Kendall Correlation:  0.6312738479637523
Iteration:  5 Spearman Correlation:  0.7943798073121919 Kendall Correlation:  0.6362406503683327
Iteration:  6 Spearman Correlation:  0.7926932115998585 Kendall Correlation:  0.6340849730380285
Iteration:  7 Spearman Correlation:  0.7723488573135294 Kendall Correlation:  0.6126339864684278
Iteration:  8 Spearman Correlation:  0.7915019977653975 Kendall Correlation:  0.6326936590367839
Iteration:  9 Spearman Correlation:  0.7860364481448148 Kendall Correlation:  0.6265755629369366
Iteration:  10 Spearman Correlation:  0.7932143816070908 Kendall Correlation:  0.6346547201195544


In [20]:
print('BASELINE METRICS:')
print('Spearman Correlation:', np.mean(res, axis = 0)[0])
print('Kendall Correlation:',  np.mean(res, axis = 0)[1])

BASELINE METRICS:
Spearman Correlation: 0.789568726130776
Kendall Correlation: 0.6306904307415244


### 4 

In [21]:
def EM_ALGORITHM (logreg, features, labels, Z, flag):
    
    if flag == 'E':
        
        print('E STEP')
        
        preds, idx = logreg.predict_proba(features)[:, 1], 0
        
        for record in base:
            
            preds_team          = preds[idx : idx + len(record["players_id"]) * len(record["mask"])]
            response_team       = preds[idx : idx + len(record["players_id"]) * len(record["mask"])]
            right_response_team = labels[idx: idx + len(record["mask"])]
            response_team       = response_team.reshape((-1, len(record["mask"]))).T
            vector              = 1 - np.prod(1 - response_team, axis=1)
            response_team       = response_team / vector.reshape(-1, 1)
            response_team       = np.where(right_response_team != 0, response_team, 0)
            preds[idx: idx + len(record["players_id"]) * len(record["mask"])] = response_team.T.reshape(-1) 
            idx                 = idx + len(record["players_id"]) * len(record["mask"])   
    
        return preds
    
    else: 
        
        print('M STEP')
        
        labels, inputs = np.vstack((np.full((features.shape[0], 1), 0), np.full((features.shape[0], 1), 1))), vstack([features, features])
        weights        = np.hstack((1 - Z, Z))
        model          = LogisticRegression()
        model.fit(inputs, labels, sample_weight=weights)
        
        return model

In [23]:
number_iter = 10
res, question = [], qst
model, list_of_models = logreg, []
for i in range(number_iter):
    print('Iteration: ', i + 1)
    Z      = EM_ALGORITHM(model, features, labels, _ , 'E')
    model  = EM_ALGORITHM(_,      features, _,      Z,  'M')
    list_of_models.append(model)
    sp, ke = predict_team_rating_Spearman_Kendall(base_test, model, 'position', 'players_id', 'team_id' , 'tournament_id', question)
    print('Spearman Correlation: ', sp, 'Kendall Correlation: ', ke)
    res.append([sp, ke])
res = np.array(res)

Iteration:  1
E STEP
M STEP


  return f(**kwargs)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Spearman Correlation:  0.7790439557600213 Kendall Correlation:  0.6204891541676517
Iteration:  2
E STEP
M STEP


  return f(**kwargs)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Spearman Correlation:  0.7710685972094713 Kendall Correlation:  0.6122779382318523
Iteration:  3
E STEP
M STEP


  return f(**kwargs)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Spearman Correlation:  0.7726572059853791 Kendall Correlation:  0.6138552944219076
Iteration:  4
E STEP
M STEP


  return f(**kwargs)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Spearman Correlation:  0.7776784829962957 Kendall Correlation:  0.6193075666968156
Iteration:  5
E STEP
M STEP


  return f(**kwargs)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Spearman Correlation:  0.7796262891709009 Kendall Correlation:  0.6217908532270977
Iteration:  6
E STEP
M STEP


  return f(**kwargs)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Spearman Correlation:  0.7818274820376722 Kendall Correlation:  0.624437957930259
Iteration:  7
E STEP
M STEP


  return f(**kwargs)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Spearman Correlation:  0.7776042716050733 Kendall Correlation:  0.6201581800653587
Iteration:  8
E STEP
M STEP


  return f(**kwargs)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Spearman Correlation:  0.7836459166311835 Kendall Correlation:  0.6272008792567701
Iteration:  9
E STEP
M STEP


  return f(**kwargs)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Spearman Correlation:  0.7782953773596344 Kendall Correlation:  0.6225062012672502
Iteration:  10
E STEP
M STEP


  return f(**kwargs)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Spearman Correlation:  0.7817081236144051 Kendall Correlation:  0.6264545437050248


In [24]:
i_max = 0
maxx  = 0
for i in range(len(res)):
    if np.mean(res[i]) > maxx:
        i_max = i
        maxx = np.mean(res[i], axis = 0)
print('Best model achived on: ', i_max, 'iteration')

Best model achived on:  7 iteration


In [25]:
best_model = list_of_models[i_max]

### 5

In [26]:
def tourn_rankig(df, tourn, mask_flag, model):

    lst = [[df[0][tourn], len(df[0][mask_flag])]]
    for record in df:
        idx = record[tourn]
        if lst[-1] != [idx, len(record[mask_flag])]:
            lst.append([idx, len(record[mask_flag])])

    qst_level, tourn_level  = model.coef_[0][len(plrs):], []
    tourn_level = []
    k = 0
    for record in lst:
        temp = np.array(qst_level[k : k + record[1]])
        tourn_level.append([record[0], np.mean(temp)])
        k = k + record[1]
        
    return tourn_level

# use best model predictions derived on previous step for defining tournament level
tourn_level = tourn_rankig(base, 'tournament_id', 'mask', best_model)

In [27]:
# top 10 tournaments

k = 10
for i in sorted(tourn_level, key=lambda x: x[1])[:k]:
    print(df_train_tournaments.loc[i[0]]['name'])

Чемпионат Санкт-Петербурга. Первая лига
Первенство правого полушария
Угрюмый Ёрш
Кубок городов
Чемпионат России
Синхрон высшей лиги Москвы
All Cats Are Beautiful
Ра-II: синхрон "Борского корабела"
Антибинго
Ускользающая сова


In [28]:
# bottom 10 tournamments

for i in sorted(tourn_level, key=lambda x: x[1])[-k:]:
    print(df_train_tournaments.loc[i[0]]['name'])

(а)Синхрон-lite. Лига старта. Эпизод X
Второй тематический турнир имени Джоуи Триббиани
Синхрон-lite. Выпуск XXIX
Парный асинхронный турнир ChGK is...
(а)Синхрон-lite. Лига старта. Эпизод IX
(а)Синхрон-lite. Лига старта. Эпизод III
(а)Синхрон-lite. Лига старта. Эпизод VI
Синхрон-lite. Выпуск XXX
Асинхрон по «Королю и Шуту»
(а)Синхрон-lite. Лига старта. Эпизод V
