# Welcome to **Chess MBTI** for lichess.org! 
### This project is designed for curious chess players who want to learn about their playing style and improve their results.
### The main concept behind Chess MBTI is based on Myers-Briggs Type Indicator, or MBTI for short.
### Players will be evaluated on 4 different categories:
##### - Decision Making: U (Ultra-Calculator) / I (Instinctive) – describes the thought process behind player’s moves.
##### - Position Handling: G (Grinder) / H (Hunter) – does the player capitalize on small, positional gains, or pounces on initiative?
##### - Opening Approach: T (Theoretical) / N (Nonconformist) – does the player go with the flow in the opening, or studies them diligently?
##### - Playing Style: L (Level-Headed) / V (Volatile) – mostly shaped by player’s middlegame approach: goes for calm/chaotic positions?
### After evaluating the playing style, opening recommendations, middlegame advice and best personal MBTI matchups will be provided!
### This notebook will cover all theoretical and technical details, let's start!

## First of all, it's crucial to install all of the necessary packages and libraries.
### Let's look at them and their role one by one.

##### 1. Berserk - allows to connect with Lichess API to import the games for analysis.
##### 2. Pandas - provides functionality to complete statistical and exploratory analysis.
##### 3. Matplotlib - will be used for visualization.
##### 4. Scikit-learn - a crucial machine learning library that will be used for model building.
##### 5. Numpy - a popular Python library that will be used for mathematical operations.

In [37]:
!pip install berserk
!pip install pandas
!pip install matplotlib
!pip install scikit-learn
!pip install xgboost
!pip install numpy

Collecting dotenv
  Downloading dotenv-0.9.9-py2.py3-none-any.whl.metadata (279 bytes)
Downloading dotenv-0.9.9-py2.py3-none-any.whl (1.9 kB)
Installing collected packages: dotenv
Successfully installed dotenv-0.9.9


### Now let's import the newly installed libraries and functionalities to enable them for usage.


In [1]:
import berserk
import base64
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
import xgboost
import json
import numpy as np
import seaborn as sns
import warnings
import dotenv

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.model_selection import cross_val_predict, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model
from sklearn.multiclass import OneVsOneClassifier
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.tree import DecisionTreeClassifier, plot_tree
from results import result_dict

### Final step of the preparation - obtaining the token to interact with lichess API.
### This token enables us to start the API session, which signals the start of your activity.

In [3]:
session = berserk.TokenSession('')
client = berserk.Client(session=session)
warnings.filterwarnings("ignore")

## Now we are free to begin the more exciting part - **data importing and analysis**.
### To start, we initialize the rating threshold and type threshold dictionaries.
### What's happening here is that we divide the players into rating groups using the rating threshold dictionary.
### This allows us to more fairly evaluate the players based on their rating stage and the thresholds from type threshold dictionary.
### At the moment they might seem like some random numbers, but they will be explained later on.

In [4]:

rating_thresholds = {1200: 'beginner', 1500: 'amateur', 1800: 'upper-intermidiate', 2150: 'advanced', 100000: 'specialist'}
type_thresholds = {
0 : {'beginner': [[5,6.5], 4.75], 'amateur': [[5.5, 7], 5], 'upper-intermidiate': [[6, 7.5], 5.5], 'advanced': [[5.5,7.5],5],'specialist': [[5, 6.5], 4.75]},
1 : {'beginner': [[0.65, 0.825], 38], 'amateur': [[0.55, 0.7], 41], 'upper-intermidiate': [[0.45, 0.575], 44], 'advanced': [[0.375, 0.475],47],'specialist': [[0.335, 0.425], 50]},
2 : {'beginner': [[0.45, 0.55], 5], 'amateur': [[0.35,0.45], 6], 'upper-intermidiate': [[0.25, 0.35], 6.5], 'advanced': [[0.175,0.275],7],'specialist': [[0.125, 0.2], 7.5]},
3 : {'beginner': [[0.75, 1.15], 0.4] , 'amateur': [[0.7, 1], 0.35], 'upper-intermidiate': [[0.65, 0.9], 0.3], 'advanced': [[0.6, 0.8], 0.25],'specialist': [[0.58, 0.75], 0.22]}
}


### Now let's dive into the main data importing function, which I called get_data.
### Every detail of the code below will be explain in the comment section, signified by ## symbols.
### Feel free to take your time and inspect the function.

In [5]:
def get_data(nickname): ## Initialize the function. The only parameter here is the nickname of the player.
    raw_data = client.games.export_by_player(nickname, max=100,evals=True, clocks=True,opening=True, perf_type=['blitz','rapid'], analysed=True, tags=False)
    ## In the line above, we use the client instance, established using our token to import games of a player.
    ## The parameters of the function in brackets specify the details of what data should be imported.
    ## In this case, the data is 100 most recent analyzed games with evaluations, time usage and opening data from blitz and rapid time control.
    games = list(raw_data)
    ## Notice that the imported data is called raw_data. The reason is that you have to change it's type for it to be analyzable.
    ## We do this by organizing it in a list, which will contain each game in an order - from most to least recent.
    if len(games)<10:
        print('Not enough analyzed games yet!')
        return None, None, None
    ## Now that the data is readable, we ensure that there is enough material to work with.
    ## To do this, we check the length of the list, which is the amount of games imported.
    ## If there are less than 10 games, we print the according message and stop the function.
    opening_swing_list = [] ## In the lines preceeding the for cycle, we are initializing data containers for us to store the game data in.
    blunder_score_list = []
    no_tp_time_usage_means = []
    no_tp_time_usage_stds = []
    no_tp_eval_swing_stds = []
    no_tp_eval_swing_means = []
    game_cnt = 0
    color = ''
    debut_depth_list = []
    eval_swing_means = []
    eval_swing_stds = []
    normalized_time_usage_means = []
    normalized_time_usage_stds = []
    opening_eval_swings = []
    moves_per_game = []
    moves_btp_list=[]
    mean=0
    
    for game in games: ## Now we begin to access each game of the list for the data.
        moves_btp = 0 ## Once again, starting with initializing the data containers for future use.
        no_tp_eval_swing_list=[]
        no_tp_time_usage_list = []
        increment = int(game['clock']['increment'])
        time_list = []
        eval_swing_list = []
        time_std = 0
        eval_swing_mean = 0
        eval_swing_std = 0
        plies = len(game['moves'].split(' '))
        moves = plies//2
        game_cnt+=1
        moves_per_game.append(moves)
        if 'opening' in game.keys(): ## To avoid rare occasions where lichess doesn't recognize the opening, we will append 1 as length of an opening.
            debut_depth_list.append(game['opening']['ply'])
        else:
            debut_depth_list.append(1)
            
        if 'user' in game['players']['white'].keys() and 'user' in game['players']['black'].keys(): ## Now we identify the player's piece color.
            if game['players']['white']['user']['name'].lower() == nickname.lower():
                color = 'w'
                blunder_points = game['players']['white']['analysis']['inaccuracy']/2 + game['players']['white']['analysis']['mistake']*1.5 + game['players']['white']['analysis']['blunder']*3
                blunder_score = blunder_points/moves ## (Blunder score = Inaccuracies * 0.5 + Mistakes * 1.5 + Blunders * 3) / Move amount
                blunder_score_list.append(round(blunder_score,2)) ## At the same time we can calculate the blunder score, which tells the amount of size of mistake.
                mean+=game['players']['white']['rating'] ## Appending the actual rating of a player to identify his rating stage.
            else:
                color = 'b'
                blunder_points = game['players']['black']['analysis']['inaccuracy'] + game['players']['black']['analysis']['mistake']*2 + game['players']['black']['analysis']['blunder']*3
                blunder_score = blunder_points/moves
                blunder_score_list.append(round(blunder_score,2))
                mean+=game['players']['black']['rating']
        else:
            pass
    
            
        for i in range(len(game['analysis'])):
            if type(game['analysis'][i].get('eval')) != int:
                game['analysis'][i]['mate']=1500 
                ## To evaluate the evaluation swing, we have to keep them consistent, so when there is a mate in position,
                ## we evaluate it as a 1500 centipawn advantage or disadvantage to avoid miscalculations.
                
    
        evals = [i.get('eval') if type(i.get('eval')) == int else i.get('mate') for i in game['analysis']]
        ## Generate a full evaluation list for the game to analyze the swings.

        if plies > len(evals):
            pass
        else:
            for i in range(2, plies-2):
                if plies > 0:
                    j = i-1
                    if color == 'w':
                        if i % 2 == 0:
                            move_time = ((int(game['clocks'][i]) - int(game['clocks'][i+2]))/100+increment)
                            time_list.append(round(move_time, 2))
                            eval_swing_list.append(abs(evals[i]-evals[j]))
                            if 'opening' not in game.keys():
                                if int(game['clocks'][i+2]) > 0.2*int(game['clocks'][0]) and i//2+1:
                                    no_tp_eval_swing_list.append((abs(evals[i]-evals[j])))
                                    no_tp_time_usage_list.append(round(move_time, 2))
                                    moves_btp +=1
        ## Now we start to fill our data containers with statistical data like moves before time trouble amount, time usage and evaluation swings. 
        ## Notice that there are separate 'no_tp' versions of lists, which means 'no time pressure'. For the sake of this project, TT = >20% of time.
                            else:
                                if int(game['clocks'][i+2]) > 0.2*int(game['clocks'][0]) and i//2+1 > game['opening']['ply']//2:
                                    no_tp_eval_swing_list.append((abs(evals[i]-evals[j])))
                                    no_tp_time_usage_list.append(round(move_time, 2))
                                    moves_btp +=1     
                    else:
                        if i % 2 == 1:
                            move_time = ((int(game['clocks'][i]) - int(game['clocks'][i+2]))/100+increment)
                            time_list.append(round(move_time, 2))
                            eval_swing_list.append(abs(evals[i]-evals[j]))
                            if 'opening' not in game.keys():
                                if int(game['clocks'][i+2]) > 0.2*int(game['clocks'][0]) and i//2+1:
                                    no_tp_eval_swing_list.append((abs(evals[i]-evals[j])))
                                    no_tp_time_usage_list.append(round(move_time, 2))
                                    moves_btp +=1
                            else:
                                if int(game['clocks'][i+2]) > 0.2*int(game['clocks'][0]) and i//2+1 > game['opening']['ply']//2:
                                    no_tp_eval_swing_list.append((abs(evals[i]-evals[j])))
                                    no_tp_time_usage_list.append(round(move_time, 2))
                                    moves_btp +=1
    
        no_tp_eval_swing_srs = pd.Series(no_tp_eval_swing_list)/100
        nrm_no_tp_time_usage_srs = pd.Series(no_tp_time_usage_list)/(game['clock']['initial']+increment*moves)*100
        opening_swing_list.append(round((sum(eval_swing_list[0:10])/1000), 2) if moves>11 else round((sum(eval_swing_list)/1000), 2))
        nrm_time_list_srs = pd.Series(time_list)/(game['clock']['initial']+increment*moves)*100
        eval_swing_srs = pd.Series(eval_swing_list)/100
        debut_depth_srs = pd.Series(debut_depth_list)
        moves_per_game_srs = pd.Series(moves_per_game)
        no_tp_time_usage_srs=pd.Series(no_tp_time_usage_list)
        moves_btp_list.append(moves_btp)
        ## In the code block above, we convert our Python lists to Pandas objects called Series.
        ## The reason is very simple - we can take statistical measures like mean and std from them in just one method.

        no_tp_eval_swing_means.append(round(no_tp_eval_swing_srs.mean(), 2))
        no_tp_eval_swing_stds.append(round(no_tp_eval_swing_srs.std(), 2))
        eval_swing_means.append(round(eval_swing_srs.mean(),2))
        eval_swing_stds.append(round(eval_swing_srs.std(),2))
        normalized_time_usage_means.append(round(nrm_time_list_srs.mean(),2))
        normalized_time_usage_stds.append(round(nrm_time_list_srs.std(),2))
        no_tp_time_usage_means.append(round(no_tp_time_usage_srs.mean(),2))
        no_tp_time_usage_stds.append(round(no_tp_time_usage_srs.std(),2))

        ## And now we do exactly that to populate our end data containers, which will be the analysis data sources.
    
    blunder_score_srs = pd.Series(blunder_score_list)
    no_tp_eval_swing_stds_pds = pd.Series(no_tp_eval_swing_stds)
    no_tp_eval_swing_means_pds = pd.Series(no_tp_eval_swing_means)
    opening_swing_list_pds = pd.Series(opening_swing_list)
    debut_depth_list_pds = pd.Series(debut_depth_list)
    eval_swing_means_pds = pd.Series(eval_swing_means)
    eval_swing_stds_pds = pd.Series(eval_swing_stds)
    normalized_time_usage_means_pds = pd.Series(normalized_time_usage_means)
    normalized_time_usage_stds_pds = pd.Series(normalized_time_usage_means)
    moves_per_game_pds = pd.Series(moves_per_game)
    no_tp_time_usage_means_pds = pd.Series(no_tp_time_usage_means)
    no_tp_time_usage_stds_pds = pd.Series(no_tp_time_usage_stds)
    moves_btp_pds=pd.Series(moves_btp_list)
    ## All that's left to do is to convert these end data containers into Series themselves.



    result_type = '' ## Initializing a text object, which will contain the final rating type.
    mean//=game_cnt 
    for key in list(rating_thresholds.keys()):
        if mean < key:
            stage = rating_thresholds[key]
            break
    ## Determine the rating stage by taking the total ratings, dividing them by the amount of games and checking, which stage fits the result.

    type_criterions = {
                0 : [no_tp_time_usage_stds_pds.mean(), no_tp_time_usage_means_pds.mean(),'IU'],
                1 : [no_tp_eval_swing_means_pds.mean(), moves_per_game_pds.quantile(q=0.75),'GH'],
                2 : [opening_swing_list_pds.mean(), debut_depth_list_pds.mean(),'TN'],
                3: [no_tp_eval_swing_stds_pds.mean(), blunder_score_srs.mean(), 'LV']
                }
    ## The dictionary above shows which criteria will be used to conduct the primary analysis and to determine the result MBTI.
    ## For U/I - Time usage STD/Time usage mean
    ## For G/H - Evaluation swing mean, 75% quantile of move amount
    ## For T/N - Evaluation swing in the opening/Length of opening theory mean
    ## For L/V - Evaluation swing STD/Blunder score mean
    

    for i in range(len(type_thresholds)):
        criteria_min_value = type_thresholds[i][stage][0][0]
        criteria_max_value = type_thresholds[i][stage][0][1]
        criteria_second_value = type_thresholds[i][stage][1]
        user_first_criteria_value = type_criterions[i][0]
        user_second_criteria_value = type_criterions[i][1]     
        if i not in [1,2]:
            if criteria_min_value >= user_first_criteria_value:
                result_type+=(type_criterions[i][2][0])
            elif criteria_max_value <= user_first_criteria_value:
                result_type+=type_criterions[i][2][1]
            else:
                if criteria_second_value > user_second_criteria_value:
                    result_type+=(type_criterions[i][2][0])
                else:
                    result_type+=(type_criterions[i][2][1])
        else:
            if criteria_min_value > user_first_criteria_value:
                result_type+=(type_criterions[i][2][0])
            elif criteria_max_value < user_first_criteria_value:
                result_type+=type_criterions[i][2][1]
            else:
                if criteria_second_value > user_second_criteria_value:
                    result_type+=(type_criterions[i][2][1])
                else:
                    result_type+=(type_criterions[i][2][0])
        ## Now we enter the type calculation process. The initial techinque is as follows:
        ## Check the first criteria user value and compare it with the smaller and the larger threshold.
        ## If the value is less than the smaller threshold or larger than the bigger one, the letter is assigned.
        ## If the value is between them, compare the second value with the threshold and assign the letter.

    
    data=blunder_score_srs.mean(),opening_swing_list_pds.mean(),no_tp_eval_swing_stds_pds.mean(), no_tp_eval_swing_means_pds.mean(),debut_depth_list_pds.mean(),eval_swing_means_pds.mean(),eval_swing_stds_pds.mean(),normalized_time_usage_means_pds.mean(),normalized_time_usage_stds_pds.mean(),moves_per_game_pds.quantile(q=0.75),no_tp_time_usage_means_pds.mean(),no_tp_time_usage_stds_pds.mean(),moves_btp_pds.mean()
    X = pd.DataFrame(data=[data], columns=['Blunder scores','Eval swings (first 10 moves)','Eval swings std (middlegame)', 'Eval swings mean (middlegame)','Opening depth','Mean eval swings', 'Eval swing std', 'Mean time usage per move (%)', 'Time usage std per move (%)', 'Moves per game', 'Mean time usage per move (%, middlegame)', 'Time usage std per move (%, middlegame)', 'Moves before time pressure'])
    return X, result_type, stage
    ## Lastly, return the analyzed data along with the calculated MBTI and rating stage. They will be used later on.

### In order to create a dataset for model training, we also create a second version of data importing function.
### This version will treat each game as a separate data entry to increase the amount of data.

In [6]:
def get_data_2(nickname): 
    ## Now let's quickly look at the second version of the data collection function. 
    ## It was created to create training and testing datasets for model building, which will be shown later.
    ## In this function we will only look at the details that differ from the original function.
    raw_data = client.games.export_by_player(nickname, max=200,evals=True, clocks=True,opening=True, perf_type=['blitz','rapid'], analysed=True, tags=False)
    games = list(raw_data)
    if len(games)<50:
        return None
        quit()
    X = pd.DataFrame(columns=['Blunder scores','Eval swings (first 10 moves)','Eval swings std (middlegame)', 'Eval swings mean (middlegame)','Opening depth','Mean eval swings', 'Eval swing std', 'Mean time usage per move (%)', 'Time usage std per move (%)', 'Moves per game', 'Mean time usage per move (%, middlegame)', 'Time usage std per move (%, middlegame)', 'Moves before time pressure', 'Type 1','Type 2','Type 3','Type 4'])
    
    for game in games:
        if 'opening' not in game.keys():
            continue
        else: 
            opening_swing_list = []
            blunder_score_list = []
            no_tp_time_usage_means = []
            no_tp_time_usage_stds = []
            no_tp_eval_swing_stds = []
            no_tp_eval_swing_means = []
            mean = 0
            game_cnt = 0
            stage = ''
            color = ''
            debut_depth_list = []
            eval_swing_means = []
            eval_swing_stds = []
            normalized_time_usage_means = []
            normalized_time_usage_stds = []
            opening_eval_swings = []
            moves_per_game = []
            moves_btp_list=[]
            moves_btp = 0
            no_tp_eval_swing_list=[]
            no_tp_time_usage_list = []
            game_cnt+=1
            increment = int(game['clock']['increment'])
            time_list = []
            eval_swing_list = []
            time_std = 0
            eval_swing_mean = 0
            eval_swing_std = 0
            plies = len(game['moves'].split(' '))
            moves = plies//2
            moves_per_game.append(moves)
            if 'opening' in game.keys():
                debut_depth_list.append(game['opening']['ply'])
            else:
                debut_depth_list.append(1)
                
            if 'user' in game['players']['white'].keys() and 'user' in game['players']['black'].keys():
                if game['players']['white']['user']['name'].lower() == nickname.lower():
                    mean += game['players']['white']['rating']
                    color = 'w'
                    blunder_points = game['players']['white']['analysis']['inaccuracy']/2 + game['players']['white']['analysis']['mistake']*1.5 + game['players']['white']['analysis']['blunder']*3
                    blunder_score = blunder_points/moves 
                    blunder_score_list.append(round(blunder_score,2))
                else:
                    mean += game['players']['white']['rating'] 
                    color = 'b'
                    blunder_points = game['players']['black']['analysis']['inaccuracy'] + game['players']['black']['analysis']['mistake']*2 + game['players']['black']['analysis']['blunder']*3
                    blunder_score = blunder_points/moves
                    blunder_score_list.append(round(blunder_score,2))
            else:
                pass
        
                
            for i in range(len(game['analysis'])):
                if type(game['analysis'][i].get('eval')) != int:
                    game['analysis'][i]['mate']=1500
        
            evals = [i.get('eval') if type(i.get('eval')) == int else i.get('mate') for i in game['analysis']]
    
            if plies > len(evals):
                pass
            else:
                for i in range(2, plies-2):
                    if plies > 0:
                        j = i-1
                        if color == 'w':
                            if i % 2 == 0:
                                move_time = ((int(game['clocks'][i]) - int(game['clocks'][i+2]))/100+increment)
                                time_list.append(round(move_time, 2))
                                eval_swing_list.append(abs(evals[i]-evals[j]))
                                if 'opening' not in game.keys():
                                    if int(game['clocks'][i+2]) > 0.2*int(game['clocks'][0]) and i//2+1:
                                        no_tp_eval_swing_list.append((abs(evals[i]-evals[j])))
                                        no_tp_time_usage_list.append(round(move_time, 2))
                                        moves_btp +=1
                                else:
                                    if int(game['clocks'][i+2]) > 0.2*int(game['clocks'][0]) and i//2+1 > game['opening']['ply']//2:
                                        no_tp_eval_swing_list.append((abs(evals[i]-evals[j])))
                                        no_tp_time_usage_list.append(round(move_time, 2))
                                        moves_btp +=1     
                        else:
                            if i % 2 == 1:
                                move_time = ((int(game['clocks'][i]) - int(game['clocks'][i+2]))/100+increment)
                                time_list.append(round(move_time, 2))
                                eval_swing_list.append(abs(evals[i]-evals[j]))
                                if 'opening' not in game.keys():
                                    if int(game['clocks'][i+2]) > 0.2*int(game['clocks'][0]) and i//2+1:
                                        no_tp_eval_swing_list.append((abs(evals[i]-evals[j])))
                                        no_tp_time_usage_list.append(round(move_time, 2))
                                        moves_btp +=1
                                else:
                                    if int(game['clocks'][i+2]) > 0.2*int(game['clocks'][0]) and i//2+1 > game['opening']['ply']//2:
                                        no_tp_eval_swing_list.append((abs(evals[i]-evals[j])))
                                        no_tp_time_usage_list.append(round(move_time, 2))
                                        moves_btp +=1
        
            no_tp_eval_swing_srs = pd.Series(no_tp_eval_swing_list)/100
            nrm_no_tp_time_usage_srs = pd.Series(no_tp_time_usage_list)/(game['clock']['initial']+increment*moves)*100
            opening_swing_list.append(round((sum(eval_swing_list[0:10])/1000), 2) if moves>11 else round((sum(eval_swing_list)/1000), 2))
            nrm_time_list_srs = pd.Series(time_list)/(game['clock']['initial']+increment*moves)*100
            eval_swing_srs = pd.Series(eval_swing_list)/100
            debut_depth_srs = pd.Series(debut_depth_list)
            moves_per_game_srs = pd.Series(moves_per_game)
            no_tp_time_usage_srs=pd.Series(no_tp_time_usage_list)
            moves_btp_list.append(moves_btp)
        
            no_tp_eval_swing_means.append(round(no_tp_eval_swing_srs.mean(), 2))
            no_tp_eval_swing_stds.append(round(no_tp_eval_swing_srs.std(), 2))
            eval_swing_means.append(round(eval_swing_srs.mean(),2))
            eval_swing_stds.append(round(eval_swing_srs.std(),2))
            normalized_time_usage_means.append(round(nrm_time_list_srs.mean(),2))
            normalized_time_usage_stds.append(round(nrm_time_list_srs.std(),2))
            no_tp_time_usage_means.append(round(no_tp_time_usage_srs.mean(),2))
            no_tp_time_usage_stds.append(round(no_tp_time_usage_srs.std(),2))


            blunder_score_srs = pd.Series(blunder_score_list)
            no_tp_eval_swing_stds_pds = pd.Series(no_tp_eval_swing_stds)
            no_tp_eval_swing_means_pds = pd.Series(no_tp_eval_swing_means)
            opening_swing_list_pds = pd.Series(opening_swing_list)
            debut_depth_list_pds = pd.Series(debut_depth_list)
            eval_swing_means_pds = pd.Series(eval_swing_means)
            eval_swing_stds_pds = pd.Series(eval_swing_stds)
            normalized_time_usage_means_pds = pd.Series(normalized_time_usage_means)
            normalized_time_usage_stds_pds = pd.Series(normalized_time_usage_means)
            moves_per_game_pds = pd.Series(moves_per_game)
            no_tp_time_usage_means_pds = pd.Series(no_tp_time_usage_means)
            no_tp_time_usage_stds_pds = pd.Series(no_tp_time_usage_stds)
            moves_btp_pds=pd.Series(moves_btp_list)

            ## As some of you might have noticed, in the first version the end data containers were created after all the games were analyzed.
            ## In this case, since we want to the data from all games, we moved it to be a process for each game.
    
            result_type = ''
            for key in list(rating_thresholds.keys()):
                if mean < key:
                    stage = rating_thresholds[key]
                    break
        
            type_criterions = {
                            0 : [no_tp_time_usage_stds_pds.mean(), no_tp_time_usage_means_pds.mean(),'IU'], #Let the middle interval be bigger
                            1 : [no_tp_eval_swing_means_pds.mean(), moves/4*3,'GH'], #Think about implementing quantiles here
                            2 : [opening_swing_list_pds.mean(), debut_depth_list_pds.mean(),'TN'],
                            3: [no_tp_eval_swing_stds_pds.mean(), blunder_score_srs.mean(), 'LV']
                            }
                                
            for i in range(len(type_thresholds)):
                criteria_min_value = type_thresholds[i][stage][0][0]
                criteria_max_value = type_thresholds[i][stage][0][1]
                criteria_second_value = type_thresholds[i][stage][1]
                user_first_criteria_value = type_criterions[i][0]
                user_second_criteria_value = type_criterions[i][1]     
                if i not in [1,2]:
                    if criteria_min_value >= user_first_criteria_value:
                        result_type+=(type_criterions[i][2][0])
                    elif criteria_max_value <= user_first_criteria_value:
                        result_type+=type_criterions[i][2][1]
                    else:
                        if criteria_second_value > user_second_criteria_value:
                            result_type+=(type_criterions[i][2][0])
                        else:
                            result_type+=(type_criterions[i][2][1])
                else:
                    if criteria_min_value > user_first_criteria_value:
                        result_type+=(type_criterions[i][2][0])
                    elif criteria_max_value < user_first_criteria_value:
                        result_type+=type_criterions[i][2][1]
                    else:
                        if criteria_second_value > user_second_criteria_value:
                            result_type+=(type_criterions[i][2][1])
                        else:
                            result_type+=(type_criterions[i][2][0])
            ## Now we have a separate type calculated and the data extracted for each game.

            dataset=blunder_score,pd.Series(opening_swing_list).mean(),pd.Series(no_tp_eval_swing_list).std()/100, pd.Series(no_tp_eval_swing_list).mean()/100,game['opening']['ply'],pd.Series(eval_swing_list).mean()/100,pd.Series(eval_swing_list).std()/100,nrm_time_list_srs.mean(),nrm_time_list_srs.std(),moves/4*3,pd.Series(no_tp_time_usage_list).mean(),pd.Series(no_tp_time_usage_list).std(),moves_btp,result_type[0],result_type[1],result_type[2],result_type[3]
            data=pd.DataFrame(data=[dataset],columns=['Blunder scores','Eval swings (first 10 moves)','Eval swings std (middlegame)', 'Eval swings mean (middlegame)','Opening depth','Mean eval swings', 'Eval swing std', 'Mean time usage per move (%)', 'Time usage std per move (%)', 'Moves per game', 'Mean time usage per move (%, middlegame)', 'Time usage std per move (%, middlegame)', 'Moves before time pressure', 'Type 1','Type 2','Type 3','Type 4'])
            X=pd.concat([X, data], axis=0)
        
    return X

## Run the cell below to load the dataset!

In [7]:
data_df=pd.read_csv('df1.csv') ## Just use this cell to use the resulting dataset.
data_df.drop(columns=['Unnamed: 0'], inplace=True)
data_df.dropna(inplace =True, axis=0) ## It's very important to clean your data before analysis. In this case, we do it by removing NaN entries.
data_df


Unnamed: 0,Blunder scores,Eval swings (first 10 moves),Eval swings std (middlegame),Eval swings mean (middlegame),Opening depth,Mean eval swings,Eval swing std,Mean time usage per move (%),Time usage std per move (%),Moves per game,"Mean time usage per move (%, middlegame)","Time usage std per move (%, middlegame)",Moves before time pressure,Type 1,Type 2,Type 3,Type 4
0,0.309524,0.12,1.100428,0.463846,8,0.686250,1.720453,2.414489,3.943713,31.50,12.763077,15.616969,13,U,H,T,V
1,0.111111,0.10,0.149512,0.110833,11,0.326000,0.836770,3.749060,4.969120,20.25,10.433333,13.796841,12,U,G,T,L
2,0.142857,0.07,0.287418,0.219048,4,0.276750,0.341913,2.369697,4.046828,31.50,7.104762,9.371433,21,U,G,T,L
3,0.137500,0.09,0.373689,0.229412,7,0.220385,0.553202,1.270775,3.233416,60.00,17.790588,27.254869,17,U,G,T,L
4,0.288462,0.11,0.240599,0.230556,9,0.430000,0.887032,3.436062,3.414080,19.50,16.293333,14.066444,18,U,G,T,L
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4385,0.437500,1.26,4.636724,1.739333,2,1.739333,4.636724,2.080556,1.577572,24.00,6.241667,4.732715,30,I,H,N,V
4386,0.250000,0.22,0.557696,0.532500,4,0.448000,0.518623,0.747333,0.541482,4.50,1.682500,1.196450,4,I,G,T,L
4390,0.190476,0.57,0.951872,0.829474,2,0.636750,0.910386,2.423333,2.602794,31.50,12.058947,8.792758,19,U,H,N,L
4392,0.789474,1.54,2.247417,1.300476,2,1.832500,2.592671,2.665000,2.397740,28.50,10.601905,7.087549,21,U,H,N,V


## Now we are ready to begin the analysis!
### To do this, we separate our dataset into classifiers and main data.
### Classifiers, in this case, are the MBTI letters, which are dependent on the main data.
### Next we must scale the main data to improve the accuracy of the model.

In [8]:
classifiers = data_df[['Type 1', 'Type 2', 'Type 3', 'Type 4']] ## Extracting the MBTI letters as target variables to model for.
data_df.drop(columns=['Type 1', 'Type 2', 'Type 3', 'Type 4'], inplace=True)
data_df_scaled = StandardScaler().fit(data_df).transform(data_df) ## Scale the data to improve the accuracy.
scaled_data = pd.DataFrame(data_df_scaled)
scaled_data = pd.DataFrame(data_df_scaled,columns=['Blunder scores','Eval swings (first 10 moves)','Eval swings std (middlegame)', 'Eval swings mean (middlegame)','Opening depth','Mean eval swings', 'Eval swing std', 'Mean time usage per move (%)', 'Time usage std per move (%)', 'Moves per game', 'Mean time usage per move (%, middlegame)', 'Time usage std per move (%, middlegame)', 'Moves before time pressure'])


### In machine learning, the dependent data is usually marked as Y, and the main data is marked as X.
### That's exactly what we do here - scaled version of main data is X and the 4 different MBTI letters are marked Y1 to Y4.

In [9]:
X = scaled_data ## Scaled data without target variables will be used to conduct the analysis.
Y1 = classifiers['Type 1']
Y2 = classifiers['Type 2']
Y3 = classifiers['Type 3']
Y4 = classifiers['Type 4']



## Now we have everything necessary to form the **training sets** and **testing sets** for our models to train on.
### But what exactly is a training set and a training set?
### In general, training set is passed into the model to determine the best parameters for the model in terms of the prediction accuracy.
### Then, testing set is used to test these parameters and confirm or deny whether the parameters were well picked.

In [10]:
X1_train, X1_test, Y1_train, Y1_test =train_test_split(X, Y1, test_size = 0.2, random_state = 42,stratify = Y1) 
X2_train, X2_test, Y2_train, Y2_test =train_test_split(X, Y2, test_size = 0.2, random_state = 42,stratify = Y2)
X3_train, X3_test, Y3_train, Y3_test =train_test_split(X, Y3, test_size = 0.2, random_state = 42,stratify = Y3)
X4_train, X4_test, Y4_train, Y4_test =train_test_split(X, Y4, test_size = 0.2, random_state = 42,stratify = Y4)
## Divide the data into 80/20 proportions for model training.
## 80% is the training set, 20% is the testing set.

## Last step of model building - let's run the cell below to initialize our Random Forest models!

In [11]:
M1 = RandomForestClassifier(criterion='entropy', max_features=None, max_depth=7)
M1.fit(X1_train, Y1_train)
M2 = RandomForestClassifier(criterion='entropy', max_features=None, max_depth=None)
M2.fit(X2_train, Y2_train)
M3 = RandomForestClassifier(criterion='entropy', max_features=None, max_depth=None)
M3.fit(X3_train, Y3_train)
M4 = RandomForestClassifier(criterion='entropy', max_features=None, max_depth=None)
M4.fit(X4_train, Y4_train)

## Alright, now we can finally move on to the best part - testing Chess MBTI on ourselves, or our friends if you will!
## Simply input your nickname in the cell below, run the next cells and see the results for yourself.
## Thank you in advance for sticking with me until the end, I appreciate it a lot!

In [12]:
nickname = str(input('Type the nickname: '))

Type the nickname:  RomKali


In [17]:
dataset, threshold_type, stage = get_data(nickname)
type_dict = result_dict

In [18]:
Scaler = StandardScaler()
Scaler.fit(data_df)
Scaler.transform(data_df)
T_set = Scaler.transform(dataset)

In [19]:
result_type=''

for i in [M1, M2, M3, M4]:
    if len(result_type)<4:
        result_type+=str(i.predict(T_set)[0])
print(f'Model prediction: {result_type}, threshold prediction: {threshold_type}')


Model prediction: UGTL, threshold prediction: UGTL


In [20]:
print('Test Results:\n')
print(f'Your chess MBTI is {result_type} ({type_dict['6'][result_type]})!\n')
print(f'Here is a short description of your chess MBTI:\n{type_dict['5'][result_type]}\n')
print(f'You should definitely look up games of {type_dict['0'][result_type][0]} and {type_dict['0'][result_type][1]}. Seems like you guys have a lot in common! :)\n')
if stage in ['beginner', 'amateur']:
    print(f'Friendly advice: take up {type_dict['1'][result_type][0][0]} and {type_dict['1'][result_type][0][1]}, gain some free Elo and thank me later!\n')
else:
    print(f'Friendly advice: take up {type_dict['1'][result_type][1][0]} and {type_dict['1'][result_type][1][1]}, gain some free Elo and thank me later!\n')
print(f'Being a {result_type}, your competitive advantages are probably: \n1. {type_dict['2'][result_type][0]} \n2. {type_dict['2'][result_type][1]} \nCherish these traits and capitalize on them!\n')
print(f'On the other hand, be on a look out for:\n1. {type_dict['3'][result_type][0]} \n2. {type_dict['3'][result_type][1]}\nWe can not all be flawless like Magnus. Practice and minimize your disadvantages!\n')
print(f'Main goal of Chess MBTI project is to connect chess players with opponents of MBTIs that click and make chess even more fun and enjoyable!\nSo, being a {result_type}, if your date happens to be a {type_dict['4'][result_type]} - buy them a drink and friend them on lichess right away!')

Test Results:

Your chess MBTI is UGTL (Engineer)!

Here is a short description of your chess MBTI:
You are the Engineer — precise, principled, and methodical.
You favor clear, correct play and navigate the board like a well-designed system, valuing structure and incremental advantage. 
Your calculation is sharp, but always purposeful, and you are happy to outlast your opponent move by move. 
You thrive in openings you have mastered and positions where logic wins the day.

You should definitely look up games of Anish Giri and Anatoly Karpov. Seems like you guys have a lot in common! :)

Friendly advice: take up Exchange QGD/Catalan and Berlin Defense, gain some free Elo and thank me later!

Being a UGTL, your competitive advantages are probably: 
1. Highly accurate and strategic in opening and middlegame transitions, often guiding the game into technical waters. 
2. Comfortable grinding out long endgames with solid positional plans. 
Cherish these traits and capitalize on them!

On the