# Halo 5 Match Prediction Model

Having spent many hours playing the specific Super Fiesta Party playlist, it seemed clear to me that we were being pit against players with a wide range of skill.  Sometimes, we would be beaten so badly that we couldn't help but laugh, despite having played just fine for the past few games.  

I wanted to know for certain whether or not games were, in a sense, predetermined in any given direction.  Of course this would never be the intention of a multiplayer matchmaking system.  In theory, a perfect match making system would always be a 50/50 matchup.  Using what we learned in the first section about data from the API, let's see how close to 50/50 matchmaking really is.

Test

# Imports

We'll start by importing the same packages as our EDA notebook a long with an extensive set of sci-kit learn tools.

In [1]:
#Standard Packages
import pandas as pd
pd.set_option('display.max_columns', None)
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import numpy as np
import pickle
import warnings
warnings.filterwarnings(action='ignore') 

# Packages used for API calls and data processing
import requests
import json
def get_keys(path):
    with open(path) as f:
        return json.load(f)
import ast
import time
import http.client, urllib.request, urllib.parse, urllib.error, base64
api_key = 'ceeaacb7cf024c7485e00ef8457e42dc'
gamertag = 'Drymander'
from tqdm import tqdm
# !pip install isodate
import isodate

In [3]:
def get_keys(path):
    with open(path) as f:
        return json.load(f)
    
get_keys('/api_keys.py')

FileNotFoundError: [Errno 2] No such file or directory: '/api_keys.py'

In [2]:
# Preprocessing tools
from sklearn.model_selection import train_test_split,cross_val_predict,cross_validate
from sklearn.preprocessing import MinMaxScaler,StandardScaler,OneHotEncoder
scaler = StandardScaler()
from sklearn import metrics

# Models & Utilities
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression,LogisticRegressionCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, precision_score, recall_score, f1_score
from sklearn import svm
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import minmax_scale
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import OneHotEncoder


# Intro

When I first started this project, I tried running models using matches from my personal history.  However, there was one major issue wwith that choice.  In order to get the features for our model, we will need to pull information that is only available from the API in a 'total lifetime' format.  

This means that if I played with a player **10 months ago**, I would actually only be able to pull features for that player for their stats **today**.  In other words, only the past ~3 weeks or so of matches could be considered passable as quality data where every player's time line synced with when the match was played.

Not satisfied with a model with poor quality and very scattered cross variance scores, I decided to go about it in a different way.

From my earlier data collection, I was able to amass a list of unique gamertag names that I have played with throughout time.  We will build a process to pull each of those players' 25 most recent games, put each game into one line of the dataframe, and limit the date range of those matches so that all data will be properly synced.

# Functions

To pull this off, we'll need to chain together some functions.

## Gamertag for API

This is a simple function to prepare anyone's gamertag from how it would normally appear to how it needs to be formatted for the API.

In [3]:
# Prepare gamertag for API
def gamertag_for_api(gamertag):
    
    # Replace spaces with '+'
    gamertag = gamertag.replace(' ','+')
    return gamertag

# Testing the function
gamertag_for_api('this is a test')

'this+is+a+test'

## Pull Recent Match

This pulls the most recent match information for any given player.

It uses two API calls.  The first will give us the match ID and date of the most recent game any player has played by specifying their gamertag.  The second calls the match results API, which gives us the gamertags of all players in the match as well as information on winner / loser / tie etc.  It will also give us Spartan Rank and Total XP.

In [4]:
# Function to pull most recent match stats into JSON format
# Uses two separate API calls, one from player history and another from match details
def pull_recent_match(how_recent, api_key=api_key, explore=False, gamertag='Drymander'):
    
    # Use gamertag_for_api function to remove any spaces
    gamertag = gamertag_for_api(gamertag)
    headers = {
        # Request headers
        'Ocp-Apim-Subscription-Key': api_key,
    }
    # Pulls from arena mode, how_recent is how far to go back in the match history
    # 'count' refers to the number of matches to pull
    params = urllib.parse.urlencode({
        # Request parameters
        'modes': 'arena',
        'start': how_recent,
        'count': 1,
        'include-times': True,
    })
    
    # Try this, otherwise return error message
    try:
        
        # Connect to API and pull most recent match for specified gamer
        conn = http.client.HTTPSConnection('www.haloapi.com')
        conn.request("GET", f"/stats/h5/players/{gamertag}/matches?%s" % params, "{body}", headers)
        response = conn.getresponse()
        latest_match = json.loads(response.read())
        
        # Identify match ID and match date
        match_id = latest_match['Results'][0]['Id']['MatchId']
        match_date = latest_match['Results'][0]['MatchCompletedDate']['ISO8601Date']
        
        # Rest for 1.01 seconds to not get blocked by API
        time.sleep(1.01)
        
        # Using match_id, pull details from match
        conn.request("GET", f"/stats/h5/arena/matches/{match_id}?%s" % params, "{body}", headers)
        response = conn.getresponse()
        data = response.read()
        
        # Option to return as byte string for alternative viewing
        if explore == True:
            print(data)
        else:
            # Append match ID and date from player history API call
            match_results = json.loads(data)
            match_results['MatchId'] = match_id
            match_results['Date'] = match_date
        conn.close()
    
    # Print error if issue with calling API
    except Exception as e:
        print(f"[Errno {0}] {1}".format(e.errno, e.strerror))
    
    # Return match results as JSON
    return match_results

# Show result
match_results = pull_recent_match(0, explore=False, gamertag='Drymander')
# match_results

## Build Base Dataframe

Now that we have our match results JSON for the most recent match, we'll build a base dataframe similar to what we built in the EDA.

In [5]:
# Function to build the base dataframe for a single match
# Designed to take in the JSON provided by the pull_recent_match function
def build_base_dataframe(match_results, gamertag):
    
    # Build empty base match dataframe
    df = pd.DataFrame()
    columns = [
        'Finished'
        'TeamId',
        'Gamertag',
        'SpartanRank',
        'PrevTotalXP',
    ]
    df = pd.DataFrame(columns = columns)
    
    # Populate base match dataframe with player stats for each player
    i = 0
    for player in match_results['PlayerStats']:

        player_dic = {}
        # Team ID
        player_dic['DNF'] = match_results['PlayerStats'][i]['DNF']
        player_dic['TeamId'] = match_results['PlayerStats'][i]['TeamId']
        # Team Color
        player_dic['TeamColor'] = match_results['PlayerStats'][i]['TeamId']
        # Gamer Tag
        player_dic['Gamertag'] = match_results['PlayerStats'][i]['Player']['Gamertag']
        # Spartan Rank
        player_dic['SpartanRank'] = match_results['PlayerStats'][i]['XpInfo']['SpartanRank']
        # Previous Total XP
        player_dic['PrevTotalXP'] = match_results['PlayerStats'][i]['XpInfo']['PrevTotalXP']
        df = df.append(player_dic, ignore_index=True)
        i += 1
    
    ########## DATE, GAME VARIANT, MAP ID, MATCH ID, PLAYLIST ID ##########
    df['Date'] = match_results['Date']
    df['Date'] = pd.to_datetime(df['Date']).dt.tz_convert(None)
#     df['Date'] = df['Date'].floor('T')
    df['MatchId'] = match_results['MatchId']
    df['GameBaseVariantId'] = match_results['GameBaseVariantId']
    df['MapVariantId'] = match_results['MapVariantId']
    df['PlaylistId'] = match_results['PlaylistId']
    
    ########## DEFINE PLAYER TEAM ##########
    playerteam = df.loc[df['Gamertag'] == gamertag, 'TeamId'].values[0]
    if playerteam == 0:
        enemyteam = 1   
    else:
        enemyteam = 0
        
    df['PlayerTeam'] = df['TeamId'].map({playerteam:'Player', enemyteam:'Enemy'})
    
    if match_results['TeamStats'][0]['TeamId'] == playerteam:
        playerteam_stats = match_results['TeamStats'][0]
        enemyteam_stats = match_results['TeamStats'][1]
    else: 
        playerteam_stats = match_results['TeamStats'][1]
        enemyteam_stats = match_results['TeamStats'][0]
    
    ########## DETERMINE WINNER ##########
    # Tie
    if playerteam_stats['Rank'] == 1 and enemyteam_stats['Rank'] == 1:
        df['Winner'] = 'Tie'
    # Player wins
    elif playerteam_stats['Rank'] == 1 and enemyteam_stats['Rank'] == 2:
        df['Winner'] = df['TeamId'].map({playerteam:'Victory', enemyteam:'Defeat'})
    # Enemy wins
    elif playerteam_stats['Rank'] == 2 and enemyteam_stats['Rank'] == 1:
        df['Winner'] = df['TeamId'].map({enemyteam:'Victory', playerteam:'Defeat'})
    # Error handling
    else:
        winner = 'Error determining winner'
    
    ########## TEAM COLOR ##########
    df['TeamColor'] = df['TeamId'].map({0:'Red', 1:'Blue'})
    
    # Set columns
    df = df[['Date', 'MatchId', 'GameBaseVariantId', 'PlaylistId', 'MapVariantId', 'DNF',
             'TeamId', 'PlayerTeam', 'Winner', 'TeamColor', 
             'Gamertag', 'SpartanRank', 'PrevTotalXP',
            ]]
    # Sort match by winning team
    df = df.sort_values(by=['Winner'], ascending=False)
    
    return df

df = build_base_dataframe(pull_recent_match(8), 'Drymander')

df.head()

Unnamed: 0,Date,MatchId,GameBaseVariantId,PlaylistId,MapVariantId,DNF,TeamId,PlayerTeam,Winner,TeamColor,Gamertag,SpartanRank,PrevTotalXP
1,2021-07-29 01:41:02.570,31bfdcf5-3dd4-441d-b260-9a8637d721e6,a2949322-dc84-45ab-8454-cf94fb28c189,f0c9ef9a-48bd-4b24-9db3-2c76b4e23450,e2c25cc8-8f51-44ba-bcde-ff08993b01c8,0.0,1.0,Enemy,Victory,Blue,Doomnwo,149,23749271
5,2021-07-29 01:41:02.570,31bfdcf5-3dd4-441d-b260-9a8637d721e6,a2949322-dc84-45ab-8454-cf94fb28c189,f0c9ef9a-48bd-4b24-9db3-2c76b4e23450,e2c25cc8-8f51-44ba-bcde-ff08993b01c8,0.0,1.0,Enemy,Victory,Blue,JoelODST117,149,23720290
6,2021-07-29 01:41:02.570,31bfdcf5-3dd4-441d-b260-9a8637d721e6,a2949322-dc84-45ab-8454-cf94fb28c189,f0c9ef9a-48bd-4b24-9db3-2c76b4e23450,e2c25cc8-8f51-44ba-bcde-ff08993b01c8,0.0,1.0,Enemy,Victory,Blue,Mx J3NY Mx,146,9194326
7,2021-07-29 01:41:02.570,31bfdcf5-3dd4-441d-b260-9a8637d721e6,a2949322-dc84-45ab-8454-cf94fb28c189,f0c9ef9a-48bd-4b24-9db3-2c76b4e23450,e2c25cc8-8f51-44ba-bcde-ff08993b01c8,0.0,1.0,Enemy,Victory,Blue,KarryDahZX,149,18940553
0,2021-07-29 01:41:02.570,31bfdcf5-3dd4-441d-b260-9a8637d721e6,a2949322-dc84-45ab-8454-cf94fb28c189,f0c9ef9a-48bd-4b24-9db3-2c76b4e23450,e2c25cc8-8f51-44ba-bcde-ff08993b01c8,0.0,0.0,Player,Defeat,Red,Drymander,148,15735444


## Get Player List

Now that we have our base dataframe, we'll want to use the gamertags in the match to get more extensive player information.  First, we'll need to prepare their gamertags for the API, similar to how we did it for our first function.

In [6]:
# Function to combine all gamertags from the match and prepare them in string
# format for the next API call
def get_player_list(df):
    
    # Create list from our df['Gamertag'] column and remove the brackets
    player_list = str(list(df['Gamertag']))[1:-1]
    
    # Format string for API
    player_list = player_list.replace(', ',',')
    player_list = player_list.replace("'",'')
    player_list = player_list.replace(' ','+')
    
    # Return in one full string
    return player_list

get_player_list(df)

'Doomnwo,JoelODST117,Mx+J3NY+Mx,KarryDahZX,Drymander,ToweringAsh97,Andy2542,JohnyUL'

## Get Player History

With the gamertags prepared in one string, we'll call the Player Service Records - Arena API, which will return a single JSON file for each player detailing their aggregate stats for every variety of Arena game type (Slayer, Capture the Flag, Oddball, Strongholds, etc).  This will have informatino like total wins, total losses, total kills / assists / deaths all specific to each game type.

While we'll never know for certain, the theory behind compiling the model dataframe by game type is that the features would be more representative of skill and experience in that game type.

To further elaborate, we'll be using games exclusively played in Super Fiesta Party playlist, which respawns players with randomized weapons after every death.  Even if a player is very skilled at Halo, if they have never played this variety of gametype before, they will not likely fare as well as they do in more traditional Halo game types (at least for their first few games).  Thus, total stats for a player's extended Halo history would not be representative of their skill in the Super Fiesta Party playlist.

We'll compile each player's service info in a list.

In [None]:
# Function to pull more informative information about each player in the match
# This information is not available in the two previous API calls
def get_player_history(df, readable=False):
    headers = {
        # Request headers
        'Ocp-Apim-Subscription-Key': str(api_key),
    }
    params = urllib.parse.urlencode({
    })
    # Use our function in the block above the prepare the gamertags for the API
    player_list_api = get_player_list(df)
    
    # Try calling service records API using our player list
    try:
        conn = http.client.HTTPSConnection('www.haloapi.com')
        conn.request("GET", f"/stats/h5/servicerecords/arena?players={player_list_api}&%s" % params, "{body}", headers)
        response = conn.getresponse()
        data = response.read()
        player_history = json.loads(data)
        conn.close()
    
    # Return error if issue with API
    except Exception as e:
        print(f"[Errno {0}] {1}".format(e.errno, e.strerror))
    
    # Option to view in byte string readable format
    if readable == False:
        return player_history
    else:
        return data

# Show result
player_history = get_player_history(df)
# player_history

## Build History Dataframe

Now that we have our player service records for the single match, we'll extend our base dataframe by building out a new, more detailed 'variant' dataframe and append it to the base dataframe.

In [None]:
# Function to build secondary dataframe with more informative player stats
def build_history_dataframe(player_history, variant_id, streamlit=False):
    
    # Option to view 'streamlit' dataframe, which includes pertinent
    # information but excludes all stats for modeling
    if streamlit == True:
        vdf_columns = ['Gamertag','TotalTimePlayed','K/D','Accuracy','WinRate']
        vdf = pd.DataFrame(columns = vdf_columns)
    else:
        stat_list = ['Gamertag', 'TotalKills', 'TotalHeadshots', 'TotalWeaponDamage', 'TotalShotsFired',
                    'TotalShotsLanded', 'TotalMeleeKills', 'TotalMeleeDamage', 'TotalAssassinations',
                    'TotalGroundPoundKills', 'TotalGroundPoundDamage', 'TotalShoulderBashKills',
                    'TotalShoulderBashDamage', 'TotalGrenadeDamage', 'TotalPowerWeaponKills',
                    'TotalPowerWeaponDamage', 'TotalPowerWeaponGrabs', 'TotalPowerWeaponPossessionTime',
                    'TotalDeaths', 'TotalAssists', 'TotalGamesCompleted', 'TotalGamesWon',
                    'TotalGamesLost', 'TotalGamesTied', 'TotalTimePlayed','TotalGrenadeKills']
        vdf = pd.DataFrame(columns = stat_list)
    
    # Set coutner variable
    i = 0
    # Loop the goes through each player in the player history JSON
    for player in player_history['Results']:
        
        # Loop that goes through each Arena Game Base Variant and locates
        # the details specific to the game vase variant of the match
        for variant in player['Result']['ArenaStats']['ArenaGameBaseVariantStats']:
            if variant['GameBaseVariantId'] == variant_id:
                variant_stats = variant
        
        # Create empty dictionary where stats will be added
        variant_dic = {}
        
        # Streamlit option - calculates specifc features
        if streamlit == True:
            variant_dic['Gamertag'] = player_history['Results'][i]['Id']
            variant_dic['TotalTimePlayed']= isodate.parse_duration(variant_stats['TotalTimePlayed']).total_seconds() / 3600
            variant_dic['K/D'] = variant_stats['TotalKills'] / variant_stats['TotalDeaths']
            variant_dic['Accuracy'] = variant_stats['TotalShotsLanded'] / variant_stats['TotalShotsFired']
            variant_dic['WinRate'] = variant_stats['TotalGamesWon'] / variant_stats['TotalGamesLost']
            vdf = vdf.append(variant_dic, True)
            i += 1
        
        # Modeling option - includes all features but does not yet calculate
        else:
            variant_dic['Gamertag'] = player_history['Results'][i]['Id']
            
            # Loop that appends all stats to variant dic
            for stat in stat_list[1:]:    
                variant_dic[stat] = variant_stats[stat]
            
            # Parsing ISO duration times
            variant_dic['TotalTimePlayed']= isodate.parse_duration(variant_stats['TotalTimePlayed']).total_seconds() / 3600
            vdf = vdf.append(variant_dic, True)
            i += 1
    
    # Return the streamlit or modeling dataframe
    return vdf
    
build_history_dataframe(player_history, '1571fdac-e0b4-4ebc-a73a-6e13001b71d3', streamlit=False)

## Recent Match Stats

This function chains together all of the previous functions up to this point.  It returns a full dataframe for a single match, which could then be converted into a single row for our model.

In [None]:
# Function that combines all functions above to go through each step to
# Get the match dataframe
def recent_match_stats(gamertag, back_count=0):
    
    # Pull the match result as JSON from API
    match_results = pull_recent_match(back_count, explore=False, gamertag=gamertag)
    
    # Build the base dataframe
    base_df = build_base_dataframe(match_results, gamertag=gamertag)
    
    # Sleep for 1.01 seconds to avoid issues with API
    time.sleep(1.01)
    
    # Create playerlist for player history API call
    player_list = get_player_list(base_df)
    
    # Call API to get player history JSON
    player_history = get_player_history(base_df)
    
    # Build base player stats dataframe based on player history API call
    history_df = build_history_dataframe(player_history, match_results['GameBaseVariantId'])
    
    # Merge the base dataframe and stats dataframe
    full_stats_df = pd.merge(base_df, history_df, how='inner', on = 'Gamertag')
    
    return full_stats_df

# Show full dataframe for match
df = recent_match_stats('Drymander', back_count=0)
df

## Convert Match Dataframe into Single Row

Now we'll write a function to flatten the dataframe into a single row, which will be required for modeling.

In [None]:
# Function to convert the full match dataframe into a single Pandas row for modeling
def one_row(df, for_model=False):
    
    # If statement that rules out matches that will present issues for the model
    # We want to make sure that exactly 8 players finished teh match and that
    # No player exited the game before it was over
    if ((for_model==True) and ((len(df.index) != 8) or (1 in df['DNF'].values))):
        # Returns an empty dataframe that will be appended to the modeling dataset,
        # effectively denoting that the match will not be usable for the model
        df = pd.DataFrame()
    
    # If the match meets the modeling criteria:
    else:
        # Sort by PlayerTeam (captures player team stats first)
        # Sort by TotalTimePlayed
        df = df.sort_values(by=['PlayerTeam', 'TotalTimePlayed'], ascending=(False, False))
        
        # Isolate portion of the dataframe for creating information we need
        df = df.reset_index()
        df_row = df.iloc[0:1,1:6]
        
        # Determine whether player won, lost, or tied the match
        df_player = df.loc[df['PlayerTeam'] == 'Player']
        if df_player['Winner'].str.contains('Victory').any():
            df_row['WinLoseTie'] = 'Victory'
        elif df_player['Winner'].str.contains('Defeat').any():
            df_row['WinLoseTie'] = 'Defeat'
        elif df_player['Winner'].str.contains('Tie').any():
            df_row['WinLoseTie'] = 'Tie'
        else: 
            df_row['WinLoseTie'] = 'Error Determining Victor'
        
        # 'Flatten' the match dataframe so that each player stat can
        # be represented in one line of data
        column_list = df.columns.to_list()
        columns_converted = []
        df = df.drop(df.iloc[:, 0:11], axis = 1)
        df = df.stack().to_frame().T
        df.columns = ['{}_{}'.format(*c) for c in df.columns]
        
        # Dictionary to convert strings denoting P1-4 (Player 1-4), E1 (Enemey 1-4)
        column_convert_dic = {"0_":"P1-", "1_":"P2-","2_":"P3-","3_":"P4-",
                              "4_":"E1-","5_":"E2-","6_":"E3-","7_":"E4-",}

        # Use dictionary to set column names
        for k, v in column_convert_dic.items():
            df.columns = df.columns.str.replace(k, v)
        df.columns = df.columns.str.replace('-', '_')
        df = df_row.join(df, how='outer')
    
    # Return match dataframe as one row
    return df

# Test function
one_row(df, for_model=False)

## Load Unique Gamertag List

We're almost ready to start pulling rows into the dataframe.  We'll use this list exported from our EDA, which includes 24,248 unique players that I have personally played with over the past year.

In [None]:
# Load unique gamertags pickle file from EDA notebook

with open('data/unique_gamertags.pkl', 'rb') as unique_gamertags_pickle:
    unique_gamertags = pickle.load(unique_gamertags_pickle)

# See how many unique gamertags that the player has played with
len(unique_gamertags)

## Modified Pull Matches Function

We're going to modify our Pull Recent Match function to pull 25 matches per player from our set of unique gamertags.  We're pulling 25 for a couple of reasons:

- The API allows up to 25 Match ID's to be pulled at once
- Many of these matches will not qualify for our model due to players leaving in the middle of the match
- Many games will be of different game types that might not be relevant to the game type that we'll be modeling 

In [None]:
# Function that slightly models the pull_recent_match function
# Designed to pull 25 matches from each gamertag for modeling
def model_pull_matches(how_recent, api_key=api_key, 
                       gamertag='Drymander', count=25):
    
    # Use gamertag_for_api function to remove any spaces
    gamertag = gamertag_for_api(gamertag)
    # Set API key
    headers = {
        # Request headers
        'Ocp-Apim-Subscription-Key': api_key,
    }
    
    # Pulls from arena mode, how_recent is how far to go back in the match history
    # 'count' refers to the number of matches to pull
    params = urllib.parse.urlencode({
        # Request parameters
        'modes': 'arena',
        'start': 0,
        'count': count,
        'include-times': True,
    })
    
    # Try / except for error handling
    try:
        
        # Connect to API and pull most recent 25 matches for specified gamer
        # and format into JSON
        conn = http.client.HTTPSConnection('www.haloapi.com')
        conn.request("GET", f"/stats/h5/players/{gamertag}/matches?%s" % params, "{body}", headers)
        time.sleep(1.01)
        response = conn.getresponse()
        data = response.read()
        history_pull = json.loads(data)
        print(history_pull['ResultCount'])
        
        # Counter variable for printing status
        i = 0
        
        # Empty list to append 25 matches
        latest_count_matches = []
        
        # Loop to go through each of the 25 matches to pull match details for each
        for match in history_pull['Results']:
            
            # Identify match ID and match date
            match_id = match['Id']['MatchId']
            match_date = match['MatchCompletedDate']['ISO8601Date']
            
            # API call for each match ID in teh 25 matches
            conn.request("GET", f"/stats/h5/arena/matches/{match_id}?%s" % params, "{body}", headers)
            time.sleep(1.1)
            
            # Format into JSON and append match ID and date
            response = conn.getresponse()
            data = response.read()
            match_results = json.loads(data)
            match_results['MatchId'] = match_id
            match_results['Date'] = match_date
            
            # Append each match JSON to full list
            latest_count_matches.append(match_results)
            conn.close()
            i += 1
            
            # Print total number of matches appended to list
            print(f'{i} matches appended')
    
    # Error handling
    except Exception as e:
        print(f"[Errno {0}] {1}".format(e.errno, e.strerror))
    
    # Return full set of 25 matches
    return latest_count_matches

# Testing the function
latest_count_matches = model_pull_matches(0, gamertag='Drymander', count=5)

## Add Rows to Dataframe

Finally, this function will chain everything together to add rows to the modeling dataframe.  

You might notice that the gamertag list is chunked at [651:850].  This is where I left off with pulling rows, but I can continue adding more rows should it prove more beneficial to the model.  651 is the start number in the unique gamer tags list, and 850 is the end number.  This means I have pulled 25 matches from 850 gamertags, but it should be noted that some or all of those matches might not have been added to the final modeling dataframe.  The reasons for this incldue:
- Less than 8 players (4 on each team) finished the match.
- 8 players finished the match, but one or more players other than the 8 finishers disconnected from the match and was replaced mid-game.
- During the Player Service Record API pull, one or more of the 8 players changed their gamertag **after** the date of the match, meaning the gamertag returned null values for their service record.

This function is designed to append rows to a .csv file as it runs.  Since the function requires multiple API calls per row, this allows the ability to stop / start the function as needed.  It also allows you to pull additional rows in manageable chunks.

In [None]:
# Setting run to false
# If set to true, model will start adding new rows to model datafarme stored in .csv
run = False

# Isolating portion of unique_gamertags to build model dataframe in manageable chunks
gamertag_list = unique_gamertags[651:850]

# Function add rows to the modeling dataframe
def model_recent_match_stats(gamertag_list, back_count=0, count=25):
    
    # Create new dataframe
    df = pd.DataFrame()
    
    # Set gamertag_count to zero, will be used in updating status via print
    gamertag_count = 0
    
    # Loop through gamertags in unique_gamertags list
    for gamertag in tqdm(gamertag_list):
        
        # Try / except to deal with API error handling
        try:
            # Use latest_count_match function to pull 25 matches from player 
            latest_count_matches = model_pull_matches(0, gamertag=gamertag, count=count)
            time.sleep(1.1)
            
            # Setting error counter and additional counter variable
            error_count = 0
            i = 0
            
            # Loop through each of the players 25 matches
            for match in latest_count_matches:
                
                # Error handling
                try:
                    
                    # Build the base dataframe
                    base_df = build_base_dataframe(match, gamertag=gamertag)
                    
                    # Create playerlist for player history API call
                    player_list = get_player_list(base_df)
                    
                    # Call API to get player history JSON
                    player_history = get_player_history(base_df)
                    
                    # Build base player stats dataframe based on player history API call
                    history_df = build_history_dataframe(player_history, match_results['GameBaseVariantId'])
                    
                    # Merge the base dataframe and stats dataframe
                    full_stats_df = pd.merge(base_df, history_df, how='inner', on = 'Gamertag')
                    
                    # Flatten full match dataframe into one row, 
                    row = one_row(full_stats_df, for_model=True)
                    
                    # Append row to model dataframe .csv with specific date format  
                    row.to_csv('data/MODEL_PULL.csv', mode ='a', date_format='%Y-%m-%d %H:%M:%S', header=False)
                    
                    # Append to model dataframe if working outside of .csv
                    df = df.append(row)
                    
                    # Print how many rows have been added to the dataframe
                    i += 1
                    print(f'{i} rows added to model dataframe')

                    time.sleep(1.1)
                except:
                    
                    # Print error count if row cannot be added because it doesn't meet criteria
                    # Typically this occurs when a player has changed their gamertag
                    error_count += 1
                    print(f'{error_count} rows returned error when getting player history')
                    time.sleep(1.1)
                    error_count += 1
                    continue
            
            # Print number of gamertags that the function has gone through
            gamertag_count += 1
            print(f'{gamertag_count} completed')
        
        except:
            
            # Show error message if gamer skipped due to name change or other issue with API
            print('gamertag skipped due to error')
    
    # Return modeling dataframe
    return df

if run == True:
    model_df = model_recent_match_stats(gamertag_list, back_count=0, count=25)
else:
    pass


# Load Model Dataframe from CSV

Let's take a look at our modeling dataframe by loading the .csv file.

In [None]:
# Load csv created by model_recent_match_stats
df = pd.read_csv('data/MODEL_PULL.csv')

# Convert the dates to datetime objects
df['Date'] = df['Date'].apply(pd.to_datetime)

# Drop 'Unnamed: 0' from dataframe
df = df.drop(['Unnamed: 0'], axis=1)

df.head(3)

In [None]:
# Remove null, infinity, and negative infinity
# This can cause errors when creating features
len_df = len(df)
print(f'There are {len_df} total rows in the model dataframe.')
print(f'There are {len(df[df.isin([np.nan, np.inf, -np.inf]).any(1)])} null or infinity values that should be removed from the model.')
df = df[~df.isin([np.nan, np.inf, -np.inf]).any(1)]
len_df = len(df)
print(f'There are {len_df} rows after removing null and infinity values.')

Power weapon possession time needs to be converted from ISO duration times to floats representing hours

In [None]:
# Define roster to interact with dataframe
roster = ['P1', 'P2', 'P3', 'P4', 'E1', 'E2', 'E3', 'E4']

# Loop through players in roster
for player in roster:
    
    # Empty list of parsed times
    parsed_times = []
    
    # Parse times for each player 
    for row in df[f'{player}_TotalPowerWeaponPossessionTime']:
        row = isodate.parse_duration(row).total_seconds() / 3600
        parsed_times.append(row)
    
    # Set column to parsed times list
    df[f'{player}_TotalPowerWeaponPossessionTime'] = parsed_times

df['P1_TotalPowerWeaponPossessionTime']

Similar to our EDA, we'll need to decode a few columns for readability.

In [None]:
# This function will convert codes provided by the API into a readable format
def decode_column(df, column, api_dict):
    
    # Empty list of decoded values
    decoded_list = []
    
    # Loop through each row
    for row in df[column]:
        i = 0
        
        # Loop through API dictionary
        for item in api_dict:
            
            # If code found, append it to list
            if item['id'] == row:
                name = item['name']
                decoded_list.append(name)
            
            # Otherwise keep searching until found
            else:
                i += 1
    
    # Return decoded list
    return decoded_list

In [None]:
# This function will convert maps to readable format
def decode_maps(df, column, api_dict):
    decoded_list = []
    
    # Loop through each row
    for row in df[column]:
        i = 0
        
        # Creating map_count variable
        map_count = len(api_dict)
        
        # For each item in API dictionary
        for item in api_dict:
            
            # If map cannot be found, name 'Custom Map'
            if (i+1) == map_count:
                name = 'Custom Map'
                decoded_list.append(name)
            
            # If found, assign value to code
            elif item['id'] == row:
                name = item['name']
                decoded_list.append(name)
            
            # Otherwise keep looping
            else:
                i += 1
    
    # Return decoded list
    return decoded_list

In [None]:
# Loading GameBaseVariantId metadata dictionary pulled from API
with open('data/GameBaseVariantId.pkl', 'rb') as GameBaseVariantId_pickle:
    GameBaseVariantId_dic = pickle.load(GameBaseVariantId_pickle)

# Loading PlaylistId metadata dictionary pulled from API
with open('data/PlaylistId_dic.pkl', 'rb') as PlaylistId_dic_pickle:
    PlaylistId_dic = pickle.load(PlaylistId_dic_pickle)

# Loading map_list metadata dictionary pulled from API
with open('data/map_list.pkl', 'rb') as map_list_pickle:
    map_list = pickle.load(map_list_pickle)

# Decode columsn with using our decode functions
df['GameBaseVariantId'] = decode_column(df, 'GameBaseVariantId', GameBaseVariantId_dic)    
df['PlaylistId'] = decode_column(df, 'PlaylistId', PlaylistId_dic)
df['MapVariantId'] = decode_maps(df, 'MapVariantId', map_list)

df[['GameBaseVariantId', 'PlaylistId', 'MapVariantId']].head(3)

Next we will:
- Remove tie games
- Ensure victory and defeat are represented as 1 and 0 integers
- Rename WinLoseTie to PlayerWin, which better describes our model target
- Filter our match date to ensure our match information and our player service record information are chronologically synced.

In [None]:
# Remove ties from df
df = df[df['WinLoseTie'] != 'Tie']

# Set victories to 1 and defeats to 0
df.loc[(df['WinLoseTie'] == 'Victory'),'WinLoseTie'] = 1
df.loc[(df['WinLoseTie'] == 'Defeat'),'WinLoseTie'] = 0

# Convert to integers to be safe
df['WinLoseTie'] = df['WinLoseTie'].astype('int')

# Rename WinLoseTie to PlayerWin for clarity
df.rename(columns={'WinLoseTie':'PlayerWin'}, inplace=True)

# Set date range, only want matches later than 7/1/21
print(len(df))
df = df[(df['Date'] > '2021-07-01')]
print(len(df))

df.head(3)

# Choose GameBaseVariantId to model

This function was used a bit more during experimentation, but it is still useful for filtering our model data by game base variant (Slayer or Capture the Flag) and playlist (Super Fiesta Party).  

Since the majority of games I have played were Super Fiesta Party, we'll focus on those games for modeling.

In [None]:
# Function to choose gametype
def choose_gametype(df, GameBaseVariantId, PlaylistId):
    
    # If none selected, return df
    # This will be useful for the next function
    if GameBaseVariantId == None and PlaylistId == None:
        gametype_df = df
    
    # Option to set GameBaseVariantId to None
    elif GameBaseVariantId == None:
        gametype_df = df[df['PlaylistId'] == PlaylistId]
    
    # Option to set PlaylistId to None
    elif PlaylistId == None:
        gametype_df = df[df['GameBaseVariantId'] == GameBaseVariantId]
    
    # Set dataframe to specified GameBaseVariantId and PlaylistId
    else:
        gametype_df = df[(df['GameBaseVariantId'] == GameBaseVariantId) & (df['PlaylistId'] == PlaylistId)]
    
    # Return dataframe
    return gametype_df

# Set to Super Fiesta Party
df = choose_gametype(df, 'Capture the Flag', 'Super Fiesta Party')

# Check function with value counts
df['PlaylistId'].value_counts()

## Drop columns

We'll drop columns that will not be helpful for our model.

In [None]:
# Drop unnecessary columns
df = df.drop(['Date',
        'MatchId',
        'GameBaseVariantId',
        'PlaylistId',
        'MapVariantId',
        'P1_Gamertag',
        'P2_Gamertag',
        'P3_Gamertag',
        'P4_Gamertag',
        'E1_Gamertag',
        'E2_Gamertag',
        'E3_Gamertag',
        'E4_Gamertag',
        ]
        ,axis=1)

df.head(5)

## Feature Creation

Done in a slighty different order than the EDA, we will still want to create features that might be helpful for our model.

We'll be creating:
- Win rate
- K/D
- Accuracy

And we'll also convert all total lifetime game base variant stats into 'per game' stats.  These per game stats might be more indicative of skill, whereas total lifetime stats are more indicative of experience.  Both are relevant.

In [None]:
# Set roster for sifting through players of dataframe
roster = ['P1', 'P2', 'P3', 'P4', 'E1', 'E2', 'E3', 'E4']

# Loop through players in roster
for player in roster:
    
    # Set win rate
    df[f'{player}_WinRate'] = df[f'{player}_TotalGamesWon'] / df[f'{player}_TotalGamesLost']
    
    # Set K/D (or Kill / Death ratio)
    df[f'{player}_K/D'] = df[f'{player}_TotalKills'] / df[f'{player}_TotalDeaths']
    
    # Set accuracy
    df[f'{player}_Accuracy'] = df[f'{player}_TotalShotsLanded'] / df[f'{player}_TotalShotsFired']
    
    per_game_stat_list = ['TotalKills', 'TotalHeadshots', 'TotalWeaponDamage', 
                      'TotalShotsFired', 'TotalShotsLanded', 'TotalMeleeKills', 
                      'TotalMeleeDamage', 'TotalAssassinations', 'TotalGroundPoundKills', 
                      'TotalGroundPoundDamage', 'TotalShoulderBashKills', 
                      'TotalShoulderBashDamage', 'TotalGrenadeDamage', 'TotalPowerWeaponKills', 
                      'TotalPowerWeaponDamage', 'TotalPowerWeaponGrabs', 
                      'TotalPowerWeaponPossessionTime', 'TotalDeaths', 'TotalAssists', 
                      'TotalGrenadeKills']
            
    for stat in per_game_stat_list:
        per_game_stat_string = stat.replace('Total', '')
        per_game_stat_string = f'{per_game_stat_string}PerGame'
        df[f'{player}_{per_game_stat_string}'] = df[f'{player}_{stat}'] / df[f'{player}_TotalGamesCompleted']
#         variant_dic[per_game_stat_string] = variant_dic[stat] / variant_dic['TotalGamesCompleted']


# Drop infinity values, which can arise if it is the first time a player
# is playing specified playlist
df = df.dropna()
df = df[~df.isin([np.nan, np.inf, -np.inf]).any(1)]

df

# Make Model Dataframe Function

Something that will be helpful for our model will be sorting each feature from highest to lowest for each team of players.  For each column in feature_list, this function will take each stat for P1, P2, P3, and P4 and reorder the stats.  Hopefully this will help the model understand that P1 represents the 'best' or 'highest' stat on that team, and P4 represents the 'worst' or 'lowest.'  The function will do the same for the enemy team (E1, E2, E3, and E4).

In [None]:
# Set columns for full model dataframe
feature_list = [
    # Main stats
    'WinRate', 'K/D', 'Accuracy', 'TotalGamesCompleted', 'TotalGamesWon',
    'TotalGamesLost', 'PrevTotalXP', 'SpartanRank', 'TotalTimePlayed',
    # Total life time stats
#     'TotalKills', 'TotalHeadshots', 'TotalWeaponDamage', 
#     'TotalShotsFired', 'TotalPowerWeaponPossessionTime',
#     'TotalShotsLanded', 'TotalMeleeKills', 'TotalMeleeDamage', 'TotalAssassinations',
#     'TotalGroundPoundKills', 'TotalGroundPoundDamage', 'TotalShoulderBashKills',
#     'TotalShoulderBashDamage', 'TotalGrenadeDamage', 'TotalPowerWeaponKills',
#     'TotalPowerWeaponDamage', 'TotalPowerWeaponGrabs',
#     'TotalDeaths', 'TotalAssists', 'TotalGamesTied', 'TotalGrenadeKills',
    # Per game stats   
    'KillsPerGame', 'HeadshotsPerGame', 'WeaponDamagePerGame', 
    'ShotsFiredPerGame', 'ShotsLandedPerGame', 'MeleeKillsPerGame', 
    'MeleeDamagePerGame', 'AssassinationsPerGame', 'GroundPoundKillsPerGame', 
    'GroundPoundDamagePerGame', 'ShoulderBashKillsPerGame', 
    'ShoulderBashDamagePerGame', 'GrenadeDamagePerGame', 'PowerWeaponKillsPerGame', 
    'PowerWeaponDamagePerGame', 'PowerWeaponGrabsPerGame', 
    'PowerWeaponPossessionTimePerGame', 'DeathsPerGame', 'AssistsPerGame', 
    'GrenadeKillsPerGame',
]


# Function that sorts player stats
def sort_players(df, feature_list, GameBaseVariantId, PlaylistId):
    
    # Choose gametype function
    df = choose_gametype(df, GameBaseVariantId, PlaylistId) 
    
    # Empty dataframe with PlayerWin as first column
    model_df = pd.DataFrame()
    model_df['PlayerWin'] = df['PlayerWin']
    
    # Loop that sorts player stats per team
    for feature in feature_list:
        feature_columns = [
            f'P1_{feature}', f'P2_{feature}',
            f'P3_{feature}', f'P4_{feature}', f'E1_{feature}',
            f'E2_{feature}', f'E3_{feature}', f'E4_{feature}',
            ]
        
        # Copy input dataframe columns
        feature_df = df[feature_columns].copy()

        # Sort Players in dataframe by highest value
        i = 0
        for row in tqdm(feature_df.iterrows()):
            # Sort player / enemy from highest to lowest in row
            feature_df.iloc[i, 0:4] = feature_df.iloc[i, 0:4].sort_values(ascending=False).values
            feature_df.iloc[i, 4:8] = feature_df.iloc[i, 4:8].sort_values(ascending=False).values
            i += 1
        
        # Join sorted features with PlayerWin column
        model_df = model_df.join(feature_df, on=model_df.index)
    
    # Drop null values
    model_df = model_df.dropna()
    
    # Return sorted dataframe
    return model_df
            
df = sort_players(df, feature_list, None, None)

df.head(5)

In [None]:
# df.to_csv('data/Model_DF_PerGameFeatures_Sorted_HuskyRaid.csv')

# Model with All Features

We'll start by modeling all features broken down and sorted by individual player.

In [None]:
df = pd.read_csv('data/Model_DF_PerGameFeatures_Sorted_HuskyRaid.csv')
df = df.drop(['Unnamed: 0'], axis=1)

In [None]:
# Make model_df from a copy of our dataframe up to this point
model_df = df.copy()

# Assign features and target
features = model_df.drop(['PlayerWin'], axis=1)
target = model_df['PlayerWin']

# Assigning X and y for train test split
X = features
y = target

# Ensure target is integer format
y = y.astype('int')

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                    random_state=8)

# Print shape
print(X_train.shape)
print(X_test.shape)

We'll set up a basic linear regression for the purposes of creating and testing our 'evaluate model' and 'make model' functions.

In [None]:
# Create logistic regression model
model_log = LogisticRegression(random_state=8)

# Train on X_train and y_train
model_log.fit(X_train, y_train)

This function will return cross validation scores.

In [None]:
# Function to return cross validation scores
def cross_val_check(model_string_name, model, X_train, y_train, X_test, y_test):
    scores = cross_val_score(model, X_train, y_train, cv=5) # model, train, target, cross validation
    print(f'{model_string_name} Cross Validation Scores:')
    print(scores)
    print(f'\nCross validation mean: \t{scores.mean():.2%}')
    
cross_val_check('Logistic Regression', model_log, X_train, y_train, X_test, y_test)

This function will return performance information and helpful visuals for interpreting the strengths and weaknesses of our models.

In [None]:
def evaluate_model(model, X_train, X_test, y_train, 
                   y_test, cmap='Greens', normalize=None,
                   classes=None,figsize=(10,4), graphs=False):
    
    """
    model :: classifier user desires to evaluate
    X_train :: X training data
    X_test :: X test data
    y_train :: y_train data
    y_test :: y_train data
    cmap :: color palette of confusion matrix
    normalize :: set to True if normalized confusion matrix is desired
    figsize :: desired plot size

    """
    
    # Print model accuracy
    print(f'Training Accuracy: {model.score(X_train,y_train):.2%}')
    print(f'Test Accuracy: {model.score(X_test,y_test):.2%}')
    print('')
    
    # Option to show graphs
    if graphs == True:
        
        # Print classification report
        y_test_predict = model.predict(X_test)
        print(metrics.classification_report(y_test, y_test_predict,
                                            target_names=classes))

        # Plot confusion matrix
        fig,ax = plt.subplots(ncols=2,figsize=figsize)
        metrics.plot_confusion_matrix(model, X_test,y_test,cmap=cmap, 
                                      normalize=normalize,display_labels=classes,
                                      ax=ax[0])

        #Plot ROC curves
        
        with sns.axes_style("darkgrid"):
            curve = metrics.plot_roc_curve(model,X_train,y_train,ax=ax[1])
            curve2 = metrics.plot_roc_curve(model,X_test,y_test,ax=ax[1])
            curve.ax_.grid()
            curve.ax_.plot([0,1],[0,1],ls=':')
            fig.tight_layout()
            plt.show()

evaluate_model(model_log, X_train, X_test, y_train, 
                   y_test, graphs=True)

Now we'll build a function that accomplishes a few things:
- Conducts the full train / test split process
- Models the data
- Provides evaluation metrics
- Has options for choosing model type (e.g. logistic regression, random forest, SVM, XGBoost, etc.)
- Has an option for running a dummy model
- Has option for choosing a scaler if desired

In [None]:
# Function to create models
def make_model(df, regressor=LogisticRegression, scale=False, graphs=False, dummy=False, cmap='Greens',
              slim=False, scaler=StandardScaler()):

    # Assigning X and y for train test split
    X = df.drop(['PlayerWin'], axis=1)
    y = df['PlayerWin']
    
    # Ensure target is integer format
    y=y.astype('int')

    # Train test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                        random_state=8)
    
    # Option to scale data with scaler type as parameter
    if scale==True:
        scaler = scaler
        X_train = pd.DataFrame(scaler.fit_transform(X_train))
        X_test = pd.DataFrame(scaler.transform(X_test))
    
    # Option to return dummy model
    if dummy == True:
        model_log = DummyClassifier(strategy='stratified')
        print('Using Dummy Model')
    else:
        model_log = regressor
    
    # Fit to X_train and y_train
    model_log.fit(X_train, y_train)
   
    # Print total number of samples
    total_samples = X_train.shape[0] + X_test.shape[0]
    print(f'Total number of samples: {total_samples}')
    print('------------------------------------------')
    
    # Option to suppress cross validation scores
    if slim == False:
        cross_val_check(str(regressor), model_log, X_train, y_train, X_test, y_test)
    
    evaluate_model(model_log, X_train, X_test, y_train, y_test, graphs=graphs, cmap=cmap)

make_model(model_df, scale=False, graphs=True, regressor=LogisticRegression())

## Logistic Regression

Now that we have a function to fully model our data, we can try our data on a few different types of models.  We'll start with basic logistic regression.

In [None]:
# Basic logistic regression
make_model(df, regressor=LogisticRegression(random_state=8), scale=False, graphs=False, cmap='Greens',
              slim=False)

Let's see if using scalers makes a difference.  We'll try Power Transformer, Standard Scaler, and Robust Scaler.

In [None]:
print('Logistic Regression with PowerTransformer')
make_model(model_df, scale=True, graphs=False, regressor=LogisticRegression(), 
           scaler=PowerTransformer())
print('Logistic Regression with StandardScaler')
make_model(model_df, scale=True, graphs=False, regressor=LogisticRegression(), 
           scaler=StandardScaler())
print('Logistic Regression with RobustScaler')
make_model(model_df, scale=True, graphs=False, regressor=LogisticRegression(), 
           scaler=RobustScaler())

It looks like the power transformer is best for logistic regression, let's take a look at the full evaluation.

In [None]:
# Logistic regression with power transformer, full evalutation
make_model(model_df, scale=True, graphs=True, regressor=LogisticRegression(), scaler=PowerTransformer())

We were able to reach 70.29% test accuracy.  Let's run a dummy model to see if this score is better than a dummy model.

In [None]:
# Dummy model for comparison
make_model(model_df, scale=False, graphs=True, dummy=True, 
           cmap='Reds', regressor=LogisticRegression())

With a test accuracy of 51.72%, our logistic regression model with power transformer at 70.29% test accuracy is at least picking up on something.  Before diving into what it's using to make its predictions, let's try a few more models.

## Random Forest

Next we'll try random forest models to see how they perform, with and without scalers.

In [None]:
# Random forest not scaled
make_model(model_df, scale=False, graphs=False, regressor=RandomForestClassifier())

In [None]:
# Random forest with different scales
print('Random Forest with PowerTransformer')
make_model(model_df, scale=True, graphs=False, 
           regressor=RandomForestClassifier(random_state=8), scaler=PowerTransformer())
print('Random Forest with StandardScaler')
make_model(model_df, scale=True, graphs=False, 
           regressor=RandomForestClassifier(random_state=8), scaler=StandardScaler())
print('Random Forest with RobustScaler')
make_model(model_df, scale=True, graphs=False, 
           regressor=RandomForestClassifier(random_state=8), scaler=RobustScaler())


We were almost able to crack 72% with the power tranformer.  Let's take a look at the full evaluation.

In [None]:
make_model(model_df, scale=True, graphs=True, 
           regressor=RandomForestClassifier(random_state=8), scaler=PowerTransformer())

Random forest model with the power transformer is our best model so far at 71.61% test accuracy.

## Support Vector Machines

Next we'll take a look at support vector machines.

In [None]:
# SVM without scaling
make_model(model_df, scale=False, graphs=False, regressor=svm.SVC())

In [None]:
# Support Vector Machines with different scalers
print('Support Vector Machine with PowerTransformer')
make_model(model_df, scale=True, graphs=False, 
           regressor=svm.SVC(), scaler=PowerTransformer())
print('Support Vector Machine with StandardScaler')
make_model(model_df, scale=True, graphs=False, 
           regressor=svm.SVC(), scaler=StandardScaler())
print('Support Vector Machine with RobustScaler')
make_model(model_df, scale=True, graphs=False, 
           regressor=svm.SVC(), scaler=RobustScaler())

We came pretty close to 72% with SVM using the power transformer, but random forest is still our best model.  Let's take a look at SVM using power transformer, which scored 71.48% test accuracy.

In [None]:
make_model(model_df, scale=True, graphs=True, 
           regressor=svm.SVC(), scaler=PowerTransformer())

## XGBoost

Finally, let's take a look at XGBoost.

In [None]:
# XGBoost with no scaler
make_model(model_df, scale=False, graphs=False, regressor=XGBClassifier())

In [None]:
# XGBoost with different scalers
print('XGBoost with PowerTransformer')
make_model(model_df, scale=True, graphs=False, 
           regressor=XGBClassifier(random_state=8), scaler=PowerTransformer())
print('XGBoost with StandardScaler')
make_model(model_df, scale=True, graphs=False, 
           regressor=XGBClassifier(random_state=8), scaler=StandardScaler())
print('XGBoost with RobustScaler')
make_model(model_df, scale=True, graphs=False, 
           regressor=XGBClassifier(random_state=8), scaler=RobustScaler())

While worth a shot, XGBoost did not outperform our SVM model.  While most models look nearly identical in terms of accuracy, let's take a closer look at XGBoost with robust scaler, since it has the highest cross validation mean at 71.54%.

In [None]:
make_model(model_df, scale=True, graphs=False, 
           regressor=XGBClassifier(), scaler=RobustScaler())

# Feature Analysis

Let's take a look at how logistic regression values different features in our full model dataset.  We'll create a model accuracy function and a plot coefficients function to explore the importances.

Before that, we'll redo our train test split using the power tranformer, which was best for linear regression.

In [None]:
# Make model_df from a copy of our dataframe up to this point
model_df = df.copy()

# Assign features and target
features = model_df.drop(['PlayerWin'], axis=1)
target = model_df['PlayerWin']

# Assigning X and y for train test split
X = features
y = target

# Ensure target is integer format
y = y.astype('int')

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                    random_state=8)

# Define and scale X train and test
scaler = PowerTransformer()
X_train = pd.DataFrame(scaler.fit_transform(X_train))
X_test = pd.DataFrame(scaler.transform(X_test))

# Create and fit logistic regressoin
model_log = LogisticRegression()
model_log.fit(X_train, y_train)

# Print shape
print(X_train.shape)
print(X_test.shape)

Now we'll create our functions and take a look a the feature importances.

In [None]:
# Function to show model accuracy
def model_accuracy(model, X_train, y_train, X_test, y_test):
    
    print(f'Training Accuracy: {model.score(X_train,y_train):.2%}')
    print(f'Test Accuracy: {model.score(X_test,y_test):.2%}')

# Uncomment line below to create images for presentation
# sns.set_context('talk')

# Function to plot logistic regression coefficients
def plot_coefficients(model, features, X_train, X_test, y_train, y_test, count=20):    
    
    # Create a list of coefficients
    coeffs = pd.Series(model.coef_.flatten(), index=features.columns).sort_values(ascending=False)
#     coeffs = coeffs[:20]
    top_coeffs = coeffs[:count]
    bottom_coeffs = coeffs[-count:]
    coeffs = top_coeffs.append(bottom_coeffs)
    
    # Display accuracy of newly trained model
    model_accuracy(model, X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test)

    # Create coefficients plot
    with sns.plotting_context("talk", font_scale=1.4):
        with sns.axes_style("darkgrid"):
            plt.figure(figsize=(16, 12))
            ax = sns.barplot(x=coeffs, y=coeffs.index, palette='coolwarm')
            ax.set(xlabel='Log Coefficients', ylabel='Features')
            ax.set_title("Feature Importances",fontsize=30)
    
    # Save image
    plt.tight_layout()

    
plot_coefficients(model_log, features, X_train, X_test, y_train, y_test, count=10)

It's a bit difficult to interpret the model importances with so many features.  Let's try condesning the features by average feature per team and see if that returns more interpretable importances.

# Model with Condensed Features

We'll create a function to condense the features on each team.

In [None]:
# Function to convert model dataframe from individual players stats to averaged team stats
def condense_features(df, feature_list):
    
    # Empty dataframe
    df_total=pd.DataFrame()
    
    # Copy PlayerWin column
    df_total['PlayerWin'] = df['PlayerWin']
    
    # Loop through features for each player
    for feature in feature_list:
        
        # Add sum features for respective teams
        df_total[f'Player_{feature}'] = df[f'P1_{feature}'] + df[f'P2_{feature}'] + df[f'P3_{feature}'] + df[f'P4_{feature}']
        df_total[f'Enemy_{feature}'] = df[f'E1_{feature}'] + df[f'E2_{feature}'] + df[f'E3_{feature}'] + df[f'E4_{feature}']
        
        # Divide by 4 to get average
        df_total[f'Player_{feature}'] = df_total[f'Player_{feature}'] / 4
        df_total[f'Enemy_{feature}'] = df_total[f'Enemy_{feature}'] / 4

    return df_total
    
df = condense_features(df, feature_list)
df.head(5)

In [None]:
# df.to_csv('data/Model_W_Condensed_Features_HuskyRaid.csv')

In [None]:
df = pd.read_csv('data/Model_W_Condensed_Features_HuskyRaid.csv')
df = df.drop(['Unnamed: 0'], axis=1)

Next we'll apply our train test split again.  Before checking out the features, let's see if condensing the features has any affect on model performances.

In [None]:
# Make model_df from a copy of our dataframe up to this point
model_df = df.copy()

# Assign features and target
features = model_df.drop(['PlayerWin'], axis=1)
target = model_df['PlayerWin']

# Assigning X and y for train test split
X = features
y = target

# Ensure target is integer format
y = y.astype('int')

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                    random_state=8)

# Print shape
print(X_train.shape)
print(X_test.shape)

## Logistic Regression, Random Forest, SVM, XGBoost

We'll rerun each model we tested before, but now with our condensed feature set.

In [None]:
print('Logistic Regression with PowerTransformer')
make_model(model_df, scale=True, graphs=False, regressor=LogisticRegression(), scaler=PowerTransformer())
print('Logistic Regression with StandardScaler')
make_model(model_df, scale=True, graphs=False, regressor=LogisticRegression(), scaler=StandardScaler())
print('Logistic Regression with RobustScaler')
make_model(model_df, scale=True, graphs=False, regressor=LogisticRegression(), scaler=RobustScaler())

# Random forest with power transformer scale
print('Random Forest with PowerTransformer')
make_model(model_df, scale=True, graphs=False, 
           regressor=RandomForestClassifier(random_state=8), scaler=PowerTransformer())
print('Random Forest with StandardScaler')
make_model(model_df, scale=True, graphs=False, 
           regressor=RandomForestClassifier(random_state=8), scaler=StandardScaler())
print('Random Forest with RobustScaler')
make_model(model_df, scale=True, graphs=False, 
           regressor=RandomForestClassifier(random_state=8), scaler=RobustScaler())

# Support Vector Machines with different scalers
print('Support Vector Machine with PowerTransformer')
make_model(model_df, scale=True, graphs=False, 
           regressor=svm.SVC(), scaler=PowerTransformer())
print('Support Vector Machine with StandardScaler')
make_model(model_df, scale=True, graphs=False, 
           regressor=svm.SVC(), scaler=StandardScaler())
print('Support Vector Machine with RobustScaler')
make_model(model_df, scale=True, graphs=False, 
           regressor=svm.SVC(), scaler=RobustScaler())

Looks like we beat our highest score with the condensed feature set!  Let's take a look at logistic regression with the power transformer in more detail.

In [None]:
print('Logistic Regression with PowerTransformer')
make_model(model_df, scale=True, graphs=True, regressor=LogisticRegression(), scaler=PowerTransformer())

## Condensed Feature Analysis

Let's take a look at the feature importances now that we have condensed the featuers.  We'll perform our train / test split again using the power transformer, and then we'll run our feature importances function.

In [None]:
# Make model_df from a copy of our dataframe up to this point
model_df = df.copy()

# Assign features and target
features = model_df.drop(['PlayerWin'], axis=1)
target = model_df['PlayerWin']

# Assigning X and y for train test split
X = features
y = target

# Ensure target is integer format
y = y.astype('int')

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                    random_state=8)

# Define and scale X train and test
scaler = PowerTransformer()
X_train = pd.DataFrame(scaler.fit_transform(X_train))
X_test = pd.DataFrame(scaler.transform(X_test))

# Create and fit logistic regressoin
model_log = LogisticRegression()
model_log.fit(X_train, y_train)

# Print shape
print(X_train.shape)
print(X_test.shape)

In [None]:
plot_coefficients(model_log, features, X_train, X_test, y_train, y_test, count=10)

With the slimmed down features, we see that Player Winrate is now the most important feature as well as player shots landed per game.  It's still a bit difficult to understand why something like ShotsLandedPerGame is ranked so high, but perhaps this speaks to accuracy being an important factor.  

# Model Removing Multicollinearity

We'll take one last step regarding model interpretation by removing multicollinearity.  

In [None]:
df = pd.read_csv('data/Model_W_Condensed_Features_HuskyRaid.csv')
df = df.drop(['Unnamed: 0'], axis=1)
model_df = df.copy()

columns = ['PlayerWin', 'Player_WinRate', 'Enemy_WinRate', 'Player_K/D',
       'Enemy_K/D', 'Player_Accuracy', 'Enemy_Accuracy',
       'Player_TotalGamesCompleted', 'Enemy_TotalGamesCompleted',
       'Player_SpartanRank', 'Enemy_SpartanRank',
       'Player_KillsPerGame', 'Enemy_KillsPerGame',
       'Player_HeadshotsPerGame', 'Enemy_HeadshotsPerGame',
       'Player_DeathsPerGame','Enemy_DeathsPerGame', 
       'Player_GrenadeKillsPerGame', 'Enemy_GrenadeKillsPerGame']

model_df = model_df[columns]

model_df.columns

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Create function to output multicollinearity heatmap
def heatmap(df_name, figsize=(30,30), cmap='Reds'):
    with sns.axes_style("darkgrid"):
        corr = df_name.drop('PlayerWin',axis=1).corr()
        mask = np.zeros_like(corr)
        mask[np.triu_indices_from(mask)] = True
        fig, ax = plt.subplots(figsize=figsize)
        ax = sns.heatmap(corr, annot=True, cmap=cmap, mask=mask)
    
    return fig, ax

heatmap(model_df)

It looks like we were able to get rid of multicollinear features that broached the 0.75 threshold.  Let's run the train test split again and check out the feature importances.

In [None]:
# Assign features and target
features = model_df.drop(['PlayerWin'], axis=1)
target = model_df['PlayerWin']

# Assigning X and y for train test split
X = features
y = target

# Ensure target is integer format
y = y.astype('int')

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                    random_state=8)

# Define and scale X train and test
scaler = PowerTransformer()
X_train = pd.DataFrame(scaler.fit_transform(X_train))
X_test = pd.DataFrame(scaler.transform(X_test))

# Create and fit logistic regressoin
model_log = LogisticRegression()
model_log.fit(X_train, y_train)

# Print shape
print(X_train.shape)
print(X_test.shape)

In [None]:
plot_coefficients(model_log, features, X_train, X_test, y_train, y_test, count=10)

It seems that Player_WinRate is still the most important feature in predicting the outcome of a match.

# Best Model

We explored many model types with a variety of scalers and datasets.  The best model for predicting victory using only gamertags and data available from the API was the per-game statistics condensed by player and enemy teams using logistic regression and the power transformer for scaling.

In [None]:
# import matplotlib.pyplot as plt
# %matplotlib inline
# import seaborn as sns
# sns.set(rc={'axes.facecolor':'white'})

df = pd.read_csv('data/Model_W_Condensed_Features_HuskyRaid.csv')
df = df.drop(['Unnamed: 0'], axis=1)

# Make model_df from a copy of our dataframe up to this point
model_df = df.copy()

# Assign features and target
features = model_df.drop(['PlayerWin'], axis=1)
target = model_df['PlayerWin']

# Assigning X and y for train test split
X = features
y = target

# Ensure target is integer format
y = y.astype('int')

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                    random_state=8)

# Print shape
print(X_train.shape)
print(X_test.shape)

print('Logistic Regression with PowerTransformer')
make_model(model_df, scale=True, graphs=True, regressor=LogisticRegression(), scaler=PowerTransformer())

The model is more precise at predicting victory over defeat, and this might be due to the fact that the matches compiled for the dataframe came from players that were fairly experienced in the Super Fiesta Party playlist.  In the future, it might be worth exploring matches from competitors with less play time and less experience in the playlist.

# Conclusions and Recommendations

For this specific playlist and with players who were most likely more skilled than average, we were able to predict the victor of a match with 72.73% accuracy using only information gathered by the API and no details about what actually occurred during the match.  

While I admittedly have no knowledge on best practices in ensuring a positive player experience, I believe an ideal matchmaking algorithm should not be predictable above a certain threshold, ideally not much higher than 50%.

It is entirely possible that matchmaking algorithms are already optimized to meet this ideal standard.  Perhaps sourcing modeling data from more skilled players in a very specific playlist would naturally lead to a higher than desired predictive quality simply because there are not enough equally skilled players entering matchmaking to ensure an even match at various hours of the day.

However, if that's not the case, a solution to uneven matchmaking might come in the form of a machine learning model as simple and efficient as logistic regression using readily available player data.  If something like this isn't being used, it could be implemented experimentally.

I should note that none of the modeling was conducted with ranked matchmaking, which certainly exists in Halo 5 and many other competitive games.  That system is likely more nuanced and robust, and deserves its own round of modeling and analysis.

# Next Steps

Regarding the Super Fiesta Party playlist, where players spawn with random weapons throughout the match, there exists a 'Match Events' API call that details nearly every action that happened in any given match.  Most importantly, this provides information on what weapons players spawned with throughout the match.  Given the fact that the weapons are randomized, frequenters of Super Fiesta Party will (or should) freely admit that luck with the random weapons varies substantially.

This project was originally concieved with this in mind, and the goal was to predict victory based on random weapon spawns alone.  The hurdle we encountered was that there was not a way to decode the +100 weapon variants.  343 Industries admitted in a forum post that adding this to the API would not be trivial, and given the API is technically a beta, they're under no obligation to give us this information.  However, it should be possible to decode the weapon variants through some individual data collection conducted through custom matches.

Finally, we would like to exapnd our modeling dataset to a variety of skill levels and playlists, which will be possible by identifying players that meet this criteria.  It would certainly be worthwhile to determine whether or not ranked matchmaking has the same level of predictive quality.