# Preparation

I use Python 3, but everything should work with Python 2.

1. Install [HDF5](https://www.hdfgroup.org/HDF5/release/obtain5.html).
2. Install other packages:

<code>pip install h5py keras matplotlib numpy pyyaml scipy scitkit-learn theano urllib3</code>

# Tutorial

The goal of this project is to learn distributed representations of MLB players, which can then be used for other types of analyses. The project is inspired by [word2vec](https://en.wikipedia.org/wiki/Word2vec) (hence the name), which learns distributed representations of words. These distributed representations often have pretty interesting properties; for example, Paris - France + Italy = Rome (see [here](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) and [here](http://arxiv.org/pdf/1301.3781.pdf) for more details).

In this notebook, I'll show you how I built a model that simultaneously learns distributed representations of pitchers and batters from at bat data.

To start things off, let's download some data from [Retrosheet.org](http://retrosheet.org/). We'll use data from the noughties.

In [1]:
import urllib.request
urllib.request.urlretrieve("http://www.retrosheet.org/events/2000seve.zip", "2000seve.zip")

('2000seve.zip', <http.client.HTTPMessage at 0x7ff43410b7b8>)

Now, extract the data.

In [2]:
import zipfile
zip_ref = zipfile.ZipFile("2000seve.zip", "r")
zip_ref.extractall("2000seve")
zip_ref.close()

Next, we'll prepare some variables for collecting the data.

In [3]:
from os import listdir
from os.path import isfile, join

data_directory = "/home/airalcorn2/Projects/Deep Baseball/2000seve"
data_files = [f for f in listdir(data_directory) if isfile(join(data_directory, f))]
data = []
at_bats = {}
home_runs = {}
singles = {}
doubles = {}
counts = {"batter": {}, "pitcher": {}}

And now we'll read in the data. Unfortunately, this is going to be a bunch of spaghetti code. The goal is to collect the batter, pitcher, and pitch outcome (e.g., strike, ball, double) for every pitch. By the end of the following code block, we'll have a Python list of dictionaries with the format <code>{"batter": batter, "pitcher": pitcher, "outcome": outcome}</code>. To best understand what's going on in the code, you'll have to read through Retrosheet's [game file documentation](http://www.retrosheet.org/game.htm).

In [4]:
import string

for data_file in data_files:
    # Skip non-event files.
    if not (".EVA" in data_file or ".EVN" in data_file):
        continue
    
    f = open(join(data_directory, data_file))
    home_pitcher = None
    away_pitcher = None
    line = f.readline().strip()
    
    while line != "":
        parts = line.split(",")
        
        # Get starting pitchers.
        if parts[0] == "id":
            while parts[0] != "play":
                line = f.readline().strip()
                parts = line.split(",")
                if parts[0] == "start" and parts[-1] == "1":
                    if parts[3] == "0":
                        away_pitcher = parts[1]
                    else:
                        home_pitcher = parts[1]
        
        # Skip non-plays, steals, errors on foul fly balls, 
        # picked off stealings, and other random plays.
        if (parts[-1] == "NP" or parts[-1][:2] == "CS" or parts[-1][:2] == "DI" or
            parts[-1][:2] == "SB" or parts[-1][:3] == "FLE" or parts[-1][:4] == "POCS" or
            parts[-1][:2] == "OA"):
            line = f.readline().strip()
            continue
        
        # Get at bat data.
        if parts[0] == "play":
            batter = parts[3]
            pitcher = home_pitcher
            if parts[2] == "1":
                pitcher = away_pitcher
            
            at_bats[batter] = at_bats.get(batter, 0) + 1
            counts["batter"][batter] = counts["batter"].get(batter, 0) + 1
            counts["pitcher"][pitcher] = counts["pitcher"].get(pitcher, 0) + 1
            
            row = {"batter": batter, "pitcher": pitcher}
            
            # Handle balks, wild pitches, passed balls, and pickoffs.
            if (parts[-1][:2] == "BK" or parts[-1][:2] == "WP" or
                parts[-1][:2] == "PB" or parts[-1][:2] == "PO"):
                row["outcome"] = parts[-1][:2]
                data.append(row)
                line = f.readline().strip()
                continue
            
            # Cycle through the piches for the current at bat.
            # See "The pitches field of the play record" here: http://www.retrosheet.org/eventfile.htm.
            pitches = parts[5]
            i = 0
            while i < len(pitches):
                pitch = pitches[i]
                
                # Handle catcher pickoffs, pitches blocked by catcher, or runners.
                if pitch == "+" or pitch == ">":
                    i += 1
                    pitch = pitches[i]
                elif pitch == "*":
                    i += 1
                    pitch += pitches[i]
                
                if "X" not in pitch and pitch != "." and pitch != "*" and pitch != "+":
                    row["outcome"] = pitch
                    data.append(row)
                
                i += 1
            
            # If the last pitch resulted in contact, figure out the pitch outcome.
            # See "Events made by the batter at the plate" here: http://www.retrosheet.org/eventfile.htm#8.
            if pitches[-1] == "X":
                play_parts = parts[6].split("/")
                main_play = play_parts[0]
                play = main_play.split(".")[0]
                
                if play[0] == "H":
                    play = "HR"
                elif play[0] in string.digits:
                    play = play[0]
                elif play[0] in {"S", "D", "T"}:
                    play = play[:2]
                    # Try to get first ball handler.
                    if len(play) < 2:
                        try:
                            handlers = play_parts[1]
                            play = play[0] + handlers[0]
                        except IndexError:
                            pass
                
                row["outcome"] = play
                if play == "HR":
                    home_runs[batter] = home_runs.get(batter, 0) + 1
                elif play[0] == "S":
                    singles[batter] = singles.get(batter, 0) + 1
                elif play[0] == "D":
                    doubles[batter] = doubles.get(batter, 0) + 1
                
                data.append(row)
        
        # Handle pitching changes.
        if parts[0] == "sub":
            if parts[-1] == "1":
                if parts[3] == "0":
                    away_pitcher = parts[1]
                else:
                    home_pitcher = parts[1]
        
        line = f.readline().strip()

OK, now that we have our raw data, we're going to establish some cutoffs so that we're only analyzing players with a reasonable amount of data. We're going to only include the most frequent batters and pitchers (i.e., those who accounted for 90% of the pitches).

In [5]:
cutoffs = {}
percentile_cutoff = 0.9
for player_type in ["batter", "pitcher"]:
    counts_list = list(counts[player_type].values())
    counts_list.sort(reverse = True)
    total_pitches = sum(counts_list)
    cumulative_percentage = [sum(counts_list[:i + 1]) / total_pitches for i in range(len(counts_list))]
    cutoff_index = sum([1 for total in cumulative_percentage if total <= percentile_cutoff])
    cutoff = counts_list[cutoff_index]
    cutoffs[player_type] = cutoff
    print(player_type)
    print("Original: {0}\tNew: {1}\tProportion: {2:.2f}".format(
            len(counts[player_type]), cutoff_index, cutoff_index / len(counts[player_type])))

batter
Original: 2699	New: 715	Proportion: 0.26
pitcher
Original: 1760	New: 774	Proportion: 0.44


As you can see, only 26% of batters and 44% of pitchers were involved in 90% of the pitches.

Let's use these new cutoff points to build the final data set.

In [6]:
final_data = []
for sample in data:
    batter = sample["batter"]
    pitcher = sample["pitcher"]
    if counts["batter"][batter] >= cutoffs["batter"] and counts["pitcher"][pitcher] >= cutoffs["pitcher"]:
        final_data.append(sample)

print("Original: {0}\tReduced: {1}".format(len(data), len(final_data)))

Original: 7204897	Reduced: 5867152


As you can see, we still have a large data set even after removing rare batters and pitchers.

Next, we have to associate an integer index with each of our batters, pitchers, and outcomes, respectively.

In [7]:
import random

random.shuffle(final_data)

categories = {"batter": set(), "pitcher": set(), "outcome": set()}
for sample in final_data:
    categories["batter"].add(sample["batter"])
    categories["pitcher"].add(sample["pitcher"])
    categories["outcome"].add(sample["outcome"])

for column in categories:
    categories[column] = list(categories[column])
    categories[column].sort()

category_to_int = {}
for column in categories:
    category_to_int[column] = {categories[column][i]: i for i in range(len(categories[column]))}

We then have to use these newly defined integer indices to build the appropriate NumPy arrays for our model

In [8]:
import numpy as np
from keras.utils import np_utils

BATCH_SIZE = 100
NUM_BATTERS = len(categories["batter"])
NUM_PITCHERS = len(categories["pitcher"])
NUM_OUTCOMES = len(categories["outcome"])
VEC_SIZE = 20

data_sets = {"batter": [], "pitcher": [], "outcome": []}
for sample in final_data:
    for column in sample:
        value = sample[column]
        value_index = category_to_int[column][value]
        data_sets[column].append(value_index)

for column in data_sets:
    data_sets[column] = np.array(data_sets[column])

data_sets["outcome"] = np_utils.to_categorical(data_sets["outcome"], NUM_OUTCOMES)

We're now ready to build our model with [Keras](http://keras.io/) and [Theano](http://deeplearning.net/software/theano/). The model is similar in spirit to a word2vec model in that we're trying to learn the player embeddings that best predict the outcome of a pitch (the "target word" in word2vec) given a certain batter and pitcher (the "context" in word2vec). We'll learn separate embedding matrices for batters and pitchers.

In [9]:
from keras.layers import Embedding, Dropout, Merge
from keras.layers.core import Dense, Reshape
from keras.models import Sequential

batter_embed = Sequential()
batter_embed.add(Embedding(NUM_BATTERS, VEC_SIZE, input_length = 1))
batter_embed.add(Reshape((VEC_SIZE,)))

pitcher_embed = Sequential()
pitcher_embed.add(Embedding(NUM_PITCHERS, VEC_SIZE, input_length = 1))
pitcher_embed.add(Reshape((VEC_SIZE,)))

model = Sequential()
model.add(Merge([batter_embed, pitcher_embed], mode = "concat"))
model.add(Dense(NUM_OUTCOMES, activation = "softmax"))
model.add(Dropout(0.5))
model.compile(optimizer = "adadelta", loss = "categorical_crossentropy",
              metrics = ["accuracy"])

Using Theano backend.


And we're now ready to train our model. We'll save the weights that have the highest performance on a held out data set.

In [10]:
from keras.callbacks import ModelCheckpoint

X_list = [data_sets["batter"].reshape(data_sets["batter"].shape[0], 1),
          data_sets["pitcher"].reshape(data_sets["pitcher"].shape[0], 1)]
y = data_sets["outcome"]
checkpointer = ModelCheckpoint(filepath = "weights.hdf5", save_best_only = True)
model.fit(X_list, y, nb_epoch = 10, batch_size = 100, validation_split = 0.15, verbose = 2, callbacks = [checkpointer], shuffle = True)
model.load_weights("weights.hdf5")

Train on 4987079 samples, validate on 880073 samples
Epoch 1/10
158s - loss: 9.1902 - acc: 0.1302 - val_loss: 2.8912 - val_acc: 0.1652
Epoch 2/10
160s - loss: 9.1805 - acc: 0.1317 - val_loss: 2.8892 - val_acc: 0.1663
Epoch 3/10
148s - loss: 9.1785 - acc: 0.1318 - val_loss: 2.8883 - val_acc: 0.1662
Epoch 4/10
151s - loss: 9.1779 - acc: 0.1316 - val_loss: 2.8897 - val_acc: 0.1656
Epoch 5/10
156s - loss: 9.1795 - acc: 0.1316 - val_loss: 2.8921 - val_acc: 0.1654
Epoch 6/10
160s - loss: 9.1793 - acc: 0.1315 - val_loss: 2.8938 - val_acc: 0.1657
Epoch 7/10
165s - loss: 9.1742 - acc: 0.1315 - val_loss: 2.8985 - val_acc: 0.1630
Epoch 8/10
158s - loss: 9.1770 - acc: 0.1315 - val_loss: 2.8959 - val_acc: 0.1654
Epoch 9/10
158s - loss: 9.1770 - acc: 0.1311 - val_loss: 2.8943 - val_acc: 0.1655
Epoch 10/10
157s - loss: 9.1795 - acc: 0.1310 - val_loss: 2.9055 - val_acc: 0.1610


Having trained the model, let's go ahead and get the distributed representations for all of our players. In order to do so, we need to define some functions that return an embedding when provided with a player's integer index.

In [11]:
from keras import backend

get_batter_vec = backend.function([batter_embed.input], batter_embed.output)
get_pitcher_vec = backend.function([pitcher_embed.input], pitcher_embed.output)

batter_vecs = [get_batter_vec([np.array([[i]])]) for i in range(NUM_BATTERS)]
pitcher_vecs = [get_pitcher_vec([np.array([[i]])]) for i in range(NUM_PITCHERS)]

# Get distributed representation of players.
batter_vecs = np.array(batter_vecs).reshape((NUM_BATTERS, VEC_SIZE))
pitcher_vecs = np.array(pitcher_vecs).reshape((NUM_PITCHERS, VEC_SIZE))
player_vecs = {"batter": batter_vecs, "pitcher": pitcher_vecs}

Now, let's find out if these embeddings have anything interesting in them. First, let's collect some information about the players.

In [12]:
# Get player data.
player_data = {}

for data_file in data_files:
    if ".ROS" in data_file:
        f = open(join(data_directory, data_file))
        for line in f:
            parts = line.strip().split(",")
            player_id = parts[0]
            last_name = parts[1]
            first_name = parts[2]
            name = first_name + " " + last_name
            batting_hand = parts[3]
            throwing_hand = parts[4]
            position = parts[6]
            player_data[player_id] = {"name": name, "batting_hand": batting_hand,
                                      "throwing_hand": throwing_hand, "position": position}

Next, we're going to perform a [principal component analysis](https://en.wikipedia.org/wiki/Principal_component_analysis) (PCA) on the embeddings and color them with various interesting properties.

In [13]:
import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D
from sklearn import decomposition


def run_pca(player_vecs, colors = None, pc_x = 0, pc_y = 1, pc_z = 2, do_print = False, title = ""):
    """
    Run a PCA on the embedded player representations.
    :param player_vecs: 
    :param colors: 
    :param pc_x: 
    :param pc_y: 
    :param pc_z: 
    :return: 
    """
    pca = decomposition.PCA()
    pca.fit(player_vecs)
    if do_print:
        print(pca.explained_variance_ratio_)
    
    projected = pca.transform(player_vecs)
    
    fig = plt.figure()
    ax = fig.add_subplot(111, projection = "3d")
    ax.scatter(projected[:, pc_x], projected[:, pc_y], projected[:, pc_z], color = colors)
    ax.set_title(title)
    
    plt.show()
    
    plt.scatter(projected[:, pc_x], projected[:, pc_y], c = colors, cmap = "gray")
    plt.title(title)
    plt.show()
    return projected


max_hr_rate = max([home_runs.get(batter_id, 0) / at_bats[batter_id] for batter_id in at_bats if batter_id in categories["batter"]])
max_single_rate = max([singles.get(batter_id, 0) / at_bats[batter_id] for batter_id in at_bats if batter_id in categories["batter"]])
max_double_rate = max([doubles.get(batter_id, 0) / at_bats[batter_id] for batter_id in at_bats if batter_id in categories["batter"]])

batting_hand_color = {"L": "red", "R": "green", "B": "purple"}
batter_colors = {"hand": [], "hr": [], "single": [], "double": []}
for i in range(NUM_BATTERS):
    batter_id = categories["batter"][i]
    batting_hand = player_data[batter_id]["batting_hand"]
    batter_colors["hand"].append(batting_hand_color[batting_hand])
    batter_colors["hr"].append(str((home_runs.get(batter_id, 0) / at_bats[batter_id]) / max_hr_rate))
    batter_colors["single"].append(str((singles.get(batter_id, 0) / at_bats[batter_id]) / max_single_rate))
    batter_colors["double"].append(str((doubles.get(batter_id, 0) / at_bats[batter_id]) / max_double_rate))

for batter_color in ["hand", "single", "double", "hr"]:
    projected_batters = run_pca(batter_vecs, batter_colors[batter_color], title = "batter_{0}".format(batter_color))
    no = run_pca(batter_vecs, batter_colors[batter_color], 1, 2, 3)

projected_batters = run_pca(batter_vecs, batter_colors["hr"], do_print = True)
projected_pitchers = run_pca(pitcher_vecs, do_print = True)

[ 0.33529857  0.22530989  0.13290913  0.05876472  0.04907317  0.0453608
  0.02849336  0.02082874  0.01662581  0.01471605  0.01353817  0.01266403
  0.01091404  0.00788418  0.00695989  0.00630556  0.00590156  0.00356333
  0.00275365  0.00213543]
[ 0.28083625  0.12563516  0.11674933  0.08232093  0.06770106  0.05129958
  0.03345469  0.02995826  0.028036    0.02566545  0.02382667  0.02276182
  0.0213462   0.01885001  0.01771756  0.01673546  0.01459991  0.01128728
  0.00863065  0.00258762]


As you can see, there are some interesting patterns in the embeddings. For example, right-handed hitters are clearly separated from left-handed and switch hitters.

<img src="batters_hand_pca.png">

Similarly, frequent singles hitters are far from infrequent singles hitters.

<img src="batters_singles_pca.png">

So, the model is clearly learning something, but whether or not what it's learning is non-trivial remains to be seen.

Let's go ahead and save those PC scores in a CSV to play around with elsewhere if we want.

In [14]:
import csv

NUM_PLAYERS = {"batter": NUM_BATTERS, "pitcher": NUM_PITCHERS}


def write_projected_data(player_type, projected, fieldnames):
    """
    Write the PC scores of the players to a file.
    :param player_type: 
    :param projected: 
    :param fieldnames: 
    :return: 
    """
    out = open("{0}s_pca.csv".format(player_type), "w")
    output = csv.DictWriter(out, fieldnames = fieldnames)
    output.writeheader()
    
    for i in range(NUM_PLAYERS[player_type]):
        player_id = categories[player_type][i]
        row = {}
        for col in fieldnames:
            if col in player_data[player_id]:
                row[col] = player_data[player_id][col]
        
        for j in range(3):
            row["PC{0}".format(j + 1)] = projected[i][j]
        
        row["player_id"] = player_id
        if player_type == "batter":
            row["hr_rate"] = home_runs.get(player_id, 0) / at_bats[player_id]
        
        nothing = output.writerow(row)
    
    out.close()


fieldnames = ["player_id", "name", "position", "batting_hand", "throwing_hand", "hr_rate", "PC1", "PC2", "PC3"]
write_projected_data("batter", projected_batters, fieldnames)
fieldnames = ["player_id", "name", "throwing_hand", "PC1", "PC2", "PC3"]
write_projected_data("pitcher", projected_pitchers, fieldnames)

To get a better sense of the embeddings, I recommend exploring the PC scores in my open source [ScatterPlot3D](https://sites.google.com/site/michaelaalcorn/ScatterPlot3D) software. To run it:

1. Download the appropriate build.
2. Run with <code>java -jar ScatterPlot3D-&lt;version&gt;.jar</code> on Linux systems or by double-clicking the JAR on Windows.
3. Load the data.
4. Put 4, 5, and 6 for x, y, and z for "pitchers_pca.csv" or 7, 8, and 9 for "batters_pca.csv".
5. Click "Submit".

You can then search, zoom, and rotate the data or click on individual points for more details. For example:

<img src="pitchers_pca_all.png">

<img src="pedro_martinez.png">

Documentation can be downloaded [here](https://sites.google.com/site/michaelaalcorn/ScatterPlot3D/SupplementaryMaterials.zip?attredirects=0&d=1). A gallery of application screenshots can be found [here](http://imgur.com/a/U833y).

We'll also save the player embeddings.

In [15]:
def write_distributed_representations(player_type, player_vecs):
    """
    Write the hidden vector representation of the players to a file.
    :param player_type: 
    :param player_vecs: 
    :return: 
    """
    out = open("{0}s_latent.csv".format(player_type), "w")
    fieldnames = ["name"] + ["latent_{0}".format(i + 1) for i in range(VEC_SIZE)]
    output = csv.DictWriter(out, fieldnames = fieldnames)
    output.writeheader()
    
    for i in range(NUM_PLAYERS[player_type]):
        player_id = categories[player_type][i]
        row = {"name": player_data[player_id]["name"]}
        
        for j in range(VEC_SIZE):
            row["latent_{0}".format(j + 1)] = player_vecs[i][j]
        
        nothing = output.writerow(row)
    
    out.close()


write_distributed_representations("batter", batter_vecs)
write_distributed_representations("pitcher", pitcher_vecs)

So, do the embeddings contain any non-obvious information? Maybe comparing nearest neighbors will provide some insight.

In [16]:
import pandas as pd


def get_nearest_neighbors(name, data, latent_vecs, player_names, k = 10):
    """
    Print the k nearest neighbors (in the latent space) of a given player.
    :param name: 
    :param k: 
    :return: 
    """
    player_index = np.where(data["name"] == name)[0]
    player_latent = latent_vecs[player_index]
    distances = list(np.linalg.norm(latent_vecs - player_latent, axis = 1))
    distances_and_ids = list(zip(player_names, distances))
    distances_and_ids.sort(key = lambda x: x[1])
    
    return distances_and_ids[1:1 + k]


data_files = ["batters_latent.csv", "pitchers_latent.csv"]
data = {}
player_names = {}
latent_vecs = {}
for player_type in ["batter", "pitcher"]:
    data_file = "{0}s_latent.csv".format(player_type)
    data[player_type] = pd.read_csv(data_file)
    player_names[player_type] = list(data[player_type]["name"])
    latent_vecs[player_type] = np.array(data[player_type].iloc[:, 1:])

print("Barry Bonds")
print(get_nearest_neighbors("Barry Bonds", data["batter"], latent_vecs["batter"], player_names["batter"]))
print()

print("Ichiro Suzuki")
print(get_nearest_neighbors("Ichiro Suzuki", data["batter"], latent_vecs["batter"], player_names["batter"]))
print()

print("Bartolo Colon")
print(get_nearest_neighbors("Bartolo Colon", data["pitcher"], latent_vecs["pitcher"], player_names["pitcher"]))
print()

print("Barry Zito")
print(get_nearest_neighbors("Barry Zito", data["pitcher"], latent_vecs["pitcher"], player_names["pitcher"]))
print()

Barry Bonds
[('Brian Giles', 1.2553231210579647), ('Larry Walker', 1.4742346979235055), ('Chipper Jones', 1.5563861461306252), ('Dan Johnson', 1.5591044331404615), ('Luis Gonzalez', 1.5840471952681239), ('Jeff DaVanon', 1.5858893076808074), ('Ben Zobrist', 1.6371824828295256), ('Terrmel Sledge', 1.6424605721312489), ('Nick Johnson', 1.6439087248593454), ('Rafael Palmeiro', 1.652097539122739)]

Ichiro Suzuki
[('Endy Chavez', 1.1513289922046999), ('Alex Sanchez', 1.1517131228516677), ('Kerry Robinson', 1.1575346723783695), ('Jason Tyner', 1.2032641823232502), ('Aaron Miles', 1.2154969813720271), ('Cesar Izturis', 1.2339002696116652), ('Cristian Guzman', 1.2493230732068266), ('Tony Womack', 1.2739236815827384), ('Carl Crawford', 1.2798535075608914), ('Tike Redman', 1.2856320298427966)]

Bartolo Colon
[('Kevin Millwood', 0.62789273774956178), ('Erik Bedard', 0.64943581089753588), ('A.J. Burnett', 0.73368722495631267), ('Matt Herges', 0.73798407374104624), ('Wade Miller', 0.7442495225075526

Unfortunately, my rather limited baseball knowledge means I do not know the answer to that question. Maybe you can tell me? We can also combine players.

In [17]:
def combine_players(player_1_name, player_2_name, data, latent_vecs, player_names, k = 10, subtract = False):
    """
    Print the k nearest neighbors of the vector resulting from combining two
    players in the latent space.
    :param player_1_name: 
    :param player_2_name: 
    :param k: 
    :param subtract: 
    :return: 
    """
    player_1_index = np.where(data["name"] == player_1_name)[0]
    player_1_latent = latent_vecs[player_1_index]
    
    player_2_index = np.where(data["name"] == player_2_name)[0]
    player_2_latent = latent_vecs[player_2_index]
    
    distances = list(np.linalg.norm(latent_vecs - (player_1_latent + player_2_latent), axis = 1))
    if subtract:
        distances = list(np.linalg.norm(latent_vecs - (player_1_latent - player_2_latent), axis = 1))
    
    distances_and_ids = list(zip(player_names, distances))
    distances_and_ids.sort(key = lambda x: x[1])
    return distances_and_ids[1:1 + k]


print("Barry Bonds + Ichiro Suzuki")
print(combine_players("Barry Bonds", "Ichiro Suzuki", data["batter"], latent_vecs["batter"], player_names["batter"]))
print()

print("Barry Bonds - Ichiro Suzuki")
print(combine_players("Barry Bonds", "Ichiro Suzuki", data["batter"], latent_vecs["batter"], player_names["batter"], subtract = True))
print()

print("Ichiro Suzuki - Barry Bonds")
print(combine_players("Ichiro Suzuki", "Barry Bonds", data["batter"], latent_vecs["batter"], player_names["batter"], subtract = True))
print()

print("Bartolo Colon + Barry Zito")
print(combine_players("Bartolo Colon", "Barry Zito", data["pitcher"], latent_vecs["pitcher"], player_names["pitcher"]))
print()

print("Bartolo Colon - Barry Zito")
print(combine_players("Bartolo Colon", "Barry Zito", data["pitcher"], latent_vecs["pitcher"], player_names["pitcher"], subtract = True))
print()

print("Barry Zito - Bartolo Colon")
print(combine_players("Barry Zito", "Bartolo Colon", data["pitcher"], latent_vecs["pitcher"], player_names["pitcher"], subtract = True))
print()

Barry Bonds + Ichiro Suzuki
[('Barry Bonds', 1.9312513638306952), ('Joe Mauer', 1.9413194548863386), ('Scott Podsednik', 1.9448289693502232), ('David DeJesus', 1.9493288196791654), ('Brian Giles', 1.9795736202765661), ('Sean Casey', 2.0625488316419363), ('Jacoby Ellsbury', 2.112198148327614), ('Jody Gerut', 2.1208453732465791), ('Tike Redman', 2.1240904225559425), ('Robinson Cano', 2.1284958902929532)]

Barry Bonds - Ichiro Suzuki
[('Rafael Palmeiro', 2.3174521548970652), ('Chris Iannetta', 2.3664780719288272), ('Mark McGwire', 2.4111897494945791), ('Daric Barton', 2.4558800946485624), ('Jason Giambi', 2.4631377887280501), ('Jeff DaVanon', 2.5052343126778043), ('Chad Kreuter', 2.5210830391616139), ('Hee Seop Choi', 2.5306363662854787), ('Ken Caminiti', 2.5765982815126303), ('Gabe Gross', 2.5814960952162003)]

Ichiro Suzuki - Barry Bonds
[('Angel Berroa', 2.4416910656350042), ('Deivi Cruz', 2.4450743092900526), ('Shea Hillenbrand', 2.4473678229895586), ('Ichiro Suzuki', 2.48545313851488