#Introduction

The scope of this analysis to to distinguish betweeen "good" players and those that make the Major League Baseball Hall of Fame.  For the sake of the training data, a "good" player is definend as a player that has recieved at least 1 hall of fame vote.

Ultimately, a predictive model will be created that predicts which players will make it into the hall of fame.  Note, I will be focusing on whether the player will ultimately make it into the HOF, not whether or not they will make it into the HOF in a given voting year.

Once a model is trained, I will make predictions for the 2015 and 2016 eligible MLB pitchers. 

The scope of this analysis is currently on only Pitchers; however, further modeling and prediction will be done for position players at a future date.

In [2]:
import pandas as pd
import numpy as np
import sklearn as sk
from sklearn import ensemble, linear_model
%matplotlib inline

In [3]:
#Import Datasets
Dir = "C:/Users/grazim/Desktop/HoF/lahman"

hof = pd.read_csv(Dir + "/HallOfFame.CSV")
pitch = pd.read_csv(Dir + "/Pitching.CSV")
pitch_post = pd.read_csv(Dir + "/PitchingPost.CSV")
fielding = pd.read_csv(Dir + "/Fielding.CSV")
allstar = pd.read_csv(Dir + "/AllstarFull.CSV")

In [4]:
hof.head(10)

Unnamed: 0,playerID,yearid,votedBy,ballots,needed,votes,inducted,category,needed_note
0,cobbty01,1936,BBWAA,226,170,222,Y,Player,
1,ruthba01,1936,BBWAA,226,170,215,Y,Player,
2,wagneho01,1936,BBWAA,226,170,215,Y,Player,
3,mathech01,1936,BBWAA,226,170,205,Y,Player,
4,johnswa01,1936,BBWAA,226,170,189,Y,Player,
5,lajoina01,1936,BBWAA,226,170,146,N,Player,
6,speaktr01,1936,BBWAA,226,170,133,N,Player,
7,youngcy01,1936,BBWAA,226,170,111,N,Player,
8,hornsro01,1936,BBWAA,226,170,105,N,Player,
9,cochrmi01,1936,BBWAA,226,170,80,N,Player,


Filter the hall of fame dataset to only Players and pitchers

In [5]:
hof_pitch = hof[hof.category == "Player"]
print(hof_pitch.shape[0], hof.shape[0])

3965 4088


In [6]:
print("HOF Voting date range:", hof_pitch.yearid.min(), hof_pitch.yearid.max())

HOF Voting date range: 1936 2015


#Data Cleaning

In [7]:
#Function to prepare data for user input playerIDspercitr01
def player_hof(pitch, IDs):
    #Pull each of those players Player IDs
    pitch_hof = pitch[pitch.playerID.isin(IDs)]

    #Count the number of seasons played
    seasons_tot = pitch_hof[pitch_hof.stint == 1]
    seasons_tot = seasons_tot[['playerID', 'yearID']].groupby(['playerID']).count()
    seasons_tot['playerID'] = seasons_tot.index
    seasons_tot = seasons_tot.rename(columns={'yearID':'season_count'})
    
    #Total the career statistics of each playyer
    pitch_career = pitch_hof.groupby('playerID').sum()
    pitch_career = pitch_career.drop(['stint'],1)
    pitch_career = pitch_career.drop(['yearID'],1)
    pitch_career['playerID'] = pitch_career.index
    
    #Add seasons to the career statistics dataframe
    pitch_career = pitch_career.merge(seasons_tot, how = 'left', on = 'playerID')
    
    #ERA
    pitch_career['ERA'] = (pitch_career.ER * 9)/(pitch_career.IPouts/3)

    #Winning Percentage
    pitch_career['Wpct'] = (pitch_career.W/pitch_career.G)

    #Stirekouts per walks
    pitch_career = pitch_career.replace(0, .0000001)
    pitch_career['S/W'] = pitch_career.SO/pitch_career.BB

    #WHIP
    pitch_career['WHIP'] = (pitch_career.BB + pitch_career.H) / (pitch_career.IPouts/3)

    #IP
    pitch_career['IP'] = pitch_career.IPouts/3

    return(pitch_career)

In [8]:
hof_pitch.head()

Unnamed: 0,playerID,yearid,votedBy,ballots,needed,votes,inducted,category,needed_note
0,cobbty01,1936,BBWAA,226,170,222,Y,Player,
1,ruthba01,1936,BBWAA,226,170,215,Y,Player,
2,wagneho01,1936,BBWAA,226,170,215,Y,Player,
3,mathech01,1936,BBWAA,226,170,205,Y,Player,
4,johnswa01,1936,BBWAA,226,170,189,Y,Player,


In [9]:
#PlayerIDs of pitchers that got HOF votes
ID_hof = hof_pitch.playerID

#IDs of inducted pitches
ID_inducted = hof_pitch[hof_pitch.inducted=="Y"]
ID_inducted = ID_inducted[['playerID','inducted']]

pitch_career = player_hof(pitch, ID_hof)

#left join inducted indicator into pitch_career
pitch_career = pitch_career.merge(ID_inducted, how = 'left', on = 'playerID')

#Convert "N/A" values in inducted to "N"
pitch_career['inducted'] = pitch_career['inducted'].fillna("N")
pitch_career.head()

pitch_career

Unnamed: 0,W,L,G,GS,CG,SHO,SV,IPouts,H,ER,...,SH,SF,GIDP,playerID,season_count,Wpct,S/W,WHIP,IP,inducted
0,8.700000e+01,1.080000e+02,263,2.540000e+02,3.100000e+01,6.000000e+00,1.000000e-07,5022,1779,7.910000e+02,...,,,,abbotji01,10,3.307985e-01,1.432258e+00,1.433094,1674.000000,N
1,1.940000e+02,1.400000e+02,482,3.550000e+02,2.060000e+02,4.400000e+01,1.500000e+01,8986,2841,9.170000e+02,...,,,,adamsba01,19,4.024896e-01,2.409302e+00,1.092032,2995.333333,N
2,8.600000e+01,8.100000e+01,732,8.900000e+01,1.000000e+01,1.000000e-07,3.180000e+02,3874,1233,5.120000e+02,...,1.000000e+00,1.000000e-07,,aguilri01,16,1.174863e-01,2.934473e+00,1.226639,1291.333333,N
3,4.700000e+01,4.500000e+01,495,1.000000e-07,1.000000e-07,1.000000e-07,1.230000e+02,2238,679,2.720000e+02,...,,,,akerja01,11,9.494949e-02,1.474453e+00,1.277480,746.000000,N
4,1.940000e+02,1.740000e+02,561,4.640000e+02,9.800000e+01,1.800000e+01,3.000000e+00,10103,3376,1.406000e+03,...,,,,alexado01,19,3.458111e-01,1.562372e+00,1.292883,3367.666667,N
5,3.730000e+02,2.080000e+02,696,5.990000e+02,4.370000e+02,9.000000e+01,3.200000e+01,15570,4868,1.476000e+03,...,,,,alexape01,20,5.359195e-01,2.311251e+00,1.121195,5190.000000,Y
6,1.420000e+02,7.500000e+01,352,2.410000e+02,1.090000e+02,1.700000e+01,1.800000e+01,5851,1849,8.130000e+02,...,,,,allenjo02,13,4.034091e-01,1.449864e+00,1.326440,1950.333333,N
7,1.000000e-07,1.000000e-07,1,1.000000e-07,1.000000e-07,1.000000e-07,1.000000e-07,15,11,1.000000e+00,...,,,,allisdo01,1,1.000000e-07,1.000000e-07,2.400000,5.000000,N
8,1.000000e-07,1.000000e-07,1,1.000000e-07,1.000000e-07,1.000000e-07,1.000000e-07,6,3,1.000000e-07,...,,,,alouma01,1,1.000000e-07,3.000000e+00,2.000000,2.000000,N
9,8.300000e+01,7.500000e+01,218,1.610000e+02,1.280000e+02,1.600000e+01,7.000000e+00,4542,1455,4.450000e+02,...,,,,altroni01,16,3.807339e-01,1.562500e+00,1.140687,1514.000000,N


Notice the large amount of missing data in several columsn such as intentionall walks (IBB), games finished (GF), Balks (BK), etc.  Many statistics were not tracked since the beginning of the game.  Other possible interesting features such as all-star appearances, cy-young awards, etc. have also been omitted for similear reasons. 

Construct features from current dataset such as ERA, batting average against, winning percentage, Innings pitched per year

In [10]:
pitch_career.dtypes

W               float64
L               float64
G                 int64
GS              float64
CG              float64
SHO             float64
SV              float64
IPouts          float64
H               float64
ER              float64
HR              float64
BB              float64
SO              float64
BAOpp           float64
ERA             float64
IBB             float64
WP              float64
HBP             float64
BK              float64
BFP             float64
GF              float64
R               float64
SH              float64
SF              float64
GIDP            float64
playerID         object
season_count    float64
Wpct            float64
S/W             float64
WHIP            float64
IP              float64
inducted         object
dtype: object

In [11]:
pitch_career.isnull().sum()

W                 0
L                 0
G                 0
GS                0
CG                0
SHO               0
SV                0
IPouts            0
H                 0
ER                0
HR                0
BB                0
SO                0
BAOpp             4
ERA               1
IBB             191
WP                1
HBP              10
BK                0
BFP               2
GF                1
R                 0
SH              424
SF              424
GIDP            479
playerID          0
season_count      3
Wpct              0
S/W               0
WHIP              0
IP                0
inducted          0
dtype: int64

In [12]:
#Drop columns with significnt missing data
pitch_career = pitch_career.drop('SH',1)
pitch_career = pitch_career.drop('SF',1)
pitch_career = pitch_career.drop('GIDP',1)
pitch_career = pitch_career.drop('IBB',1)
pitch_career = pitch_career.drop('HBP',1)

pitch_career = pitch_career.dropna(axis = 0,how='any')


In [13]:
#How many players fall into the "Y" or "N" categories in inducted?

In [14]:
pitch_career.inducted.describe()

count     472
unique      2
top         N
freq      380
Name: inducted, dtype: object

380/472 of the players in the dataset were not inducted.  This is testament to the difficulty of gaining access to the HOF. 

Due to the large labeling imbalance, a bagging/bootstrapping method is appropriate to fully boost the classification rate

#Model Building

####Run a Random Forest as a baseline for the classification

In [15]:
RF = ensemble.RandomForestClassifier(n_estimators=500, criterion = 'entropy')
RF_fit = RF.fit(pitch_career[['W','L','G','GS','CG','SHO','SV','IP','H','ER','IP','R','HR','BB','SO','season_count']], pitch_career.inducted)

var_import = pd.DataFrame(RF.feature_importances_)
var_import['feature'] = ['W','L','G','GS','CG','SHO','SV','IP','H','ER','IP','R','HR','BB','SO','season_count']
var_import

Unnamed: 0,0,feature
0,0.127819,W
1,0.038027,L
2,0.04604,G
3,0.036515,GS
4,0.094464,CG
5,0.076633,SHO
6,0.036745,SV
7,0.085192,IP
8,0.063168,H
9,0.05792,ER


Unsurprisingly, The preliminaary run putn ERA and Wins as the most important features.  Complete Games and Innings Pitched are unexpected 

In [30]:
RF_cv = sk.cross_validation.cross_val_score(RF,pitch_career[['W','L','G','GS','CG','SHO','SV','IP','ER','IP','R','HR','BB','SO','season_count']], pitch_career.inducted)
RF_cv.mean()


0.87298496159255645

In [17]:
#Adjust max depth
RF_cv_results1 = []
for i in range(1,30):
    RF1 = ensemble.RandomForestClassifier(n_estimators=500, criterion = 'entropy', max_depth=i)
    RF_cv1 = sk.cross_validation.cross_val_score(RF1,pitch_career[['W','L','G','GS','CG','SHO','SV','IP','H','ER','IP','R','HR','BB','SO','season_count']], pitch_career.inducted)
    RF_cv_results1.append(RF_cv1.mean())
RF_cv_results1

[0.86240939089040358,
 0.86238234339500153,
 0.87295791409715451,
 0.86660175267770201,
 0.87720437087525693,
 0.8750676187385048,
 0.87295791409715451,
 0.87295791409715451,
 0.87295791409715451,
 0.8750676187385048,
 0.86873850481445425,
 0.8750676187385048,
 0.87295791409715451,
 0.87084820945580432,
 0.87717732337985499,
 0.87295791409715451,
 0.87295791409715451,
 0.87717732337985499,
 0.87295791409715451,
 0.8750676187385048,
 0.8750676187385048,
 0.87084820945580432,
 0.8750676187385048,
 0.87084820945580443,
 0.8750676187385048,
 0.87295791409715451,
 0.87295791409715451,
 0.87295791409715451,
 0.87295791409715451]

##Model Tuning

In [29]:
#Attempt to adjust features included. Use WHIP, Wpct, S/W instread of the individual features
#Adjust max depth
RF_cv_results2 = []
for i in range(1,12):
    RF2 = ensemble.RandomForestClassifier(n_estimators=500, criterion = 'entropy', max_depth=i)
    RF_cv2 = sk.cross_validation.cross_val_score(RF1,pitch_career[['W','Wpct','CG','SHO','SV','IP','ERA','HR','WHIP', 'S/W']], pitch_career.inducted)
    RF_cv_results2.append(RF_cv2.mean())
RF_cv_results2

[0.87501352374770092,
 0.87712322838905121,
 0.88136968516715353,
 0.8707670669695986,
 0.87712322838905121,
 0.87074001947419666,
 0.87074001947419666,
 0.87498647625229908,
 0.87290381910635073,
 0.87712322838905121,
 0.87287677161094879]

In [21]:
#Create implementation of K-fold cross validation to get a confusion matrix for each Fold
Folds = 10
random_start = 40

confuse = {} #dictionary to store confusion matrix of each fold
con_tot = [[0,0],[0,0]]
for fold in range(1,Folds+1):
    random = random_start + fold
    x_train, x_test, y_train, y_test = sk.cross_validation.train_test_split(pitch_career[['W','L','GS','CG','SHO','SV','IP','ERA','HR','WHIP', 'S/W']], pitch_career.inducted, test_size=0.15, random_state=random)
    RF = ensemble.RandomForestClassifier(n_estimators=500, criterion = 'entropy', max_depth=7)
    RF_fit = RF.fit(x_train, y_train)
    RF_pred = RF.predict(x_test)
    con_i = sk.metrics.confusion_matrix(y_test, RF_pred, labels=["Y","N"])
    confuse[fold] = con_i
    
    con_tot += con_i
    
    error = (con_i[0][0] + con_i[1][1])/sum(sum(con_i))
    specificity = (con_i[1][1]/(con_i[1][1] + con_i[0][1]))
    sensitivity = (con_i[0][0] /(con_i[0][0] + con_i[1][0]))
    
    print("Fold", fold,"\nError Rate:", round(error,2), "\nSpecificity:", round(specificity,2), "\nSensitivity:", round(sensitivity,2), "\n")

error_cv = (con_tot[0][0] + con_tot[1][1])/sum(sum(con_tot))
specificity_cv = (con_tot[1][1]/(con_tot[1][1] + con_tot[0][1]))
sensitivity_cv = (con_tot[0][0] /(con_tot[0][0] + con_tot[1][0]))
print("\nCV Error Rate:", round(error_cv,2), "\nCV Specificity:", round(specificity_cv,2), "\nCV Sensitivity:", round(sensitivity_cv,2), "\n")


Fold 1 
Error Rate: 0.87 
Specificity: 0.89 
Sensitivity: 0.78 

Fold 2 
Error Rate: 0.85 
Specificity: 0.87 
Sensitivity: 0.7 

Fold 3 
Error Rate: 0.89 
Specificity: 0.88 
Sensitivity: 1.0 

Fold 4 
Error Rate: 0.89 
Specificity: 0.95 
Sensitivity: 0.5 

Fold 5 
Error Rate: 0.89 
Specificity: 0.92 
Sensitivity: 0.62 

Fold 6 
Error Rate: 0.79 
Specificity: 0.84 
Sensitivity: 0.54 

Fold 7 
Error Rate: 0.86 
Specificity: 0.85 
Sensitivity: 0.89 

Fold 8 
Error Rate: 0.92 
Specificity: 0.95 
Sensitivity: 0.77 

Fold 9 
Error Rate: 0.9 
Specificity: 0.9 
Sensitivity: 0.92 

Fold 10 
Error Rate: 0.89 
Specificity: 0.92 
Sensitivity: 0.62 


CV Error Rate: 0.87 
CV Specificity: 0.9 
CV Sensitivity: 0.73 



Errors tend to skew toward false positives rather than false negatives.  Attempt to apply weights account for this

In [28]:
#Create implementation of K-fold cross validation to get a confusion matrix for each Fold
Folds = 10
random_start = 40

confuse = {} #dictionary to store confusion matrix of each fold
con_tot = [[0,0],[0,0]]
for fold in range(1,Folds+1):
    random = random_start + fold
    x_train, x_test, y_train, y_test = sk.cross_validation.train_test_split(pitch_career[['W','L','GS','CG','SHO','SV','IP','ERA','HR','WHIP', 'S/W']], pitch_career.inducted, test_size=0.15, random_state=random)
    RF = ensemble.RandomForestClassifier(n_estimators=500, criterion = 'entropy', max_depth=7, class_weight = {"N":10, "Y":1})
    RF_fit = RF.fit(x_train, y_train)
    RF_pred = RF.predict(x_test)
    con_i = sk.metrics.confusion_matrix(y_test, RF_pred, labels=["Y","N"])
    confuse[fold] = con_i
    
    con_tot += con_i
    
    error = (con_i[0][0] + con_i[1][1])/sum(sum(con_i))
    specificity = (con_i[1][1]/(con_i[1][1] + con_i[0][1]))
    sensitivity = (con_i[0][0] /(con_i[0][0] + con_i[1][0]))
    
    print("Fold", fold,"\nError Rate:", round(error,2), "\nSpecificity:", round(specificity,2), "\nSensitivity:", round(sensitivity,2), "\n")

error_cv = (con_tot[0][0] + con_tot[1][1])/sum(sum(con_tot))
specificity_cv = (con_tot[1][1]/(con_tot[1][1] + con_tot[0][1]))
sensitivity_cv = (con_tot[0][0] /(con_tot[0][0] + con_tot[1][0]))
print("\nCV Error Rate:", round(error_cv,2), "\nCV Specificity:", round(specificity_cv,2), "\nCV Sensitivity:", round(sensitivity_cv,2), "\n")


Fold 1 
Error Rate: 0.85 
Specificity: 0.85 
Sensitivity: 0.8 

Fold 2 
Error Rate: 0.83 
Specificity: 0.85 
Sensitivity: 0.67 

Fold 3 
Error Rate: 0.9 
Specificity: 0.89 
Sensitivity: 1.0 

Fold 4 
Error Rate: 0.9 
Specificity: 0.94 
Sensitivity: 0.57 

Fold 5 
Error Rate: 0.86 
Specificity: 0.88 
Sensitivity: 0.5 

Fold 6 
Error Rate: 0.87 
Specificity: 0.86 
Sensitivity: 1.0 

Fold 7 
Error Rate: 0.86 
Specificity: 0.85 
Sensitivity: 0.89 

Fold 8 
Error Rate: 0.92 
Specificity: 0.93 
Sensitivity: 0.82 

Fold 9 
Error Rate: 0.86 
Specificity: 0.84 
Sensitivity: 1.0 

Fold 10 
Error Rate: 0.89 
Specificity: 0.91 
Sensitivity: 0.67 


CV Error Rate: 0.87 
CV Specificity: 0.88 
CV Sensitivity: 0.81 



The Sensitivity is much improved after providing weights to the model. 

#Prediction of 2015 and 2016 nominees


Now that a RF model is tuned, I will make predictions for the 2015 class.

As of this analysis (early 2015), relavent pitures that will be eligible for hall of fame voting are as follows:

#####2015

-	Randy Johnson
-	Pedro Martinez
-	John Smoltz
-	Eddie Guardado
-	Jason Schmidt
-	Curt Schilling
-	Roger Clemens
-	Lee Smith
-	Mike Mussina
-	Troy Percival
-	Tom Gordon

(http://www.baseball-reference.com/awards/hof_2015.shtml)


#####2016

-	Trevor Hoffman	
-	Billy Wagner	
-	Mike Hampton	
-	Mike Mussina	
-	Lee Smith	
-	Roger Clemens	
-	Curt Schilling

http://www.baseball-reference.com/awards/hof_2016.shtml

NOTE: players listed in 2015 and 2016 will only be tested in 2015

In [44]:
#List PlayerIDs for each of the players of interest
ID15 = ['johnsra05','martipe02','smoltjo01','guarded01','schmija01','schilcu01','clemero02','smithle02','mussimi01','percitr01','gordoto01','guarded01']
ID16 = ['hoffmtr01', 'wagnebi02', 'hamptmi01']

In [46]:
players_to_pred15 = player_hof(pitch, ID15)
players_to_pred16 = player_hof(pitch, ID16)
players_to_pred15

Unnamed: 0,W,L,G,GS,CG,SHO,SV,IPouts,H,ER,...,R,SH,SF,GIDP,playerID,season_count,Wpct,S/W,WHIP,IP
0,354,184,709,707,118.0,46.0,1e-07,14750,4185,1707,...,1885,37.0,31.0,,clemero02,24,0.499295,2.956962,1.172542,4916.666667
1,138,126,890,203,18.0,4.0,158.0,6324,1889,927,...,1016,16.0,12.0,,gordoto01,21,0.155056,1.973388,1.359583,2108.0
2,46,61,908,25,1e-07,1e-07,187.0,2834,894,453,...,477,24.0,16.0,,guarded01,17,0.050661,2.293103,1.314749,944.666667
3,303,166,618,603,100.0,37.0,2.0,12406,3346,1513,...,1703,75.0,39.0,,johnsra05,22,0.490291,3.256513,1.171127,4135.333333
4,219,100,476,409,46.0,17.0,3.0,8482,2221,919,...,1006,40.0,32.0,,martipe02,18,0.460084,4.15,1.05435,2827.333333
5,270,153,537,536,57.0,23.0,1e-07,10688,3460,1458,...,1559,41.0,43.0,,mussimi01,18,0.502793,3.583439,1.191523,3562.666667
6,35,43,703,1,1e-07,1e-07,358.0,2126,479,250,...,271,6.0,9.0,,percitr01,14,0.049787,2.552288,1.107714,708.666667
7,216,146,569,436,83.0,20.0,22.0,9783,2998,1253,...,1318,50.0,28.0,,schilcu01,20,0.379613,4.38256,1.137381,3261.0
8,130,96,323,314,20.0,9.0,1e-07,5989,1846,878,...,958,48.0,31.0,,schmija01,14,0.402477,2.219697,1.321423,1996.333333
9,71,92,1022,6,1e-07,1e-07,478.0,3868,1133,434,...,475,,,,smithle02,18,0.069472,2.574074,1.255688,1289.333333


In [47]:
RF = ensemble.RandomForestClassifier(n_estimators=500, criterion = 'entropy', max_depth=8)
RF_fit = RF.fit(pitch_career[['W','Wpct','CG','SHO','IP','ERA','HR','WHIP', 'S/W']], pitch_career.inducted,)
RF_pred15 = RF.predict(players_to_pred15[['W','Wpct','CG','SHO','IP','ERA','HR','WHIP', 'S/W']])
RF_pred16 = RF.predict(players_to_pred16[['W','Wpct','CG','SHO','IP','ERA','HR','WHIP', 'S/W']])

var_import = pd.DataFrame(RF.feature_importances_)
var_import['feature'] = ['W','Wpct','CG','SHO','IP','ERA','HR','WHIP', 'S/W']
var_import

Unnamed: 0,0,feature
0,0.185396,W
1,0.09717,Wpct
2,0.121733,CG
3,0.090136,SHO
4,0.148682,IP
5,0.097622,ERA
6,0.074551,HR
7,0.107184,WHIP
8,0.077525,S/W


In this analysis the greatest factors were the number of wins and the number of Innings Pitched.  This is possibily giving testament to the idea that longenvity is valued in pitching.


In [50]:
#Match playerIDs to predictions
ID_pred15 = players_to_pred15.playerID

pred15 = []
for i in range(0,len(ID_pred15)):
    pred15.append([ID_pred15[i],RF_pred15[i]])
    
    
ID_pred16 = players_to_pred16.playerID

pred16 = []
for i in range(0,len(ID_pred16)):
    pred16.append([ID_pred16[i],RF_pred16[i]])

In [58]:
print(pred15)

print('\n',pred16)

[['clemero02', 'N'], ['gordoto01', 'N'], ['guarded01', 'N'], ['johnsra05', 'Y'], ['martipe02', 'Y'], ['mussimi01', 'N'], ['percitr01', 'N'], ['schilcu01', 'N'], ['schmija01', 'N'], ['smithle02', 'N'], ['smoltjo01', 'Y']]

 [['hamptmi01', 'N'], ['hoffmtr01', 'N'], ['wagnebi02', 'N']]


The above model predicted that Randy Johnson, Pedro Martinex, and John Smoltz would eventually be elected into the HOF.  Each of those players was elected in 2015.

http://www.baseball-reference.com/awards/hof_2015.shtml

As stated earlier, the intent of this model is to predict if a player will eventually make it into the HOF and does not take into consideration the number of votes needed to get in.  As such, judgement on the 11 players predicted to not make it into the HOF will simply take time to evaulate if that holds true.