# HW4: Baseball Modeling

For this problem please install the <i>Lahman</i> package, a comprehensive package about Baseball statistics,
and use it to answer a few questions.

Important information:
<ul><li>
project home page (with links to impressive graphics):  http://lahman.r-forge.r-project.org/
</li><li>
package documentation (html):  http://lahman.r-forge.r-project.org/doc/
</li></ul>

The documentation includes descriptions of the many tables in this package, such as the
Salaries table: http://lahman.r-forge.r-project.org/doc/Salaries.html


#  The Goal

There are two problems for you to solve:
<ul><li>
Problem 1: construct a model that predicts a player's salary based on his baseball statistics.
You are asked to upload salary predictions for a test dataset; your score will be based on
the <b>RSS (residual sum of squares)</b>.
</li><li>
Problem 2: construct a model that predicts whether a player will be inducted into the Hall of Fame.
Your score will be based on the <b>accuracy rate</b> of the uploaded predictions from your model.
</li></ul>
Important: 50% of the test set for the Hall of Fame will have a Hall-of-Fame classification.
Keep this in mind when making your predictions.

Then, as in HW3, upload a .csv file containing your models to CCLE.

## Step 1: build the models

Using the 'RelevantInformation' table, one model should predict a player's maximum salary,
the other should predict whether or not they will get into the Hall of Fame.

<b>YOU CAN USE ANY MODEL YOU LIKE.</b>
The baseline models are a linear regression model and a logistic regression model ----------
but you can choose <i>any</i> model.
Please produce the most accurate models you can --
more accurate models will get a higher score.

<hr style="border-width:20px;">

## Step 2: generate two CSV files with predictions for your models on the Test Datasets

To complete the assignment, run your models on the test datasets:
<ul><li>
<tt>HW4_Baseball_Salary_test.csv</tt>
</li><li>
<tt>HW4_Baseball_HallOfFame_test.csv</tt>
</li></ul>
<br/>
From these generate two files of predictions:
<ul><li>
<tt>HW4_Baseball_Salary_predictions.csv</tt>
</li><li>
<tt>HW4_Baseball_Salary_predictions.csv</tt>
</li></ul>
<br/>
Each prediction file contains only one column; salary predictions are integers, and HallOfFame predictions are either 0 or 1 (as in the input dataset).

Also please upload a file <tt>HW4_Baseball_Models.csv</tt> that specifies your models
in two lines, one for each model:
<code>
      0.8999,"lm( log10(max_salary) ~ AB+R+H+X2B+X3B+HR+RBI+SB, data = RecentPlayersAndStatsAndSalary)"
      0.7888,"glm( HallOfFame ~ AB+R+H+X2B+X3B+HR+SlugPct, data = HallOfFameContenders, family=binomial)"
</code>

<b>Each line gives the accuracy of a model</b>,
as well as <b>the command you used to generate the model</b>.
There is no length restriction on the lines.

<br/>

Your score will not depend on the <tt>HW4_Baseball_Models.csv</tt> file.
We want this information to see which models do well on these problems.

<hr style="border-width:20px;">

## Step 3: upload your CSV files and notebook to CCLE

Finally, go to CCLE and upload:
<ul><li>
<tt>HW4_Baseball_Salary_predictions.csv</tt>  -- your salary predictions
</li><li>
<tt>HW4_Baseball_Salary_predictions.csv</tt>  -- your Hall of Fame predictions
</li><li>
<tt>HW4_Baseball_Models.csv</tt> -- your models
</li><li>
<tt>HW4_Baseball_Modeling.ipynb</tt>  -- your notebook.
</li></ul>

We are not planning to run any of the uploaded notebooks.
However, your notebook should have the commands you used in developing your models ---
in order to show your work.
As announced, all assignment grading in this course will be automated,
and the notebook is needed in order to check results of the grading program.

# Get the Lahman package for R -- a database of Baseball Statistics

<hr style="border-width:20px;">

### The safe way to install it, so it will work with Jupyter -- execute the command:

<pre>
         sudo conda install -c https://conda.anaconda.org/asmeurer r-lahman
</pre>
### (The 'sudo' is not necessary if your conda installation is not write-protected.)

<hr style="border-width:20px;">

### Another way to install the Lahman package (if this works from within your Jupyter session):

In [1]:
if (!(is.element("Lahman", installed.packages()))) install.packages("Lahman", repos="http://cran.us.r-project.org")

### Load the Lahman baseball data

In [2]:
library(Lahman)

<hr style="border-width:20px;">

### Another way to get the data, if you cannot load the Lahman package:

The files
<ul><li>
<tt>PlayersAndStats.csv</tt>
</li><li>
<tt>PlayersAndStatsAndSalary.csv</tt>
</li></ul>
are distributed with the homework assignment, and are used in the notebook below.

You can use these files rather than recompute the tables using the Lahman package.

# Extract Tables of Relevant Information for your Models

### Player information -- from the Master table
http://lahman.r-forge.r-project.org/doc/Master.html

In [3]:
SelectedColumns = c("playerID","nameFirst","nameLast","birthYear", "weight","height","bats","throws")
Players = na.omit( Master[, SelectedColumns] )
summary(Players)

   playerID          nameFirst           nameLast           birthYear   
 Length:17071       Length:17071       Length:17071       Min.   :1835  
 Class :character   Class :character   Class :character   1st Qu.:1902  
 Mode  :character   Mode  :character   Mode  :character   Median :1941  
                                                          Mean   :1935  
                                                          3rd Qu.:1969  
                                                          Max.   :1994  
     weight          height      bats      throws   
 Min.   : 65.0   Min.   :43.00   B: 1131   L: 3430  
 1st Qu.:170.0   1st Qu.:71.00   L: 4721   R:13641  
 Median :185.0   Median :72.00   R:11219            
 Mean   :186.2   Mean   :72.34                      
 3rd Qu.:200.0   3rd Qu.:74.00                      
 Max.   :320.0   Max.   :83.00                      

### Player Maximum Salary -- from the Salaries table
http://lahman.r-forge.r-project.org/doc/Salaries.html

In [4]:
summary(Salaries)

# example(Salaries)  # see demos of results from the Salaries table

PlayerMaxSalary = aggregate( salary ~ playerID, Salaries, max )
colnames(PlayerMaxSalary) = gsub( "salary", "max_salary", colnames(PlayerMaxSalary) )

head(PlayerMaxSalary)

     yearID         teamID      lgID         playerID        
 Min.   :1985   CLE    :  893   AL:12123   Length:24758      
 1st Qu.:1993   LAN    :  893   NL:12635   Class :character  
 Median :2000   PHI    :  893              Mode  :character  
 Mean   :2000   SLN    :  886                                
 3rd Qu.:2007   BAL    :  883                                
 Max.   :2014   BOS    :  883                                
                (Other):19427                                
     salary        
 Min.   :       0  
 1st Qu.:  260000  
 Median :  525000  
 Mean   : 1932905  
 3rd Qu.: 2199643  
 Max.   :33000000  
                   

Unnamed: 0,playerID,max_salary
1,aardsda01,4500000
2,aasedo01,675000
3,abadan01,327000
4,abadfe01,525900
5,abbotje01,300000
6,abbotji01,2775000


In [5]:
PlayerStartYear = aggregate( yearID ~ playerID, Salaries, min )
colnames(PlayerStartYear) = gsub( "yearID", "startYear", colnames(PlayerStartYear) )

PlayerEndYear = aggregate( yearID ~ playerID, Salaries, max )
colnames(PlayerEndYear) = gsub( "yearID", "endYear", colnames(PlayerEndYear) )

head(PlayerStartYear)

Unnamed: 0,playerID,startYear
1,aardsda01,2004
2,aasedo01,1986
3,abadan01,2006
4,abadfe01,2011
5,abbotje01,1998
6,abbotji01,1989


### Batting Statistics -- from the BattingStats table
http://lahman.r-forge.r-project.org/doc/battingStats.html
   
(See also the Batting table:
http://lahman.r-forge.r-project.org/doc/Batting.html )

A glossary for Baseball Statistics Acronyms is in
   http://en.wikipedia.org/wiki/Baseball_statistics

In [6]:
BattingStats = battingStats()

### Aggregate Batting Stats -- cumulative, over a player's career

In [7]:
TotalBattingCounts = aggregate( cbind(AB,R,H,X2B,X3B,HR,RBI,SB,CS,BB,BA,PA,TB) ~ playerID,
                                     BattingStats, sum)
nrow(TotalBattingCounts)
MaxBattingPcts = aggregate( cbind(SlugPct,OBP,OPS,BABIP) ~ playerID,
                                 BattingStats, max )
nrow(MaxBattingPcts)

AggregateBattingStats = merge(TotalBattingCounts,MaxBattingPcts, by="playerID")
summary(AggregateBattingStats)
nrow(AggregateBattingStats)

   playerID               AB                R                H         
 Length:11532       Min.   :    1.0   Min.   :   0.0   Min.   :   0.0  
 Class :character   1st Qu.:   19.0   1st Qu.:   1.0   1st Qu.:   3.0  
 Mode  :character   Median :  136.5   Median :  12.0   Median :  25.0  
                    Mean   :  896.7   Mean   : 117.6   Mean   : 234.8  
                    3rd Qu.:  834.5   3rd Qu.:  95.0   3rd Qu.: 199.0  
                    Max.   :14053.0   Max.   :2295.0   Max.   :4256.0  
      X2B              X3B                HR             RBI        
 Min.   :  0.00   Min.   :  0.000   Min.   :  0.0   Min.   :   0.0  
 1st Qu.:  0.00   1st Qu.:  0.000   1st Qu.:  0.0   1st Qu.:   1.0  
 Median :  4.00   Median :  0.000   Median :  1.0   Median :  10.0  
 Mean   : 41.29   Mean   :  6.723   Mean   : 21.4   Mean   : 109.6  
 3rd Qu.: 33.00   3rd Qu.:  5.000   3rd Qu.: 10.0   3rd Qu.:  85.0  
 Max.   :746.00   Max.   :173.000   Max.   :762.0   Max.   :2297.0  
       SB    

### Inducted into the Hall of Fame?  -- from the HallOfFame table
http://lahman.r-forge.r-project.org/doc/HallOfFame.html

In [8]:
data(HallOfFame)
head(HallOfFame)

InductedIntoHallOfFame = subset(HallOfFame, inducted == 'Y')[ , 1:2]

head(InductedIntoHallOfFame)
nrow(InductedIntoHallOfFame)

Unnamed: 0,playerID,yearID,votedBy,ballots,needed,votes,inducted,category,needed_note
1,cobbty01,1936,BBWAA,226,170,222,Y,Player,
2,ruthba01,1936,BBWAA,226,170,215,Y,Player,
3,wagneho01,1936,BBWAA,226,170,215,Y,Player,
4,mathech01,1936,BBWAA,226,170,205,Y,Player,
5,johnswa01,1936,BBWAA,226,170,189,Y,Player,
6,lajoina01,1936,BBWAA,226,170,146,N,Player,


Unnamed: 0,playerID,yearID
1,cobbty01,1936
2,ruthba01,1936
3,wagneho01,1936
4,mathech01,1936
5,johnswa01,1936
111,lajoina01,1937


### Include HallOfFame information in the Players table

In [9]:
PlayersWithHallOfFame = transform( merge( Players, InductedIntoHallOfFame, all.x=TRUE, by="playerID"),
                                        HallOfFame = ifelse( is.na(yearID), 0, 1 ),
                                        yearID = ifelse( is.na(yearID), 0, yearID )
                                        )
colnames(PlayersWithHallOfFame) = gsub( "yearID", "HallOfFameYear", colnames(PlayersWithHallOfFame) )
head(PlayersWithHallOfFame, 20)

Unnamed: 0,playerID,nameFirst,nameLast,birthYear,weight,height,bats,throws,HallOfFameYear,HallOfFame
1,aardsda01,David,Aardsma,1981,205,75,R,R,0,0
2,aaronha01,Hank,Aaron,1934,180,72,R,R,1982,1
3,aaronto01,Tommie,Aaron,1939,190,75,R,R,0,0
4,aasedo01,Don,Aase,1954,190,75,R,R,0,0
5,abadan01,Andy,Abad,1972,184,73,L,L,0,0
6,abadfe01,Fernando,Abad,1985,220,73,L,L,0,0
7,abadijo01,John,Abadie,1854,192,72,R,R,0,0
8,abbated01,Ed,Abbaticchio,1877,170,71,R,R,0,0
9,abbeybe01,Bert,Abbey,1869,175,71,R,R,0,0
10,abbeych01,Charlie,Abbey,1866,169,68,L,L,0,0


In [10]:
nrow(PlayersWithHallOfFame)
nrow(subset(PlayersWithHallOfFame, HallOfFame == 1))

In [11]:
PlayersAndStats = merge( PlayersWithHallOfFame, AggregateBattingStats )

nrow(PlayersAndStats)
nrow(subset(PlayersAndStats, HallOfFame == 1))

# write.csv(PlayersAndStats, file="PlayersAndStats.csv", quote=FALSE, row.names=FALSE)

# Join Information for your Baseball Salary model into one Table

### Merge Aggregate Batting Statistics and Maximum Salary into the Relevant Information table

In [12]:
PlayersAndStatsAndSalary = transform(
    merge( merge( merge( PlayersAndStats, PlayerMaxSalary ), PlayerStartYear), PlayerEndYear ),
    totalYears = endYear - startYear + 1
    )
head(PlayersAndStatsAndSalary)
nrow(PlayersAndStatsAndSalary)

# write.csv(PlayersAndStatsAndSalary, file="PlayersAndStatsAndSalary.csv", quote=FALSE, row.names=FALSE)

Unnamed: 0,playerID,nameFirst,nameLast,birthYear,weight,height,bats,throws,HallOfFameYear,HallOfFame,AB,R,H,X2B,X3B,HR,RBI,SB,CS,BB,BA,PA,TB,SlugPct,OBP,OPS,BABIP,max_salary,startYear,endYear,totalYears
1,aardsda01,David,Aardsma,1981,205,75,R,R,0,0,3,0,0,0,0,0,0,0,0,0,0.0,4,0,0.0,0.0,0.0,0.0,4500000,2004,2012,9
2,aasedo01,Don,Aase,1954,190,75,R,R,0,0,5,0,0,0,0,0,0,0,0,0,0.0,5,0,0.0,0.0,0.0,0.0,675000,1986,1989,4
3,abadan01,Andy,Abad,1972,184,73,L,L,0,0,21,1,2,0,0,0,0,0,1,4,0.118,25,2,0.118,0.4,0.4,0.167,327000,2006,2006,1
4,abadfe01,Fernando,Abad,1985,220,73,L,L,0,0,8,0,1,0,0,0,0,0,0,0,0.143,8,1,0.143,0.143,0.286,0.25,525900,2011,2014,4
5,abbotje01,Jeff,Abbott,1972,190,74,R,L,0,0,596,82,157,33,2,18,83,6,5,38,1.236,649,248,0.492,0.343,0.79,0.32,300000,1998,2001,4
6,abbotji01,Jim,Abbott,1967,200,75,L,L,0,0,21,0,2,0,0,0,3,0,0,0,0.095,24,2,0.095,0.095,0.19,0.182,2775000,1989,1999,11


# Problem 1: construct a model like this Baseline Salary Model

### For this salary model, consider only those players who started playing after 2000:

In [13]:
RecentPlayersAndStatsAndSalary = subset( PlayersAndStatsAndSalary, startYear >= 2000 )
nrow(RecentPlayersAndStatsAndSalary)

In [14]:
summary(PlayersAndStatsAndSalary)

   playerID          nameFirst           nameLast           birthYear   
 Length:4090        Length:4090        Length:4090        Min.   :1925  
 Class :character   Class :character   Class :character   1st Qu.:1964  
 Mode  :character   Mode  :character   Mode  :character   Median :1972  
                                                          Mean   :1972  
                                                          3rd Qu.:1980  
                                                          Max.   :1993  
     weight          height     bats     throws   HallOfFameYear   
 Min.   :140.0   Min.   :66.0   B: 397   L: 830   Min.   :   0.00  
 1st Qu.:182.2   1st Qu.:72.0   L:1158   R:3260   1st Qu.:   0.00  
 Median :195.0   Median :73.0   R:2535            Median :   0.00  
 Mean   :197.7   Mean   :73.4                     Mean   :  18.62  
 3rd Qu.:210.0   3rd Qu.:75.0                     3rd Qu.:   0.00  
 Max.   :295.0   Max.   :83.0                     Max.   :2015.00  
   HallOfFame

In [15]:
BaselineSalaryModel = lm( log10(max_salary) ~
                         AB+R+H+X2B+X3B+HR+RBI+SB+CS+BB+BA+PA+SlugPct+OBP+BABIP + startYear + totalYears,
                         data = RecentPlayersAndStatsAndSalary)
summary(BaselineSalaryModel)


Call:
lm(formula = log10(max_salary) ~ AB + R + H + X2B + X3B + HR + 
    RBI + SB + CS + BB + BA + PA + SlugPct + OBP + BABIP + startYear + 
    totalYears, data = RecentPlayersAndStatsAndSalary)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.11262 -0.16561 -0.03839  0.14415  1.53555 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -6.371e+01  4.032e+00 -15.803  < 2e-16 ***
AB          -3.093e-03  5.457e-04  -5.668 1.69e-08 ***
R           -1.767e-03  5.068e-04  -3.487 0.000500 ***
H            8.609e-04  3.506e-04   2.456 0.014160 *  
X2B          1.408e-03  7.283e-04   1.933 0.053370 .  
X3B          3.012e-03  1.658e-03   1.817 0.069368 .  
HR           2.309e-03  1.020e-03   2.263 0.023764 *  
RBI          9.654e-05  4.981e-04   0.194 0.846337    
SB           8.782e-04  5.399e-04   1.627 0.103968    
CS          -7.303e-04  1.786e-03  -0.409 0.682640    
BB          -2.381e-03  5.068e-04  -4.698 2.84e-06 ***
BA          -1.403e-01 

In [16]:
#  Create your model here ...
library(caret)

MySalaryModel <- train(log10(max_salary) ~ AB+R+H+X2B+X3B+HR+RBI+SB+CS+BB+BA+PA+SlugPct+OBP+BABIP + startYear + totalYears + birthYear + height + weight + endYear, data = RecentPlayersAndStatsAndSalary, method="cubist", preProc = c("center", "scale"), trControl = trainControl(method = "repeatedcv", repeats = 5))

MySalaryModel

Loading required package: lattice
Loading required package: ggplot2
Loading required package: Cubist


Cubist 

1720 samples
  30 predictor

Pre-processing: centered (21), scaled (21) 
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 1549, 1548, 1548, 1548, 1547, 1549, ... 
Resampling results across tuning parameters:

  committees  neighbors  RMSE       Rsquared   RMSE SD     Rsquared SD
   1          0          0.2573236  0.7922028  0.02975086  0.04417651 
   1          5          0.2643321  0.7800510  0.02954271  0.04493142 
   1          9          0.2582840  0.7893002  0.03010258  0.04532795 
  10          0          0.2301070  0.8310504  0.02427305  0.03383936 
  10          5          0.2406527  0.8147543  0.02297689  0.03334637 
  10          9          0.2328361  0.8260252  0.02406843  0.03449360 
  20          0          0.2286208  0.8332726  0.02367124  0.03265563 
  20          5          0.2393460  0.8168718  0.02211373  0.03199302 
  20          9          0.2314904  0.8281358  0.02342924  0.03339142 

RMSE was used to select the optimal mo

In [17]:
RecentPlayersAndStatsAndSalary_test = read.csv( file("/Users/rutuja/Documents/Current Topics - Data Science/HW4/HW4_Baseball/HW4_Baseball_Salary_test.csv"), header=TRUE )

RecentPlayersAndStatsAndSalary_test 

Unnamed: 0,playerID,nameFirst,nameLast,birthYear,weight,height,bats,throws,HallOfFameYear,HallOfFame,AB,R,H,X2B,X3B,HR,RBI,SB,CS,BB,BA,PA,TB,SlugPct,OBP,OPS,BABIP,startYear,endYear,totalYears
1,fornean01,Andrew,Forney,1982,234,72,R,R,0,0,32,3,5,0,0,1,2,0,0,0,0.263,35,11,0.437,0.261,0.696,0.33,2007,2013,6
2,mokasta01,Tanaya,Mokashi,1991,226,76,L,R,0,0,1309,211,357,61,13,56,149,30,13,154,0.813,1490,608,0.486,0.371,0.854,0.351,2012,2013,3
3,shahma01,Mauli,Shah,1987,204,74,R,R,0,0,958,90,270,57,3,7,106,7,3,43,1.346,1021,360,0.421,0.341,0.762,0.341,2013,2013,4
4,vahabar01,Arash,Vahabpour,1983,226,75,L,L,0,0,389,42,95,13,7,9,40,0,2,21,1.786,448,146,0.642,0.4,1.029,0.425,2002,2011,9
5,wenigbr01,Brooke,Wenig,1983,194,73,L,L,0,0,3408,430,932,197,12,146,520,7,5,274,2.544,3727,1590,0.597,0.412,1.015,0.431,2008,2015,6
6,wangha01,Haoying,Wang,1975,191,72,L,L,0,0,1674,194,469,93,10,34,222,11,8,106,2.602,1818,684,0.484,0.378,0.856,0.351,2005,2011,9
7,zhangzi01,Zihao,Zhang,1988,216,74,R,R,0,0,1262,123,295,48,1,42,153,0,3,66,0.984,1357,472,0.48,0.31,0.787,0.301,2014,2014,1
8,lyuha01,Hanci,Lyu,1979,186,72,R,R,0,0,387,52,101,13,4,13,52,4,2,42,1.044,439,160,0.577,0.39,0.976,0.393,2001,2006,7
9,shettpu01,Punit,Shetty,1988,229,73,L,R,0,0,1268,118,342,76,2,26,152,16,1,121,1.33,1410,501,0.547,0.398,0.945,0.385,2013,2013,2
10,harrikr01,Kristina,Harris,1978,194,74,R,R,0,0,2692,369,677,147,9,114,434,18,18,189,2.261,2923,1186,0.49,0.337,0.818,0.322,2003,2008,5


In [56]:
regression_RSS = function( model, test.data, test.solutions,intercept=1) {
   
     predictions = predict (model, test.data)
     RSS = sum((10^predictions-(test.solutions))^2)
 }

BaselineSalaryModel_RSS = regression_RSS(BaselineSalaryModel,RecentPlayersAndStatsAndSalary,RecentPlayersAndStatsAndSalary$max_salary)
BaselineSalaryModel_RSS

MySalaryModel_RSS = regression_RSS(MySalaryModel,RecentPlayersAndStatsAndSalary,RecentPlayersAndStatsAndSalary$max_salary)
MySalaryModel_RSS



In [57]:
regression_accuracy = function( model, test.data, test.solutions ) {
    predictions = predict( model, subset(test.data))
    mse = sum((predictions-log10(test.solutions))^2)/sum((log10(test.solutions) - mean(log10(test.solutions)))^2)
    accuracy = 1-mse
    return(accuracy)
 }

In [58]:
BaselineSalaryModel_RSquared = regression_accuracy(BaselineSalaryModel,RecentPlayersAndStatsAndSalary,RecentPlayersAndStatsAndSalary$max_salary)
BaselineSalaryModel_RSquared

MySalaryModel_RSquared = regression_accuracy(MySalaryModel,RecentPlayersAndStatsAndSalary,RecentPlayersAndStatsAndSalary$max_salary)
MySalaryModel_RSquared

In [24]:
testSalaryData = read.csv( file("/Users/rutuja/Documents/Current Topics - Data Science/HW4/HW4_Baseball/HW4_Baseball_Salary_test.csv"), header=TRUE )

In [25]:
predictedLogSalary = predict(MySalaryModel, testSalaryData)
predictedSalary = 10^predictedLogSalary

write.table(predictedSalary, file="/Users/rutuja/Documents/Current Topics - Data Science/HW4/HW4_Baseball_Salary_predictions.csv",sep=",",append=TRUE,row.names=FALSE,col.names=FALSE)

## Upload the predictions of your model on HW4_Baseball_Salary_test.csv


Generate predictions for your model for players in the file  <tt>HW4_Baseball_Salary_test.csv</tt>.
Your predictions will be evaluated by its RSS (sum of squares of residuals).


# Problem 2: construct a model with better performance  (higher accuracy) than this Baseline Hall of Fame Model

###  Hall of Fame election rules:


A. A baseball player must have been active as a player in the Major Leagues at some time during a period beginning fifteen (15) years before and ending five (5) years prior to election.

B. Player must have played in each of ten (10) Major League championship seasons, some part of which must have been within the period described in 3(A).

C. Player shall have ceased to be an active player in the Major Leagues at least five (5) calendar years preceding the election but may be otherwise connected with baseball.

### Consequently:   only consider players born before 1970
(They must start around 20 years of age, play at least 10 years, have stopped playing at least 5 years earlier, and take perhaps 10 years to win the ballot -- so born at least 45 years ago.)

In [26]:
HallOfFameContenders = subset( PlayersAndStats, birthYear < 1970 )
head(HallOfFameContenders)
nrow(HallOfFameContenders)

Unnamed: 0,playerID,nameFirst,nameLast,birthYear,weight,height,bats,throws,HallOfFameYear,HallOfFame,AB,R,H,X2B,X3B,HR,RBI,SB,CS,BB,BA,PA,TB,SlugPct,OBP,OPS,BABIP
2,aaronha01,Hank,Aaron,1934,180,72,R,R,1982,1,12364,2174,3771,624,98,755,2297,240,73,1402,6.927,13940,6856,0.669,0.41,1.079,0.338
3,aaronto01,Tommie,Aaron,1939,190,75,R,R,0,0,944,102,216,42,6,13,94,9,8,86,1.545,1045,309,0.374,0.318,0.686,0.276
4,aasedo01,Don,Aase,1954,190,75,R,R,0,0,5,0,0,0,0,0,0,0,0,0,0.0,5,0,0.0,0.0,0.0,0.0
7,abadijo01,John,Abadie,1854,192,72,R,R,0,0,49,4,11,0,0,0,5,1,0,0,0.472,49,11,0.25,0.25,0.5,0.25
9,abbotji01,Jim,Abbott,1967,200,75,L,L,0,0,21,0,2,0,0,0,3,0,0,0,0.095,24,2,0.095,0.095,0.19,0.182
10,abbotku01,Kurt,Abbott,1969,180,71,R,R,0,0,2044,273,523,109,23,62,242,22,11,133,2.511,2227,864,0.465,0.326,0.77,0.354


In [27]:
BaselineHallOfFameModel = glm( HallOfFame ~ AB+R+H+X2B+X3B+HR+RBI+SB+CS+BB+BA+PA+SlugPct+OBP+BABIP,
                         data = HallOfFameContenders, family=binomial)

summary(BaselineHallOfFameModel)


Call:
glm(formula = HallOfFame ~ AB + R + H + X2B + X3B + HR + RBI + 
    SB + CS + BB + BA + PA + SlugPct + OBP + BABIP, family = binomial, 
    data = HallOfFameContenders)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0236  -0.1609  -0.1354  -0.1225   3.2096  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -5.177921   0.245464 -21.094  < 2e-16 ***
AB          -0.015193   0.002452  -6.195 5.82e-10 ***
R            0.004717   0.001993   2.366  0.01797 *  
H            0.004591   0.001633   2.812  0.00493 ** 
X2B         -0.017811   0.003509  -5.076 3.86e-07 ***
X3B          0.021059   0.006559   3.210  0.00133 ** 
HR          -0.007845   0.003394  -2.312  0.02080 *  
RBI          0.006136   0.001578   3.890  0.00010 ***
SB           0.005979   0.002021   2.958  0.00309 ** 
CS          -0.034386   0.007499  -4.586 4.53e-06 ***
BB          -0.013913   0.002313  -6.015 1.80e-09 ***
BA           0.065975   0.135318   0.488  0.625

In [28]:
confusionMatrix = table( round(predict(BaselineHallOfFameModel, type="response")), HallOfFameContenders$HallOfFame )
confusionMatrix
# terrible prediction accuracy:  only 38 Hall-of-Fame players were identified correctly:

   
       0    1
  0 7899  155
  1   19   38

##  Warning!  This dataset is severely imbalanced.  Read Ch.16 of [APM]

Only about 1% or 2% of all players are inducted into the Hall of Fame:

In [29]:
( FameTally = table( HallOfFameContenders$HallOfFame ) )


   0    1 
7918  193 

In [30]:
data.frame( percentageOfHallOfFamers = FameTally[2] / sum(FameTally) )

Unnamed: 0,percentageOfHallOfFamers
1,0.02379485


## Finally:  Upload the predictions of your model on HW4_Baseball_HallOfFame_test.csv

Use your model to generate predictions for players in the file <tt>HW4_Baseball_HallOfFame_test.csv</tt>,
and upload them to CCLE.


##  Important:  The test dataset has as many Hall-of-Fame players as non-Hall-of-Fame players


Even though classifying everybody as a NON-Hall-of-Fame player is correct
for about 98% of all players, the weighting of Hall-of-Fame players is the same as for non-Hall-of-Fame players
in this test data.
Ignoring these players will get a very low score on this assignment.

Specifically, your predictions will be evaluated by their <b>Accuracy-Rate</b>:
<blockquote>
This rate is a weighted percentage of correct predictions
for players in the Hall of Fame.  Because half of the test data are in the Hall of Fame,
<u>correct prediction for players in the Hall of Fame is weighted heavily.</u>
</blockquote>


In [31]:
not.installed <- function(pkg) !is.element(pkg, installed.packages()[,1])

if (not.installed("MASS"))  install.packages("MASS")  # we need the MASS package

library(MASS)  #  load the MASS package

#install.packages('kernlab', dependencies=TRUE, repos='http://cran.rstudio.com/')
library(kernlab) 
#install.packages('e1071', dependencies=TRUE, , repos='http://cran.rstudio.com/')
library(e1071)

#if (not.installed("caret"))  install.packages("caret")  # we need the caret package

library(caret)  #  load the caret package

detach(package:caret)  # reload the package, since the code here modifies GermanCredit
library(caret)





Attaching package: ‘kernlab’

The following object is masked from ‘package:ggplot2’:

    alpha



In [32]:
upSampledData = upSample(x= HallOfFameContenders, y = as.factor(HallOfFameContenders$HallOfFame), yname = "HallOfFame")

In [33]:
HallOfFameModel <- train(HallOfFame ~ AB+R+H+X2B+X3B+HR+RBI+SB+CS+BB+BA+PA+SlugPct+OBP+BABIP+HallOfFameYear, data=upSampledData, method="cubist", preProc = c("center", "scale"), trControl = trainControl(method = "repeatedcv", repeats = 5))

summary(HallOfFameModel)

In train.default(x, y, weights = w, ...): You are trying to do regression and your outcome only has two possible values Are you trying to do classification? If so, use a 2 level factor as your outcome column.


Call:
cubist.default(x = x, y = y, committees = param$committees)


Cubist [Release 2.07 GPL Edition]  Wed May 25 20:33:58 2016
---------------------------------

    Target attribute `outcome'

Read 15836 cases (17 attributes) from undefined.data

Model:

  Rule 1: [7918 cases, mean 0.0, range 0 to 0, est err 0.0]

    if
	HallOfFameYear <= -0.9998397
    then
	outcome = 0

  Rule 2: [7918 cases, mean 1.0, range 1 to 1, est err 0.0]

    if
	HallOfFameYear > -0.9998397
    then
	outcome = 1


Evaluation on training data (15836 cases):

    Average  |error|                0.0
    Relative |error|               0.00
    Correlation coefficient        1.00


	Attribute usage:
	  Conds  Model

	  100%           HallOfFameYear


Time: 1.0 secs


In [34]:
Hall_of_Fame_Accuracy_Rate = function( model, test.data, test.solutions ) {
    
    predictions = predict( model, test.data)
    
    NON_Hall_of_Fame_sum = length(predictions)-sum(test.solutions)
    
    Hall_of_Fame_sum = sum(test.solutions)

    NON_Hall_of_Fame_correct_predictions  = sum(ifelse(I(as.numeric(as.character(predictions))>0) == test.solutions&test.solutions == 0,1,0  ))
    
    Hall_of_Fame_correct_predictions =  sum(ifelse(I(as.numeric(as.character(predictions))>0) == test.solutions&test.solutions == 1,1,0  ))
     
    accuracy = (NON_Hall_of_Fame_correct_predictions + Hall_of_Fame_correct_predictions )/(NON_Hall_of_Fame_sum + Hall_of_Fame_sum)
    
}

In [35]:
BaselineHallOfFameModel_Accuracy = Hall_of_Fame_Accuracy_Rate(BaselineHallOfFameModel, HallOfFameContenders,  HallOfFameContenders$HallOfFame)
BaselineHallOfFameModel_Accuracy

HallOfFameModel_Accuracy = Hall_of_Fame_Accuracy_Rate(HallOfFameModel, HallOfFameContenders, HallOfFameContenders$HallOfFame)
HallOfFameModel_Accuracy

In [41]:
testData = read.csv( file("/Users/rutuja/Documents/Current Topics - Data Science/HW4/HW4_Baseball/HW4_Baseball_HallOfFame_test.csv"), header=TRUE )
predictedHallOfFame <- predict(HallOfFameModel, testData)

write.table(predictedHallOfFame, file="/Users/rutuja/Documents/Current Topics - Data Science/HW4/HW4_Baseball_HallOfFame_predictions.csv",sep=",",append=TRUE,row.names=FALSE,col.names=FALSE)

In [61]:
models = cbind(MySalaryModel_RSS, "train(log10(max_salary) ~ AB+R+H+X2B+X3B+HR+RBI+SB+CS+BB+BA+PA+SlugPct+OBP+BABIP + startYear + totalYears + birthYear + height + weight + endYear, data = RecentPlayersAndStatsAndSalary, method='cubist', preProc = c('center', 'scale'), trControl = trainControl(method = 'repeatedcv', repeats = 5))")
write.table(models, file="/Users/rutuja/Documents/Current Topics - Data Science/HW4/HW4_Baseball_Models.csv",sep=",",append=TRUE,row.names=FALSE,col.names=FALSE)

models = cbind(HallOfFameModel_Accuracy, "train(HallOfFame ~ AB+R+H+X2B+X3B+HR+RBI+SB+CS+BB+BA+PA+SlugPct+OBP+BABIP+HallOfFameYear, data=upSampledData, method='cubist', preProc = c('center', 'scale'), trControl = trainControl(method = 'repeatedcv', repeats = 5))")
write.table(models, file="/Users/rutuja/Documents/Current Topics - Data Science/HW4/HW4_Baseball_Models.csv",sep=",",append=TRUE,row.names=FALSE,col.names=FALSE)

