### Predicting MLB Player Salaries: A Batting Performance Analysis

Author: Hector Guerrero

---


## Data Preprocessing

In this notebook, we will perform data processing tasks to simplify the dataframe, drop columns with no value to our project, reduce the time span, and perform further cleaning and wrangling.

By the end, I expect to have a dataframe that is ready for analysis and modeling.

---

## Table of Contents
1. [Data Exploration](#data_exploration)
    - [1.1. Data Dictionary](#data_dict)
    - [1.2. Description and overview of the data](#overview)
2. [Absence of Key Features](#absence_key_features)
    - [2.1. Adding Key Features](#adding_key_features)
    - [2.2. Adding Cumulative Features](#adding_cumulative_features)
3. [Data Cleaning](#data_cleaning)
    - [3.1. Setting a time span](#time_span)
    - [3.2. Droping columns (Features) that are no relevant to the project](#drop_cols)
    - [3.3. Droping repeated or similar columns](#drop_repeated_cols)
    - 2.4. [Filtering and removing players with no relevance to the project (Removing rows)](#drop_rows)
    - 2.5. [Dealing with Null Values](#null_values)
4. [Salary Adjustments](#salary_adjustments)
4. [Next Steps](#next_steps)

In [70]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


---

## 1. Data Exploration <a id='data_exploration'></a>

To begin the analysis, we will explore the dataset to understand its structure and content.


In [71]:
# Load Dataset
raw_df = pd.read_csv("pre_datasets/merged.csv")

  raw_df = pd.read_csv("pre_datasets/merged.csv")


In [72]:
# Shape of the dataset
print('Shape of the dataset: ', raw_df.shape)
print('Number of rows (Players): ', raw_df.shape[0])
print('Number of columns (Features): ', raw_df.shape[1])

Shape of the dataset:  (103554, 104)
Number of rows (Players):  103554
Number of columns (Features):  104


The dataset has a shape of **(103554, 104)**, indicating __103,554 rows (Players) and 104 columns (Features)__. 

Here are the first five rows of the dataset:

In [73]:
# First 5 rows of the dataset
pd.set_option('display.max_columns', None)

raw_df.sample(5)

Unnamed: 0,name_common,age_x,year_ID,player_ID,pitcher,mlb_ID,stint_ID,PA,G_x,Inn,salary_x,WAR,team_ID,playerID,birthYear,birthCountry,weight,height,bats,throws,debut,finalGame,retroID,bbrefID,yearID,G_y,AB_x,R_x,H_x,2B_x,3B_x,HR_x,RBI,SB_x,CS_x,BB_x,SO_x,IBB,HBP_x,SH,SF_x,GIDP,teamID,lgID,franchID,divID,Rank,G,Ghome,W,L,DivWin,LgWin,WSWin,R_y,AB_y,H_y,2B_y,3B_y,HR_y,BB_y,SO_y,SB_y,CS_y,HBP_y,SF_y,RA,ER,ERA,CG,SHO,SV,IPouts,HA,HRA,BBA,SOA,E,DP,FP,name_x,park,attendance,BPF,PPF,teamIDBR,teamIDlahman45,teamIDretro,firstname,lastname,playerid,mlbid,year,salary_y,TeamName,age_y,leaguerank,teamrank,averagesalary,leagueminimum,borndate,name_y,POS,BirthYear
71190,Homer Bush,24.0,1997,bushho01,N,111792.0,1,11.0,10,39.7,0.0,-0.07,NYY,bushho01,1972.0,USA,180.0,70.0,R,R,1997-08-16,2004-06-08,bushh001,bushho01,1997.0,10.0,11.0,2.0,4.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,NYA,AL,NYY,E,2.0,162.0,80.0,96.0,66.0,N,N,N,891.0,5710.0,1636.0,325.0,23.0,161.0,676.0,954.0,99.0,58.0,37.0,70.0,688.0,626.0,3.84,11.0,10.0,51.0,4403.0,1463.0,144.0,532.0,1165.0,104.0,156.0,0.983,New York Yankees,Yankee Stadium II,2580325.0,100.0,98.0,NYY,NYA,NYA,homer,bush,756.0,111792.0,1997.0,150000.0,New York Yankees,24.0,0.0,0.0,1336609.0,150000.0,1972-11-12,homer bush,2B,1972.0
39565,Sherman Jones,26.0,1961,jonessh02,Y,116728.0,1,12.0,24,55.0,0.0,0.08,CIN,jonessh02,1935.0,USA,205.0,76.0,L,R,1960-08-02,1962-09-09,jones102,jonessh02,1961.0,24.0,11.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,CIN,NL,CIN,,1.0,154.0,77.0,93.0,61.0,,Y,N,710.0,5243.0,1414.0,247.0,35.0,158.0,423.0,761.0,70.0,33.0,,,653.0,575.0,3.78,46.0,12.0,40.0,4110.0,1300.0,147.0,500.0,829.0,134.0,124.0,0.977,Cincinnati Reds,Crosley Field,1117603.0,102.0,101.0,CIN,CIN,CIN,,,,,,,,,,,,,,,,1935.0
34189,Hal Hudson,25.0,1952,hudsoha01,Y,232524.0,2,1.0,5,0.0,6000.0,-0.01,SLB,hudsoha01,1927.0,USA,175.0,70.0,L,L,1952-04-20,1953-09-23,hudsh101,hudsoha01,1952.0,5.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,SLA,AL,BAL,,7.0,155.0,78.0,64.0,90.0,,N,N,604.0,5353.0,1340.0,225.0,46.0,82.0,540.0,720.0,30.0,34.0,,,733.0,640.0,4.12,48.0,6.0,18.0,4197.0,1388.0,111.0,598.0,581.0,155.0,176.0,0.974,St. Louis Browns,Sportsman's Park IV,518796.0,103.0,106.0,SLB,SLA,SLA,,,,,,,,,,,,,,,,1927.0
54286,Graig Nettles,35.0,1980,nettlgr01,N,119720.0,1,369.0,89,769.7,0.0,1.36,NYY,nettlgr01,1944.0,USA,180.0,72.0,L,R,1967-09-06,1988-10-01,nettg001,nettlgr01,1980.0,89.0,324.0,52.0,79.0,14.0,0.0,16.0,45.0,0.0,0.0,42.0,42.0,5.0,1.0,0.0,2.0,8.0,NYA,AL,NYY,E,1.0,162.0,81.0,103.0,59.0,Y,N,N,820.0,5553.0,1484.0,239.0,34.0,189.0,643.0,739.0,86.0,36.0,28.0,54.0,662.0,583.0,3.58,29.0,15.0,50.0,4393.0,1433.0,102.0,463.0,845.0,138.0,160.0,0.978,New York Yankees,Yankee Stadium II,2627417.0,98.0,97.0,NYY,NYA,NYA,,,,,,,,,,,,,,,3B,1944.0
51525,Frank Duffy,30.0,1977,duffyfr01,N,113597.0,1,369.0,122,930.3,55000.0,0.1,CLE,duffyfr01,1946.0,USA,180.0,73.0,R,R,1970-09-04,1979-05-11,dufff101,duffyfr01,1977.0,122.0,334.0,30.0,67.0,13.0,2.0,4.0,31.0,8.0,3.0,21.0,47.0,0.0,0.0,13.0,1.0,8.0,CLE,AL,CLE,E,5.0,161.0,81.0,71.0,90.0,N,N,N,676.0,5491.0,1476.0,221.0,46.0,100.0,531.0,688.0,87.0,87.0,34.0,54.0,739.0,661.0,4.1,45.0,8.0,30.0,4357.0,1441.0,136.0,550.0,876.0,130.0,145.0,0.979,Cleveland Indians,Cleveland Stadium,900365.0,96.0,97.0,CLE,CLE,CLE,,,,,,,,,,,,,,,,1946.0


In [74]:
pd.set_option('display.max_rows', 200)
# Print all column names and their data types
raw_df.dtypes

name_common        object
age_x             float64
year_ID             int64
player_ID          object
pitcher            object
mlb_ID            float64
stint_ID            int64
PA                float64
G_x                 int64
Inn               float64
salary_x          float64
WAR               float64
team_ID            object
playerID           object
birthYear         float64
birthCountry       object
weight            float64
height            float64
bats               object
throws             object
debut              object
finalGame          object
retroID            object
bbrefID            object
yearID            float64
G_y               float64
AB_x              float64
R_x               float64
H_x               float64
2B_x              float64
3B_x              float64
HR_x              float64
RBI               float64
SB_x              float64
CS_x              float64
BB_x              float64
SO_x              float64
IBB               float64
HBP_x       

### 1.1 Data Dictionary <a id='data_dict'></a>

Now let's examine the data dictionary to understand the meaning of each column:

| Column Name | Description | 
| --- | --- |
| name_common | Player name |
| age_x | Player age |
| year_ID | Year (Season played) |
| team_ID | Team played |
| player_ID | Player ID |
| pitcher | Whether the player is a pitcher |
| mlb_ID | MLB ID |
| stint_ID | The stint of the player in that year (a player can have more than one stint in a year if they moved teams). |
| PA | Plate appearances |
| G_x | Games played |
| Inn | Innings played |
| salary_x | Player salary |
| WAR | Wins Above Replacement - a measure of how many wins a player adds to a team compared to a replacement-level player. |
| teamID | ID of the team they played for the season |
| playerID | Player ID |
| birthYear | Year of birth |
| birthMonth | Month of birth |
| birthDay | Day of birth |
| birthCountry | Country of birth |
| nameFirst | First name |
| nameLast | Last name |
| weight | Weight in pounds |
| height | Height in inches |
| bats | Batting hand (left, right, or both) |
| throws | Throwing hand (left or right) |
| debut | Date of MLB debut |
| finalGame | Date of final MLB game |
| retroID | Retro ID |
| bbrefID | Baseball Reference ID |
| yearID | Year (Season played) |
| lgID | The league the player played in |
| G | Games played |
| AB | At bats |
| R | Runs |
| H | Hits |
| 2B | Doubles |
| 3B | Triples |
| HR | Homeruns |
| RBI | Runs batted in |
| SB | Stolen bases |
| CS | Caught stealing |
| BB | Walks |
| SO | Strikeouts |
| IBB | Intentional walks |
| HBP | Hit by pitch |
| SH | Sacrifice hits |
| SF | Sacrifice flies |
| GIDP | Grounded into double plays |
| teamID | ID of the team they played for the season |
| lgID | The league the player played in |
| franchID | Franchise ID |
| divID | Division ID |
| Rank | Team rank at the end of the season |
| G | Games played |
| Ghome | Games played at home |
| W | Wins |
| L | Losses |
| DivWin | Division Winner (Y or N) |
| WCWin | Wild Card Winner (Y or N) |
| LgWin | League Champion(Y or N) |
| WSWin | World Series Winner (Y or N) |
| R | Total number of runs scored by the team in the season |
| AB | Total number of at bats by the team in the season |
| H | Total number of hits by the team in the season |
| 2B | Total number of doubles by the team in the season |
| 3B | Total number of triples by the team in the season |
| HR | Total number of home runs by the team in the season |
| BB | Total number of walks by the team in the season |
| SO | Total number of strikeouts by the team in the season |
| SB | Total number of stolen bases by the team in the season |
| CS | Total number of times caught stealing by the team in the season |
| HBP | Total number of times hit by pitch by the team in the season |
| SF | Total number of sacrifice flies by the team in the season |
| RA | Total number of runs allowed by the team in the season |
| ER | Total number of earned runs allowed by the team in the season |
| ERA | Earned run average |
| CG | Total number of complete games pitched by the team in the season |
| SHO | Total number of shutouts pitched by the team in the season |
| SV | Total number of saves by the team in the season |
| IPouts | Total number of outs pitched by the team in the season |
| HA | Total number of hits allowed by the team in the season |
| HRA | Total number of home runs allowed by the team in the season |
| BBA | Total number of walks allowed by the team in the season |
| SOA | Total number of strikeouts by the team in the season |
| E | Total number of errors by the team in the season |
| DP | Total number of double plays turned by the team in the season |
| FP | Fielding percentage |
| name_x | Team name |
| park | Team park |
| attendance | Total attendance for the season |
| BPF | Three-year park factor for batters |
| PPF | Three-year park factor for pitchers |
| teamIDBR | Team ID used by Baseball Reference website |
| teamIDlahman45 | Team ID used in Lahman database version 4.5 |
| teamIDretro | Team ID used by Retrosheet |
| firstname | The player's First name |
| lastname | The player's Last name |
| playerid | A unique identifier for each player |
| mlbid | A unique identifier for each player |
| year | Year of the salary |
| salary_y | Player salary for the year |
| teamname | Team name |
| age | The player's age during the year of the salary |
| leaguerank | The player's rank in the league based on salary |
| teamrank | The player's rank in the team based on salary |
| averagesalary | The average salary in the league for the year |
| leagueminimun | The minimum salary in the league for the year |
| borndate | The player's birth date |






















### 1.2 Description and Overview of the Data <a id='overview'></a>

In [75]:
# Unique players (Total number of players)
print('Number of unique players: ', raw_df['playerID'].nunique())

# Unique seasons (Total number of seasons)
print('Number of unique seasons: ', raw_df['yearID'].nunique())

# First season
print('First season: ', raw_df['yearID'].min())

# Last season
print('Last season: ', raw_df['yearID'].max())

Number of unique players:  20016
Number of unique seasons:  152
First season:  1871.0
Last season:  2022.0


#### Overview of the Data

Our dataset is a rich compilation of baseball statistics that contains data for __19,829__ unique players across __152__ seasons. The earliest MLB season in the dataset is __1871__, and the latest season is __2022__.

Each row in the dataset represents a single player's performance for a given season, with a variety of metrics reflecting different aspects of their performance. These are divided into the following categories:
- __Identifiers:__ These include various IDs for players, teams, and leagues, as well as the year and player's stint with the team in that season.
- __Basic Batting and Fielding Statistics:__ These include familiar metrics such as games played, at bats, runs, hits, and home runs, among others, for both players and teams.
- __Advanced Batting and Fielding Statistics:__ These include more advanced metrics such as wins above replacement (WAR), wins above average (WAA), and runs above average (RAA), among others, for both players and teams.
- __Personal Information:__ These include the player's name, birth and death information, weight, height, and handedness, among others.
- __Team Information:__ These include the team's name, division, and league, as well as their record, attendance, and park factors, among others.
- __Salary Information:__ These include the player's salary.

This dataset provides a comprehensive view of baseball performance, allowing us to examine the game from both individual player and team perspectives.


In [76]:
# Copy of the df for preprocessing
df = raw_df.copy()

## 2. Absence of Key Features (Pre-Feature Engineering) <a id='missing_features'></a>

Our current dataset lacks several features that are integral to understanding a player's performance in baseball. These absent features include:
- __Batting Average (BA)__: This is calculated by dividing a player's number of hits by their number of at bats. It provides a measure of a player's offensive capabilities.
- __On-Base Percentage (OBP)__: This metric represents the frequency at which a player reaches base per plate appearance. It's calculated by dividing the total times a player reaches base (hits, walks, and hit by pitches) by their total number of eligible at bats.
- __Singles (1B)__: This is the number of hits a player has that resulted in the batter reaching first base safely. It's calculated by subtracting doubles, triples, and home runs from total hits.
- __Total Bases (TB)__: The number of bases a player has gained with hits. It is a weighted sum for a batter's collection of hits including singles, doubles, triples and home runs.
- __Slugging Percentage (SLG)__: The total number of bases divided by the number of at bats.
- __On-Base Plus Slugging (OPS)__: This is the sum of a player's on-base percentage and slugging percentage. It's a more comprehensive statistic that measures a player's ability to get on base, along with their ability to hit for power.
- __Batting Average on Balls In Play (BABIP)__: This measures how often a ball in play goes for a hit. A ball is "in play" when the plate appearance ends in something other than a strikeout, walk, hit batter, catcher's interference, sacrifice bunt, or home run. In other words, the batter puts the ball in play and it doesn't clear the outfield fence. This can be an indicator of a player's luck, skill at placing hits where fielders aren't, or both.
- __Win Percentage (W%)__: This is the number of wins divided by the number of games played. It provides a measure of a team's success over a period of time.

### 2.1 Adding key features <a id='adding_features'></a>

In [77]:
# Add Battiing Average (BA) column, round to 3 decimal places
df['BA'] = round(df['H_x'] / df['AB_x'], 3)

# Add On Base Percentage (OBP) column
df['OBP'] = round((df['H_x'] + df['BB_x'] + df['HBP_x']) / (df['AB_x'] + df['BB_x'] + df['HBP_x'] + df['SF_x']), 3)

# Add Singles (1B) column
df['1B'] = df['H_x'] - df['2B_x'] - df['3B_x'] - df['HR_x']

# Add Total Bases (TB) column
df['TB'] = df['1B'] + 2*df['2B_x'] + 3*df['3B_x'] + 4*df['HR_x']

# Add Slugging Percentage (SLG) column
df['SLG'] = round(df['TB'] / df['AB_x'], 3)

# Add On Base Plus Slugging (OPS) column
df['OPS'] = round(df['OBP'] + df['SLG'], 3)

# Add Batting Average on Balls in Play (BABIP) column
df['BABIP'] = round((df['H_x'] - df['HR_x']) / (df['AB_x'] - df['SO_x'] - df['HR_x'] + df['SF_x']), 3)

# Add Win Percentage (W%) column
df['W%'] = round(df['W'] / (df['W'] + df['L']), 3)


##### Sanity check
Let's compare newly created features with the ones of a well known player, Derek Jeter, to make sure we are calculating them correctly.
 
https://www.baseball-reference.com/players/j/jeterde01.shtml

In [78]:
# Sanity check, Derek Jeter's new columns
df[df['playerID'] == 'jeterde01'][['playerID', 'yearID', 'BA', 'OBP', '1B', 'SLG', 'OPS', 'BABIP', 'TB']]

Unnamed: 0,playerID,yearID,BA,OBP,1B,SLG,OPS,BABIP,TB
68809,jeterde01,1995.0,0.25,0.294,7.0,0.375,0.669,0.324,18.0
69943,jeterde01,1996.0,0.314,0.37,142.0,0.43,0.8,0.361,250.0
71064,jeterde01,1997.0,0.291,0.37,142.0,0.405,0.775,0.345,265.0
72210,jeterde01,1998.0,0.324,0.384,151.0,0.481,0.865,0.375,301.0
73397,jeterde01,1999.0,0.349,0.438,149.0,0.552,0.99,0.396,346.0
74600,jeterde01,2000.0,0.339,0.416,151.0,0.481,0.897,0.386,285.0
75843,jeterde01,2001.0,0.311,0.377,132.0,0.48,0.857,0.343,295.0
77027,jeterde01,2002.0,0.297,0.373,147.0,0.421,0.794,0.336,271.0
78232,jeterde01,2003.0,0.324,0.393,118.0,0.45,0.843,0.379,217.0
79464,jeterde01,2004.0,0.292,0.352,120.0,0.471,0.823,0.315,303.0


All the features match the ones in the website, so we can proceed.

### 2.2 Adding cumulative features
It is essential to consider the player's career statistics up to the point of each season. This is because a player's salary is often determined not just by their performance in the current season, but by their cumulative performance throughout their career.

To this end, we have added cumulative features to our dataset. These features represent the cumulative sum of various statistics for each player up to each season. These features represent the cumulative sum of various statistics for each player up to each season. For example, `career_G_x` represents the cumulative number of games played by the player up to the current season, `career_H_x` represents the cumulative number of hits, and so on.

In addition to these cumulative sum features, we have also added cumulative mean features for certain statistics. These features represent the cumulative average of these statistics for each player up to each season. For example, `career_BA` represents the cumulative batting average of the player up to the current season.

Later in the preprocessing stage, we will be narrowing down the time span of our dataset. To ensure we do not lose valuable data, especially for players who debuted before our selected time span, it's crucial to perform this step now. This approach ensures that we retain all relevant information for our analysis and model training.


In [79]:
# Add cumulative columns

# Empty list to store the cumulative columns names
cumulative_sum_cols = []
cumulative_mean_cols = []

# Function to add cumulative columns
def add_cumulative_sum(df, list_of_cols):
    for col in list_of_cols:
        new_col_name = 'career_' + col
        df[new_col_name] = df.groupby(['playerID'])[col].cumsum()
        if col != 'WAR':
            df[new_col_name] = df[new_col_name].fillna(0).astype(int)
        cumulative_sum_cols.append(new_col_name)
        
# Add cumulative sum columns
columns_to_sum = ['G_x', 'PA', 'AB_x', 'R_x', 'H_x', '1B', '2B_x', '3B_x', 'HR_x', 'RBI', 'SB_x', 'CS_x', 'BB_x', 'SO_x', 'IBB', 'TB', 'WAR']
add_cumulative_sum(df, columns_to_sum)

# Add cumulative mean columns (Batting Average, On Base Percentage, Slugging Percentage, On Base Plus Slugging, BABIP)
# Batting Average (BA)
df['career_BA'] = round(df.groupby(['playerID'])['H_x'].cumsum() / df.groupby(['playerID'])['AB_x'].cumsum(), 3)
cumulative_mean_cols.append('career_BA')

# On Base Percentage (OBP)
df['career_OBP'] = round((df.groupby(['playerID'])['H_x'].cumsum() + df.groupby(['playerID'])['BB_x'].cumsum() + df.groupby(['playerID'])['HBP_x'].cumsum()) / (df.groupby(['playerID'])['AB_x'].cumsum() + df.groupby(['playerID'])['BB_x'].cumsum() + df.groupby(['playerID'])['HBP_x'].cumsum() + df.groupby(['playerID'])['SF_x'].cumsum()), 3)
cumulative_mean_cols.append('career_OBP')

# Slugging Percentage (SLG)
df['career_SLG'] = round((df.groupby(['playerID'])['1B'].cumsum() + 2*df.groupby(['playerID'])['2B_x'].cumsum() + 3*df.groupby(['playerID'])['3B_x'].cumsum() + 4*df.groupby(['playerID'])['HR_x'].cumsum()) / df.groupby(['playerID'])['AB_x'].cumsum(), 3)
cumulative_mean_cols.append('career_SLG')

# On Base Plus Slugging (OPS)
df['career_OPS'] = round(df['career_OBP'] + df['career_SLG'], 3)
cumulative_mean_cols.append('career_OPS')

# Batting Average on Balls in Play (BABIP)
df['career_BABIP'] = round((df.groupby(['playerID'])['H_x'].cumsum() - df.groupby(['playerID'])['HR_x'].cumsum()) / (df.groupby(['playerID'])['AB_x'].cumsum() - df.groupby(['playerID'])['SO_x'].cumsum() - df.groupby(['playerID'])['HR_x'].cumsum() + df.groupby(['playerID'])['SF_x'].cumsum()), 3)
cumulative_mean_cols.append('career_BABIP')

##### Sanity check

https://www.baseball-reference.com/players/j/jeterde01.shtml

In [80]:
# Sanity check, Derek Jeter's new cumulative sum columns
df[df['playerID'] == 'jeterde01'][['playerID', 'yearID', 'career_G_x', 'career_PA', 'career_AB_x', 'career_R_x', 'career_H_x', 'career_1B', 'career_2B_x', 'career_3B_x', 'career_HR_x', 'career_RBI', 'career_SB_x', 'career_CS_x', 'career_BB_x', 'career_SO_x', 'career_IBB', 'career_HBP_x', 'career_SH', 'career_SF_x', 'career_GIDP', 'career_TB', 'career_WAR']]


Unnamed: 0,playerID,yearID,career_G_x,career_PA,career_AB_x,career_R_x,career_H_x,career_1B,career_2B_x,career_3B_x,career_HR_x,career_RBI,career_SB_x,career_CS_x,career_BB_x,career_SO_x,career_IBB,career_HBP_x,career_SH,career_SF_x,career_GIDP,career_TB,career_WAR
68809,jeterde01,1995.0,15,51,48,5,12,7,4,1,0,7,0,0,3,11,0,0,0,0,0,18,-0.34
69943,jeterde01,1996.0,172,705,630,109,195,149,29,7,10,85,14,7,51,113,1,9,6,9,13,268,2.95
71064,jeterde01,1997.0,331,1453,1284,225,385,291,60,14,20,155,37,19,125,238,1,19,14,11,27,533,7.91
72210,jeterde01,1998.0,480,2147,1910,352,588,442,85,22,39,239,67,25,182,357,2,24,17,14,40,834,15.44
73397,jeterde01,1999.0,638,2886,2537,486,807,591,122,31,63,341,86,33,273,473,7,36,20,20,52,1180,23.44
74600,jeterde01,2000.0,786,3565,3130,605,1008,742,153,35,78,414,108,37,341,572,11,48,23,23,66,1465,28.01
75843,jeterde01,2001.0,936,4251,3744,715,1199,874,188,38,99,488,135,40,397,671,14,58,28,24,79,1760,33.2
77027,jeterde01,2002.0,1093,4981,4388,839,1390,1021,214,38,117,563,167,43,470,785,16,65,31,27,93,2031,36.87
78232,jeterde01,2003.0,1212,5523,4870,926,1546,1139,239,41,127,615,178,48,513,873,18,78,34,28,103,2248,40.44
79464,jeterde01,2004.0,1366,6244,5513,1037,1734,1259,283,42,150,693,201,52,559,972,19,92,50,30,122,2551,44.68


In [81]:
# Derek Jeter's career average stats
df[df['name_common'] == 'Derek Jeter'][['playerID', 'yearID', 'career_BA', 'career_OBP', 'career_SLG', 'career_OPS', 'career_BABIP']]

Unnamed: 0,playerID,yearID,career_BA,career_OBP,career_SLG,career_OPS,career_BABIP
68809,jeterde01,1995.0,0.25,0.294,0.375,0.669,0.324
69943,jeterde01,1996.0,0.31,0.365,0.425,0.79,0.359
71064,jeterde01,1997.0,0.3,0.368,0.415,0.783,0.352
72210,jeterde01,1998.0,0.308,0.373,0.437,0.81,0.359
73397,jeterde01,1999.0,0.318,0.389,0.465,0.854,0.368
74600,jeterde01,2000.0,0.322,0.394,0.468,0.862,0.372
75843,jeterde01,2001.0,0.32,0.392,0.47,0.862,0.367
77027,jeterde01,2002.0,0.317,0.389,0.463,0.852,0.362
78232,jeterde01,2003.0,0.317,0.389,0.462,0.851,0.364
79464,jeterde01,2004.0,0.315,0.385,0.463,0.848,0.358


All the features match the ones in the website, so we can proceed.

In [82]:
df.shape

(103554, 138)

### 2.3 Additional Key Features

##### Qualified Hitter
In baseball, a qualified hitter is a player who has accumulated enough plate appearances over the season to be eligible for batting titles. The qualification threshold is typically set at 3.1 plate appearances per team game, which equals about 502 plate appearances over a 162-game season.

The reason for this qualification threshold is to ensure that batting titles and other performance metrics are awarded based on a substantial body of work, not just a few successful games. This helps to level the playing field and ensure that the players who are leading in these categories have demonstrated consistent performance over many games.

Adding a "Qualified Hitter" feature to our dataset could be useful for several reasons:

- Context: It provides context to the batting statistics. A high average metrics are more truthful if the player has maintained that performance level over many games, not just a handful.
- Fairness: It ensures that players who have played a substantial number of games are not unfairly compared to players who have played fewer games.
- Performance: It could help us better predict player salaries. If teams value consistent performance, they might be willing to pay more for players who are qualified hitters.
- Interpretability: It makes our analysis easier to interpret. By focusing on qualified hitters, we're comparing players who have had a similar level of involvement in the game.

In [83]:
# Qualified players, binary column
df['qualified'] = np.where(df['PA'] >= (3.1 * df['G']), 1, 0)

In [84]:
# # Save a copy of the df
# df.to_csv('baseball_pre_cumulative.csv', index=False)

## 3. Data Cleaning <a id='data_cleaning'></a>
The next crucial step in our analysis is data cleaning. This process involves preparing our data for analysis and modeling by removing or modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted.

### 3.1 Setting a Time Span <a id='time_span'></a>

The dataset contains data from 1871 to 2022. However, as part of our data cleaning process, we made a strategic decision to limit our dataset to encompass the years from 1985 to the present. 

There were several reasons behind this decision:
- __Statistical Consistency:__ The game of baseball has evolved significantly over the years, with changes in rules, equipment, and player training and conditioning. By focusing on the past four decades, we ensure a higher degree of consistency in the playing conditions, thus making our statistical analysis and predictions more reliable.
- __Impact of Free Agency:__ The introduction of free agency in 1976 has had a significant impact on the game. By starting our analysis from 1985, we focus on an era when player movement between teams became more common, which adds an interesting dynamic to player performance and team composition.
- __Data Availability:__ The data from 1985 to the present is more complete and accurate, which will help us avoid potential issues with missing or incorrect data.

In [85]:
# New dataframe starting from 1985
df_1985 = df.copy()
df_1985 = df_1985[df_1985['yearID'] >= 1985]

# Save a copy of the df_1980
# df_1985.to_csv('baseball_pre_cumulative_1985.csv', index=False)

In [86]:
# Shape 
print('Shape of the dataset: ', df_1985.shape)

Shape of the dataset:  (44993, 139)


In [87]:
# Sample
df_1985.sample(5)

Unnamed: 0,name_common,age_x,year_ID,player_ID,pitcher,mlb_ID,stint_ID,PA,G_x,Inn,salary_x,WAR,team_ID,playerID,birthYear,birthCountry,weight,height,bats,throws,debut,finalGame,retroID,bbrefID,yearID,G_y,AB_x,R_x,H_x,2B_x,3B_x,HR_x,RBI,SB_x,CS_x,BB_x,SO_x,IBB,HBP_x,SH,SF_x,GIDP,teamID,lgID,franchID,divID,Rank,G,Ghome,W,L,DivWin,LgWin,WSWin,R_y,AB_y,H_y,2B_y,3B_y,HR_y,BB_y,SO_y,SB_y,CS_y,HBP_y,SF_y,RA,ER,ERA,CG,SHO,SV,IPouts,HA,HRA,BBA,SOA,E,DP,FP,name_x,park,attendance,BPF,PPF,teamIDBR,teamIDlahman45,teamIDretro,firstname,lastname,playerid,mlbid,year,salary_y,TeamName,age_y,leaguerank,teamrank,averagesalary,leagueminimum,borndate,name_y,POS,BirthYear,BA,OBP,1B,TB,SLG,OPS,BABIP,W%,career_G_x,career_PA,career_AB_x,career_R_x,career_H_x,career_1B,career_2B_x,career_3B_x,career_HR_x,career_RBI,career_SB_x,career_CS_x,career_BB_x,career_SO_x,career_IBB,career_HBP_x,career_SH,career_SF_x,career_GIDP,career_TB,career_WAR,career_BA,career_OBP,career_SLG,career_OPS,career_BABIP,qualified
63138,Randy Bush,30.0,1989,bushra01,N,111795.0,1,444.0,141,944.4,550000.0,-0.22,MIN,bushra01,1958.0,USA,186.0,73.0,L,L,1982-05-01,1993-06-23,bushr001,bushra01,1989.0,141.0,391.0,60.0,103.0,17.0,4.0,14.0,54.0,5.0,8.0,48.0,73.0,6.0,3.0,0.0,2.0,16.0,MIN,AL,MIN,W,5.0,162.0,81.0,80.0,82.0,N,N,N,740.0,5581.0,1542.0,278.0,35.0,117.0,478.0,743.0,111.0,53.0,39.0,58.0,738.0,680.0,4.28,19.0,8.0,38.0,4288.0,1495.0,139.0,500.0,851.0,107.0,141.0,0.982,Minnesota Twins,Hubert H Humphrey Metrodome,2277438.0,107.0,107.0,MIN,MIN,MIN,randy,bush,9474.0,111795.0,1989.0,550000.0,Minnesota Twins,30.0,233.0,12.0,497254.0,68000.0,1958-10-05,randy bush,OF,1958.0,0.263,0.347,68.0,170.0,0.435,0.782,0.291,0.494,918,2828,2472,335,623,391,126,24,82,343,32,23,285,403,42,38,6,27,49,1043,2.01,0.252,0.335,0.422,0.757,0.269,0
92630,Matt Albers,31.0,2014,alberma01,Y,458006.0,1,0.0,0,10.0,2250000.0,0.0,HOU,alberma01,1983.0,USA,225.0,73.0,L,R,2006-07-25,2019-09-28,albem001,alberma01,2014.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,HOU,AL,HOU,W,4.0,162.0,81.0,70.0,92.0,N,N,N,629.0,5447.0,1317.0,240.0,19.0,163.0,495.0,1442.0,122.0,37.0,55.0,36.0,723.0,657.0,4.11,7.0,3.0,31.0,4316.0,1437.0,139.0,484.0,1137.0,106.0,151.0,0.983,Houston Astros,Minute Maid Park,1751829.0,101.0,102.0,HOU,HOU,HOU,matt,albers,21317.0,458006.0,2014.0,2250000.0,Houston Astros,31.0,355.0,5.0,3818923.0,500000.0,1983-01-20,matt albers,P,1983.0,,,0.0,0.0,,,,0.432,81,37,34,0,2,2,0,0,0,0,0,0,0,21,0,0,3,0,0,2,-0.38,0.059,0.059,0.059,0.118,0.154,0
75501,AJ Burnett,24.0,2001,burnea.01,Y,150359.0,1,59.0,25,173.3,250000.0,-0.34,FLA,burneaj01,1977.0,USA,230.0,76.0,R,R,1999-08-17,2015-10-03,burna001,burnea.01,2001.0,27.0,50.0,0.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,27.0,0.0,1.0,7.0,0.0,0.0,FLO,NL,FLA,E,4.0,162.0,80.0,76.0,86.0,N,N,N,742.0,5542.0,1461.0,325.0,30.0,166.0,470.0,1145.0,89.0,40.0,67.0,45.0,744.0,691.0,4.32,5.0,11.0,32.0,4314.0,1397.0,151.0,617.0,1119.0,103.0,174.0,0.983,Florida Marlins,Pro Player Stadium,1261226.0,97.0,97.0,FLA,FLO,FLO,aj,burnett,912.0,150359.0,2001.0,250000.0,Florida Marlins,24.0,644.0,24.0,2138896.0,200000.0,1977-01-03,aj burnett,P,1977.0,0.08,0.115,3.0,5.0,0.1,0.215,0.174,0.469,45,106,92,4,13,9,2,1,1,3,0,0,4,47,0,1,9,0,0,20,0.01,0.141,0.186,0.217,0.403,0.273,0
68511,Vince Coleman,32.0,1994,colemvi01,N,112487.0,1,477.0,104,916.3,3312500.0,-1.55,KCR,colemvi01,1961.0,USA,170.0,72.0,B,R,1985-04-18,1997-04-14,colev001,colemvi01,1994.0,104.0,438.0,61.0,105.0,14.0,12.0,2.0,33.0,50.0,8.0,29.0,72.0,0.0,1.0,4.0,5.0,2.0,KCA,AL,KCR,C,3.0,115.0,59.0,64.0,51.0,,,,574.0,3911.0,1051.0,211.0,38.0,100.0,376.0,698.0,140.0,62.0,33.0,38.0,532.0,485.0,4.23,5.0,6.0,38.0,3095.0,1018.0,95.0,392.0,717.0,80.0,102.0,0.982,Kansas City Royals,Kauffman Stadium,1400494.0,104.0,104.0,KCR,KCA,KCA,vince,coleman,10044.0,112487.0,1994.0,3312500.0,Kansas City Royals,33.0,91.0,4.0,1168263.0,109000.0,1960-09-22,vince coleman,OF,1961.0,0.24,0.285,77.0,149.0,0.34,0.625,0.279,0.557,1217,5361,4853,773,1280,1024,152,82,22,313,698,159,430,846,10,13,42,23,36,1662,12.69,0.264,0.324,0.342,0.666,0.314,1
77862,Trey Hodges,24.0,2002,hodgetr01,Y,408074.0,1,3.0,4,11.7,0.0,-0.05,ATL,hodgetr01,1978.0,USA,187.0,75.0,R,R,2002-09-10,2003-09-27,hodgt001,hodgetr01,2002.0,4.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,ATL,NL,ATL,E,1.0,161.0,81.0,101.0,59.0,Y,N,N,708.0,5495.0,1428.0,280.0,25.0,164.0,558.0,1028.0,76.0,39.0,54.0,49.0,565.0,511.0,3.13,3.0,15.0,57.0,4402.0,1302.0,123.0,554.0,1058.0,114.0,170.0,0.982,Atlanta Braves,Turner Field,2603484.0,102.0,101.0,ATL,ATL,ATL,trey,hodges,5693.0,408074.0,2002.0,200000.0,Atlanta Braves,24.0,0.0,0.0,2295649.0,200000.0,1978-06-29,trey hodges,P,1978.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.631,4,3,3,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,-0.05,0.0,0.0,0.0,0.0,0.0,0


### 3.2 Droping columns (Features) that are no relevant to the project <a id='drop_cols'></a>


In [88]:
# Empty list to store columns to be dropped
drop_cols = []


#### Personal Columns

Some columns are not relevant to our project and will be dropped. These columns are commonly related to personal information about the player, such as birth date, birth place, and death date. 

Personal columns are not directly related to the project's analysis of batting performance. By removing them from the dataset, we can eliminate unnecessary personal information and narrow the scope of the project to the pertinent variables.


In [89]:
# List of personal columns that are not relevant
personal_columns = ['firstname', 'lastname', 'borndate', 'birthCountry']

# Append personal columns to drop_cols
drop_cols.extend(personal_columns)


#### Pitching Stats Columns

To focus on the batting-oriented nature of your project and streamline the analysis, it is recommended to drop the pitching-related columns. By removing these columns, we can concentrate solely on the batting statistics, which aligns with the project's objective. Dropping the pitching columns will enable a more concise and meaningful exploration of the batting performance in your dataset.

In [90]:
# Pitching columns
pitching_columns = ['ERA', 'ER', 'ERA', 'CG', 'SHO', 'SV', 'IPouts', 'HA', 'HRA', 'BBA', 'SOA', 'PPF']

# Append pitching columns to drop_cols
drop_cols.extend(pitching_columns)

#### Team Stats Columns

Team statistics, while important in a team context, may not be as relevant when predicting individual salaries because they reflect the collective performance of a team rather than the performance of an individual player.


In [91]:
# Drop team stats columns R_y', 'AB_y', 'H_y', '2B_y', '3B_y', 'HR_y', 'BB_y', 'SO_y', 'SB_y', 'CS_y', 'SF_y, 'E', 'DP', 'FP', 'BPF'
team_stats_columns = ['R_y', 'AB_y', 'H_y', '2B_y', '3B_y', 'HR_y', 'BB_y', 
                      'SO_y', 'SB_y', 'CS_y', 'SF_y', 'E', 'DP', 'FP', 
                      'BPF', 'attendance', 'lgID', 'divID']

drop_cols.extend(team_stats_columns)

#### Other Stats

- `HBP_x`, `HBP_y` - Hit by pitch, this is generally considered a relatively minor event in the course of a game, and it does not contribute significantly to a player's overall performance
- `stint_ID` -  Player's salary is typically determined by their overall performance and potential, not by the sequence in which they played for different teams in a particular year.
- `SF` (sacrifice flies), `SH` (sacrifice hits) and `GIDP` (grounded into double play) - Compared to other performance measures like home runs, hits, and batting average, 'SF', 'SH' and 'GIDP' are less frequently referenced as key performance indicators. They are situational statistics that may not contribute much to a player's overall perceived value.

In [92]:
drop_cols.extend(['HBP_x', 'HBP_y', 'stint_ID', 'SF_x', 'SF_y', 'GIDP', 'SH'])

### 3.3 Droping repeated or similar columns <a id='drop_repeated_cols'></a>

Some columns provide the same information as other columns, but with different names. We are going to iterate over each pair of columns we belive are repeated and evaluate which will be the one to drop. 

In [93]:
def compare_columns(dataframe, column1, column2, column3=None):
    '''
    Compare up to three columns in a DataFrame and return the number of equal and different values.
    Additionally, provide the count of null values for each column to assist in identifying columns for potential dropping.
    If the number of different values is close to or equals the number of null values, it suggests that the differences 
    observed are primarily due to null values.
    :param column1: string
    :param column2: string
    :param column3: string
    '''
    if column3 is None:
        # Return bool if column1 and column2 are equal and count the number of False values
        comparison = dataframe[column1] == dataframe[column2]
        if False in comparison.value_counts():
            print('Number of different values: ', comparison.value_counts()[False])
            print('Number of equal values: ', comparison.value_counts()[True])
            # Null values for comparison
            print(f'Number of null values for {column1}: ', dataframe[column1].isnull().sum())
            print(f'Number of null values for {column2}: ', dataframe[column2].isnull().sum())
        else:
            print("All values are equal")    
    else:
        # Return bool if column1, column2 and column3 are equal and count the number of False values
        comparison = (dataframe[column1] == dataframe[column2]) & (dataframe[column1] == dataframe[column3])
        if False in comparison.value_counts():
            print('Number of different values: ', comparison.value_counts()[False])
            print('Number of equal values: ', comparison.value_counts()[True])
            # Null values for comparison
            print(f'Number of null values for {column1}: ', dataframe[column1].isnull().sum())
            print(f'Number of null values for {column2}: ', dataframe[column2].isnull().sum())
            print(f'Number of null values for {column3}: ', dataframe[column3].isnull().sum())
        else:
            print("All values are equal")
        


##### Year columns (Season)

__`yearID` and `year`.__

In [94]:
# Columns that start with 'year'
year_columns = [col for col in df_1985.columns if col.startswith('year')]
year_columns

['year_ID', 'yearID', 'year']

In [95]:
# yearID and year_ID comparison
compare_columns(df_1985, 'yearID', 'year_ID')

All values are equal


Both columns are the same, we can drop any of them plus `year` column.

In [96]:
drop_cols.extend(['year_ID', 'year'])

##### Player identifiers

__`playerID` and `player_ID`.__

In [97]:
# Columns that could be for player identification
id_columns = [col for col in df_1985.columns if 'ID' in col or 'id' in col or 'player' in col or 'name' in col] 
id_columns

['name_common',
 'year_ID',
 'player_ID',
 'mlb_ID',
 'stint_ID',
 'team_ID',
 'playerID',
 'retroID',
 'bbrefID',
 'yearID',
 'GIDP',
 'teamID',
 'lgID',
 'franchID',
 'divID',
 'name_x',
 'teamIDBR',
 'teamIDlahman45',
 'teamIDretro',
 'firstname',
 'lastname',
 'playerid',
 'mlbid',
 'name_y',
 'career_GIDP']

In [98]:
# playerID and player_ID comparison
compare_columns(df_1985, 'playerID', 'player_ID')

Number of different values:  554
Number of equal values:  44439
Number of null values for playerID:  0
Number of null values for player_ID:  0


Let's select the rows where `playerID` and `player_ID` are different and see what the difference is.

In [99]:
pd.set_option('display.max_rows', 400)
df_1985[df_1985['playerID'] != df_1985['player_ID']][['playerID', 'player_ID']]

Unnamed: 0,playerID,player_ID
58710,obriech01,o'brich01
58951,oconnja02,o'conja02
59196,oberrmi01,o'bermi01
59245,oneilpa01,o'neipa01
59251,obriepe03,o'bripe03
...,...,...
102943,montafr01,montafr02
103026,beeksja01,beeksja02
103073,rodrijo04,rodrijo06
103248,mannima01,mannima02


It looks like the differences are special characters in the `player_ID` column. We can drop `player_ID` and keep `playerID`, since it is cleaner.

In [100]:
# Append player_ID to drop_cols
drop_cols.append('player_ID')

##### More player identifiers

In [101]:
# Player identifiers
player_ids_columns = ['playerID', 'mlb_ID', 'retroID', 'bbrefID', 'mlbid', 'playerid']

df_1985[player_ids_columns].value_counts().sample(10)

playerID   mlb_ID    retroID   bbrefID    mlbid     playerid
monahsh01  136265.0  monas001  monahsh01  136265.0  15531.0      2
valaich01  453362.0  valac001  valaich01  453362.0  50528.0      4
dawsoan01  113151.0  dawsa001  dawsoan01  113151.0  10576.0     12
estrajo01  283051.0  estrj001  estrajo01  283051.0  1119.0       8
balleja01  110517.0  ballj001  balleja01  110517.0  8431.0       5
breale01   276506.0  breal001  breale01   276506.0  4172.0       2
witasja01  124482.0  witaj001  witasja01  124482.0  225.0        8
butlebi03  913428.0  butlb003  butlebi03  456714.0  45958.0      1
roberet01  681799.0  robee001  roberet01  681799.0  201728.0     1
ozunapa01  150367.0  ozunp001  ozunapa01  150367.0  628.0        6
dtype: int64

We are interested in Baseball Reference IDs, as it is one of our main sources of information. `playerID` and `bbrefID` are the same, and both of them are from Baseball Reference. We can drop `bbrefID` and keep `playerID`. We will also drop the rest of the player identifiers.

In [102]:
drop_cols.extend(['mlb_ID', 'retroID', 'bbrefID', 'mlbid', 'playerid'])

`name_common` and `name_y`

In [103]:
# name_common and name_y sample
df_1985[df_1985['name_common'] != df_1985['name_y']][['name_common', 'name_y']].sample(10)

Unnamed: 0,name_common,name_y
79768,Jon Adkins,jonathan adkins
86008,Jhonny Nunez,jhonny nunez
85391,Adam Everett,adam everett
94464,Andre Ethier,andre ethier
98206,Ryan Carpenter,ryan carpenter
68552,Alex Arias,alex arias
89110,Vin Mazzaro,vincent mazzaro
102017,Kyle Finnegan,kyle finnegan
67371,Scott Aldred,scott aldred
92083,Danny Espinosa,danny espinosa


It looks that they are the same but `name_y` is in lower case. We will keep `name_common` and drop `name_y`.

In [104]:
# drop name_y
drop_cols.append('name_y')

##### Age columns

In [105]:
compare_columns(df_1985, 'age_x', 'age_y')

Number of different values:  975
Number of equal values:  44018
Number of null values for age_x:  0
Number of null values for age_y:  85


We will drop `age_y` as it has more missing values.

In [106]:
drop_cols.append('age_y')

# Change name of age_x to age
df_1985.rename(columns={'age_x': 'age'}, inplace=True)

##### Team identifiers

There are several columns that serve as identifiers for the team:
- `teamID`
- `team_ID`
- `teamIDBR`
- `teamIDlahman45`
- `teamIDretro`
- `franchID`
- `name_x`
- `TeamName`

Let's look at a dataframe including these columns:

In [107]:
# Team identifiers
team_ids_columns = ['teamID', 'team_ID', 'teamIDBR', 'teamIDlahman45', 'teamIDretro', 'franchID', 'name_x', 'TeamName']



We will rely on the identifiers provided by Baseball-Reference (https://www.baseball-reference.com/about/team_IDs.shtml), specifically `teamIDBR` and `franchID`. Baseball-Reference is one of our primary data sources, and utilizing these identifiers will ensure consistency and accuracy in our analysis. We will drop the other team identifiers except for `name`, which will be useful for visualizations and analysis.

In [108]:
# null values for the team identifiers columns
df_1985[team_ids_columns].isnull().sum()

teamID             0
team_ID            0
teamIDBR           0
teamIDlahman45     0
teamIDretro        0
franchID           0
name_x             0
TeamName          85
dtype: int64

In [109]:
# Append teamID, team_ID, teamIDlahman45, teamIDretro to drop_cols
drop_cols.extend(['teamID', 'team_ID', 'teamIDlahman45', 'teamIDretro', 'TeamName'])

In [110]:
# Change name_x to teamName
df_1985.rename(columns={'name_x': 'teamName'}, inplace=True)

##### Salary columns

`salary_x` and `salary_y`

In [111]:
compare_columns(df_1985, 'salary_x', 'salary_y')

Number of different values:  17084
Number of equal values:  27909
Number of null values for salary_x:  0
Number of null values for salary_y:  85


We will drop `salary_x` as it has more missing values.

In [112]:
drop_cols.append('salary_x')

In [113]:
more_cols = ['']

##### Summary of columns to drop

In [114]:
# Columns to drop
print('Columns to drop: ', drop_cols)
print('Number of columns to drop: ', len(drop_cols))

Columns to drop:  ['firstname', 'lastname', 'borndate', 'birthCountry', 'ERA', 'ER', 'ERA', 'CG', 'SHO', 'SV', 'IPouts', 'HA', 'HRA', 'BBA', 'SOA', 'PPF', 'R_y', 'AB_y', 'H_y', '2B_y', '3B_y', 'HR_y', 'BB_y', 'SO_y', 'SB_y', 'CS_y', 'SF_y', 'E', 'DP', 'FP', 'BPF', 'attendance', 'lgID', 'divID', 'HBP_x', 'HBP_y', 'stint_ID', 'SF_x', 'SF_y', 'GIDP', 'SH', 'year_ID', 'year', 'player_ID', 'mlb_ID', 'retroID', 'bbrefID', 'mlbid', 'playerid', 'name_y', 'age_y', 'teamID', 'team_ID', 'teamIDlahman45', 'teamIDretro', 'TeamName', 'salary_x']
Number of columns to drop:  57


##### Droping columns

In [115]:
# Drop columns
df_1985.drop(columns=drop_cols, inplace=True)

In [116]:
# Shape after dropping columns
print('Shape of the dataset: ', df_1985.shape)

Shape of the dataset:  (44993, 84)


In [117]:
# Re empty list to store more columns to be dropped if needed
drop_cols = []

In [118]:
df_1985.sample(5)

Unnamed: 0,name_common,age,pitcher,PA,G_x,Inn,WAR,playerID,birthYear,weight,height,bats,throws,debut,finalGame,yearID,G_y,AB_x,R_x,H_x,2B_x,3B_x,HR_x,RBI,SB_x,CS_x,BB_x,SO_x,IBB,franchID,Rank,G,Ghome,W,L,DivWin,LgWin,WSWin,RA,teamName,park,teamIDBR,salary_y,leaguerank,teamrank,averagesalary,leagueminimum,POS,BirthYear,BA,OBP,1B,TB,SLG,OPS,BABIP,W%,career_G_x,career_PA,career_AB_x,career_R_x,career_H_x,career_1B,career_2B_x,career_3B_x,career_HR_x,career_RBI,career_SB_x,career_CS_x,career_BB_x,career_SO_x,career_IBB,career_HBP_x,career_SH,career_SF_x,career_GIDP,career_TB,career_WAR,career_BA,career_OBP,career_SLG,career_OPS,career_BABIP,qualified
89838,Jon Jay,27.0,N,502.0,117,993.3,3.3,jayjo02,1985.0,200.0,71.0,L,L,2010-04-26,2021-05-12,2012.0,117.0,443.0,70.0,135.0,22.0,4.0,4.0,40.0,19.0,7.0,34.0,71.0,3.0,STL,2.0,162.0,81.0,88.0,74.0,N,N,N,648.0,St. Louis Cardinals,Busch Stadium III,STL,504000.0,579.0,17.0,3213479.0,480000.0,OF,1985.0,0.305,0.373,105.0,177.0,0.4,0.773,0.355,0.543,381,1328,1185,173,356,265,65,8,18,104,27,18,86,202,4,25,26,6,25,491,6.94,0.3,0.359,0.414,0.773,0.348,0
81787,Chad Harville,29.0,Y,0.0,3,41.0,0.0,harvich01,1976.0,180.0,69.0,R,R,1999-06-23,2006-08-12,2006.0,32.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,TBD,5.0,162.0,81.0,61.0,101.0,N,N,N,856.0,Tampa Bay Devil Rays,Tropicana Field,TBD,327000.0,0.0,0.0,2699292.0,327000.0,P,1976.0,,,0.0,0.0,,,,0.377,92,2,2,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,-0.04,0.0,0.0,0.0,0.0,,0
65210,Rick Mahler,37.0,Y,18.0,23,66.0,-0.05,mahleri01,1953.0,195.0,73.0,R,R,1979-04-20,1991-08-06,1991.0,23.0,14.0,0.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,WSN,6.0,161.0,68.0,71.0,90.0,N,N,N,655.0,Montreal Expos,Stade Olympique,MON,100000.0,0.0,0.0,851492.0,100000.0,P,1953.0,0.143,0.143,1.0,3.0,0.214,0.357,0.2,0.441,394,671,581,40,104,86,17,0,1,37,0,1,20,117,0,2,66,2,10,124,0.58,0.179,0.208,0.213,0.421,0.222,0
61045,Mark Carreon,23.0,N,13.0,9,21.0,-0.11,carrema01,1963.0,170.0,72.0,R,L,1987-09-08,1996-08-23,1987.0,9.0,12.0,0.0,3.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,NYM,2.0,162.0,81.0,92.0,70.0,N,N,N,698.0,New York Mets,Shea Stadium,NYM,60000.0,513.0,21.0,412454.0,62500.0,OF,1963.0,0.25,0.308,3.0,3.0,0.25,0.558,0.273,0.568,9,13,12,0,3,3,0,0,0,1,0,1,1,1,0,0,0,0,0,3,-0.11,0.25,0.308,0.25,0.558,0.273,0
102746,Bobby Bradley,26.0,N,17.0,8,41.0,-0.39,bradlbo01,1996.0,225.0,73.0,L,R,2019-06-23,2022-04-27,2022.0,8.0,17.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,0.0,CLE,1.0,162.0,81.0,92.0,70.0,Y,N,N,634.0,Cleveland Guardians,Progressive Field,CLE,700000.0,0.0,0.0,4317736.0,700000.0,1B,1996.0,0.118,0.118,2.0,2.0,0.118,0.236,0.25,0.568,97,345,307,41,61,29,15,0,17,45,0,0,29,128,1,6,0,3,5,127,-0.31,0.199,0.278,0.414,0.692,0.267,0


### 2.4 Filtering and removing players with no relevance to the project (Removing rows) <a id='drop_rows'></a>

#### Pitchers

As we mentioned earlier in the project, we are going to focus on batting metrics. We will remove pitchers from the dataset.

In [119]:
# Drop rows with pitcher = Y
df_1985 = df_1985[df_1985['pitcher'] != 'Y']

In [120]:
# Drop pitcher column
df_1985.drop('pitcher', axis=1, inplace=True)

# Shape
df_1985.shape

(22429, 83)

---

#### Update Team Names

Post 1985, several teams have undergone name changes. We'll update these to reflect their current names, resulting in 30 unique team names. Additionally, we'll include the Montreal Expos, the only team to have relocated since 1985, bringing our total to 31.

- https://www.mlb.com/team

In [121]:
# Team names
df_1985['teamName'].value_counts()

San Francisco Giants             823
Cincinnati Reds                  802
Kansas City Royals               800
Seattle Mariners                 799
Texas Rangers                    798
Los Angeles Dodgers              794
Baltimore Orioles                793
New York Mets                    790
Oakland Athletics                790
Boston Red Sox                   787
San Diego Padres                 783
New York Yankees                 782
Cleveland Indians                779
Pittsburgh Pirates               777
Detroit Tigers                   767
Minnesota Twins                  765
Chicago Cubs                     757
Toronto Blue Jays                755
Philadelphia Phillies            752
Houston Astros                   749
St. Louis Cardinals              741
Chicago White Sox                733
Milwaukee Brewers                733
Atlanta Braves                   729
Colorado Rockies                 606
Arizona Diamondbacks             524
Montreal Expos                   404
L

In [122]:
# Cleveland Indians to Cleveland Guardians
df_1985['teamName'] = df_1985['teamName'].replace('Cleveland Indians', 'Cleveland Guardians')

# Anaheim Angels to Los Angeles Angels
df_1985['teamName'] = df_1985['teamName'].replace('Anaheim Angels', 'Los Angeles Angels')

# Tampa Bay Devil Rays to Tampa Bay Rays
df_1985['teamName'] = df_1985['teamName'].replace('Tampa Bay Devil Rays', 'Tampa Bay Rays')

# Florida Marlins to Miami Marlins
df_1985['teamName'] = df_1985['teamName'].replace('Florida Marlins', 'Miami Marlins')

# California Angels to Los Angeles Angels
df_1985['teamName'] = df_1985['teamName'].replace('California Angels', 'Los Angeles Angels')

# Los Angeles Angels of Anaheim to Los Angeles Angels
df_1985['teamName'] = df_1985['teamName'].replace('Los Angeles Angels of Anaheim', 'Los Angeles Angels')

# Montreal Expos to Washington Nationals
df_1985['teamName'] = df_1985['teamName'].replace('Montreal Expos', 'Washington Nationals')

In [123]:
# Unique team names
df_1985['teamName'].nunique()

30

---

### 2.5 Dealing with Null Values <a id='null_values'></a>

In [124]:
pd.set_option('display.max_rows', 200)
# Null values
df_1985.isnull().sum()[df_1985.isnull().sum() > 0]

DivWin           513
LgWin            513
WSWin            513
salary_y          10
leaguerank        10
teamrank          10
averagesalary     10
leagueminimum     10
POS               44
BA                47
OBP               36
SLG               47
OPS               47
BABIP            139
career_BA         21
career_OBP        15
career_SLG        21
career_OPS        21
career_BABIP      70
dtype: int64

The curious pattern of having the same number of null values (668) in the columns `WSWin`, `DivWin`, and `LgWin` suggests a potential relationship among these variables. Let's look at the rows with null values in these columns.

In [125]:
pd.set_option('display.max_rows', None)
# Rows with DivWin, LgWin and WSWin null values, sort 
df_1985[df_1985['DivWin'].isnull() & df_1985['LgWin'].isnull() & df_1985['WSWin'].isnull()].nunique().sort_values()

LgWin              0
DivWin             0
WSWin              0
yearID             1
leagueminimum      1
averagesalary      1
throws             2
qualified          2
bats               3
Rank               5
G                  5
POS                6
3B_x              12
Ghome             13
height            14
W                 15
L                 16
CS_x              16
IBB               20
BirthYear         21
age               21
birthYear         23
W%                24
RA                26
park              28
teamIDBR          28
teamName          28
franchID          28
teamrank          34
weight            37
HR_x              38
2B_x              38
SB_x              39
career_HBP_x      57
career_3B_x       67
career_SH         69
career_SF_x       73
BB_x              75
career_IBB        88
RBI               92
R_x               93
career_CS_x       96
SO_x             101
1B               102
G_y              112
G_x              112
career_GIDP      126
career_BA    

 It's interesting that `yearID` only contains one unique value. This suggests that all the records in this subset belong to a single season. 

In [126]:
df_1985[df_1985['DivWin'].isnull() & df_1985['LgWin'].isnull() & df_1985['WSWin'].isnull()][['yearID']].value_counts()

yearID
1994.0    513
dtype: int64

The null values present in the `DivWin`, `LgWin`, and `WSWin` columns correspond to the year 1994. This particular year holds significance in MLB history as the season came to an abrupt end due to a labor strike. As a result, no teams were able to compete in the playoffs, and the World Series was canceled.

Therefore, the presence of null values in DivWin (division winner), LgWin (league winner), and WSWin (World Series winner) for the year 1994 is expected. We will fill these null values with "N" to indicate that no team won the division, league, or World Series that year.

In [127]:
# Fill WSwWin, LgWin and DivWin null values with 'N'
df_1985['WSWin'].fillna('N', inplace=True)
df_1985['LgWin'].fillna('N', inplace=True)
df_1985['DivWin'].fillna('N', inplace=True)

In [128]:
# Null values
df_1985.isnull().sum()[df_1985.isnull().sum() > 0].sort_values(ascending=False)

BABIP            139
career_BABIP      70
BA                47
SLG               47
OPS               47
POS               44
OBP               36
career_BA         21
career_SLG        21
career_OPS        21
career_OBP        15
salary_y          10
leaguerank        10
teamrank          10
averagesalary     10
leagueminimum     10
dtype: int64

In [129]:
df_1985.shape

(22429, 83)

##### `BABIP` Nulls

In [130]:
# Rows with BABIP null values
df_1985[df_1985['BABIP'].isnull()].sample()

Unnamed: 0,name_common,age,PA,G_x,Inn,WAR,playerID,birthYear,weight,height,bats,throws,debut,finalGame,yearID,G_y,AB_x,R_x,H_x,2B_x,3B_x,HR_x,RBI,SB_x,CS_x,BB_x,SO_x,IBB,franchID,Rank,G,Ghome,W,L,DivWin,LgWin,WSWin,RA,teamName,park,teamIDBR,salary_y,leaguerank,teamrank,averagesalary,leagueminimum,POS,BirthYear,BA,OBP,1B,TB,SLG,OPS,BABIP,W%,career_G_x,career_PA,career_AB_x,career_R_x,career_H_x,career_1B,career_2B_x,career_3B_x,career_HR_x,career_RBI,career_SB_x,career_CS_x,career_BB_x,career_SO_x,career_IBB,career_HBP_x,career_SH,career_SF_x,career_GIDP,career_TB,career_WAR,career_BA,career_OBP,career_SLG,career_OPS,career_BABIP,qualified
79383,Corey Hart,22.0,1.0,1,0.0,0.0,hartco01,1982.0,240.0,78.0,R,R,2004-05-25,2015-06-21,2004.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,MIL,6.0,161.0,81.0,67.0,94.0,N,N,N,757.0,Milwaukee Brewers,Miller Park,MIL,300000.0,0.0,0.0,2313535.0,300000.0,OF,1982.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.416,1,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,,0


The calculation for BABIP (Batting Average on Balls In Play) is:

BABIP = (H - HR) / (AB - SO - HR + SF)

BABIP NaN values, are likely due to division by zero in the denominator of the formula. This can occur if the player has no at-bats (AB), or if the number of strikeouts (SO) and home runs (HR) equals or exceeds the number of at-bats.

In other words, if a player has never been at bat, or if every time they've been at bat they've either struck out or hit a home run, then the denominator of the BABIP formula will be zero, leading to a division by zero error and a resulting NaN value.

For this reason we will fill the BABIP NaN values with zero.

In [131]:
# Fill BABIP null values with 0
df_1985['BABIP'].fillna(0, inplace=True)

In [132]:
# Null values
df_1985.isnull().sum()[df_1985.isnull().sum() > 0]

salary_y         10
leaguerank       10
teamrank         10
averagesalary    10
leagueminimum    10
POS              44
BA               47
OBP              36
SLG              47
OPS              47
career_BA        21
career_OBP       15
career_SLG       21
career_OPS       21
career_BABIP     70
dtype: int64

##### `BA`, `OBP`, `SLG`, and `OPS` Nulls

The values for Batting Average (`BA`), On-Base Percentage (`OBP`), Slugging Percentage (`SLG`), and On-Base Plus Slugging (`OPS`) are calculated based on a player's hitting statistics. If a player has not had any at-bats or has not reached base in any way, these values will be undefined and will appear as NaN in your dataset.

For this reason we will fill the `BA`, `OBP`, `SLG`, and `OPS` NaN values with zero.

In [133]:
# Fill BA, OBP, SLG, OPS null values with 0
df_1985['BA'].fillna(0, inplace=True)
df_1985['OBP'].fillna(0, inplace=True)
df_1985['SLG'].fillna(0, inplace=True)
df_1985['OPS'].fillna(0, inplace=True)

In [134]:
# null values
df_1985.isnull().sum()[df_1985.isnull().sum() > 0]

salary_y         10
leaguerank       10
teamrank         10
averagesalary    10
leagueminimum    10
POS              44
career_BA        21
career_OBP       15
career_SLG       21
career_OPS       21
career_BABIP     70
dtype: int64

##### Carreer Stats Nulls

In [135]:
# Rows with career_BABIP null values
df_1985[df_1985['career_BABIP'].isnull()].sample(5)

Unnamed: 0,name_common,age,PA,G_x,Inn,WAR,playerID,birthYear,weight,height,bats,throws,debut,finalGame,yearID,G_y,AB_x,R_x,H_x,2B_x,3B_x,HR_x,RBI,SB_x,CS_x,BB_x,SO_x,IBB,franchID,Rank,G,Ghome,W,L,DivWin,LgWin,WSWin,RA,teamName,park,teamIDBR,salary_y,leaguerank,teamrank,averagesalary,leagueminimum,POS,BirthYear,BA,OBP,1B,TB,SLG,OPS,BABIP,W%,career_G_x,career_PA,career_AB_x,career_R_x,career_H_x,career_1B,career_2B_x,career_3B_x,career_HR_x,career_RBI,career_SB_x,career_CS_x,career_BB_x,career_SO_x,career_IBB,career_HBP_x,career_SH,career_SF_x,career_GIDP,career_TB,career_WAR,career_BA,career_OBP,career_SLG,career_OPS,career_BABIP,qualified
64668,Doug Lindsey,23.0,3.0,1,9.0,-0.17,lindsdo01,1967.0,200.0,74.0,R,R,1991-10-06,1993-10-02,1991.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,PHI,3.0,162.0,83.0,78.0,84.0,N,N,N,680.0,Philadelphia Phillies,Veterans Stadium,PHI,100000.0,0.0,0.0,851492.0,100000.0,C,1967.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.481,1,3,3,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,-0.17,0.0,0.0,0.0,0.0,,0
102241,Otto Lopez,22.0,1.0,1,0.0,0.0,lopezot01,1998.0,185.0,70.0,R,R,2021-08-17,2022-10-05,2021.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,TOR,4.0,162.0,80.0,91.0,71.0,N,N,N,663.0,Toronto Blue Jays,Rogers Centre,TOR,570500.0,0.0,0.0,4170000.0,570500.0,SS,1998.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.562,1,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,,0
88765,Martín Maldonado,24.0,1.0,3,3.0,-0.03,maldoma01,1986.0,230.0,72.0,R,R,2011-09-03,2023-06-23,2011.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,MIL,1.0,162.0,81.0,96.0,66.0,Y,N,N,638.0,Milwaukee Brewers,Miller Park,MIL,414000.0,0.0,0.0,3095183.0,414000.0,C,1986.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.593,3,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,-0.03,0.0,0.0,0.0,0.0,,0
91226,Kevin Kiermaier,23.0,0.0,1,1.0,0.0,kiermke01,1990.0,210.0,73.0,L,R,2013-09-30,2023-06-24,2013.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,TBD,2.0,163.0,81.0,92.0,71.0,N,N,N,646.0,Tampa Bay Rays,Tropicana Field,TBR,490000.0,0.0,0.0,3386212.0,490000.0,OF,1990.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.564,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,,,,,,0
75219,Raul Gonzalez,26.0,2.0,3,3.0,-0.08,gonzara01,1973.0,190.0,68.0,R,R,2000-05-25,2004-06-21,2000.0,3.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,CHC,6.0,162.0,81.0,65.0,97.0,N,N,N,904.0,Chicago Cubs,Wrigley Field,CHC,175000.0,759.0,27.0,1895630.0,200000.0,OF,1973.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.401,3,2,2,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,-0.08,0.0,0.0,0.0,0.0,,0


NaN values for career statistics such as Batting Average on Balls in Play (`BABIP`), On-Base Plus Slugging (`OPS`), Slugging Percentage (`SLG`), On-Base Percentage (`OBP`), and Batting Average (`BA`) likely appear because the player had no at-bats or did not reach base in any way during their career up to that point.
This is particularly common for players in their first season, as they have not yet had the opportunity to accumulate any hits, walks, or other statistics that contribute to these metrics. As a result, the denominators in the formulas for these statistics are zero, leading to undefined values.

Same as before we will fill this career stats NaN values with zero.

In [136]:
# Fill career_BABIP, career_BA, career_OBP, career_SLG, career_OPS null values with 0
df_1985['career_BABIP'].fillna(0, inplace=True)
df_1985['career_BA'].fillna(0, inplace=True)
df_1985['career_OBP'].fillna(0, inplace=True)
df_1985['career_SLG'].fillna(0, inplace=True)
df_1985['career_OPS'].fillna(0, inplace=True)

##### Position null values
We have 26 null values for the position of the player. They are not so many, so we can manually look for the position of the players in Baseball Reference and fill the null values.

In [137]:
# Unique POS
df_1985['POS'].unique()

array(['OF', nan, '2B', 'C', 'SS', '1B', '3B', 'P'], dtype=object)

In [138]:
# playerID for null POS
df_1985[df_1985['POS'].isnull()]['playerID'].value_counts()


thornan01    3
mcraeha01    3
ortajo01     3
stanist01    2
greenad01    2
chambal01    1
godwity01    1
sassero01    1
casimca01    1
ortegbi01    1
wrighro02    1
decasyu01    1
lukacro01    1
melilke01    1
saloman01    1
jimenlu01    1
belnovi01    1
bormajo01    1
wilsoco01    1
mullise01    1
banisje01    1
byrdji01     1
dalenpe01    1
freemla01    1
allenro02    1
davidan01    1
woodsal01    1
aikenwi01    1
gamblos01    1
squirmi01    1
burroje01    1
johnsja01    1
rodried02    1
kuipedu01    1
fordda01     1
dawsoro01    1
Name: playerID, dtype: int64

In [139]:
# fill null POS with 'OF'
df_1985['POS'].fillna('OF', inplace=True)

In [140]:
def fill_positions(player_id, position):
    '''
    Fill null values for POS column with the correct position for each player.
    :param player_id: string
    :param position: string
    '''
    df_1985.loc[df_1985['playerID'] == player_id, 'POS'] = position

##### Remaining Nulls

In [141]:
# Null values
df_1985.isnull().sum()[df_1985.isnull().sum() > 0]

salary_y         10
leaguerank       10
teamrank         10
averagesalary    10
leagueminimum    10
dtype: int64

We only have 9 rows (player) remaining with null values. We will take a look at them and decide what to do.

In [142]:
# rows with salary_y null values
df_1985[df_1985['teamrank'].isnull()]

Unnamed: 0,name_common,age,PA,G_x,Inn,WAR,playerID,birthYear,weight,height,bats,throws,debut,finalGame,yearID,G_y,AB_x,R_x,H_x,2B_x,3B_x,HR_x,RBI,SB_x,CS_x,BB_x,SO_x,IBB,franchID,Rank,G,Ghome,W,L,DivWin,LgWin,WSWin,RA,teamName,park,teamIDBR,salary_y,leaguerank,teamrank,averagesalary,leagueminimum,POS,BirthYear,BA,OBP,1B,TB,SLG,OPS,BABIP,W%,career_G_x,career_PA,career_AB_x,career_R_x,career_H_x,career_1B,career_2B_x,career_3B_x,career_HR_x,career_RBI,career_SB_x,career_CS_x,career_BB_x,career_SO_x,career_IBB,career_HBP_x,career_SH,career_SF_x,career_GIDP,career_TB,career_WAR,career_BA,career_OBP,career_SLG,career_OPS,career_BABIP,qualified
60462,Alexis Infante,25.0,0.0,1,0.0,0.0,infanal01,1961.0,175.0,70.0,R,R,1987-09-27,1990-07-15,1987.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,TOR,2.0,162.0,81.0,96.0,66.0,N,N,N,655.0,Toronto Blue Jays,Exhibition Stadium,TOR,,,,,,3B,1961.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.593,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0
61095,Mike Fischlin,31.0,0.0,1,0.0,0.0,fischmi01,1955.0,165.0,73.0,R,R,1977-09-03,1987-05-11,1987.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,ATL,5.0,161.0,81.0,69.0,92.0,N,N,N,829.0,Atlanta Braves,Atlanta-Fulton County Stadium,ATL,,,,,,SS,1955.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.429,517,1082,941,109,207,169,29,6,3,68,24,13,92,142,1,5,39,5,11,257,-1.34,0.22,0.291,0.273,0.564,0.255,0
65814,Jack Voigt,26.0,0.0,1,0.0,0.0,voigtja01,1966.0,170.0,73.0,R,R,1992-08-03,1998-07-27,1992.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BAL,3.0,162.0,81.0,89.0,73.0,N,N,N,656.0,Baltimore Orioles,Oriole Park at Camden Yards,BAL,,,,,,OF,1966.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.549,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0
66405,Tommy Shields,27.0,0.0,2,0.0,0.0,shielto01,1964.0,180.0,72.0,L,R,1992-07-25,1993-10-03,1992.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BAL,3.0,162.0,81.0,89.0,73.0,N,N,N,656.0,Baltimore Orioles,Oriole Park at Camden Yards,BAL,,,,,,3B,1964.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.549,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0
71628,Randy Velarde,34.0,0.0,1,0.0,0.0,velarra01,1962.0,185.0,72.0,R,R,1987-08-20,2002-09-29,1997.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,ANA,2.0,162.0,82.0,84.0,78.0,N,N,N,794.0,Los Angeles Angels,Edison International Field,ANA,,,,,,2B,1962.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.519,795,2788,2465,345,662,466,126,13,57,262,29,22,256,500,3,26,26,15,64,985,13.48,0.269,0.342,0.4,0.742,0.315,0
71866,Torii Hunter,21.0,0.0,1,0.0,0.0,hunteto01,1975.0,220.0,74.0,R,R,1997-08-22,2015-10-03,1997.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,MIN,4.0,162.0,81.0,68.0,94.0,N,N,N,861.0,Minnesota Twins,Hubert H Humphrey Metrodome,MIN,,,,,,OF,1975.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.42,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0
77957,Anderson Machado,22.0,0.0,1,0.0,0.0,machaan01,1981.0,189.0,72.0,B,R,2003-09-27,2005-07-28,2003.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,PHI,3.0,162.0,81.0,86.0,76.0,N,N,N,697.0,Philadelphia Phillies,Veterans Stadium,PHI,,,,,,SS,1981.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.531,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0
83408,Jerry Gil,24.0,0.0,1,0.0,0.0,gilje01,1982.0,215.0,75.0,R,R,2004-08-22,2007-09-27,2007.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,CIN,5.0,162.0,81.0,72.0,90.0,N,N,N,853.0,Cincinnati Reds,Great American Ball Park,CIN,,,,,,SS,1982.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.444,30,88,86,3,15,12,2,1,0,8,2,0,0,33,0,1,0,1,2,19,-0.91,0.174,0.182,0.221,0.403,0.278,0
85955,Jason Pridie,25.0,0.0,1,0.0,0.0,pridija01,1983.0,205.0,73.0,L,R,2008-09-03,2015-09-29,2009.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,MIN,1.0,163.0,82.0,87.0,76.0,Y,N,N,765.0,Minnesota Twins,Hubert H Humphrey Metrodome,MIN,,,,,,OF,1983.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.534,11,6,4,3,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0,0,-0.2,0.0,0.2,0.0,0.2,0.0,0
86621,Yadier Molina,26.0,544.0,140,1186.7,3.15,molinya01,1982.0,225.0,71.0,R,R,2004-06-03,2022-10-05,2009.0,140.0,481.0,45.0,141.0,23.0,1.0,6.0,54.0,9.0,3.0,50.0,39.0,2.0,STL,1.0,162.0,81.0,91.0,71.0,Y,N,N,640.0,St. Louis Cardinals,Busch Stadium III,STL,,,,,,C,1982.0,0.293,0.366,111.0,184.0,0.383,0.749,0.309,0.562,669,2458,2215,189,596,456,103,2,35,263,13,12,178,202,19,20,29,16,95,808,8.22,0.269,0.327,0.365,0.692,0.281,1


We weren't able to find salary information for these players except for a player called "Yadier Molina" for the 2009 season. We will fill the null values with the information we found abot him.

- https://www.baseball-reference.com/players/m/molinya01.shtml
- https://www.baseball-reference.com/bullpen/Minimum_salary
- https://www.baseball-reference.com/teams/STL/2009.shtml


In [143]:
# Fil Yadier Molina's salary_y for 2009
df_1985.loc[(df_1985['playerID'] == 'molinya01') & (df_1985['yearID'] == 2009), 'salary_y'] = 3312500

# Fill Yadier Molina's teamrank for 2009
df_1985.loc[(df_1985['playerID'] == 'molinya01') & (df_1985['yearID'] == 2009), 'teamrank'] = 9

# Fill Yadier Molina's average salary for 2009
df_1985.loc[(df_1985['playerID'] == 'molinya01') & (df_1985['yearID'] == 2009), 'averagesalary'] = 2996106

# Fill Yadier Molina's leagueminimum for 2009
df_1985.loc[(df_1985['playerID'] == 'molinya01') & (df_1985['yearID'] == 2009), 'leagueminimum'] = 400000   

# Fill Yadier Molina's leaguerank for 2009
df_1985.loc[(df_1985['playerID'] == 'molinya01') & (df_1985['yearID'] == 2009), 'leaguerank'] = 100



In [144]:
# null values
df_1985.isnull().sum().sort_values(ascending=False)

salary_y         9
leaguerank       9
averagesalary    9
leagueminimum    9
teamrank         9
career_OBP       0
career_BA        0
career_AB_x      0
career_PA        0
career_G_x       0
W%               0
BABIP            0
OPS              0
SLG              0
TB               0
1B               0
OBP              0
BA               0
BirthYear        0
POS              0
career_OPS       0
career_BABIP     0
career_R_x       0
career_H_x       0
career_1B        0
career_IBB       0
career_WAR       0
career_TB        0
career_GIDP      0
career_SF_x      0
career_SH        0
career_HBP_x     0
career_SO_x      0
career_2B_x      0
career_BB_x      0
career_CS_x      0
career_SB_x      0
career_RBI       0
career_HR_x      0
career_SLG       0
career_3B_x      0
name_common      0
age              0
3B_x             0
H_x              0
R_x              0
AB_x             0
G_y              0
yearID           0
finalGame        0
debut            0
throws           0
bats        

In [145]:
# Drop rows with null values
df_1985.dropna(inplace=True)

## 4. Salary Adjustments <a id='salary_adjustments'></a>
We will adjust the salaries of the players to account for inflation. The value of money changes over time due to inflation. Therefore, comparing salaries from different years without adjusting for inflation can lead to misleading results.
 - https://www.bls.gov/data/inflation_calculator.htm

In [146]:
buying_power = {1985: 2.89,
                   1986: 2.78,
                   1987: 2.74,
                   1988: 2.64,
                   1989: 2.52,
                   1990: 2.39,
                   1991: 2.27,
                   1992: 2.21,
                   1993: 2.14,
                   1994: 2.09,
                   1995: 2.03,
                   1996: 1.98,
                   1997: 1.92,
                   1998: 1.89,
                   1999: 1.86,
                   2000: 1.81,
                   2001: 1.76,
                   2002: 1.72,
                   2003: 1.68,
                   2004: 1.65,
                   2005: 1.61,
                   2006: 1.54,
                   2007: 1.51,
                   2008: 1.45,
                   2009: 1.45,
                   2010: 1.41,
                   2011: 1.39,
                   2012: 1.35,
                   2013: 1.32,
                   2014: 1.30,
                   2015: 1.31,
                   2016: 1.29,
                   2017: 1.26,
                   2018: 1.23,
                   2019: 1.21,
                   2020: 1.18,
                   2021: 1.17,
                   2022: 1.09,}

In [147]:
# Adjusted salary column
df_1985['adjusted_salary'] = round(df_1985['salary_y'] * df_1985['yearID'].map(buying_power), 0)

In [148]:
# Sanity check
df_1985[['yearID', 'salary_y', 'adjusted_salary']].sample(10)

Unnamed: 0,yearID,salary_y,adjusted_salary
88866,2011.0,3650000.0,5073500.0
85213,2008.0,390000.0,565500.0
73049,1998.0,170000.0,321300.0
95154,2016.0,520200.0,671058.0
88539,2011.0,414000.0,575460.0
70398,1996.0,300000.0,594000.0
91600,2013.0,490000.0,646800.0
61281,1987.0,60000.0,164400.0
75459,2000.0,270000.0,488700.0
89459,2012.0,480000.0,648000.0


In [149]:
# Save a copy of the df_1985
# df_1985.to_csv('baseballsalaries_1985_preclean.csv', index=False)

Lastly, we will change the column names and order to make it more readable

In [150]:
# Copy
df_1985_EDA = df_1985.copy()


##### More cleaning

In [151]:
# drop salary_y
df_1985.drop('salary_y', axis=1, inplace=True)

In [152]:
# Drop birthYear
df_1985.drop('birthYear', axis=1, inplace=True)

In [153]:
# debut and finalGame to datetime
df_1985_EDA['debut'] = pd.to_datetime(df_1985_EDA['debut'])
df_1985_EDA['finalGame'] = pd.to_datetime(df_1985_EDA['finalGame'])

# Extract year from debut and finalGame
df_1985_EDA['debut_year'] = df_1985_EDA['debut'].dt.year
df_1985_EDA['finalGame_year'] = df_1985_EDA['finalGame'].dt.year

# Drop debut and finalGame
df_1985_EDA.drop(['debut', 'finalGame'], axis=1, inplace=True)


In [154]:
# Change DivWin, LgWin and WSWin to binary columns
df_1985_EDA['DivWin'] = np.where(df_1985_EDA['DivWin'] == 'Y', 1, 0)
df_1985_EDA['LgWin'] = np.where(df_1985_EDA['LgWin'] == 'Y', 1, 0)
df_1985_EDA['WSWin'] = np.where(df_1985_EDA['WSWin'] == 'Y', 1, 0)

In [155]:
column_order = [
    'playerID', 'name_common', 'BirthYear', 'debut_year', 'finalGame_year',
    'age', 'weight', 'height', 'bats', 'throws', 'POS',
    'yearID', 'G_y', 'PA', 'AB_x', 'qualified', 'R_x', 'H_x', '1B', '2B_x', '3B_x', 'HR_x', 'RBI', 'SB_x', 'CS_x', 'BB_x', 'SO_x', 'IBB', 'BA', 'OBP', 'SLG', 'OPS', 'BABIP', 'TB', 'Inn', 'WAR',
    'career_G_x', 'career_PA', 'career_AB_x', 'career_R_x', 'career_H_x', 'career_1B', 'career_2B_x', 'career_3B_x', 'career_HR_x', 'career_RBI', 'career_SB_x', 'career_CS_x', 'career_BB_x', 'career_SO_x', 'career_IBB', 'career_TB', 'career_WAR', 'career_BA', 'career_OBP', 'career_SLG', 'career_OPS', 'career_BABIP',
    'teamName', 'Rank', 'G', 'W', 'L', 'W%', 'DivWin', 'LgWin', 'WSWin', 
    'adjusted_salary'
]

In [156]:
# Sanity check
len(column_order) == len(df_1985.columns)

False

In [157]:
# Change column order
df_1985_EDA = df_1985_EDA[column_order]

In [158]:

# Change column names
df_1985_EDA.rename(columns={'yearID': 'Year',
            'G_y': 'GamesPlayer',
            'AB_x': 'AB_p',
            'R_x': 'R_p',
            'H_x': 'H_p',
            '2B_x': '2B_p',
            '3B_x': '3B_p',
            'HR_x': 'HR_p',
            'SB_x': 'SB_p',
            'CS_x': 'CS_p',
            'BB_x': 'BB_p',
            'SO_x': 'SO_p',
            'HBP_x': 'HBP_p',
            'career_G_x': 'CareerGames',
            'career_AB_x': 'CareerAB',
            'career_R_x': 'Career_R',
            'career_H_x': 'Career_H',
            'career_2B_x': 'Career_2B',
            'career_3B_x': 'Career_3B',
            'career_HR_x': 'Career_HR',
            'career_SB_x': 'Career_SB',
            'career_CS_x': 'Career_CS',
            'career_BB_x': 'Career_BB',
            'career_SO_x': 'Career_SO',
            'career_IBB': 'Career_IBB',
            'R_y': 'R_t',
            'H_y': 'H_t',
            '2B_y': '2B_t',
            '3B_y': '3B_t',
            'HR_y': 'HR_t',
            'BB_y': 'BB_t',
            'SO_y': 'SO_t',
            'SB_y': 'SB_t',
            'CS_y': 'CS_t',
            'SF_y': 'SF_t',
            'salary_y': 'Salary'}, inplace=True)

In [159]:
# Drop 2022 rows
df_1985_EDA = df_1985_EDA[df_1985_EDA['Year'] != 2022]

In [160]:
# sample
df_1985_EDA.sample(5)

Unnamed: 0,playerID,name_common,BirthYear,debut_year,finalGame_year,age,weight,height,bats,throws,POS,Year,GamesPlayer,PA,AB_p,qualified,R_p,H_p,1B,2B_p,3B_p,HR_p,RBI,SB_p,CS_p,BB_p,SO_p,IBB,BA,OBP,SLG,OPS,BABIP,TB,Inn,WAR,CareerGames,career_PA,CareerAB,Career_R,Career_H,career_1B,Career_2B,Career_3B,Career_HR,career_RBI,Career_SB,Career_CS,Career_BB,Career_SO,Career_IBB,career_TB,career_WAR,career_BA,career_OBP,career_SLG,career_OPS,career_BABIP,teamName,Rank,G,W,L,W%,DivWin,LgWin,WSWin,adjusted_salary
80084,wilsopr01,Preston Wilson,1974.0,1998,2007,29.0,193.0,74.0,R,R,OF,2004.0,58.0,222.0,202.0,0,24.0,50.0,33.0,11.0,0.0,6.0,29.0,2.0,1.0,17.0,49.0,2.0,0.248,0.315,0.391,0.706,0.299,79.0,436.0,-1.09,809,3255,2918,436,774,452,164,12,146,501,104,46,272,799,12,1400,7.68,0.265,0.334,0.48,0.814,0.315,Colorado Rockies,4.0,162.0,68.0,94.0,0.42,0,0,0,14850000.0
76527,minorry01,Ryan Minor,1974.0,1998,2001,27.0,225.0,79.0,R,R,3B,2001.0,55.0,107.0,95.0,0,10.0,15.0,11.0,2.0,0.0,2.0,13.0,0.0,1.0,9.0,31.0,0.0,0.158,0.234,0.242,0.476,0.203,23.0,197.3,-1.41,142,342,317,30,56,40,11,0,5,27,1,1,20,97,0,82,-2.48,0.177,0.228,0.259,0.487,0.234,Washington Nationals,5.0,162.0,68.0,94.0,0.42,0,0,0,352000.0
90471,deazaal01,Alejandro De Aza,1984.0,2007,2017,29.0,195.0,72.0,L,L,OF,2013.0,153.0,675.0,607.0,1,84.0,160.0,112.0,27.0,4.0,17.0,62.0,20.0,8.0,50.0,147.0,1.0,0.264,0.323,0.405,0.728,0.318,246.0,1324.3,0.55,424,1648,1477,221,404,280,79,15,30,148,62,26,126,336,6,603,5.81,0.274,0.336,0.408,0.744,0.334,Chicago White Sox,5.0,162.0,63.0,99.0,0.389,0,0,0,2739000.0
98689,tilsoch01,Charlie Tilson,1992.0,2016,2019,26.0,190.0,71.0,L,L,OF,2019.0,54.0,157.0,144.0,0,16.0,33.0,27.0,5.0,0.0,1.0,12.0,4.0,0.0,10.0,38.0,1.0,0.229,0.293,0.285,0.578,0.305,41.0,358.3,-0.67,96,280,252,23,62,54,6,1,1,23,6,3,20,58,1,73,-1.57,0.246,0.31,0.29,0.6,0.314,Chicago White Sox,3.0,161.0,72.0,89.0,0.447,0,0,0,671550.0
60211,bushra01,Randy Bush,1958.0,1982,1993,27.0,186.0,73.0,L,L,OF,1986.0,130.0,402.0,357.0,0,50.0,96.0,63.0,19.0,7.0,7.0,45.0,5.0,3.0,39.0,63.0,2.0,0.269,0.347,0.42,0.767,0.309,150.0,780.0,0.7,519,1569,1394,178,343,206,79,15,43,192,9,6,136,232,17,581,1.21,0.246,0.32,0.417,0.737,0.265,Minnesota Twins,6.0,162.0,71.0,91.0,0.438,0,0,0,611600.0


In [161]:
# Shoei Ohtani
df_1985_EDA[df_1985_EDA['playerID'] == 'ohtansh01']

Unnamed: 0,playerID,name_common,BirthYear,debut_year,finalGame_year,age,weight,height,bats,throws,POS,Year,GamesPlayer,PA,AB_p,qualified,R_p,H_p,1B,2B_p,3B_p,HR_p,RBI,SB_p,CS_p,BB_p,SO_p,IBB,BA,OBP,SLG,OPS,BABIP,TB,Inn,WAR,CareerGames,career_PA,CareerAB,Career_R,Career_H,career_1B,Career_2B,Career_3B,Career_HR,career_RBI,Career_SB,Career_CS,Career_BB,Career_SO,Career_IBB,career_TB,career_WAR,career_BA,career_OBP,career_SLG,career_OPS,career_BABIP,teamName,Rank,G,W,L,W%,DivWin,LgWin,WSWin,adjusted_salary
98268,ohtansh01,Shohei Ohtani,1994.0,2018,2023,23.0,210.0,76.0,L,R,P,2018.0,114.0,367.0,326.0,0,59.0,93.0,48.0,21.0,2.0,22.0,61.0,10.0,4.0,37.0,102.0,2.0,0.285,0.361,0.564,0.925,0.35,184.0,789.7,2.73,104,367,326,59,93,48,21,2,22,61,10,4,37,102,2,184,2.73,0.285,0.361,0.564,0.925,0.35,Los Angeles Angels,4.0,162.0,80.0,82.0,0.494,0,0,0,670350.0
99654,ohtansh01,Shohei Ohtani,1994.0,2018,2023,24.0,210.0,76.0,L,R,P,2019.0,106.0,425.0,384.0,0,51.0,110.0,67.0,20.0,5.0,18.0,62.0,12.0,3.0,33.0,110.0,1.0,0.286,0.343,0.505,0.848,0.354,194.0,828.0,2.45,210,792,710,110,203,115,41,7,40,123,22,7,70,212,3,378,5.18,0.286,0.351,0.532,0.883,0.352,Los Angeles Angels,4.0,162.0,72.0,90.0,0.444,0,0,0,786500.0
100965,ohtansh01,Shohei Ohtani,1994.0,2018,2023,25.0,210.0,76.0,L,R,P,2020.0,46.0,175.0,153.0,0,23.0,29.0,16.0,6.0,0.0,7.0,24.0,7.0,1.0,22.0,50.0,0.0,0.19,0.291,0.366,0.657,0.229,56.0,361.7,0.04,254,967,863,133,232,131,47,7,47,147,29,8,92,262,3,434,5.22,0.269,0.34,0.503,0.843,0.331,Los Angeles Angels,4.0,60.0,26.0,34.0,0.433,0,0,0,826000.0
102430,ohtansh01,Shohei Ohtani,1994.0,2018,2023,26.0,210.0,76.0,L,R,P,2021.0,158.0,639.0,537.0,1,103.0,138.0,58.0,26.0,8.0,46.0,100.0,26.0,10.0,96.0,189.0,20.0,0.257,0.372,0.592,0.964,0.303,318.0,1272.6,4.86,409,1606,1400,236,370,189,73,15,93,247,55,18,188,451,23,752,10.08,0.264,0.353,0.537,0.89,0.321,Los Angeles Angels,4.0,162.0,77.0,85.0,0.475,0,0,0,3510000.0


In [162]:
#shape
df_1985_EDA.shape

(22010, 68)

In [163]:
# Save a copy of the df_1985
# df_1985_EDA.to_csv('baseballsalaries_1985.csv', index=False)

df_1985_EDA.to_csv('baseballsalaries.csv', index=False)

## 5. Next Steps <a id='next_steps'></a>

Now that we have a clean dataset, we are ready to move on to the next step of our project: 

- __Exploratory Data Analysis (EDA):__ Perform an in-depth exploratory data analysis to uncover insights, patterns, and relationships within the preprocessed data. Utilize visualizations, statistical analysis, and other techniques to understand the distribution, correlations, and trends present in the data. This stage will provide valuable insights that can guide further analysis and modeling decisions.

- __Feature Engineering:__ Engage in feature engineering to enhance the dataset for modeling purposes. This includes selecting relevant features, transforming existing features, and potentially creating new features based on domain knowledge and insights gained from the EDA. Iteratively refine the feature set to improve model performance and align it with the project's objectives.

Please note that the current state of the preprocessed data is not the final form. More features can be added or removed during the feature engineering phase to further optimize our models and increase their predictive power.



In [164]:
# Jim Dwyer
df_1985[df_1985['playerID'] == 'dwyerji01']

Unnamed: 0,name_common,age,PA,G_x,Inn,WAR,playerID,weight,height,bats,throws,debut,finalGame,yearID,G_y,AB_x,R_x,H_x,2B_x,3B_x,HR_x,RBI,SB_x,CS_x,BB_x,SO_x,IBB,franchID,Rank,G,Ghome,W,L,DivWin,LgWin,WSWin,RA,teamName,park,teamIDBR,leaguerank,teamrank,averagesalary,leagueminimum,POS,BirthYear,BA,OBP,1B,TB,SLG,OPS,BABIP,W%,career_G_x,career_PA,career_AB_x,career_R_x,career_H_x,career_1B,career_2B_x,career_3B_x,career_HR_x,career_RBI,career_SB_x,career_CS_x,career_BB_x,career_SO_x,career_IBB,career_HBP_x,career_SH,career_SF_x,career_GIDP,career_TB,career_WAR,career_BA,career_OBP,career_SLG,career_OPS,career_BABIP,qualified,adjusted_salary
58995,Jim Dwyer,35.0,274.0,101,525.0,1.08,dwyerji01,165.0,70.0,L,L,1973-06-10,1990-06-21,1985.0,101.0,233.0,35.0,58.0,8.0,3.0,7.0,36.0,0.0,3.0,37.0,31.0,2.0,BAL,4.0,161.0,81.0,83.0,78.0,N,N,N,764.0,Baltimore Orioles,Memorial Stadium,BAL,275.0,18.0,371571.0,60000.0,OF,1950.0,0.249,0.353,40.0,93.0,0.399,0.752,0.26,0.516,949,2302,1968,286,504,359,82,15,48,237,20,12,276,264,23,8,25,25,34,760,3.21,0.256,0.346,0.386,0.732,0.271,0,1083750.0
59937,Jim Dwyer,36.0,189.0,94,346.7,0.65,dwyerji01,165.0,70.0,L,L,1973-06-10,1990-06-21,1986.0,94.0,160.0,18.0,39.0,13.0,1.0,8.0,31.0,0.0,2.0,23.0,31.0,1.0,BAL,7.0,162.0,79.0,73.0,89.0,N,N,N,760.0,Baltimore Orioles,Memorial Stadium,BAL,285.0,14.0,412520.0,60000.0,OF,1950.0,0.244,0.339,17.0,78.0,0.488,0.827,0.248,0.451,1043,2491,2128,304,543,376,95,16,56,268,20,14,299,295,24,10,25,29,36,838,3.86,0.255,0.345,0.394,0.739,0.27,0,1112000.0
60878,Jim Dwyer,37.0,281.0,92,606.3,1.49,dwyerji01,165.0,70.0,L,L,1973-06-10,1990-06-21,1987.0,92.0,241.0,54.0,66.0,7.0,1.0,15.0,33.0,4.0,1.0,37.0,57.0,4.0,BAL,6.0,162.0,82.0,67.0,95.0,N,N,N,880.0,Baltimore Orioles,Memorial Stadium,BAL,273.0,14.0,412454.0,62500.0,OF,1950.0,0.274,0.371,43.0,120.0,0.498,0.869,0.3,0.414,1135,2772,2369,358,609,419,102,17,71,301,24,15,336,352,28,11,26,30,40,958,5.35,0.257,0.348,0.404,0.752,0.272,0,842550.0
61841,Jim Dwyer,38.0,122.0,55,275.0,0.29,dwyerji01,165.0,70.0,L,L,1973-06-10,1990-06-21,1988.0,55.0,94.0,9.0,24.0,1.0,0.0,2.0,18.0,0.0,0.0,25.0,19.0,4.0,BAL,7.0,161.0,80.0,54.0,107.0,N,N,N,789.0,Baltimore Orioles,Memorial Stadium,BAL,316.0,14.0,438729.0,62500.0,OF,1950.0,0.255,0.41,21.0,31.0,0.33,0.74,0.293,0.335,1190,2894,2463,367,633,440,103,17,73,319,24,15,361,371,32,12,26,32,41,989,5.64,0.257,0.351,0.402,0.753,0.273,0,792000.0
62816,Jim Dwyer,39.0,265.0,101,658.0,0.83,dwyerji01,165.0,70.0,L,L,1973-06-10,1990-06-21,1989.0,101.0,235.0,35.0,74.0,12.0,0.0,3.0,25.0,2.0,0.0,29.0,24.0,1.0,MIN,5.0,162.0,81.0,80.0,82.0,N,N,N,738.0,Minnesota Twins,Hubert H Humphrey Metrodome,MIN,400.0,14.0,497254.0,68000.0,OF,1950.0,0.315,0.389,59.0,95.0,0.404,0.793,0.34,0.494,1291,3159,2698,402,707,499,115,17,76,344,26,15,390,395,33,12,26,33,47,1084,6.47,0.262,0.354,0.402,0.756,0.279,0,504000.0
63811,Jim Dwyer,40.0,75.0,37,201.0,-0.44,dwyerji01,165.0,70.0,L,L,1973-06-10,1990-06-21,1990.0,37.0,63.0,7.0,12.0,0.0,0.0,1.0,5.0,0.0,0.0,12.0,7.0,1.0,MIN,7.0,162.0,81.0,74.0,88.0,N,N,N,729.0,Minnesota Twins,Hubert H Humphrey Metrodome,MIN,431.0,18.0,597537.0,100000.0,OF,1950.0,0.19,0.32,11.0,15.0,0.238,0.558,0.2,0.457,1328,3234,2761,409,719,510,115,17,77,349,26,15,402,402,34,12,26,33,49,1099,6.03,0.26,0.353,0.398,0.751,0.277,0,573600.0
