## Predicting MLB Player Salaries: A Batting Performance Analysis
# Data Preprocessing

In this notebook, we will perform data processing tasks to simplify the dataframe, drop columns with no value to our project, reduce the time span, and perform further cleaning and wrangling.

By the end, I expect to have a dataframe that is ready for analysis and modeling.

## Table of Contents
1. [Data Exploration](#data_exploration)
    - [1.1. Data Dictionary](#data_dict)
    - [1.2. Description and overview of the data](#overview)
2. [Absence of Key Features](#absence_key_features)
    - [2.1. Adding Key Features](#adding_key_features)
    - [2.2. Adding Cumulative Features](#adding_cumulative_features)
3. [Data Cleaning](#data_cleaning)
    - [3.1. Setting a time span](#time_span)
    - [3.2. Droping columns (Features) that are no relevant to the project](#drop_cols)
    - [3.3. Droping repeated or similar columns](#drop_repeated_cols)
    - 2.4. [Filtering and removing players with no relevance to the project (Removing rows)](#drop_rows)
    - 2.5. [Dealing with Null Values](#null_values)
4. [Salary Adjustments](#salary_adjustments)
4. [Next Steps](#next_steps)

In [353]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


## 1. Data Exploration <a id='data_exploration'></a>

To begin the analysis, we will explore the dataset to understand its structure and content.


In [354]:
# Load Dataset
raw_df = pd.read_csv("pre_datasets/war_batting_people_teams_salaries_pre.csv")

  raw_df = pd.read_csv("pre_datasets/war_batting_people_teams_salaries_pre.csv")


In [355]:
# Shape of the dataset
print('Shape of the dataset: ', raw_df.shape)
print('Number of rows (Players): ', raw_df.shape[0])
print('Number of columns (Features): ', raw_df.shape[1])

Shape of the dataset:  (102624, 105)
Number of rows (Players):  102624
Number of columns (Features):  105


The dataset has a shape of **(112149, 105)**, indicating __112,149 rows (Players) and 109 columns (Features)__. 

Here are the first five rows of the dataset:

In [356]:
# First 5 rows of the dataset
pd.set_option('display.max_columns', None)

raw_df.sample(5)

Unnamed: 0,name_common,age_x,year_ID,player_ID,pitcher,mlb_ID,stint_ID,PA,G_x,Inn,salary_x,OPS_plus,WAR,WAR_def,WAR_off,team_ID,playerID,birthYear,birthCountry,weight,height,bats,throws,debut,finalGame,retroID,bbrefID,yearID,G_y,AB_x,R_x,H_x,2B_x,3B_x,HR_x,RBI,SB_x,CS_x,BB_x,SO_x,IBB,HBP_x,SH,SF_x,GIDP,teamID,lgID,franchID,divID,Rank,G,Ghome,W,L,DivWin,LgWin,WSWin,R_y,AB_y,H_y,2B_y,3B_y,HR_y,BB_y,SO_y,SB_y,CS_y,HBP_y,SF_y,RA,ER,ERA,CG,SHO,SV,IPouts,HA,HRA,BBA,SOA,E,DP,FP,name_x,park,attendance,BPF,PPF,teamIDBR,teamIDlahman45,teamIDretro,firstname,lastname,playerid,mlbid,year,salary_y,TeamName,age_y,leaguerank,teamrank,averagesalary,leagueminimum,borndate,name_y
98669,Ryan Garton,29.0,2019,gartory01,Y,623439.0,1,0.0,0,3.0,0.0,0.0,0.0,0.0,0.0,SEA,gartory01,1989.0,USA,190.0,70.0,R,R,2016-05-26,2019-05-20,gartr001,gartory01,2019.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,SEA,AL,SEA,W,5.0,162.0,81.0,68.0,94.0,N,N,N,758.0,5500.0,1305.0,254.0,28.0,239.0,588.0,1581.0,115.0,47.0,58.0,37.0,893.0,798.0,4.99,3.0,4.0,34.0,4318.0,1484.0,260.0,505.0,1239.0,132.0,145.0,0.978,Seattle Mariners,T-Mobile Park,1791863.0,93.0,94.0,SEA,SEA,SEA,ryan,garton,150535.0,623439.0,2019.0,555000.0,Seattle Mariners,29.0,0.0,0.0,4509524.0,555000.0,1989-12-05,ryan garton
25713,Syl Johnson,36.0,1937,johnssy01,Y,116630.0,1,50.0,32,0.0,0.0,-17.952571,-0.33,-0.01,-0.33,PHI,johnssy01,1900.0,USA,180.0,71.0,R,R,1922-04-24,1940-09-14,johns101,johnssy01,1937.0,32.0,48.0,1.0,7.0,1.0,0.0,0.0,3.0,0.0,0.0,0.0,15.0,0.0,0.0,2.0,0.0,2.0,PHI,NL,PHI,,7.0,155.0,74.0,61.0,92.0,,N,N,724.0,5424.0,1482.0,258.0,37.0,103.0,478.0,640.0,66.0,,,,869.0,770.0,5.05,59.0,6.0,15.0,4119.0,1629.0,116.0,501.0,529.0,184.0,157.0,0.97,Philadelphia Phillies,Baker Bowl,212790.0,109.0,112.0,PHI,PHI,PHI,,,,,,,,,,,,,,
3155,Fred ONeill,22.0,1887,o'neifr01,N,120019.0,1,28.0,6,0.0,0.0,120.538697,-0.04,-0.17,0.1,NYP,oneilfr01,1865.0,CAN,142.0,67.0,R,,1887-05-03,1887-05-22,oneif101,o'neifr01,1887.0,6.0,26.0,4.0,8.0,1.0,1.0,0.0,3.0,3.0,0.0,1.0,4.0,0.0,1.0,0.0,0.0,0.0,NY4,AA,NYP,,7.0,138.0,,44.0,89.0,,N,N,754.0,4821.0,1197.0,193.0,66.0,21.0,439.0,456.0,305.0,,50.0,,1093.0,693.0,5.28,132.0,1.0,0.0,3541.0,1545.0,39.0,406.0,316.0,632.0,102.0,0.894,New York Metropolitans,St. George Cricket Grounds,,96.0,100.0,NYP,NY4,NY4,,,,,,,,,,,,,,
35079,Willy Miranda,28.0,1954,miranwi01,N,119184.0,1,129.0,92,394.4,0.0,79.17793,-0.13,-0.06,0.22,NYY,miranwi01,1926.0,Cuba,150.0,69.0,B,R,1951-05-06,1959-09-07,miraw101,miranwi01,1954.0,92.0,116.0,12.0,29.0,4.0,2.0,1.0,12.0,0.0,3.0,10.0,10.0,0.0,0.0,1.0,4.0,2.0,NYA,AL,NYY,,2.0,155.0,78.0,103.0,51.0,,N,N,805.0,5226.0,1400.0,215.0,59.0,133.0,650.0,632.0,34.0,41.0,,,563.0,500.0,3.26,51.0,16.0,37.0,4137.0,1284.0,86.0,552.0,655.0,127.0,198.0,0.979,New York Yankees,Yankee Stadium I,1475171.0,97.0,93.0,NYY,NYA,NYA,,,,,,,,,,,,,,
58731,Albert Hall,28.0,1986,hallal02,N,115334.0,1,57.0,16,114.7,0.0,60.948177,-0.11,-0.08,-0.07,ATL,hallal02,1958.0,USA,155.0,71.0,B,R,1981-09-12,1989-10-01,halla001,hallal02,1986.0,16.0,50.0,6.0,12.0,2.0,0.0,0.0,1.0,8.0,3.0,5.0,6.0,0.0,0.0,2.0,0.0,0.0,ATL,NL,ATL,W,6.0,161.0,81.0,72.0,89.0,N,N,N,615.0,5384.0,1348.0,241.0,24.0,138.0,538.0,904.0,93.0,76.0,24.0,42.0,719.0,629.0,3.97,17.0,5.0,39.0,4274.0,1443.0,117.0,576.0,932.0,141.0,181.0,0.978,Atlanta Braves,Atlanta-Fulton County Stadium,1387181.0,105.0,106.0,ATL,ATL,ATL,albert,hall,12324.0,115334.0,1986.0,60000.0,Atlanta Braves,27.0,0.0,0.0,412520.0,60000.0,1959-03-07,albert hall


In [357]:
pd.set_option('display.max_rows', 200)
# Print all column names and their data types
raw_df.dtypes

name_common        object
age_x             float64
year_ID             int64
player_ID          object
pitcher            object
mlb_ID            float64
stint_ID            int64
PA                float64
G_x                 int64
Inn               float64
salary_x          float64
OPS_plus          float64
WAR               float64
WAR_def           float64
WAR_off           float64
team_ID            object
playerID           object
birthYear         float64
birthCountry       object
weight            float64
height            float64
bats               object
throws             object
debut              object
finalGame          object
retroID            object
bbrefID            object
yearID            float64
G_y               float64
AB_x              float64
R_x               float64
H_x               float64
2B_x              float64
3B_x              float64
HR_x              float64
RBI               float64
SB_x              float64
CS_x              float64
BB_x        

### 1.1 Data Dictionary <a id='data_dict'></a>

Now let's examine the data dictionary to understand the meaning of each column:

| Column Name | Description | 
| --- | --- |
| name_common | Player name |
| age_x | Player age |
| year_ID | Year (Season played) |
| team_ID | Team played |
| player_ID | Player ID |
| pitcher | Whether the player is a pitcher |
| mlb_ID | MLB ID |
| stint_ID | The stint of the player in that year (a player can have more than one stint in a year if they moved teams). |
| PA | Plate appearances |
| G_x | Games played |
| Inn | Innings played |
| salary_x | Player salary |
| OPS_plus | On-base Plus Slugging Plus - an adjusted version of the On-base Plus Slugging (OPS) statistic. |
| WAR | Wins Above Replacement - a measure of how many wins a player adds to a team compared to a replacement-level player. |
| WAR_def | Wins above replacement as fielder |
| WAR_off | Wins above replacement as batter |
| teamID | ID of the team they played for the season |
| playerID | Player ID |
| birthYear | Year of birth |
| birthMonth | Month of birth |
| birthDay | Day of birth |
| birthCountry | Country of birth |
| nameFirst | First name |
| nameLast | Last name |
| weight | Weight in pounds |
| height | Height in inches |
| bats | Batting hand (left, right, or both) |
| throws | Throwing hand (left or right) |
| debut | Date of MLB debut |
| finalGame | Date of final MLB game |
| retroID | Retro ID |
| bbrefID | Baseball Reference ID |
| yearID | Year (Season played) |
| lgID | The league the player played in |
| G | Games played |
| AB | At bats |
| R | Runs |
| H | Hits |
| 2B | Doubles |
| 3B | Triples |
| HR | Homeruns |
| RBI | Runs batted in |
| SB | Stolen bases |
| CS | Caught stealing |
| BB | Walks |
| SO | Strikeouts |
| IBB | Intentional walks |
| HBP | Hit by pitch |
| SH | Sacrifice hits |
| SF | Sacrifice flies |
| GIDP | Grounded into double plays |
| teamID | ID of the team they played for the season |
| lgID | The league the player played in |
| franchID | Franchise ID |
| divID | Division ID |
| Rank | Team rank at the end of the season |
| G | Games played |
| Ghome | Games played at home |
| W | Wins |
| L | Losses |
| DivWin | Division Winner (Y or N) |
| WCWin | Wild Card Winner (Y or N) |
| LgWin | League Champion(Y or N) |
| WSWin | World Series Winner (Y or N) |
| R | Total number of runs scored by the team in the season |
| AB | Total number of at bats by the team in the season |
| H | Total number of hits by the team in the season |
| 2B | Total number of doubles by the team in the season |
| 3B | Total number of triples by the team in the season |
| HR | Total number of home runs by the team in the season |
| BB | Total number of walks by the team in the season |
| SO | Total number of strikeouts by the team in the season |
| SB | Total number of stolen bases by the team in the season |
| CS | Total number of times caught stealing by the team in the season |
| HBP | Total number of times hit by pitch by the team in the season |
| SF | Total number of sacrifice flies by the team in the season |
| RA | Total number of runs allowed by the team in the season |
| ER | Total number of earned runs allowed by the team in the season |
| ERA | Earned run average |
| CG | Total number of complete games pitched by the team in the season |
| SHO | Total number of shutouts pitched by the team in the season |
| SV | Total number of saves by the team in the season |
| IPouts | Total number of outs pitched by the team in the season |
| HA | Total number of hits allowed by the team in the season |
| HRA | Total number of home runs allowed by the team in the season |
| BBA | Total number of walks allowed by the team in the season |
| SOA | Total number of strikeouts by the team in the season |
| E | Total number of errors by the team in the season |
| DP | Total number of double plays turned by the team in the season |
| FP | Fielding percentage |
| name_x | Team name |
| park | Team park |
| attendance | Total attendance for the season |
| BPF | Three-year park factor for batters |
| PPF | Three-year park factor for pitchers |
| teamIDBR | Team ID used by Baseball Reference website |
| teamIDlahman45 | Team ID used in Lahman database version 4.5 |
| teamIDretro | Team ID used by Retrosheet |
| firstname | The player's First name |
| lastname | The player's Last name |
| playerid | A unique identifier for each player |
| mlbid | A unique identifier for each player |
| year | Year of the salary |
| salary_y | Player salary for the year |
| teamname | Team name |
| age | The player's age during the year of the salary |
| leaguerank | The player's rank in the league based on salary |
| teamrank | The player's rank in the team based on salary |
| averagesalary | The average salary in the league for the year |
| leagueminimun | The minimum salary in the league for the year |
| borndate | The player's birth date |






















### 1.2 Description and Overview of the Data <a id='overview'></a>

In [358]:
# Unique players (Total number of players)
print('Number of unique players: ', raw_df['playerID'].nunique())

# Unique seasons (Total number of seasons)
print('Number of unique seasons: ', raw_df['yearID'].nunique())

# First season
print('First season: ', raw_df['yearID'].min())

# Last season
print('Last season: ', raw_df['yearID'].max())

Number of unique players:  19829
Number of unique seasons:  152
First season:  1871.0
Last season:  2022.0


#### Overview of the Data

Our dataset is a rich compilation of baseball statistics that contains data for __19,942__ unique players across __152__ seasons. The earliest MLB season in the dataset is __1871__, and the latest season is __2022__.

Each row in the dataset represents a single player's performance for a given season, with a variety of metrics reflecting different aspects of their performance. These are divided into the following categories:
- __Identifiers:__ These include various IDs for players, teams, and leagues, as well as the year and player's stint with the team in that season.
- __Basic Batting and Fielding Statistics:__ These include familiar metrics such as games played, at bats, runs, hits, and home runs, among others, for both players and teams.
- __Advanced Batting and Fielding Statistics:__ These include more advanced metrics such as wins above replacement (WAR), wins above average (WAA), and runs above average (RAA), among others, for both players and teams.
- __Personal Information:__ These include the player's name, birth and death information, weight, height, and handedness, among others.
- __Team Information:__ These include the team's name, division, and league, as well as their record, attendance, and park factors, among others.
- __Salary Information:__ These include the player's salary.

This dataset provides a comprehensive view of baseball performance, allowing us to examine the game from both individual player and team perspectives.


In [359]:
# Copy of the df for preprocessing
df = raw_df.copy()

## 2. Absence of Key Features (Pre-Feature Engineering) <a id='missing_features'></a>

Our current dataset lacks several features that are integral to understanding a player's performance in baseball. These absent features include:
- __Batting Average (BA)__: This is calculated by dividing a player's number of hits by their number of at bats. It provides a measure of a player's offensive capabilities.
- __On-Base Percentage (OBP)__: This metric represents the frequency at which a player reaches base per plate appearance. It's calculated by dividing the total times a player reaches base (hits, walks, and hit by pitches) by their total number of eligible at bats.
- __Singles (1B)__: This is the number of hits a player has that resulted in the batter reaching first base safely. It's calculated by subtracting doubles, triples, and home runs from total hits.
- __Total Bases (TB)__: The number of bases a player has gained with hits. It is a weighted sum for a batter's collection of hits including singles, doubles, triples and home runs.
- __Slugging Percentage (SLG)__: The total number of bases divided by the number of at bats.
- __On-Base Plus Slugging (OPS)__: This is the sum of a player's on-base percentage and slugging percentage. It's a more comprehensive statistic that measures a player's ability to get on base, along with their ability to hit for power.
- __Batting Average on Balls In Play (BABIP)__: This measures how often a ball in play goes for a hit. A ball is "in play" when the plate appearance ends in something other than a strikeout, walk, hit batter, catcher's interference, sacrifice bunt, or home run. In other words, the batter puts the ball in play and it doesn't clear the outfield fence. This can be an indicator of a player's luck, skill at placing hits where fielders aren't, or both.
- __Win Percentage (W%)__: This is the number of wins divided by the number of games played. It provides a measure of a team's success over a period of time.

### 2.1 Adding key features <a id='adding_features'></a>

In [360]:
# Add Battiing Average (BA) column, round to 3 decimal places
df['BA'] = round(df['H_x'] / df['AB_x'], 3)

# Add On Base Percentage (OBP) column
df['OBP'] = round((df['H_x'] + df['BB_x'] + df['HBP_x']) / (df['AB_x'] + df['BB_x'] + df['HBP_x'] + df['SF_x']), 3)

# Add Singles (1B) column
df['1B'] = df['H_x'] - df['2B_x'] - df['3B_x'] - df['HR_x']

# Add Total Bases (TB) column
df['TB'] = df['1B'] + 2*df['2B_x'] + 3*df['3B_x'] + 4*df['HR_x']

# Add Slugging Percentage (SLG) column
df['SLG'] = round(df['TB'] / df['AB_x'], 3)

# Add On Base Plus Slugging (OPS) column
df['OPS'] = round(df['OBP'] + df['SLG'], 3)

# Add Batting Average on Balls in Play (BABIP) column
df['BABIP'] = round((df['H_x'] - df['HR_x']) / (df['AB_x'] - df['SO_x'] - df['HR_x'] + df['SF_x']), 3)

# Add Win Percentage (W%) column
df['W%'] = round(df['W'] / (df['W'] + df['L']), 3)


##### Sanity check
Let's compare newly created features with the ones of a well known player, Derek Jeter, to make sure we are calculating them correctly.
 
https://www.baseball-reference.com/players/j/jeterde01.shtml

In [361]:
# Sanity check, Derek Jeter's new columns
df[df['playerID'] == 'jeterde01'][['playerID', 'yearID', 'BA', 'OBP', '1B', 'SLG', 'OPS', 'BABIP', 'TB']]

Unnamed: 0,playerID,yearID,BA,OBP,1B,SLG,OPS,BABIP,TB
67993,jeterde01,1995.0,0.25,0.294,7.0,0.375,0.669,0.324,18.0
69126,jeterde01,1996.0,0.314,0.37,142.0,0.43,0.8,0.361,250.0
70241,jeterde01,1997.0,0.291,0.37,142.0,0.405,0.775,0.345,265.0
71384,jeterde01,1998.0,0.324,0.384,151.0,0.481,0.865,0.375,301.0
72567,jeterde01,1999.0,0.349,0.438,149.0,0.552,0.99,0.396,346.0
73766,jeterde01,2000.0,0.339,0.416,151.0,0.481,0.897,0.386,285.0
75004,jeterde01,2001.0,0.311,0.377,132.0,0.48,0.857,0.343,295.0
76182,jeterde01,2002.0,0.297,0.373,147.0,0.421,0.794,0.336,271.0
77383,jeterde01,2003.0,0.324,0.393,118.0,0.45,0.843,0.379,217.0
78610,jeterde01,2004.0,0.292,0.352,120.0,0.471,0.823,0.315,303.0


All the features match the ones in the website, so we can proceed.

### 2.2 Adding cumulative features
It is essential to consider the player's career statistics up to the point of each season. This is because a player's salary is often determined not just by their performance in the current season, but by their cumulative performance throughout their career.

To this end, we have added cumulative features to our dataset. These features represent the cumulative sum of various statistics for each player up to each season. These features represent the cumulative sum of various statistics for each player up to each season. For example, `career_G_x` represents the cumulative number of games played by the player up to the current season, `career_H_x` represents the cumulative number of hits, and so on.

In addition to these cumulative sum features, we have also added cumulative mean features for certain statistics. These features represent the cumulative average of these statistics for each player up to each season. For example, `career_BA` represents the cumulative batting average of the player up to the current season.

Later in the preprocessing stage, we will be narrowing down the time span of our dataset. To ensure we do not lose valuable data, especially for players who debuted before our selected time span, it's crucial to perform this step now. This approach ensures that we retain all relevant information for our analysis and model training.


In [362]:
# Add cumulative columns

# Empty list to store the cumulative columns names
cumulative_sum_cols = []
cumulative_mean_cols = []

# Function to add cumulative columns
def add_cumulative_sum(df, list_of_cols):
    for col in list_of_cols:
        new_col_name = 'career_' + col
        df[new_col_name] = df.groupby(['playerID'])[col].cumsum()
        if col != 'WAR':
            df[new_col_name] = df[new_col_name].fillna(0).astype(int)
        cumulative_sum_cols.append(new_col_name)
        
# Add cumulative sum columns
columns_to_sum = ['G_x', 'PA', 'AB_x', 'R_x', 'H_x', '1B', '2B_x', '3B_x', 'HR_x', 'RBI', 'SB_x', 'CS_x', 'BB_x', 'SO_x', 'IBB', 'HBP_x', 'SH', 'SF_x', 'GIDP', 'TB', 'WAR']
add_cumulative_sum(df, columns_to_sum)

# Add cumulative mean columns (Batting Average, On Base Percentage, Slugging Percentage, On Base Plus Slugging, BABIP)
# Batting Average (BA)
df['career_BA'] = round(df.groupby(['playerID'])['H_x'].cumsum() / df.groupby(['playerID'])['AB_x'].cumsum(), 3)
cumulative_mean_cols.append('career_BA')

# On Base Percentage (OBP)
df['career_OBP'] = round((df.groupby(['playerID'])['H_x'].cumsum() + df.groupby(['playerID'])['BB_x'].cumsum() + df.groupby(['playerID'])['HBP_x'].cumsum()) / (df.groupby(['playerID'])['AB_x'].cumsum() + df.groupby(['playerID'])['BB_x'].cumsum() + df.groupby(['playerID'])['HBP_x'].cumsum() + df.groupby(['playerID'])['SF_x'].cumsum()), 3)
cumulative_mean_cols.append('career_OBP')

# Slugging Percentage (SLG)
df['career_SLG'] = round((df.groupby(['playerID'])['1B'].cumsum() + 2*df.groupby(['playerID'])['2B_x'].cumsum() + 3*df.groupby(['playerID'])['3B_x'].cumsum() + 4*df.groupby(['playerID'])['HR_x'].cumsum()) / df.groupby(['playerID'])['AB_x'].cumsum(), 3)
cumulative_mean_cols.append('career_SLG')

# On Base Plus Slugging (OPS)
df['career_OPS'] = round(df['career_OBP'] + df['career_SLG'], 3)
cumulative_mean_cols.append('career_OPS')

# Batting Average on Balls in Play (BABIP)
df['career_BABIP'] = round((df.groupby(['playerID'])['H_x'].cumsum() - df.groupby(['playerID'])['HR_x'].cumsum()) / (df.groupby(['playerID'])['AB_x'].cumsum() - df.groupby(['playerID'])['SO_x'].cumsum() - df.groupby(['playerID'])['HR_x'].cumsum() + df.groupby(['playerID'])['SF_x'].cumsum()), 3)
cumulative_mean_cols.append('career_BABIP')

##### Sanity check

https://www.baseball-reference.com/players/j/jeterde01.shtml

In [363]:
# Sanity check, Derek Jeter's new cumulative sum columns
df[df['playerID'] == 'jeterde01'][['playerID', 'yearID', 'career_G_x', 'career_PA', 'career_AB_x', 'career_R_x', 'career_H_x', 'career_1B', 'career_2B_x', 'career_3B_x', 'career_HR_x', 'career_RBI', 'career_SB_x', 'career_CS_x', 'career_BB_x', 'career_SO_x', 'career_IBB', 'career_HBP_x', 'career_SH', 'career_SF_x', 'career_GIDP', 'career_TB', 'career_WAR']]


Unnamed: 0,playerID,yearID,career_G_x,career_PA,career_AB_x,career_R_x,career_H_x,career_1B,career_2B_x,career_3B_x,career_HR_x,career_RBI,career_SB_x,career_CS_x,career_BB_x,career_SO_x,career_IBB,career_HBP_x,career_SH,career_SF_x,career_GIDP,career_TB,career_WAR
67993,jeterde01,1995.0,15,51,48,5,12,7,4,1,0,7,0,0,3,11,0,0,0,0,0,18,-0.34
69126,jeterde01,1996.0,172,705,630,109,195,149,29,7,10,85,14,7,51,113,1,9,6,9,13,268,2.95
70241,jeterde01,1997.0,331,1453,1284,225,385,291,60,14,20,155,37,19,125,238,1,19,14,11,27,533,7.91
71384,jeterde01,1998.0,480,2147,1910,352,588,442,85,22,39,239,67,25,182,357,2,24,17,14,40,834,15.44
72567,jeterde01,1999.0,638,2886,2537,486,807,591,122,31,63,341,86,33,273,473,7,36,20,20,52,1180,23.44
73766,jeterde01,2000.0,786,3565,3130,605,1008,742,153,35,78,414,108,37,341,572,11,48,23,23,66,1465,28.01
75004,jeterde01,2001.0,936,4251,3744,715,1199,874,188,38,99,488,135,40,397,671,14,58,28,24,79,1760,33.2
76182,jeterde01,2002.0,1093,4981,4388,839,1390,1021,214,38,117,563,167,43,470,785,16,65,31,27,93,2031,36.87
77383,jeterde01,2003.0,1212,5523,4870,926,1546,1139,239,41,127,615,178,48,513,873,18,78,34,28,103,2248,40.44
78610,jeterde01,2004.0,1366,6244,5513,1037,1734,1259,283,42,150,693,201,52,559,972,19,92,50,30,122,2551,44.68


In [364]:
# Derek Jeter's career average stats
df[df['name_common'] == 'Derek Jeter'][['playerID', 'yearID', 'career_BA', 'career_OBP', 'career_SLG', 'career_OPS', 'career_BABIP']]

Unnamed: 0,playerID,yearID,career_BA,career_OBP,career_SLG,career_OPS,career_BABIP
67993,jeterde01,1995.0,0.25,0.294,0.375,0.669,0.324
69126,jeterde01,1996.0,0.31,0.365,0.425,0.79,0.359
70241,jeterde01,1997.0,0.3,0.368,0.415,0.783,0.352
71384,jeterde01,1998.0,0.308,0.373,0.437,0.81,0.359
72567,jeterde01,1999.0,0.318,0.389,0.465,0.854,0.368
73766,jeterde01,2000.0,0.322,0.394,0.468,0.862,0.372
75004,jeterde01,2001.0,0.32,0.392,0.47,0.862,0.367
76182,jeterde01,2002.0,0.317,0.389,0.463,0.852,0.362
77383,jeterde01,2003.0,0.317,0.389,0.462,0.851,0.364
78610,jeterde01,2004.0,0.315,0.385,0.463,0.848,0.358


All the features match the ones in the website, so we can proceed.

In [365]:
df.shape

(102624, 139)

In [366]:
# Save a copy of the df
df.to_csv('baseball_pre_cumulative.csv', index=False)

## 3. Data Cleaning <a id='data_cleaning'></a>
The next crucial step in our analysis is data cleaning. This process involves preparing our data for analysis and modeling by removing or modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted.

### 3.1 Setting a Time Span <a id='time_span'></a>

The dataset contains data from 1871 to 2022. However, as part of our data cleaning process, we made a strategic decision to limit our dataset to encompass the years from 1985 to the present. 

There were several reasons behind this decision:
- __Statistical Consistency:__ The game of baseball has evolved significantly over the years, with changes in rules, equipment, and player training and conditioning. By focusing on the past four decades, we ensure a higher degree of consistency in the playing conditions, thus making our statistical analysis and predictions more reliable.
- __Impact of Free Agency:__ The introduction of free agency in 1976 has had a significant impact on the game. By starting our analysis from 1985, we focus on an era when player movement between teams became more common, which adds an interesting dynamic to player performance and team composition.
- __Data Availability:__ The data from 1985 to the present is more complete and accurate, which will help us avoid potential issues with missing or incorrect data.

In [367]:
# New dataframe starting from 1985
df_1985 = df.copy()
df_1985 = df_1985[df_1985['yearID'] >= 1985]

# Save a copy of the df_1980
df_1985.to_csv('baseball_pre_cumulative_1985.csv', index=False)

In [368]:
# Shape 
print('Shape of the dataset: ', df_1985.shape)

Shape of the dataset:  (44838, 139)


### 3.2 Droping columns (Features) that are no relevant to the project <a id='drop_cols'></a>


In [369]:
# Empty list to store columns to be dropped
drop_cols = []


#### Personal Columns

Some columns are not relevant to our project and will be dropped. These columns are commonly related to personal information about the player, such as birth date, birth place, and death date. 

Personal columns are not directly related to the project's analysis of batting performance. By removing them from the dataset, we can eliminate unnecessary personal information and narrow the scope of the project to the pertinent variables.


In [370]:
# List of personal columns that are not relevant
personal_columns = ['firstname', 'lastname', 'borndate']

# Append personal columns to drop_cols
drop_cols.extend(personal_columns)


#### Pitching Stats Columns

To focus on the batting-oriented nature of your project and streamline the analysis, it is recommended to drop the pitching-related columns. By removing these columns, we can concentrate solely on the batting statistics, which aligns with the project's objective. Dropping the pitching columns will enable a more concise and meaningful exploration of the batting performance in your dataset.

In [371]:
# Pitching columns
pitching_columns = ['ERA', 'ER', 'ERA', 'CG', 'SHO', 'SV', 'IPouts', 'HA', 'HRA', 'BBA', 'SOA', 'PPF']

# Append pitching columns to drop_cols
drop_cols.extend(pitching_columns)

### 3.3 Droping repeated or similar columns <a id='drop_repeated_cols'></a>

Some columns provide the same information as other columns, but with different names. We are going to iterate over each pair of columns we belive are repeated and evaluate which will be the one to drop. 

In [372]:
def compare_columns(dataframe, column1, column2, column3=None):
    '''
    Compare up to three columns in a DataFrame and return the number of equal and different values.
    Additionally, provide the count of null values for each column to assist in identifying columns for potential dropping.
    If the number of different values is close to or equals the number of null values, it suggests that the differences 
    observed are primarily due to null values.
    :param column1: string
    :param column2: string
    :param column3: string
    '''
    if column3 is None:
        # Return bool if column1 and column2 are equal and count the number of False values
        comparison = dataframe[column1] == dataframe[column2]
        if False in comparison.value_counts():
            print('Number of different values: ', comparison.value_counts()[False])
            print('Number of equal values: ', comparison.value_counts()[True])
            # Null values for comparison
            print(f'Number of null values for {column1}: ', dataframe[column1].isnull().sum())
            print(f'Number of null values for {column2}: ', dataframe[column2].isnull().sum())
        else:
            print("All values are equal")    
    else:
        # Return bool if column1, column2 and column3 are equal and count the number of False values
        comparison = (dataframe[column1] == dataframe[column2]) & (dataframe[column1] == dataframe[column3])
        if False in comparison.value_counts():
            print('Number of different values: ', comparison.value_counts()[False])
            print('Number of equal values: ', comparison.value_counts()[True])
            # Null values for comparison
            print(f'Number of null values for {column1}: ', dataframe[column1].isnull().sum())
            print(f'Number of null values for {column2}: ', dataframe[column2].isnull().sum())
            print(f'Number of null values for {column3}: ', dataframe[column3].isnull().sum())
        else:
            print("All values are equal")
        


##### Year columns (Season)

__`yearID` and `year`.__

In [373]:
# Columns that start with 'year'
year_columns = [col for col in df_1985.columns if col.startswith('year')]
year_columns

['year_ID', 'yearID', 'year']

In [374]:
# yearID and year_ID comparison
compare_columns(df_1985, 'yearID', 'year_ID')

All values are equal


Both columns are the same, we can drop any of them plus `year` column.

In [375]:
drop_cols.extend(['year_ID', 'year'])

##### Player identifiers

__`playerID` and `player_ID`.__

In [376]:
# Columns that could be for player identification
id_columns = [col for col in df_1985.columns if 'ID' in col or 'id' in col or 'player' in col or 'name' in col] 
id_columns

['name_common',
 'year_ID',
 'player_ID',
 'mlb_ID',
 'stint_ID',
 'team_ID',
 'playerID',
 'retroID',
 'bbrefID',
 'yearID',
 'GIDP',
 'teamID',
 'lgID',
 'franchID',
 'divID',
 'name_x',
 'teamIDBR',
 'teamIDlahman45',
 'teamIDretro',
 'firstname',
 'lastname',
 'playerid',
 'mlbid',
 'name_y',
 'career_GIDP']

In [377]:
# playerID and player_ID comparison
compare_columns(df_1985, 'playerID', 'player_ID')

Number of different values:  553
Number of equal values:  44285
Number of null values for playerID:  0
Number of null values for player_ID:  0


Let's select the rows where `playerID` and `player_ID` are different and see what the difference is.

In [378]:
pd.set_option('display.max_rows', 400)
df_1985[df_1985['playerID'] != df_1985['player_ID']][['playerID', 'player_ID']]

Unnamed: 0,playerID,player_ID
57934,obriech01,o'brich01
58173,oconnja02,o'conja02
58417,oberrmi01,o'bermi01
58465,oneilpa01,o'neipa01
58471,obriepe03,o'bripe03
...,...,...
102014,montafr01,montafr02
102097,beeksja01,beeksja02
102144,rodrijo04,rodrijo06
102319,mannima01,mannima02


It looks like the differences are special characters in the `player_ID` column. We can drop `player_ID` and keep `playerID`, since it is cleaner.

In [379]:
# Append player_ID to drop_cols
drop_cols.append('player_ID')

##### More player identifiers

In [380]:
# Player identifiers
player_ids_columns = ['playerID', 'mlb_ID', 'retroID', 'bbrefID', 'mlbid', 'playerid']

df_1985[player_ids_columns].value_counts().sample(10)

playerID   mlb_ID     retroID   bbrefID    mlbid     playerid
parkecl01  120230.0   parkc001  parkecl01  120230.0  16336.0      3
nixja01    1303872.0  nix-j001  nixja01    434624.0  6443.0       1
tellero01  642133.0   tellr001  tellero01  642133.0  180759.0     4
fryja01    605240.0   fry-j001  fryja01    605240.0  165484.0     5
delcama01  434668.0   delcm001  delcama01  434668.0  5098.0       5
furbuch01  1037406.0  furbc001  furbuch01  518703.0  128990.0     1
hitchst01  115982.0   hitcs001  hitchst01  115982.0  961.0       11
escobyu01  488862.0   escoy001  escobyu01  488862.0  54457.0     10
morenom01  238722.0   moreo001  morenom01  119361.0  15641.0      1
jakubch01  499856.0   jakuc001  jakubch01  499856.0  30326.0      3
dtype: int64

We are interested in Baseball Reference IDs, as it is one of our main sources of information. `playerID` and `bbrefID` are the same, and both of them are from Baseball Reference. We can drop `bbrefID` and keep `playerID`. We will also drop the rest of the player identifiers.

In [381]:
drop_cols.extend(['mlb_ID', 'retroID', 'bbrefID', 'mlbid', 'playerid'])

`name_common` and `name_y`

In [382]:
# name_common and name_y sample
df_1985[df_1985['name_common'] != df_1985['name_y']][['name_common', 'name_y']].sample(10)

Unnamed: 0,name_common,name_y
84429,Tony Pena,tony pena
69233,Greg Vaughn,greg vaughn
71636,John Halama,john halama
65140,Juan Samuel,juan samuel
58050,Don Baylor,don baylor
61156,Ken Phelps,ken phelps
97308,Ryan Zimmerman,ryan zimmerman
102116,Jeff Hoffman,jeff hoffman
99584,Jose Altuve,jose altuve
60931,Fred McGriff,fred mcgriff


It looks that they are the same but `name_y` is in lower case. We will keep `name_common` and drop `name_y`.

In [383]:
# drop name_y
drop_cols.append('name_y')

##### Age columns

In [384]:
compare_columns(df_1985, 'age_x', 'age_y')

Number of different values:  962
Number of equal values:  43876
Number of null values for age_x:  0
Number of null values for age_y:  76


We will drop `age_y` as it has more missing values.

In [385]:
drop_cols.append('age_y')

# Change name of age_x to age
df_1985.rename(columns={'age_x': 'age'}, inplace=True)

##### Team identifiers

There are several columns that serve as identifiers for the team:
- `teamID`
- `team_ID`
- `teamIDBR`
- `teamIDlahman45`
- `teamIDretro`
- `franchID`
- `name_x`
- `TeamName`

Let's look at a dataframe including these columns:

In [386]:
# Team identifiers
team_ids_columns = ['teamID', 'team_ID', 'teamIDBR', 'teamIDlahman45', 'teamIDretro', 'franchID', 'name_x', 'TeamName']



We will rely on the identifiers provided by Baseball-Reference (https://www.baseball-reference.com/about/team_IDs.shtml), specifically `teamIDBR` and `franchID`. Baseball-Reference is one of our primary data sources, and utilizing these identifiers will ensure consistency and accuracy in our analysis. We will drop the other team identifiers except for `name`, which will be useful for visualizations and analysis.

In [387]:
# null values for the team identifiers columns
df_1985[team_ids_columns].isnull().sum()

teamID             0
team_ID            0
teamIDBR           0
teamIDlahman45     0
teamIDretro        0
franchID           0
name_x             0
TeamName          76
dtype: int64

In [388]:
# Append teamID, team_ID, teamIDlahman45, teamIDretro to drop_cols
drop_cols.extend(['teamID', 'team_ID', 'teamIDlahman45', 'teamIDretro', 'TeamName'])

In [389]:
# Change name_x to teamName
df_1985.rename(columns={'name_x': 'teamName'}, inplace=True)

##### Salary columns

`salary_x` and `salary_y`

In [390]:
compare_columns(df_1985, 'salary_x', 'salary_y')

Number of different values:  16965
Number of equal values:  27873
Number of null values for salary_x:  0
Number of null values for salary_y:  76


We will drop `salary_x` as it has more missing values.

In [391]:
drop_cols.append('salary_x')

##### Summary of columns to drop

In [392]:
# Columns to drop
print('Columns to drop: ', drop_cols)
print('Number of columns to drop: ', len(drop_cols))

Columns to drop:  ['firstname', 'lastname', 'borndate', 'ERA', 'ER', 'ERA', 'CG', 'SHO', 'SV', 'IPouts', 'HA', 'HRA', 'BBA', 'SOA', 'PPF', 'year_ID', 'year', 'player_ID', 'mlb_ID', 'retroID', 'bbrefID', 'mlbid', 'playerid', 'name_y', 'age_y', 'teamID', 'team_ID', 'teamIDlahman45', 'teamIDretro', 'TeamName', 'salary_x']
Number of columns to drop:  31


##### Droping columns

In [393]:
# Drop columns
df_1985.drop(columns=drop_cols, inplace=True)

In [394]:
# Shape after dropping columns
print('Shape of the dataset: ', df_1985.shape)

Shape of the dataset:  (44838, 109)


In [395]:
# Re empty list to store more columns to be dropped if needed
drop_cols = []

In [396]:
df_1985.sample(5)

Unnamed: 0,name_common,age,pitcher,stint_ID,PA,G_x,Inn,OPS_plus,WAR,WAR_def,WAR_off,playerID,birthYear,birthCountry,weight,height,bats,throws,debut,finalGame,yearID,G_y,AB_x,R_x,H_x,2B_x,3B_x,HR_x,RBI,SB_x,CS_x,BB_x,SO_x,IBB,HBP_x,SH,SF_x,GIDP,lgID,franchID,divID,Rank,G,Ghome,W,L,DivWin,LgWin,WSWin,R_y,AB_y,H_y,2B_y,3B_y,HR_y,BB_y,SO_y,SB_y,CS_y,HBP_y,SF_y,RA,E,DP,FP,teamName,park,attendance,BPF,teamIDBR,salary_y,leaguerank,teamrank,averagesalary,leagueminimum,BA,OBP,1B,TB,SLG,OPS,BABIP,W%,career_G_x,career_PA,career_AB_x,career_R_x,career_H_x,career_1B,career_2B_x,career_3B_x,career_HR_x,career_RBI,career_SB_x,career_CS_x,career_BB_x,career_SO_x,career_IBB,career_HBP_x,career_SH,career_SF_x,career_GIDP,career_TB,career_WAR,career_BA,career_OBP,career_SLG,career_OPS,career_BABIP
59666,Adam Peterson,21.0,Y,1,0.0,0,4.0,0.0,0.0,0.0,0.0,peterad01,1965.0,USA,190.0,75.0,R,R,1987-09-19,1991-08-06,1987.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,AL,CHW,W,5.0,162.0,81.0,77.0,85.0,N,N,N,748.0,5538.0,1427.0,283.0,36.0,173.0,487.0,971.0,138.0,52.0,33.0,52.0,746.0,116.0,174.0,0.981,Chicago White Sox,Comiskey Park,1208060.0,103.0,CHW,60000.0,513.0,8.0,412454.0,62500.0,,,0.0,0.0,,,,0.475,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,,,,,
102162,Jonathan Schoop,30.0,N,1,510.0,131,1125.0,60.248091,0.12,1.26,-0.69,schoojo01,1991.0,Curacao,247.0,73.0,R,R,2013-09-25,2023-06-23,2022.0,131.0,481.0,48.0,97.0,23.0,1.0,11.0,38.0,5.0,0.0,19.0,107.0,1.0,6.0,0.0,4.0,15.0,AL,DET,C,4.0,162.0,82.0,66.0,96.0,N,N,N,557.0,5378.0,1240.0,235.0,27.0,110.0,380.0,1413.0,47.0,24.0,58.0,44.0,713.0,94.0,137.0,0.984,Detroit Tigers,Comerica Park,1575544.0,96.0,DET,7500000.0,164.0,5.0,4317736.0,700000.0,0.202,0.239,62.0,155.0,0.322,0.561,0.234,0.407,1133,4465,4183,542,1066,675,210,7,174,537,15,4,182,992,4,61,7,31,121,1812,20.04,0.255,0.294,0.433,0.727,0.293
77613,Jermaine Clark,26.0,N,2,57.0,25,155.7,-75.621459,-0.59,-0.06,-0.59,clarkje02,1976.0,USA,175.0,70.0,L,R,2001-04-03,2005-05-24,2003.0,25.0,48.0,2.0,8.0,2.0,0.0,0.0,7.0,2.0,2.0,6.0,5.0,0.0,0.0,1.0,2.0,1.0,AL,TEX,W,4.0,162.0,81.0,71.0,91.0,N,N,N,826.0,5664.0,1506.0,274.0,36.0,239.0,488.0,1052.0,65.0,25.0,75.0,42.0,969.0,94.0,168.0,0.985,Texas Rangers,The Ballpark at Arlington,2094394.0,111.0,TEX,300000.0,0.0,0.0,2372189.0,300000.0,0.167,0.25,6.0,10.0,0.208,0.458,0.178,0.438,28,57,48,3,8,6,2,0,0,7,2,2,6,5,0,0,1,2,1,10,-0.62,0.167,0.25,0.208,0.458,0.178
94652,Stephen Vogt,31.0,N,1,532.0,137,1130.3,93.767331,2.05,0.53,2.14,vogtst01,1984.0,USA,216.0,72.0,L,R,2012-04-06,2022-10-05,2016.0,137.0,490.0,54.0,123.0,30.0,2.0,14.0,56.0,0.0,0.0,35.0,83.0,3.0,4.0,0.0,3.0,6.0,AL,OAK,W,5.0,162.0,81.0,69.0,93.0,N,N,N,653.0,5500.0,1352.0,270.0,21.0,169.0,442.0,1145.0,50.0,23.0,33.0,34.0,761.0,97.0,152.0,0.984,Oakland Athletics,O.co Coliseum,1521506.0,90.0,OAK,527500.0,583.0,14.0,3966020.0,507500.0,0.251,0.305,77.0,199.0,0.406,0.711,0.275,0.426,422,1505,1364,156,348,228,67,8,45,178,1,3,118,249,12,7,2,14,19,566,6.98,0.255,0.315,0.415,0.73,0.28
79674,Cal Eldred,37.0,Y,1,2.0,30,37.0,-100.0,-0.03,-0.01,-0.03,eldreca01,1967.0,USA,215.0,76.0,R,R,1991-09-24,2005-10-01,2005.0,31.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,NL,STL,C,1.0,162.0,81.0,100.0,62.0,Y,N,N,805.0,5538.0,1494.0,287.0,26.0,170.0,534.0,947.0,83.0,36.0,62.0,35.0,634.0,100.0,196.0,0.984,St. Louis Cardinals,Busch Stadium II,3538988.0,101.0,STL,600000.0,463.0,18.0,2476589.0,316000.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.617,192,89,72,7,8,6,2,0,0,4,0,1,6,39,0,0,11,0,2,10,-0.41,0.111,0.179,0.139,0.318,0.242


### 2.4 Filtering and removing players with no relevance to the project (Removing rows) <a id='drop_rows'></a>

#### Pitchers

As we mentioned earlier in the project, we are going to focus on batting metrics. We will remove pitchers from the dataset.

In [397]:
# Drop rows with pitcher = Y
df_1985 = df_1985[df_1985['pitcher'] != 'Y']

In [398]:
# Drop pitcher column
df_1985.drop('pitcher', axis=1, inplace=True)

# Shape
df_1985.shape

(22274, 108)

---

#### Update Team Names

Post 1985, several teams have undergone name changes. We'll update these to reflect their current names, resulting in 30 unique team names. Additionally, we'll include the Montreal Expos, the only team to have relocated since 1985, bringing our total to 31.

- https://www.mlb.com/team

In [430]:
# Team names
df_1985['teamName'].value_counts()

San Francisco Giants     814
Los Angeles Angels       805
Kansas City Royals       799
Seattle Mariners         798
Texas Rangers            797
Cincinnati Reds          795
Cleveland Guardians      793
Baltimore Orioles        788
Oakland Athletics        788
Los Angeles Dodgers      787
Boston Red Sox           784
New York Mets            782
New York Yankees         780
San Diego Padres         769
Detroit Tigers           767
Pittsburgh Pirates       765
Minnesota Twins          763
Toronto Blue Jays        751
Chicago Cubs             750
Philadelphia Phillies    740
Houston Astros           738
St. Louis Cardinals      735
Chicago White Sox        731
Milwaukee Brewers        725
Atlanta Braves           720
Colorado Rockies         604
Miami Marlins            599
Arizona Diamondbacks     520
Tampa Bay Rays           517
Montreal Expos           398
Washington Nationals     372
Name: teamName, dtype: int64

In [429]:
# Cleveland Indians to Cleveland Guardians
df_1985['teamName'] = df_1985['teamName'].replace('Cleveland Indians', 'Cleveland Guardians')

# Anaheim Angels to Los Angeles Angels
df_1985['teamName'] = df_1985['teamName'].replace('Anaheim Angels', 'Los Angeles Angels')

# Tampa Bay Devil Rays to Tampa Bay Rays
df_1985['teamName'] = df_1985['teamName'].replace('Tampa Bay Devil Rays', 'Tampa Bay Rays')

# Florida Marlins to Miami Marlins
df_1985['teamName'] = df_1985['teamName'].replace('Florida Marlins', 'Miami Marlins')

# California Angels to Los Angeles Angels
df_1985['teamName'] = df_1985['teamName'].replace('California Angels', 'Los Angeles Angels')

# Los Angeles Angels of Anaheim to Los Angeles Angels
df_1985['teamName'] = df_1985['teamName'].replace('Los Angeles Angels of Anaheim', 'Los Angeles Angels')

In [432]:
# Unique team names
df_1985['teamName'].nunique()

31

---

### 2.5 Dealing with Null Values <a id='null_values'></a>

In [433]:
pd.set_option('display.max_rows', 200)
# Null values
df_1985.isnull().sum()[df_1985.isnull().sum() > 0]

Series([], dtype: int64)

The curious pattern of having the same number of null values (668) in the columns `WSWin`, `DivWin`, and `LgWin` suggests a potential relationship among these variables. Let's look at the rows with null values in these columns.

In [434]:
pd.set_option('display.max_rows', None)
# Rows with DivWin, LgWin and WSWin null values, sort 
df_1985[df_1985['DivWin'].isnull() & df_1985['LgWin'].isnull() & df_1985['WSWin'].isnull()].nunique().sort_values()

playerID           0
Rank               0
divID              0
franchID           0
lgID               0
teamIDBR           0
attendance         0
park               0
teamName           0
career_BABIP       0
career_OPS         0
career_SLG         0
career_OBP         0
career_BA          0
career_WAR         0
career_TB          0
career_GIDP        0
career_SF_x        0
career_SH          0
career_HBP_x       0
career_IBB         0
career_SO_x        0
career_BB_x        0
career_CS_x        0
G                  0
Ghome              0
W                  0
L                  0
averagesalary      0
teamrank           0
leaguerank         0
salary_y           0
BPF                0
FP                 0
DP                 0
E                  0
RA                 0
SF_y               0
HBP_y              0
career_SB_x        0
CS_y               0
SO_y               0
BB_y               0
HR_y               0
3B_y               0
2B_y               0
H_y                0
AB_y         

 It's interesting that `yearID` only contains one unique value. This suggests that all the records in this subset belong to a single season. 

In [435]:
df_1985[df_1985['DivWin'].isnull() & df_1985['LgWin'].isnull() & df_1985['WSWin'].isnull()][['yearID']].value_counts()

Series([], dtype: int64)

The null values present in the `DivWin`, `LgWin`, and `WSWin` columns correspond to the year 1994. This particular year holds significance in MLB history as the season came to an abrupt end due to a labor strike. As a result, no teams were able to compete in the playoffs, and the World Series was canceled.

Therefore, the presence of null values in DivWin (division winner), LgWin (league winner), and WSWin (World Series winner) for the year 1994 is expected. We will fill these null values with "N" to indicate that no team won the division, league, or World Series that year.

In [436]:
# Fill WSwWin, LgWin and DivWin null values with 'N'
df_1985['WSWin'].fillna('N', inplace=True)
df_1985['LgWin'].fillna('N', inplace=True)
df_1985['DivWin'].fillna('N', inplace=True)

In [437]:
# Null values
df_1985.isnull().sum()[df_1985.isnull().sum() > 0].sort_values(ascending=False)

Series([], dtype: int64)

In [438]:
df_1985.shape

(22274, 109)

##### `BABIP` Nulls

In [440]:
# Rows with BABIP null values
df_1985[df_1985['BABIP'].isnull()].sample()

ValueError: a must be greater than 0 unless no samples are taken

The calculation for BABIP (Batting Average on Balls In Play) is:

BABIP = (H - HR) / (AB - SO - HR + SF)

BABIP NaN values, are likely due to division by zero in the denominator of the formula. This can occur if the player has no at-bats (AB), or if the number of strikeouts (SO) and home runs (HR) equals or exceeds the number of at-bats.

In other words, if a player has never been at bat, or if every time they've been at bat they've either struck out or hit a home run, then the denominator of the BABIP formula will be zero, leading to a division by zero error and a resulting NaN value.

For this reason we will fill the BABIP NaN values with zero.

In [441]:
# Fill BABIP null values with 0
df_1985['BABIP'].fillna(0, inplace=True)

In [442]:
# Null values
df_1985.isnull().sum()[df_1985.isnull().sum() > 0]

Series([], dtype: int64)

##### `BA`, `OBP`, `SLG`, and `OPS` Nulls

The values for Batting Average (`BA`), On-Base Percentage (`OBP`), Slugging Percentage (`SLG`), and On-Base Plus Slugging (`OPS`) are calculated based on a player's hitting statistics. If a player has not had any at-bats or has not reached base in any way, these values will be undefined and will appear as NaN in your dataset.

For this reason we will fill the `BA`, `OBP`, `SLG`, and `OPS` NaN values with zero.

In [443]:
# Fill BA, OBP, SLG, OPS null values with 0
df_1985['BA'].fillna(0, inplace=True)
df_1985['OBP'].fillna(0, inplace=True)
df_1985['SLG'].fillna(0, inplace=True)
df_1985['OPS'].fillna(0, inplace=True)

In [444]:
# null values
df_1985.isnull().sum()[df_1985.isnull().sum() > 0]

Series([], dtype: int64)

##### Carreer Stats Nulls

In [445]:
# Rows with career_BABIP null values
df_1985[df_1985['career_BABIP'].isnull()].sample(5)

ValueError: a must be greater than 0 unless no samples are taken

NaN values for career statistics such as Batting Average on Balls in Play (`BABIP`), On-Base Plus Slugging (`OPS`), Slugging Percentage (`SLG`), On-Base Percentage (`OBP`), and Batting Average (`BA`) likely appear because the player had no at-bats or did not reach base in any way during their career up to that point.
This is particularly common for players in their first season, as they have not yet had the opportunity to accumulate any hits, walks, or other statistics that contribute to these metrics. As a result, the denominators in the formulas for these statistics are zero, leading to undefined values.

Same as before we will fill this career stats NaN values with zero.

In [446]:
# Fill career_BABIP, career_BA, career_OBP, career_SLG, career_OPS null values with 0
df_1985['career_BABIP'].fillna(0, inplace=True)
df_1985['career_BA'].fillna(0, inplace=True)
df_1985['career_OBP'].fillna(0, inplace=True)
df_1985['career_SLG'].fillna(0, inplace=True)
df_1985['career_OPS'].fillna(0, inplace=True)

##### Remaining Nulls

In [447]:
# Null values
df_1985.isnull().sum()[df_1985.isnull().sum() > 0]

Series([], dtype: int64)

We only have one row (player) remaining with null values. We will take a look at it and decide what to do.

In [448]:
# rows with salary_y null values
df_1985[df_1985['teamrank'].isnull()]

Unnamed: 0,playerID,name_common,birthYear,birthCountry,debut,finalGame,age,weight,height,bats,throws,yearID,stint_ID,G_x,G_y,PA,AB_x,R_x,H_x,2B_x,3B_x,HR_x,RBI,SB_x,CS_x,BB_x,SO_x,IBB,HBP_x,SH,SF_x,GIDP,1B,BA,OBP,SLG,OPS,BABIP,TB,W%,Inn,OPS_plus,WAR,WAR_def,WAR_off,career_G_x,career_PA,career_AB_x,career_R_x,career_H_x,career_1B,career_2B_x,career_3B_x,career_HR_x,career_RBI,career_SB_x,career_CS_x,career_BB_x,career_SO_x,career_IBB,career_HBP_x,career_SH,career_SF_x,career_GIDP,career_TB,career_WAR,career_BA,career_OBP,career_SLG,career_OPS,career_BABIP,teamName,park,attendance,teamIDBR,lgID,franchID,divID,Rank,G,Ghome,W,L,DivWin,LgWin,WSWin,R_y,AB_y,H_y,2B_y,3B_y,HR_y,BB_y,SO_y,SB_y,CS_y,HBP_y,SF_y,RA,E,DP,FP,BPF,salary_y,leaguerank,teamrank,averagesalary,leagueminimum,adjusted_salary


Remaning null is from a player called "Yadier Molina" for the 2009 season. We will try to find his stats in Baseball Reference or another trusted source and fill the null values with them.

- https://www.baseball-reference.com/players/m/molinya01.shtml
- https://www.baseball-reference.com/bullpen/Minimum_salary
- https://www.baseball-reference.com/teams/STL/2009.shtml


In [449]:
# Fil Yadier Molina's salary_y for 2009
df_1985.loc[(df_1985['playerID'] == 'molinya01') & (df_1985['yearID'] == 2009), 'salary_y'] = 3312500

# Fill Yadier Molina's teamrank for 2009
df_1985.loc[(df_1985['playerID'] == 'molinya01') & (df_1985['yearID'] == 2009), 'teamrank'] = 9

# Fill Yadier Molina's average salary for 2009
df_1985.loc[(df_1985['playerID'] == 'molinya01') & (df_1985['yearID'] == 2009), 'averagesalary'] = 2996106

# Fill Yadier Molina's leagueminimum for 2009
df_1985.loc[(df_1985['playerID'] == 'molinya01') & (df_1985['yearID'] == 2009), 'leagueminimum'] = 400000   

# Fill Yadier Molina's leaguerank for 2009
df_1985.loc[(df_1985['playerID'] == 'molinya01') & (df_1985['yearID'] == 2009), 'leaguerank'] = 100



In [450]:
# null values
df_1985.isnull().sum().sum()

0

## 4. Salary Adjustments <a id='salary_adjustments'></a>
We will adjust the salaries of the players to account for inflation. The value of money changes over time due to inflation. Therefore, comparing salaries from different years without adjusting for inflation can lead to misleading results.
 - https://www.bls.gov/data/inflation_calculator.htm

In [451]:
buying_power = {1985: 2.89,
                   1986: 2.78,
                   1987: 2.74,
                   1988: 2.64,
                   1989: 2.52,
                   1990: 2.39,
                   1991: 2.27,
                   1992: 2.21,
                   1993: 2.14,
                   1994: 2.09,
                   1995: 2.03,
                   1996: 1.98,
                   1997: 1.92,
                   1998: 1.89,
                   1999: 1.86,
                   2000: 1.81,
                   2001: 1.76,
                   2002: 1.72,
                   2003: 1.68,
                   2004: 1.65,
                   2005: 1.61,
                   2006: 1.54,
                   2007: 1.51,
                   2008: 1.45,
                   2009: 1.45,
                   2010: 1.41,
                   2011: 1.39,
                   2012: 1.35,
                   2013: 1.32,
                   2014: 1.30,
                   2015: 1.31,
                   2016: 1.29,
                   2017: 1.26,
                   2018: 1.23,
                   2019: 1.21,
                   2020: 1.18,
                   2021: 1.17,
                   2022: 1.09,}

In [452]:
# Adjusted salary column
df_1985['adjusted_salary'] = round(df_1985['salary_y'] * df_1985['yearID'].map(buying_power), 0)

In [453]:
# Sanity check
df_1985[['yearID', 'salary_y', 'adjusted_salary']].sample(10)

Unnamed: 0,yearID,salary_y,adjusted_salary
84474,2008.0,2000000.0,2900000.0
92408,2015.0,12000000.0,15720000.0
82144,2007.0,3750000.0,5662500.0
75908,2002.0,700000.0,1204000.0
99377,2020.0,566000.0,667880.0
93463,2015.0,10500000.0,13755000.0
98306,2019.0,19700000.0,23837000.0
69778,1996.0,1600000.0,3168000.0
62026,1989.0,900000.0,2268000.0
93916,2016.0,2150000.0,2773500.0


In [454]:
# Save a copy of the df_1985
df_1985.to_csv('baseballsalaries_1985_preclean.csv', index=False)

Lastly, we will change the column order to make it more readable.

In [455]:
column_order = [
    'playerID', 'name_common', 'birthYear', 'birthCountry', 'debut', 'finalGame',
    'age', 'weight', 'height', 'bats', 'throws',
    'yearID', 'stint_ID', 'G_x', 'G_y', 'PA', 'AB_x', 'R_x', 'H_x', '2B_x', '3B_x', 'HR_x', 'RBI', 'SB_x', 'CS_x', 'BB_x', 'SO_x', 'IBB', 'HBP_x', 'SH', 'SF_x', 'GIDP', '1B', 'BA', 'OBP', 'SLG', 'OPS', 'BABIP', 'TB', 'W%', 'Inn', 'OPS_plus', 'WAR', 'WAR_def', 'WAR_off',
    'career_G_x', 'career_PA', 'career_AB_x', 'career_R_x', 'career_H_x', 'career_1B', 'career_2B_x', 'career_3B_x', 'career_HR_x', 'career_RBI', 'career_SB_x', 'career_CS_x', 'career_BB_x', 'career_SO_x', 'career_IBB', 'career_HBP_x', 'career_SH', 'career_SF_x', 'career_GIDP', 'career_TB', 'career_WAR', 'career_BA', 'career_OBP', 'career_SLG', 'career_OPS', 'career_BABIP',
    'teamName', 'park', 'attendance', 'teamIDBR', 'lgID', 'franchID', 'divID', 'Rank', 'G', 'Ghome', 'W', 'L', 'DivWin', 'LgWin', 'WSWin', 'R_y', 'AB_y', 'H_y', '2B_y', '3B_y', 'HR_y', 'BB_y', 'SO_y', 'SB_y', 'CS_y', 'HBP_y', 'SF_y', 'RA', 'E', 'DP', 'FP', 'BPF',
    'salary_y', 'leaguerank', 'teamrank', 'averagesalary', 'leagueminimum', 'adjusted_salary'
]

In [456]:
# Sanity check
len(column_order) == len(df_1985.columns)

True

In [457]:
# Change column order
df_1985 = df_1985[column_order]

In [458]:
# sample
df_1985.sample(5)

Unnamed: 0,playerID,name_common,birthYear,birthCountry,debut,finalGame,age,weight,height,bats,throws,yearID,stint_ID,G_x,G_y,PA,AB_x,R_x,H_x,2B_x,3B_x,HR_x,RBI,SB_x,CS_x,BB_x,SO_x,IBB,HBP_x,SH,SF_x,GIDP,1B,BA,OBP,SLG,OPS,BABIP,TB,W%,Inn,OPS_plus,WAR,WAR_def,WAR_off,career_G_x,career_PA,career_AB_x,career_R_x,career_H_x,career_1B,career_2B_x,career_3B_x,career_HR_x,career_RBI,career_SB_x,career_CS_x,career_BB_x,career_SO_x,career_IBB,career_HBP_x,career_SH,career_SF_x,career_GIDP,career_TB,career_WAR,career_BA,career_OBP,career_SLG,career_OPS,career_BABIP,teamName,park,attendance,teamIDBR,lgID,franchID,divID,Rank,G,Ghome,W,L,DivWin,LgWin,WSWin,R_y,AB_y,H_y,2B_y,3B_y,HR_y,BB_y,SO_y,SB_y,CS_y,HBP_y,SF_y,RA,E,DP,FP,BPF,salary_y,leaguerank,teamrank,averagesalary,leagueminimum,adjusted_salary
97436,barnhtu01,Tucker Barnhart,1991.0,USA,2014-04-03,2023-06-20,27.0,192.0,71.0,L,R,2018.0,1,138,138.0,522.0,460.0,50.0,114.0,21.0,3.0,10.0,46.0,0.0,4.0,54.0,96.0,2.0,2.0,3.0,3.0,13.0,80.0,0.248,0.328,0.372,0.7,0.291,171.0,0.414,1046.0,87.771438,0.82,0.75,0.83,476,1699,1503,136,382,271,77,6,28,160,5,5,161,291,27,9,14,12,47,555,5.6,0.254,0.328,0.369,0.697,0.296,Cincinnati Reds,Great American Ball Park,1629356.0,CIN,NL,CIN,C,5.0,162.0,81.0,67.0,95.0,N,N,N,696.0,5532.0,1404.0,251.0,25.0,172.0,559.0,1376.0,77.0,33.0,65.0,35.0,819.0,95.0,144.0,0.984,103.0,4000000.0,269.0,7.0,4095686.0,545000.0,4920000.0
76201,mohrdu01,Dustan Mohr,1976.0,USA,2001-08-29,2007-07-08,26.0,210.0,72.0,R,R,2002.0,1,120,120.0,417.0,383.0,55.0,103.0,23.0,2.0,12.0,45.0,6.0,3.0,31.0,86.0,3.0,1.0,2.0,0.0,5.0,66.0,0.269,0.325,0.433,0.758,0.319,166.0,0.584,932.7,100.057944,2.12,0.34,1.32,140,474,434,61,115,76,25,2,12,49,7,4,36,103,3,1,2,1,5,180,2.18,0.265,0.322,0.415,0.737,0.322,Minnesota Twins,Hubert H Humphrey Metrodome,1924473.0,MIN,AL,MIN,C,1.0,161.0,81.0,94.0,67.0,Y,N,N,768.0,5582.0,1518.0,348.0,36.0,167.0,472.0,1089.0,79.0,62.0,56.0,52.0,712.0,74.0,124.0,0.987,100.0,200000.0,0.0,0.0,2295649.0,200000.0,344000.0
71696,johnske03,Keith Johns,1800.0,USA,1998-05-23,1998-05-26,26.0,175.0,73.0,R,R,1998.0,1,2,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.568,11.0,0.0,0.01,-0.01,0.01,2,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0.01,0.0,1.0,0.0,0.0,0.0,Boston Red Sox,Fenway Park II,2314704.0,BOS,AL,BOS,E,2.0,162.0,81.0,92.0,70.0,N,N,N,876.0,5601.0,1568.0,338.0,35.0,205.0,541.0,1049.0,72.0,39.0,70.0,52.0,729.0,105.0,128.0,0.983,102.0,170000.0,0.0,0.0,1398831.0,170000.0,321300.0
58065,flynndo01,Doug Flynn,1951.0,USA,1975-04-09,1985-10-05,34.0,165.0,71.0,R,R,1985.0,2,41,41.0,61.0,57.0,2.0,14.0,2.0,1.0,0.0,2.0,0.0,0.0,0.0,3.0,0.0,0.0,3.0,1.0,1.0,11.0,0.246,0.241,0.316,0.557,0.255,18.0,0.522,167.3,55.169553,-0.35,-0.17,-0.05,1309,4085,3853,288,918,757,115,39,7,284,20,20,151,320,58,1,60,20,95,1132,-6.89,0.238,0.266,0.294,0.56,0.257,Montreal Expos,Stade Olympique,1502494.0,MON,NL,WSN,E,3.0,161.0,81.0,84.0,77.0,N,N,N,633.0,5429.0,1342.0,242.0,49.0,118.0,492.0,880.0,169.0,77.0,26.0,45.0,636.0,121.0,152.0,0.981,95.0,350000.0,295.0,13.0,371571.0,60000.0,1011500.0
63292,thompmi02,Milt Thompson,1959.0,USA,1984-09-04,1996-07-28,31.0,170.0,71.0,L,R,1990.0,1,135,135.0,463.0,418.0,42.0,91.0,14.0,7.0,6.0,30.0,25.0,5.0,39.0,60.0,5.0,5.0,1.0,0.0,4.0,64.0,0.218,0.292,0.328,0.62,0.241,137.0,0.432,933.0,70.944204,0.72,0.39,-0.2,756,2688,2448,312,677,522,99,29,27,207,157,42,203,406,20,14,12,11,35,915,12.94,0.277,0.334,0.374,0.708,0.321,St. Louis Cardinals,Busch Stadium II,2573225.0,STL,NL,STL,E,6.0,162.0,81.0,70.0,92.0,N,N,N,599.0,5462.0,1398.0,255.0,41.0,73.0,517.0,844.0,221.0,74.0,21.0,50.0,698.0,130.0,114.0,0.979,100.0,866667.0,173.0,10.0,597537.0,100000.0,2071334.0


In [462]:
# min salary
df_1985['adjusted_salary'].min()

72500.0

In [460]:
# Save a copy of the df_1985
df_1985.to_csv('baseballsalaries_1985.csv', index=False)

## 5. Next Steps <a id='next_steps'></a>

Now that we have a clean dataset, we are ready to move on to the next step of our project: 

- __Exploratory Data Analysis (EDA):__ Perform an in-depth exploratory data analysis to uncover insights, patterns, and relationships within the preprocessed data. Utilize visualizations, statistical analysis, and other techniques to understand the distribution, correlations, and trends present in the data. This stage will provide valuable insights that can guide further analysis and modeling decisions.

- __Feature Engineering:__ Engage in feature engineering to enhance the dataset for modeling purposes. This includes selecting relevant features, transforming existing features, and potentially creating new features based on domain knowledge and insights gained from the EDA. Iteratively refine the feature set to improve model performance and align it with the project's objectives.

Please note that the current state of the preprocessed data is not the final form. More features can be added or removed during the feature engineering phase to further optimize our models and increase their predictive power.

