## Predicting MLB Player Salaries: A Batting Performance Analysis
# Data Preprocessing

In this notebook, we will perform data processing tasks to simplify the dataframe, drop columns with no value to our project, reduce the time span, and perform further cleaning and wrangling.

By the end, I expect to have a dataframe that is ready for analysis and modeling.

## Table of Contents
1. [Data Exploration](#data_exploration)
    - [1.1. Data Dictionary](#data_dict)
    - [1.2. Description and overview of the data](#overview)
2. [Absence of Key Features](#absence_key_features)
    - [2.1. Adding Key Features](#adding_key_features)
    - [2.2. Adding Cumulative Features](#adding_cumulative_features)
3. [Data Cleaning](#data_cleaning)
    - [3.1. Setting a time span](#time_span)
    - [3.2. Droping columns (Features) that are no relevant to the project](#drop_cols)
    - [3.3. Droping repeated or similar columns](#drop_repeated_cols)
    - 2.4. [Filtering and removing players with no relevance to the project (Removing rows)](#drop_rows)
    - 2.5. [Dealing with Null Values](#null_values)
4. [Salary Adjustments](#salary_adjustments)
4. [Next Steps](#next_steps)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


## 1. Data Exploration <a id='data_exploration'></a>

To begin the analysis, we will explore the dataset to understand its structure and content.


In [73]:
# Load Dataset
raw_df = pd.read_csv("pre_datasets/war_batting_people_teams_salaries_pre.csv")

  raw_df = pd.read_csv("pre_datasets/war_batting_people_teams_salaries_pre.csv")


In [74]:
# Shape of the dataset
print('Shape of the dataset: ', raw_df.shape)
print('Number of rows (Players): ', raw_df.shape[0])
print('Number of columns (Features): ', raw_df.shape[1])

Shape of the dataset:  (112149, 109)
Number of rows (Players):  112149
Number of columns (Features):  109


The dataset has a shape of **(112149, 105)**, indicating __112,149 rows (Players) and 109 columns (Features)__. 

Here are the first five rows of the dataset:

In [78]:
# First 5 rows of the dataset
pd.set_option('display.max_columns', None)

raw_df.sample(5)

Unnamed: 0,name_common,age_x,year_ID,player_ID,pitcher,mlb_ID,stint_ID,PA,G_x,Inn,salary_x,OPS_plus,WAR,WAR_def,WAR_off,team_ID,playerID,birthYear,birthMonth,birthDay,birthCountry,nameFirst,nameLast,weight,height,bats,throws,debut,finalGame,retroID,bbrefID,yearID,G_y,AB_x,R_x,H_x,2B_x,3B_x,HR_x,RBI,SB_x,CS_x,BB_x,SO_x,IBB,HBP_x,SH,SF_x,GIDP,teamID,lgID,franchID,divID,Rank,G,Ghome,W,L,DivWin,LgWin,WSWin,R_y,AB_y,H_y,2B_y,3B_y,HR_y,BB_y,SO_y,SB_y,CS_y,HBP_y,SF_y,RA,ER,ERA,CG,SHO,SV,IPouts,HA,HRA,BBA,SOA,E,DP,FP,name_x,park,attendance,BPF,PPF,teamIDBR,teamIDlahman45,teamIDretro,firstname,lastname,playerid,mlbid,year,salary_y,TeamName,age_y,leaguerank,teamrank,averagesalary,leagueminimum,borndate,name_y
51142,Nelson Briles,33.0,1977,brilene01,Y,222936.0,2,0.0,0,112.3,0.0,0.0,0.0,0.0,0.0,TEX,brilene01,1943.0,8.0,5.0,USA,Nelson,Briles,195.0,71.0,R,R,1965-04-19,1978-09-13,briln101,brilene01,1977.0,30.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,TEX,AL,TEX,W,2.0,162.0,81.0,94.0,68.0,N,N,N,767.0,5541.0,1497.0,265.0,39.0,135.0,596.0,904.0,154.0,85.0,39.0,50.0,657.0,583.0,3.56,49.0,17.0,31.0,4417.0,1412.0,134.0,471.0,864.0,117.0,156.0,0.982,Texas Rangers,Arlington Stadium,1250722.0,101.0,101.0,TEX,TEX,TEX,,,,,,,,,,,,,,
76077,Kent Mercker,31.0,1999,merckke01,Y,237934.0,2,34.0,25,129.4,2500000.0,14.28956,0.22,0.0,0.22,STL,merckke01,1968.0,2.0,1.0,USA,Kent,Mercker,175.0,73.0,L,L,1989-09-22,2008-05-30,merck001,merckke01,1999.0,31.0,28.0,5.0,5.0,1.0,0.0,0.0,2.0,0.0,0.0,2.0,10.0,0.0,0.0,4.0,0.0,0.0,SLN,NL,STL,C,4.0,161.0,80.0,75.0,86.0,N,N,N,809.0,5570.0,1461.0,274.0,27.0,194.0,613.0,1202.0,134.0,48.0,51.0,44.0,838.0,761.0,4.74,5.0,3.0,38.0,4336.0,1519.0,161.0,667.0,1025.0,132.0,163.0,0.978,St. Louis Cardinals,Busch Stadium II,3225334.0,101.0,101.0,STL,SLN,SLN,Kent,Mercker,15299.0,118967.0,1999.0,2500000.0,St. Louis Cardinals,31.0,193.0,6.0,1611166.0,200000.0,1968-02-01,Kent Mercker
23517,Herman Bell,35.0,1933,bellhi01,Y,110825.0,1,32.0,38,0.0,0.0,-3.665825,-0.1,-0.02,-0.1,NYG,bellhi01,1897.0,7.0,16.0,USA,Herman,Bell,185.0,72.0,R,R,1924-04-16,1934-08-23,bellh101,bellhi01,1933.0,38.0,29.0,5.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,8.0,0.0,0.0,1.0,0.0,1.0,NY1,NL,SFG,,1.0,156.0,77.0,91.0,61.0,,Y,Y,636.0,5461.0,1437.0,204.0,41.0,82.0,377.0,477.0,31.0,,,,515.0,424.0,2.71,75.0,23.0,15.0,4224.0,1280.0,61.0,400.0,555.0,178.0,156.0,0.973,New York Giants,Polo Grounds IV,604471.0,99.0,97.0,NYG,NY1,NY1,,,,,,,,,,,,,,
32351,Richie Ashburn,22.0,1949,ashburi01,N,110349.0,1,728.0,154,0.0,10000.0,88.198955,2.55,0.52,1.97,PHI,ashburi01,1927.0,3.0,19.0,USA,Richie,Ashburn,170.0,70.0,L,R,1948-04-20,1962-09-30,ashbr101,ashburi01,1949.0,154.0,662.0,84.0,188.0,18.0,11.0,1.0,37.0,9.0,0.0,58.0,38.0,0.0,1.0,7.0,0.0,7.0,PHI,NL,PHI,,3.0,154.0,77.0,81.0,73.0,,N,N,662.0,5307.0,1349.0,232.0,55.0,122.0,528.0,670.0,27.0,,,,668.0,601.0,3.89,58.0,12.0,15.0,4173.0,1389.0,104.0,502.0,495.0,158.0,141.0,0.974,Philadelphia Phillies,Shibe Park,819698.0,97.0,97.0,PHI,PHI,PHI,,,,,,,,,,,,,,
104145,Tuffy Gosewisch,33.0,2017,gosewtu01,N,488912.0,1,31.0,11,77.0,635000.0,-50.460379,-0.56,-0.04,-0.46,SEA,gosewtu01,1983.0,8.0,17.0,USA,Tuffy,Gosewisch,185.0,71.0,R,R,2013-08-01,2017-05-21,goset001,gosewtu01,2017.0,11.0,28.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,14.0,0.0,0.0,2.0,0.0,2.0,SEA,AL,SEA,W,3.0,162.0,81.0,78.0,84.0,N,N,N,750.0,5551.0,1436.0,281.0,17.0,200.0,487.0,1267.0,89.0,35.0,78.0,35.0,772.0,713.0,4.46,1.0,9.0,39.0,4321.0,1399.0,237.0,490.0,1244.0,103.0,147.0,0.982,Seattle Mariners,Safeco Field,2135445.0,97.0,97.0,SEA,SEA,SEA,Tuffy,Gosewisch,29078.0,488912.0,2017.0,635000.0,Seattle Mariners,33.0,541.0,21.0,4097122.0,535000.0,1983-08-17,Tuffy Gosewisch


In [161]:
pd.set_option('display.max_rows', 200)
# Print all column names and their data types
raw_df.dtypes

playerID           object
birthYear           int64
birthMonth          int64
birthDay            int64
birthCountry       object
birthState         object
birthCity          object
deathYear         float64
deathMonth        float64
deathDay          float64
deathCountry       object
deathState         object
deathCity          object
nameFirst          object
nameLast           object
nameGiven          object
weight            float64
height            float64
bats               object
throws             object
debut              object
finalGame          object
retroID            object
bbrefID            object
name_common        object
age_x             float64
year_ID             int64
player_ID          object
pitcher            object
mlb_ID            float64
stint_ID            int64
PA                float64
G_x                 int64
Inn               float64
salary_x          float64
OPS_plus          float64
WAR               float64
WAR_def           float64
WAR_off     

### 1.1 Data Dictionary <a id='data_dict'></a>

Now let's examine the data dictionary to understand the meaning of each column:

| Column Name | Description | 
| --- | --- |
| name_common | Player name |
| age_x | Player age |
| year_ID | Year (Season played) |
| team_ID | Team played |
| player_ID | Player ID |
| pitcher | Whether the player is a pitcher |
| mlb_ID | MLB ID |
| stint_ID | The stint of the player in that year (a player can have more than one stint in a year if they moved teams). |
| PA | Plate appearances |
| G_x | Games played |
| Inn | Innings played |
| salary_x | Player salary |
| OPS_plus | On-base Plus Slugging Plus - an adjusted version of the On-base Plus Slugging (OPS) statistic. |
| WAR | Wins Above Replacement - a measure of how many wins a player adds to a team compared to a replacement-level player. |
| WAR_def | Wins above replacement as fielder |
| WAR_off | Wins above replacement as batter |
| teamID | ID of the team they played for the season |
| playerID | Player ID |
| birthYear | Year of birth |
| birthMonth | Month of birth |
| birthDay | Day of birth |
| birthCountry | Country of birth |
| nameFirst | First name |
| nameLast | Last name |
| weight | Weight in pounds |
| height | Height in inches |
| bats | Batting hand (left, right, or both) |
| throws | Throwing hand (left or right) |
| debut | Date of MLB debut |
| finalGame | Date of final MLB game |
| retroID | Retro ID |
| bbrefID | Baseball Reference ID |
| yearID | Year (Season played) |
| lgID | The league the player played in |
| G | Games played |
| AB | At bats |
| R | Runs |
| H | Hits |
| 2B | Doubles |
| 3B | Triples |
| HR | Homeruns |
| RBI | Runs batted in |
| SB | Stolen bases |
| CS | Caught stealing |
| BB | Walks |
| SO | Strikeouts |
| IBB | Intentional walks |
| HBP | Hit by pitch |
| SH | Sacrifice hits |
| SF | Sacrifice flies |
| GIDP | Grounded into double plays |
| teamID | ID of the team they played for the season |
| lgID | The league the player played in |
| franchID | Franchise ID |
| divID | Division ID |
| Rank | Team rank at the end of the season |
| G | Games played |
| Ghome | Games played at home |
| W | Wins |
| L | Losses |
| DivWin | Division Winner (Y or N) |
| WCWin | Wild Card Winner (Y or N) |
| LgWin | League Champion(Y or N) |
| WSWin | World Series Winner (Y or N) |
| R | Total number of runs scored by the team in the season |
| AB | Total number of at bats by the team in the season |
| H | Total number of hits by the team in the season |
| 2B | Total number of doubles by the team in the season |
| 3B | Total number of triples by the team in the season |
| HR | Total number of home runs by the team in the season |
| BB | Total number of walks by the team in the season |
| SO | Total number of strikeouts by the team in the season |
| SB | Total number of stolen bases by the team in the season |
| CS | Total number of times caught stealing by the team in the season |
| HBP | Total number of times hit by pitch by the team in the season |
| SF | Total number of sacrifice flies by the team in the season |
| RA | Total number of runs allowed by the team in the season |
| ER | Total number of earned runs allowed by the team in the season |
| ERA | Earned run average |
| CG | Total number of complete games pitched by the team in the season |
| SHO | Total number of shutouts pitched by the team in the season |
| SV | Total number of saves by the team in the season |
| IPouts | Total number of outs pitched by the team in the season |
| HA | Total number of hits allowed by the team in the season |
| HRA | Total number of home runs allowed by the team in the season |
| BBA | Total number of walks allowed by the team in the season |
| SOA | Total number of strikeouts by the team in the season |
| E | Total number of errors by the team in the season |
| DP | Total number of double plays turned by the team in the season |
| FP | Fielding percentage |
| name_x | Team name |
| park | Team park |
| attendance | Total attendance for the season |
| BPF | Three-year park factor for batters |
| PPF | Three-year park factor for pitchers |
| teamIDBR | Team ID used by Baseball Reference website |
| teamIDlahman45 | Team ID used in Lahman database version 4.5 |
| teamIDretro | Team ID used by Retrosheet |
| firstname | The player's First name |
| lastname | The player's Last name |
| playerid | A unique identifier for each player |
| mlbid | A unique identifier for each player |
| year | Year of the salary |
| salary_y | Player salary for the year |
| teamname | Team name |
| age | The player's age during the year of the salary |
| leaguerank | The player's rank in the league based on salary |
| teamrank | The player's rank in the team based on salary |
| averagesalary | The average salary in the league for the year |
| leagueminimun | The minimum salary in the league for the year |
| borndate | The player's birth date |






















### 1.2 Description and Overview of the Data <a id='overview'></a>

In [79]:
# Unique players (Total number of players)
print('Number of unique players: ', raw_df['playerID'].nunique())

# Unique seasons (Total number of seasons)
print('Number of unique seasons: ', raw_df['yearID'].nunique())

# First season
print('First season: ', raw_df['yearID'].min())

# Last season
print('Last season: ', raw_df['yearID'].max())

Number of unique players:  19942
Number of unique seasons:  152
First season:  1871.0
Last season:  2022.0


#### Overview of the Data

Our dataset is a rich compilation of baseball statistics that contains data for __19,942__ unique players across __152__ seasons. The earliest MLB season in the dataset is __1871__, and the latest season is __2022__.

Each row in the dataset represents a single player's performance for a given season, with a variety of metrics reflecting different aspects of their performance. These are divided into the following categories:
- __Identifiers:__ These include various IDs for players, teams, and leagues, as well as the year and player's stint with the team in that season.
- __Basic Batting and Fielding Statistics:__ These include familiar metrics such as games played, at bats, runs, hits, and home runs, among others, for both players and teams.
- __Advanced Batting and Fielding Statistics:__ These include more advanced metrics such as wins above replacement (WAR), wins above average (WAA), and runs above average (RAA), among others, for both players and teams.
- __Personal Information:__ These include the player's name, birth and death information, weight, height, and handedness, among others.
- __Team Information:__ These include the team's name, division, and league, as well as their record, attendance, and park factors, among others.
- __Salary Information:__ These include the player's salary.

This dataset provides a comprehensive view of baseball performance, allowing us to examine the game from both individual player and team perspectives.


In [80]:
# Copy of the df for preprocessing
df = raw_df.copy()

## 2. Absence of Key Features (Pre-Feature Engineering) <a id='missing_features'></a>

Our current dataset lacks several features that are integral to understanding a player's performance in baseball. These absent features include:
- __Batting Average (BA)__: This is calculated by dividing a player's number of hits by their number of at bats. It provides a measure of a player's offensive capabilities.
- __On-Base Percentage (OBP)__: This metric represents the frequency at which a player reaches base per plate appearance. It's calculated by dividing the total times a player reaches base (hits, walks, and hit by pitches) by their total number of eligible at bats.
- __Singles (1B)__: This is the number of hits a player has that resulted in the batter reaching first base safely. It's calculated by subtracting doubles, triples, and home runs from total hits.
- __Total Bases (TB)__: The number of bases a player has gained with hits. It is a weighted sum for a batter's collection of hits including singles, doubles, triples and home runs.
- __Slugging Percentage (SLG)__: The total number of bases divided by the number of at bats.
- __On-Base Plus Slugging (OPS)__: This is the sum of a player's on-base percentage and slugging percentage. It's a more comprehensive statistic that measures a player's ability to get on base, along with their ability to hit for power.
- __Batting Average on Balls In Play (BABIP)__: This measures how often a ball in play goes for a hit. A ball is "in play" when the plate appearance ends in something other than a strikeout, walk, hit batter, catcher's interference, sacrifice bunt, or home run. In other words, the batter puts the ball in play and it doesn't clear the outfield fence. This can be an indicator of a player's luck, skill at placing hits where fielders aren't, or both.
- __Win Percentage (W%)__: This is the number of wins divided by the number of games played. It provides a measure of a team's success over a period of time.

### 2.1 Adding key features <a id='adding_features'></a>

In [81]:
# Add Battiing Average (BA) column, round to 3 decimal places
df['BA'] = round(df['H_x'] / df['AB_x'], 3)

# Add On Base Percentage (OBP) column
df['OBP'] = round((df['H_x'] + df['BB_x'] + df['HBP_x']) / (df['AB_x'] + df['BB_x'] + df['HBP_x'] + df['SF_x']), 3)

# Add Singles (1B) column
df['1B'] = df['H_x'] - df['2B_x'] - df['3B_x'] - df['HR_x']

# Add Total Bases (TB) column
df['TB'] = df['1B'] + 2*df['2B_x'] + 3*df['3B_x'] + 4*df['HR_x']

# Add Slugging Percentage (SLG) column
df['SLG'] = round(df['TB'] / df['AB_x'], 3)

# Add On Base Plus Slugging (OPS) column
df['OPS'] = round(df['OBP'] + df['SLG'], 3)

# Add Batting Average on Balls in Play (BABIP) column
df['BABIP'] = round((df['H_x'] - df['HR_x']) / (df['AB_x'] - df['SO_x'] - df['HR_x'] + df['SF_x']), 3)

# Add Win Percentage (W%) column
df['W%'] = round(df['W'] / (df['W'] + df['L']), 3)


##### Sanity check
Let's compare newly created features with the ones of a well known player, Derek Jeter, to make sure we are calculating them correctly.
 
https://www.baseball-reference.com/players/j/jeterde01.shtml

In [82]:
# Sanity check, Derek Jeter's new columns
df[df['playerID'] == 'jeterde01'][['playerID', 'yearID', 'BA', 'OBP', '1B', 'SLG', 'OPS', 'BABIP', 'TB']]

Unnamed: 0,playerID,yearID,BA,OBP,1B,SLG,OPS,BABIP,TB
70090,jeterde01,1995.0,0.25,0.294,7.0,0.375,0.669,0.324,18.0
71506,jeterde01,1996.0,0.314,0.37,142.0,0.43,0.8,0.361,250.0
72829,jeterde01,1997.0,0.291,0.37,142.0,0.405,0.775,0.345,265.0
74215,jeterde01,1998.0,0.324,0.384,151.0,0.481,0.865,0.375,301.0
75652,jeterde01,1999.0,0.349,0.438,149.0,0.552,0.99,0.396,346.0
77092,jeterde01,2000.0,0.339,0.416,151.0,0.481,0.897,0.386,285.0
78585,jeterde01,2001.0,0.311,0.377,132.0,0.48,0.857,0.343,295.0
80004,jeterde01,2002.0,0.297,0.373,147.0,0.421,0.794,0.336,271.0
81488,jeterde01,2003.0,0.324,0.393,118.0,0.45,0.843,0.379,217.0
83009,jeterde01,2004.0,0.292,0.352,120.0,0.471,0.823,0.315,303.0


All the features match the ones in the website, so we can proceed.

### 2.2 Adding cumulative features
It is essential to consider the player's career statistics up to the point of each season. This is because a player's salary is often determined not just by their performance in the current season, but by their cumulative performance throughout their career.

To this end, we have added cumulative features to our dataset. These features represent the cumulative sum of various statistics for each player up to each season. These features represent the cumulative sum of various statistics for each player up to each season. For example, `career_G_x` represents the cumulative number of games played by the player up to the current season, `career_H_x` represents the cumulative number of hits, and so on.

In addition to these cumulative sum features, we have also added cumulative mean features for certain statistics. These features represent the cumulative average of these statistics for each player up to each season. For example, `career_BA` represents the cumulative batting average of the player up to the current season.

Later in the preprocessing stage, we will be narrowing down the time span of our dataset. To ensure we do not lose valuable data, especially for players who debuted before our selected time span, it's crucial to perform this step now. This approach ensures that we retain all relevant information for our analysis and model training.


In [83]:
# Add cumulative columns

# Empty list to store the cumulative columns names
cumulative_sum_cols = []
cumulative_mean_cols = []

# Function to add cumulative columns
def add_cumulative_sum(df, list_of_cols):
    for col in list_of_cols:
        new_col_name = 'career_' + col
        df[new_col_name] = df.groupby(['playerID'])[col].cumsum()
        if col != 'WAR':
            df[new_col_name] = df[new_col_name].fillna(0).astype(int)
        cumulative_sum_cols.append(new_col_name)
        
# Add cumulative sum columns
columns_to_sum = ['G_x', 'PA', 'AB_x', 'R_x', 'H_x', '1B', '2B_x', '3B_x', 'HR_x', 'RBI', 'SB_x', 'CS_x', 'BB_x', 'SO_x', 'IBB', 'HBP_x', 'SH', 'SF_x', 'GIDP', 'TB', 'WAR']
add_cumulative_sum(df, columns_to_sum)

# Add cumulative mean columns (Batting Average, On Base Percentage, Slugging Percentage, On Base Plus Slugging, BABIP)
# Batting Average (BA)
df['career_BA'] = round(df.groupby(['playerID'])['H_x'].cumsum() / df.groupby(['playerID'])['AB_x'].cumsum(), 3)
cumulative_mean_cols.append('career_BA')

# On Base Percentage (OBP)
df['career_OBP'] = round((df.groupby(['playerID'])['H_x'].cumsum() + df.groupby(['playerID'])['BB_x'].cumsum() + df.groupby(['playerID'])['HBP_x'].cumsum()) / (df.groupby(['playerID'])['AB_x'].cumsum() + df.groupby(['playerID'])['BB_x'].cumsum() + df.groupby(['playerID'])['HBP_x'].cumsum() + df.groupby(['playerID'])['SF_x'].cumsum()), 3)
cumulative_mean_cols.append('career_OBP')

# Slugging Percentage (SLG)
df['career_SLG'] = round((df.groupby(['playerID'])['1B'].cumsum() + 2*df.groupby(['playerID'])['2B_x'].cumsum() + 3*df.groupby(['playerID'])['3B_x'].cumsum() + 4*df.groupby(['playerID'])['HR_x'].cumsum()) / df.groupby(['playerID'])['AB_x'].cumsum(), 3)
cumulative_mean_cols.append('career_SLG')

# On Base Plus Slugging (OPS)
df['career_OPS'] = round(df['career_OBP'] + df['career_SLG'], 3)
cumulative_mean_cols.append('career_OPS')

# Batting Average on Balls in Play (BABIP)
df['career_BABIP'] = round((df.groupby(['playerID'])['H_x'].cumsum() - df.groupby(['playerID'])['HR_x'].cumsum()) / (df.groupby(['playerID'])['AB_x'].cumsum() - df.groupby(['playerID'])['SO_x'].cumsum() - df.groupby(['playerID'])['HR_x'].cumsum() + df.groupby(['playerID'])['SF_x'].cumsum()), 3)
cumulative_mean_cols.append('career_BABIP')

##### Sanity check

https://www.baseball-reference.com/players/j/jeterde01.shtml

In [84]:
# Sanity check, Derek Jeter's new cumulative sum columns
df[df['playerID'] == 'jeterde01'][['playerID', 'yearID', 'career_G_x', 'career_PA', 'career_AB_x', 'career_R_x', 'career_H_x', 'career_1B', 'career_2B_x', 'career_3B_x', 'career_HR_x', 'career_RBI', 'career_SB_x', 'career_CS_x', 'career_BB_x', 'career_SO_x', 'career_IBB', 'career_HBP_x', 'career_SH', 'career_SF_x', 'career_GIDP', 'career_TB', 'career_WAR']]


Unnamed: 0,playerID,yearID,career_G_x,career_PA,career_AB_x,career_R_x,career_H_x,career_1B,career_2B_x,career_3B_x,career_HR_x,career_RBI,career_SB_x,career_CS_x,career_BB_x,career_SO_x,career_IBB,career_HBP_x,career_SH,career_SF_x,career_GIDP,career_TB,career_WAR
70090,jeterde01,1995.0,15,51,48,5,12,7,4,1,0,7,0,0,3,11,0,0,0,0,0,18,-0.34
71506,jeterde01,1996.0,172,705,630,109,195,149,29,7,10,85,14,7,51,113,1,9,6,9,13,268,2.95
72829,jeterde01,1997.0,331,1453,1284,225,385,291,60,14,20,155,37,19,125,238,1,19,14,11,27,533,7.91
74215,jeterde01,1998.0,480,2147,1910,352,588,442,85,22,39,239,67,25,182,357,2,24,17,14,40,834,15.44
75652,jeterde01,1999.0,638,2886,2537,486,807,591,122,31,63,341,86,33,273,473,7,36,20,20,52,1180,23.44
77092,jeterde01,2000.0,786,3565,3130,605,1008,742,153,35,78,414,108,37,341,572,11,48,23,23,66,1465,28.01
78585,jeterde01,2001.0,936,4251,3744,715,1199,874,188,38,99,488,135,40,397,671,14,58,28,24,79,1760,33.2
80004,jeterde01,2002.0,1093,4981,4388,839,1390,1021,214,38,117,563,167,43,470,785,16,65,31,27,93,2031,36.87
81488,jeterde01,2003.0,1212,5523,4870,926,1546,1139,239,41,127,615,178,48,513,873,18,78,34,28,103,2248,40.44
83009,jeterde01,2004.0,1366,6244,5513,1037,1734,1259,283,42,150,693,201,52,559,972,19,92,50,30,122,2551,44.68


In [85]:
# Derek Jeter's career average stats
df[df['name_common'] == 'Derek Jeter'][['playerID', 'yearID', 'career_BA', 'career_OBP', 'career_SLG', 'career_OPS', 'career_BABIP']]

Unnamed: 0,playerID,yearID,career_BA,career_OBP,career_SLG,career_OPS,career_BABIP
70090,jeterde01,1995.0,0.25,0.294,0.375,0.669,0.324
71506,jeterde01,1996.0,0.31,0.365,0.425,0.79,0.359
72829,jeterde01,1997.0,0.3,0.368,0.415,0.783,0.352
74215,jeterde01,1998.0,0.308,0.373,0.437,0.81,0.359
75652,jeterde01,1999.0,0.318,0.389,0.465,0.854,0.368
77092,jeterde01,2000.0,0.322,0.394,0.468,0.862,0.372
78585,jeterde01,2001.0,0.32,0.392,0.47,0.862,0.367
80004,jeterde01,2002.0,0.317,0.389,0.463,0.852,0.362
81488,jeterde01,2003.0,0.317,0.389,0.462,0.851,0.364
83009,jeterde01,2004.0,0.315,0.385,0.463,0.848,0.358


All the features match the ones in the website, so we can proceed.

In [86]:
df.shape

(112149, 143)

In [87]:
# Save a copy of the df
df.to_csv('baseball_pre_cumulative.csv', index=False)

## 3. Data Cleaning <a id='data_cleaning'></a>
The next crucial step in our analysis is data cleaning. This process involves preparing our data for analysis and modeling by removing or modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted.

### 3.1 Setting a Time Span <a id='time_span'></a>

The dataset contains data from 1871 to 2022. However, as part of our data cleaning process, we made a strategic decision to limit our dataset to encompass the years from 1985 to the present. 

There were several reasons behind this decision:
- __Statistical Consistency:__ The game of baseball has evolved significantly over the years, with changes in rules, equipment, and player training and conditioning. By focusing on the past four decades, we ensure a higher degree of consistency in the playing conditions, thus making our statistical analysis and predictions more reliable.
- __Impact of Free Agency:__ The introduction of free agency in 1976 has had a significant impact on the game. By starting our analysis from 1985, we focus on an era when player movement between teams became more common, which adds an interesting dynamic to player performance and team composition.
- __Data Availability:__ The data from 1985 to the present is more complete and accurate, which will help us avoid potential issues with missing or incorrect data.

In [88]:
# New dataframe starting from 1985
df_1985 = df.copy()
df_1985 = df_1985[df_1985['yearID'] >= 1985]

# Save a copy of the df_1980
df_1985.to_csv('baseball_pre_cumulative_1985.csv', index=False)

In [89]:
# Shape 
print('Shape of the dataset: ', df_1985.shape)

Shape of the dataset:  (54352, 143)


### 3.2 Droping columns (Features) that are no relevant to the project <a id='drop_cols'></a>


In [90]:
# Empty list to store columns to be dropped
drop_cols = []


#### Personal Columns

Some columns are not relevant to our project and will be dropped. These columns are commonly related to personal information about the player, such as birth date, birth place, and death date. 

Personal columns are not directly related to the project's analysis of batting performance. By removing them from the dataset, we can eliminate unnecessary personal information and narrow the scope of the project to the pertinent variables.


In [91]:
# List of personal columns that are not relevant
personal_columns = ['birthMonth', 'birthDay', 'firstname', 'lastname', 'borndate']

# Append personal columns to drop_cols
drop_cols.extend(personal_columns)


#### Pitching Stats Columns

To focus on the batting-oriented nature of your project and streamline the analysis, it is recommended to drop the pitching-related columns. By removing these columns, we can concentrate solely on the batting statistics, which aligns with the project's objective. Dropping the pitching columns will enable a more concise and meaningful exploration of the batting performance in your dataset.

In [92]:
# Pitching columns
pitching_columns = ['ERA', 'ER', 'ERA', 'CG', 'SHO', 'SV', 'IPouts', 'HA', 'HRA', 'BBA', 'SOA', 'PPF']

# Append pitching columns to drop_cols
drop_cols.extend(pitching_columns)

### 3.3 Droping repeated or similar columns <a id='drop_repeated_cols'></a>

Some columns provide the same information as other columns, but with different names. We are going to iterate over each pair of columns we belive are repeated and evaluate which will be the one to drop. 

In [31]:
def compare_columns(dataframe, column1, column2, column3=None):
    '''
    Compare up to three columns in a DataFrame and return the number of equal and different values.
    Additionally, provide the count of null values for each column to assist in identifying columns for potential dropping.
    If the number of different values is close to or equals the number of null values, it suggests that the differences 
    observed are primarily due to null values.
    :param column1: string
    :param column2: string
    :param column3: string
    '''
    if column3 is None:
        # Return bool if column1 and column2 are equal and count the number of False values
        comparison = dataframe[column1] == dataframe[column2]
        if False in comparison.value_counts():
            print('Number of different values: ', comparison.value_counts()[False])
            print('Number of equal values: ', comparison.value_counts()[True])
            # Null values for comparison
            print(f'Number of null values for {column1}: ', dataframe[column1].isnull().sum())
            print(f'Number of null values for {column2}: ', dataframe[column2].isnull().sum())
        else:
            print("All values are equal")    
    else:
        # Return bool if column1, column2 and column3 are equal and count the number of False values
        comparison = (dataframe[column1] == dataframe[column2]) & (dataframe[column1] == dataframe[column3])
        if False in comparison.value_counts():
            print('Number of different values: ', comparison.value_counts()[False])
            print('Number of equal values: ', comparison.value_counts()[True])
            # Null values for comparison
            print(f'Number of null values for {column1}: ', dataframe[column1].isnull().sum())
            print(f'Number of null values for {column2}: ', dataframe[column2].isnull().sum())
            print(f'Number of null values for {column3}: ', dataframe[column3].isnull().sum())
        else:
            print("All values are equal")
        


##### Year columns (Season)

__`yearID` and `year`.__

In [94]:
# Columns that start with 'year'
year_columns = [col for col in df_1985.columns if col.startswith('year')]
year_columns

['year_ID', 'yearID', 'year']

In [35]:
# yearID and year_ID comparison
compare_columns(df_1980, 'yearID', 'year_ID')

All values are equal


Both columns are the same, we can drop any of them plus `year` column.

In [97]:
drop_cols.extend(['year_ID', 'year'])

##### Player identifiers

__`playerID` and `player_ID`.__

In [98]:
# Columns that could be for player identification
id_columns = [col for col in df_1985.columns if 'ID' in col or 'id' in col or 'player' in col or 'name' in col] 
id_columns

['name_common',
 'year_ID',
 'player_ID',
 'mlb_ID',
 'stint_ID',
 'team_ID',
 'playerID',
 'nameFirst',
 'nameLast',
 'retroID',
 'bbrefID',
 'yearID',
 'GIDP',
 'teamID',
 'lgID',
 'franchID',
 'divID',
 'name_x',
 'teamIDBR',
 'teamIDlahman45',
 'teamIDretro',
 'firstname',
 'lastname',
 'playerid',
 'mlbid',
 'name_y',
 'career_GIDP']

In [100]:
# playerID and player_ID comparison
compare_columns(df_1985, 'playerID', 'player_ID')

Number of different values:  709
Number of equal values:  53643
Number of null values for playerID:  0
Number of null values for player_ID:  0


Let's select the rows where `playerID` and `player_ID` are different and see what the difference is.

In [39]:
pd.set_option('display.max_rows', 400)
df_1980[df_1980['playerID'] != df_1980['player_ID']][['playerID', 'player_ID']]

Unnamed: 0,playerID,player_ID
57969,obriech01,o'brich01
58255,oconnja02,o'conja02
58539,oberrmi01,o'bermi01
58593,oneilpa01,o'neipa01
58599,obriepe03,o'bripe03
...,...,...
111507,rodrijo04,rodrijo06
111788,mannima01,mannima02
111789,mannima01,mannima02
111790,mannima01,mannima02


It looks like the differences are special characters in the `player_ID` column. We can drop `player_ID` and keep `playerID`, since it is cleaner.

In [101]:
# Append player_ID to drop_cols
drop_cols.append('player_ID')

##### More player identifiers

In [109]:
# Player identifiers
player_ids_columns = ['playerID', 'mlb_ID', 'retroID', 'bbrefID', 'mlbid', 'playerid']

df_1985[player_ids_columns].value_counts().sample(10)

playerID   mlb_ID    retroID   bbrefID    mlbid   
stricsc01  465988.0  stris001  stricsc02  232994.0     1
eckener01  425166.0  eckee001  eckener01  425166.0     2
coreyma02  407579.0  corem001  coreyma02  407579.0     3
frareca01  621350.0  frarc001  frareca01  621350.0     2
scherma01  453286.0  schem001  scherma01  453286.0    14
worleva01  474699.0  worlv001  worleva01  474699.0     8
gamboed01  543195.0  gambe001  gamboed01  543195.0     1
wardda01   132880.0  wardd002  wardda01   132880.0     9
sheffga01  122111.0  shefg001  sheffga01  122111.0    20
fonvich01  228582.0  fonvc001  fonvich01  114354.0     1
dtype: int64

We are interested in Baseball Reference IDs, as it is one of our main sources of information. `playerID` and `bbrefID` are the same, and both of them are from Baseball Reference. We can drop `bbrefID` and keep `playerID`. We will also drop the rest of the player identifiers.

In [110]:
drop_cols.extend(['mlb_ID', 'retroID', 'bbrefID', 'mlbid', 'playerid'])

`name_common` and `name_y` comparison

In [111]:
compare_columns(df_1985, 'name_common', 'name_y')

Number of different values:  11897
Number of equal values:  42455
Number of null values for name_common:  0
Number of null values for name_y:  58


We will drop `name_y` as it has more missing values.

In [112]:
drop_cols.append('name_y')

##### Age columns

In [113]:
compare_columns(df_1985, 'age_x', 'age_y')

Number of different values:  8918
Number of equal values:  45434
Number of null values for age_x:  0
Number of null values for age_y:  58


We will drop `age_y` as it has more missing values.

In [114]:
drop_cols.append('age_y')

# Change name of age_x to age
df_1985.rename(columns={'age_x': 'age'}, inplace=True)

##### Team identifiers

There are several columns that serve as identifiers for the team:
- `teamID`
- `team_ID`
- `teamIDBR`
- `teamIDlahman45`
- `teamIDretro`
- `franchID`
- `name_x`
- `TeamName`

Let's look at a dataframe including these columns:

In [119]:
# Team identifiers
team_ids_columns = ['teamID', 'team_ID', 'teamIDBR', 'teamIDlahman45', 'teamIDretro', 'franchID', 'name_x', 'TeamName']



We will rely on the identifiers provided by Baseball-Reference (https://www.baseball-reference.com/about/team_IDs.shtml), specifically `teamIDBR` and `franchID`. Baseball-Reference is one of our primary data sources, and utilizing these identifiers will ensure consistency and accuracy in our analysis. We will drop the other team identifiers except for `name`, which will be useful for visualizations and analysis.

In [120]:
# null values for the team identifiers columns
df_1985[team_ids_columns].isnull().sum()

teamID             0
team_ID            0
teamIDBR           0
teamIDlahman45     0
teamIDretro        0
franchID           0
name_x             0
TeamName          58
dtype: int64

In [121]:
# Append teamID, team_ID, teamIDlahman45, teamIDretro to drop_cols
drop_cols.extend(['teamID', 'team_ID', 'teamIDlahman45', 'teamIDretro', 'TeamName'])

In [122]:
# Change name_x to teamName
df_1985.rename(columns={'name_x': 'teamName'}, inplace=True)

##### Salary columns

`salary_x` and `salary_y`

In [129]:
compare_columns(df_1985, 'salary_x', 'salary_y')

Number of different values:  26021
Number of equal values:  28331
Number of null values for salary_x:  0
Number of null values for salary_y:  58


We will drop `salary_x` as it has more missing values.

In [130]:
drop_cols.append('salary_x')

##### Summary of columns to drop

In [131]:
# Columns to drop
print('Columns to drop: ', drop_cols)
print('Number of columns to drop: ', len(drop_cols))

Columns to drop:  ['salary_x']
Number of columns to drop:  1


##### Droping columns

In [160]:
# Drop columns
df_1985.drop(columns=drop_cols, inplace=True)

In [161]:
# Shape after dropping columns
print('Shape of the dataset: ', df_1985.shape)

Shape of the dataset:  (27400, 110)


In [134]:
# Re empty list to store more columns to be dropped if needed
drop_cols = []

In [135]:
df_1985.sample(5)

Unnamed: 0,name_common,age,pitcher,stint_ID,PA,G_x,Inn,OPS_plus,WAR,WAR_def,WAR_off,playerID,birthYear,birthCountry,nameFirst,nameLast,weight,height,bats,throws,debut,finalGame,yearID,G_y,AB_x,R_x,H_x,2B_x,3B_x,HR_x,RBI,SB_x,CS_x,BB_x,SO_x,IBB,HBP_x,SH,SF_x,GIDP,lgID,franchID,divID,Rank,G,Ghome,W,L,DivWin,LgWin,WSWin,R_y,AB_y,H_y,2B_y,3B_y,HR_y,BB_y,SO_y,SB_y,CS_y,HBP_y,SF_y,RA,E,DP,FP,teamName,park,attendance,BPF,teamIDBR,playerid,salary_y,leaguerank,teamrank,averagesalary,leagueminimum,BA,OBP,1B,TB,SLG,OPS,BABIP,W%,career_G_x,career_PA,career_AB_x,career_R_x,career_H_x,career_1B,career_2B_x,career_3B_x,career_HR_x,career_RBI,career_SB_x,career_CS_x,career_BB_x,career_SO_x,career_IBB,career_HBP_x,career_SH,career_SF_x,career_GIDP,career_TB,career_WAR,career_BA,career_OBP,career_SLG,career_OPS,career_BABIP
91636,Brandon Wood,25.0,N,1,243.0,81,616.0,5.987794,-1.51,0.46,-1.72,woodbr01,1985.0,USA,Brandon,Wood,205.0,75.0,R,R,2007-04-26,2011-09-25,2010.0,81.0,226.0,20.0,33.0,2.0,0.0,4.0,14.0,1.0,0.0,6.0,71.0,0.0,2.0,8.0,1.0,3.0,AL,ANA,W,3.0,162.0,81.0,80.0,82.0,N,N,N,681.0,5488.0,1363.0,276.0,19.0,155.0,466.0,1070.0,104.0,52.0,52.0,37.0,702.0,113.0,116.0,0.981,Los Angeles Angels of Anaheim,Angel Stadium,3250816.0,98.0,LAA,37632.0,410000.0,691.0,24.0,3014572.0,400000.0,0.146,0.174,27.0,47.0,0.208,0.382,0.191,0.494,253,715,674,58,119,87,14,0,18,52,9,0,20,219,0,6,12,3,15,187,-4.27,0.177,0.206,0.277,0.483,0.23
81542,Eric DuBose,27.0,Y,1,0.0,0,73.7,0.0,0.0,0.0,0.0,duboser01,1976.0,USA,Eric,DuBose,230.0,75.0,L,L,2002-09-19,2006-04-07,2003.0,17.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,AL,BAL,E,4.0,163.0,81.0,71.0,91.0,N,N,N,743.0,5665.0,1516.0,277.0,24.0,152.0,431.0,902.0,89.0,36.0,54.0,40.0,820.0,105.0,164.0,0.983,Baltimore Orioles,Oriole Park at Camden Yards,2454523.0,99.0,BAL,20125.0,300000.0,0.0,0.0,2372189.0,300000.0,,,0.0,0.0,,,,0.438,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,,,,,
78961,José Molina,26.0,N,1,42.0,15,103.0,116.044902,0.54,0.28,0.34,molinjo01,1975.0,P.R.,Jose,Molina,250.0,72.0,R,R,1999-09-06,2014-09-28,2001.0,15.0,37.0,8.0,10.0,3.0,0.0,2.0,4.0,0.0,0.0,3.0,8.0,0.0,0.0,2.0,0.0,2.0,AL,ANA,W,3.0,162.0,81.0,75.0,87.0,N,N,N,691.0,5551.0,1447.0,275.0,26.0,158.0,494.0,1001.0,116.0,52.0,77.0,53.0,730.0,103.0,142.0,0.983,Anaheim Angels,Edison International Field,2000919.0,101.0,ANA,686.0,200000.0,0.0,0.0,2138896.0,200000.0,0.27,0.325,5.0,19.0,0.514,0.839,0.296,0.463,25,63,56,11,15,9,4,0,2,5,0,0,5,12,1,0,2,0,2,25,0.56,0.268,0.328,0.446,0.774,0.31
107989,Erick Mejia,25.0,N,1,16.0,8,33.0,-44.038127,-0.43,-0.19,-0.24,mejiaer01,1994.0,D.R.,Erick,Mejia,195.0,71.0,B,R,2019-09-05,2020-09-20,2020.0,8.0,14.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,7.0,0.0,0.0,1.0,0.0,0.0,AL,KCR,C,4.0,60.0,30.0,26.0,34.0,N,N,N,248.0,1988.0,485.0,97.0,7.0,68.0,172.0,527.0,49.0,20.0,18.0,10.0,272.0,31.0,62.0,0.985,Kansas City Royals,Kauffman Stadium,0.0,102.0,KCR,181889.0,563500.0,0.0,0.0,4724815.0,563500.0,0.071,0.071,0.0,2.0,0.143,0.214,0.143,0.433,17,43,36,4,6,4,2,0,0,4,1,0,4,14,0,0,1,1,1,8,-0.45,0.167,0.244,0.222,0.466,0.261
99638,Brent Morel,28.0,N,1,7.0,3,14.0,93.537592,0.02,0.0,0.02,morelbr01,1987.0,USA,Brent,Morel,230.0,74.0,R,R,2010-09-07,2015-07-24,2015.0,3.0,7.0,1.0,2.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,NL,PIT,C,2.0,162.0,81.0,98.0,64.0,N,N,N,697.0,5631.0,1462.0,292.0,27.0,140.0,461.0,1322.0,98.0,45.0,89.0,41.0,596.0,122.0,177.0,0.981,Pittsburgh Pirates,PNC Park,2498596.0,99.0,PIT,20865.0,6500000.0,193.0,3.0,3952252.0,507500.0,0.286,0.286,1.0,3.0,0.429,0.715,0.5,0.605,733,2354,2170,235,484,359,83,3,39,188,41,20,134,454,4,9,35,6,50,690,1.32,0.223,0.27,0.318,0.588,0.264


### 2.4 Filtering and removing players with no relevance to the project (Removing rows) <a id='drop_rows'></a>

#### Pitchers

As we mentioned earlier in the project, we are going to focus on batting metrics. We will remove pitchers from the dataset.

In [136]:
# Drop rows with pitcher = Y
df_1985 = df_1985[df_1985['pitcher'] != 'Y']

In [137]:
# Drop pitcher column
df_1985.drop('pitcher', axis=1, inplace=True)

# Shape
df_1985.shape

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_1985.drop('pitcher', axis=1, inplace=True)


(27400, 111)

### 2.5 Dealing with Null Values <a id='null_values'></a>

In [139]:
pd.set_option('display.max_rows', 200)
# Null values
df_1985.isnull().sum()[df_1985.isnull().sum() > 0]

DivWin           668
LgWin            668
WSWin            668
playerid           1
salary_y           1
leaguerank         1
teamrank           1
averagesalary      1
leagueminimum      1
BA                40
OBP               31
SLG               40
OPS               40
BABIP            122
career_BA         12
career_OBP        10
career_SLG        12
career_OPS        12
career_BABIP      58
dtype: int64

The curious pattern of having the same number of null values (668) in the columns `WSWin`, `DivWin`, and `LgWin` suggests a potential relationship among these variables. Let's look at the rows with null values in these columns.

In [140]:
pd.set_option('display.max_rows', None)
# Rows with DivWin, LgWin and WSWin null values, sort 
df_1985[df_1985['DivWin'].isnull() & df_1985['LgWin'].isnull() & df_1985['WSWin'].isnull()].nunique().sort_values()

WSWin              0
LgWin              0
DivWin             0
yearID             1
averagesalary      1
leagueminimum      1
stint_ID           2
lgID               2
throws             2
divID              3
bats               3
G                  5
Rank               5
FP                11
3B_x              12
BPF               12
Ghome             13
SF_x              13
height            14
HBP_x             14
birthCountry      15
W                 15
SH                15
L                 16
CS_x              16
3B_y              18
HBP_y             18
SF_y              18
E                 19
GIDP              20
IBB               20
HR_y              21
age               21
DP                21
CS_y              22
2B_y              22
birthYear         22
SB_y              23
W%                24
BB_y              25
RA                26
AB_y              26
SO_y              27
H_y               27
park              28
teamName          28
teamIDBR          28
R_y          

 It's interesting that `yearID` only contains one unique value. This suggests that all the records in this subset belong to a single season. 

In [197]:
df_1980[df_1980['DivWin'].isnull() & df_1980['LgWin'].isnull() & df_1980['WSWin'].isnull()][['yearID']].value_counts()

yearID
1994      506
dtype: int64

The null values present in the `DivWin`, `LgWin`, and `WSWin` columns correspond to the year 1994. This particular year holds significance in MLB history as the season came to an abrupt end due to a labor strike. As a result, no teams were able to compete in the playoffs, and the World Series was canceled.

Therefore, the presence of null values in DivWin (division winner), LgWin (league winner), and WSWin (World Series winner) for the year 1994 is expected. We will fill these null values with "N" to indicate that no team won the division, league, or World Series that year.

In [141]:
# Fill WSwWin, LgWin and DivWin null values with 'N'
df_1985['WSWin'].fillna('N', inplace=True)
df_1985['LgWin'].fillna('N', inplace=True)
df_1985['DivWin'].fillna('N', inplace=True)

In [142]:
# Null values
df_1980.isnull().sum()[df_1980.isnull().sum() > 0].sort_values(ascending=False)

BABIP            122
career_BABIP      58
OPS               40
SLG               40
BA                40
OBP               31
career_OPS        12
career_SLG        12
career_BA         12
career_OBP        10
b_day              1
new_id_y           1
lastname_2         1
firstname_2        1
b_year_3           1
firstname          1
lastname           1
b_year             1
name_y             1
borndate           1
leagueminimum      1
averagesalary      1
teamrank           1
leaguerank         1
age_y              1
TeamName           1
salary_y           1
year               1
mlbid              1
playerid           1
b_month            1
dtype: int64

In [143]:
df_1985.shape

(27400, 111)

##### `BABIP` Nulls

In [145]:
# Rows with BABIP null values
df_1985[df_1985['BABIP'].isnull()].sample(5)

Unnamed: 0,name_common,age,stint_ID,PA,G_x,Inn,OPS_plus,WAR,WAR_def,WAR_off,playerID,birthYear,birthCountry,nameFirst,nameLast,weight,height,bats,throws,debut,finalGame,yearID,G_y,AB_x,R_x,H_x,2B_x,3B_x,HR_x,RBI,SB_x,CS_x,BB_x,SO_x,IBB,HBP_x,SH,SF_x,GIDP,lgID,franchID,divID,Rank,G,Ghome,W,L,DivWin,LgWin,WSWin,R_y,AB_y,H_y,2B_y,3B_y,HR_y,BB_y,SO_y,SB_y,CS_y,HBP_y,SF_y,RA,E,DP,FP,teamName,park,attendance,BPF,teamIDBR,playerid,salary_y,leaguerank,teamrank,averagesalary,leagueminimum,BA,OBP,1B,TB,SLG,OPS,BABIP,W%,career_G_x,career_PA,career_AB_x,career_R_x,career_H_x,career_1B,career_2B_x,career_3B_x,career_HR_x,career_RBI,career_SB_x,career_CS_x,career_BB_x,career_SO_x,career_IBB,career_HBP_x,career_SH,career_SF_x,career_GIDP,career_TB,career_WAR,career_BA,career_OBP,career_SLG,career_OPS,career_BABIP
100614,Michael Choice,25.0,1,1.0,1,3.0,-100.0,-0.13,-0.1,-0.03,choicmi01,1989.0,USA,Michael,Choice,230.0,72.0,R,R,2013-09-02,2015-06-24,2015.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,AL,TEX,W,1.0,162.0,81.0,88.0,74.0,Y,N,N,751.0,5511.0,1419.0,279.0,32.0,172.0,503.0,1233.0,101.0,39.0,76.0,54.0,733.0,119.0,169.0,0.981,Texas Rangers,Globe Life Park in Arlington,2491875.0,102.0,TEX,146339.0,507500.0,0.0,0.0,3952252.0,507500.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.543,96,300,272,22,51,34,7,1,9,36,1,0,22,76,0,3,0,3,11,87,-1.95,0.188,0.253,0.32,0.573,0.221
92090,JC Boscan,30.0,1,1.0,1,1.0,0.0,0.04,0.0,0.04,boscajc01,1979.0,Venezuela,JC,Boscan,215.0,74.0,R,R,2010-10-01,2013-09-29,2010.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,NL,ATL,E,2.0,162.0,81.0,91.0,71.0,N,N,N,738.0,5463.0,1411.0,312.0,25.0,139.0,634.0,1140.0,63.0,29.0,51.0,35.0,629.0,126.0,166.0,0.98,Atlanta Braves,Turner Field,2510119.0,98.0,ATL,4628.0,400000.0,0.0,0.0,3014572.0,400000.0,,1.0,0.0,0.0,,,,0.562,1,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0.04,,1.0,,,
89098,Jai Miller,23.0,1,1.0,1,2.0,-100.0,-0.03,0.0,-0.03,milleja04,1985.0,USA,Jai,Miller,200.0,75.0,R,R,2008-06-22,2011-09-28,2008.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,NL,FLA,E,3.0,161.0,81.0,84.0,77.0,N,N,N,770.0,5499.0,1397.0,302.0,28.0,208.0,543.0,1371.0,76.0,28.0,69.0,46.0,767.0,117.0,122.0,0.98,Florida Marlins,Dolphin Stadium,1335076.0,101.0,FLA,37287.0,390000.0,0.0,0.0,2925679.0,390000.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.522,1,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,-0.03,0.0,0.0,0.0,0.0,
96817,Freddy Guzman,32.0,1,0.0,1,9.0,0.0,0.0,-0.01,0.0,guzmafr01,1981.0,D.R.,Freddy,Guzman,165.0,70.0,B,R,2004-08-17,2013-09-18,2013.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,AL,TBD,E,2.0,163.0,81.0,92.0,71.0,N,N,N,700.0,5538.0,1421.0,296.0,23.0,165.0,589.0,1171.0,73.0,38.0,36.0,55.0,646.0,59.0,147.0,0.99,Tampa Bay Rays,Tropicana Field,1510300.0,96.0,TBR,5544.0,7312500.0,137.0,2.0,3386212.0,490000.0,,,0.0,0.0,,,,0.564,75,124,114,19,24,19,3,0,2,9,14,6,5,21,0,3,0,2,0,33,0.58,0.211,0.258,0.289,0.547,0.237
77838,Pedro Swann,29.0,1,2.0,4,5.0,-100.0,-0.07,-0.02,-0.05,swannpe01,1970.0,USA,Pedro,Swann,195.0,72.0,L,R,2000-09-09,2003-09-27,2000.0,4.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,NL,ATL,E,1.0,162.0,81.0,95.0,67.0,Y,N,N,810.0,5489.0,1490.0,274.0,26.0,179.0,595.0,1010.0,148.0,56.0,59.0,45.0,714.0,129.0,138.0,0.979,Atlanta Braves,Turner Field,3234304.0,101.0,ATL,18648.0,175000.0,759.0,26.0,1895630.0,200000.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.586,4,2,2,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,-0.07,0.0,0.0,0.0,0.0,


The calculation for BABIP (Batting Average on Balls In Play) is:

BABIP = (H - HR) / (AB - SO - HR + SF)

BABIP NaN values, are likely due to division by zero in the denominator of the formula. This can occur if the player has no at-bats (AB), or if the number of strikeouts (SO) and home runs (HR) equals or exceeds the number of at-bats.

In other words, if a player has never been at bat, or if every time they've been at bat they've either struck out or hit a home run, then the denominator of the BABIP formula will be zero, leading to a division by zero error and a resulting NaN value.

For this reason we will fill the BABIP NaN values with zero.

In [147]:
# Fill BABIP null values with 0
df_1985['BABIP'].fillna(0, inplace=True)

In [148]:
# Null values
df_1985.isnull().sum()[df_1985.isnull().sum() > 0]

playerid          1
salary_y          1
leaguerank        1
teamrank          1
averagesalary     1
leagueminimum     1
BA               40
OBP              31
SLG              40
OPS              40
career_BA        12
career_OBP       10
career_SLG       12
career_OPS       12
career_BABIP     58
dtype: int64

##### `BA`, `OBP`, `SLG`, and `OPS` Nulls

The values for Batting Average (`BA`), On-Base Percentage (`OBP`), Slugging Percentage (`SLG`), and On-Base Plus Slugging (`OPS`) are calculated based on a player's hitting statistics. If a player has not had any at-bats or has not reached base in any way, these values will be undefined and will appear as NaN in your dataset.

For this reason we will fill the `BA`, `OBP`, `SLG`, and `OPS` NaN values with zero.

In [149]:
# Fill BA, OBP, SLG, OPS null values with 0
df_1985['BA'].fillna(0, inplace=True)
df_1985['OBP'].fillna(0, inplace=True)
df_1985['SLG'].fillna(0, inplace=True)
df_1985['OPS'].fillna(0, inplace=True)

In [150]:
# null values
df_1985.isnull().sum()[df_1985.isnull().sum() > 0]

playerid          1
salary_y          1
leaguerank        1
teamrank          1
averagesalary     1
leagueminimum     1
career_BA        12
career_OBP       10
career_SLG       12
career_OPS       12
career_BABIP     58
dtype: int64

##### Carreer Stats Nulls

In [151]:
# Rows with career_BABIP null values
df_1985[df_1985['career_BABIP'].isnull()].sample(5)

Unnamed: 0,name_common,age,stint_ID,PA,G_x,Inn,OPS_plus,WAR,WAR_def,WAR_off,playerID,birthYear,birthCountry,nameFirst,nameLast,weight,height,bats,throws,debut,finalGame,yearID,G_y,AB_x,R_x,H_x,2B_x,3B_x,HR_x,RBI,SB_x,CS_x,BB_x,SO_x,IBB,HBP_x,SH,SF_x,GIDP,lgID,franchID,divID,Rank,G,Ghome,W,L,DivWin,LgWin,WSWin,R_y,AB_y,H_y,2B_y,3B_y,HR_y,BB_y,SO_y,SB_y,CS_y,HBP_y,SF_y,RA,E,DP,FP,teamName,park,attendance,BPF,teamIDBR,playerid,salary_y,leaguerank,teamrank,averagesalary,leagueminimum,BA,OBP,1B,TB,SLG,OPS,BABIP,W%,career_G_x,career_PA,career_AB_x,career_R_x,career_H_x,career_1B,career_2B_x,career_3B_x,career_HR_x,career_RBI,career_SB_x,career_CS_x,career_BB_x,career_SO_x,career_IBB,career_HBP_x,career_SH,career_SF_x,career_GIDP,career_TB,career_WAR,career_BA,career_OBP,career_SLG,career_OPS,career_BABIP
103155,Dustin Fowler,22.0,1,0.0,1,0.7,0.0,0.0,0.0,0.0,fowledu01,1994.0,USA,Dustin,Fowler,195.0,72.0,L,L,2017-06-29,2021-04-21,2017.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,AL,NYY,E,2.0,162.0,81.0,91.0,71.0,N,N,N,858.0,5594.0,1463.0,266.0,23.0,241.0,616.0,1386.0,90.0,22.0,64.0,56.0,660.0,95.0,102.0,0.984,New York Yankees,Yankee Stadium III,3146966.0,104.0,NYY,181161.0,535000.0,0.0,0.0,4097122.0,535000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.562,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,,,,,
85019,Kevin Reese,27.0,1,2.0,2,9.0,52.439024,0.03,0.0,0.03,reeseke01,1978.0,USA,Kevin,Reese,195.0,71.0,L,L,2005-06-26,2006-07-04,2005.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,AL,NYY,E,1.0,162.0,81.0,95.0,67.0,Y,N,N,886.0,5624.0,1552.0,259.0,16.0,229.0,637.0,989.0,84.0,27.0,73.0,43.0,789.0,95.0,151.0,0.984,New York Yankees,Yankee Stadium II,4090696.0,98.0,NYY,21019.0,300000.0,752.0,27.0,2476589.0,316000.0,0.0,0.5,0.0,0.0,0.0,0.5,0.0,0.586,2,2,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0.03,0.0,0.5,0.0,0.5,
108502,Luis Campusano,21.0,1,4.0,1,9.0,385.417723,0.1,-0.01,0.1,campulu01,1998.0,USA,Luis,Campusano,232.0,71.0,R,R,2020-09-04,2023-04-14,2020.0,1.0,3.0,2.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,2.0,0.0,1.0,0.0,0.0,0.0,NL,SDP,W,2.0,60.0,32.0,37.0,23.0,N,N,N,325.0,1972.0,506.0,103.0,12.0,95.0,204.0,479.0,55.0,13.0,28.0,14.0,241.0,32.0,46.0,0.985,San Diego Padres,Petco Park,0.0,94.0,SDP,183260.0,663500.0,541.0,19.0,4724815.0,563500.0,0.333,0.5,0.0,4.0,1.333,1.833,0.0,0.617,2,8,6,4,2,0,0,0,2,2,0,0,0,4,0,2,0,0,0,8,0.2,0.333,0.5,1.333,1.833,
78773,Jason Smith,23.0,1,1.0,2,1.0,-100.0,-0.04,-0.01,-0.03,smithja05,1977.0,USA,Jason,Smith,195.0,75.0,L,R,2001-06-17,2009-05-21,2001.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,NL,CHC,C,3.0,162.0,81.0,88.0,74.0,N,N,N,777.0,5406.0,1409.0,268.0,32.0,194.0,577.0,1077.0,67.0,36.0,66.0,53.0,701.0,109.0,113.0,0.982,Chicago Cubs,Wrigley Field,2779465.0,95.0,CHC,7615.0,200000.0,0.0,0.0,2138896.0,200000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.543,2,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,-0.04,0.0,0.0,0.0,0.0,
98448,Francisco Peña,24.0,1,0.0,1,1.0,0.0,0.12,0.12,0.0,penafr01,1989.0,D.R.,Francisco,Pena,230.0,74.0,R,R,2014-05-20,2018-09-30,2014.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,AL,KCR,C,2.0,162.0,81.0,89.0,73.0,N,Y,N,651.0,5545.0,1456.0,286.0,29.0,95.0,380.0,985.0,153.0,36.0,53.0,47.0,624.0,104.0,122.0,0.983,Kansas City Royals,Kauffman Stadium,1956482.0,105.0,KCR,141758.0,500000.0,0.0,0.0,3818923.0,500000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.549,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.12,,,,,


NaN values for career statistics such as Batting Average on Balls in Play (`BABIP`), On-Base Plus Slugging (`OPS`), Slugging Percentage (`SLG`), On-Base Percentage (`OBP`), and Batting Average (`BA`) likely appear because the player had no at-bats or did not reach base in any way during their career up to that point.
This is particularly common for players in their first season, as they have not yet had the opportunity to accumulate any hits, walks, or other statistics that contribute to these metrics. As a result, the denominators in the formulas for these statistics are zero, leading to undefined values.

Same as before we will fill this career stats NaN values with zero.

In [152]:
# Fill career_BABIP, career_BA, career_OBP, career_SLG, career_OPS null values with 0
df_1985['career_BABIP'].fillna(0, inplace=True)
df_1985['career_BA'].fillna(0, inplace=True)
df_1985['career_OBP'].fillna(0, inplace=True)
df_1985['career_SLG'].fillna(0, inplace=True)
df_1985['career_OPS'].fillna(0, inplace=True)

##### Remaining Nulls

In [162]:
# Null values
df_1985.isnull().sum()[df_1985.isnull().sum() > 0]

leaguerank       1
teamrank         1
averagesalary    1
leagueminimum    1
dtype: int64

We only have one row (player) remaining with null values. We will take a look at it and decide what to do.

In [163]:
# rows with salary_y null values
df_1985[df_1985['teamrank'].isnull()]

Unnamed: 0,name_common,age,stint_ID,PA,G_x,Inn,OPS_plus,WAR,WAR_def,WAR_off,playerID,birthYear,birthCountry,nameFirst,nameLast,weight,height,bats,throws,debut,finalGame,yearID,G_y,AB_x,R_x,H_x,2B_x,3B_x,HR_x,RBI,SB_x,CS_x,BB_x,SO_x,IBB,HBP_x,SH,SF_x,GIDP,lgID,franchID,divID,Rank,G,Ghome,W,L,DivWin,LgWin,WSWin,R_y,AB_y,H_y,2B_y,3B_y,HR_y,BB_y,SO_y,SB_y,CS_y,HBP_y,SF_y,RA,E,DP,FP,teamName,park,attendance,BPF,teamIDBR,salary_y,leaguerank,teamrank,averagesalary,leagueminimum,BA,OBP,1B,TB,SLG,OPS,BABIP,W%,career_G_x,career_PA,career_AB_x,career_R_x,career_H_x,career_1B,career_2B_x,career_3B_x,career_HR_x,career_RBI,career_SB_x,career_CS_x,career_BB_x,career_SO_x,career_IBB,career_HBP_x,career_SH,career_SF_x,career_GIDP,career_TB,career_WAR,career_BA,career_OBP,career_SLG,career_OPS,career_BABIP
91460,Yadier Molina,26.0,1,544.0,140,1186.7,100.364778,3.15,1.88,2.1,molinya01,1982.0,P.R.,Yadier,Molina,225.0,71.0,R,R,2004-06-03,2022-10-05,2009.0,140.0,481.0,45.0,141.0,23.0,1.0,6.0,54.0,9.0,3.0,50.0,39.0,2.0,6.0,6.0,1.0,27.0,NL,STL,C,1.0,162.0,81.0,91.0,71.0,Y,N,N,730.0,5465.0,1436.0,294.0,29.0,160.0,528.0,1041.0,75.0,31.0,61.0,43.0,640.0,96.0,167.0,0.985,St. Louis Cardinals,Busch Stadium III,3343252.0,98.0,STL,3312500.0,,,,,0.293,0.366,111.0,184.0,0.383,0.749,0.309,0.562,669,2458,2215,189,596,456,103,2,35,263,13,12,178,202,19,20,29,16,95,808,8.22,0.269,0.327,0.365,0.692,0.281


Remaning null is from a player called "Yadier Molina" for the 2009 season. We will try to find his stats in Baseball Reference or another trusted source and fill the null values with them.

- https://www.baseball-reference.com/players/m/molinya01.shtml
- https://www.baseball-reference.com/bullpen/Minimum_salary
- https://www.baseball-reference.com/teams/STL/2009.shtml


In [164]:
# Fil Yadier Molina's salary_y for 2009
df_1985.loc[(df_1985['playerID'] == 'molinya01') & (df_1985['yearID'] == 2009), 'salary_y'] = 3312500

# Fill Yadier Molina's teamrank for 2009
df_1985.loc[(df_1985['playerID'] == 'molinya01') & (df_1985['yearID'] == 2009), 'teamrank'] = 9

# Fill Yadier Molina's average salary for 2009
df_1985.loc[(df_1985['playerID'] == 'molinya01') & (df_1985['yearID'] == 2009), 'averagesalary'] = 2996106

# Fill Yadier Molina's leagueminimum for 2009
df_1985.loc[(df_1985['playerID'] == 'molinya01') & (df_1985['yearID'] == 2009), 'leagueminimum'] = 400000   

# Fill Yadier Molina's leaguerank for 2009
df_1985.loc[(df_1985['playerID'] == 'molinya01') & (df_1985['yearID'] == 2009), 'leaguerank'] = 100



In [167]:
# null values
df_1985.isnull().sum().sum()

0

## 4. Salary Adjustments <a id='salary_adjustments'></a>
We will adjust the salaries of the players to account for inflation. The value of money changes over time due to inflation. Therefore, comparing salaries from different years without adjusting for inflation can lead to misleading results.


In [176]:
buying_power = {1985: 2.89,
                   1986: 2.78,
                   1987: 2.74,
                   1988: 2.64,
                   1989: 2.52,
                   1990: 2.39,
                   1991: 2.27,
                   1992: 2.21,
                   1993: 2.14,
                   1994: 2.09,
                   1995: 2.03,
                   1996: 1.98,
                   1997: 1.92,
                   1998: 1.89,
                   1999: 1.86,
                   2000: 1.81,
                   2001: 1.76,
                   2002: 1.72,
                   2003: 1.68,
                   2004: 1.65,
                   2005: 1.61,
                   2006: 1.54,
                   2007: 1.51,
                   2008: 1.45,
                   2009: 1.45,
                   2010: 1.41,
                   2011: 1.39,
                   2012: 1.35,
                   2013: 1.32,
                   2014: 1.30,
                   2015: 1.31,
                   2016: 1.29,
                   2017: 1.26,
                   2018: 1.23,
                   2019: 1.21,
                   2020: 1.18,
                   2021: 1.17,
                   2022: 1.09,}

In [177]:
# Adjusted salary column
df_1985['adjusted_salary'] = round(df_1985['salary_y'] * df_1985['yearID'].map(buying_power), 0)

In [180]:
# Sanity check
df_1985[['yearID', 'salary_y', 'adjusted_salary']].sample(10)

Unnamed: 0,yearID,salary_y,adjusted_salary
64718,1991.0,300000.0,681000.0
106951,2019.0,2675000.0,3236750.0
81789,2003.0,11666667.0,19600001.0
96042,2012.0,525000.0,708750.0
94984,2012.0,480000.0,648000.0
101478,2016.0,507500.0,654675.0
79112,2001.0,2000000.0,3520000.0
100421,2015.0,1500000.0,1965000.0
94796,2012.0,480000.0,648000.0
101969,2016.0,507500.0,654675.0


In [181]:
# Save a copy of the df_1985
df_1985.to_csv('baseballsalaries_1985_preclean.csv', index=False)

Lastly, we will change the column order to make it more readable.

In [191]:
column_order = [
    'playerID', 'name_common', 'nameFirst', 'nameLast', 'birthYear', 'birthCountry', 'debut', 'finalGame',
    'age', 'weight', 'height', 'bats', 'throws',
    'yearID', 'stint_ID', 'G_x', 'G_y', 'PA', 'AB_x', 'R_x', 'H_x', '2B_x', '3B_x', 'HR_x', 'RBI', 'SB_x', 'CS_x', 'BB_x', 'SO_x', 'IBB', 'HBP_x', 'SH', 'SF_x', 'GIDP', '1B', 'BA', 'OBP', 'SLG', 'OPS', 'BABIP', 'TB', 'W%', 'Inn', 'OPS_plus', 'WAR', 'WAR_def', 'WAR_off',
    'career_G_x', 'career_PA', 'career_AB_x', 'career_R_x', 'career_H_x', 'career_1B', 'career_2B_x', 'career_3B_x', 'career_HR_x', 'career_RBI', 'career_SB_x', 'career_CS_x', 'career_BB_x', 'career_SO_x', 'career_IBB', 'career_HBP_x', 'career_SH', 'career_SF_x', 'career_GIDP', 'career_TB', 'career_WAR', 'career_BA', 'career_OBP', 'career_SLG', 'career_OPS', 'career_BABIP',
    'teamName', 'park', 'attendance', 'teamIDBR', 'lgID', 'franchID', 'divID', 'Rank', 'G', 'Ghome', 'W', 'L', 'DivWin', 'LgWin', 'WSWin', 'R_y', 'AB_y', 'H_y', '2B_y', '3B_y', 'HR_y', 'BB_y', 'SO_y', 'SB_y', 'CS_y', 'HBP_y', 'SF_y', 'RA', 'E', 'DP', 'FP', 'BPF',
    'salary_y', 'leaguerank', 'teamrank', 'averagesalary', 'leagueminimum', 'adjusted_salary'
]

In [192]:
# Sanity check
len(column_order) == len(df_1985.columns)

True

In [195]:
# Change column order
df_1985 = df_1985[column_order]

In [196]:
# sample
df_1985.sample(5)

Unnamed: 0,playerID,name_common,nameFirst,nameLast,birthYear,birthCountry,debut,finalGame,age,weight,height,bats,throws,yearID,stint_ID,G_x,G_y,PA,AB_x,R_x,H_x,2B_x,3B_x,HR_x,RBI,SB_x,CS_x,BB_x,SO_x,IBB,HBP_x,SH,SF_x,GIDP,1B,BA,OBP,SLG,OPS,BABIP,TB,W%,Inn,OPS_plus,WAR,WAR_def,WAR_off,career_G_x,career_PA,career_AB_x,career_R_x,career_H_x,career_1B,career_2B_x,career_3B_x,career_HR_x,career_RBI,career_SB_x,career_CS_x,career_BB_x,career_SO_x,career_IBB,career_HBP_x,career_SH,career_SF_x,career_GIDP,career_TB,career_WAR,career_BA,career_OBP,career_SLG,career_OPS,career_BABIP,teamName,park,attendance,teamIDBR,lgID,franchID,divID,Rank,G,Ghome,W,L,DivWin,LgWin,WSWin,R_y,AB_y,H_y,2B_y,3B_y,HR_y,BB_y,SO_y,SB_y,CS_y,HBP_y,SF_y,RA,E,DP,FP,BPF,salary_y,leaguerank,teamrank,averagesalary,leagueminimum,adjusted_salary
111183,mccorch01,Chas McCormick,Chas,McCormick,1995.0,USA,2021-04-01,2023-06-23,27.0,208.0,72.0,R,L,2022.0,1,119,119.0,407.0,359.0,47.0,88.0,12.0,2.0,14.0,44.0,4.0,3.0,46.0,106.0,0.0,1.0,0.0,1.0,5.0,60.0,0.245,0.332,0.407,0.739,0.308,146.0,0.654,923.7,109.226877,1.29,-0.43,1.55,227,727,643,94,161,107,24,2,28,94,8,5,71,210,0,5,0,8,10,273,3.62,0.25,0.326,0.425,0.751,0.322,Houston Astros,Minute Maid Park,2688998.0,HOU,AL,HOU,W,1.0,162.0,81.0,106.0,56.0,Y,Y,Y,737.0,5409.0,1341.0,284.0,13.0,214.0,528.0,1179.0,83.0,22.0,60.0,42.0,518.0,72.0,122.0,0.987,101.0,700000.0,0.0,0.0,4317736.0,700000.0,763000.0
85850,duffych01,Chris Duffy,Chris,Duffy,1980.0,USA,2005-04-07,2009-05-14,26.0,185.0,70.0,L,L,2006.0,1,84,84.0,348.0,314.0,46.0,80.0,14.0,3.0,2.0,18.0,26.0,1.0,19.0,71.0,1.0,10.0,4.0,1.0,1.0,61.0,0.255,0.317,0.338,0.655,0.322,106.0,0.414,671.7,70.149939,0.38,-0.15,0.7,246,968,880,136,246,194,36,10,6,54,56,6,52,186,2,24,10,2,4,320,3.48,0.28,0.336,0.364,0.7,0.348,Pittsburgh Pirates,PNC Park,1861549.0,PIT,NL,PIT,C,5.0,162.0,81.0,67.0,95.0,N,N,N,691.0,5558.0,1462.0,286.0,17.0,141.0,459.0,1200.0,68.0,23.0,89.0,49.0,797.0,104.0,168.0,0.983,98.0,327000.0,0.0,0.0,2699292.0,327000.0,503580.0
98123,ruizca01,Carlos Ruiz,Carlos,Ruiz,1979.0,Panama,2006-05-06,2017-09-30,35.0,215.0,70.0,R,R,2014.0,1,110,110.0,445.0,381.0,43.0,96.0,25.0,1.0,6.0,31.0,4.0,2.0,46.0,60.0,1.0,12.0,1.0,5.0,11.0,64.0,0.252,0.347,0.37,0.717,0.281,141.0,0.451,960.0,101.099861,3.13,1.52,2.41,935,3371,2929,347,795,532,194,6,63,367,20,8,334,385,59,65,23,20,85,1190,20.85,0.271,0.357,0.406,0.763,0.293,Philadelphia Phillies,Citizens Bank Park,2423852.0,PHI,NL,PHI,E,5.0,162.0,81.0,73.0,89.0,N,N,N,619.0,5603.0,1356.0,251.0,27.0,125.0,443.0,1306.0,109.0,26.0,55.0,37.0,687.0,83.0,133.0,0.987,100.0,8500000.0,130.0,8.0,3818923.0,500000.0,11050000.0
83212,duboija01,Jason Dubois,Jason,Dubois,1979.0,USA,2004-05-19,2005-08-19,25.0,220.0,77.0,R,R,2004.0,1,20,20.0,25.0,23.0,2.0,5.0,0.0,1.0,1.0,5.0,0.0,0.0,1.0,7.0,0.0,0.0,0.0,1.0,0.0,3.0,0.217,0.24,0.435,0.675,0.25,10.0,0.549,23.3,68.42569,-0.06,-0.13,0.05,20,25,23,2,5,3,0,1,1,5,0,0,1,7,0,0,0,1,0,10,-0.06,0.217,0.24,0.435,0.675,0.25,Chicago Cubs,Wrigley Field,3170154.0,CHC,NL,CHC,C,3.0,162.0,82.0,89.0,73.0,N,N,N,789.0,5628.0,1508.0,308.0,29.0,235.0,489.0,1080.0,66.0,28.0,38.0,48.0,665.0,86.0,126.0,0.986,103.0,300000.0,0.0,0.0,2313535.0,300000.0,495000.0
78626,brownem01,Emil Brown,Emil,Brown,1974.0,USA,1997-04-03,2009-06-06,26.0,195.0,74.0,R,R,2001.0,2,74,74.0,155.0,137.0,21.0,26.0,4.0,1.0,3.0,13.0,12.0,4.0,16.0,49.0,1.0,2.0,0.0,0.0,2.0,18.0,0.19,0.284,0.299,0.583,0.271,41.0,0.383,323.7,17.942136,0.03,0.14,-0.1,209,457,404,52,81,58,13,2,8,38,20,6,38,129,2,13,1,1,6,122,-1.18,0.2,0.289,0.302,0.591,0.272,Pittsburgh Pirates,PNC Park,2464870.0,PIT,NL,PIT,C,6.0,162.0,81.0,62.0,100.0,N,N,N,657.0,5398.0,1333.0,256.0,25.0,161.0,467.0,1106.0,93.0,73.0,67.0,35.0,858.0,133.0,168.0,0.978,103.0,265000.0,620.0,24.0,2138896.0,200000.0,466400.0


In [197]:
# Save a copy of the df_1985
df_1985.to_csv('baseballsalaries_1985.csv', index=False)

## 5. Next Steps <a id='next_steps'></a>

Now that we have a clean dataset, we are ready to move on to the next step of our project: 

- __Exploratory Data Analysis (EDA):__ Perform an in-depth exploratory data analysis to uncover insights, patterns, and relationships within the preprocessed data. Utilize visualizations, statistical analysis, and other techniques to understand the distribution, correlations, and trends present in the data. This stage will provide valuable insights that can guide further analysis and modeling decisions.

- __Feature Engineering:__ Engage in feature engineering to enhance the dataset for modeling purposes. This includes selecting relevant features, transforming existing features, and potentially creating new features based on domain knowledge and insights gained from the EDA. Iteratively refine the feature set to improve model performance and align it with the project's objectives.

Please note that the current state of the preprocessed data is not the final form. More features can be added or removed during the feature engineering phase to further optimize our models and increase their predictive power.

