## Predicting MLB Player Salaries: A Batting Performance Analysis
---


# Data Preprocessing

In this notebook, we will perform data processing tasks to simplify the dataframe, drop columns with no value to our project, reduce the time span, and perform further cleaning and wrangling.

By the end, I expect to have a dataframe that is ready for analysis and modeling.

---

## Table of Contents
1. [Data Exploration](#data_exploration)
    - [1.1. Data Dictionary](#data_dict)
    - [1.2. Description and overview of the data](#overview)
2. [Absence of Key Features](#absence_key_features)
    - [2.1. Adding Key Features](#adding_key_features)
    - [2.2. Adding Cumulative Features](#adding_cumulative_features)
3. [Data Cleaning](#data_cleaning)
    - [3.1. Setting a time span](#time_span)
    - [3.2. Droping columns (Features) that are no relevant to the project](#drop_cols)
    - [3.3. Droping repeated or similar columns](#drop_repeated_cols)
    - 2.4. [Filtering and removing players with no relevance to the project (Removing rows)](#drop_rows)
    - 2.5. [Dealing with Null Values](#null_values)
4. [Salary Adjustments](#salary_adjustments)
4. [Next Steps](#next_steps)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


---

## 1. Data Exploration <a id='data_exploration'></a>

To begin the analysis, we will explore the dataset to understand its structure and content.


In [2]:
# Load Dataset
raw_df = pd.read_csv("pre_datasets/war_batting_people_teams_salaries_pre.csv")

  raw_df = pd.read_csv("pre_datasets/war_batting_people_teams_salaries_pre.csv")


In [3]:
# Shape of the dataset
print('Shape of the dataset: ', raw_df.shape)
print('Number of rows (Players): ', raw_df.shape[0])
print('Number of columns (Features): ', raw_df.shape[1])

Shape of the dataset:  (102624, 105)
Number of rows (Players):  102624
Number of columns (Features):  105


The dataset has a shape of **(112149, 105)**, indicating __112,149 rows (Players) and 109 columns (Features)__. 

Here are the first five rows of the dataset:

In [4]:
# First 5 rows of the dataset
pd.set_option('display.max_columns', None)

raw_df.sample(5)

Unnamed: 0,name_common,age_x,year_ID,player_ID,pitcher,mlb_ID,stint_ID,PA,G_x,Inn,salary_x,OPS_plus,WAR,WAR_def,WAR_off,team_ID,playerID,birthYear,birthCountry,weight,height,bats,throws,debut,finalGame,retroID,bbrefID,yearID,G_y,AB_x,R_x,H_x,2B_x,3B_x,HR_x,RBI,SB_x,CS_x,BB_x,SO_x,IBB,HBP_x,SH,SF_x,GIDP,teamID,lgID,franchID,divID,Rank,G,Ghome,W,L,DivWin,LgWin,WSWin,R_y,AB_y,H_y,2B_y,3B_y,HR_y,BB_y,SO_y,SB_y,CS_y,HBP_y,SF_y,RA,ER,ERA,CG,SHO,SV,IPouts,HA,HRA,BBA,SOA,E,DP,FP,name_x,park,attendance,BPF,PPF,teamIDBR,teamIDlahman45,teamIDretro,firstname,lastname,playerid,mlbid,year,salary_y,TeamName,age_y,leaguerank,teamrank,averagesalary,leagueminimum,borndate,name_y
48559,Lowell Palmer,26.0,1974,palmelo01,Y,120197.0,1,23.0,23,73.0,0.0,-26.523827,-0.12,-0.01,-0.12,SDP,palmelo01,1947.0,USA,190.0,73.0,R,R,1969-06-21,1974-09-16,palml101,palmelo01,1974.0,23.0,23.0,2.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,15.0,0.0,0.0,0.0,0.0,0.0,SDN,NL,SDP,W,6.0,162.0,81.0,60.0,102.0,N,N,N,541.0,5415.0,1239.0,196.0,27.0,99.0,564.0,900.0,85.0,45.0,20.0,35.0,830.0,736.0,4.58,25.0,7.0,19.0,4337.0,1536.0,124.0,715.0,855.0,170.0,126.0,0.973,San Diego Padres,Jack Murphy Stadium,1075399.0,95.0,98.0,SDP,SDN,SDN,,,,,,,,,,,,,,
6527,Clark Griffith,28.0,1898,griffcl01,Y,115150.0,1,137.0,38,0.0,0.0,35.809057,-0.17,0.0,-0.15,CHC,griffcl01,1869.0,USA,156.0,66.0,R,R,1891-04-11,1914-10-07,grifc101,griffcl01,1898.0,38.0,122.0,15.0,20.0,2.0,3.0,0.0,15.0,1.0,0.0,13.0,6.0,0.0,0.0,2.0,0.0,0.0,CHN,NL,CHC,,4.0,152.0,,85.0,65.0,,N,,828.0,5219.0,1431.0,175.0,84.0,18.0,476.0,387.0,220.0,,71.0,,679.0,422.0,2.83,137.0,13.0,0.0,4028.0,1357.0,17.0,364.0,323.0,412.0,149.0,0.936,Chicago Orphans,West Side Park II,424352.0,100.0,99.0,CHC,CHN,CHN,,,,,,,,,,,,,,
42720,Don Demeter,32.0,1967,demetdo01,N,226510.0,2,176.0,71,337.3,31500.0,203.689945,0.23,-0.07,0.12,BOS,demetdo01,1935.0,USA,190.0,76.0,R,R,1956-09-18,1967-08-28,demed101,demetdo01,1967.0,71.0,164.0,22.0,37.0,9.0,0.0,6.0,16.0,0.0,0.0,9.0,27.0,1.0,2.0,2.0,0.0,4.0,BOS,AL,BOS,,1.0,162.0,81.0,92.0,70.0,,Y,N,722.0,5471.0,1394.0,216.0,39.0,158.0,522.0,1020.0,68.0,59.0,,,614.0,545.0,3.36,41.0,9.0,44.0,4377.0,1307.0,142.0,477.0,1010.0,142.0,142.0,0.977,Boston Red Sox,Fenway Park II,1727832.0,109.0,109.0,BOS,BOS,BOS,,,,,,,,,,,,,,
74678,Adam Kennedy,25.0,2001,kennead01,N,150456.0,1,532.0,137,1147.7,280000.0,81.232375,1.55,1.08,0.84,ANA,kennead01,1976.0,USA,195.0,71.0,L,R,1999-08-21,2012-09-07,kenna001,kennead01,2001.0,137.0,478.0,48.0,129.0,25.0,3.0,6.0,40.0,12.0,7.0,27.0,71.0,3.0,11.0,7.0,9.0,7.0,ANA,AL,ANA,W,3.0,162.0,81.0,75.0,87.0,N,N,N,691.0,5551.0,1447.0,275.0,26.0,158.0,494.0,1001.0,116.0,52.0,77.0,53.0,730.0,671.0,4.2,6.0,1.0,43.0,4313.0,1452.0,168.0,525.0,947.0,103.0,142.0,0.983,Anaheim Angels,Edison International Field,2000919.0,101.0,101.0,ANA,ANA,ANA,adam,kennedy,1127.0,150456.0,2001.0,280000.0,Anaheim Angels,25.0,598.0,17.0,2138896.0,200000.0,1976-01-10,adam kennedy
62442,Shawn Abner,23.0,1989,abnersh01,N,110025.0,1,108.0,57,244.7,77500.0,38.311744,-1.23,-0.51,-0.85,SDP,abnersh01,1966.0,USA,190.0,73.0,R,R,1987-09-08,1992-10-03,abnes001,abnersh01,1989.0,57.0,102.0,13.0,18.0,4.0,0.0,2.0,14.0,1.0,0.0,5.0,20.0,2.0,0.0,0.0,1.0,1.0,SDN,NL,SDP,W,2.0,162.0,81.0,89.0,73.0,N,N,N,642.0,5422.0,1360.0,215.0,32.0,120.0,552.0,1013.0,136.0,67.0,9.0,41.0,626.0,547.0,3.38,21.0,11.0,52.0,4372.0,1359.0,133.0,481.0,933.0,154.0,147.0,0.976,San Diego Padres,Jack Murphy Stadium,2009031.0,101.0,101.0,SDP,SDN,SDN,shawn,abner,8016.0,110025.0,1989.0,77500.0,San Diego Padres,23.0,567.0,21.0,497254.0,68000.0,1966-06-17,shawn abner


In [5]:
pd.set_option('display.max_rows', 200)
# Print all column names and their data types
raw_df.dtypes

name_common        object
age_x             float64
year_ID             int64
player_ID          object
pitcher            object
mlb_ID            float64
stint_ID            int64
PA                float64
G_x                 int64
Inn               float64
salary_x          float64
OPS_plus          float64
WAR               float64
WAR_def           float64
WAR_off           float64
team_ID            object
playerID           object
birthYear         float64
birthCountry       object
weight            float64
height            float64
bats               object
throws             object
debut              object
finalGame          object
retroID            object
bbrefID            object
yearID            float64
G_y               float64
AB_x              float64
R_x               float64
H_x               float64
2B_x              float64
3B_x              float64
HR_x              float64
RBI               float64
SB_x              float64
CS_x              float64
BB_x        

### 1.1 Data Dictionary <a id='data_dict'></a>

Now let's examine the data dictionary to understand the meaning of each column:

| Column Name | Description | 
| --- | --- |
| name_common | Player name |
| age_x | Player age |
| year_ID | Year (Season played) |
| team_ID | Team played |
| player_ID | Player ID |
| pitcher | Whether the player is a pitcher |
| mlb_ID | MLB ID |
| stint_ID | The stint of the player in that year (a player can have more than one stint in a year if they moved teams). |
| PA | Plate appearances |
| G_x | Games played |
| Inn | Innings played |
| salary_x | Player salary |
| OPS_plus | On-base Plus Slugging Plus - an adjusted version of the On-base Plus Slugging (OPS) statistic. |
| WAR | Wins Above Replacement - a measure of how many wins a player adds to a team compared to a replacement-level player. |
| WAR_def | Wins above replacement as fielder |
| WAR_off | Wins above replacement as batter |
| teamID | ID of the team they played for the season |
| playerID | Player ID |
| birthYear | Year of birth |
| birthMonth | Month of birth |
| birthDay | Day of birth |
| birthCountry | Country of birth |
| nameFirst | First name |
| nameLast | Last name |
| weight | Weight in pounds |
| height | Height in inches |
| bats | Batting hand (left, right, or both) |
| throws | Throwing hand (left or right) |
| debut | Date of MLB debut |
| finalGame | Date of final MLB game |
| retroID | Retro ID |
| bbrefID | Baseball Reference ID |
| yearID | Year (Season played) |
| lgID | The league the player played in |
| G | Games played |
| AB | At bats |
| R | Runs |
| H | Hits |
| 2B | Doubles |
| 3B | Triples |
| HR | Homeruns |
| RBI | Runs batted in |
| SB | Stolen bases |
| CS | Caught stealing |
| BB | Walks |
| SO | Strikeouts |
| IBB | Intentional walks |
| HBP | Hit by pitch |
| SH | Sacrifice hits |
| SF | Sacrifice flies |
| GIDP | Grounded into double plays |
| teamID | ID of the team they played for the season |
| lgID | The league the player played in |
| franchID | Franchise ID |
| divID | Division ID |
| Rank | Team rank at the end of the season |
| G | Games played |
| Ghome | Games played at home |
| W | Wins |
| L | Losses |
| DivWin | Division Winner (Y or N) |
| WCWin | Wild Card Winner (Y or N) |
| LgWin | League Champion(Y or N) |
| WSWin | World Series Winner (Y or N) |
| R | Total number of runs scored by the team in the season |
| AB | Total number of at bats by the team in the season |
| H | Total number of hits by the team in the season |
| 2B | Total number of doubles by the team in the season |
| 3B | Total number of triples by the team in the season |
| HR | Total number of home runs by the team in the season |
| BB | Total number of walks by the team in the season |
| SO | Total number of strikeouts by the team in the season |
| SB | Total number of stolen bases by the team in the season |
| CS | Total number of times caught stealing by the team in the season |
| HBP | Total number of times hit by pitch by the team in the season |
| SF | Total number of sacrifice flies by the team in the season |
| RA | Total number of runs allowed by the team in the season |
| ER | Total number of earned runs allowed by the team in the season |
| ERA | Earned run average |
| CG | Total number of complete games pitched by the team in the season |
| SHO | Total number of shutouts pitched by the team in the season |
| SV | Total number of saves by the team in the season |
| IPouts | Total number of outs pitched by the team in the season |
| HA | Total number of hits allowed by the team in the season |
| HRA | Total number of home runs allowed by the team in the season |
| BBA | Total number of walks allowed by the team in the season |
| SOA | Total number of strikeouts by the team in the season |
| E | Total number of errors by the team in the season |
| DP | Total number of double plays turned by the team in the season |
| FP | Fielding percentage |
| name_x | Team name |
| park | Team park |
| attendance | Total attendance for the season |
| BPF | Three-year park factor for batters |
| PPF | Three-year park factor for pitchers |
| teamIDBR | Team ID used by Baseball Reference website |
| teamIDlahman45 | Team ID used in Lahman database version 4.5 |
| teamIDretro | Team ID used by Retrosheet |
| firstname | The player's First name |
| lastname | The player's Last name |
| playerid | A unique identifier for each player |
| mlbid | A unique identifier for each player |
| year | Year of the salary |
| salary_y | Player salary for the year |
| teamname | Team name |
| age | The player's age during the year of the salary |
| leaguerank | The player's rank in the league based on salary |
| teamrank | The player's rank in the team based on salary |
| averagesalary | The average salary in the league for the year |
| leagueminimun | The minimum salary in the league for the year |
| borndate | The player's birth date |






















### 1.2 Description and Overview of the Data <a id='overview'></a>

In [6]:
# Unique players (Total number of players)
print('Number of unique players: ', raw_df['playerID'].nunique())

# Unique seasons (Total number of seasons)
print('Number of unique seasons: ', raw_df['yearID'].nunique())

# First season
print('First season: ', raw_df['yearID'].min())

# Last season
print('Last season: ', raw_df['yearID'].max())

Number of unique players:  19829
Number of unique seasons:  152
First season:  1871.0
Last season:  2022.0


#### Overview of the Data

Our dataset is a rich compilation of baseball statistics that contains data for __19,942__ unique players across __152__ seasons. The earliest MLB season in the dataset is __1871__, and the latest season is __2022__.

Each row in the dataset represents a single player's performance for a given season, with a variety of metrics reflecting different aspects of their performance. These are divided into the following categories:
- __Identifiers:__ These include various IDs for players, teams, and leagues, as well as the year and player's stint with the team in that season.
- __Basic Batting and Fielding Statistics:__ These include familiar metrics such as games played, at bats, runs, hits, and home runs, among others, for both players and teams.
- __Advanced Batting and Fielding Statistics:__ These include more advanced metrics such as wins above replacement (WAR), wins above average (WAA), and runs above average (RAA), among others, for both players and teams.
- __Personal Information:__ These include the player's name, birth and death information, weight, height, and handedness, among others.
- __Team Information:__ These include the team's name, division, and league, as well as their record, attendance, and park factors, among others.
- __Salary Information:__ These include the player's salary.

This dataset provides a comprehensive view of baseball performance, allowing us to examine the game from both individual player and team perspectives.


In [7]:
# Copy of the df for preprocessing
df = raw_df.copy()

## 2. Absence of Key Features (Pre-Feature Engineering) <a id='missing_features'></a>

Our current dataset lacks several features that are integral to understanding a player's performance in baseball. These absent features include:
- __Batting Average (BA)__: This is calculated by dividing a player's number of hits by their number of at bats. It provides a measure of a player's offensive capabilities.
- __On-Base Percentage (OBP)__: This metric represents the frequency at which a player reaches base per plate appearance. It's calculated by dividing the total times a player reaches base (hits, walks, and hit by pitches) by their total number of eligible at bats.
- __Singles (1B)__: This is the number of hits a player has that resulted in the batter reaching first base safely. It's calculated by subtracting doubles, triples, and home runs from total hits.
- __Total Bases (TB)__: The number of bases a player has gained with hits. It is a weighted sum for a batter's collection of hits including singles, doubles, triples and home runs.
- __Slugging Percentage (SLG)__: The total number of bases divided by the number of at bats.
- __On-Base Plus Slugging (OPS)__: This is the sum of a player's on-base percentage and slugging percentage. It's a more comprehensive statistic that measures a player's ability to get on base, along with their ability to hit for power.
- __Batting Average on Balls In Play (BABIP)__: This measures how often a ball in play goes for a hit. A ball is "in play" when the plate appearance ends in something other than a strikeout, walk, hit batter, catcher's interference, sacrifice bunt, or home run. In other words, the batter puts the ball in play and it doesn't clear the outfield fence. This can be an indicator of a player's luck, skill at placing hits where fielders aren't, or both.
- __Win Percentage (W%)__: This is the number of wins divided by the number of games played. It provides a measure of a team's success over a period of time.

### 2.1 Adding key features <a id='adding_features'></a>

In [8]:
# Add Battiing Average (BA) column, round to 3 decimal places
df['BA'] = round(df['H_x'] / df['AB_x'], 3)

# Add On Base Percentage (OBP) column
df['OBP'] = round((df['H_x'] + df['BB_x'] + df['HBP_x']) / (df['AB_x'] + df['BB_x'] + df['HBP_x'] + df['SF_x']), 3)

# Add Singles (1B) column
df['1B'] = df['H_x'] - df['2B_x'] - df['3B_x'] - df['HR_x']

# Add Total Bases (TB) column
df['TB'] = df['1B'] + 2*df['2B_x'] + 3*df['3B_x'] + 4*df['HR_x']

# Add Slugging Percentage (SLG) column
df['SLG'] = round(df['TB'] / df['AB_x'], 3)

# Add On Base Plus Slugging (OPS) column
df['OPS'] = round(df['OBP'] + df['SLG'], 3)

# Add Batting Average on Balls in Play (BABIP) column
df['BABIP'] = round((df['H_x'] - df['HR_x']) / (df['AB_x'] - df['SO_x'] - df['HR_x'] + df['SF_x']), 3)

# Add Win Percentage (W%) column
df['W%'] = round(df['W'] / (df['W'] + df['L']), 3)


##### Sanity check
Let's compare newly created features with the ones of a well known player, Derek Jeter, to make sure we are calculating them correctly.
 
https://www.baseball-reference.com/players/j/jeterde01.shtml

In [9]:
# Sanity check, Derek Jeter's new columns
df[df['playerID'] == 'jeterde01'][['playerID', 'yearID', 'BA', 'OBP', '1B', 'SLG', 'OPS', 'BABIP', 'TB']]

Unnamed: 0,playerID,yearID,BA,OBP,1B,SLG,OPS,BABIP,TB
67993,jeterde01,1995.0,0.25,0.294,7.0,0.375,0.669,0.324,18.0
69126,jeterde01,1996.0,0.314,0.37,142.0,0.43,0.8,0.361,250.0
70241,jeterde01,1997.0,0.291,0.37,142.0,0.405,0.775,0.345,265.0
71384,jeterde01,1998.0,0.324,0.384,151.0,0.481,0.865,0.375,301.0
72567,jeterde01,1999.0,0.349,0.438,149.0,0.552,0.99,0.396,346.0
73766,jeterde01,2000.0,0.339,0.416,151.0,0.481,0.897,0.386,285.0
75004,jeterde01,2001.0,0.311,0.377,132.0,0.48,0.857,0.343,295.0
76182,jeterde01,2002.0,0.297,0.373,147.0,0.421,0.794,0.336,271.0
77383,jeterde01,2003.0,0.324,0.393,118.0,0.45,0.843,0.379,217.0
78610,jeterde01,2004.0,0.292,0.352,120.0,0.471,0.823,0.315,303.0


All the features match the ones in the website, so we can proceed.

### 2.2 Adding cumulative features
It is essential to consider the player's career statistics up to the point of each season. This is because a player's salary is often determined not just by their performance in the current season, but by their cumulative performance throughout their career.

To this end, we have added cumulative features to our dataset. These features represent the cumulative sum of various statistics for each player up to each season. These features represent the cumulative sum of various statistics for each player up to each season. For example, `career_G_x` represents the cumulative number of games played by the player up to the current season, `career_H_x` represents the cumulative number of hits, and so on.

In addition to these cumulative sum features, we have also added cumulative mean features for certain statistics. These features represent the cumulative average of these statistics for each player up to each season. For example, `career_BA` represents the cumulative batting average of the player up to the current season.

Later in the preprocessing stage, we will be narrowing down the time span of our dataset. To ensure we do not lose valuable data, especially for players who debuted before our selected time span, it's crucial to perform this step now. This approach ensures that we retain all relevant information for our analysis and model training.


In [10]:
# Add cumulative columns

# Empty list to store the cumulative columns names
cumulative_sum_cols = []
cumulative_mean_cols = []

# Function to add cumulative columns
def add_cumulative_sum(df, list_of_cols):
    for col in list_of_cols:
        new_col_name = 'career_' + col
        df[new_col_name] = df.groupby(['playerID'])[col].cumsum()
        if col != 'WAR':
            df[new_col_name] = df[new_col_name].fillna(0).astype(int)
        cumulative_sum_cols.append(new_col_name)
        
# Add cumulative sum columns
columns_to_sum = ['G_x', 'PA', 'AB_x', 'R_x', 'H_x', '1B', '2B_x', '3B_x', 'HR_x', 'RBI', 'SB_x', 'CS_x', 'BB_x', 'SO_x', 'IBB', 'HBP_x', 'SH', 'SF_x', 'GIDP', 'TB', 'WAR']
add_cumulative_sum(df, columns_to_sum)

# Add cumulative mean columns (Batting Average, On Base Percentage, Slugging Percentage, On Base Plus Slugging, BABIP)
# Batting Average (BA)
df['career_BA'] = round(df.groupby(['playerID'])['H_x'].cumsum() / df.groupby(['playerID'])['AB_x'].cumsum(), 3)
cumulative_mean_cols.append('career_BA')

# On Base Percentage (OBP)
df['career_OBP'] = round((df.groupby(['playerID'])['H_x'].cumsum() + df.groupby(['playerID'])['BB_x'].cumsum() + df.groupby(['playerID'])['HBP_x'].cumsum()) / (df.groupby(['playerID'])['AB_x'].cumsum() + df.groupby(['playerID'])['BB_x'].cumsum() + df.groupby(['playerID'])['HBP_x'].cumsum() + df.groupby(['playerID'])['SF_x'].cumsum()), 3)
cumulative_mean_cols.append('career_OBP')

# Slugging Percentage (SLG)
df['career_SLG'] = round((df.groupby(['playerID'])['1B'].cumsum() + 2*df.groupby(['playerID'])['2B_x'].cumsum() + 3*df.groupby(['playerID'])['3B_x'].cumsum() + 4*df.groupby(['playerID'])['HR_x'].cumsum()) / df.groupby(['playerID'])['AB_x'].cumsum(), 3)
cumulative_mean_cols.append('career_SLG')

# On Base Plus Slugging (OPS)
df['career_OPS'] = round(df['career_OBP'] + df['career_SLG'], 3)
cumulative_mean_cols.append('career_OPS')

# Batting Average on Balls in Play (BABIP)
df['career_BABIP'] = round((df.groupby(['playerID'])['H_x'].cumsum() - df.groupby(['playerID'])['HR_x'].cumsum()) / (df.groupby(['playerID'])['AB_x'].cumsum() - df.groupby(['playerID'])['SO_x'].cumsum() - df.groupby(['playerID'])['HR_x'].cumsum() + df.groupby(['playerID'])['SF_x'].cumsum()), 3)
cumulative_mean_cols.append('career_BABIP')

##### Sanity check

https://www.baseball-reference.com/players/j/jeterde01.shtml

In [11]:
# Sanity check, Derek Jeter's new cumulative sum columns
df[df['playerID'] == 'jeterde01'][['playerID', 'yearID', 'career_G_x', 'career_PA', 'career_AB_x', 'career_R_x', 'career_H_x', 'career_1B', 'career_2B_x', 'career_3B_x', 'career_HR_x', 'career_RBI', 'career_SB_x', 'career_CS_x', 'career_BB_x', 'career_SO_x', 'career_IBB', 'career_HBP_x', 'career_SH', 'career_SF_x', 'career_GIDP', 'career_TB', 'career_WAR']]


Unnamed: 0,playerID,yearID,career_G_x,career_PA,career_AB_x,career_R_x,career_H_x,career_1B,career_2B_x,career_3B_x,career_HR_x,career_RBI,career_SB_x,career_CS_x,career_BB_x,career_SO_x,career_IBB,career_HBP_x,career_SH,career_SF_x,career_GIDP,career_TB,career_WAR
67993,jeterde01,1995.0,15,51,48,5,12,7,4,1,0,7,0,0,3,11,0,0,0,0,0,18,-0.34
69126,jeterde01,1996.0,172,705,630,109,195,149,29,7,10,85,14,7,51,113,1,9,6,9,13,268,2.95
70241,jeterde01,1997.0,331,1453,1284,225,385,291,60,14,20,155,37,19,125,238,1,19,14,11,27,533,7.91
71384,jeterde01,1998.0,480,2147,1910,352,588,442,85,22,39,239,67,25,182,357,2,24,17,14,40,834,15.44
72567,jeterde01,1999.0,638,2886,2537,486,807,591,122,31,63,341,86,33,273,473,7,36,20,20,52,1180,23.44
73766,jeterde01,2000.0,786,3565,3130,605,1008,742,153,35,78,414,108,37,341,572,11,48,23,23,66,1465,28.01
75004,jeterde01,2001.0,936,4251,3744,715,1199,874,188,38,99,488,135,40,397,671,14,58,28,24,79,1760,33.2
76182,jeterde01,2002.0,1093,4981,4388,839,1390,1021,214,38,117,563,167,43,470,785,16,65,31,27,93,2031,36.87
77383,jeterde01,2003.0,1212,5523,4870,926,1546,1139,239,41,127,615,178,48,513,873,18,78,34,28,103,2248,40.44
78610,jeterde01,2004.0,1366,6244,5513,1037,1734,1259,283,42,150,693,201,52,559,972,19,92,50,30,122,2551,44.68


In [12]:
# Derek Jeter's career average stats
df[df['name_common'] == 'Derek Jeter'][['playerID', 'yearID', 'career_BA', 'career_OBP', 'career_SLG', 'career_OPS', 'career_BABIP']]

Unnamed: 0,playerID,yearID,career_BA,career_OBP,career_SLG,career_OPS,career_BABIP
67993,jeterde01,1995.0,0.25,0.294,0.375,0.669,0.324
69126,jeterde01,1996.0,0.31,0.365,0.425,0.79,0.359
70241,jeterde01,1997.0,0.3,0.368,0.415,0.783,0.352
71384,jeterde01,1998.0,0.308,0.373,0.437,0.81,0.359
72567,jeterde01,1999.0,0.318,0.389,0.465,0.854,0.368
73766,jeterde01,2000.0,0.322,0.394,0.468,0.862,0.372
75004,jeterde01,2001.0,0.32,0.392,0.47,0.862,0.367
76182,jeterde01,2002.0,0.317,0.389,0.463,0.852,0.362
77383,jeterde01,2003.0,0.317,0.389,0.462,0.851,0.364
78610,jeterde01,2004.0,0.315,0.385,0.463,0.848,0.358


All the features match the ones in the website, so we can proceed.

In [13]:
df.shape

(102624, 139)

In [14]:
# Save a copy of the df
df.to_csv('baseball_pre_cumulative.csv', index=False)

## 3. Data Cleaning <a id='data_cleaning'></a>
The next crucial step in our analysis is data cleaning. This process involves preparing our data for analysis and modeling by removing or modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted.

### 3.1 Setting a Time Span <a id='time_span'></a>

The dataset contains data from 1871 to 2022. However, as part of our data cleaning process, we made a strategic decision to limit our dataset to encompass the years from 1985 to the present. 

There were several reasons behind this decision:
- __Statistical Consistency:__ The game of baseball has evolved significantly over the years, with changes in rules, equipment, and player training and conditioning. By focusing on the past four decades, we ensure a higher degree of consistency in the playing conditions, thus making our statistical analysis and predictions more reliable.
- __Impact of Free Agency:__ The introduction of free agency in 1976 has had a significant impact on the game. By starting our analysis from 1985, we focus on an era when player movement between teams became more common, which adds an interesting dynamic to player performance and team composition.
- __Data Availability:__ The data from 1985 to the present is more complete and accurate, which will help us avoid potential issues with missing or incorrect data.

In [15]:
# New dataframe starting from 1985
df_1985 = df.copy()
df_1985 = df_1985[df_1985['yearID'] >= 1985]

# Save a copy of the df_1980
df_1985.to_csv('baseball_pre_cumulative_1985.csv', index=False)

In [16]:
# Shape 
print('Shape of the dataset: ', df_1985.shape)

Shape of the dataset:  (44838, 139)


### 3.2 Droping columns (Features) that are no relevant to the project <a id='drop_cols'></a>


In [17]:
# Empty list to store columns to be dropped
drop_cols = []


#### Personal Columns

Some columns are not relevant to our project and will be dropped. These columns are commonly related to personal information about the player, such as birth date, birth place, and death date. 

Personal columns are not directly related to the project's analysis of batting performance. By removing them from the dataset, we can eliminate unnecessary personal information and narrow the scope of the project to the pertinent variables.


In [18]:
# List of personal columns that are not relevant
personal_columns = ['firstname', 'lastname', 'borndate']

# Append personal columns to drop_cols
drop_cols.extend(personal_columns)


#### Pitching Stats Columns

To focus on the batting-oriented nature of your project and streamline the analysis, it is recommended to drop the pitching-related columns. By removing these columns, we can concentrate solely on the batting statistics, which aligns with the project's objective. Dropping the pitching columns will enable a more concise and meaningful exploration of the batting performance in your dataset.

In [19]:
# Pitching columns
pitching_columns = ['ERA', 'ER', 'ERA', 'CG', 'SHO', 'SV', 'IPouts', 'HA', 'HRA', 'BBA', 'SOA', 'PPF']

# Append pitching columns to drop_cols
drop_cols.extend(pitching_columns)

### 3.3 Droping repeated or similar columns <a id='drop_repeated_cols'></a>

Some columns provide the same information as other columns, but with different names. We are going to iterate over each pair of columns we belive are repeated and evaluate which will be the one to drop. 

In [20]:
def compare_columns(dataframe, column1, column2, column3=None):
    '''
    Compare up to three columns in a DataFrame and return the number of equal and different values.
    Additionally, provide the count of null values for each column to assist in identifying columns for potential dropping.
    If the number of different values is close to or equals the number of null values, it suggests that the differences 
    observed are primarily due to null values.
    :param column1: string
    :param column2: string
    :param column3: string
    '''
    if column3 is None:
        # Return bool if column1 and column2 are equal and count the number of False values
        comparison = dataframe[column1] == dataframe[column2]
        if False in comparison.value_counts():
            print('Number of different values: ', comparison.value_counts()[False])
            print('Number of equal values: ', comparison.value_counts()[True])
            # Null values for comparison
            print(f'Number of null values for {column1}: ', dataframe[column1].isnull().sum())
            print(f'Number of null values for {column2}: ', dataframe[column2].isnull().sum())
        else:
            print("All values are equal")    
    else:
        # Return bool if column1, column2 and column3 are equal and count the number of False values
        comparison = (dataframe[column1] == dataframe[column2]) & (dataframe[column1] == dataframe[column3])
        if False in comparison.value_counts():
            print('Number of different values: ', comparison.value_counts()[False])
            print('Number of equal values: ', comparison.value_counts()[True])
            # Null values for comparison
            print(f'Number of null values for {column1}: ', dataframe[column1].isnull().sum())
            print(f'Number of null values for {column2}: ', dataframe[column2].isnull().sum())
            print(f'Number of null values for {column3}: ', dataframe[column3].isnull().sum())
        else:
            print("All values are equal")
        


##### Year columns (Season)

__`yearID` and `year`.__

In [21]:
# Columns that start with 'year'
year_columns = [col for col in df_1985.columns if col.startswith('year')]
year_columns

['year_ID', 'yearID', 'year']

In [22]:
# yearID and year_ID comparison
compare_columns(df_1985, 'yearID', 'year_ID')

All values are equal


Both columns are the same, we can drop any of them plus `year` column.

In [23]:
drop_cols.extend(['year_ID', 'year'])

##### Player identifiers

__`playerID` and `player_ID`.__

In [24]:
# Columns that could be for player identification
id_columns = [col for col in df_1985.columns if 'ID' in col or 'id' in col or 'player' in col or 'name' in col] 
id_columns

['name_common',
 'year_ID',
 'player_ID',
 'mlb_ID',
 'stint_ID',
 'team_ID',
 'playerID',
 'retroID',
 'bbrefID',
 'yearID',
 'GIDP',
 'teamID',
 'lgID',
 'franchID',
 'divID',
 'name_x',
 'teamIDBR',
 'teamIDlahman45',
 'teamIDretro',
 'firstname',
 'lastname',
 'playerid',
 'mlbid',
 'name_y',
 'career_GIDP']

In [25]:
# playerID and player_ID comparison
compare_columns(df_1985, 'playerID', 'player_ID')

Number of different values:  553
Number of equal values:  44285
Number of null values for playerID:  0
Number of null values for player_ID:  0


Let's select the rows where `playerID` and `player_ID` are different and see what the difference is.

In [26]:
pd.set_option('display.max_rows', 400)
df_1985[df_1985['playerID'] != df_1985['player_ID']][['playerID', 'player_ID']]

Unnamed: 0,playerID,player_ID
57934,obriech01,o'brich01
58173,oconnja02,o'conja02
58417,oberrmi01,o'bermi01
58465,oneilpa01,o'neipa01
58471,obriepe03,o'bripe03
...,...,...
102014,montafr01,montafr02
102097,beeksja01,beeksja02
102144,rodrijo04,rodrijo06
102319,mannima01,mannima02


It looks like the differences are special characters in the `player_ID` column. We can drop `player_ID` and keep `playerID`, since it is cleaner.

In [27]:
# Append player_ID to drop_cols
drop_cols.append('player_ID')

##### More player identifiers

In [28]:
# Player identifiers
player_ids_columns = ['playerID', 'mlb_ID', 'retroID', 'bbrefID', 'mlbid', 'playerid']

df_1985[player_ids_columns].value_counts().sample(10)

playerID   mlb_ID    retroID   bbrefID    mlbid     playerid
machaju01  118078.0  machj001  machaju01  118078.0  14594.0     2
tapiara01  606132.0  tapir001  tapiara01  606132.0  181564.0    7
mccutan01  915410.0  mccua001  mccutan01  457705.0  54136.0     1
doteloc01  410202.0  doteo001  doteloc01  136734.0  1023.0      1
lopezaq01  425658.0  lopea003  lopezaq01  425658.0  6022.0      4
machaju01  236156.0  machj001  machaju01  118078.0  14594.0     1
bolsimi01  502211.0  bolsm001  bolsimi01  502211.0  123167.0    4
neaglde01  239346.0  neagd001  neaglde01  119673.0  662.0       2
guldebr01  115238.0  guldb001  guldebr01  115238.0  12248.0     1
heinesc01  595981.0  heins001  heinesc01  595981.0  169819.0    3
dtype: int64

We are interested in Baseball Reference IDs, as it is one of our main sources of information. `playerID` and `bbrefID` are the same, and both of them are from Baseball Reference. We can drop `bbrefID` and keep `playerID`. We will also drop the rest of the player identifiers.

In [29]:
drop_cols.extend(['mlb_ID', 'retroID', 'bbrefID', 'mlbid', 'playerid'])

`name_common` and `name_y`

In [30]:
# name_common and name_y sample
df_1985[df_1985['name_common'] != df_1985['name_y']][['name_common', 'name_y']].sample(10)

Unnamed: 0,name_common,name_y
95319,Geovany Soto,geovany soto
70761,Pat Listach,pat listach
62026,Jim Gantner,jim gantner
76518,Julio Lugo,julio lugo
95068,Charlie Blackmon,charlie blackmon
94260,Kaleb Cowart,kaleb cowart
100796,JT Chargois,jt chargois
92587,Eddie Butler,eddie butler
70897,Ryan Karp,ryan karp
82372,Garret Anderson,garret anderson


It looks that they are the same but `name_y` is in lower case. We will keep `name_common` and drop `name_y`.

In [31]:
# drop name_y
drop_cols.append('name_y')

##### Age columns

In [32]:
compare_columns(df_1985, 'age_x', 'age_y')

Number of different values:  962
Number of equal values:  43876
Number of null values for age_x:  0
Number of null values for age_y:  76


We will drop `age_y` as it has more missing values.

In [33]:
drop_cols.append('age_y')

# Change name of age_x to age
df_1985.rename(columns={'age_x': 'age'}, inplace=True)

##### Team identifiers

There are several columns that serve as identifiers for the team:
- `teamID`
- `team_ID`
- `teamIDBR`
- `teamIDlahman45`
- `teamIDretro`
- `franchID`
- `name_x`
- `TeamName`

Let's look at a dataframe including these columns:

In [34]:
# Team identifiers
team_ids_columns = ['teamID', 'team_ID', 'teamIDBR', 'teamIDlahman45', 'teamIDretro', 'franchID', 'name_x', 'TeamName']



We will rely on the identifiers provided by Baseball-Reference (https://www.baseball-reference.com/about/team_IDs.shtml), specifically `teamIDBR` and `franchID`. Baseball-Reference is one of our primary data sources, and utilizing these identifiers will ensure consistency and accuracy in our analysis. We will drop the other team identifiers except for `name`, which will be useful for visualizations and analysis.

In [35]:
# null values for the team identifiers columns
df_1985[team_ids_columns].isnull().sum()

teamID             0
team_ID            0
teamIDBR           0
teamIDlahman45     0
teamIDretro        0
franchID           0
name_x             0
TeamName          76
dtype: int64

In [36]:
# Append teamID, team_ID, teamIDlahman45, teamIDretro to drop_cols
drop_cols.extend(['teamID', 'team_ID', 'teamIDlahman45', 'teamIDretro', 'TeamName'])

In [37]:
# Change name_x to teamName
df_1985.rename(columns={'name_x': 'teamName'}, inplace=True)

##### Salary columns

`salary_x` and `salary_y`

In [38]:
compare_columns(df_1985, 'salary_x', 'salary_y')

Number of different values:  16965
Number of equal values:  27873
Number of null values for salary_x:  0
Number of null values for salary_y:  76


We will drop `salary_x` as it has more missing values.

In [39]:
drop_cols.append('salary_x')

In [None]:
more_cols = ['']

##### Summary of columns to drop

In [40]:
# Columns to drop
print('Columns to drop: ', drop_cols)
print('Number of columns to drop: ', len(drop_cols))

Columns to drop:  ['firstname', 'lastname', 'borndate', 'ERA', 'ER', 'ERA', 'CG', 'SHO', 'SV', 'IPouts', 'HA', 'HRA', 'BBA', 'SOA', 'PPF', 'year_ID', 'year', 'player_ID', 'mlb_ID', 'retroID', 'bbrefID', 'mlbid', 'playerid', 'name_y', 'age_y', 'teamID', 'team_ID', 'teamIDlahman45', 'teamIDretro', 'TeamName', 'salary_x']
Number of columns to drop:  31


##### Droping columns

In [41]:
# Drop columns
df_1985.drop(columns=drop_cols, inplace=True)

In [42]:
# Shape after dropping columns
print('Shape of the dataset: ', df_1985.shape)

Shape of the dataset:  (44838, 109)


In [43]:
# Re empty list to store more columns to be dropped if needed
drop_cols = []

In [44]:
df_1985.sample(5)

Unnamed: 0,name_common,age,pitcher,stint_ID,PA,G_x,Inn,OPS_plus,WAR,WAR_def,WAR_off,playerID,birthYear,birthCountry,weight,height,bats,throws,debut,finalGame,yearID,G_y,AB_x,R_x,H_x,2B_x,3B_x,HR_x,RBI,SB_x,CS_x,BB_x,SO_x,IBB,HBP_x,SH,SF_x,GIDP,lgID,franchID,divID,Rank,G,Ghome,W,L,DivWin,LgWin,WSWin,R_y,AB_y,H_y,2B_y,3B_y,HR_y,BB_y,SO_y,SB_y,CS_y,HBP_y,SF_y,RA,E,DP,FP,teamName,park,attendance,BPF,teamIDBR,salary_y,leaguerank,teamrank,averagesalary,leagueminimum,BA,OBP,1B,TB,SLG,OPS,BABIP,W%,career_G_x,career_PA,career_AB_x,career_R_x,career_H_x,career_1B,career_2B_x,career_3B_x,career_HR_x,career_RBI,career_SB_x,career_CS_x,career_BB_x,career_SO_x,career_IBB,career_HBP_x,career_SH,career_SF_x,career_GIDP,career_TB,career_WAR,career_BA,career_OBP,career_SLG,career_OPS,career_BABIP
74960,Darrin Fletcher,34.0,N,1,453.0,134,1030.0,63.099807,-0.01,1.27,-0.51,fletcda01,1966.0,USA,195.0,74.0,L,R,1989-09-10,2002-07-16,2001.0,134.0,416.0,36.0,94.0,20.0,0.0,11.0,56.0,0.0,1.0,24.0,43.0,4.0,6.0,1.0,6.0,18.0,AL,TOR,E,3.0,162.0,82.0,80.0,82.0,N,N,N,767.0,5663.0,1489.0,287.0,36.0,195.0,470.0,1094.0,156.0,55.0,74.0,43.0,753.0,97.0,184.0,0.985,Toronto Blue Jays,Skydome,1915438.0,102.0,TOR,3525000.0,184.0,6.0,2138896.0,200000.0,0.226,0.274,63.0,147.0,0.353,0.627,0.226,0.494,1200,4136,3775,369,1020,683,208,8,121,561,2,6,251,386,31,49,12,48,118,1607,7.96,0.27,0.32,0.426,0.746,0.271
68039,Edwin Hurtado,25.0,Y,1,0.0,0,77.7,0.0,0.0,0.0,0.0,hurtaed01,1970.0,Venezuela,215.0,75.0,R,R,1995-05-22,1997-07-30,1995.0,14.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,AL,TOR,E,5.0,144.0,72.0,56.0,88.0,N,N,N,642.0,5036.0,1309.0,275.0,27.0,140.0,492.0,906.0,75.0,16.0,44.0,45.0,777.0,97.0,131.0,0.982,Toronto Blue Jays,Skydome,2826483.0,99.0,TOR,109000.0,0.0,0.0,1110766.0,109000.0,,,0.0,0.0,,,,0.389,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,,,,,
89302,Rick Porcello,23.0,Y,1,3.0,2,176.3,-100.0,-0.04,0.0,-0.04,porceri01,1988.0,USA,205.0,77.0,R,R,2009-04-09,2020-09-26,2012.0,31.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,AL,DET,C,1.0,162.0,81.0,88.0,74.0,Y,Y,N,726.0,5476.0,1467.0,279.0,39.0,163.0,511.0,1103.0,59.0,23.0,57.0,39.0,670.0,99.0,127.0,0.983,Detroit Tigers,Comerica Park,3028033.0,104.0,DET,3100000.0,271.0,10.0,3213479.0,480000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.543,8,16,14,0,3,3,0,0,0,2,0,0,0,6,0,0,2,0,0,3,-0.01,0.214,0.214,0.214,0.428,0.375
77459,Frank Menechino,32.0,N,1,109.0,43,248.3,71.706318,-0.08,-0.18,0.21,menecfr01,1971.0,USA,175.0,69.0,R,R,1999-09-06,2005-10-02,2003.0,43.0,83.0,10.0,16.0,0.0,0.0,2.0,9.0,0.0,0.0,19.0,16.0,1.0,4.0,2.0,1.0,2.0,AL,OAK,W,1.0,162.0,81.0,96.0,66.0,Y,N,N,768.0,5497.0,1398.0,317.0,24.0,176.0,556.0,898.0,48.0,14.0,59.0,53.0,643.0,107.0,145.0,0.983,Oakland Athletics,Oakland Coliseum,2216596.0,99.0,OAK,334500.0,565.0,19.0,2372189.0,300000.0,0.193,0.364,14.0,22.0,0.265,0.629,0.212,0.593,295,1019,840,145,196,132,38,3,23,110,3,7,138,194,1,25,6,10,20,309,3.78,0.233,0.354,0.368,0.722,0.273
102582,Whit Merrifield,33.0,N,2,550.0,139,1121.6,201.798382,0.0,-0.8,0.92,merriwh01,1989.0,USA,195.0,73.0,R,R,2016-05-18,2023-06-24,2022.0,139.0,504.0,70.0,126.0,28.0,1.0,11.0,58.0,16.0,5.0,38.0,85.0,0.0,0.0,0.0,8.0,11.0,AL,KCR,C,5.0,162.0,81.0,65.0,97.0,N,N,N,640.0,5437.0,1327.0,247.0,38.0,138.0,460.0,1287.0,104.0,34.0,48.0,44.0,810.0,82.0,153.0,0.986,Kansas City Royals,Kauffman Stadium,1277686.0,103.0,KCR,7000000.0,179.0,4.0,4317736.0,700000.0,0.25,0.298,86.0,189.0,0.375,0.673,0.276,0.401,907,3939,3627,522,1035,710,220,26,79,403,175,43,244,621,9,25,4,39,60,1544,17.29,0.285,0.331,0.426,0.757,0.322


### 2.4 Filtering and removing players with no relevance to the project (Removing rows) <a id='drop_rows'></a>

#### Pitchers

As we mentioned earlier in the project, we are going to focus on batting metrics. We will remove pitchers from the dataset.

In [45]:
# Drop rows with pitcher = Y
df_1985 = df_1985[df_1985['pitcher'] != 'Y']

In [46]:
# Drop pitcher column
df_1985.drop('pitcher', axis=1, inplace=True)

# Shape
df_1985.shape

(22274, 108)

---

#### Update Team Names

Post 1985, several teams have undergone name changes. We'll update these to reflect their current names, resulting in 30 unique team names. Additionally, we'll include the Montreal Expos, the only team to have relocated since 1985, bringing our total to 31.

- https://www.mlb.com/team

In [47]:
# Team names
df_1985['teamName'].value_counts()

San Francisco Giants             814
Kansas City Royals               799
Seattle Mariners                 798
Texas Rangers                    797
Cincinnati Reds                  795
Baltimore Orioles                788
Oakland Athletics                788
Los Angeles Dodgers              787
Boston Red Sox                   784
New York Mets                    782
New York Yankees                 780
Cleveland Indians                778
San Diego Padres                 769
Detroit Tigers                   767
Pittsburgh Pirates               765
Minnesota Twins                  763
Toronto Blue Jays                751
Chicago Cubs                     750
Philadelphia Phillies            740
Houston Astros                   738
St. Louis Cardinals              735
Chicago White Sox                731
Milwaukee Brewers                725
Atlanta Braves                   720
Colorado Rockies                 604
Arizona Diamondbacks             520
Montreal Expos                   398
L

In [48]:
# Cleveland Indians to Cleveland Guardians
df_1985['teamName'] = df_1985['teamName'].replace('Cleveland Indians', 'Cleveland Guardians')

# Anaheim Angels to Los Angeles Angels
df_1985['teamName'] = df_1985['teamName'].replace('Anaheim Angels', 'Los Angeles Angels')

# Tampa Bay Devil Rays to Tampa Bay Rays
df_1985['teamName'] = df_1985['teamName'].replace('Tampa Bay Devil Rays', 'Tampa Bay Rays')

# Florida Marlins to Miami Marlins
df_1985['teamName'] = df_1985['teamName'].replace('Florida Marlins', 'Miami Marlins')

# California Angels to Los Angeles Angels
df_1985['teamName'] = df_1985['teamName'].replace('California Angels', 'Los Angeles Angels')

# Los Angeles Angels of Anaheim to Los Angeles Angels
df_1985['teamName'] = df_1985['teamName'].replace('Los Angeles Angels of Anaheim', 'Los Angeles Angels')

In [49]:
# Unique team names
df_1985['teamName'].nunique()

31

---

### 2.5 Dealing with Null Values <a id='null_values'></a>

In [50]:
pd.set_option('display.max_rows', 200)
# Null values
df_1985.isnull().sum()[df_1985.isnull().sum() > 0]

DivWin           512
LgWin            512
WSWin            512
salary_y           1
leaguerank         1
teamrank           1
averagesalary      1
leagueminimum      1
BA                28
OBP               22
SLG               28
OPS               28
BABIP             92
career_BA         11
career_OBP         9
career_SLG        11
career_OPS        11
career_BABIP      46
dtype: int64

The curious pattern of having the same number of null values (668) in the columns `WSWin`, `DivWin`, and `LgWin` suggests a potential relationship among these variables. Let's look at the rows with null values in these columns.

In [51]:
pd.set_option('display.max_rows', None)
# Rows with DivWin, LgWin and WSWin null values, sort 
df_1985[df_1985['DivWin'].isnull() & df_1985['LgWin'].isnull() & df_1985['WSWin'].isnull()].nunique().sort_values()

DivWin             0
LgWin              0
WSWin              0
leagueminimum      1
averagesalary      1
yearID             1
throws             2
stint_ID           2
lgID               2
bats               3
divID              3
G                  5
Rank               5
FP                11
3B_x              12
BPF               12
Ghome             13
SF_x              13
HBP_x             14
height            14
W                 15
birthCountry      15
SH                15
CS_x              16
L                 16
3B_y              18
HBP_y             18
SF_y              18
E                 19
GIDP              20
IBB               20
DP                21
HR_y              21
age               21
2B_y              22
CS_y              22
SB_y              23
birthYear         23
W%                24
BB_y              25
RA                26
AB_y              26
H_y               27
SO_y              27
attendance        28
teamName          28
teamIDBR          28
franchID     

 It's interesting that `yearID` only contains one unique value. This suggests that all the records in this subset belong to a single season. 

In [52]:
df_1985[df_1985['DivWin'].isnull() & df_1985['LgWin'].isnull() & df_1985['WSWin'].isnull()][['yearID']].value_counts()

yearID
1994.0    512
dtype: int64

The null values present in the `DivWin`, `LgWin`, and `WSWin` columns correspond to the year 1994. This particular year holds significance in MLB history as the season came to an abrupt end due to a labor strike. As a result, no teams were able to compete in the playoffs, and the World Series was canceled.

Therefore, the presence of null values in DivWin (division winner), LgWin (league winner), and WSWin (World Series winner) for the year 1994 is expected. We will fill these null values with "N" to indicate that no team won the division, league, or World Series that year.

In [53]:
# Fill WSwWin, LgWin and DivWin null values with 'N'
df_1985['WSWin'].fillna('N', inplace=True)
df_1985['LgWin'].fillna('N', inplace=True)
df_1985['DivWin'].fillna('N', inplace=True)

In [54]:
# Null values
df_1985.isnull().sum()[df_1985.isnull().sum() > 0].sort_values(ascending=False)

BABIP            92
career_BABIP     46
BA               28
SLG              28
OPS              28
OBP              22
career_BA        11
career_SLG       11
career_OPS       11
career_OBP        9
salary_y          1
leaguerank        1
teamrank          1
averagesalary     1
leagueminimum     1
dtype: int64

In [55]:
df_1985.shape

(22274, 108)

##### `BABIP` Nulls

In [56]:
# Rows with BABIP null values
df_1985[df_1985['BABIP'].isnull()].sample()

Unnamed: 0,name_common,age,stint_ID,PA,G_x,Inn,OPS_plus,WAR,WAR_def,WAR_off,playerID,birthYear,birthCountry,weight,height,bats,throws,debut,finalGame,yearID,G_y,AB_x,R_x,H_x,2B_x,3B_x,HR_x,RBI,SB_x,CS_x,BB_x,SO_x,IBB,HBP_x,SH,SF_x,GIDP,lgID,franchID,divID,Rank,G,Ghome,W,L,DivWin,LgWin,WSWin,R_y,AB_y,H_y,2B_y,3B_y,HR_y,BB_y,SO_y,SB_y,CS_y,HBP_y,SF_y,RA,E,DP,FP,teamName,park,attendance,BPF,teamIDBR,salary_y,leaguerank,teamrank,averagesalary,leagueminimum,BA,OBP,1B,TB,SLG,OPS,BABIP,W%,career_G_x,career_PA,career_AB_x,career_R_x,career_H_x,career_1B,career_2B_x,career_3B_x,career_HR_x,career_RBI,career_SB_x,career_CS_x,career_BB_x,career_SO_x,career_IBB,career_HBP_x,career_SH,career_SF_x,career_GIDP,career_TB,career_WAR,career_BA,career_OBP,career_SLG,career_OPS,career_BABIP
62718,Chuck Carr,22.0,1,2.0,4,1.0,-100.0,-0.03,0.0,-0.03,carrch02,1967.0,USA,155.0,70.0,B,R,1990-04-28,1997-09-27,1990.0,4.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,NL,NYM,E,2.0,162.0,81.0,91.0,71.0,N,N,N,775.0,5504.0,1410.0,278.0,21.0,172.0,536.0,851.0,110.0,33.0,32.0,56.0,613.0,132.0,107.0,0.978,New York Mets,Shea Stadium,2732745.0,100.0,NYM,100000.0,0.0,0.0,597537.0,100000.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.562,4,2,2,0,0,0,0,0,0,0,1,0,0,2,0,0,0,0,0,0,-0.03,0.0,0.0,0.0,0.0,


The calculation for BABIP (Batting Average on Balls In Play) is:

BABIP = (H - HR) / (AB - SO - HR + SF)

BABIP NaN values, are likely due to division by zero in the denominator of the formula. This can occur if the player has no at-bats (AB), or if the number of strikeouts (SO) and home runs (HR) equals or exceeds the number of at-bats.

In other words, if a player has never been at bat, or if every time they've been at bat they've either struck out or hit a home run, then the denominator of the BABIP formula will be zero, leading to a division by zero error and a resulting NaN value.

For this reason we will fill the BABIP NaN values with zero.

In [57]:
# Fill BABIP null values with 0
df_1985['BABIP'].fillna(0, inplace=True)

In [58]:
# Null values
df_1985.isnull().sum()[df_1985.isnull().sum() > 0]

salary_y          1
leaguerank        1
teamrank          1
averagesalary     1
leagueminimum     1
BA               28
OBP              22
SLG              28
OPS              28
career_BA        11
career_OBP        9
career_SLG       11
career_OPS       11
career_BABIP     46
dtype: int64

##### `BA`, `OBP`, `SLG`, and `OPS` Nulls

The values for Batting Average (`BA`), On-Base Percentage (`OBP`), Slugging Percentage (`SLG`), and On-Base Plus Slugging (`OPS`) are calculated based on a player's hitting statistics. If a player has not had any at-bats or has not reached base in any way, these values will be undefined and will appear as NaN in your dataset.

For this reason we will fill the `BA`, `OBP`, `SLG`, and `OPS` NaN values with zero.

In [59]:
# Fill BA, OBP, SLG, OPS null values with 0
df_1985['BA'].fillna(0, inplace=True)
df_1985['OBP'].fillna(0, inplace=True)
df_1985['SLG'].fillna(0, inplace=True)
df_1985['OPS'].fillna(0, inplace=True)

In [60]:
# null values
df_1985.isnull().sum()[df_1985.isnull().sum() > 0]

salary_y          1
leaguerank        1
teamrank          1
averagesalary     1
leagueminimum     1
career_BA        11
career_OBP        9
career_SLG       11
career_OPS       11
career_BABIP     46
dtype: int64

##### Carreer Stats Nulls

In [61]:
# Rows with career_BABIP null values
df_1985[df_1985['career_BABIP'].isnull()].sample(5)

Unnamed: 0,name_common,age,stint_ID,PA,G_x,Inn,OPS_plus,WAR,WAR_def,WAR_off,playerID,birthYear,birthCountry,weight,height,bats,throws,debut,finalGame,yearID,G_y,AB_x,R_x,H_x,2B_x,3B_x,HR_x,RBI,SB_x,CS_x,BB_x,SO_x,IBB,HBP_x,SH,SF_x,GIDP,lgID,franchID,divID,Rank,G,Ghome,W,L,DivWin,LgWin,WSWin,R_y,AB_y,H_y,2B_y,3B_y,HR_y,BB_y,SO_y,SB_y,CS_y,HBP_y,SF_y,RA,E,DP,FP,teamName,park,attendance,BPF,teamIDBR,salary_y,leaguerank,teamrank,averagesalary,leagueminimum,BA,OBP,1B,TB,SLG,OPS,BABIP,W%,career_G_x,career_PA,career_AB_x,career_R_x,career_H_x,career_1B,career_2B_x,career_3B_x,career_HR_x,career_RBI,career_SB_x,career_CS_x,career_BB_x,career_SO_x,career_IBB,career_HBP_x,career_SH,career_SF_x,career_GIDP,career_TB,career_WAR,career_BA,career_OBP,career_SLG,career_OPS,career_BABIP
83482,Clay Timpner,25.0,1,2.0,2,2.0,-100.0,-0.05,0.0,-0.05,timpncl01,1983.0,USA,195.0,74.0,L,L,2008-04-08,2008-04-11,2008.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,NL,SFG,W,4.0,162.0,81.0,72.0,90.0,N,N,N,640.0,5543.0,1452.0,311.0,37.0,94.0,452.0,1044.0,108.0,46.0,48.0,44.0,759.0,96.0,129.0,0.983,San Francisco Giants,AT&T Park,2863837.0,102.0,SFG,390000.0,0.0,0.0,2925679.0,390000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.444,2,2,2,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,-0.05,0.0,0.0,0.0,0.0,
68644,Ruben Rivera,21.0,1,1.0,5,5.0,-100.0,-0.01,0.03,-0.04,riverru01,1973.0,Panama,195.0,72.0,R,R,1995-09-03,2003-05-28,1995.0,5.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,AL,NYY,E,2.0,145.0,73.0,79.0,65.0,N,N,N,749.0,4947.0,1365.0,280.0,34.0,122.0,625.0,851.0,50.0,30.0,39.0,68.0,688.0,74.0,121.0,0.986,New York Yankees,Yankee Stadium II,1705263.0,99.0,NYY,109000.0,0.0,0.0,1110766.0,109000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.549,5,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,-0.01,0.0,0.0,0.0,0.0,
89472,Tyler Graham,28.0,1,2.0,10,2.0,-100.0,-0.08,0.0,-0.08,grahaty01,1984.0,USA,185.0,72.0,R,R,2012-09-07,2012-10-02,2012.0,10.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,NL,ARI,W,3.0,162.0,81.0,81.0,81.0,N,N,N,734.0,5462.0,1416.0,307.0,33.0,165.0,539.0,1266.0,93.0,51.0,41.0,45.0,688.0,90.0,146.0,0.985,Arizona Diamondbacks,Chase Field,2177617.0,105.0,ARI,480000.0,0.0,0.0,3213479.0,480000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,10,2,2,1,0,0,0,0,0,0,0,1,0,2,0,0,0,0,0,0,-0.08,0.0,0.0,0.0,0.0,
75170,Jason Smith,23.0,1,1.0,2,1.0,-100.0,-0.04,-0.01,-0.03,smithja05,1977.0,USA,195.0,75.0,L,R,2001-06-17,2009-05-21,2001.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,NL,CHC,C,3.0,162.0,81.0,88.0,74.0,N,N,N,777.0,5406.0,1409.0,268.0,32.0,194.0,577.0,1077.0,67.0,36.0,66.0,53.0,701.0,109.0,113.0,0.982,Chicago Cubs,Wrigley Field,2779465.0,95.0,CHC,200000.0,0.0,0.0,2138896.0,200000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.543,2,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,-0.04,0.0,0.0,0.0,0.0,
86263,JC Boscan,30.0,1,1.0,1,1.0,0.0,0.04,0.0,0.04,boscajc01,1979.0,Venezuela,215.0,74.0,R,R,2010-10-01,2013-09-29,2010.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,NL,ATL,E,2.0,162.0,81.0,91.0,71.0,N,N,N,738.0,5463.0,1411.0,312.0,25.0,139.0,634.0,1140.0,63.0,29.0,51.0,35.0,629.0,126.0,166.0,0.98,Atlanta Braves,Turner Field,2510119.0,98.0,ATL,400000.0,0.0,0.0,3014572.0,400000.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.562,1,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0.04,,1.0,,,


NaN values for career statistics such as Batting Average on Balls in Play (`BABIP`), On-Base Plus Slugging (`OPS`), Slugging Percentage (`SLG`), On-Base Percentage (`OBP`), and Batting Average (`BA`) likely appear because the player had no at-bats or did not reach base in any way during their career up to that point.
This is particularly common for players in their first season, as they have not yet had the opportunity to accumulate any hits, walks, or other statistics that contribute to these metrics. As a result, the denominators in the formulas for these statistics are zero, leading to undefined values.

Same as before we will fill this career stats NaN values with zero.

In [62]:
# Fill career_BABIP, career_BA, career_OBP, career_SLG, career_OPS null values with 0
df_1985['career_BABIP'].fillna(0, inplace=True)
df_1985['career_BA'].fillna(0, inplace=True)
df_1985['career_OBP'].fillna(0, inplace=True)
df_1985['career_SLG'].fillna(0, inplace=True)
df_1985['career_OPS'].fillna(0, inplace=True)

##### Remaining Nulls

In [63]:
# Null values
df_1985.isnull().sum()[df_1985.isnull().sum() > 0]

salary_y         1
leaguerank       1
teamrank         1
averagesalary    1
leagueminimum    1
dtype: int64

We only have one row (player) remaining with null values. We will take a look at it and decide what to do.

In [64]:
# rows with salary_y null values
df_1985[df_1985['teamrank'].isnull()]

Unnamed: 0,name_common,age,stint_ID,PA,G_x,Inn,OPS_plus,WAR,WAR_def,WAR_off,playerID,birthYear,birthCountry,weight,height,bats,throws,debut,finalGame,yearID,G_y,AB_x,R_x,H_x,2B_x,3B_x,HR_x,RBI,SB_x,CS_x,BB_x,SO_x,IBB,HBP_x,SH,SF_x,GIDP,lgID,franchID,divID,Rank,G,Ghome,W,L,DivWin,LgWin,WSWin,R_y,AB_y,H_y,2B_y,3B_y,HR_y,BB_y,SO_y,SB_y,CS_y,HBP_y,SF_y,RA,E,DP,FP,teamName,park,attendance,BPF,teamIDBR,salary_y,leaguerank,teamrank,averagesalary,leagueminimum,BA,OBP,1B,TB,SLG,OPS,BABIP,W%,career_G_x,career_PA,career_AB_x,career_R_x,career_H_x,career_1B,career_2B_x,career_3B_x,career_HR_x,career_RBI,career_SB_x,career_CS_x,career_BB_x,career_SO_x,career_IBB,career_HBP_x,career_SH,career_SF_x,career_GIDP,career_TB,career_WAR,career_BA,career_OBP,career_SLG,career_OPS,career_BABIP
85733,Yadier Molina,26.0,1,544.0,140,1186.7,100.364778,3.15,1.88,2.1,molinya01,1982.0,P.R.,225.0,71.0,R,R,2004-06-03,2022-10-05,2009.0,140.0,481.0,45.0,141.0,23.0,1.0,6.0,54.0,9.0,3.0,50.0,39.0,2.0,6.0,6.0,1.0,27.0,NL,STL,C,1.0,162.0,81.0,91.0,71.0,Y,N,N,730.0,5465.0,1436.0,294.0,29.0,160.0,528.0,1041.0,75.0,31.0,61.0,43.0,640.0,96.0,167.0,0.985,St. Louis Cardinals,Busch Stadium III,3343252.0,98.0,STL,,,,,,0.293,0.366,111.0,184.0,0.383,0.749,0.309,0.562,669,2458,2215,189,596,456,103,2,35,263,13,12,178,202,19,20,29,16,95,808,8.22,0.269,0.327,0.365,0.692,0.281


Remaning null is from a player called "Yadier Molina" for the 2009 season. We will try to find his stats in Baseball Reference or another trusted source and fill the null values with them.

- https://www.baseball-reference.com/players/m/molinya01.shtml
- https://www.baseball-reference.com/bullpen/Minimum_salary
- https://www.baseball-reference.com/teams/STL/2009.shtml


In [65]:
# Fil Yadier Molina's salary_y for 2009
df_1985.loc[(df_1985['playerID'] == 'molinya01') & (df_1985['yearID'] == 2009), 'salary_y'] = 3312500

# Fill Yadier Molina's teamrank for 2009
df_1985.loc[(df_1985['playerID'] == 'molinya01') & (df_1985['yearID'] == 2009), 'teamrank'] = 9

# Fill Yadier Molina's average salary for 2009
df_1985.loc[(df_1985['playerID'] == 'molinya01') & (df_1985['yearID'] == 2009), 'averagesalary'] = 2996106

# Fill Yadier Molina's leagueminimum for 2009
df_1985.loc[(df_1985['playerID'] == 'molinya01') & (df_1985['yearID'] == 2009), 'leagueminimum'] = 400000   

# Fill Yadier Molina's leaguerank for 2009
df_1985.loc[(df_1985['playerID'] == 'molinya01') & (df_1985['yearID'] == 2009), 'leaguerank'] = 100



In [66]:
# null values
df_1985.isnull().sum().sum()

0

## 4. Salary Adjustments <a id='salary_adjustments'></a>
We will adjust the salaries of the players to account for inflation. The value of money changes over time due to inflation. Therefore, comparing salaries from different years without adjusting for inflation can lead to misleading results.
 - https://www.bls.gov/data/inflation_calculator.htm

In [67]:
buying_power = {1985: 2.89,
                   1986: 2.78,
                   1987: 2.74,
                   1988: 2.64,
                   1989: 2.52,
                   1990: 2.39,
                   1991: 2.27,
                   1992: 2.21,
                   1993: 2.14,
                   1994: 2.09,
                   1995: 2.03,
                   1996: 1.98,
                   1997: 1.92,
                   1998: 1.89,
                   1999: 1.86,
                   2000: 1.81,
                   2001: 1.76,
                   2002: 1.72,
                   2003: 1.68,
                   2004: 1.65,
                   2005: 1.61,
                   2006: 1.54,
                   2007: 1.51,
                   2008: 1.45,
                   2009: 1.45,
                   2010: 1.41,
                   2011: 1.39,
                   2012: 1.35,
                   2013: 1.32,
                   2014: 1.30,
                   2015: 1.31,
                   2016: 1.29,
                   2017: 1.26,
                   2018: 1.23,
                   2019: 1.21,
                   2020: 1.18,
                   2021: 1.17,
                   2022: 1.09,}

In [68]:
# Adjusted salary column
df_1985['adjusted_salary'] = round(df_1985['salary_y'] * df_1985['yearID'].map(buying_power), 0)

In [69]:
# Sanity check
df_1985[['yearID', 'salary_y', 'adjusted_salary']].sample(10)

Unnamed: 0,yearID,salary_y,adjusted_salary
96647,2018.0,545000.0,670350.0
67379,1994.0,1575000.0,3291750.0
76389,2002.0,6000000.0,10320000.0
81419,2006.0,4950000.0,7623000.0
65671,1993.0,5750000.0,12305000.0
74452,2000.0,230000.0,416300.0
60507,1987.0,200000.0,548000.0
65669,1993.0,4516666.0,9665665.0
63293,1990.0,600000.0,1434000.0
87813,2011.0,500000.0,695000.0


In [70]:
# Save a copy of the df_1985
df_1985.to_csv('baseballsalaries_1985_preclean.csv', index=False)

Lastly, we will change the column order to make it more readable.

In [71]:
column_order = [
    'playerID', 'name_common', 'birthYear', 'birthCountry', 'debut', 'finalGame',
    'age', 'weight', 'height', 'bats', 'throws',
    'yearID', 'stint_ID', 'G_x', 'G_y', 'PA', 'AB_x', 'R_x', 'H_x', '2B_x', '3B_x', 'HR_x', 'RBI', 'SB_x', 'CS_x', 'BB_x', 'SO_x', 'IBB', 'HBP_x', 'SH', 'SF_x', 'GIDP', '1B', 'BA', 'OBP', 'SLG', 'OPS', 'BABIP', 'TB', 'W%', 'Inn', 'OPS_plus', 'WAR', 'WAR_def', 'WAR_off',
    'career_G_x', 'career_PA', 'career_AB_x', 'career_R_x', 'career_H_x', 'career_1B', 'career_2B_x', 'career_3B_x', 'career_HR_x', 'career_RBI', 'career_SB_x', 'career_CS_x', 'career_BB_x', 'career_SO_x', 'career_IBB', 'career_HBP_x', 'career_SH', 'career_SF_x', 'career_GIDP', 'career_TB', 'career_WAR', 'career_BA', 'career_OBP', 'career_SLG', 'career_OPS', 'career_BABIP',
    'teamName', 'park', 'attendance', 'teamIDBR', 'lgID', 'franchID', 'divID', 'Rank', 'G', 'Ghome', 'W', 'L', 'DivWin', 'LgWin', 'WSWin', 'R_y', 'AB_y', 'H_y', '2B_y', '3B_y', 'HR_y', 'BB_y', 'SO_y', 'SB_y', 'CS_y', 'HBP_y', 'SF_y', 'RA', 'E', 'DP', 'FP', 'BPF',
    'salary_y', 'leaguerank', 'teamrank', 'averagesalary', 'leagueminimum', 'adjusted_salary'
]

In [72]:
# Sanity check
len(column_order) == len(df_1985.columns)

True

In [73]:
# Change column order
df_1985 = df_1985[column_order]

In [74]:
# sample
df_1985.sample(5)

Unnamed: 0,playerID,name_common,birthYear,birthCountry,debut,finalGame,age,weight,height,bats,throws,yearID,stint_ID,G_x,G_y,PA,AB_x,R_x,H_x,2B_x,3B_x,HR_x,RBI,SB_x,CS_x,BB_x,SO_x,IBB,HBP_x,SH,SF_x,GIDP,1B,BA,OBP,SLG,OPS,BABIP,TB,W%,Inn,OPS_plus,WAR,WAR_def,WAR_off,career_G_x,career_PA,career_AB_x,career_R_x,career_H_x,career_1B,career_2B_x,career_3B_x,career_HR_x,career_RBI,career_SB_x,career_CS_x,career_BB_x,career_SO_x,career_IBB,career_HBP_x,career_SH,career_SF_x,career_GIDP,career_TB,career_WAR,career_BA,career_OBP,career_SLG,career_OPS,career_BABIP,teamName,park,attendance,teamIDBR,lgID,franchID,divID,Rank,G,Ghome,W,L,DivWin,LgWin,WSWin,R_y,AB_y,H_y,2B_y,3B_y,HR_y,BB_y,SO_y,SB_y,CS_y,HBP_y,SF_y,RA,E,DP,FP,BPF,salary_y,leaguerank,teamrank,averagesalary,leagueminimum,adjusted_salary
95952,liriary01,Rymer Liriano,1991.0,D.R.,2014-08-11,2017-10-01,26.0,230.0,72.0,R,R,2017.0,1,21,21.0,46.0,41.0,4.0,9.0,2.0,0.0,1.0,6.0,1.0,0.0,5.0,14.0,0.0,0.0,0.0,0.0,0.0,6.0,0.22,0.304,0.341,0.645,0.308,14.0,0.414,118.0,75.475618,0.15,0.11,-0.05,59,167,150,17,33,27,4,0,2,12,5,1,14,53,1,2,0,1,6,43,-0.42,0.22,0.293,0.287,0.58,0.323,Chicago White Sox,Guaranteed Rate Field,1629470.0,CHW,AL,CHW,C,4.0,162.0,81.0,67.0,95.0,N,N,N,706.0,5513.0,1412.0,256.0,37.0,186.0,401.0,1397.0,71.0,31.0,76.0,33.0,820.0,114.0,157.0,0.981,98.0,535000.0,0.0,0.0,4097122.0,535000.0,674100.0
86338,larisje01,Jeff Larish,1982.0,USA,2008-05-30,2010-10-03,27.0,200.0,74.0,L,R,2010.0,2,27,27.0,75.0,67.0,5.0,12.0,3.0,0.0,2.0,9.0,1.0,0.0,7.0,24.0,0.0,1.0,0.0,0.0,2.0,7.0,0.179,0.267,0.313,0.58,0.244,21.0,0.5,171.0,75.964963,-0.29,-0.12,-0.29,101,276,245,30,55,34,12,1,8,32,3,3,29,83,0,1,0,1,8,93,-0.59,0.224,0.308,0.38,0.688,0.303,Detroit Tigers,Comerica Park,2461237.0,DET,AL,DET,C,3.0,162.0,81.0,81.0,81.0,N,N,N,751.0,5643.0,1515.0,308.0,32.0,152.0,546.0,1147.0,69.0,30.0,41.0,41.0,743.0,109.0,171.0,0.982,101.0,400000.0,0.0,0.0,3014572.0,400000.0,564000.0
67682,penato01,Tony Peña,1957.0,D.R.,1980-09-01,1997-09-28,37.0,175.0,72.0,R,R,1994.0,1,40,40.0,126.0,112.0,18.0,33.0,8.0,1.0,2.0,10.0,0.0,1.0,9.0,11.0,0.0,0.0,3.0,2.0,6.0,22.0,0.295,0.341,0.438,0.779,0.307,49.0,0.584,300.0,101.512359,0.52,0.11,0.62,1790,6501,5966,622,1569,1166,275,27,101,643,79,62,416,761,71,22,64,33,214,2201,26.16,0.263,0.312,0.369,0.681,0.286,Cleveland Guardians,Jacobs Field,1995174.0,CLE,AL,CLE,C,2.0,113.0,51.0,66.0,47.0,N,N,N,679.0,4022.0,1165.0,240.0,20.0,167.0,382.0,629.0,131.0,48.0,18.0,38.0,562.0,90.0,119.0,0.98,99.0,400000.0,397.0,18.0,1168263.0,109000.0,836000.0
65555,mcintti01,Tim McIntosh,1965.0,USA,1990-09-03,1996-06-12,27.0,195.0,71.0,R,R,1992.0,1,35,35.0,84.0,77.0,7.0,14.0,3.0,0.0,0.0,6.0,1.0,3.0,3.0,9.0,0.0,2.0,1.0,1.0,1.0,11.0,0.182,0.229,0.221,0.45,0.203,17.0,0.568,202.3,28.141051,-0.68,-0.14,-0.54,47,100,93,10,19,13,4,0,2,8,1,3,3,15,0,2,1,1,1,29,-0.67,0.204,0.242,0.312,0.554,0.221,Milwaukee Brewers,County Stadium,1857351.0,MIL,AL,MIL,E,2.0,162.0,81.0,92.0,70.0,N,N,N,740.0,5504.0,1477.0,272.0,35.0,82.0,511.0,779.0,256.0,115.0,33.0,72.0,604.0,89.0,146.0,0.986,99.0,109000.0,0.0,0.0,1028667.0,109000.0,240890.0
92198,deazaal01,Alejandro De Aza,1984.0,D.R.,2007-04-02,2017-10-01,31.0,195.0,72.0,L,L,2015.0,3,114,114.0,365.0,325.0,51.0,85.0,17.0,7.0,7.0,35.0,7.0,5.0,31.0,84.0,3.0,5.0,2.0,2.0,6.0,54.0,0.262,0.333,0.422,0.755,0.331,137.0,0.5,755.9,300.352677,0.72,-1.05,1.34,680,2541,2279,328,609,414,120,30,45,224,86,41,196,539,11,28,23,15,26,924,7.71,0.267,0.331,0.405,0.736,0.33,Baltimore Orioles,Oriole Park at Camden Yards,2281202.0,BAL,AL,BAL,E,3.0,162.0,78.0,81.0,81.0,N,N,N,713.0,5485.0,1370.0,246.0,20.0,217.0,418.0,1331.0,44.0,25.0,51.0,32.0,693.0,77.0,134.0,0.987,103.0,5000000.0,234.0,7.0,3952252.0,507500.0,6550000.0


In [75]:
# min salary
df_1985['adjusted_salary'].min()

72500.0

In [76]:
# Save a copy of the df_1985
df_1985.to_csv('baseballsalaries_1985.csv', index=False)

In [77]:
df_1985.dtypes

playerID            object
name_common         object
birthYear          float64
birthCountry        object
debut               object
finalGame           object
age                float64
weight             float64
height             float64
bats                object
throws              object
yearID             float64
stint_ID             int64
G_x                  int64
G_y                float64
PA                 float64
AB_x               float64
R_x                float64
H_x                float64
2B_x               float64
3B_x               float64
HR_x               float64
RBI                float64
SB_x               float64
CS_x               float64
BB_x               float64
SO_x               float64
IBB                float64
HBP_x              float64
SH                 float64
SF_x               float64
GIDP               float64
1B                 float64
BA                 float64
OBP                float64
SLG                float64
OPS                float64
B

## 5. Next Steps <a id='next_steps'></a>

Now that we have a clean dataset, we are ready to move on to the next step of our project: 

- __Exploratory Data Analysis (EDA):__ Perform an in-depth exploratory data analysis to uncover insights, patterns, and relationships within the preprocessed data. Utilize visualizations, statistical analysis, and other techniques to understand the distribution, correlations, and trends present in the data. This stage will provide valuable insights that can guide further analysis and modeling decisions.

- __Feature Engineering:__ Engage in feature engineering to enhance the dataset for modeling purposes. This includes selecting relevant features, transforming existing features, and potentially creating new features based on domain knowledge and insights gained from the EDA. Iteratively refine the feature set to improve model performance and align it with the project's objectives.

Please note that the current state of the preprocessed data is not the final form. More features can be added or removed during the feature engineering phase to further optimize our models and increase their predictive power.

