# Predicting MLB Player Salaries: A Batting Performance Analysis
---
# Raw Datsets exploration, cleaning and joins


## Contents:
1. [Introduction](#introduction)
2. [Basic Data Exploration and cleaning](#basic-data-exploration)
    - 2.1 [WAR Dataset](#war-dataset)
    - 2.2 [Batting Dataset](#batting-dataset)
    - 2.3 [People Dataset](#people-dataset)
    - 2.4 [Salary Dataset](#salary-dataset)
    - 2.5 [Teams Dataset](#teams-dataset)
3. [Table Joins](#table-joins)
4. [Next Steps](#summary)

---

## 1. Introduction

This notebook provides a basic first exploration of several datasets related to baseball statistics. The datasets cover a wide range of information, including player performance, team performance, player biographical information and salaries. The goal of this exploration is to gain a better understanding of the structure and content of these datasets, perform some basic cleaning and preparation to ensure the data is well-suited for subsequent analysis. And finally, we're going to merge these datasets into a single, comprehensive one for further exploration and analysis.


In [9066]:
# Import necessary libraries
import pandas as pd
import numpy as np

#### Dataset Loading <a class="anchor" id="dataset-loading"></a>

In [9067]:
# Load datasets
raw_war = pd.read_csv('Raw datasets/war_daily_bat.csv')
raw_teams = pd.read_csv('Raw datasets/Teams.csv')
raw_batting = pd.read_csv('Raw datasets/Batting.csv')
raw_fielding = pd.read_csv('Raw datasets/Fielding.csv')
raw_people = pd.read_csv('Raw datasets/People.csv')
raw_salaries = pd.read_csv('Raw datasets/salary_history.csv')

---
## 2. Basic Data Exploration and Cleaning  <a class="anchor" id="basic-data-exploration"></a>  
We'll start by examining the shape and features of each dataset, and then perform some basic cleaning and preparation to ensure the data is well-suited for subsequent analysis. This includes converting data types where necessary, and dropping irrelevant columns. We'll also address any inconsistencies or anomalies we come across during our exploration.



### 2.1 WAR (Wins Above Replacement) Dataset <a class="anchor" id="war-dataset"></a>

In [9068]:
pd.set_option('display.max_columns', None)
# Shape and first rows
print(f'raw_war shape: {raw_war.shape}')
print(f'Rows: {raw_war.shape[0]}')
print(f'Columns: {raw_war.shape[1]}')
raw_war.head()

raw_war shape: (121375, 48)
Rows: 121375
Columns: 48


Unnamed: 0,name_common,age,mlb_ID,player_ID,year_ID,team_ID,stint_ID,lg_ID,PA,G,Inn,runs_bat,runs_br,runs_dp,runs_field,runs_infield,runs_outfield,runs_catcher,runs_defense,runs_position,runs_position_p,runs_replacement,runs_above_rep,runs_above_avg,runs_above_avg_off,runs_above_avg_def,WAA,WAA_off,WAA_def,WAR,WAR_def,WAR_off,WAR_rep,salary,pitcher,teamRpG,oppRpG,oppRpPA_rep,oppRpG_rep,pyth_exponent,pyth_exponent_rep,waa_win_perc,waa_win_perc_off,waa_win_perc_def,waa_win_perc_rep,OPS_plus,TOB_lg,TB_lg
0,David Aardsma,22.0,430911.0,aardsda01,2004,SFG,1,NL,0.0,11,10.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.01,0.0,-0.01,0.0,0.0,300000.0,Y,4.67092,4.67092,0.08651,4.67092,1.89,1.89,0.5,0.5,0.5,0.5,,0.0,0.0
1,David Aardsma,24.0,430911.0,aardsda01,2006,CHC,1,NL,3.0,43,53.0,-0.9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.46,0.0,-0.4,-0.4,-0.4,0.0,-0.04,-0.04,-0.01,-0.04,-0.01,-0.04,0.0,,Y,4.85675,4.86675,0.09085,4.86457,1.912,1.913,0.499,0.499,0.5,0.4998,-100.0,0.694,0.896
2,David Aardsma,25.0,430911.0,aardsda01,2007,CHW,1,AL,0.0,2,32.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,387500.0,Y,4.85895,4.85895,0.08422,4.85895,1.912,1.912,0.5,0.5,0.5,0.5,,0.0,0.0
3,David Aardsma,26.0,430911.0,aardsda01,2008,BOS,1,AL,1.0,5,48.7,-0.29,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.14,0.0,-0.2,-0.2,-0.2,0.0,-0.02,-0.02,0.0,-0.02,0.0,-0.02,0.0,403250.0,Y,4.674,4.704,0.08092,4.6965,1.893,1.894,0.497,0.497,0.5,0.4992,-100.0,0.345,0.434
4,David Aardsma,27.0,430911.0,aardsda01,2009,SEA,1,AL,0.0,3,71.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,419000.0,Y,4.79788,4.79788,0.08302,4.79788,1.905,1.905,0.5,0.5,0.5,0.5,,0.0,0.0


##### Dictionary

| Column Name | Description |  
| --- | --- |
| name_common | Player name |
| age | Player age |
| mlb_ID | MLB ID |
| player_ID | Player ID |
| year_ID | Year (Season played) |
| team_ID | ID of the team they played for the season |
| stint_ID | The stint of the player in that year (a player can have more than one stint in a year if they moved teams). |
| lg_ID | The league the player played in |
| PA | Plate appearances |
| G | Games played |
| Inn | Innings played |
| runs_bat | Runs above average |
| runs_br | Runs from baserunning |
| runs_dp | Runs from avoiding double plays |
| runs_field | Runs from fielding |
| runs_infield | Runs from infield |
| runs_outfield | Runs from outfield |
| runs_catcher | Runs from catching |
| runs_defense | Runs from defense |
| runs_position | Runs from position |
| runs_position_p | Runs from position as pitcher |
| runs_replacement | Runs from replacement |
| runs_above_rep | Runs above replacement |
| runs_above_avg | Runs above average |
| runs_above_avg_off | Runs above average as batter |  
| runs_above_avg_def | Runs above average as fielder |
| WAA | Wins above average |
| WAA_off | Wins above average as batter |
| WAA_def | Wins above average as fielder |
| WAR | Wins above replacement |
| WAR_def | Wins above replacement as fielder |
| WAR_off | Wins above replacement as batter |
| WAR_rep | Wins above replacement as replacement |
| salary | Player salary |
| pitcher | Whether the player is a pitcher |
| teamRpG | Team runs per game |
| oppRpG | Opponent runs per game |
| oppRpPA_rep | Opponent runs per plate appearance as replacement |
| oppRpG_rep | Opponent runs per game as replacement |
| pyth_exponent | Pythagorean exponent |
| pyth_exponent_rep | Pythagorean exponent as replacement |
| waa_win_perc | Win percentage above average |
| waa_win_perc_off | Win percentage above average as batter |
| waa_win_perc_def | Win percentage above average as fielder |
| waa_win_perc_rep | Win percentage above average as replacement |
| OPS_plus | OPS+ |
| TOB_lg | Times on base in league |
| TB_lg | Total bases in league |

The WAR dataset provides a comprehensive view of player performance, capturing the value of a player in all facets of the game including batting, baserunning, fielding, and pitching. The dataset consists of 121,375 rows and 48 columns.

The dataset also includes a variety of other metrics related to player performance. The goal of the WAR metric, in particular, is to summarize a player's total contributions to their team in one statistic.


In [9069]:
# Data types
raw_war.dtypes

name_common            object
age                   float64
mlb_ID                float64
player_ID              object
year_ID                 int64
team_ID                object
stint_ID                int64
lg_ID                  object
PA                    float64
G                       int64
Inn                   float64
runs_bat              float64
runs_br               float64
runs_dp               float64
runs_field            float64
runs_infield          float64
runs_outfield         float64
runs_catcher          float64
runs_defense          float64
runs_position         float64
runs_position_p       float64
runs_replacement      float64
runs_above_rep        float64
runs_above_avg        float64
runs_above_avg_off    float64
runs_above_avg_def    float64
WAA                   float64
WAA_off               float64
WAA_def               float64
WAR                   float64
WAR_def               float64
WAR_off               float64
WAR_rep               float64
salary    

It looks that all the data types are correct, so we can move on to the next step.


The majority of these columns are complex and difficult to interpret and they not add much value to our analysis. Therefore, we will drop a subset of these columns to simplify the dataset and make it easier to work with. We will keep the columns that are most relevant to our analysis.

In [9070]:
cols_to_keep = ['name_common',
 'age',
 'mlb_ID',
 'player_ID',
 'year_ID',
 'team_ID',
 'stint_ID',
 'lg_ID',
 'PA',
 'G',
 'Inn', 'salary', 'OPS_plus', 'WAR', 'WAR_def', 'WAR_off', 'pitcher']

The columns we are keeping include basic player information (like name, age, and team), performance metrics that are easy to interpret (like Plate Appearances (`PA`), Games played (`G`), and Innings played (`Inn`), and metrics that are likely to be relevant to salary (like Wins Above Replacement (`WAR`), defensive and offensive WAR, and whether the player is a pitcher). We're also keeping the `OPS_plus` column, which is a more advanced metric but is widely used and understood in baseball analytics.

In [9071]:
# Drop columns
raw_war = raw_war[cols_to_keep]

In [9072]:
# Drop rows with year = 2023
raw_war = raw_war[raw_war['year_ID'] != 2023]

Some pre cleaning

In [9073]:
# Remove special characters and punctuation from name_common
raw_war['name_common'] = raw_war['name_common'].str.replace(r"[\"\',.]", '')

# # lower case
# raw_war['name_common'] = raw_war['name_common'].str.lower()

  raw_war['name_common'] = raw_war['name_common'].str.replace(r"[\"\',.]", '')


In [9074]:
raw_war.shape

(120757, 17)

##### `stint_ID` column
In our dataset, a player can have multiple entries for a single season due to having multiple stints. This can occur when a player is traded or moves teams during the season. Each stint is recorded as a separate row, which is not ideal for our analysis as we want a single row per player per season.

To resolve this, we'll aggregate the data at the player and season level. This means we'll combine the statistics for all stints a player had in a single season.

In [9075]:
# New dataframe to preserve the first team the player played for in a season
raw_war_teams = raw_war[['player_ID', 'year_ID', 'team_ID', 'stint_ID']]

# drop rows with stint_ID > 1
raw_war_teams = raw_war_teams[raw_war_teams['stint_ID'] == 1]

# Drop stint_ID column
raw_war_teams.drop('stint_ID', axis=1, inplace=True)

In [9076]:
# group raw_war 
raw_war_grouped = raw_war.groupby(['name_common', 'age', 'year_ID', 'player_ID', 'pitcher']).sum().reset_index()
raw_war_grouped.shape

  raw_war_grouped = raw_war.groupby(['name_common', 'age', 'year_ID', 'player_ID', 'pitcher']).sum().reset_index()


(109262, 15)

In [9077]:
# Join raw_war_grouped with df_war_teams
war_pre = raw_war_grouped.merge(raw_war_teams, on=['player_ID', 'year_ID'])

In [9078]:
# Change stint_ID 
war_pre['stint_ID'] = np.where(war_pre['stint_ID'] == 1, 1, 
                                        np.where(war_pre['stint_ID'] == 3, 2, 
                                            np.where(war_pre['stint_ID'] == 6, 3, 
                                                np.where(war_pre['stint_ID'] == 10, 4, 
                                                    np.where(war_pre['stint_ID'] == 15, 5, 0)))))

In [9079]:
# Sanity check, check for players with more than one team in a year
# Joey Gallo
war_pre[war_pre['name_common'] == 'Joey Gallo']

Unnamed: 0,name_common,age,year_ID,player_ID,pitcher,mlb_ID,stint_ID,PA,G,Inn,salary,OPS_plus,WAR,WAR_def,WAR_off,team_ID
56527,Joey Gallo,21.0,2015,gallojo01,N,608336.0,1,123.0,36,264.0,0.0,91.332137,0.32,0.07,0.21,TEX
56528,Joey Gallo,22.0,2016,gallojo01,N,608336.0,1,30.0,17,72.0,0.0,-2.86383,-0.52,-0.24,-0.32,TEX
56529,Joey Gallo,23.0,2017,gallojo01,N,608336.0,1,532.0,145,1208.0,537120.0,118.426407,2.92,-0.63,3.32,TEX
56530,Joey Gallo,24.0,2018,gallojo01,N,608336.0,1,577.0,148,1199.0,560000.0,108.778664,2.41,-0.01,1.86,TEX
56531,Joey Gallo,25.0,2019,gallojo01,N,608336.0,1,297.0,70,614.3,605500.0,144.925708,3.08,0.47,2.54,TEX
56532,Joey Gallo,26.0,2020,gallojo01,N,608336.0,1,226.0,57,475.7,4400000.0,86.573416,1.51,1.11,0.2,TEX
56533,Joey Gallo,27.0,2021,gallojo01,N,1216672.0,2,616.0,153,1304.4,6200000.0,230.520541,4.61,0.92,3.04,TEX
56534,Joey Gallo,28.0,2022,gallojo01,N,1216672.0,2,410.0,126,944.4,10275000.0,162.570976,0.23,-0.13,-0.18,NYY


In [9080]:
# Shape
print(f'raw_war shape: {raw_war.shape}')

raw_war shape: (120757, 17)


In [9081]:
# Save to csv
war_pre.to_csv('pre_datasets\war_pre.csv', index=False)

### 2.2 Batting Dataset <a class="anchor" id="batting-dataset"></a>



In [9082]:
# Shape and first rows
print(f'raw_batting shape: {raw_batting.shape}')
print(f'Rows: {raw_batting.shape[0]}')
print(f'Columns: {raw_batting.shape[1]}')
raw_batting.sample(5)

raw_batting shape: (112184, 22)
Rows: 112184
Columns: 22


Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
91972,grillja01,2009,2,TEX,AL,30,0,0,0,0,0,0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0
15109,smithre02,1914,1,BRO,NL,90,330,39,81,10,8,4,48.0,11.0,,30,26.0,,1.0,13.0,,
27773,gumpera01,1938,1,PHA,AL,4,4,1,1,0,0,0,1.0,0.0,0.0,1,1.0,,0.0,0.0,,
93930,rosalad01,2010,1,OAK,AL,80,255,31,69,8,2,7,31.0,2.0,2.0,19,65.0,0.0,1.0,2.0,2.0,1.0
17120,aldrivi01,1918,1,CHN,NL,3,3,1,1,1,0,0,1.0,0.0,,0,0.0,,0.0,1.0,,


##### Dictionary

| Column Name | Description |
| --- | --- |
| playerID | Player ID |
| yearID | Year (Season played) |
| stint | The stint of the player in that year (a player can have more than one stint in a year if they moved teams). |
| teamID | ID of the team they played for the season |
| lgID | The league the player played in |
| G | Games played |
| AB | At bats |
| R | Runs |
| H | Hits |
| 2B | Doubles |
| 3B | Triples |
| HR | Homeruns |
| RBI | Runs batted in |
| SB | Stolen bases |
| CS | Caught stealing |
| BB | Walks |
| SO | Strikeouts |
| IBB | Intentional walks |
| HBP | Hit by pitch |
| SH | Sacrifice hits |
| SF | Sacrifice flies |
| GIDP | Grounded into double plays |



The batting dataset is a comprehensive collection of player batting statistics. It contains 112,184 rows and 22 columns, each row representing a player's performance in a particular season.

The dataset includes both pitchers and position players. Pitchers typically have fewer at-bats and different offensive statistics than position players, which is something to keep in mind during the analysis.

In [9083]:
# Data types
raw_batting.dtypes

playerID     object
yearID        int64
stint         int64
teamID       object
lgID         object
G             int64
AB            int64
R             int64
H             int64
2B            int64
3B            int64
HR            int64
RBI         float64
SB          float64
CS          float64
BB            int64
SO          float64
IBB         float64
HBP         float64
SH          float64
SF          float64
GIDP        float64
dtype: object

It looks that all the data types are correct, so we can move on to the next step.

For now, we will keep all the columns in the batting dataset. We can always drop unnecessary columns later once we understand the data better.

##### `stint` column
Same situation as in the WAR dataset, a player can have multiple entries for a single season due to having multiple stints. This can occur when a player is traded or moves teams during the season. Each stint is recorded as a separate row, which is not ideal for our analysis as we want a single row per player per season.

To resolve this, we'll follow the same process as we did for the WAR dataset.

In [9084]:
# New dataframe 
raw_batting_teams = raw_batting[['playerID', 'yearID', 'teamID', 'stint']]
raw_batting_teams.shape

(112184, 4)

In [9085]:
# drop rows with stint > 1
raw_batting_teams = raw_batting_teams[raw_batting_teams['stint'] == 1]

# drop stint column
raw_batting_teams.drop('stint', axis=1, inplace=True)

In [9086]:
# group raw_batting 
raw_batting_grouped = raw_batting.groupby(['playerID', 'yearID']).sum().reset_index()

  raw_batting_grouped = raw_batting.groupby(['playerID', 'yearID']).sum().reset_index()


In [9087]:
# Join raw_batting_grouped with raw_batting_teams 
batting_pre = raw_batting_grouped.merge(raw_batting_teams, on=['playerID', 'yearID'])

In [9088]:
# Sanity check, check for players with more than one team in a year
# Joey Gallo
batting_pre[batting_pre['playerID'] == 'gallojo01']

Unnamed: 0,playerID,yearID,stint,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP,teamID
31674,gallojo01,2015,1,36,108,16,22,3,1,6,14.0,3.0,0.0,15,57.0,3.0,0.0,0.0,0.0,0.0,TEX
31675,gallojo01,2016,1,17,25,2,1,0,0,1,1.0,1.0,0.0,5,19.0,0.0,0.0,0.0,0.0,0.0,TEX
31676,gallojo01,2017,1,145,449,85,94,18,3,41,80.0,7.0,2.0,75,196.0,1.0,8.0,0.0,0.0,3.0,TEX
31677,gallojo01,2018,1,148,500,82,103,24,1,40,92.0,3.0,4.0,74,207.0,4.0,3.0,0.0,0.0,3.0,TEX
31678,gallojo01,2019,1,70,241,54,61,15,1,22,49.0,4.0,2.0,52,114.0,4.0,2.0,1.0,1.0,0.0,TEX
31679,gallojo01,2020,1,57,193,23,35,8,0,10,26.0,2.0,0.0,29,79.0,2.0,4.0,0.0,0.0,0.0,TEX
31680,gallojo01,2021,3,153,498,90,99,13,1,38,77.0,6.0,0.0,111,213.0,5.0,6.0,0.0,1.0,6.0,TEX
31681,gallojo01,2022,3,126,350,48,56,8,2,19,47.0,3.0,0.0,56,163.0,0.0,3.0,0.0,1.0,0.0,NYA


In [9089]:
# drop stint column
batting_pre.drop('stint', axis=1, inplace=True)

In [9090]:
# Save to csv
batting_pre.to_csv('pre_datasets/batting_pre.csv', index=False)

### 2.3 People Dataset <a class="anchor" id="people-dataset"></a>



In [9091]:
# Shape and first rows
print(f'raw_people shape: {raw_people.shape}')
print(f'Rows: {raw_people.shape[0]}')
print(f'Columns: {raw_people.shape[1]}')
raw_people.head()


raw_people shape: (20811, 24)
Rows: 20811
Columns: 24


Unnamed: 0,playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,deathCountry,deathState,deathCity,nameFirst,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID
0,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01
1,aaronha01,1934.0,2.0,5.0,USA,AL,Mobile,2021.0,1.0,22.0,USA,GA,Atlanta,Hank,Aaron,Henry Louis,180.0,72.0,R,R,1954-04-13,1976-10-03,aaroh101,aaronha01
2,aaronto01,1939.0,8.0,5.0,USA,AL,Mobile,1984.0,8.0,16.0,USA,GA,Atlanta,Tommie,Aaron,Tommie Lee,190.0,75.0,R,R,1962-04-10,1971-09-26,aarot101,aaronto01
3,aasedo01,1954.0,9.0,8.0,USA,CA,Orange,,,,,,,Don,Aase,Donald William,190.0,75.0,R,R,1977-07-26,1990-10-03,aased001,aasedo01
4,abadan01,1972.0,8.0,25.0,USA,FL,Palm Beach,,,,,,,Andy,Abad,Fausto Andres,184.0,73.0,L,L,2001-09-10,2006-04-13,abada001,abadan01


##### Dictionary

| Column Name | Description |
| --- | --- |
| playerID | Player ID |
| birthYear | Year of birth |
| birthMonth | Month of birth |
| birthDay | Day of birth |
| birthCountry | Country of birth |
| birthState | State of birth |
| birthCity | City of birth |
| deathYear | Year of death |
| deathMonth | Month of death |
| deathDay | Day of death |
| deathCountry | Country of death |
| deathState | State of death |
| deathCity | City of death |
| nameFirst | First name |
| nameLast | Last name |
| nameGiven | Full name |
| weight | Weight in pounds |
| height | Height in inches |
| bats | Batting hand (left, right, or both) |
| throws | Throwing hand (left or right) |
| debut | Date of MLB debut |
| finalGame | Date of final MLB game |
| retroID | Retro ID |
| bbrefID | Baseball Reference ID |


The people dataset contains biographical information about baseball players. This dataset has 20,811 rows and 24 columns. Each row represents a player, and the columns provide various details about the player, such as their birth and death details, physical attributes, and career details.

This dataset will be useful for adding context to our analysis, such as the player's age during each season, their physical attributes, and their career span.

In [9092]:
# Data types
raw_people.dtypes

playerID         object
birthYear       float64
birthMonth      float64
birthDay        float64
birthCountry     object
birthState       object
birthCity        object
deathYear       float64
deathMonth      float64
deathDay        float64
deathCountry     object
deathState       object
deathCity        object
nameFirst        object
nameLast         object
nameGiven        object
weight          float64
height          float64
bats             object
throws           object
debut            object
finalGame        object
retroID          object
bbrefID          object
dtype: object

It looks that all the data types are correct, so we can move on to the next step.

##### Personal columns

Some columns in this dataset are not relevant to our project and will be dropped. These columns are commonly related to personal information about the player, such as birth date, birth place, and death date. 

Personal columns are not directly related to the project's analysis of batting performance. By removing them from the dataset, we can eliminate unnecessary personal information and narrow the scope of the project to the pertinent variables.

In [9093]:
# List of personal columns that are not relevant
personal_columns = ['nameGiven','birthState', 'birthCity', 'deathYear', 'deathMonth', 'deathDay', 'deathCountry', 'deathState', 'deathCity']

In [9094]:
# Drop personal columns
raw_people.drop(personal_columns, axis=1, inplace=True)

Some pre cleaning

In [9095]:
# Remove special characters and punctuation from nameFirst and nameLast
raw_people['nameFirst'] = raw_people['nameFirst'].str.replace(r"[\"\',]", '')
raw_people['nameLast'] = raw_people['nameLast'].str.replace(r"[\"\',]", '')

raw_people['nameFirst'] = raw_people['nameFirst'].str.replace('.', '')
raw_people['nameLast'] = raw_people['nameLast'].str.replace('.', '')

raw_people['nameFirst'] = raw_people['nameFirst'].str.replace(' ', '')

# Lower case
raw_people['nameFirst'] = raw_people['nameFirst'].str.lower()
raw_people['nameLast'] = raw_people['nameLast'].str.lower()

  raw_people['nameFirst'] = raw_people['nameFirst'].str.replace(r"[\"\',]", '')
  raw_people['nameLast'] = raw_people['nameLast'].str.replace(r"[\"\',]", '')
  raw_people['nameFirst'] = raw_people['nameFirst'].str.replace('.', '')
  raw_people['nameLast'] = raw_people['nameLast'].str.replace('.', '')


In [9096]:
# Shape
raw_people.shape

(20811, 15)

In [9097]:
# Create a copy and save it to csv
people_pre = raw_people.copy()
people_pre.to_csv('pre_datasets/people_pre.csv', index=False)

### 2.4 Salary Dataset <a class="anchor" id="salary-dataset"></a>

In [9098]:
# Shape and first rows
print(f'raw_salaries shape: {raw_salaries.shape}')
print(f'Rows: {raw_salaries.shape[0]}')
print(f'Columns: {raw_salaries.shape[1]}')
raw_salaries.head() 

raw_salaries shape: (46450, 19)
Rows: 46450
Columns: 19


Unnamed: 0,firstname,lastname,playerid,mlbid,year,salary,TeamName,age,leaguerank,teamrank,averagesalary,leagueminimum,serviceTime,borndate,name,first3,last3,first3last3,middle2
0,David,Aardsma,25001,430911,2004,300000,San Francisco Giants,22,0,0,2313535,300000,Null,1981-12-27,David Aardsma,Dav,Aar,DavAar,avi
1,David,Aardsma,25001,430911,2006,327000,Chicago Cubs,24,0,0,2699292,327000,Null,1981-12-27,David Aardsma,Dav,Aar,DavAar,avi
2,David,Aardsma,25001,430911,2007,387500,Chicago White Sox,25,675,23,2824751,380000,Null,1981-12-27,David Aardsma,Dav,Aar,DavAar,avi
3,David,Aardsma,25001,430911,2008,403250,Boston Red Sox,26,632,27,2925679,390000,Null,1981-12-27,David Aardsma,Dav,Aar,DavAar,avi
4,David,Aardsma,25001,430911,2009,419000,Seattle Mariners,27,635,20,2996106,400000,Null,1981-12-27,David Aardsma,Dav,Aar,DavAar,avi


##### Dictionary

| Column Name | Description |
| --- | --- |
| firstname | The player's First name |
| lastname | The player's Last name |
| playerid | A unique identifier for each player |
| mlbid | A unique identifier for each player |
| year | Year of the salary |
| salary | Player salary for the year 
| age | The player's age during the year of the salary |
| leaguerank | The player's rank in the league based on salary |
| teamrank | The player's rank in the team based on salary |
| averagesalary | The average salary in the league for the year |
| leagueminimun | The minimum salary in the league for the year |
| serviceTime | The player's service time in the league |
| borndate | The player's birth date |

The salaries dataset provides detailed information about the salaries of baseball players. This dataset contains 46,450 rows and 14 columns. Each row represents a player's salary for a specific year, and the columns provide various details about the salary, the player, and the team they played for.

This dataset will be crucial for our analysis as it provides the target variable we want to predict - the player's salary. It also provides context about how the player's salary compares to others in the league and on their team.

Upon initial inspection of the salaries dataset, we've noticed that the `serviceTime` column contains several null values.

Additionally, this column doesn't seem to provide any crucial information for our analysis. Therefore, we'll drop this column from the dataset.

In [9099]:
# Null values
raw_salaries.isnull().sum()

firstname        0
lastname         0
playerid         0
mlbid            0
year             0
salary           0
TeamName         0
age              0
leaguerank       0
teamrank         0
averagesalary    0
leagueminimum    0
serviceTime      0
borndate         0
name             0
first3           0
last3            0
first3last3      0
middle2          2
dtype: int64

In [9100]:
# 'Null' in serviceTime column
raw_salaries[raw_salaries['serviceTime'] == 'Null'].shape

(35121, 19)

It looks that there are no Null values in the dataset, 'Null' in serviceTime is just a string. There are 35,121 of them .We are going to drop the serviceTime column as it doesn't provide any crucial information for our analysis.

In [9101]:
# Drop serviceTime column
raw_salaries.drop('serviceTime', axis=1, inplace=True)

In [9102]:
# Data types
raw_salaries.dtypes

firstname        object
lastname         object
playerid          int64
mlbid             int64
year              int64
salary            int64
TeamName         object
age               int64
leaguerank        int64
teamrank          int64
averagesalary     int64
leagueminimum     int64
borndate         object
name             object
first3           object
last3            object
first3last3      object
middle2          object
dtype: object

Some pre cleaning

In [9103]:
# Remove special characters and punctuation from firstName and lastName
raw_salaries['firstname'] = raw_salaries['firstname'].str.replace(r"[\"\',]", '')
raw_salaries['lastname'] = raw_salaries['lastname'].str.replace(r"[\"\',]", '')

raw_salaries['firstname'] = raw_salaries['firstname'].str.replace('.', '')
raw_salaries['lastname'] = raw_salaries['lastname'].str.replace('.', '')

# lower case
raw_salaries['firstname'] = raw_salaries['firstname'].str.lower()
raw_salaries['lastname'] = raw_salaries['lastname'].str.lower()



  raw_salaries['firstname'] = raw_salaries['firstname'].str.replace(r"[\"\',]", '')
  raw_salaries['lastname'] = raw_salaries['lastname'].str.replace(r"[\"\',]", '')
  raw_salaries['firstname'] = raw_salaries['firstname'].str.replace('.', '')
  raw_salaries['lastname'] = raw_salaries['lastname'].str.replace('.', '')


##### Full name column
In our current dataset, player names are split into two separate columns: `firstname` and `lastname`. While this format can be useful for certain types of analysis, it might be more convenient for us to have a single column that contains the full name of each player.

In [9104]:
# concat first and last name
raw_salaries['name'] = raw_salaries['firstname'] + ' ' + raw_salaries['lastname']

raw_salaries.head()

Unnamed: 0,firstname,lastname,playerid,mlbid,year,salary,TeamName,age,leaguerank,teamrank,averagesalary,leagueminimum,borndate,name,first3,last3,first3last3,middle2
0,david,aardsma,25001,430911,2004,300000,San Francisco Giants,22,0,0,2313535,300000,1981-12-27,david aardsma,Dav,Aar,DavAar,avi
1,david,aardsma,25001,430911,2006,327000,Chicago Cubs,24,0,0,2699292,327000,1981-12-27,david aardsma,Dav,Aar,DavAar,avi
2,david,aardsma,25001,430911,2007,387500,Chicago White Sox,25,675,23,2824751,380000,1981-12-27,david aardsma,Dav,Aar,DavAar,avi
3,david,aardsma,25001,430911,2008,403250,Boston Red Sox,26,632,27,2925679,390000,1981-12-27,david aardsma,Dav,Aar,DavAar,avi
4,david,aardsma,25001,430911,2009,419000,Seattle Mariners,27,635,20,2996106,400000,1981-12-27,david aardsma,Dav,Aar,DavAar,avi


##### `borndate` column
The `borndate` column contains the birth date of each player. Currently this column is in string format, which is not ideal for our analysis. We'll convert this column to a datetime format.

In [9105]:
# borndate to datetime
raw_salaries['borndate'] = pd.to_datetime(raw_salaries['borndate'])

In [9106]:
# Drop rows with year = 2023
raw_salaries = raw_salaries[raw_salaries['year'] != 2023]

In [9107]:
drop_rows = ['ChJi197', 'ChTr196']

In [9108]:
# Create a copy and save it to csv
salaries_pre = raw_salaries.copy()
salaries_pre.to_csv('pre_datasets/salaries_pre.csv', index=False)

### 2.5 Teams Dataset <a class="anchor" id="teams-dataset"></a>

In [9109]:
# Shape and first rows
print(f'raw_teams shape: {raw_teams.shape}')
print(f'Rows: {raw_teams.shape[0]}')
print(f'Columns: {raw_teams.shape[1]}')
raw_teams.sample(5)

raw_teams shape: (3015, 47)
Rows: 3015
Columns: 47


Unnamed: 0,yearID,lgID,teamID,franchID,divID,Rank,G,Ghome,W,L,DivWin,LgWin,WSWin,R,AB,H,2B,3B,HR,BB,SO,SB,CS,HBP,SF,RA,ER,ERA,CG,SHO,SV,IPouts,HA,HRA,BBA,SOA,E,DP,FP,name,park,attendance,BPF,PPF,teamIDBR,teamIDlahman45,teamIDretro
445,1904,NL,SLN,STL,,5,154,76.0,75,79,,N,,602,5104,1292,175,66,24,343,514.0,199.0,,46.0,,595,401,2.64,146,7,2,4104,1286,23,319,529,307,83,0.952,St. Louis Cardinals,Robison Field,386750.0,96,99,STL,SLN,SLN
1816,1981,AL,CAL,ANA,W,5,110,54.0,51,59,N,N,N,476,3688,944,134,16,97,393,571.0,44.0,33.0,29.0,30.0,453,399,3.7,27,8,19,2914,958,81,323,426,101,120,0.977,California Angels,Anaheim Stadium,1441545.0,100,99,CAL,CAL,CAL
2183,1995,AL,BOS,BOS,E,1,144,72.0,86,58,Y,N,N,791,4997,1399,286,31,175,560,923.0,99.0,44.0,65.0,49.0,698,631,4.39,7,9,39,3878,1338,127,476,888,120,151,0.978,Boston Red Sox,Fenway Park II,2164410.0,103,103,BOS,BOS,BOS
196,1887,NL,IN3,IND,,8,127,,37,89,,N,N,628,4368,1080,162,70,33,300,379.0,334.0,,44.0,,965,634,5.24,118,4,1,3264,1289,60,431,245,479,105,0.912,Indianapolis Hoosiers,Athletic Park I,,96,102,IND,IN3,IN3
2821,2016,AL,MIN,MIN,C,5,162,81.0,59,103,N,N,N,722,5618,1409,288,35,200,513,1426.0,91.0,32.0,44.0,43.0,889,814,5.08,4,3,26,4329,1617,221,479,1191,126,172,0.979,Minnesota Twins,Target Field,1963912.0,96,98,MIN,MIN,MIN


##### Dictionary

| Column Name | Description |
| --- | --- |
| yearID | Year (Season played) |
| lgID | The league the player played in |
| teamID | ID of the team they played for the season |
| franchID | Franchise ID |
| divID | Division ID |
| Rank | Team rank at the end of the season |
| G | Games played |
| Ghome | Games played at home |
| W | Wins |
| L | Losses |
| DivWin | Division Winner (Y or N) |
| WCWin | Wild Card Winner (Y or N) |
| LgWin | League Champion(Y or N) |
| WSWin | World Series Winner (Y or N) |
| R | Total number of runs scored by the team in the season |
| AB | Total number of at bats by the team in the season |
| H | Total number of hits by the team in the season |
| 2B | Total number of doubles by the team in the season |
| 3B | Total number of triples by the team in the season |
| HR | Total number of home runs by the team in the season |
| BB | Total number of walks by the team in the season |
| SO | Total number of strikeouts by the team in the season |
| SB | Total number of stolen bases by the team in the season |
| CS | Total number of times caught stealing by the team in the season |
| HBP | Total number of times hit by pitch by the team in the season |
| SF | Total number of sacrifice flies by the team in the season |
| RA | Total number of runs allowed by the team in the season |
| ER | Total number of earned runs allowed by the team in the season |
| ERA | Earned run average |
| CG | Total number of complete games pitched by the team in the season |
| SHO | Total number of shutouts pitched by the team in the season |
| SV | Total number of saves by the team in the season |
| IPouts | Total number of outs pitched by the team in the season |
| HA | Total number of hits allowed by the team in the season |
| HRA | Total number of home runs allowed by the team in the season |
| BBA | Total number of walks allowed by the team in the season |
| SOA | Total number of strikeouts by the team in the season |
| E | Total number of errors by the team in the season |
| DP | Total number of double plays turned by the team in the season |
| FP | Fielding percentage |
| name | Team name |
| park | Team park |
| attendance | Total attendance for the season |
| BPF | Three-year park factor for batters |
| PPF | Three-year park factor for pitchers |
| teamIDBR | Team ID used by Baseball Reference website |
| teamIDlahman45 | Team ID used in Lahman database version 4.5 |
| teamIDretro | Team ID used by Retrosheet |

The `raw_teams` dataset is a comprehensive collection of team-based statistics for each season. It has 3015 rows and 47 columns, each row representing a team's performance in a particular season.

For the time being, we'll keep all the columns as they might provide useful insights for our analysis. We can always drop unnecessary columns later once we understand the data better.

In [9110]:
# Data types
raw_teams.dtypes

yearID              int64
lgID               object
teamID             object
franchID           object
divID              object
Rank                int64
G                   int64
Ghome             float64
W                   int64
L                   int64
DivWin             object
LgWin              object
WSWin              object
R                   int64
AB                  int64
H                   int64
2B                  int64
3B                  int64
HR                  int64
BB                  int64
SO                float64
SB                float64
CS                float64
HBP               float64
SF                float64
RA                  int64
ER                  int64
ERA               float64
CG                  int64
SHO                 int64
SV                  int64
IPouts              int64
HA                  int64
HRA                 int64
BBA                 int64
SOA                 int64
E                   int64
DP                  int64
FP          

It looks that all the data types are correct, so we can move on to the next step.

In [9111]:
# Create a copy and save it to csv
teams_pre = raw_teams.copy()
teams_pre.to_csv('pre_datasets/teams_pre.csv', index=False)

---

## 3. Table Joins <a class="anchor" id="table-joins"></a>

Now, let's join our dataframes to create a single dataset that contains player-level information for each season. The resulting dataset will be used for further analysis and modeling.

The goal here is to find a common link between the datasets that can be used to join them together. This might require some trial and error, and possibly some data cleaning to ensure the keys match up correctly. This process can turn iterative.

In [9112]:
pd.set_option('display.max_columns', None)

def show_player_info(dataset, player_name):
    '''Display player information for sanity check'''

    player_info = dataset[dataset['name_common'] == player_name]
    
    return player_info

##### WAR and People datasets

In [9113]:
print(war_pre.shape)
print(people_pre.shape)

(103491, 16)
(20811, 15)


In [9114]:
# Join people_pre with war_pre on bbrefID and player_ID
war_people = war_pre.merge(people_pre, how='left',left_on='player_ID', right_on='bbrefID')
war_people.sample(5)

Flushing oldest 200 entries.
  warn('Output cache limit (currently {sz} entries) hit.\n'


Unnamed: 0,name_common,age,year_ID,player_ID,pitcher,mlb_ID,stint_ID,PA,G,Inn,salary,OPS_plus,WAR,WAR_def,WAR_off,team_ID,playerID,birthYear,birthMonth,birthDay,birthCountry,nameFirst,nameLast,weight,height,bats,throws,debut,finalGame,retroID,bbrefID
19200,Chris Resop,30.0,2013,resopch01,Y,434592.0,1,0.0,1,18.0,1350000.0,0.0,0.0,0.0,0.0,OAK,resopch01,1982.0,11.0,4.0,USA,chris,resop,225.0,75.0,R,R,2005-06-28,2013-05-14,resoc001,resopch01
92860,Steve Lomasney,21.0,1999,lomasst01,N,206565.0,1,2.0,1,5.0,0.0,-100.0,0.05,0.1,-0.05,BOS,lomasst01,1977.0,8.0,9.0,USA,steve,lomasney,185.0,72.0,R,R,1999-10-03,1999-10-03,lomas001,lomasst01
56240,Joe Thatcher,30.0,2012,thatcjo01,Y,491159.0,1,0.0,51,31.7,700000.0,0.0,0.0,0.01,0.0,SDP,thatcjo01,1981.0,10.0,4.0,USA,joe,thatcher,230.0,74.0,L,L,2007-07-26,2015-10-02,thatj001,thatcjo01
77955,Otto Hess,28.0,1907,hessot01,Y,115867.0,1,38.0,19,0.0,0.0,31.157677,0.02,-0.01,0.02,CLE,hessot01,1878.0,10.0,10.0,Switzerland,otto,hess,170.0,73.0,L,L,1902-08-03,1915-06-13,hesso101,hessot01
35909,Frank Jude,21.0,1906,judefr01,N,116784.0,1,333.0,80,0.0,0.0,59.924845,-1.65,-1.04,-1.09,CIN,judefr01,1884.0,11.0,11.0,USA,frank,jude,150.0,67.0,R,R,1906-07-09,1906-10-07,judef101,judefr01


In [9115]:
show_player_info(war_people, 'Joey Gallo')

Unnamed: 0,name_common,age,year_ID,player_ID,pitcher,mlb_ID,stint_ID,PA,G,Inn,salary,OPS_plus,WAR,WAR_def,WAR_off,team_ID,playerID,birthYear,birthMonth,birthDay,birthCountry,nameFirst,nameLast,weight,height,bats,throws,debut,finalGame,retroID,bbrefID
56527,Joey Gallo,21.0,2015,gallojo01,N,608336.0,1,123.0,36,264.0,0.0,91.332137,0.32,0.07,0.21,TEX,gallojo01,1993.0,11.0,19.0,USA,joey,gallo,250.0,77.0,L,R,2015-06-02,2023-06-24,gallj002,gallojo01
56528,Joey Gallo,22.0,2016,gallojo01,N,608336.0,1,30.0,17,72.0,0.0,-2.86383,-0.52,-0.24,-0.32,TEX,gallojo01,1993.0,11.0,19.0,USA,joey,gallo,250.0,77.0,L,R,2015-06-02,2023-06-24,gallj002,gallojo01
56529,Joey Gallo,23.0,2017,gallojo01,N,608336.0,1,532.0,145,1208.0,537120.0,118.426407,2.92,-0.63,3.32,TEX,gallojo01,1993.0,11.0,19.0,USA,joey,gallo,250.0,77.0,L,R,2015-06-02,2023-06-24,gallj002,gallojo01
56530,Joey Gallo,24.0,2018,gallojo01,N,608336.0,1,577.0,148,1199.0,560000.0,108.778664,2.41,-0.01,1.86,TEX,gallojo01,1993.0,11.0,19.0,USA,joey,gallo,250.0,77.0,L,R,2015-06-02,2023-06-24,gallj002,gallojo01
56531,Joey Gallo,25.0,2019,gallojo01,N,608336.0,1,297.0,70,614.3,605500.0,144.925708,3.08,0.47,2.54,TEX,gallojo01,1993.0,11.0,19.0,USA,joey,gallo,250.0,77.0,L,R,2015-06-02,2023-06-24,gallj002,gallojo01
56532,Joey Gallo,26.0,2020,gallojo01,N,608336.0,1,226.0,57,475.7,4400000.0,86.573416,1.51,1.11,0.2,TEX,gallojo01,1993.0,11.0,19.0,USA,joey,gallo,250.0,77.0,L,R,2015-06-02,2023-06-24,gallj002,gallojo01
56533,Joey Gallo,27.0,2021,gallojo01,N,1216672.0,2,616.0,153,1304.4,6200000.0,230.520541,4.61,0.92,3.04,TEX,gallojo01,1993.0,11.0,19.0,USA,joey,gallo,250.0,77.0,L,R,2015-06-02,2023-06-24,gallj002,gallojo01
56534,Joey Gallo,28.0,2022,gallojo01,N,1216672.0,2,410.0,126,944.4,10275000.0,162.570976,0.23,-0.13,-0.18,NYY,gallojo01,1993.0,11.0,19.0,USA,joey,gallo,250.0,77.0,L,R,2015-06-02,2023-06-24,gallj002,gallojo01


##### WAR/People and Batting datasets

In [9116]:
# shape
war_people.shape

(103491, 31)

In [9117]:
batting_pre.shape

(103693, 20)

In [9118]:
# Join war_people and batting_pre on player_ID = playerID , year_ID = yearID
war_batting_people = war_people.merge(batting_pre, how='left', left_on=['playerID', 'year_ID'], right_on=['playerID', 'yearID'])
war_batting_people.sample(5)

Unnamed: 0,name_common,age,year_ID,player_ID,pitcher,mlb_ID,stint_ID,PA,G_x,Inn,salary,OPS_plus,WAR,WAR_def,WAR_off,team_ID,playerID,birthYear,birthMonth,birthDay,birthCountry,nameFirst,nameLast,weight,height,bats,throws,debut,finalGame,retroID,bbrefID,yearID,G_y,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP,teamID
73354,Mike Fitzgerald,24.0,1985,fitzgmi02,N,114197.0,1,341.0,108,836.3,0.0,68.54943,-0.47,0.0,0.14,MON,fitzgmi02,1960.0,7.0,13.0,USA,mike,fitzgerald,185.0,72.0,R,R,1983-09-13,1992-10-04,fitzm001,fitzgmi02,1985.0,108.0,295.0,25.0,61.0,7.0,1.0,5.0,34.0,5.0,3.0,38.0,55.0,12.0,2.0,1.0,5.0,8.0,MON
58767,Johnny Couch,31.0,1922,couchjo01,Y,112726.0,1,100.0,43,0.0,5200.0,-2.945818,-0.43,0.02,-0.44,CIN,couchjo01,1891.0,3.0,31.0,USA,johnny,couch,180.0,72.0,L,R,1917-04-11,1925-09-21,coucj101,couchjo01,1922.0,43.0,91.0,3.0,12.0,1.0,2.0,0.0,10.0,0.0,0.0,6.0,28.0,0.0,0.0,3.0,0.0,0.0,CIN
55200,Joe Hatten,35.0,1952,hattejo01,Y,115609.0,1,16.0,17,0.0,0.0,-63.093707,-0.18,0.0,-0.18,CHC,hattejo01,1916.0,11.0,7.0,USA,joe,hatten,176.0,72.0,R,L,1946-04-21,1952-07-04,hattj101,hattejo01,1952.0,17.0,15.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,1.0,0.0,0.0,CHN
100024,Vinny Castilla,28.0,1996,castivi02,N,112106.0,1,673.0,160,1374.0,1200000.0,112.068849,3.23,0.81,2.74,COL,castivi02,1967.0,7.0,4.0,Mexico,vinny,castilla,175.0,73.0,R,R,1991-09-01,2006-09-28,castv001,castivi02,1996.0,160.0,629.0,97.0,191.0,34.0,0.0,40.0,113.0,7.0,2.0,35.0,88.0,7.0,5.0,0.0,4.0,20.0,COL
69556,Mark Grudzielanek,27.0,1997,grudzma01,N,115210.0,1,688.0,156,1368.3,220000.0,80.690478,1.5,1.15,1.33,MON,grudzma01,1970.0,6.0,30.0,USA,mark,grudzielanek,185.0,73.0,R,R,1995-04-28,2010-06-06,grudm001,grudzma01,1997.0,156.0,649.0,76.0,177.0,54.0,3.0,4.0,51.0,25.0,9.0,23.0,76.0,0.0,10.0,3.0,3.0,13.0,MON


In [9119]:
show_player_info(war_batting_people, 'Joey Gallo')

Unnamed: 0,name_common,age,year_ID,player_ID,pitcher,mlb_ID,stint_ID,PA,G_x,Inn,salary,OPS_plus,WAR,WAR_def,WAR_off,team_ID,playerID,birthYear,birthMonth,birthDay,birthCountry,nameFirst,nameLast,weight,height,bats,throws,debut,finalGame,retroID,bbrefID,yearID,G_y,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP,teamID
56527,Joey Gallo,21.0,2015,gallojo01,N,608336.0,1,123.0,36,264.0,0.0,91.332137,0.32,0.07,0.21,TEX,gallojo01,1993.0,11.0,19.0,USA,joey,gallo,250.0,77.0,L,R,2015-06-02,2023-06-24,gallj002,gallojo01,2015.0,36.0,108.0,16.0,22.0,3.0,1.0,6.0,14.0,3.0,0.0,15.0,57.0,3.0,0.0,0.0,0.0,0.0,TEX
56528,Joey Gallo,22.0,2016,gallojo01,N,608336.0,1,30.0,17,72.0,0.0,-2.86383,-0.52,-0.24,-0.32,TEX,gallojo01,1993.0,11.0,19.0,USA,joey,gallo,250.0,77.0,L,R,2015-06-02,2023-06-24,gallj002,gallojo01,2016.0,17.0,25.0,2.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,5.0,19.0,0.0,0.0,0.0,0.0,0.0,TEX
56529,Joey Gallo,23.0,2017,gallojo01,N,608336.0,1,532.0,145,1208.0,537120.0,118.426407,2.92,-0.63,3.32,TEX,gallojo01,1993.0,11.0,19.0,USA,joey,gallo,250.0,77.0,L,R,2015-06-02,2023-06-24,gallj002,gallojo01,2017.0,145.0,449.0,85.0,94.0,18.0,3.0,41.0,80.0,7.0,2.0,75.0,196.0,1.0,8.0,0.0,0.0,3.0,TEX
56530,Joey Gallo,24.0,2018,gallojo01,N,608336.0,1,577.0,148,1199.0,560000.0,108.778664,2.41,-0.01,1.86,TEX,gallojo01,1993.0,11.0,19.0,USA,joey,gallo,250.0,77.0,L,R,2015-06-02,2023-06-24,gallj002,gallojo01,2018.0,148.0,500.0,82.0,103.0,24.0,1.0,40.0,92.0,3.0,4.0,74.0,207.0,4.0,3.0,0.0,0.0,3.0,TEX
56531,Joey Gallo,25.0,2019,gallojo01,N,608336.0,1,297.0,70,614.3,605500.0,144.925708,3.08,0.47,2.54,TEX,gallojo01,1993.0,11.0,19.0,USA,joey,gallo,250.0,77.0,L,R,2015-06-02,2023-06-24,gallj002,gallojo01,2019.0,70.0,241.0,54.0,61.0,15.0,1.0,22.0,49.0,4.0,2.0,52.0,114.0,4.0,2.0,1.0,1.0,0.0,TEX
56532,Joey Gallo,26.0,2020,gallojo01,N,608336.0,1,226.0,57,475.7,4400000.0,86.573416,1.51,1.11,0.2,TEX,gallojo01,1993.0,11.0,19.0,USA,joey,gallo,250.0,77.0,L,R,2015-06-02,2023-06-24,gallj002,gallojo01,2020.0,57.0,193.0,23.0,35.0,8.0,0.0,10.0,26.0,2.0,0.0,29.0,79.0,2.0,4.0,0.0,0.0,0.0,TEX
56533,Joey Gallo,27.0,2021,gallojo01,N,1216672.0,2,616.0,153,1304.4,6200000.0,230.520541,4.61,0.92,3.04,TEX,gallojo01,1993.0,11.0,19.0,USA,joey,gallo,250.0,77.0,L,R,2015-06-02,2023-06-24,gallj002,gallojo01,2021.0,153.0,498.0,90.0,99.0,13.0,1.0,38.0,77.0,6.0,0.0,111.0,213.0,5.0,6.0,0.0,1.0,6.0,TEX
56534,Joey Gallo,28.0,2022,gallojo01,N,1216672.0,2,410.0,126,944.4,10275000.0,162.570976,0.23,-0.13,-0.18,NYY,gallojo01,1993.0,11.0,19.0,USA,joey,gallo,250.0,77.0,L,R,2015-06-02,2023-06-24,gallj002,gallojo01,2022.0,126.0,350.0,48.0,56.0,8.0,2.0,19.0,47.0,3.0,0.0,56.0,163.0,0.0,3.0,0.0,1.0,0.0,NYA


In [9120]:
war_batting_people.shape

(103491, 50)

##### WAR/People/Batting and Team datasets

There are several columns for team identifiers, let's decide on which one to use.

In [9121]:
# Compare the teamIDBR column from teams_pre to the team_ID column from war_batting_people
print(f'teamIDBR (teams_pre) vs team_ID (war_batting_people):\n {teams_pre["teamIDBR"].isin(war_batting_people["team_ID"]).value_counts()}\n')

# Compare the teamID column from teams_pre to the teamID column from war_batting_people 
print(f'teamID (teams_pre) vs teamID (war_batting_people):\n {teams_pre["teamID"].isin(war_batting_people["teamID"]).value_counts()}\n')

# Compare the teamID column from temas_pre to the team_ID column from war_batting_people
print(f'teamID (teams_pre) vs team_ID (war_batting_people):\n {teams_pre["teamID"].isin(war_batting_people["team_ID"]).value_counts()}')

teamIDBR (teams_pre) vs team_ID (war_batting_people):
 True    3015
Name: teamIDBR, dtype: int64

teamID (teams_pre) vs teamID (war_batting_people):
 True    3015
Name: teamID, dtype: int64

teamID (teams_pre) vs team_ID (war_batting_people):
 True     1735
False    1280
Name: teamID, dtype: int64


In [9122]:
# Unique values in war_batting_people team_ID
war_batting_people['teamID'].nunique()

149

In [9123]:
# Unique values in teams_pre teamID 
teams_pre['teamID'].nunique()

149

In [9124]:
war_batting_people.shape

(103491, 50)

In [9125]:
# Join war_batting_people and raw_teams on teamID = teamID and yearID = yearID
war_batting_people_teams = war_batting_people.merge(teams_pre, how='left', left_on=['teamID', 'yearID'], right_on=['teamID', 'yearID'])

In [9126]:
war_batting_people_teams.sample(5)

Unnamed: 0,name_common,age,year_ID,player_ID,pitcher,mlb_ID,stint_ID,PA,G_x,Inn,salary,OPS_plus,WAR,WAR_def,WAR_off,team_ID,playerID,birthYear,birthMonth,birthDay,birthCountry,nameFirst,nameLast,weight,height,bats,throws,debut,finalGame,retroID,bbrefID,yearID,G_y,AB_x,R_x,H_x,2B_x,3B_x,HR_x,RBI,SB_x,CS_x,BB_x,SO_x,IBB,HBP_x,SH,SF_x,GIDP,teamID,lgID,franchID,divID,Rank,G,Ghome,W,L,DivWin,LgWin,WSWin,R_y,AB_y,H_y,2B_y,3B_y,HR_y,BB_y,SO_y,SB_y,CS_y,HBP_y,SF_y,RA,ER,ERA,CG,SHO,SV,IPouts,HA,HRA,BBA,SOA,E,DP,FP,name,park,attendance,BPF,PPF,teamIDBR,teamIDlahman45,teamIDretro
71152,Matt Stairs,24.0,1992,stairma01,N,122644.0,1,38.0,13,65.0,0.0,58.462267,-0.27,-0.18,-0.13,MON,stairma01,1968.0,2.0,27.0,CAN,matt,stairs,200.0,69.0,L,R,1992-05-29,2011-07-22,staim001,stairma01,1992.0,13.0,30.0,2.0,5.0,2.0,0.0,0.0,5.0,0.0,0.0,7.0,7.0,0.0,0.0,0.0,1.0,0.0,MON,NL,WSN,E,2.0,162.0,81.0,87.0,75.0,N,N,N,648.0,5477.0,1381.0,263.0,37.0,102.0,463.0,976.0,196.0,63.0,43.0,55.0,581.0,530.0,3.25,11.0,14.0,49.0,4404.0,1296.0,92.0,525.0,1014.0,124.0,113.0,0.98,Montreal Expos,Stade Olympique,1669127.0,99.0,99.0,MON,MON,MON
31368,Ed Konetchy,33.0,1919,koneted01,N,117245.0,1,539.0,132,0.0,3500.0,118.11507,2.2,-0.62,2.21,BRO,koneted01,1885.0,9.0,3.0,USA,ed,konetchy,195.0,74.0,R,R,1907-06-29,1921-10-01,konee101,koneted01,1919.0,132.0,486.0,46.0,145.0,24.0,9.0,1.0,47.0,14.0,0.0,29.0,39.0,0.0,3.0,21.0,0.0,0.0,BRO,NL,LAD,,5.0,141.0,70.0,69.0,71.0,,N,N,525.0,4844.0,1272.0,167.0,66.0,25.0,258.0,405.0,112.0,,,,513.0,389.0,2.73,98.0,12.0,1.0,3843.0,1256.0,21.0,292.0,476.0,218.0,84.0,0.963,Brooklyn Robins,Ebbets Field,360721.0,103.0,103.0,BRO,BRO,BRO
26995,Deven Marrero,30.0,2021,marrede01,N,571918.0,1,19.0,10,42.6,0.0,86.803208,0.02,0.01,0.02,MIA,marrede01,1990.0,8.0,25.0,USA,deven,marrero,190.0,72.0,R,R,2015-06-28,2022-09-10,marrd001,marrede01,2021.0,10.0,16.0,4.0,3.0,0.0,0.0,1.0,1.0,1.0,0.0,3.0,6.0,0.0,0.0,0.0,0.0,2.0,MIA,NL,FLA,E,4.0,162.0,81.0,67.0,95.0,N,N,N,623.0,5348.0,1244.0,226.0,23.0,158.0,450.0,1553.0,106.0,29.0,65.0,30.0,701.0,622.0,3.96,1.0,8.0,33.0,4245.0,1282.0,162.0,529.0,1381.0,122.0,146.0,0.979,Miami Marlins,Marlins Park,642617.0,98.0,99.0,MIA,FLO,MIA
39453,George Grant,22.0,1925,grantge02,Y,115043.0,1,4.0,12,0.0,2400.0,24.588056,-0.01,0.0,-0.01,SLB,grantge02,1903.0,1.0,6.0,USA,george,grant,175.0,71.0,R,R,1923-09-17,1931-07-12,grang101,grantge02,1925.0,12.0,4.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,SLA,AL,BAL,,3.0,154.0,78.0,82.0,71.0,,N,N,900.0,5440.0,1620.0,304.0,68.0,110.0,498.0,375.0,85.0,78.0,,,906.0,754.0,4.92,67.0,7.0,10.0,4137.0,1588.0,99.0,675.0,419.0,219.0,164.0,0.964,St. Louis Browns,Sportsman's Park IV,462898.0,105.0,106.0,SLB,SLA,SLA
30479,Dwight Smith,32.0,1996,smithdw01,N,122416.0,1,172.0,101,184.0,455000.0,50.61703,-1.15,-0.36,-0.92,ATL,smithdw01,1963.0,11.0,8.0,USA,dwight,smith,175.0,71.0,L,R,1989-05-01,1996-09-29,smitd003,smithdw01,1996.0,101.0,153.0,16.0,31.0,5.0,0.0,3.0,16.0,1.0,3.0,17.0,42.0,1.0,1.0,0.0,1.0,2.0,ATL,NL,ATL,E,1.0,162.0,81.0,96.0,66.0,Y,Y,N,773.0,5614.0,1514.0,264.0,28.0,197.0,530.0,1032.0,83.0,43.0,27.0,50.0,648.0,575.0,3.52,14.0,9.0,46.0,4407.0,1372.0,120.0,451.0,1245.0,130.0,143.0,0.98,Atlanta Braves,Atlanta-Fulton County Stadium,2901242.0,106.0,104.0,ATL,ATL,ATL


In [9127]:
war_batting_people_teams.shape

(103491, 95)

##### WAR/People/Batting/Team and Salary datasets

In order to align our datasets for comparison, we'll first create a subset of data starting from 1985, as this is the earliest year available in the `salaries_pre` dataset. This will allow us to identify common links between the datasets and determine the most suitable one for joining. Once we've made this decision, we'll proceed to join the full datasets. Any null values resulting from years prior to 1985 will be addressed in the next stages of the project.

In [9128]:
# war_batting_people_teams starting in 1985
war_batting_people_teams_85 = war_batting_people_teams.copy()
war_batting_people_teams_85 = war_batting_people_teams[war_batting_people_teams['year_ID'] >= 1985]

`mlb_ID` and `mlbid` seem to be the only common link between the datasets. Let's see if we can use these columns to join the datasets.

In [9129]:
# Compare mlb_ID from war_batting_people_teams to mlbid from salaries_pre
print(f'mlb_ID (war_batting_people_teams_85) vs mlbid (salaries_pre):\n {war_batting_people_teams_85["mlb_ID"].isin(salaries_pre["mlbid"]).value_counts()}')

mlb_ID (war_batting_people_teams_85) vs mlbid (salaries_pre):
 True     41574
False     4113
Name: mlb_ID, dtype: int64


It appears that there are 41,501 instances where the `mlb_ID` from the war_batting_people_teams_85 dataset matches the `mlbid` from the salaries_pre dataset. However, there are also 4,106 instances where the IDs do not match.

There could be inconsistencies in the way the IDs are recorded in the two datasets. Let's explore this further by looking at the IDs that don't match.

In [9130]:
# # Extract a sample player with different mlb_ID
# difference_sample = war_batting_people_teams_85[~war_batting_people_teams_85["mlb_ID"].isin(salaries_pre["mlbid"])][['name_common', 'mlb_ID']].sample(1)
# name_d = difference_sample.values[0][0]
# mlbid_d = difference_sample.values[0][1]

# print('From war_batting_people_teams_85:')
# print(f'Name: {name_d}\nmlb_ID: {mlbid_d}\n')

# print('From salaries_pre:')
# print(salaries_pre[salaries_pre["name"] == name_d][["name", "mlbid"]].head(1).iloc[0])

It appears that there is a discrepancy between the `mlb_ID` in the war_batting_people_teams_85 dataset and the `mlbid` in the salaries_pre dataset. This could be due to a variety of reasons, such as data entry errors, different data sources, or changes in player IDs over time.

The best solution would be to find/create a common link between the datasets. We don't want to join by player's name as this is not a unique identifier, some players can share the same name. 

Let's explore this solution further.



##### - Creating a unique player ID for both datasets

Both datasets share birth dates information and names, we can play around with these columns to create a unique player ID for both datasets. This will be a trial and error and iterative process.

__Unique ID for war_batting_people_teams__

In [9131]:
# Some of this columns I had dropped them before, had to go back and add them again

# Birth dates to int
war_batting_people_teams_85['birthYear'] = war_batting_people_teams_85['birthYear'].astype('Int64')
war_batting_people_teams_85['birthMonth'] = war_batting_people_teams_85['birthMonth'].astype('Int64')
war_batting_people_teams_85['birthDay'] = war_batting_people_teams_85['birthDay'].astype('Int64')

# Birth dates to int full dataset
war_batting_people_teams['birthYear'] = war_batting_people_teams['birthYear'].astype('Int64')
war_batting_people_teams['birthMonth'] = war_batting_people_teams['birthMonth'].astype('Int64')
war_batting_people_teams['birthDay'] = war_batting_people_teams['birthDay'].astype('Int64')
war_batting_people_teams['birthYear_3'] = war_batting_people_teams['birthYear'].astype(str).str[:3]

# Concat nameFirst and nameLast
# war_batting_people_teams['name'] = war_batting_people_teams['nameFirst'] + war_batting_people_teams_85['nameLast']

# First and second letter of nameFirst and nameLast 
war_batting_people_teams_85['nameFirst_2'] = war_batting_people_teams_85['nameFirst'].str[:2]
war_batting_people_teams_85['nameLast_2'] = war_batting_people_teams_85['nameLast'].str[:2]

# First and second letter of nameFirst and nameLast full dataset
war_batting_people_teams['nameFirst_2'] = war_batting_people_teams['nameFirst'].str[:2]
war_batting_people_teams['nameLast_2'] = war_batting_people_teams['nameLast'].str[:5]

# concat initials year, month and day for new ID
war_batting_people_teams_85['new_id_x'] = war_batting_people_teams_85['nameFirst_2'] + war_batting_people_teams_85['nameLast_2'] + war_batting_people_teams_85['birthYear'].astype(str) #+ war_batting_people_teams_85['birthMonth'].astype(str) + war_batting_people_teams_85['birthDay'].astype(str) # + war_batting_people_teams_85['birthYear'].astype(str)

# concat initials year, month and day for new ID full dataset
war_batting_people_teams['new_id_x'] = war_batting_people_teams['nameFirst_2'] + war_batting_people_teams['nameLast_2'] + war_batting_people_teams['birthYear_3'] 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  war_batting_people_teams_85['birthYear'] = war_batting_people_teams_85['birthYear'].astype('Int64')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  war_batting_people_teams_85['birthMonth'] = war_batting_people_teams_85['birthMonth'].astype('Int64')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  war

In [9132]:
# Sanity check
show_player_info(war_batting_people_teams, 'Joey Gallo')[['name_common', 'year_ID', 'team_ID', 'birthYear', 'birthMonth', 'birthDay', 'new_id_x']]

Unnamed: 0,name_common,year_ID,team_ID,birthYear,birthMonth,birthDay,new_id_x
56527,Joey Gallo,2015,TEX,1993,11,19,jogallo199
56528,Joey Gallo,2016,TEX,1993,11,19,jogallo199
56529,Joey Gallo,2017,TEX,1993,11,19,jogallo199
56530,Joey Gallo,2018,TEX,1993,11,19,jogallo199
56531,Joey Gallo,2019,TEX,1993,11,19,jogallo199
56532,Joey Gallo,2020,TEX,1993,11,19,jogallo199
56533,Joey Gallo,2021,TEX,1993,11,19,jogallo199
56534,Joey Gallo,2022,NYY,1993,11,19,jogallo199


__Unique ID for salary_pre__

In [9133]:
# Extract year, month and day from borndate
salaries_pre['b_year'] = salaries_pre['borndate'].dt.year.astype('Int64')
salaries_pre['b_month'] = salaries_pre['borndate'].dt.month.astype('Int64')
salaries_pre['b_day'] = salaries_pre['borndate'].dt.day.astype('Int64')
salaries_pre['b_year_3'] = salaries_pre['b_year'].astype(str).str[:3]

# First and second letter of firstname and lastname
salaries_pre['firstname_2'] = salaries_pre['firstname'].str[:2]
salaries_pre['lastname_2'] = salaries_pre['lastname'].str[:5]

# concat initials year, month and day for new ID
salaries_pre['new_id_y'] = salaries_pre['firstname_2'] + salaries_pre['lastname_2'] + salaries_pre['b_year_3'] 

In [9134]:
# Sanity check
# Joey Gallo
salaries_pre[salaries_pre['name'] == 'joey gallo'][['name', 'firstname_2', 'lastname_2', 'borndate', 'b_year', 'b_month', 'b_day', 'new_id_y']]

Unnamed: 0,name,firstname_2,lastname_2,borndate,b_year,b_month,b_day,new_id_y
14141,joey gallo,jo,gallo,1993-11-19,1993,11,19,jogallo199
14142,joey gallo,jo,gallo,1993-11-19,1993,11,19,jogallo199
14143,joey gallo,jo,gallo,1993-11-19,1993,11,19,jogallo199
14144,joey gallo,jo,gallo,1993-11-19,1993,11,19,jogallo199
14145,joey gallo,jo,gallo,1993-11-19,1993,11,19,jogallo199
14146,joey gallo,jo,gallo,1993-11-19,1993,11,19,jogallo199
14147,joey gallo,jo,gallo,1993-11-19,1993,11,19,jogallo199
14148,joey gallo,jo,gallo,1993-11-19,1993,11,19,jogallo199


In [9135]:
war_batting_people_teams.shape

(103491, 99)

##### - Joining the datasets

In [9136]:
# Join war_batting_people_teams_85 and slaries_pre on new_id_x = new_id_y and year_ID = year
war_batting_people_teams_salaries = war_batting_people_teams.merge(salaries_pre, how='left', left_on=['new_id_x', 'year_ID'], right_on=['new_id_y', 'year'])

##### Dealing with null values after joining

Most of the null values will be dealt in the preprocessing stage, once we define the time period of our analysis. However, there are some null values that we can deal with now.

In [9137]:
pd.set_option('display.max_rows', None)
# null values 
war_batting_people_teams_salaries.isnull().sum().sort_values(ascending=False)

middle2           58508
new_id_y          58506
borndate          58506
mlbid             58506
year              58506
firstname         58506
salary_y          58506
TeamName          58506
age_y             58506
leaguerank        58506
teamrank          58506
lastname          58506
leagueminimum     58506
averagesalary     58506
name_y            58506
first3            58506
last3             58506
first3last3       58506
b_year            58506
b_month           58506
b_day             58506
b_year_3          58506
firstname_2       58506
lastname_2        58506
playerid          58506
DivWin            44880
SF_y              44715
divID             43882
HBP_y             35788
CS_y              22314
Ghome              8489
WSWin              8247
attendance         5744
SB_y               2906
bats               2393
throws             1996
LgWin              1815
weight             1689
height             1618
park               1540
lgID               1458
birthDay        

We will remove certain rows with null values that don't impact our analysis. These rows correspond to players with insignificant stats, and their removal won't significantly affect our dataset size.

In [9138]:
# Pre cleaning - Rows that we realized during this step that contain null values and dropping them would not affect the analysis, as they are not relevant players with relevant stats.
rows_to_drop = ['sternad01', 
                'hegmabo01', 
                'jimerch01', 
                'tremich01',
                'firovda01',
                'singldu01',
                'lunarfe01',
                'matosfr01',
                'reyesgi01',
                'manrifr01',
                'krugeja01',
                'pankoji01',
                'hietpjo01',
                'morgake01',
                'dalesma01',
                'mooreke02',
                'davidma01',
                'davisma02',
                'calzana01',
                'santape01',
                'carabra01',
                'mckeewa01',
                'esposbr01',
                'sanchan02',
                'gonzaal02',
                'gonzaal01',
                'anderbr05',
                'rojasjo02',
                'roberda09',
                'lopezlu03',
                'lopezlu02',
                'snydebr02',
                'braunry01',
                'willima07',
                'reyesjo02',
                ]

In [9139]:
# Drop rows
war_batting_people_teams_salaries = war_batting_people_teams_salaries[~war_batting_people_teams_salaries['playerID'].isin(rows_to_drop)]

In [9140]:
special_drop_dict = {'stewach01': [2010],
                     'freesda01': [2009],
                     'disarga01': [1989],
                     'kingsge01': [1996],
                     'snowjt01': [2008],
                     'larueja01': [2009],
                     'posadjo01': [1995],
                     'thornlo01': [1990],
                     'zuvelpa01': [1991],
                     'murphse01': [2019],
                     'gorete01': [2020],
                     'lintz01': [2021],
                     'wilsova01': [1999],
                     'bartobr01': [2009],
                     'burkeja02': [2010],
                     'haltesh01': [1999],
                     }

In [9141]:
# Drop rows in special_drop_dict
for key, value in special_drop_dict.items():
    war_batting_people_teams_salaries = war_batting_people_teams_salaries[~((war_batting_people_teams_salaries['playerID'] == key) & (war_batting_people_teams_salaries['yearID'].isin(value)))]

In [9142]:
drop_rows = ['ChJi197', 
             'ChTr196', 
             'DuSi197', 
             'FeLu197', 
             'FrMa196', 
             'GiRe196', 
             'JiPa195', 
             'KeMo197', 
             'MaDa196', 
             'NaCa198',
             'RaCa196',
             'WaMc197',
             ]

# Drop rows woth new_id_y in drop_rows from salaries_pre
salaries_pre = salaries_pre[~salaries_pre['new_id_y'].isin(drop_rows)]

In [9143]:
special_drop_dict_salaries = {'JaLa197': [2010],
                     'LoTh196': [1985],
                     }

In [9144]:
for key, value in special_drop_dict_salaries.items():
    salaries_pre  = salaries_pre[~((salaries_pre['new_id_y'] == key) & (salaries_pre['year'].isin(value)))]

In [9145]:
drop_x = ['mibrown195', 'brsmith195', 'kemille180', 'kemille196',
       'mifitzg196', 'chhowar196', 'brhunte197',
       'brhunte196', 'mavalde197', 'majohns197', 'majohns196',
       'chpeter197', 'frgarci197', 'caherna197', 'caherna196',
       'abnunez197', 'japhill197', 'jajones197',
       'mawatso197', 'lugonza197', 'racastr197',
       'jonelso197', 'rybraun198', 'brsnyde198',
       'chcarte198', 'jovalde198', 'daalvar198',
       'maduffy198', 'dabarne198', 'betaylo199', 'brrodge199',
       'jodavis199', 'jomarti198', 'magonza198', 'brzimme199',
       'heperez199', 'rogarci199', 'aldiaz199',
       'dicasti199']

# Drop rows woth new_id_y in drop_tst from war_batting_people_teams_salaries
war_batting_people_teams_salaries = war_batting_people_teams_salaries[~war_batting_people_teams_salaries['new_id_x'].isin(drop_x)]

In [9146]:
war_batting_people_teams_salaries = war_batting_people_teams_salaries[~((war_batting_people_teams_salaries['year_ID'] == 2022) & (war_batting_people_teams_salaries['mlbid'].isnull()))]


In [9147]:
# null values
war_batting_people_teams_salaries.isnull().sum().sort_values(ascending=False)

middle2           57864
new_id_y          57862
borndate          57862
mlbid             57862
year              57862
firstname         57862
salary_y          57862
TeamName          57862
age_y             57862
leaguerank        57862
teamrank          57862
lastname          57862
leagueminimum     57862
averagesalary     57862
name_y            57862
first3            57862
last3             57862
first3last3       57862
b_year            57862
b_month           57862
b_day             57862
b_year_3          57862
firstname_2       57862
lastname_2        57862
playerid          57862
DivWin            44867
SF_y              44715
divID             43882
HBP_y             35788
CS_y              22314
Ghome              8489
WSWin              8234
attendance         5744
SB_y               2906
bats               2393
throws             1996
LgWin              1802
weight             1689
height             1618
park               1540
lgID               1458
birthDay        

##### Dropping newly created columns

Now that we succesfully joined the datasets, we can drop the columns (new ids) we created to join them for cleanliness purposes.

In [9148]:
new_ids = ['birthYear_3', 'nameFirst_2', 'nameLast_2', 'new_id_x', 'firstname_2', 'lastname_2', 'new_id_y', 'b_year_3', 'b_year', 'b_month', 'b_day', 'birthMonth', 'birthDay', 'nameFirst', 'nameLast', 'first3', 'last3', 'first3last3', 'middle2']

# Drop new_ids columns
war_batting_people_teams_salaries.drop(new_ids, axis=1, inplace=True)

In [9149]:
# sort by name_common and year_ID
war_batting_people_teams_salaries = war_batting_people_teams_salaries.sort_values(by=['year_ID', 'name_common'])

In [9150]:
war_batting_people_teams_salaries.head()

Unnamed: 0,name_common,age_x,year_ID,player_ID,pitcher,mlb_ID,stint_ID,PA,G_x,Inn,salary_x,OPS_plus,WAR,WAR_def,WAR_off,team_ID,playerID,birthYear,birthCountry,weight,height,bats,throws,debut,finalGame,retroID,bbrefID,yearID,G_y,AB_x,R_x,H_x,2B_x,3B_x,HR_x,RBI,SB_x,CS_x,BB_x,SO_x,IBB,HBP_x,SH,SF_x,GIDP,teamID,lgID,franchID,divID,Rank,G,Ghome,W,L,DivWin,LgWin,WSWin,R_y,AB_y,H_y,2B_y,3B_y,HR_y,BB_y,SO_y,SB_y,CS_y,HBP_y,SF_y,RA,ER,ERA,CG,SHO,SV,IPouts,HA,HRA,BBA,SOA,E,DP,FP,name_x,park,attendance,BPF,PPF,teamIDBR,teamIDlahman45,teamIDretro,firstname,lastname,playerid,mlbid,year,salary_y,TeamName,age_y,leaguerank,teamrank,averagesalary,leagueminimum,borndate,name_y
989,Al Barker,32.0,1871,barkeal01,N,110565.0,1,5.0,1,0.0,0.0,92.724026,0.03,0.0,0.03,ROK,barkeal01,1839,USA,162.0,72.0,,,1871-06-01,1871-06-01,barka101,barkeal01,1871.0,1.0,4.0,0.0,1.0,0.0,0.0,0.0,2.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,RC1,,ROK,,9.0,25.0,,4.0,21.0,,N,,231.0,1036.0,274.0,44.0,25.0,3.0,38.0,30.0,53.0,10.0,,,287.0,108.0,4.3,23.0,1.0,0.0,678.0,315.0,3.0,34.0,16.0,220.0,14.0,0.821,Rockford Forest Citys,Agricultural Society Fair Grounds,,97.0,99.0,ROK,RC1,RC1,,,,,,,,,,,,,NaT,
1586,Al Pratt,23.0,1871,prattal01,Y,120742.0,1,131.0,29,0.0,0.0,97.798322,0.01,-0.09,0.08,CLE,prattal01,1847,USA,140.0,67.0,,R,1871-05-04,1872-08-19,prata101,prattal01,1871.0,29.0,130.0,31.0,34.0,6.0,8.0,0.0,20.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,2.0,CL1,,CFC,,8.0,29.0,,10.0,19.0,,N,,249.0,1186.0,328.0,35.0,40.0,7.0,26.0,25.0,18.0,8.0,,,341.0,116.0,4.11,23.0,0.0,0.0,762.0,346.0,13.0,53.0,34.0,234.0,15.0,0.818,Cleveland Forest Citys,National Association Grounds,,96.0,100.0,CLE,CL1,CL1,,,,,,,,,,,,,NaT,
1589,Al Reach,31.0,1871,reachal01,N,120965.0,1,138.0,26,0.0,0.0,145.954197,0.98,0.02,0.99,ATH,reachal01,1840,United Kingdom,155.0,66.0,L,L,1871-05-20,1875-05-21,reaca101,reachal01,1871.0,26.0,133.0,43.0,47.0,7.0,6.0,0.0,34.0,2.0,0.0,5.0,6.0,0.0,0.0,0.0,0.0,0.0,PH1,,PNA,,1.0,28.0,,21.0,7.0,,Y,,376.0,1281.0,410.0,66.0,27.0,9.0,46.0,23.0,56.0,12.0,,,266.0,137.0,4.95,27.0,0.0,0.0,747.0,329.0,3.0,53.0,16.0,194.0,13.0,0.845,Philadelphia Athletics,Jefferson Street Grounds,,102.0,98.0,ATH,PH1,PH1,,,,,,,,,,,,,NaT,
1704,Al Spalding,20.0,1871,spaldal01,Y,122558.0,1,152.0,31,0.0,1500.0,90.566468,-0.08,-0.15,0.05,BOS,spaldal01,1850,USA,170.0,73.0,R,R,1871-05-05,1878-08-31,spala101,spaldal01,1871.0,31.0,144.0,43.0,39.0,10.0,1.0,1.0,31.0,2.0,0.0,8.0,1.0,0.0,0.0,0.0,0.0,0.0,BS1,,BNA,,3.0,31.0,,20.0,10.0,,N,,401.0,1372.0,426.0,70.0,37.0,3.0,60.0,19.0,73.0,16.0,,,303.0,109.0,3.55,22.0,1.0,3.0,828.0,367.0,2.0,42.0,23.0,243.0,24.0,0.834,Boston Red Stockings,South End Grounds I,,103.0,98.0,BOS,BS1,BS1,,,,,,,,,,,,,NaT,
3603,Andy Leonard,25.0,1871,leonaan01,N,117686.0,1,151.0,31,0.0,0.0,99.733132,0.49,-0.08,0.56,OLY,leonaan01,1846,Ireland,168.0,67.0,R,R,1871-05-05,1880-07-06,leona101,leonaan01,1871.0,31.0,148.0,33.0,43.0,8.0,3.0,0.0,30.0,14.0,3.0,3.0,1.0,0.0,0.0,0.0,0.0,2.0,WS3,,OLY,,4.0,32.0,,15.0,15.0,,N,,310.0,1353.0,375.0,54.0,26.0,6.0,48.0,13.0,48.0,13.0,,,303.0,137.0,4.37,32.0,0.0,0.0,846.0,371.0,4.0,45.0,13.0,218.0,20.0,0.85,Washington Olympics,Olympics Grounds,,94.0,98.0,OLY,WS3,WS3,,,,,,,,,,,,,NaT,


In [9151]:
# shape
war_batting_people_teams_salaries.shape

(102624, 105)

In [9152]:
# Copy and save to csv
war_batting_people_teams_salaries_pre = war_batting_people_teams_salaries.copy()
war_batting_people_teams_salaries_pre.to_csv('pre_datasets/war_batting_people_teams_salaries_pre.csv', index=False)

---

### 4. Next Steps <a class="anchor" id="summary"></a>
- Data wrangling on joined dataset to clean and prepare data for modeling
- EDA on joined dataset to identify potential features for modeling


-----
-----
-----