### Predicting MLB Player Salaries: A Batting Performance Analysis

Author: Hector Guerrero

---
## Raw Datsets exploration, basic cleaning and Merges


### Table of Contents:
1. [Introduction](#introduction)
2. [Basic Data Exploration and cleaning](#basic-data-exploration)
    - 2.1 [WAR Dataset](#war-dataset)
    - 2.2 [Batting Dataset](#batting-dataset)
    - 2.3 [People Dataset](#people-dataset)
    - 2.4 [Salary Dataset](#salary-dataset)
    - 2.5 [Teams Dataset](#teams-dataset)
3. [Table Merges](#table-joins)
4. [Next Steps](#summary)

---

## 1. Introduction

This notebook provides a basic first exploration of several datasets related to baseball statistics. The datasets cover a wide range of information, including player performance, team performance, player biographical information and salaries. The goal of this exploration is to gain a better understanding of the structure and content of these datasets, perform some basic cleaning and preparation to ensure the data is well-suited for subsequent analysis. And finally, we're going to merge these datasets into a single, comprehensive one for further exploration and analysis.


In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np

#### Dataset Loading <a class="anchor" id="dataset-loading"></a>

In [2]:
# Load datasets
raw_war = pd.read_csv('Raw datasets/war_daily_bat.csv')
raw_teams = pd.read_csv('Raw datasets/Teams.csv')
raw_batting = pd.read_csv('Raw datasets/Batting.csv')
raw_fielding = pd.read_csv('Raw datasets/Fielding.csv')
raw_people = pd.read_csv('Raw datasets/People_x.csv')
raw_salaries = pd.read_csv('Raw datasets/salary_history.csv')
positions = pd.read_csv('Raw datasets/positions.csv')
byears = pd.read_csv('Raw datasets/byears.csv')

---
## 2. Basic Data Exploration and Cleaning  <a class="anchor" id="basic-data-exploration"></a>  
We'll start by examining the shape and features of each dataset, and then perform some basic cleaning and preparation to ensure the data is well-suited for subsequent analysis. This includes converting data types where necessary, and dropping irrelevant columns. We'll also address any inconsistencies or anomalies we come across during our exploration.



### 2.1 WAR (Wins Above Replacement) Dataset <a class="anchor" id="war-dataset"></a>

In [3]:
pd.set_option('display.max_columns', None)
# Shape and first rows
print(f'raw_war shape: {raw_war.shape}')
print(f'Rows: {raw_war.shape[0]}')
print(f'Columns: {raw_war.shape[1]}')
raw_war.head()

raw_war shape: (121375, 48)
Rows: 121375
Columns: 48


Unnamed: 0,name_common,age,mlb_ID,player_ID,year_ID,team_ID,stint_ID,lg_ID,PA,G,Inn,runs_bat,runs_br,runs_dp,runs_field,runs_infield,runs_outfield,runs_catcher,runs_defense,runs_position,runs_position_p,runs_replacement,runs_above_rep,runs_above_avg,runs_above_avg_off,runs_above_avg_def,WAA,WAA_off,WAA_def,WAR,WAR_def,WAR_off,WAR_rep,salary,pitcher,teamRpG,oppRpG,oppRpPA_rep,oppRpG_rep,pyth_exponent,pyth_exponent_rep,waa_win_perc,waa_win_perc_off,waa_win_perc_def,waa_win_perc_rep,OPS_plus,TOB_lg,TB_lg
0,David Aardsma,22.0,430911.0,aardsda01,2004,SFG,1,NL,0.0,11,10.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.01,0.0,-0.01,0.0,0.0,300000.0,Y,4.67092,4.67092,0.08651,4.67092,1.89,1.89,0.5,0.5,0.5,0.5,,0.0,0.0
1,David Aardsma,24.0,430911.0,aardsda01,2006,CHC,1,NL,3.0,43,53.0,-0.9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.46,0.0,-0.4,-0.4,-0.4,0.0,-0.04,-0.04,-0.01,-0.04,-0.01,-0.04,0.0,,Y,4.85675,4.86675,0.09085,4.86457,1.912,1.913,0.499,0.499,0.5,0.4998,-100.0,0.694,0.896
2,David Aardsma,25.0,430911.0,aardsda01,2007,CHW,1,AL,0.0,2,32.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,387500.0,Y,4.85895,4.85895,0.08422,4.85895,1.912,1.912,0.5,0.5,0.5,0.5,,0.0,0.0
3,David Aardsma,26.0,430911.0,aardsda01,2008,BOS,1,AL,1.0,5,48.7,-0.29,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.14,0.0,-0.2,-0.2,-0.2,0.0,-0.02,-0.02,0.0,-0.02,0.0,-0.02,0.0,403250.0,Y,4.674,4.704,0.08092,4.6965,1.893,1.894,0.497,0.497,0.5,0.4992,-100.0,0.345,0.434
4,David Aardsma,27.0,430911.0,aardsda01,2009,SEA,1,AL,0.0,3,71.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,419000.0,Y,4.79788,4.79788,0.08302,4.79788,1.905,1.905,0.5,0.5,0.5,0.5,,0.0,0.0


In [4]:
# Fill pitcher NaN with N
raw_war['pitcher'].fillna('N', inplace=True)

##### Dictionary

| Column Name | Description |  
| --- | --- |
| name_common | Player name |
| age | Player age |
| mlb_ID | MLB ID |
| player_ID | Player ID |
| year_ID | Year (Season played) |
| team_ID | ID of the team they played for the season |
| stint_ID | The stint of the player in that year (a player can have more than one stint in a year if they moved teams). |
| lg_ID | The league the player played in |
| PA | Plate appearances |
| G | Games played |
| Inn | Innings played |
| runs_bat | Runs above average |
| runs_br | Runs from baserunning |
| runs_dp | Runs from avoiding double plays |
| runs_field | Runs from fielding |
| runs_infield | Runs from infield |
| runs_outfield | Runs from outfield |
| runs_catcher | Runs from catching |
| runs_defense | Runs from defense |
| runs_position | Runs from position |
| runs_position_p | Runs from position as pitcher |
| runs_replacement | Runs from replacement |
| runs_above_rep | Runs above replacement |
| runs_above_avg | Runs above average |
| runs_above_avg_off | Runs above average as batter |  
| runs_above_avg_def | Runs above average as fielder |
| WAA | Wins above average |
| WAA_off | Wins above average as batter |
| WAA_def | Wins above average as fielder |
| WAR | Wins above replacement |
| WAR_def | Wins above replacement as fielder |
| WAR_off | Wins above replacement as batter |
| WAR_rep | Wins above replacement as replacement |
| salary | Player salary |
| pitcher | Whether the player is a pitcher |
| teamRpG | Team runs per game |
| oppRpG | Opponent runs per game |
| oppRpPA_rep | Opponent runs per plate appearance as replacement |
| oppRpG_rep | Opponent runs per game as replacement |
| pyth_exponent | Pythagorean exponent |
| pyth_exponent_rep | Pythagorean exponent as replacement |
| waa_win_perc | Win percentage above average |
| waa_win_perc_off | Win percentage above average as batter |
| waa_win_perc_def | Win percentage above average as fielder |
| waa_win_perc_rep | Win percentage above average as replacement |
| OPS_plus | OPS+ |
| TOB_lg | Times on base in league |
| TB_lg | Total bases in league |

The WAR dataset provides a comprehensive view of player performance, capturing the value of a player in all facets of the game including batting, baserunning, fielding, and pitching. The dataset consists of 121,375 rows and 48 columns.

The dataset also includes a variety of other metrics related to player performance. The goal of the WAR metric, in particular, is to summarize a player's total contributions to their team in one statistic.


In [5]:
# Data types
raw_war.dtypes

name_common            object
age                   float64
mlb_ID                float64
player_ID              object
year_ID                 int64
team_ID                object
stint_ID                int64
lg_ID                  object
PA                    float64
G                       int64
Inn                   float64
runs_bat              float64
runs_br               float64
runs_dp               float64
runs_field            float64
runs_infield          float64
runs_outfield         float64
runs_catcher          float64
runs_defense          float64
runs_position         float64
runs_position_p       float64
runs_replacement      float64
runs_above_rep        float64
runs_above_avg        float64
runs_above_avg_off    float64
runs_above_avg_def    float64
WAA                   float64
WAA_off               float64
WAA_def               float64
WAR                   float64
WAR_def               float64
WAR_off               float64
WAR_rep               float64
salary    

It looks that all the data types are correct, so we can move on to the next step.


The majority of these columns are complex and difficult to interpret and they not add much value to our analysis. Therefore, we will drop a subset of these columns to simplify the dataset and make it easier to work with. We will keep the columns that are most relevant to our analysis.

In [6]:
cols_to_keep = ['name_common',
 'age',
 'mlb_ID',
 'player_ID',
 'year_ID',
 'team_ID',
 'stint_ID',
 'lg_ID',
 'PA',
 'G',
 'Inn', 'salary', 'WAR', 'pitcher']

The columns we are keeping include basic player information (like name, age, and team), performance metrics that are easy to interpret (like Plate Appearances (`PA`), Games played (`G`), and Innings played (`Inn`), and metrics that are likely to be relevant to salary (like Wins Above Replacement (`WAR`), defensive and offensive WAR, and whether the player is a pitcher). We're also keeping the `OPS_plus` column, which is a more advanced metric but is widely used and understood in baseball analytics.

In [7]:
# Drop columns
raw_war = raw_war[cols_to_keep]

In [8]:
# Drop rows with year = 2023
raw_war = raw_war[raw_war['year_ID'] != 2023]

Some pre cleaning

In [9]:
# Remove special characters and punctuation from name_common
raw_war['name_common'] = raw_war['name_common'].str.replace(r"[\"\',.]", '')

# # lower case
# raw_war['name_common'] = raw_war['name_common'].str.lower()

  raw_war['name_common'] = raw_war['name_common'].str.replace(r"[\"\',.]", '')


In [10]:
raw_war.shape

(120757, 14)

##### `stint_ID` column
In our dataset, a player can have multiple entries for a single season due to having multiple stints. This can occur when a player is traded or moves teams during the season. Each stint is recorded as a separate row, which is not ideal for our analysis as we want a single row per player per season.

To resolve this, we'll aggregate the data at the player and season level. This means we'll combine the statistics for all stints a player had in a single season.

In [11]:
# New dataframe to preserve the first team the player played for in a season
raw_war_teams = raw_war[['player_ID', 'year_ID', 'team_ID', 'stint_ID']]

# drop rows with stint_ID > 1
raw_war_teams = raw_war_teams[raw_war_teams['stint_ID'] == 1]

# Drop stint_ID column
raw_war_teams.drop('stint_ID', axis=1, inplace=True)

In [12]:
# group raw_war 
raw_war_grouped = raw_war.groupby(['name_common', 'age', 'year_ID', 'player_ID', 'pitcher']).sum().reset_index()
raw_war_grouped.shape

  raw_war_grouped = raw_war.groupby(['name_common', 'age', 'year_ID', 'player_ID', 'pitcher']).sum().reset_index()


(110194, 12)

In [13]:
# Join raw_war_grouped with df_war_teams
war_pre = raw_war_grouped.merge(raw_war_teams, on=['player_ID', 'year_ID'])

In [14]:
# Change stint_ID 
war_pre['stint_ID'] = np.where(war_pre['stint_ID'] == 1, 1, 
                                        np.where(war_pre['stint_ID'] == 3, 2, 
                                            np.where(war_pre['stint_ID'] == 6, 3, 
                                                np.where(war_pre['stint_ID'] == 10, 4, 
                                                    np.where(war_pre['stint_ID'] == 15, 5, 0)))))

In [15]:
# Sanity check, check for players with more than one team in a year
# Joey Gallo
war_pre[war_pre['name_common'] == 'Jim Dwyer']

Unnamed: 0,name_common,age,year_ID,player_ID,pitcher,mlb_ID,stint_ID,PA,G,Inn,salary,WAR,team_ID
52713,Jim Dwyer,23.0,1973,dwyerji01,N,113673.0,1,58.0,28,127.0,0.0,-0.78,STL
52714,Jim Dwyer,24.0,1974,dwyerji01,N,113673.0,1,100.0,74,122.7,0.0,0.37,STL
52715,Jim Dwyer,25.0,1975,dwyerji01,N,227346.0,2,247.0,81,469.0,0.0,0.99,STL
52716,Jim Dwyer,26.0,1976,dwyerji01,N,227346.0,2,119.0,61,166.0,0.0,-0.64,MON
52717,Jim Dwyer,27.0,1977,dwyerji01,N,113673.0,1,37.0,13,79.0,0.0,-0.24,STL
52718,Jim Dwyer,28.0,1978,dwyerji01,N,227346.0,2,283.0,107,531.4,0.0,-0.27,STL
52719,Jim Dwyer,29.0,1979,dwyerji01,N,113673.0,1,133.0,76,285.7,0.0,-0.25,BOS
52720,Jim Dwyer,30.0,1980,dwyerji01,N,113673.0,1,292.0,93,626.7,0.0,0.56,BOS
52721,Jim Dwyer,31.0,1981,dwyerji01,N,113673.0,1,157.0,68,347.3,0.0,-0.06,BAL
52722,Jim Dwyer,32.0,1982,dwyerji01,N,113673.0,1,178.0,71,343.7,0.0,1.16,BAL


In [16]:
# Shape
print(f'raw_war shape: {raw_war.shape}')

raw_war shape: (120757, 14)


In [17]:
# Save to csv
war_pre.to_csv('pre_datasets\war_pre.csv', index=False)

### 2.2 Batting Dataset <a class="anchor" id="batting-dataset"></a>



In [18]:
# Shape and first rows
print(f'raw_batting shape: {raw_batting.shape}')
print(f'Rows: {raw_batting.shape[0]}')
print(f'Columns: {raw_batting.shape[1]}')
raw_batting.sample(5)

raw_batting shape: (112184, 22)
Rows: 112184
Columns: 22


Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
99481,quentca01,2014,1,SDN,NL,50,130,9,23,6,0,4,18.0,0.0,0.0,17,33.0,0.0,4.0,0.0,4.0,5.0
58024,ottenji01,1980,1,SLN,NL,31,5,0,1,0,0,0,0.0,0.0,0.0,0,1.0,0.0,0.0,0.0,0.0,0.0
17256,finneha01,1918,2,NYA,AL,23,39,4,9,1,0,0,1.0,0.0,,1,7.0,,2.0,1.0,,
15174,wallabo01,1914,1,SLA,AL,26,73,3,16,2,1,0,5.0,1.0,1.0,5,13.0,,0.0,4.0,,
93772,ortizda01,2010,1,BOS,AL,145,518,86,140,36,1,32,102.0,0.0,1.0,82,145.0,14.0,2.0,0.0,4.0,12.0


##### Dictionary

| Column Name | Description |
| --- | --- |
| playerID | Player ID |
| yearID | Year (Season played) |
| stint | The stint of the player in that year (a player can have more than one stint in a year if they moved teams). |
| teamID | ID of the team they played for the season |
| lgID | The league the player played in |
| G | Games played |
| AB | At bats |
| R | Runs |
| H | Hits |
| 2B | Doubles |
| 3B | Triples |
| HR | Homeruns |
| RBI | Runs batted in |
| SB | Stolen bases |
| CS | Caught stealing |
| BB | Walks |
| SO | Strikeouts |
| IBB | Intentional walks |
| HBP | Hit by pitch |
| SH | Sacrifice hits |
| SF | Sacrifice flies |
| GIDP | Grounded into double plays |



The batting dataset is a comprehensive collection of player batting statistics. It contains 112,184 rows and 22 columns, each row representing a player's performance in a particular season.

The dataset includes both pitchers and position players. Pitchers typically have fewer at-bats and different offensive statistics than position players, which is something to keep in mind during the analysis.

In [19]:
# Data types
raw_batting.dtypes

playerID     object
yearID        int64
stint         int64
teamID       object
lgID         object
G             int64
AB            int64
R             int64
H             int64
2B            int64
3B            int64
HR            int64
RBI         float64
SB          float64
CS          float64
BB            int64
SO          float64
IBB         float64
HBP         float64
SH          float64
SF          float64
GIDP        float64
dtype: object

It looks that all the data types are correct, so we can move on to the next step.

For now, we will keep all the columns in the batting dataset. We can always drop unnecessary columns later once we understand the data better.

##### `stint` column
Same situation as in the WAR dataset, a player can have multiple entries for a single season due to having multiple stints. This can occur when a player is traded or moves teams during the season. Each stint is recorded as a separate row, which is not ideal for our analysis as we want a single row per player per season.

To resolve this, we'll follow the same process as we did for the WAR dataset.

In [20]:
# New dataframe 
raw_batting_teams = raw_batting[['playerID', 'yearID', 'teamID', 'stint']]
raw_batting_teams.shape

(112184, 4)

In [21]:
# drop rows with stint > 1
raw_batting_teams = raw_batting_teams[raw_batting_teams['stint'] == 1]

# drop stint column
raw_batting_teams.drop('stint', axis=1, inplace=True)

In [22]:
# group raw_batting 
raw_batting_grouped = raw_batting.groupby(['playerID', 'yearID']).sum().reset_index()

  raw_batting_grouped = raw_batting.groupby(['playerID', 'yearID']).sum().reset_index()


In [23]:
# Join raw_batting_grouped with raw_batting_teams 
batting_pre = raw_batting_grouped.merge(raw_batting_teams, on=['playerID', 'yearID'])

In [24]:
# Sanity check, check for players with more than one team in a year
# Joey Gallo
batting_pre[batting_pre['playerID'] == 'gallojo01']

Unnamed: 0,playerID,yearID,stint,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP,teamID
31674,gallojo01,2015,1,36,108,16,22,3,1,6,14.0,3.0,0.0,15,57.0,3.0,0.0,0.0,0.0,0.0,TEX
31675,gallojo01,2016,1,17,25,2,1,0,0,1,1.0,1.0,0.0,5,19.0,0.0,0.0,0.0,0.0,0.0,TEX
31676,gallojo01,2017,1,145,449,85,94,18,3,41,80.0,7.0,2.0,75,196.0,1.0,8.0,0.0,0.0,3.0,TEX
31677,gallojo01,2018,1,148,500,82,103,24,1,40,92.0,3.0,4.0,74,207.0,4.0,3.0,0.0,0.0,3.0,TEX
31678,gallojo01,2019,1,70,241,54,61,15,1,22,49.0,4.0,2.0,52,114.0,4.0,2.0,1.0,1.0,0.0,TEX
31679,gallojo01,2020,1,57,193,23,35,8,0,10,26.0,2.0,0.0,29,79.0,2.0,4.0,0.0,0.0,0.0,TEX
31680,gallojo01,2021,3,153,498,90,99,13,1,38,77.0,6.0,0.0,111,213.0,5.0,6.0,0.0,1.0,6.0,TEX
31681,gallojo01,2022,3,126,350,48,56,8,2,19,47.0,3.0,0.0,56,163.0,0.0,3.0,0.0,1.0,0.0,NYA


In [25]:
# drop stint column
batting_pre.drop('stint', axis=1, inplace=True)

In [26]:
# Save to csv
batting_pre.to_csv('pre_datasets/batting_pre.csv', index=False)

### 2.3 People Dataset <a class="anchor" id="people-dataset"></a>



In [27]:
# Shape and first rows
print(f'raw_people shape: {raw_people.shape}')
print(f'Rows: {raw_people.shape[0]}')
print(f'Columns: {raw_people.shape[1]}')
raw_people.head()


raw_people shape: (20811, 24)
Rows: 20811
Columns: 24


Unnamed: 0,playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,deathCountry,deathState,deathCity,nameFirst,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID
0,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01
1,aaronha01,1934.0,2.0,5.0,USA,AL,Mobile,2021.0,1.0,22.0,USA,GA,Atlanta,Hank,Aaron,Henry Louis,180.0,72.0,R,R,1954-04-13,1976-10-03,aaroh101,aaronha01
2,aaronto01,1939.0,8.0,5.0,USA,AL,Mobile,1984.0,8.0,16.0,USA,GA,Atlanta,Tommie,Aaron,Tommie Lee,190.0,75.0,R,R,1962-04-10,1971-09-26,aarot101,aaronto01
3,aasedo01,1954.0,9.0,8.0,USA,CA,Orange,,,,,,,Don,Aase,Donald William,190.0,75.0,R,R,1977-07-26,1990-10-03,aased001,aasedo01
4,abadan01,1972.0,8.0,25.0,USA,FL,Palm Beach,,,,,,,Andy,Abad,Fausto Andres,184.0,73.0,L,L,2001-09-10,2006-04-13,abada001,abadan01


##### Dictionary

| Column Name | Description |
| --- | --- |
| playerID | Player ID |
| birthYear | Year of birth |
| birthMonth | Month of birth |
| birthDay | Day of birth |
| birthCountry | Country of birth |
| birthState | State of birth |
| birthCity | City of birth |
| deathYear | Year of death |
| deathMonth | Month of death |
| deathDay | Day of death |
| deathCountry | Country of death |
| deathState | State of death |
| deathCity | City of death |
| nameFirst | First name |
| nameLast | Last name |
| nameGiven | Full name |
| weight | Weight in pounds |
| height | Height in inches |
| bats | Batting hand (left, right, or both) |
| throws | Throwing hand (left or right) |
| debut | Date of MLB debut |
| finalGame | Date of final MLB game |
| retroID | Retro ID |
| bbrefID | Baseball Reference ID |


The people dataset contains biographical information about baseball players. This dataset has 20,811 rows and 24 columns. Each row represents a player, and the columns provide various details about the player, such as their birth and death details, physical attributes, and career details.

This dataset will be useful for adding context to our analysis, such as the player's age during each season, their physical attributes, and their career span.

In [28]:
# Data types
raw_people.dtypes

playerID         object
birthYear       float64
birthMonth      float64
birthDay        float64
birthCountry     object
birthState       object
birthCity        object
deathYear       float64
deathMonth      float64
deathDay        float64
deathCountry     object
deathState       object
deathCity        object
nameFirst        object
nameLast         object
nameGiven        object
weight          float64
height          float64
bats             object
throws           object
debut            object
finalGame        object
retroID          object
bbrefID          object
dtype: object

It looks that all the data types are correct, so we can move on to the next step.

##### Personal columns

Some columns in this dataset are not relevant to our project and will be dropped. These columns are commonly related to personal information about the player, such as birth date, birth place, and death date. 

Personal columns are not directly related to the project's analysis of batting performance. By removing them from the dataset, we can eliminate unnecessary personal information and narrow the scope of the project to the pertinent variables.

In [29]:
# List of personal columns that are not relevant
personal_columns = ['nameGiven','birthState', 'birthCity', 'deathYear', 'deathMonth', 'deathDay', 'deathCountry', 'deathState', 'deathCity']

In [30]:
# Drop personal columns
raw_people.drop(personal_columns, axis=1, inplace=True)

Some pre cleaning

In [31]:
# Remove special characters and punctuation from nameFirst and nameLast
raw_people['nameFirst'] = raw_people['nameFirst'].str.replace(r"[\"\',]", '')
raw_people['nameLast'] = raw_people['nameLast'].str.replace(r"[\"\',]", '')

raw_people['nameFirst'] = raw_people['nameFirst'].str.replace('.', '')
raw_people['nameLast'] = raw_people['nameLast'].str.replace('.', '')

raw_people['nameFirst'] = raw_people['nameFirst'].str.replace(' ', '')

# Lower case
raw_people['nameFirst'] = raw_people['nameFirst'].str.lower()
raw_people['nameLast'] = raw_people['nameLast'].str.lower()

  raw_people['nameFirst'] = raw_people['nameFirst'].str.replace(r"[\"\',]", '')
  raw_people['nameLast'] = raw_people['nameLast'].str.replace(r"[\"\',]", '')
  raw_people['nameFirst'] = raw_people['nameFirst'].str.replace('.', '')
  raw_people['nameLast'] = raw_people['nameLast'].str.replace('.', '')


In [32]:
# Shape
raw_people.shape

(20811, 15)

In [33]:
# Create a copy and save it to csv
people_pre = raw_people.copy()
people_pre.to_csv('pre_datasets/people_pre.csv', index=False)

### 2.4 Salary Dataset <a class="anchor" id="salary-dataset"></a>

In [34]:
# Shape and first rows
print(f'raw_salaries shape: {raw_salaries.shape}')
print(f'Rows: {raw_salaries.shape[0]}')
print(f'Columns: {raw_salaries.shape[1]}')
raw_salaries.head() 

raw_salaries shape: (46450, 19)
Rows: 46450
Columns: 19


Unnamed: 0,firstname,lastname,playerid,mlbid,year,salary,TeamName,age,leaguerank,teamrank,averagesalary,leagueminimum,serviceTime,borndate,name,first3,last3,first3last3,middle2
0,David,Aardsma,25001,430911,2004,300000,San Francisco Giants,22,0,0,2313535,300000,Null,1981-12-27,David Aardsma,Dav,Aar,DavAar,avi
1,David,Aardsma,25001,430911,2006,327000,Chicago Cubs,24,0,0,2699292,327000,Null,1981-12-27,David Aardsma,Dav,Aar,DavAar,avi
2,David,Aardsma,25001,430911,2007,387500,Chicago White Sox,25,675,23,2824751,380000,Null,1981-12-27,David Aardsma,Dav,Aar,DavAar,avi
3,David,Aardsma,25001,430911,2008,403250,Boston Red Sox,26,632,27,2925679,390000,Null,1981-12-27,David Aardsma,Dav,Aar,DavAar,avi
4,David,Aardsma,25001,430911,2009,419000,Seattle Mariners,27,635,20,2996106,400000,Null,1981-12-27,David Aardsma,Dav,Aar,DavAar,avi


##### Dictionary

| Column Name | Description |
| --- | --- |
| firstname | The player's First name |
| lastname | The player's Last name |
| playerid | A unique identifier for each player |
| mlbid | A unique identifier for each player |
| year | Year of the salary |
| salary | Player salary for the year 
| age | The player's age during the year of the salary |
| leaguerank | The player's rank in the league based on salary |
| teamrank | The player's rank in the team based on salary |
| averagesalary | The average salary in the league for the year |
| leagueminimun | The minimum salary in the league for the year |
| serviceTime | The player's service time in the league |
| borndate | The player's birth date |

The salaries dataset provides detailed information about the salaries of baseball players. This dataset contains 46,450 rows and 14 columns. Each row represents a player's salary for a specific year, and the columns provide various details about the salary, the player, and the team they played for.

This dataset will be crucial for our analysis as it provides the target variable we want to predict - the player's salary. It also provides context about how the player's salary compares to others in the league and on their team.

Upon initial inspection of the salaries dataset, we've noticed that the `serviceTime` column contains several null values.

Additionally, this column doesn't seem to provide any crucial information for our analysis. Therefore, we'll drop this column from the dataset.

In [35]:
# Null values
raw_salaries.isnull().sum()

firstname        0
lastname         0
playerid         0
mlbid            0
year             0
salary           0
TeamName         0
age              0
leaguerank       0
teamrank         0
averagesalary    0
leagueminimum    0
serviceTime      0
borndate         0
name             0
first3           0
last3            0
first3last3      0
middle2          2
dtype: int64

In [36]:
# 'Null' in serviceTime column
raw_salaries[raw_salaries['serviceTime'] == 'Null'].shape

(35121, 19)

It looks that there are no Null values in the dataset, 'Null' in serviceTime is just a string. There are 35,121 of them .We are going to drop the serviceTime column as it doesn't provide any crucial information for our analysis.

In [37]:
# Drop serviceTime column
raw_salaries.drop('serviceTime', axis=1, inplace=True)

In [38]:
# Data types
raw_salaries.dtypes

firstname        object
lastname         object
playerid          int64
mlbid             int64
year              int64
salary            int64
TeamName         object
age               int64
leaguerank        int64
teamrank          int64
averagesalary     int64
leagueminimum     int64
borndate         object
name             object
first3           object
last3            object
first3last3      object
middle2          object
dtype: object

Some pre cleaning

In [39]:
# Remove special characters and punctuation from firstName and lastName
raw_salaries['firstname'] = raw_salaries['firstname'].str.replace(r"[\"\',]", '')
raw_salaries['lastname'] = raw_salaries['lastname'].str.replace(r"[\"\',]", '')

raw_salaries['firstname'] = raw_salaries['firstname'].str.replace('.', '')
raw_salaries['lastname'] = raw_salaries['lastname'].str.replace('.', '')

# lower case
raw_salaries['firstname'] = raw_salaries['firstname'].str.lower()
raw_salaries['lastname'] = raw_salaries['lastname'].str.lower()



  raw_salaries['firstname'] = raw_salaries['firstname'].str.replace(r"[\"\',]", '')
  raw_salaries['lastname'] = raw_salaries['lastname'].str.replace(r"[\"\',]", '')
  raw_salaries['firstname'] = raw_salaries['firstname'].str.replace('.', '')
  raw_salaries['lastname'] = raw_salaries['lastname'].str.replace('.', '')


##### Full name column
In our current dataset, player names are split into two separate columns: `firstname` and `lastname`. While this format can be useful for certain types of analysis, it might be more convenient for us to have a single column that contains the full name of each player.

In [40]:
# concat first and last name
raw_salaries['name'] = raw_salaries['firstname'] + ' ' + raw_salaries['lastname']

raw_salaries.head()

Unnamed: 0,firstname,lastname,playerid,mlbid,year,salary,TeamName,age,leaguerank,teamrank,averagesalary,leagueminimum,borndate,name,first3,last3,first3last3,middle2
0,david,aardsma,25001,430911,2004,300000,San Francisco Giants,22,0,0,2313535,300000,1981-12-27,david aardsma,Dav,Aar,DavAar,avi
1,david,aardsma,25001,430911,2006,327000,Chicago Cubs,24,0,0,2699292,327000,1981-12-27,david aardsma,Dav,Aar,DavAar,avi
2,david,aardsma,25001,430911,2007,387500,Chicago White Sox,25,675,23,2824751,380000,1981-12-27,david aardsma,Dav,Aar,DavAar,avi
3,david,aardsma,25001,430911,2008,403250,Boston Red Sox,26,632,27,2925679,390000,1981-12-27,david aardsma,Dav,Aar,DavAar,avi
4,david,aardsma,25001,430911,2009,419000,Seattle Mariners,27,635,20,2996106,400000,1981-12-27,david aardsma,Dav,Aar,DavAar,avi


##### `borndate` column
The `borndate` column contains the birth date of each player. Currently this column is in string format, which is not ideal for our analysis. We'll convert this column to a datetime format.

In [41]:
# borndate to datetime
raw_salaries['borndate'] = pd.to_datetime(raw_salaries['borndate'])

In [42]:
# Drop rows with year = 2023
raw_salaries = raw_salaries[raw_salaries['year'] != 2023]

In [43]:
drop_rows = ['ChJi197', 'ChTr196']

In [44]:
# Create a copy and save it to csv
salaries_pre = raw_salaries.copy()
salaries_pre.to_csv('pre_datasets/salaries_pre.csv', index=False)

### 2.5 Teams Dataset <a class="anchor" id="teams-dataset"></a>

In [45]:
# Shape and first rows
print(f'raw_teams shape: {raw_teams.shape}')
print(f'Rows: {raw_teams.shape[0]}')
print(f'Columns: {raw_teams.shape[1]}')
raw_teams.sample(5)

raw_teams shape: (3015, 47)
Rows: 3015
Columns: 47


Unnamed: 0,yearID,lgID,teamID,franchID,divID,Rank,G,Ghome,W,L,DivWin,LgWin,WSWin,R,AB,H,2B,3B,HR,BB,SO,SB,CS,HBP,SF,RA,ER,ERA,CG,SHO,SV,IPouts,HA,HRA,BBA,SOA,E,DP,FP,name,park,attendance,BPF,PPF,teamIDBR,teamIDlahman45,teamIDretro
2484,2005,AL,DET,DET,C,4,162,81.0,71,91,N,N,N,723,5602,1521,283,45,168,384,1038.0,66.0,28.0,53.0,52.0,787,719,4.51,7,2,37,4307,1504,193,461,907,110,171,0.982,Detroit Tigers,Comerica Park,2024431.0,98,98,DET,DET,DET
1322,1958,NL,PHI,PHI,,8,154,77.0,69,85,,N,N,664,5363,1424,238,56,124,573,871.0,51.0,33.0,,,762,671,4.32,51,6,15,4191,1480,148,446,778,129,136,0.978,Philadelphia Phillies,Connie Mack Stadium,931110.0,99,100,PHI,PHI,PHI
2282,1998,NL,MON,WSN,E,4,162,81.0,65,97,N,N,N,644,5418,1348,280,32,147,439,1058.0,91.0,46.0,60.0,37.0,783,695,4.38,4,5,39,4281,1448,156,533,1017,155,127,0.975,Montreal Expos,Stade Olympique,914909.0,98,99,MON,MON,MON
1139,1947,NL,CHN,CHC,,6,155,79.0,69,85,,N,N,567,5305,1373,231,48,71,471,578.0,22.0,,,,722,614,4.04,46,8,15,4101,1449,106,618,571,150,159,0.975,Chicago Cubs,Wrigley Field,1364039.0,96,97,CHC,CHN,CHN
2189,1995,NL,COL,COL,W,2,144,72.0,77,67,N,N,N,785,4994,1406,259,43,200,484,943.0,125.0,59.0,56.0,31.0,783,711,4.97,1,1,43,3865,1443,160,512,891,107,146,0.981,Colorado Rockies,Coors Field,3390037.0,129,129,COL,COL,COL


##### Dictionary

| Column Name | Description |
| --- | --- |
| yearID | Year (Season played) |
| lgID | The league the player played in |
| teamID | ID of the team they played for the season |
| franchID | Franchise ID |
| divID | Division ID |
| Rank | Team rank at the end of the season |
| G | Games played |
| Ghome | Games played at home |
| W | Wins |
| L | Losses |
| DivWin | Division Winner (Y or N) |
| WCWin | Wild Card Winner (Y or N) |
| LgWin | League Champion(Y or N) |
| WSWin | World Series Winner (Y or N) |
| R | Total number of runs scored by the team in the season |
| AB | Total number of at bats by the team in the season |
| H | Total number of hits by the team in the season |
| 2B | Total number of doubles by the team in the season |
| 3B | Total number of triples by the team in the season |
| HR | Total number of home runs by the team in the season |
| BB | Total number of walks by the team in the season |
| SO | Total number of strikeouts by the team in the season |
| SB | Total number of stolen bases by the team in the season |
| CS | Total number of times caught stealing by the team in the season |
| HBP | Total number of times hit by pitch by the team in the season |
| SF | Total number of sacrifice flies by the team in the season |
| RA | Total number of runs allowed by the team in the season |
| ER | Total number of earned runs allowed by the team in the season |
| ERA | Earned run average |
| CG | Total number of complete games pitched by the team in the season |
| SHO | Total number of shutouts pitched by the team in the season |
| SV | Total number of saves by the team in the season |
| IPouts | Total number of outs pitched by the team in the season |
| HA | Total number of hits allowed by the team in the season |
| HRA | Total number of home runs allowed by the team in the season |
| BBA | Total number of walks allowed by the team in the season |
| SOA | Total number of strikeouts by the team in the season |
| E | Total number of errors by the team in the season |
| DP | Total number of double plays turned by the team in the season |
| FP | Fielding percentage |
| name | Team name |
| park | Team park |
| attendance | Total attendance for the season |
| BPF | Three-year park factor for batters |
| PPF | Three-year park factor for pitchers |
| teamIDBR | Team ID used by Baseball Reference website |
| teamIDlahman45 | Team ID used in Lahman database version 4.5 |
| teamIDretro | Team ID used by Retrosheet |

The `raw_teams` dataset is a comprehensive collection of team-based statistics for each season. It has 3015 rows and 47 columns, each row representing a team's performance in a particular season.

For the time being, we'll keep all the columns as they might provide useful insights for our analysis. We can always drop unnecessary columns later once we understand the data better.

In [46]:
# Data types
raw_teams.dtypes

yearID              int64
lgID               object
teamID             object
franchID           object
divID              object
Rank                int64
G                   int64
Ghome             float64
W                   int64
L                   int64
DivWin             object
LgWin              object
WSWin              object
R                   int64
AB                  int64
H                   int64
2B                  int64
3B                  int64
HR                  int64
BB                  int64
SO                float64
SB                float64
CS                float64
HBP               float64
SF                float64
RA                  int64
ER                  int64
ERA               float64
CG                  int64
SHO                 int64
SV                  int64
IPouts              int64
HA                  int64
HRA                 int64
BBA                 int64
SOA                 int64
E                   int64
DP                  int64
FP          

It looks that all the data types are correct, so we can move on to the next step.

In [47]:
# Create a copy and save it to csv
teams_pre = raw_teams.copy()
teams_pre.to_csv('pre_datasets/teams_pre.csv', index=False)

---

## 3. Table Merges <a class="anchor" id="table-joins"></a>

Now, let's join our dataframes to create a single dataset that contains player-level information for each season. The resulting dataset will be used for further analysis and modeling.

The goal here is to find a common link between the datasets that can be used to join them together. This might require some trial and error, and possibly some data cleaning to ensure the keys match up correctly. This process can turn iterative.

In [48]:
pd.set_option('display.max_columns', None)

def show_player_info(dataset, player_name):
    '''Display player information for sanity check'''

    player_info = dataset[dataset['name_common'] == player_name]
    
    return player_info

##### WAR and People datasets

In [49]:
print(war_pre.shape)
print(people_pre.shape)

(104423, 13)
(20811, 15)


In [50]:
# Join people_pre with war_pre on bbrefID and player_ID
war_people = war_pre.merge(people_pre, how='left',left_on='player_ID', right_on='bbrefID')
war_people.sample(5)

Unnamed: 0,name_common,age,year_ID,player_ID,pitcher,mlb_ID,stint_ID,PA,G,Inn,salary,WAR,team_ID,playerID,birthYear,birthMonth,birthDay,birthCountry,nameFirst,nameLast,weight,height,bats,throws,debut,finalGame,retroID,bbrefID
103610,Woody English,27.0,1933,engliwo01,N,113855.0,1,457.0,105,0.0,0.0,2.26,CHC,engliwo01,1906.0,3.0,2.0,USA,woody,english,155.0,70.0,R,R,1927-04-26,1938-07-01,englw101,engliwo01
69169,Luther Roy,22.0,1925,roylu01,Y,121517.0,1,2.0,6,0.0,0.0,-0.04,CLE,roylu01,1902.0,7.0,29.0,USA,luther,roy,161.0,70.0,R,R,1924-06-12,1929-10-05,roy-l101,roylu01
54793,Jimmy Wood,28.0,1872,woodji01,N,249068.0,2,150.0,32,0.0,0.0,1.04,TRO,woodji01,1843.0,12.0,1.0,USA,jimmy,wood,150.0,68.0,,R,1871-05-08,1873-11-01,woodj106,woodji01
104311,Zack Wheat,34.0,1922,wheatza01,N,124134.0,1,660.0,152,0.0,8300.0,4.39,BRO,wheatza01,1888.0,5.0,23.0,USA,zack,wheat,170.0,70.0,L,R,1909-09-11,1927-09-21,wheaz101,wheatza01
88016,Roy Hartzell,32.0,1914,hartzro01,N,115581.0,1,581.0,137,0.0,0.0,1.52,NYY,hartzro01,1881.0,7.0,6.0,USA,roy,hartzell,155.0,68.0,L,R,1906-04-17,1916-07-25,hartr102,hartzro01


In [51]:
show_player_info(war_people, 'Joey Gallo')

Unnamed: 0,name_common,age,year_ID,player_ID,pitcher,mlb_ID,stint_ID,PA,G,Inn,salary,WAR,team_ID,playerID,birthYear,birthMonth,birthDay,birthCountry,nameFirst,nameLast,weight,height,bats,throws,debut,finalGame,retroID,bbrefID
57059,Joey Gallo,21.0,2015,gallojo01,N,608336.0,1,123.0,36,264.0,0.0,0.32,TEX,gallojo01,1993.0,11.0,19.0,USA,joey,gallo,250.0,77.0,L,R,2015-06-02,2023-06-24,gallj002,gallojo01
57060,Joey Gallo,22.0,2016,gallojo01,N,608336.0,1,30.0,17,72.0,0.0,-0.52,TEX,gallojo01,1993.0,11.0,19.0,USA,joey,gallo,250.0,77.0,L,R,2015-06-02,2023-06-24,gallj002,gallojo01
57061,Joey Gallo,23.0,2017,gallojo01,N,608336.0,1,532.0,145,1208.0,537120.0,2.92,TEX,gallojo01,1993.0,11.0,19.0,USA,joey,gallo,250.0,77.0,L,R,2015-06-02,2023-06-24,gallj002,gallojo01
57062,Joey Gallo,24.0,2018,gallojo01,N,608336.0,1,577.0,148,1199.0,560000.0,2.41,TEX,gallojo01,1993.0,11.0,19.0,USA,joey,gallo,250.0,77.0,L,R,2015-06-02,2023-06-24,gallj002,gallojo01
57063,Joey Gallo,25.0,2019,gallojo01,N,608336.0,1,297.0,70,614.3,605500.0,3.08,TEX,gallojo01,1993.0,11.0,19.0,USA,joey,gallo,250.0,77.0,L,R,2015-06-02,2023-06-24,gallj002,gallojo01
57064,Joey Gallo,26.0,2020,gallojo01,N,608336.0,1,226.0,57,475.7,4400000.0,1.51,TEX,gallojo01,1993.0,11.0,19.0,USA,joey,gallo,250.0,77.0,L,R,2015-06-02,2023-06-24,gallj002,gallojo01
57065,Joey Gallo,27.0,2021,gallojo01,N,1216672.0,2,616.0,153,1304.4,6200000.0,4.61,TEX,gallojo01,1993.0,11.0,19.0,USA,joey,gallo,250.0,77.0,L,R,2015-06-02,2023-06-24,gallj002,gallojo01
57066,Joey Gallo,28.0,2022,gallojo01,N,1216672.0,2,410.0,126,944.4,10275000.0,0.23,NYY,gallojo01,1993.0,11.0,19.0,USA,joey,gallo,250.0,77.0,L,R,2015-06-02,2023-06-24,gallj002,gallojo01


##### WAR/People and Batting datasets

In [52]:
# shape
war_people.shape

(104423, 28)

In [53]:
batting_pre.shape

(103693, 20)

In [54]:
# Join war_people and batting_pre on player_ID = playerID , year_ID = yearID
war_batting_people = war_people.merge(batting_pre, how='left', left_on=['playerID', 'year_ID'], right_on=['playerID', 'yearID'])
war_batting_people.sample(5)

Unnamed: 0,name_common,age,year_ID,player_ID,pitcher,mlb_ID,stint_ID,PA,G_x,Inn,salary,WAR,team_ID,playerID,birthYear,birthMonth,birthDay,birthCountry,nameFirst,nameLast,weight,height,bats,throws,debut,finalGame,retroID,bbrefID,yearID,G_y,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP,teamID
72879,Michael Wuertz,32.0,2011,wuertmi01,Y,430900.0,1,0.0,4,33.7,2800000.0,0.0,OAK,wuertmi01,1978.0,12.0,15.0,USA,michael,wuertz,225.0,74.0,R,R,2004-04-05,2011-09-20,wuerm001,wuertmi01,2011.0,39.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,OAK
25533,Dave Valle,33.0,1994,valleda01,N,247242.0,2,134.0,46,317.7,1200000.0,-0.08,BOS,valleda01,1960.0,10.0,30.0,USA,dave,valle,200.0,74.0,R,R,1984-09-07,1996-09-29,valld001,valleda01,1994.0,46.0,112.0,14.0,26.0,8.0,1.0,2.0,10.0,0.0,2.0,18.0,22.0,2.0,2.0,2.0,0.0,3.0,BOS
23406,Danny Ardoin,31.0,2006,ardoida01,N,300832.0,2,135.0,40,323.3,335000.0,-0.69,COL,ardoida01,1974.0,7.0,8.0,USA,danny,ardoin,205.0,72.0,R,R,2000-08-02,2008-09-26,ardod001,ardoida01,2006.0,40.0,122.0,14.0,22.0,5.0,1.0,0.0,3.0,0.0,0.0,9.0,33.0,2.0,3.0,1.0,0.0,3.0,COL
37019,Fred Beck,28.0,1915,beckfr02,N,110761.0,1,407.0,121,0.0,0.0,-0.99,CHI,beckfr02,1886.0,11.0,17.0,USA,fred,beck,180.0,73.0,L,L,1909-04-14,1915-10-03,beckf101,beckfr02,1915.0,121.0,373.0,35.0,83.0,9.0,3.0,5.0,38.0,4.0,0.0,24.0,38.0,0.0,4.0,8.0,0.0,0.0,CHF
78994,Pat Borders,36.0,1999,bordepa01,N,222464.0,2,35.0,12,84.0,270000.0,-0.3,CLE,bordepa01,1963.0,5.0,14.0,USA,pat,borders,190.0,74.0,R,R,1988-04-06,2005-07-27,bordp001,bordepa01,1999.0,12.0,34.0,3.0,9.0,0.0,1.0,1.0,6.0,0.0,1.0,1.0,5.0,0.0,0.0,0.0,0.0,0.0,CLE


In [55]:
show_player_info(war_batting_people, 'Joey Gallo')

Unnamed: 0,name_common,age,year_ID,player_ID,pitcher,mlb_ID,stint_ID,PA,G_x,Inn,salary,WAR,team_ID,playerID,birthYear,birthMonth,birthDay,birthCountry,nameFirst,nameLast,weight,height,bats,throws,debut,finalGame,retroID,bbrefID,yearID,G_y,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP,teamID
57059,Joey Gallo,21.0,2015,gallojo01,N,608336.0,1,123.0,36,264.0,0.0,0.32,TEX,gallojo01,1993.0,11.0,19.0,USA,joey,gallo,250.0,77.0,L,R,2015-06-02,2023-06-24,gallj002,gallojo01,2015.0,36.0,108.0,16.0,22.0,3.0,1.0,6.0,14.0,3.0,0.0,15.0,57.0,3.0,0.0,0.0,0.0,0.0,TEX
57060,Joey Gallo,22.0,2016,gallojo01,N,608336.0,1,30.0,17,72.0,0.0,-0.52,TEX,gallojo01,1993.0,11.0,19.0,USA,joey,gallo,250.0,77.0,L,R,2015-06-02,2023-06-24,gallj002,gallojo01,2016.0,17.0,25.0,2.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,5.0,19.0,0.0,0.0,0.0,0.0,0.0,TEX
57061,Joey Gallo,23.0,2017,gallojo01,N,608336.0,1,532.0,145,1208.0,537120.0,2.92,TEX,gallojo01,1993.0,11.0,19.0,USA,joey,gallo,250.0,77.0,L,R,2015-06-02,2023-06-24,gallj002,gallojo01,2017.0,145.0,449.0,85.0,94.0,18.0,3.0,41.0,80.0,7.0,2.0,75.0,196.0,1.0,8.0,0.0,0.0,3.0,TEX
57062,Joey Gallo,24.0,2018,gallojo01,N,608336.0,1,577.0,148,1199.0,560000.0,2.41,TEX,gallojo01,1993.0,11.0,19.0,USA,joey,gallo,250.0,77.0,L,R,2015-06-02,2023-06-24,gallj002,gallojo01,2018.0,148.0,500.0,82.0,103.0,24.0,1.0,40.0,92.0,3.0,4.0,74.0,207.0,4.0,3.0,0.0,0.0,3.0,TEX
57063,Joey Gallo,25.0,2019,gallojo01,N,608336.0,1,297.0,70,614.3,605500.0,3.08,TEX,gallojo01,1993.0,11.0,19.0,USA,joey,gallo,250.0,77.0,L,R,2015-06-02,2023-06-24,gallj002,gallojo01,2019.0,70.0,241.0,54.0,61.0,15.0,1.0,22.0,49.0,4.0,2.0,52.0,114.0,4.0,2.0,1.0,1.0,0.0,TEX
57064,Joey Gallo,26.0,2020,gallojo01,N,608336.0,1,226.0,57,475.7,4400000.0,1.51,TEX,gallojo01,1993.0,11.0,19.0,USA,joey,gallo,250.0,77.0,L,R,2015-06-02,2023-06-24,gallj002,gallojo01,2020.0,57.0,193.0,23.0,35.0,8.0,0.0,10.0,26.0,2.0,0.0,29.0,79.0,2.0,4.0,0.0,0.0,0.0,TEX
57065,Joey Gallo,27.0,2021,gallojo01,N,1216672.0,2,616.0,153,1304.4,6200000.0,4.61,TEX,gallojo01,1993.0,11.0,19.0,USA,joey,gallo,250.0,77.0,L,R,2015-06-02,2023-06-24,gallj002,gallojo01,2021.0,153.0,498.0,90.0,99.0,13.0,1.0,38.0,77.0,6.0,0.0,111.0,213.0,5.0,6.0,0.0,1.0,6.0,TEX
57066,Joey Gallo,28.0,2022,gallojo01,N,1216672.0,2,410.0,126,944.4,10275000.0,0.23,NYY,gallojo01,1993.0,11.0,19.0,USA,joey,gallo,250.0,77.0,L,R,2015-06-02,2023-06-24,gallj002,gallojo01,2022.0,126.0,350.0,48.0,56.0,8.0,2.0,19.0,47.0,3.0,0.0,56.0,163.0,0.0,3.0,0.0,1.0,0.0,NYA


In [56]:
war_batting_people.shape

(104423, 47)

##### WAR/People/Batting and Team datasets

There are several columns for team identifiers, let's decide on which one to use.

In [57]:
# Compare the teamIDBR column from teams_pre to the team_ID column from war_batting_people
print(f'teamIDBR (teams_pre) vs team_ID (war_batting_people):\n {teams_pre["teamIDBR"].isin(war_batting_people["team_ID"]).value_counts()}\n')

# Compare the teamID column from teams_pre to the teamID column from war_batting_people 
print(f'teamID (teams_pre) vs teamID (war_batting_people):\n {teams_pre["teamID"].isin(war_batting_people["teamID"]).value_counts()}\n')

# Compare the teamID column from temas_pre to the team_ID column from war_batting_people
print(f'teamID (teams_pre) vs team_ID (war_batting_people):\n {teams_pre["teamID"].isin(war_batting_people["team_ID"]).value_counts()}')

teamIDBR (teams_pre) vs team_ID (war_batting_people):
 True    3015
Name: teamIDBR, dtype: int64

teamID (teams_pre) vs teamID (war_batting_people):
 True    3015
Name: teamID, dtype: int64

teamID (teams_pre) vs team_ID (war_batting_people):
 True     1735
False    1280
Name: teamID, dtype: int64


In [58]:
# Unique values in war_batting_people team_ID
war_batting_people['teamID'].nunique()

149

In [59]:
# Unique values in teams_pre teamID 
teams_pre['teamID'].nunique()

149

In [60]:
war_batting_people.shape

(104423, 47)

In [61]:
# Join war_batting_people and raw_teams on teamID = teamID and yearID = yearID
war_batting_people_teams = war_batting_people.merge(teams_pre, how='left', left_on=['teamID', 'yearID'], right_on=['teamID', 'yearID'])

In [62]:
war_batting_people_teams.sample(5)

Unnamed: 0,name_common,age,year_ID,player_ID,pitcher,mlb_ID,stint_ID,PA,G_x,Inn,salary,WAR,team_ID,playerID,birthYear,birthMonth,birthDay,birthCountry,nameFirst,nameLast,weight,height,bats,throws,debut,finalGame,retroID,bbrefID,yearID,G_y,AB_x,R_x,H_x,2B_x,3B_x,HR_x,RBI,SB_x,CS_x,BB_x,SO_x,IBB,HBP_x,SH,SF_x,GIDP,teamID,lgID,franchID,divID,Rank,G,Ghome,W,L,DivWin,LgWin,WSWin,R_y,AB_y,H_y,2B_y,3B_y,HR_y,BB_y,SO_y,SB_y,CS_y,HBP_y,SF_y,RA,ER,ERA,CG,SHO,SV,IPouts,HA,HRA,BBA,SOA,E,DP,FP,name,park,attendance,BPF,PPF,teamIDBR,teamIDlahman45,teamIDretro
4568,Art Nehf,24.0,1917,nehfar01,Y,119687.0,1,87.0,38,0.0,3600.0,0.61,BSN,nehfar01,1892.0,7.0,31.0,USA,art,nehf,176.0,69.0,L,L,1915-08-13,1929-10-03,nehfa101,nehfar01,1917.0,38.0,70.0,8.0,12.0,3.0,2.0,0.0,2.0,1.0,0.0,13.0,13.0,0.0,1.0,3.0,0.0,0.0,BSN,NL,ATL,,6.0,157.0,77.0,72.0,81.0,,N,N,536.0,5201.0,1280.0,169.0,75.0,22.0,427.0,587.0,155.0,,,,552.0,438.0,2.77,105.0,22.0,3.0,4272.0,1309.0,19.0,371.0,593.0,224.0,122.0,0.966,Boston Braves,Braves Field,174253.0,94.0,94.0,BSN,BSN,BSN
67881,Logan OHoppe,22.0,2022,ohopplo01,N,681351.0,1,16.0,5,39.3,0.0,-0.06,LAA,ohopplo01,2000.0,2.0,9.0,USA,logan,ohoppe,185.0,74.0,R,R,2022-09-28,2023-04-20,ohopl001,ohopplo01,2022.0,5.0,14.0,1.0,4.0,0.0,0.0,0.0,2.0,0.0,0.0,2.0,3.0,0.0,0.0,0.0,0.0,0.0,LAA,AL,ANA,W,3.0,162.0,81.0,73.0,89.0,N,N,N,623.0,5423.0,1265.0,219.0,31.0,190.0,449.0,1539.0,77.0,27.0,54.0,25.0,668.0,601.0,3.77,2.0,17.0,38.0,4307.0,1241.0,168.0,540.0,1383.0,84.0,134.0,0.985,Los Angeles Angels of Anaheim,Angel Stadium of Anaheim,2457461.0,102.0,103.0,LAA,ANA,ANA
39754,George Frazier,23.0,1978,frazige01,Y,114393.0,1,3.0,14,22.0,0.0,-0.01,STL,frazige01,1954.0,10.0,13.0,USA,george,frazier,205.0,77.0,R,R,1978-05-25,1987-10-04,frazg001,frazige01,1978.0,14.0,3.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,SLN,NL,STL,E,5.0,162.0,81.0,69.0,93.0,N,N,N,600.0,5415.0,1351.0,263.0,44.0,79.0,420.0,713.0,97.0,42.0,22.0,53.0,657.0,572.0,3.58,32.0,13.0,22.0,4313.0,1300.0,94.0,600.0,859.0,136.0,155.0,0.978,St. Louis Cardinals,Busch Stadium II,1278215.0,99.0,99.0,STL,SLN,SLN
5740,Ben Stephens,26.0,1894,stephbe01,Y,122746.0,1,4.0,3,0.0,0.0,-0.01,WHS,stephbe01,1867.0,9.0,28.0,USA,ben,stephens,170.0,70.0,,R,1892-08-05,1894-05-04,stepb101,stephbe01,1894.0,3.0,4.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,WAS,NL,WAS,,11.0,132.0,,45.0,87.0,,N,,882.0,4581.0,1317.0,218.0,118.0,59.0,617.0,375.0,249.0,,69.0,,1122.0,678.0,5.51,102.0,0.0,4.0,3321.0,1573.0,59.0,446.0,190.0,499.0,81.0,0.908,Washington Senators,Boundary Field,125000.0,97.0,99.0,WHS,WSN,WSN
31166,Ed Barnowski,21.0,1965,barnoed01,Y,110600.0,1,0.0,4,4.3,0.0,0.0,BAL,barnoed01,1943.0,8.0,23.0,USA,ed,barnowski,195.0,74.0,R,R,1965-09-08,1966-09-16,barne102,barnoed01,1965.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BAL,AL,BAL,,3.0,162.0,79.0,94.0,68.0,,N,N,641.0,5450.0,1299.0,227.0,38.0,125.0,529.0,907.0,67.0,31.0,,,578.0,489.0,2.98,32.0,15.0,41.0,4431.0,1268.0,120.0,510.0,939.0,126.0,152.0,0.98,Baltimore Orioles,Memorial Stadium,781649.0,102.0,101.0,BAL,BAL,BAL


In [63]:
war_batting_people_teams.shape

(104423, 92)

##### WAR/People/Batting/Team and Salary datasets

In order to align our datasets for comparison, we'll first create a subset of data starting from 1985, as this is the earliest year available in the `salaries_pre` dataset. This will allow us to identify common links between the datasets and determine the most suitable one for joining. Once we've made this decision, we'll proceed to join the full datasets. Any null values resulting from years prior to 1985 will be addressed in the next stages of the project.

In [64]:
# war_batting_people_teams starting in 1985
war_batting_people_teams_85 = war_batting_people_teams.copy()
war_batting_people_teams_85 = war_batting_people_teams[war_batting_people_teams['year_ID'] >= 1985]

`mlb_ID` and `mlbid` seem to be the only common link between the datasets. Let's see if we can use these columns to join the datasets.

In [65]:
# Compare mlb_ID from war_batting_people_teams to mlbid from salaries_pre
print(f'mlb_ID (war_batting_people_teams_85) vs mlbid (salaries_pre):\n {war_batting_people_teams_85["mlb_ID"].isin(salaries_pre["mlbid"]).value_counts()}')

mlb_ID (war_batting_people_teams_85) vs mlbid (salaries_pre):
 True     41714
False     4130
Name: mlb_ID, dtype: int64


It appears that there are 41,501 instances where the `mlb_ID` from the war_batting_people_teams_85 dataset matches the `mlbid` from the salaries_pre dataset. However, there are also 4,106 instances where the IDs do not match.

There could be inconsistencies in the way the IDs are recorded in the two datasets. Let's explore this further by looking at the IDs that don't match.

In [66]:
# # Extract a sample player with different mlb_ID
# difference_sample = war_batting_people_teams_85[~war_batting_people_teams_85["mlb_ID"].isin(salaries_pre["mlbid"])][['name_common', 'mlb_ID']].sample(1)
# name_d = difference_sample.values[0][0]
# mlbid_d = difference_sample.values[0][1]

# print('From war_batting_people_teams_85:')
# print(f'Name: {name_d}\nmlb_ID: {mlbid_d}\n')

# print('From salaries_pre:')
# print(salaries_pre[salaries_pre["name"] == name_d][["name", "mlbid"]].head(1).iloc[0])

It appears that there is a discrepancy between the `mlb_ID` in the war_batting_people_teams_85 dataset and the `mlbid` in the salaries_pre dataset. This could be due to a variety of reasons, such as data entry errors, different data sources, or changes in player IDs over time.

The best solution would be to find/create a common link between the datasets. We don't want to join by player's name as this is not a unique identifier, some players can share the same name. 

Let's explore this solution further.



##### - Creating a unique player ID for both datasets

Both datasets share birth dates information and names, we can play around with these columns to create a unique player ID for both datasets. This will be a trial and error and iterative process.

__Unique ID for war_batting_people_teams__

In [67]:
# Some of this columns I had dropped them before, had to go back and add them again

# Birth dates to int
war_batting_people_teams_85['birthYear'] = war_batting_people_teams_85['birthYear'].astype('Int64')
war_batting_people_teams_85['birthMonth'] = war_batting_people_teams_85['birthMonth'].astype('Int64')
war_batting_people_teams_85['birthDay'] = war_batting_people_teams_85['birthDay'].astype('Int64')

# Birth dates to int full dataset
war_batting_people_teams['birthYear'] = war_batting_people_teams['birthYear'].astype('Int64')
war_batting_people_teams['birthMonth'] = war_batting_people_teams['birthMonth'].astype('Int64')
war_batting_people_teams['birthDay'] = war_batting_people_teams['birthDay'].astype('Int64')
war_batting_people_teams['birthYear_3'] = war_batting_people_teams['birthYear'].astype(str).str[:3]

# Concat nameFirst and nameLast
# war_batting_people_teams['name'] = war_batting_people_teams['nameFirst'] + war_batting_people_teams_85['nameLast']

# First and second letter of nameFirst and nameLast 
war_batting_people_teams_85['nameFirst_2'] = war_batting_people_teams_85['nameFirst'].str[:2]
war_batting_people_teams_85['nameLast_2'] = war_batting_people_teams_85['nameLast'].str[:2]

# First and second letter of nameFirst and nameLast full dataset
war_batting_people_teams['nameFirst_2'] = war_batting_people_teams['nameFirst'].str[:2]
war_batting_people_teams['nameLast_2'] = war_batting_people_teams['nameLast'].str[:5]

# concat initials year, month and day for new ID
war_batting_people_teams_85['new_id_x'] = war_batting_people_teams_85['nameFirst_2'] + war_batting_people_teams_85['nameLast_2'] + war_batting_people_teams_85['birthYear'].astype(str) #+ war_batting_people_teams_85['birthMonth'].astype(str) + war_batting_people_teams_85['birthDay'].astype(str) # + war_batting_people_teams_85['birthYear'].astype(str)

# concat initials year, month and day for new ID full dataset
war_batting_people_teams['new_id_x'] = war_batting_people_teams['nameFirst_2'] + war_batting_people_teams['nameLast_2'] + war_batting_people_teams['birthYear_3'] 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  war_batting_people_teams_85['birthYear'] = war_batting_people_teams_85['birthYear'].astype('Int64')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  war_batting_people_teams_85['birthMonth'] = war_batting_people_teams_85['birthMonth'].astype('Int64')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  war

In [68]:
# Sanity check
show_player_info(war_batting_people_teams, 'Joey Gallo')[['name_common', 'year_ID', 'team_ID', 'birthYear', 'birthMonth', 'birthDay', 'new_id_x']]

Unnamed: 0,name_common,year_ID,team_ID,birthYear,birthMonth,birthDay,new_id_x
57059,Joey Gallo,2015,TEX,1993,11,19,jogallo199
57060,Joey Gallo,2016,TEX,1993,11,19,jogallo199
57061,Joey Gallo,2017,TEX,1993,11,19,jogallo199
57062,Joey Gallo,2018,TEX,1993,11,19,jogallo199
57063,Joey Gallo,2019,TEX,1993,11,19,jogallo199
57064,Joey Gallo,2020,TEX,1993,11,19,jogallo199
57065,Joey Gallo,2021,TEX,1993,11,19,jogallo199
57066,Joey Gallo,2022,NYY,1993,11,19,jogallo199


__Unique ID for salary_pre__

In [69]:
# Extract year, month and day from borndate
salaries_pre['b_year'] = salaries_pre['borndate'].dt.year.astype('Int64')
salaries_pre['b_month'] = salaries_pre['borndate'].dt.month.astype('Int64')
salaries_pre['b_day'] = salaries_pre['borndate'].dt.day.astype('Int64')
salaries_pre['b_year_3'] = salaries_pre['b_year'].astype(str).str[:3]

# Extract letters of firstname and lastname
salaries_pre['firstname_2'] = salaries_pre['firstname'].str[:2]
salaries_pre['lastname_2'] = salaries_pre['lastname'].str[:5]

# Create new ID
salaries_pre['new_id_y'] = salaries_pre['firstname_2'] + salaries_pre['lastname_2'] + salaries_pre['b_year_3'] 

In [70]:
# Sanity check
# Joey Gallo
salaries_pre[salaries_pre['name'] == 'joey gallo'][['name', 'firstname_2', 'lastname_2', 'borndate', 'b_year', 'b_month', 'b_day', 'new_id_y']]

Unnamed: 0,name,firstname_2,lastname_2,borndate,b_year,b_month,b_day,new_id_y
14141,joey gallo,jo,gallo,1993-11-19,1993,11,19,jogallo199
14142,joey gallo,jo,gallo,1993-11-19,1993,11,19,jogallo199
14143,joey gallo,jo,gallo,1993-11-19,1993,11,19,jogallo199
14144,joey gallo,jo,gallo,1993-11-19,1993,11,19,jogallo199
14145,joey gallo,jo,gallo,1993-11-19,1993,11,19,jogallo199
14146,joey gallo,jo,gallo,1993-11-19,1993,11,19,jogallo199
14147,joey gallo,jo,gallo,1993-11-19,1993,11,19,jogallo199
14148,joey gallo,jo,gallo,1993-11-19,1993,11,19,jogallo199


In [71]:
war_batting_people_teams.shape

(104423, 96)

##### - Merging the datasets

In [72]:
# Join war_batting_people_teams_85 and slaries_pre on new_id_x = new_id_y and year_ID = year
war_batting_people_teams_salaries = war_batting_people_teams.merge(salaries_pre, how='left', left_on=['new_id_x', 'year_ID'], right_on=['new_id_y', 'year'])

In [73]:
# positions head
positions.head()

Unnamed: 0,playerID,POS
0,aardsda01,P
1,aasedo01,P
2,abadan01,1B
3,abadfe01,P
4,abbotco01,P


##### Dealing with null values after joining

Most of the null values will be dealt in the preprocessing stage, once we define the time period of our analysis. However, there are some null values that we can deal with now.

In [74]:
pd.set_option('display.max_rows', None)
# null values 
war_batting_people_teams_salaries.isnull().sum().sort_values(ascending=False)

middle2           59292
new_id_y          59290
year              59290
TeamName          59290
age_y             59290
mlbid             59290
playerid          59290
leaguerank        59290
teamrank          59290
averagesalary     59290
lastname          59290
firstname         59290
leagueminimum     59290
borndate          59290
name_y            59290
first3            59290
last3             59290
first3last3       59290
b_year            59290
b_month           59290
b_day             59290
b_year_3          59290
firstname_2       59290
lastname_2        59290
salary_y          59290
DivWin            45538
SF_y              45388
divID             44539
HBP_y             36429
CS_y              22569
Ghome              8494
WSWin              8257
attendance         5749
SB_y               2906
bats               2397
throws             1999
LgWin              1816
weight             1695
height             1620
park               1544
lgID               1458
SO_y            

We will remove certain rows with null values that don't impact our analysis. These rows correspond to players with insignificant stats, and their removal won't significantly affect our dataset size.

In [75]:
# Pre cleaning - Rows that we realized during this step that contain null values and dropping them would not affect the analysis, as they are not relevant players with relevant stats.
rows_to_drop = ['sternad01', 
                'hegmabo01', 
                'jimerch01', 
                'tremich01',
                'firovda01',
                'singldu01',
                'lunarfe01',
                'matosfr01',
                'reyesgi01',
                'manrifr01',
                'krugeja01',
                'pankoji01',
                'hietpjo01',
                'morgake01',
                'dalesma01',
                'mooreke02',
                'davidma01',
                'davisma02',
                'calzana01',
                'santape01',
                'carabra01',
                'mckeewa01',
                'esposbr01',
                'sanchan02',
                'gonzaal02',
                'gonzaal01',
                'anderbr05',
                'rojasjo02',
                'roberda09',
                'lopezlu03',
                'lopezlu02',
                'snydebr02',
                'braunry01',
                'willima07',
                'reyesjo02',
                ]

In [76]:
# Drop rows
war_batting_people_teams_salaries = war_batting_people_teams_salaries[~war_batting_people_teams_salaries['playerID'].isin(rows_to_drop)]

In [77]:
special_drop_dict = {'stewach01': [2010],
                     'freesda01': [2009],
                     'disarga01': [1989],
                     'kingsge01': [1996],
                     'snowjt01': [2008],
                     'larueja01': [2009],
                     'posadjo01': [1995],
                     'thornlo01': [1990],
                     'zuvelpa01': [1991],
                     'murphse01': [2019],
                     'gorete01': [2020],
                     'lintz01': [2021],
                     'wilsova01': [1999],
                     'bartobr01': [2009],
                     'burkeja02': [2010],
                     'haltesh01': [1999],
                     }

In [78]:
# Drop rows in special_drop_dict
for key, value in special_drop_dict.items():
    war_batting_people_teams_salaries = war_batting_people_teams_salaries[~((war_batting_people_teams_salaries['playerID'] == key) & (war_batting_people_teams_salaries['yearID'].isin(value)))]

In [79]:
drop_rows = ['ChJi197', 
             'ChTr196', 
             'DuSi197', 
             'FeLu197', 
             'FrMa196', 
             'GiRe196', 
             'JiPa195', 
             'KeMo197', 
             'MaDa196', 
             'NaCa198',
             'RaCa196',
             'WaMc197',
             ]

# Drop rows woth new_id_y in drop_rows from salaries_pre
salaries_pre = salaries_pre[~salaries_pre['new_id_y'].isin(drop_rows)]

In [80]:
special_drop_dict_salaries = {'JaLa197': [2010],
                     'LoTh196': [1985],
                     }

In [81]:
for key, value in special_drop_dict_salaries.items():
    salaries_pre  = salaries_pre[~((salaries_pre['new_id_y'] == key) & (salaries_pre['year'].isin(value)))]

In [82]:
drop_x = ['mibrown195', 'brsmith195', 'kemille180', 'kemille196',
       'mifitzg196', 'chhowar196', 'brhunte197',
       'brhunte196', 'mavalde197', 'majohns197', 'majohns196',
       'chpeter197', 'frgarci197', 'caherna197', 'caherna196',
       'abnunez197', 'japhill197', 'jajones197',
       'mawatso197', 'lugonza197', 'racastr197',
       'jonelso197', 'rybraun198', 'brsnyde198',
       'chcarte198', 'jovalde198', 'daalvar198',
       'maduffy198', 'dabarne198', 'betaylo199', 'brrodge199',
       'jodavis199', 'jomarti198', 'magonza198', 'brzimme199',
       'heperez199', 'rogarci199', 'aldiaz199',
       'dicasti199']

# Drop rows woth new_id_y in drop_tst from war_batting_people_teams_salaries
war_batting_people_teams_salaries = war_batting_people_teams_salaries[~war_batting_people_teams_salaries['new_id_x'].isin(drop_x)]

In [83]:
war_batting_people_teams_salaries = war_batting_people_teams_salaries[~((war_batting_people_teams_salaries['year_ID'] == 2022) & (war_batting_people_teams_salaries['mlbid'].isnull()))]


In [84]:
# null values
war_batting_people_teams_salaries.isnull().sum().sort_values(ascending=False)

middle2           58648
new_id_y          58646
year              58646
TeamName          58646
age_y             58646
mlbid             58646
playerid          58646
leaguerank        58646
teamrank          58646
averagesalary     58646
lastname          58646
firstname         58646
leagueminimum     58646
borndate          58646
name_y            58646
first3            58646
last3             58646
first3last3       58646
b_year            58646
b_month           58646
b_day             58646
b_year_3          58646
firstname_2       58646
lastname_2        58646
salary_y          58646
DivWin            45525
SF_y              45388
divID             44539
HBP_y             36429
CS_y              22569
Ghome              8494
WSWin              8244
attendance         5749
SB_y               2906
bats               2397
throws             1999
LgWin              1803
weight             1695
height             1620
park               1544
lgID               1458
SO_y            

##### Dropping newly created columns

Now that we succesfully joined the datasets, we can drop the columns (new ids) we created to join them for cleanliness purposes.

In [85]:
new_ids = ['birthYear_3', 'nameFirst_2', 'nameLast_2', 'new_id_x', 'firstname_2', 'lastname_2', 'new_id_y', 'b_year_3', 'b_year', 'b_month', 'b_day', 'birthMonth', 'birthDay', 'nameFirst', 'nameLast', 'first3', 'last3', 'first3last3', 'middle2']

# Drop new_ids columns
war_batting_people_teams_salaries.drop(new_ids, axis=1, inplace=True)

In [86]:
# sort by name_common and year_ID
war_batting_people_teams_salaries = war_batting_people_teams_salaries.sort_values(by=['year_ID', 'name_common'])

In [87]:
war_batting_people_teams_salaries.head()

Unnamed: 0,name_common,age_x,year_ID,player_ID,pitcher,mlb_ID,stint_ID,PA,G_x,Inn,salary_x,WAR,team_ID,playerID,birthYear,birthCountry,weight,height,bats,throws,debut,finalGame,retroID,bbrefID,yearID,G_y,AB_x,R_x,H_x,2B_x,3B_x,HR_x,RBI,SB_x,CS_x,BB_x,SO_x,IBB,HBP_x,SH,SF_x,GIDP,teamID,lgID,franchID,divID,Rank,G,Ghome,W,L,DivWin,LgWin,WSWin,R_y,AB_y,H_y,2B_y,3B_y,HR_y,BB_y,SO_y,SB_y,CS_y,HBP_y,SF_y,RA,ER,ERA,CG,SHO,SV,IPouts,HA,HRA,BBA,SOA,E,DP,FP,name_x,park,attendance,BPF,PPF,teamIDBR,teamIDlahman45,teamIDretro,firstname,lastname,playerid,mlbid,year,salary_y,TeamName,age_y,leaguerank,teamrank,averagesalary,leagueminimum,borndate,name_y
995,Al Barker,32.0,1871,barkeal01,N,110565.0,1,5.0,1,0.0,0.0,0.03,ROK,barkeal01,1839,USA,162.0,72.0,,,1871-06-01,1871-06-01,barka101,barkeal01,1871.0,1.0,4.0,0.0,1.0,0.0,0.0,0.0,2.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,RC1,,ROK,,9.0,25.0,,4.0,21.0,,N,,231.0,1036.0,274.0,44.0,25.0,3.0,38.0,30.0,53.0,10.0,,,287.0,108.0,4.3,23.0,1.0,0.0,678.0,315.0,3.0,34.0,16.0,220.0,14.0,0.821,Rockford Forest Citys,Agricultural Society Fair Grounds,,97.0,99.0,ROK,RC1,RC1,,,,,,,,,,,,,NaT,
1596,Al Pratt,23.0,1871,prattal01,Y,120742.0,1,131.0,29,0.0,0.0,0.01,CLE,prattal01,1847,USA,140.0,67.0,,R,1871-05-04,1872-08-19,prata101,prattal01,1871.0,29.0,130.0,31.0,34.0,6.0,8.0,0.0,20.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,2.0,CL1,,CFC,,8.0,29.0,,10.0,19.0,,N,,249.0,1186.0,328.0,35.0,40.0,7.0,26.0,25.0,18.0,8.0,,,341.0,116.0,4.11,23.0,0.0,0.0,762.0,346.0,13.0,53.0,34.0,234.0,15.0,0.818,Cleveland Forest Citys,National Association Grounds,,96.0,100.0,CLE,CL1,CL1,,,,,,,,,,,,,NaT,
1599,Al Reach,31.0,1871,reachal01,N,120965.0,1,138.0,26,0.0,0.0,0.98,ATH,reachal01,1840,United Kingdom,155.0,66.0,L,L,1871-05-20,1875-05-21,reaca101,reachal01,1871.0,26.0,133.0,43.0,47.0,7.0,6.0,0.0,34.0,2.0,0.0,5.0,6.0,0.0,0.0,0.0,0.0,0.0,PH1,,PNA,,1.0,28.0,,21.0,7.0,,Y,,376.0,1281.0,410.0,66.0,27.0,9.0,46.0,23.0,56.0,12.0,,,266.0,137.0,4.95,27.0,0.0,0.0,747.0,329.0,3.0,53.0,16.0,194.0,13.0,0.845,Philadelphia Athletics,Jefferson Street Grounds,,102.0,98.0,ATH,PH1,PH1,,,,,,,,,,,,,NaT,
1715,Al Spalding,20.0,1871,spaldal01,Y,122558.0,1,152.0,31,0.0,1500.0,-0.08,BOS,spaldal01,1850,USA,170.0,73.0,R,R,1871-05-05,1878-08-31,spala101,spaldal01,1871.0,31.0,144.0,43.0,39.0,10.0,1.0,1.0,31.0,2.0,0.0,8.0,1.0,0.0,0.0,0.0,0.0,0.0,BS1,,BNA,,3.0,31.0,,20.0,10.0,,N,,401.0,1372.0,426.0,70.0,37.0,3.0,60.0,19.0,73.0,16.0,,,303.0,109.0,3.55,22.0,1.0,3.0,828.0,367.0,2.0,42.0,23.0,243.0,24.0,0.834,Boston Red Stockings,South End Grounds I,,103.0,98.0,BOS,BS1,BS1,,,,,,,,,,,,,NaT,
3631,Andy Leonard,25.0,1871,leonaan01,N,117686.0,1,151.0,31,0.0,0.0,0.49,OLY,leonaan01,1846,Ireland,168.0,67.0,R,R,1871-05-05,1880-07-06,leona101,leonaan01,1871.0,31.0,148.0,33.0,43.0,8.0,3.0,0.0,30.0,14.0,3.0,3.0,1.0,0.0,0.0,0.0,0.0,2.0,WS3,,OLY,,4.0,32.0,,15.0,15.0,,N,,310.0,1353.0,375.0,54.0,26.0,6.0,48.0,13.0,48.0,13.0,,,303.0,137.0,4.37,32.0,0.0,0.0,846.0,371.0,4.0,45.0,13.0,218.0,20.0,0.85,Washington Olympics,Olympics Grounds,,94.0,98.0,OLY,WS3,WS3,,,,,,,,,,,,,NaT,


In [88]:
# shape
war_batting_people_teams_salaries.shape

(103554, 102)

In [89]:
# unique players
war_batting_people_teams_salaries['playerID'].nunique()

20016

##### Other merges
Merge datasets created separately after finalizing this notebook

In [90]:
# join war_war_batting_people_teams_salaries and positions on playerID = playerID
war_batting_people_teams_salaries_positions = war_batting_people_teams_salaries.merge(positions, how='left', left_on='playerID', right_on='playerID')

In [91]:
war_batting_people_teams_salaries_positions.sample(5)

Unnamed: 0,name_common,age_x,year_ID,player_ID,pitcher,mlb_ID,stint_ID,PA,G_x,Inn,salary_x,WAR,team_ID,playerID,birthYear,birthCountry,weight,height,bats,throws,debut,finalGame,retroID,bbrefID,yearID,G_y,AB_x,R_x,H_x,2B_x,3B_x,HR_x,RBI,SB_x,CS_x,BB_x,SO_x,IBB,HBP_x,SH,SF_x,GIDP,teamID,lgID,franchID,divID,Rank,G,Ghome,W,L,DivWin,LgWin,WSWin,R_y,AB_y,H_y,2B_y,3B_y,HR_y,BB_y,SO_y,SB_y,CS_y,HBP_y,SF_y,RA,ER,ERA,CG,SHO,SV,IPouts,HA,HRA,BBA,SOA,E,DP,FP,name_x,park,attendance,BPF,PPF,teamIDBR,teamIDlahman45,teamIDretro,firstname,lastname,playerid,mlbid,year,salary_y,TeamName,age_y,leaguerank,teamrank,averagesalary,leagueminimum,borndate,name_y,POS
85682,DJ Carrasco,32.0,2009,carradj01,Y,425647.0,1,0.0,3,93.3,440000.0,0.0,CHW,carradj01,1977,USA,215.0,76.0,R,R,2003-04-02,2012-05-16,carrd001,carradj01,2009.0,49.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,CHA,AL,CHW,C,3.0,162.0,81.0,79.0,83.0,N,N,N,724.0,5463.0,1410.0,246.0,20.0,184.0,534.0,1022.0,113.0,49.0,62.0,39.0,732.0,663.0,4.14,4.0,11.0,36.0,4319.0,1438.0,169.0,507.0,1119.0,113.0,158.0,0.981,Chicago White Sox,U.S. Cellular Field,2284163.0,105.0,105.0,CHW,CHA,CHA,dj,carrasco,4810.0,425647.0,2009.0,440000.0,Chicago White Sox,32.0,562.0,20.0,2996106.0,400000.0,1977-04-12,dj carrasco,P
15105,Tim Hendryx,25.0,1916,hendrti01,N,115761.0,1,72.0,15,0.0,1800.0,0.53,NYY,hendrti01,1891,USA,170.0,69.0,R,R,1911-09-04,1921-07-26,hendt101,hendrti01,1916.0,15.0,62.0,10.0,18.0,7.0,1.0,0.0,5.0,4.0,0.0,8.0,6.0,0.0,1.0,1.0,0.0,0.0,NYA,AL,NYY,,4.0,156.0,79.0,80.0,74.0,,N,N,577.0,5198.0,1277.0,194.0,59.0,35.0,516.0,632.0,179.0,,,,561.0,440.0,2.77,84.0,12.0,17.0,4284.0,1249.0,37.0,476.0,616.0,219.0,119.0,0.967,New York Yankees,Polo Grounds IV,469211.0,102.0,102.0,NYY,NYA,NYA,,,,,,,,,,,,,NaT,,
71402,Kirk Rueter,26.0,1997,rueteki01,Y,121541.0,1,75.0,31,190.7,260000.0,0.0,SFG,rueteki01,1970,USA,190.0,75.0,L,L,1993-07-07,2005-07-29,ruetk001,rueteki01,1997.0,32.0,65.0,5.0,9.0,0.0,0.0,0.0,5.0,0.0,0.0,3.0,14.0,0.0,0.0,7.0,0.0,1.0,SFN,NL,SFG,W,1.0,162.0,81.0,90.0,72.0,Y,N,N,784.0,5485.0,1415.0,266.0,37.0,172.0,642.0,1120.0,121.0,49.0,46.0,59.0,793.0,706.0,4.39,5.0,9.0,45.0,4338.0,1494.0,160.0,578.0,1044.0,125.0,157.0,0.98,San Francisco Giants,3Com Park,1690869.0,98.0,98.0,SFG,SFN,SFN,kirk,rueter,527.0,121541.0,1997.0,260000.0,San Francisco Giants,26.0,501.0,20.0,1336609.0,150000.0,1970-12-01,kirk rueter,P
59855,Gerald Perry,25.0,1986,perryge01,N,120439.0,1,80.0,29,143.3,0.0,-0.53,ATL,perryge01,1960,USA,180.0,71.0,L,R,1983-08-11,1995-08-24,perrg001,perryge01,1986.0,29.0,70.0,6.0,19.0,2.0,0.0,2.0,11.0,0.0,1.0,8.0,4.0,1.0,0.0,1.0,1.0,4.0,ATL,NL,ATL,W,6.0,161.0,81.0,72.0,89.0,N,N,N,615.0,5384.0,1348.0,241.0,24.0,138.0,538.0,904.0,93.0,76.0,24.0,42.0,719.0,629.0,3.97,17.0,5.0,39.0,4274.0,1443.0,117.0,576.0,932.0,141.0,181.0,0.978,Atlanta Braves,Atlanta-Fulton County Stadium,1387181.0,105.0,106.0,ATL,ATL,ATL,gerald,perry,16521.0,120439.0,1986.0,60000.0,Atlanta Braves,25.0,0.0,0.0,412520.0,60000.0,1960-10-30,gerald perry,1B
97311,Carlos Rodón,25.0,2018,rodonca01,Y,607074.0,1,0.0,0,120.7,2300000.0,0.0,CHW,rodonca01,1992,USA,255.0,74.0,L,L,2015-04-21,2022-09-29,rodoc001,rodonca01,2018.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,CHA,AL,CHW,C,4.0,162.0,81.0,62.0,100.0,N,N,N,656.0,5523.0,1332.0,259.0,40.0,182.0,425.0,1594.0,98.0,41.0,66.0,32.0,848.0,771.0,4.83,0.0,8.0,34.0,4311.0,1404.0,196.0,653.0,1259.0,114.0,135.0,0.981,Chicago White Sox,Guaranteed Rate Field,1608817.0,97.0,98.0,CHW,CHA,CHA,carlos,rodon,165585.0,607074.0,2018.0,2300000.0,Chicago White Sox,25.0,363.0,7.0,4095686.0,545000.0,1992-12-10,carlos rodon,P


In [92]:
war_batting_people_teams_salaries_positions.shape

(103554, 103)

In [93]:
# Join war_batting_people_teams_salaries_positions and byears on playerID = playerID
war_batting_people_teams_salaries_positions = war_batting_people_teams_salaries_positions.merge(byears, how='left', left_on='playerID', right_on='playerID')

In [94]:
# Copy and save to csv
war_batting_people_teams_salaries_positions_pre = war_batting_people_teams_salaries_positions.copy()
war_batting_people_teams_salaries_positions_pre.to_csv('pre_datasets/merged.csv', index=False)

In [95]:
# Jim Dwyer
war_batting_people_teams_salaries_positions_pre[war_batting_people_teams_salaries_positions_pre['name_common'] == 'Jim Dwyer']

Unnamed: 0,name_common,age_x,year_ID,player_ID,pitcher,mlb_ID,stint_ID,PA,G_x,Inn,salary_x,WAR,team_ID,playerID,birthYear,birthCountry,weight,height,bats,throws,debut,finalGame,retroID,bbrefID,yearID,G_y,AB_x,R_x,H_x,2B_x,3B_x,HR_x,RBI,SB_x,CS_x,BB_x,SO_x,IBB,HBP_x,SH,SF_x,GIDP,teamID,lgID,franchID,divID,Rank,G,Ghome,W,L,DivWin,LgWin,WSWin,R_y,AB_y,H_y,2B_y,3B_y,HR_y,BB_y,SO_y,SB_y,CS_y,HBP_y,SF_y,RA,ER,ERA,CG,SHO,SV,IPouts,HA,HRA,BBA,SOA,E,DP,FP,name_x,park,attendance,BPF,PPF,teamIDBR,teamIDlahman45,teamIDretro,firstname,lastname,playerid,mlbid,year,salary_y,TeamName,age_y,leaguerank,teamrank,averagesalary,leagueminimum,borndate,name_y,POS,BirthYear
48291,Jim Dwyer,23.0,1973,dwyerji01,N,113673.0,1,58.0,28,127.0,0.0,-0.78,STL,dwyerji01,1950,USA,165.0,70.0,L,L,1973-06-10,1990-06-21,dwyej001,dwyerji01,1973.0,28.0,57.0,7.0,11.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,5.0,0.0,0.0,0.0,0.0,4.0,SLN,NL,STL,E,2.0,162.0,81.0,81.0,81.0,N,N,N,643.0,5478.0,1418.0,240.0,35.0,75.0,531.0,796.0,100.0,46.0,29.0,50.0,603.0,528.0,3.25,42.0,14.0,36.0,4382.0,1366.0,105.0,486.0,867.0,159.0,149.0,0.975,St. Louis Cardinals,Busch Stadium II,1574046.0,100.0,100.0,STL,SLN,SLN,,,,,,,,,,,,,NaT,,OF,1950.0
49115,Jim Dwyer,24.0,1974,dwyerji01,N,113673.0,1,100.0,74,122.7,0.0,0.37,STL,dwyerji01,1950,USA,165.0,70.0,L,L,1973-06-10,1990-06-21,dwyej001,dwyerji01,1974.0,74.0,86.0,13.0,24.0,1.0,0.0,2.0,11.0,0.0,0.0,11.0,16.0,2.0,1.0,0.0,2.0,1.0,SLN,NL,STL,E,2.0,161.0,81.0,86.0,75.0,N,N,N,677.0,5620.0,1492.0,216.0,46.0,83.0,531.0,752.0,172.0,62.0,44.0,55.0,643.0,570.0,3.48,37.0,13.0,20.0,4420.0,1399.0,97.0,616.0,794.0,147.0,192.0,0.977,St. Louis Cardinals,Busch Stadium II,1838413.0,100.0,100.0,STL,SLN,SLN,,,,,,,,,,,,,NaT,,OF,1950.0
49971,Jim Dwyer,25.0,1975,dwyerji01,N,227346.0,2,247.0,81,469.0,0.0,0.99,STL,dwyerji01,1950,USA,165.0,70.0,L,L,1973-06-10,1990-06-21,dwyej001,dwyerji01,1975.0,81.0,206.0,26.0,56.0,8.0,1.0,3.0,21.0,4.0,1.0,27.0,36.0,0.0,0.0,12.0,2.0,1.0,SLN,NL,STL,E,3.0,163.0,82.0,82.0,80.0,N,N,N,662.0,5597.0,1527.0,239.0,46.0,81.0,444.0,649.0,116.0,49.0,29.0,45.0,689.0,577.0,3.57,33.0,13.0,36.0,4364.0,1452.0,98.0,571.0,824.0,171.0,140.0,0.973,St. Louis Cardinals,Busch Stadium II,1695270.0,104.0,105.0,STL,SLN,SLN,,,,,,,,,,,,,NaT,,OF,1950.0
50794,Jim Dwyer,26.0,1976,dwyerji01,N,227346.0,2,119.0,61,166.0,0.0,-0.64,MON,dwyerji01,1950,USA,165.0,70.0,L,L,1973-06-10,1990-06-21,dwyej001,dwyerji01,1976.0,61.0,105.0,9.0,19.0,3.0,1.0,0.0,5.0,0.0,0.0,13.0,11.0,2.0,0.0,0.0,1.0,3.0,MON,NL,WSN,E,6.0,162.0,80.0,55.0,107.0,N,N,N,531.0,5428.0,1275.0,224.0,32.0,94.0,433.0,841.0,86.0,44.0,16.0,40.0,734.0,639.0,3.99,26.0,10.0,21.0,4320.0,1442.0,89.0,659.0,783.0,155.0,179.0,0.976,Montreal Expos,Jarry Park,646704.0,105.0,107.0,MON,MON,MON,,,,,,,,,,,,,NaT,,OF,1950.0
51635,Jim Dwyer,27.0,1977,dwyerji01,N,113673.0,1,37.0,13,79.0,0.0,-0.24,STL,dwyerji01,1950,USA,165.0,70.0,L,L,1973-06-10,1990-06-21,dwyej001,dwyerji01,1977.0,13.0,31.0,3.0,7.0,1.0,0.0,0.0,2.0,0.0,0.0,4.0,5.0,0.0,2.0,0.0,0.0,0.0,SLN,NL,STL,E,3.0,162.0,83.0,83.0,79.0,N,N,N,737.0,5527.0,1490.0,252.0,56.0,96.0,489.0,823.0,134.0,112.0,25.0,29.0,688.0,612.0,3.81,26.0,10.0,31.0,4338.0,1420.0,139.0,532.0,768.0,139.0,174.0,0.978,St. Louis Cardinals,Busch Stadium II,1659287.0,99.0,99.0,STL,SLN,SLN,,,,,,,,,,,,,NaT,,OF,1950.0
52544,Jim Dwyer,28.0,1978,dwyerji01,N,227346.0,2,283.0,107,531.4,0.0,-0.27,STL,dwyerji01,1950,USA,165.0,70.0,L,L,1973-06-10,1990-06-21,dwyej001,dwyerji01,1978.0,107.0,238.0,30.0,53.0,12.0,2.0,6.0,26.0,7.0,0.0,37.0,32.0,4.0,1.0,3.0,4.0,2.0,SLN,NL,STL,E,5.0,162.0,81.0,69.0,93.0,N,N,N,600.0,5415.0,1351.0,263.0,44.0,79.0,420.0,713.0,97.0,42.0,22.0,53.0,657.0,572.0,3.58,32.0,13.0,22.0,4313.0,1300.0,94.0,600.0,859.0,136.0,155.0,0.978,St. Louis Cardinals,Busch Stadium II,1278215.0,99.0,99.0,STL,SLN,SLN,,,,,,,,,,,,,NaT,,OF,1950.0
53444,Jim Dwyer,29.0,1979,dwyerji01,N,113673.0,1,133.0,76,285.7,0.0,-0.25,BOS,dwyerji01,1950,USA,165.0,70.0,L,L,1973-06-10,1990-06-21,dwyej001,dwyerji01,1979.0,76.0,113.0,19.0,30.0,7.0,0.0,2.0,14.0,3.0,1.0,17.0,9.0,1.0,1.0,0.0,2.0,7.0,BOS,AL,BOS,E,3.0,160.0,80.0,91.0,69.0,N,N,N,841.0,5538.0,1567.0,310.0,34.0,194.0,512.0,708.0,60.0,43.0,33.0,59.0,711.0,641.0,4.03,47.0,11.0,29.0,4294.0,1487.0,133.0,463.0,731.0,142.0,166.0,0.977,Boston Red Sox,Fenway Park II,2353114.0,106.0,106.0,BOS,BOS,BOS,,,,,,,,,,,,,NaT,,OF,1950.0
54352,Jim Dwyer,30.0,1980,dwyerji01,N,113673.0,1,292.0,93,626.7,0.0,0.56,BOS,dwyerji01,1950,USA,165.0,70.0,L,L,1973-06-10,1990-06-21,dwyej001,dwyerji01,1980.0,93.0,260.0,41.0,74.0,11.0,1.0,9.0,38.0,3.0,2.0,28.0,23.0,5.0,2.0,1.0,1.0,4.0,BOS,AL,BOS,E,5.0,160.0,81.0,83.0,77.0,N,N,N,757.0,5603.0,1588.0,297.0,36.0,162.0,475.0,720.0,79.0,48.0,32.0,50.0,767.0,701.0,4.38,30.0,8.0,43.0,4324.0,1557.0,129.0,481.0,696.0,149.0,206.0,0.977,Boston Red Sox,Fenway Park II,1956092.0,106.0,105.0,BOS,BOS,BOS,,,,,,,,,,,,,NaT,,OF,1950.0
55271,Jim Dwyer,31.0,1981,dwyerji01,N,113673.0,1,157.0,68,347.3,0.0,-0.06,BAL,dwyerji01,1950,USA,165.0,70.0,L,L,1973-06-10,1990-06-21,dwyej001,dwyerji01,1981.0,68.0,134.0,16.0,30.0,0.0,1.0,3.0,10.0,0.0,2.0,20.0,19.0,0.0,0.0,0.0,3.0,2.0,BAL,AL,BAL,E,2.0,105.0,55.0,59.0,46.0,N,N,N,429.0,3516.0,883.0,165.0,11.0,88.0,404.0,454.0,41.0,34.0,15.0,24.0,437.0,386.0,3.7,25.0,10.0,23.0,2820.0,923.0,83.0,347.0,489.0,68.0,114.0,0.983,Baltimore Orioles,Memorial Stadium,1024247.0,100.0,99.0,BAL,BAL,BAL,,,,,,,,,,,,,NaT,,OF,1950.0
56174,Jim Dwyer,32.0,1982,dwyerji01,N,113673.0,1,178.0,71,343.7,0.0,1.16,BAL,dwyerji01,1950,USA,165.0,70.0,L,L,1973-06-10,1990-06-21,dwyej001,dwyerji01,1982.0,71.0,148.0,28.0,45.0,4.0,3.0,6.0,15.0,2.0,0.0,27.0,24.0,4.0,0.0,1.0,2.0,0.0,BAL,AL,BAL,E,2.0,163.0,82.0,94.0,68.0,N,N,N,774.0,5557.0,1478.0,259.0,27.0,179.0,634.0,796.0,49.0,38.0,25.0,52.0,687.0,648.0,3.99,38.0,8.0,34.0,4387.0,1436.0,147.0,488.0,719.0,101.0,140.0,0.984,Baltimore Orioles,Memorial Stadium,1613031.0,100.0,99.0,BAL,BAL,BAL,,,,,,,,,,,,,NaT,,OF,1950.0


---

### 4. Next Steps <a class="anchor" id="summary"></a>
- Data wrangling and preprocessing on merged dataset to clean and prepare data for modeling
- EDA on joined dataset to identify potential features for modeling


-----
-----
-----