# Data Wrangling
 Data cleaning and preprocessing. Transforming raw data into a format that can be easily analyzed. 
## Contents:
- Data exploration for every dataset
    - Shape of Datasets and first look at rows
    - Description for every dataset
- Data cleaning
- Data Joining/Merging

In [3]:
# Import necessary libraries
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
# Load datasets
raw_war = pd.read_csv('Raw datasets/war_daily_bat.csv')
raw_teams = pd.read_csv('Raw datasets/Teams.csv')
raw_batting = pd.read_csv('Raw datasets/Batting.csv')
raw_fielding = pd.read_csv('Raw datasets/Fielding.csv')
raw_people = pd.read_csv('Raw datasets/People.csv')
raw_salaries = pd.read_csv('Raw datasets/Salaries.csv')

### Dataset exploration
#### Shape of Datasets and first look at rows

In [5]:
# Check the shape of the datasets and view the first few rows
datasets = [raw_war, raw_teams, raw_batting, raw_fielding, raw_people, raw_salaries]
dataset_names = ['raw_war', 'raw_teams', 'raw_batting', 'raw_fielding', 'raw_people', 'raw_salaries']

for dataset, name in zip(datasets, dataset_names):
    print(f'\n{name}:')
    print(f'Shape: {dataset.shape}')
    print(dataset.head())


raw_war:
Shape: (121375, 49)
     name_common   age    mlb_ID  player_ID  year_ID team_ID  stint_ID lg_ID  \
0  David Aardsma  22.0  430911.0  aardsda01     2004     SFG         1    NL   
1  David Aardsma  24.0  430911.0  aardsda01     2006     CHC         1    NL   
2  David Aardsma  25.0  430911.0  aardsda01     2007     CHW         1    AL   
3  David Aardsma  26.0  430911.0  aardsda01     2008     BOS         1    AL   
4  David Aardsma  27.0  430911.0  aardsda01     2009     SEA         1    AL   

    PA   G  ...  oppRpG_rep  pyth_exponent  pyth_exponent_rep  waa_win_perc  \
0  0.0  11  ...     4.67092          1.890              1.890         0.500   
1  3.0  43  ...     4.86457          1.912              1.913         0.499   
2  0.0   2  ...     4.85895          1.912              1.912         0.500   
3  1.0   5  ...     4.69650          1.893              1.894         0.497   
4  0.0   3  ...     4.79788          1.905              1.905         0.500   

   waa_win_per

#### Datasets description
- **raw_war:** Comprehensive collection of baseball statistics, with a focus on player performance metrics.
- **raw_teams:** Comprehensive collection of team-level baseball statistics.
- **raw_batting:** Player-level batting statistics for each season.
- **raw_fielding:** Player-level fielding statistics for each season.
- **raw_people:** Contains personal and biographical information about baseball players.   
- **raw_salaries:** Player-level salary data for each season.


#### Datasets Dictionaries

##### raw_war
| Column Name | Description |  
| --- | --- |
| name_common | Player name |
| age | Player age |
| mlb_ID | MLB ID code |
| player_ID | Player ID code |
| year_ID | Year |
| team_ID | Team |
| stint_ID | Player's stint (order of appearances within a season) |
| lg_ID | League |
| PA | Plate appearances when batting |
| G | Games |
| Inn | Innings played in the field |
| runs_bat | Runs above average |
| runs_br | Runs from baserunning |
| runs_dp | Runs from avoiding double plays |
| runs_field | Runs from fielding |
| runs_infield | Runs from infield defense |
| runs_outfield | Runs from outfield defense |
| runs_catcher | Runs from catcher defense |
| runs_good_plays | Runs from good fielding plays |
| runs_defense | Runs from all defensive plays |
| runs_position | Runs from positional scarcity |
| runs_position_p | Runs from positional scarcity, pitcher |
| runs_replacement | Runs from replacement level |
| runs_above_rep | Runs above replacement level |
| runs_above_avg | Runs above average |
| runs_above_avg_off | Runs above average, offense |
| runs_above_avg_def | Runs above average, defense |
| WAA | Wins above average |
| WAA_off | Wins above average, offense |
| WAA_def | Wins above average, defense |
| WAR | Wins above replacement |
| WAR_def | Wins above replacement, defense |
| WAR_off | Wins above replacement, offense |
| WAR_rep | Wins above replacement, replacement level |
| salary | Salary |
| pitcher | Pitcher indicator |
| teamRpG | Team runs per game |
| oppRpG | Opponent runs per game |
| oppRpPA_rep | Opponent runs per plate appearance, replacement level |
| oppRpG_rep | Opponent runs per game, replacement level |
| pyth_exponent | Pythagorean win percentage exponent |
| pyth_exponent_rep | Pythagorean win percentage exponent, replacement level |
| waa_win_perc | Win percentage, based on WAA |
| waa_win_perc_off | Win percentage, based on WAA, offense |
| waa_win_perc_def | Win percentage, based on WAA, defense |
| waa_win_perc_rep | Win percentage, based on WAA, replacement level |
| OPS_plus | OPS+, relative to league |
| TOB_lg | Times on base, relative to league |
| TB_lg | Total bases, relative to league |

##### raw_teams
| Column Name | Description |
| --- | --- |
| yearID | Year |
| lgID | League |
| teamID | Team |
| franchID | Franchise (links to TeamsFranchise table) |
| divID | Team's division |
| Rank | Position in final standings |
| G | Games played |
| Ghome | Games played at home |
| W | Wins |
| L | Losses |
| DivWin | Division Winner (Y or N) |
| WCWin | Wild Card Winner (Y or N) |
| LgWin | League Champion(Y or N) |
| WSWin | World Series Winner (Y or N) |
| R | Runs scored |
| AB | At bats |
| H | Hits by batters |
| 2B | Doubles |
| 3B | Triples |
| HR | Homeruns by batters |
| BB | Walks by batters |
| SO | Strikeouts by batters |
| SB | Stolen bases |
| CS | Caught stealing |
| HBP | Batters hit by pitch |
| SF | Sacrifice flies |
| RA | Opponents runs scored |
| ER | Earned runs allowed |
| ERA | Earned run average |
| CG | Complete games |
| SHO | Shutouts |
| SV | Saves |
| IPOuts | Outs Pitched (innings pitched x 3) |
| HA | Hits allowed |
| HRA | Homeruns allowed |
| BBA | Walks allowed |
| SOA | Strikeouts by pitchers |
| E | Errors |
| DP | Double Plays |
| FP | Fielding percentage |
| name | Team's full name |
| park | Name of team's home ballpark |
| attendance | Home attendance total |
| BPF | Three-year park factor for batters |
| PPF | Three-year park factor for pitchers |
| teamIDBR | Team ID used by Baseball Reference website |
| teamIDlahman45 | Team ID used in Lahman database version 4.5 |
| teamIDretro | Team ID used by Retrosheet |



##### raw_batting
| Column Name | Description |
| --- | --- |
| playerID | Player ID code |
| yearID | Year |
| stint | player's stint (order of appearances within a season) |
| teamID | Team |
| lgID | League |
| G | Games |
| AB | At Bats |
| R | Runs |
| H | Hits |
| 2B | Doubles |
| 3B | Triples |
| HR | Homeruns |
| RBI | Runs Batted In |
| SB | Stolen Bases |
| CS | Caught Stealing |
| BB | Base on Balls |
| SO | Strikeouts |
| IBB | Intentional walks |
| HBP | Hit by pitch |
| SH | Sacrifice hits |
| SF | Sacrifice flies |
| GIDP | Grounded into double plays |
| G_old | Old version of games (deprecated) |
| PA | Plate appearances |
| InnOuts | Time played in the field expressed as outs |
| PO | Putouts |
| A | Assists |
| E | Errors |
| DP | Double Plays |
| PB | Passed Balls (by catchers) |
| WP | Wild Pitches (by catchers) |
| SB | Stolen bases allowed (by catchers) |
| CS | Caught Stealing (by catchers) |
| ZR | Zone Rating |


##### raw_fielding
| Column Name | Description |
| --- | --- |
| playerID | Player ID code |
| yearID | Year |
| stint | player's stint (order of appearances within a season) |
| teamID | Team |
| lgID | League |
| POS | Position |
| G | Games |
| GS | Games Started |
| InnOuts | Time played in the field expressed as outs |
| PO | Putouts |
| A | Assists |
| E | Errors |
| DP | Double Plays |
| PB | Passed Balls (by catchers) |
| WP | Wild Pitches (by catchers) |
| SB | Stolen bases allowed (by catchers) |
| CS | Caught Stealing (by catchers) |
| ZR | Zone Rating |

##### raw_people
| Column Name | Description |
| --- | --- |
| playerID | Player ID code |
| birthYear | Year player was born |
| birthMonth | Month player was born |
| birthDay | Day player was born |
| birthCountry | Country where player was born |
| birthState | State where player was born |
| birthCity | City where player was born |
| deathYear | Year player died |
| deathMonth | Month player died |
| deathDay | Day player died |
| deathCountry | Country where player died |
| deathState | State where player died |
| deathCity | City where player died |
| nameFirst | Player's first name |
| nameLast | Player's last name |
| nameGiven | Player's given name (typically first and middle) |
| weight | Player's weight in pounds |
| height | Player's height in inches |
| bats | Player's batting hand (left, right, or both) |
| throws | Player's throwing hand (left or right) |
| debut | Date that player made first major league appearance |
| finalGame | Date that player made first major league appearance (blank if still active) |
| retroID | ID used by retrosheet |
| bbrefID | ID used by Baseball Reference website |

##### raw_salaries
| Column Name | Description |
| --- | --- |
| yearID | Year |
| teamID | Team |
| lgID | League |
| playerID | Player ID code |
| salary | Salary |


In [183]:
total_players = df['player_ID'].nunique()
total_teams = df['team_ID'].nunique()
total_seasons = df['year_ID'].nunique()
first_season = df['year_ID'].min()
last_season = df['year_ID'].max()

print(f'Number of players: {total_players}')
print(f'Number of teams: {total_teams}')
print(f'Number of seasons: {total_seasons}')
print(f'First Season: {first_season}')
print(f'Last Season: {last_season}')

Number of players: 22903
Number of teams: 180
Number of seasons: 153
First Season: 1871
Last Season: 2023


### Data Cleaning

#### Drop mlb_ID column
Both `mlb_ID` and `player_ID` columns are identifiers for the same player. `player_ID` is more widely used and recognized in other baseball datasets I might want to integrate in the future.

In [184]:
# Drop mlb_ID column
df.drop('mlb_ID', axis=1, inplace=True)

In [199]:
df.isnull().sum()

name_common               0
age                    1388
player_ID                 0
year_ID                   0
team_ID                   0
stint_ID                  0
lg_ID                   736
PA                      802
G                         0
Inn                   44908
runs_bat                  0
runs_br                   0
runs_dp                   0
runs_field                0
runs_infield          44908
runs_outfield         44908
runs_catcher          44908
runs_good_plays       91900
runs_defense              0
runs_position           988
runs_position_p           1
runs_replacement        802
runs_above_rep          988
runs_above_avg          988
runs_above_avg_off      988
runs_above_avg_def      988
WAA                    9669
WAA_off                9669
WAA_def                9669
WAR                   10812
WAR_def                9669
WAR_off               10812
WAR_rep               10626
salary                73429
pitcher                1143
teamRpG             

In [200]:
df.isnull().sum().sum()

521883

#### Selecting the Appropriate Time Frame for Analysis

By the 1970s, many modern aspects of the game were in place. For instance, the designated hitter rule was established in the American League in 1973, free agency began in the mid-1970s, and by the 1980s, training, nutrition, and medical treatment for players had evolved significantly. Furthermore, the quality and completeness of data tends to be better from the 1970s onward.

In [201]:
# Start from 1970 season
df_modern = df[df['year_ID'] >= 1970]

In [204]:
df_modern.isnull().sum()

name_common               0
age                       0
player_ID                 0
year_ID                   0
team_ID                   0
stint_ID                  0
lg_ID                     0
PA                      796
G                         0
Inn                       0
runs_bat                  0
runs_br                   0
runs_dp                   0
runs_field                0
runs_infield              0
runs_outfield             0
runs_catcher              0
runs_good_plays       35112
runs_defense              0
runs_position           796
runs_position_p           0
runs_replacement        796
runs_above_rep          796
runs_above_avg          796
runs_above_avg_off      796
runs_above_avg_def      796
WAA                    9477
WAA_off                9477
WAA_def                9477
WAR                    9799
WAR_def                9477
WAR_off                9799
WAR_rep                9799
salary                31020
pitcher                 322
teamRpG             

In [205]:
df_modern.isnull().sum().sum()

232471

In [203]:
df_modern.shape

(64587, 48)

In [206]:
# Drop Pitchers
df_modern = df_modern[df_modern['pitcher'] != 'Y']

In [208]:
df_modern.isnull().sum()

name_common               0
age                       0
player_ID                 0
year_ID                   0
team_ID                   0
stint_ID                  0
lg_ID                     0
PA                        0
G                         0
Inn                       0
runs_bat                  0
runs_br                   0
runs_dp                   0
runs_field                0
runs_infield              0
runs_outfield             0
runs_catcher              0
runs_good_plays       19535
runs_defense              0
runs_position             0
runs_position_p           0
runs_replacement          0
runs_above_rep            0
runs_above_avg            0
runs_above_avg_off        0
runs_above_avg_def        0
WAA                       1
WAA_off                   1
WAA_def                   1
WAR                     323
WAR_def                   1
WAR_off                 323
WAR_rep                 323
salary                15713
pitcher                 322
teamRpG             

In [209]:
df_modern.isnull().sum().sum()

36710

In [211]:
# Drop 2023 season
df_modern = df_modern[df_modern['year_ID'] != 2023]
df_modern.shape

(33308, 48)

In [213]:
df_modern.isnull().sum()

name_common               0
age                       0
player_ID                 0
year_ID                   0
team_ID                   0
stint_ID                  0
lg_ID                     0
PA                        0
G                         0
Inn                       0
runs_bat                  0
runs_br                   0
runs_dp                   0
runs_field                0
runs_infield              0
runs_outfield             0
runs_catcher              0
runs_good_plays       19535
runs_defense              0
runs_position             0
runs_position_p           0
runs_replacement          0
runs_above_rep            0
runs_above_avg            0
runs_above_avg_off        0
runs_above_avg_def        0
WAA                       1
WAA_off                   1
WAA_def                   1
WAR                     320
WAR_def                   1
WAR_off                 320
WAR_rep                 320
salary                15558
pitcher                 319
teamRpG             

In [215]:
df_modern.isnull().sum().sum()

36538

In [217]:
# Drop Players with G as zero
df_modern = df_modern[df_modern['G'] != 0]

In [219]:
df_modern.isnull().sum()

name_common               0
age                       0
player_ID                 0
year_ID                   0
team_ID                   0
stint_ID                  0
lg_ID                     0
PA                        0
G                         0
Inn                       0
runs_bat                  0
runs_br                   0
runs_dp                   0
runs_field                0
runs_infield              0
runs_outfield             0
runs_catcher              0
runs_good_plays       19535
runs_defense              0
runs_position             0
runs_position_p           0
runs_replacement          0
runs_above_rep            0
runs_above_avg            0
runs_above_avg_off        0
runs_above_avg_def        0
WAA                       0
WAA_off                   0
WAA_def                   0
WAR                     319
WAR_def                   0
WAR_off                 319
WAR_rep                 319
salary                15557
pitcher                 319
teamRpG             

In [220]:
df_modern.isnull().sum().sum()

36521

In [222]:
# Drop Players with PA as zero
df_modern = df_modern[df_modern['PA'] != 0]

In [224]:
df_modern.isnull().sum()

name_common               0
age                       0
player_ID                 0
year_ID                   0
team_ID                   0
stint_ID                  0
lg_ID                     0
PA                        0
G                         0
Inn                       0
runs_bat                  0
runs_br                   0
runs_dp                   0
runs_field                0
runs_infield              0
runs_outfield             0
runs_catcher              0
runs_good_plays       19446
runs_defense              0
runs_position             0
runs_position_p           0
runs_replacement          0
runs_above_rep            0
runs_above_avg            0
runs_above_avg_off        0
runs_above_avg_def        0
WAA                       0
WAA_off                   0
WAA_def                   0
WAR                     290
WAR_def                   0
WAR_off                 290
WAR_rep                 290
salary                15446
pitcher                 290
teamRpG             

In [225]:
df_modern.isnull().sum().sum()

36070

In [194]:
# Show player with WAR as null
df_modern[df_modern['WAR'].isnull()]['PA'].describe()

count    1154.000000
mean        3.758232
std         5.866951
min         1.000000
25%         1.000000
50%         2.000000
75%         4.000000
max        76.000000
Name: PA, dtype: float64

In [226]:
# Show player with less than 100 PA
df_modern[df_modern['PA'] < 100].isnull().sum().sum()

16344

In [227]:
# Drop players with less than 100 PA
df_modern = df_modern[df_modern['PA'] >= 100]

In [230]:
df_modern.shape

(21842, 48)

In [228]:
df_modern.isnull().sum()

name_common               0
age                       0
player_ID                 0
year_ID                   0
team_ID                   0
stint_ID                  0
lg_ID                     0
PA                        0
G                         0
Inn                       0
runs_bat                  0
runs_br                   0
runs_dp                   0
runs_field                0
runs_infield              0
runs_outfield             0
runs_catcher              0
runs_good_plays       12815
runs_defense              0
runs_position             0
runs_position_p           0
runs_replacement          0
runs_above_rep            0
runs_above_avg            0
runs_above_avg_off        0
runs_above_avg_def        0
WAA                       0
WAA_off                   0
WAA_def                   0
WAR                       0
WAR_def                   0
WAR_off                   0
WAR_rep                   0
salary                 6911
pitcher                   0
teamRpG             

In [229]:
df_modern.isnull().sum().sum()

19726