# MLB Predictions Data Preprocessing

In this notebook, we will perform data processing tasks to simplify the dataframe, drop columns with no value to our project, reduce the time span, and perform further cleaning and wrangling.

By the end, I expect to have a dataframe that is ready for analysis and modeling.

## Table of Contents
1. [Data Exploration](#data_exploration)
    - [1.1. Data Dictionary](#data_dict)
    - [1.2. Description and overview of the data](#overview)
2. [Data Cleaning](#data_cleaning)
    - 2.1. [Setting a time span](#time_span)
    - 2.2. [Droping columns (Features) that are no relevant to the project](#drop_cols)
    - 2.3. [Droping repeated or similar columns](#drop_repeated_cols)
    - 2.4. [Filtering and removing players with no relevance to the project (Removing rows)](#drop_rows)
    - 2.5. [Dealing with Null Values](#null_values)
3. [Next Steps](#next_steps)

In [103]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


## 1. Data Exploration <a id='data_exploration'></a>

To begin the analysis, we will explore the dataset to understand its structure and content.


In [104]:
# Load Dataset
raw_df = pd.read_csv('Raw datasets/batting_war_people_teams.csv')

  raw_df = pd.read_csv('Raw datasets/batting_war_people_teams.csv')


In [105]:
# Shape of the dataset
print('Shape of the dataset: ', raw_df.shape)
print('Number of rows (Players): ', raw_df.shape[0])
print('Number of columns (Features): ', raw_df.shape[1])

Shape of the dataset:  (112184, 138)
Number of rows (Players):  112184
Number of columns (Features):  138


The dataset has a shape of **(112184, 140)**, indicating __112,184 rows (Players) and 140 columns (Features)__. 

Here are the first five rows of the dataset:

In [106]:
# First 5 rows of the dataset
pd.set_option('display.max_columns', None)

raw_df.head()

Unnamed: 0,playerID,yearID,stint,teamID,lgID_x,G_x,AB_x,R_x,H_x,2B_x,3B_x,HR_x,RBI,SB_x,CS_x,BB_x,SO_x,IBB,HBP_x,SH,SF_x,GIDP,name_common,age,mlb_ID,player_ID,year_ID,team_ID,stint_ID,lg_ID,PA,G_y,Inn,runs_bat,runs_br,runs_dp,runs_field,runs_infield,runs_outfield,runs_catcher,runs_defense,runs_position,runs_position_p,runs_replacement,runs_above_rep,runs_above_avg,runs_above_avg_off,runs_above_avg_def,WAA,WAA_off,WAA_def,WAR,WAR_def,WAR_off,WAR_rep,salary,pitcher,teamRpG,oppRpG,oppRpPA_rep,oppRpG_rep,pyth_exponent,pyth_exponent_rep,waa_win_perc,waa_win_perc_off,waa_win_perc_def,waa_win_perc_rep,OPS_plus,TOB_lg,TB_lg,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,deathCountry,deathState,deathCity,nameFirst,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID,lgID_y,franchID,divID,Rank,G,Ghome,W,L,DivWin,LgWin,WSWin,R_y,AB_y,H_y,2B_y,3B_y,HR_y,BB_y,SO_y,SB_y,CS_y,HBP_y,SF_y,RA,ER,ERA,CG,SHO,SV,IPouts,HA,HRA,BBA,SOA,E,DP,FP,name,park,attendance,BPF,PPF,teamIDBR,teamIDlahman45,teamIDretro
0,abercda01,1871,1,TRO,,1,4,0,0,0,0,0,0.0,0.0,0.0,0,0.0,,,,,0.0,Frank Abercrombie,21.0,110018.0,abercda01,1871.0,TRO,1.0,,4.0,1.0,,-1.37,0.0,0.0,0.0,,,,0.0,0.07,0.0,0.14,-1.2,-1.3,-1.3,0.1,-0.08,-0.08,0.0,-0.07,0.0,-0.07,0.01,,N,9.06538,10.36538,0.22078,10.2287,2.329,2.368,0.4226,0.4226,0.504,0.4921,-100.0,1.27,1.572,1850.0,1.0,2.0,USA,OK,Fort Towson,1939.0,11.0,11.0,USA,PA,Philadelphia,Frank,Abercrombie,Francis Patterson,,,,,1871-10-21,1871-10-21,aberd101,abercda01,,TRO,,6,29,,13,15,,N,,351,1248,384,51,34,6,49,19.0,62.0,24.0,,,362,153,5.51,28,0,0,750,431,4,75,12,198,22,0.845,Troy Haymakers,Haymakers' Grounds,,101,100,TRO,TRO,TRO
1,addybo01,1871,1,RC1,,25,118,30,32,6,0,0,13.0,8.0,1.0,4,0.0,,,,,0.0,Bob Addy,29.0,110074.0,addybo01,1871.0,ROK,1.0,,122.0,25.0,,-2.96,0.94,0.0,-1.0,,,,-1.0,0.63,0.0,4.17,1.8,-2.4,-1.4,-0.4,-0.14,-0.07,-0.03,0.24,-0.03,0.31,0.38,,N,10.30978,10.36538,0.22078,10.19863,2.371,2.367,0.4945,0.4968,0.4992,0.4904,77.806252,38.101,45.607,1842.0,2.0,,CAN,ON,Port Hope,1910.0,4.0,9.0,USA,ID,Pocatello,Bob,Addy,Robert Edward,160.0,68.0,L,L,1871-05-06,1877-10-06,addyb101,addybo01,,ROK,,9,25,,4,21,,N,,231,1036,274,44,25,3,38,30.0,53.0,10.0,,,287,108,4.3,23,1,0,678,315,3,34,16,220,14,0.821,Rockford Forest Citys,Agricultural Society Fair Grounds,,97,99,ROK,RC1,RC1
2,allisar01,1871,1,CL1,,29,137,28,40,4,5,0,19.0,3.0,1.0,2,5.0,,,,,1.0,Art Allison,22.0,110170.0,allisar01,1871.0,CLE,1.0,,139.0,29.0,,-0.83,-0.07,0.0,1.0,,,,1.0,-1.52,0.0,4.75,3.3,-1.4,-2.4,-0.5,-0.08,-0.13,-0.04,0.35,-0.04,0.3,0.43,,N,10.28193,10.36538,0.22078,10.2016,2.37,2.367,0.4972,0.4952,0.499,0.4906,99.5321,43.243,52.731,1849.0,1.0,29.0,USA,PA,Philadelphia,1916.0,2.0,25.0,USA,DC,Washington,Art,Allison,Arthur Algernon,150.0,68.0,,,1871-05-04,1876-10-05,allia101,allisar01,,CFC,,8,29,,10,19,,N,,249,1186,328,35,40,7,26,25.0,18.0,8.0,,,341,116,4.11,23,0,0,762,346,13,53,34,234,15,0.818,Cleveland Forest Citys,National Association Grounds,,96,100,CLE,CL1,CL1
3,allisdo01,1871,1,WS3,,27,133,28,44,10,2,2,27.0,1.0,1.0,0,2.0,,,,,0.0,Doug Allison,24.0,110172.0,allisdo01,1871.0,OLY,1.0,,133.0,27.0,,6.49,-0.48,0.0,1.0,,,,1.0,1.79,0.0,4.54,13.3,8.8,7.8,2.8,0.5,0.45,0.15,0.92,0.15,0.87,0.42,,N,10.65427,10.36538,0.22078,10.19706,2.382,2.367,0.5185,0.5164,0.5059,0.4903,133.405823,41.017,50.74,1846.0,7.0,12.0,USA,PA,Philadelphia,1916.0,12.0,19.0,USA,DC,Washington,Doug,Allison,Douglas L.,160.0,70.0,R,R,1871-05-05,1883-07-13,allid101,allisdo01,,OLY,,4,32,,15,15,,N,,310,1353,375,54,26,6,48,13.0,48.0,13.0,,,303,137,4.37,32,0,0,846,371,4,45,13,218,20,0.85,Washington Olympics,Olympics Grounds,,94,98,OLY,WS3,WS3
4,ansonca01,1871,1,RC1,,25,120,29,39,11,3,0,16.0,6.0,2.0,2,1.0,,,,,0.0,Cap Anson,19.0,110284.0,ansonca01,1871.0,ROK,1.0,,122.0,25.0,,5.37,-0.31,0.0,2.0,,,,2.0,0.96,0.0,4.17,12.2,8.0,6.0,3.0,0.46,0.35,0.16,0.84,0.16,0.73,0.38,,N,10.60618,10.36538,0.22078,10.19863,2.38,2.367,0.5183,0.5137,0.5068,0.4904,128.350423,38.101,46.38,1852.0,4.0,17.0,USA,IA,Marshalltown,1922.0,4.0,14.0,USA,IL,Chicago,Cap,Anson,Adrian Constantine,227.0,72.0,R,R,1871-05-06,1897-10-03,ansoc101,ansonca01,,ROK,,9,25,,4,21,,N,,231,1036,274,44,25,3,38,30.0,53.0,10.0,,,287,108,4.3,23,1,0,678,315,3,34,16,220,14,0.821,Rockford Forest Citys,Agricultural Society Fair Grounds,,97,99,ROK,RC1,RC1


In [107]:
pd.set_option('display.max_columns', 20)
# Print all column names and their data types
raw_df.dtypes

playerID               object
yearID                  int64
stint                   int64
teamID                 object
lgID_x                 object
G_x                     int64
AB_x                    int64
R_x                     int64
H_x                     int64
2B_x                    int64
3B_x                    int64
HR_x                    int64
RBI                   float64
SB_x                  float64
CS_x                  float64
BB_x                    int64
SO_x                  float64
IBB                   float64
HBP_x                 float64
SH                    float64
SF_x                  float64
GIDP                  float64
name_common            object
age                   float64
mlb_ID                float64
player_ID              object
year_ID               float64
team_ID                object
stint_ID              float64
lg_ID                  object
PA                    float64
G_y                   float64
Inn                   float64
runs_bat  

### 1.1 Data Dictionary <a id='data_dict'></a>

Now let's examine the data dictionary to understand the meaning of each column:

| Column Name | Description | 
| --- | --- |
| playerID | The unique identifier for the player. |
| yearID | The year that the data is relevant to. |
| stint | The term stint usually represents a player's period of participation with a team in a given season. A player might have multiple stints if they played for multiple teams in the same season. |
| teamID | The unique identifier for the team. |
| lgID_x and lgID_y | The unique identifier for the league. (_x: Player, _y: Team) |
| G_x and G_y | Number of games played by the player or team in the season. (_x: Player, _y: Team) |
| AB_x and AB_y | Number of at bats. (_x: Player, _y: Team) |
| R_x and R_y | Number of runs. (_x: Player, _y: Team) |
| H_x and H_y | Number of hits. (_x: Player, _y: Team) |
| 2B_x and 2B_y | Number of doubles. (_x: Player, _y: Team) |
| 3B_x and 3B_y | Number of triples. (_x: Player, _y: Team) |
| HR_x and HR_y | Number of home runs. (_x: Player, _y: Team) |
| RBI | Number of runs batted in. |
| SB_x and SB_y | Number of stolen bases. (_x: Player, _y: Team) |
| CS_x and CS_y | Number of times caught stealing. (_x: Player, _y: Team) |
| BB_x and BB_y | Number of bases on balls (walks). (_x: Player, _y: Team) |
| SO_x and SO_y | Number of strikeouts. (_x: Player, _y: Team) |
| IBB | Number of intentional bases on balls. |
| HBP_x and HBP_y | Number of times hit by pitch. (_x: Player, _y: Team) |
| SH | Number of sacrifice hits. |
| SF_x and SF_y | Number of sacrifice flies. (_x: Player, _y: Team) |
| GIDP | Number of grounded into double plays. |
| name_common | Common name of the player. |
| age | Age of the player. |
| mlb_ID | Major League Baseball identifier for the player. |
| player_ID | Another player identifier, potentially from another source. |
| team_ID | Another team identifier, potentially from another source. |
| lg_ID | Another league identifier, potentially from another source. |
| PA | Plate appearances. |
| Inn | Innings played. |
| runs_bat | Runs contributed by batting. |
| runs_br | Base running runs. |
| runs_dp | Double play runs. |
| runs_field | Fielding runs. |
| runs_infield | Infielding runs. |
| runs_outfield | Outfielding runs. |
| runs_catcher | Catching runs. |
| runs_defense | Defensive runs. |
| runs_position | Positional runs. |
| runs_position_p | Positional runs per game. |
| runs_replacement | Replacement level runs. |
| runs_above_rep | Runs above replacement. |
| runs_above_avg | Runs above average. |
| runs_above_avg_off | Offensive runs above average. |
| runs_above_avg_def | Defensive runs above average. |
| WAA | Wins Above Average. |
| WAA_off | Offensive Wins Above Average. |
| WAA_def | Defensive Wins Above Average. |
| WAR | Wins Above Replacement. |
| WAR_def | Defensive Wins Above Replacement. |
| WAR_off | Offensive Wins Above Replacement. |
| WAR_rep | Replacement level Wins Above Replacement. |
| salary | Salary of the player. |
| pitcher | Indicates if the player is a pitcher. |
| teamRpG | Team runs per game. |
| oppRpG | Opponent's runs per game. |
| oppRpPA_rep | Opponent's runs per plate appearance at replacement level. |
| oppRpG_rep | Opponent's runs per game at replacement level. |
| pyth_exponent | Pythagorean exponent, used to calculate expected win percentage. |
| pyth_exponent_rep | Pythagorean exponent at replacement level. |
| waa_win_perc | Wins Above Average win percentage. |
| waa_win_perc_off | Offensive Wins Above Average win percentage. |
| waa_win_perc_def | Defensive Wins Above Average win percentage. |
| waa_win_perc_rep | Wins Above Average win percentage at replacement level. |
| OPS_plus | On-base Plus Slugging Plus, a normalized version of OPS. |
| TOB_lg | Times on base in the league. |
| TB_lg | Total bases in the league. |
| birthYear, birthMonth, birthDay, birthCountry, birthState, birthCity | These fields represent the birth information of the player. |
| deathYear, deathMonth, deathDay, deathCountry, deathState, deathCity | These fields represent the death information of the player (if applicable). |
| nameFirst, nameLast, nameGiven | These fields represent the player's first, last, and given names. |
| weight, height | These fields represent the player's weight and height. |
| bats, throws | These fields represent the player's batting and throwing hands (right, left, or both). |
| debut, finalGame | These fields represent the player's debut and final game dates. |
| retroID, bbrefID | These fields represent unique identifiers for the player from different sources. |
| franchID | The unique identifier for the franchise. |
| divID | The unique identifier for the division. |
| Rank | The team's rank in their division for the season. |
| Ghome | Number of home games. |
| W, L | Number of wins and losses. |
| DivWin, LgWin, WSWin | These fields represent whether the team won the division, league, or World Series. |
| RA | Runs allowed by the team. |
| ER | Earned runs allowed by the team. |
| ERA | Earned run average of the team. |
| CG | Complete games by the team's pitchers. |
| SHO | Shutouts by the team's pitchers. |
| SV | Saves by the team's pitchers. |
| IPouts | Innings pitched (outs) by the team's pitchers. |
| HA | Hits allowed by the team. |
| HRA | Home runs allowed by the team. |
| BBA | Bases on balls (walks) allowed by the team. |
| SOA | Strikeouts by the team's pitchers. |
| E | Errors by the team. |
| DP | Double plays by the team. |
| FP | Fielding percentage of the team. |
| name | Name of the team. |
| park | The park or stadium where the team plays their home games. |
| attendance | Total attendance at the team's home games. |
| BPF, PPF | Batting and pitching park factors, which indicate the level of favorability for hitters or pitchers. |
| teamIDBR, teamIDlahman45, teamIDretro | These fields represent unique identifiersfor the team from different sources. |














### 1.2 Description and Overview of the Data <a id='overview'></a>

In [108]:
# Unique players (Total number of players)
print('Number of unique players: ', raw_df['playerID'].nunique())

# Unique seasons (Total number of seasons)
print('Number of unique seasons: ', raw_df['yearID'].nunique())

# First season
print('First season: ', raw_df['yearID'].min())

# Last season
print('Last season: ', raw_df['yearID'].max())

Number of unique players:  20469
Number of unique seasons:  152
First season:  1871
Last season:  2022


#### Overview of the Data

Our dataset is a rich compilation of baseball statistics that contains data for __20,469__ unique players across __152__ seasons. The earliest season in the dataset is __1871__, and the latest season is __2022__.

Each row in the dataset represents a single player's performance for a given season, with a variety of metrics reflecting different aspects of their performance. These are divided into the following categories:
- __Identifiers:__ These include various IDs for players, teams, and leagues, as well as the year and player's stint with the team in that season.
- __Basic Batting and Fielding Statistics:__ These include familiar metrics such as games played, at bats, runs, hits, and home runs, among others, for both players and teams.
- __Advanced Batting and Fielding Statistics:__ These include more advanced metrics such as wins above replacement (WAR), wins above average (WAA), and runs above average (RAA), among others, for both players and teams.
- __Personal Information:__ These include the player's name, birth and death information, weight, height, and handedness, among others.
- __Team Information:__ These include the team's name, division, and league, as well as their record, attendance, and park factors, among others.
- __Salary Information:__ These include the player's salary.

This dataset provides a comprehensive view of baseball performance, allowing us to examine the game from both individual player and team perspectives.


## 2. Data Cleaning <a id='data_cleaning'></a>

### 2.1 Setting a Time Span <a id='time_span'></a>

The dataset contains data from 1871 to 2022. However, as part of our data cleaning process, we made a strategic decision to limit our dataset to encompass the years from 1980 to the present. 

There were several reasons behind this decision:
- __Statistical Consistency:__ The game of baseball has evolved significantly over the years, with changes in rules, equipment, and player training and conditioning. By focusing on the past four decades, we ensure a higher degree of consistency in the playing conditions, thus making our statistical analysis and predictions more reliable.
- __Impact of Free Agency:__ The introduction of free agency in 1976 has had a significant impact on the game. By starting our analysis from 1980, we focus on an era when player movement between teams became more common, which adds an interesting dynamic to player performance and team composition.
- __Data Availability:__ The data from 1980 to the present is more complete and accurate, which will help us avoid potential issues with missing or incorrect data.

In [109]:
# Setting a time span starting from 1980
raw_df_1980 = raw_df.copy()
raw_df_1980 = raw_df_1980[raw_df_1980['yearID'] >= 1980]

# Save the new dataset to a csv file
raw_df_1980.to_csv('Raw datasets/batting_war_people_teams_1980.csv', index=False)

### 2.2 Droping columns (Features) that are no relevant to the project <a id='drop_cols'></a>

#### Personal Columns

Some columns are not relevant to our project and will be dropped. These columns are commonly related to personal information about the player, such as birth date, birth place, and death date. 

Personal columns are not directly related to the project's analysis of batting performance. By removing them from the dataset, we can eliminate unnecessary personal information and narrow the scope of the project to the pertinent variables.


Before starting, let's check the sum of null values in the dataset. This will allow us to compare it with the sum of null values after dropping columns. The goal is to assess the impact of dropping columns on the overall null values count.

In [110]:
# List of personal columns that are not relevant
personal_columns = ['nameFirst', 'nameLast', 'nameGiven', 'birthYear', 'birthDay','birthState', 'birthCity', 'birthMonth', 'deathYear', 'deathMonth', 'deathDay', 'deathCountry', 'deathState', 'deathCity']

# Drop personal columns
raw_df_1980.drop(personal_columns, axis=1, inplace=True)


#### Pitching Columns

To focus on the batting-oriented nature of your project and streamline the analysis, it is recommended to drop the pitching-related columns. By removing these columns, we can concentrate solely on the batting statistics, which aligns with the project's objective. Dropping the pitching columns will enable a more concise and meaningful exploration of the batting performance in your dataset.

In [111]:
# Pitching columns
pitching_columns = ['ERA', 'IPouts', 'HA', 'HRA', 'BBA', 'SOA', 'E', 'SV', 'CG', 'SHO']

# Drop pitching columns
raw_df_1980.drop(pitching_columns, axis=1, inplace=True)

In [112]:
raw_df_1980.shape

(54805, 114)

In [113]:
raw_df_1980.isnull().sum().sum()

206066

#### Summary of remaining columns

In [114]:
def columns_dropped():
    '''Function to print the number of columns and null values dropped, just for reference and sanity check'''
    initial_columns = raw_df.shape[1]
    initial_nulls = 568294
    columns_dropped = initial_columns - raw_df_1980.shape[1]
    nulls_dropped = initial_nulls - raw_df_1980.isnull().sum().sum()
    columns_remaining = raw_df_1980.shape[1]
    nulls_remaining = raw_df_1980.isnull().sum().sum()

    print('Number of columns dropped: ', columns_dropped)
    print('Number of null values dropped: ', nulls_dropped)
    print('Number of columns remaining: ', columns_remaining)
    print('Number of null values remaining: ', nulls_remaining)
    

In [115]:
# Number of columns dropped
columns_dropped()

Number of columns dropped:  24
Number of null values dropped:  362228
Number of columns remaining:  114
Number of null values remaining:  206066


In [116]:
# Columns
for column in raw_df_1980.columns:
    print(column)

playerID
yearID
stint
teamID
lgID_x
G_x
AB_x
R_x
H_x
2B_x
3B_x
HR_x
RBI
SB_x
CS_x
BB_x
SO_x
IBB
HBP_x
SH
SF_x
GIDP
name_common
age
mlb_ID
player_ID
year_ID
team_ID
stint_ID
lg_ID
PA
G_y
Inn
runs_bat
runs_br
runs_dp
runs_field
runs_infield
runs_outfield
runs_catcher
runs_defense
runs_position
runs_position_p
runs_replacement
runs_above_rep
runs_above_avg
runs_above_avg_off
runs_above_avg_def
WAA
WAA_off
WAA_def
WAR
WAR_def
WAR_off
WAR_rep
salary
pitcher
teamRpG
oppRpG
oppRpPA_rep
oppRpG_rep
pyth_exponent
pyth_exponent_rep
waa_win_perc
waa_win_perc_off
waa_win_perc_def
waa_win_perc_rep
OPS_plus
TOB_lg
TB_lg
birthCountry
weight
height
bats
throws
debut
finalGame
retroID
bbrefID
lgID_y
franchID
divID
Rank
G
Ghome
W
L
DivWin
LgWin
WSWin
R_y
AB_y
H_y
2B_y
3B_y
HR_y
BB_y
SO_y
SB_y
CS_y
HBP_y
SF_y
RA
ER
DP
FP
name
park
attendance
BPF
PPF
teamIDBR
teamIDlahman45
teamIDretro


In [117]:
pd.set_option('display.max_rows', None)
raw_df.isnull().sum().sort_values(ascending=False)

salary                65229
deathState            63767
deathCity             63524
deathCountry          63493
deathDay              63471
deathMonth            63469
deathYear             63468
DivWin                48282
SF_y                  48184
divID                 47252
HBP_y                 38373
runs_infield          37128
runs_outfield         37128
runs_catcher          37128
Inn                   37128
IBB                   36651
SF_x                  36104
GIDP                  25442
CS_y                  23605
CS_x                  23542
OPS_plus              20145
WAR                   11793
WAR_off               11793
WAR_rep               11793
oppRpG_rep            10661
pyth_exponent         10661
WAA_def               10661
WAR_def               10661
teamRpG               10661
WAA_off               10661
pyth_exponent_rep     10661
waa_win_perc          10661
waa_win_perc_off      10661
waa_win_perc_def      10661
waa_win_perc_rep      10661
WAA                 

### 2.3 Droping repeated or similar columns <a id='drop_repeated_cols'></a>

Some columns provide the same information as other columns, but with different names. We are going to iterate over each pair of columns we belive are repeated and evaluate which will be the one to drop. 

In [118]:
def compare_columns(dataframe, column1, column2, column3=None):
    '''
    Compare up to three columns in a DataFrame and return the number of equal and different values.
    Additionally, provide the count of null values for each column to assist in identifying columns for potential dropping.
    If the number of different values is close to or equals the number of null values, it suggests that the differences 
    observed are primarily due to null values.
    :param column1: string
    :param column2: string
    :param column3: string
    '''
    if column3 is None:
        # Return bool if column1 and column2 are equal and count the number of False values
        comparison = dataframe[column1] == dataframe[column2]
        print('Number of different values: ', comparison.value_counts()[False])
        print('Number of equal values: ', comparison.value_counts()[True])
        # Null values for comparison
        print(f'Number of null values for {column1}: ', dataframe[column1].isnull().sum())
        print(f'Number of null values for {column2}: ', dataframe[column2].isnull().sum())
    else:
        # Return bool if column1, column2 and column3 are equal and count the number of False values
        comparison = (dataframe[column1] == dataframe[column2]) & (dataframe[column1] == dataframe[column3])
        print('Number of different values: ', comparison.value_counts()[False])
        print('Number of equal values: ', comparison.value_counts()[True])
        # Null values for comparison
        print(f'Number of null values for {column1}: ', dataframe[column1].isnull().sum())
        print(f'Number of null values for {column2}: ', dataframe[column2].isnull().sum())
        print(f'Number of null values for {column3}: ', dataframe[column3].isnull().sum())


__`yearID` and `year_ID`.__

In [119]:
# yearID and year_ID comparison
compare_columns(raw_df_1980, 'yearID', 'year_ID')

Number of different values:  695
Number of equal values:  54110
Number of null values for yearID:  0
Number of null values for year_ID:  695


`year_ID` has 695 null values, while `yearID` has 0 null values. The number of different values equals the number of null values, it suggests that the differences 
observed are due to null values. We will drop `year_ID` and keep `yearID`.

In [120]:
# Drop year_ID column
raw_df_1980.drop('year_ID', axis=1, inplace=True)

In [121]:
# Shape
raw_df_1980.shape

(54805, 113)

__`playerID` and `player_ID`.__

In [122]:
# playerID and player_ID comparison
compare_columns(raw_df_1980, 'playerID', 'player_ID')

Number of different values:  695
Number of equal values:  54110
Number of null values for playerID:  0
Number of null values for player_ID:  695


`player_ID` has 695 null values, while `playerID` has 0 null values. The number of different values equals the number of null values, it suggests that the differences 
observed are due to null values. We will drop `player_ID` and keep `playerID`.

In [123]:
# Drop player_ID column
raw_df_1980.drop('player_ID', axis=1, inplace=True)

In [124]:
# Shape
raw_df_1980.shape

(54805, 112)

__`lgID_x`, `lgID_y` and `lg_ID`.__

In [125]:
# lgID_x, lgID_y and lg_ID comparison
compare_columns(raw_df_1980, 'lgID_x', 'lgID_y', 'lg_ID')


Number of different values:  697
Number of equal values:  54108
Number of null values for lgID_x:  0
Number of null values for lgID_y:  0
Number of null values for lg_ID:  695


Same as previous case, the only difference between `lgID_x`, `lgID_y` and `lg_ID` are the null values. The two remaning different values (697 different values - 695 null values) are not significant for the size of the dataset.

We will drop `lg_ID` and `lgID_y` and keep `lgID_x`.

In [126]:
# Drop lgID_y and lg_ID columns
raw_df_1980.drop(['lgID_y', 'lg_ID'], axis=1, inplace=True)

In [127]:
# Shape
raw_df_1980.shape

(54805, 110)

__`stint` and `stint_ID`.__

In [128]:
# stint and stint_ID comparison
compare_columns(raw_df_1980, 'stint', 'stint_ID')

Number of different values:  695
Number of equal values:  54110
Number of null values for stint:  0
Number of null values for stint_ID:  695


`stint_ID` has 695 null values, while `stint` has 0 null values. The number of different values equals the number of null values, it suggests that the differences 
observed are due to null values. We will drop `stint_ID` and keep `stint`.

In [129]:
# Drop stint_ID column
raw_df_1980.drop('stint_ID', axis=1, inplace=True)

##### Team identifiers

There are several columns that serve as identifiers for the team:
- `teamID`
- `team_ID`
- `teamIDBR`
- `teamIDlahman45`
- `teamIDretro`
- `franchID`
- `name`

Let's look at a dataframe including these columns:

In [130]:
# Team identifiers
team_ids_columns = ['teamID', 'team_ID', 'teamIDBR', 'teamIDlahman45', 'teamIDretro', 'franchID', 'name']

raw_df_1980[team_ids_columns].value_counts().sort_index()

teamID  team_ID  teamIDBR  teamIDlahman45  teamIDretro  franchID  name                         
ANA     ANA      ANA       ANA             ANA          ANA       Anaheim Angels                    327
ARI     ARI      ARI       ARI             ARI          ARI       Arizona Diamondbacks             1173
ATL     ATL      ATL       ATL             ATL          ATL       Atlanta Braves                   1847
        LAA      ATL       ATL             ATL          ATL       Atlanta Braves                      1
BAL     BAL      BAL       BAL             BAL          BAL       Baltimore Orioles                1901
BOS     BOS      BOS       BOS             BOS          BOS       Boston Red Sox                   1912
CAL     CAL      CAL       CAL             CAL          ANA       California Angels                 687
CHA     CHW      CHW       CHA             CHA          CHW       Chicago White Sox                1778
CHN     CHC      CHC       CHN             CHN          CHC       Chicag


We will rely on the identifiers provided by Baseball-Reference (https://www.baseball-reference.com/about/team_IDs.shtml), specifically `teamIDBR` and `franchID`. Baseball-Reference is one of our primary data sources, and utilizing these identifiers will ensure consistency and accuracy in our analysis. We will drop the other team identifiers except for `name`, which will be useful for visualizations and analysis.

In [131]:
# Drop teamID, team_ID, teamIDlahman45, teamIDretro
raw_df_1980.drop(['teamID', 'team_ID', 'teamIDlahman45', 'teamIDretro'], axis=1, inplace=True)

##### Other player identifiers

We already compared playerID and player_ID, and we decided to keep `playerID`, but there are other player identifiers that we need to compare:
- `mlb_ID`
- `retroID`
- `bbrefID`

We will follow the same process as we did with the team identifiers.


In [132]:
# Player identifiers
player_ids_columns = ['playerID', 'mlb_ID', 'retroID', 'bbrefID']

raw_df_1980[player_ids_columns].value_counts().head(10)

playerID   mlb_ID    retroID   bbrefID  
henderi01  115749.0  hendr001  henderi01    28
moyerja01  119469.0  moyej001  moyerja01    27
baineha01  110456.0  bainh001  baineha01    27
weathda01  124000.0  weatd001  weathda01    26
mulhote01  119488.0  mulht001  mulhote01    26
francju01  212297.0  franj002  francju01    25
sierrru01  122218.0  sierr001  sierrru01    25
maddugr01  118120.0  maddg002  maddugr01    25
oroscje01  120051.0  orosj001  oroscje01    25
thomeji01  123272.0  thomj002  thomeji01    25
dtype: int64

Same as team identifiers, we will rely on the identifiers provided by Baseball-Reference, 'bbrefID'. From previous table, `playeID` seems to be taken from `bbrefID`.

Let's take a deeper look at them to decide which one to keep:

In [133]:
# Compare playerID, bbrefID
compare_columns(raw_df_1980, 'playerID', 'bbrefID')


Number of different values:  671
Number of equal values:  54134
Number of null values for playerID:  0
Number of null values for bbrefID:  0


In [134]:
# Show rows where playerID and bbrefID are not equal to compare
raw_df_1980[raw_df_1980['playerID'] != raw_df_1980['bbrefID']][player_ids_columns].sample(20)

Unnamed: 0,playerID,mlb_ID,retroID,bbrefID
103421,harriwi02,,harrw002,harriwi10
104438,beeksja01,,beekj001,beeksja02
87053,ryanbj01,,ryanb001,ryanb.01
66213,stclara01,,stclr001,st.clra01
80116,morriji02,,morrj004,morrija03
83143,surhobj01,,surhb001,surhob.01
60924,omallto01,,omalt001,o'malto01
64124,stclara01,,stclr001,st.clra01
62897,oconnja02,,oconj001,o'conja02
73901,santafp01,,santf001,santaf.01


We can see from previous sample that the only differences between `playerID` and `bbrefID` are punctuation marks and typos. `playerID` is cleaner and easier to read. 

We will keep `playerID` and drop `bbrefID`, along with `mlb_ID` and `retroID`.

In [135]:
# Drop bbrefID, mlb_ID, retroID
raw_df_1980.drop(['bbrefID', 'mlb_ID', 'retroID'], axis=1, inplace=True)

In [136]:
# Shape
raw_df_1980.shape

(54805, 102)

In [137]:
# Columns dropped
columns_dropped()

Number of columns dropped:  36
Number of null values dropped:  366398
Number of columns remaining:  102
Number of null values remaining:  201896


In [138]:
# Null values
raw_df_1980.isnull().sum().sort_values(ascending=False)

salary                23972
OPS_plus              16143
WAR                    9223
WAR_off                9223
WAR_rep                9223
pyth_exponent_rep      8991
WAR_def                8991
WAA_def                8991
WAA_off                8991
WAA                    8991
oppRpG_rep             8991
pyth_exponent          8991
teamRpG                8991
waa_win_perc_off       8991
waa_win_perc_def       8991
waa_win_perc_rep       8991
waa_win_perc           8991
PA                     1484
runs_above_avg_def     1484
runs_above_avg_off     1484
runs_above_avg         1484
runs_above_rep         1484
runs_replacement       1484
runs_position          1484
WSWin                  1030
LgWin                  1030
DivWin                 1030
pitcher                 927
runs_defense            695
oppRpPA_rep             695
TOB_lg                  695
TB_lg                   695
oppRpG                  695
runs_position_p         695
runs_catcher            695
runs_br             

### 2.4 Filtering and removing players with no relevance to the project (Removing rows) <a id='drop_rows'></a>

#### Pitchers

As we mentioned earlier in the project, we are going to focus on batting metrics. We will remove pitchers from the dataset.

In [139]:
# Drop rows with pitcher = Y
raw_df_1980 = raw_df_1980[raw_df_1980['pitcher'] != 'Y']

In [140]:
# Drop pitcher column
raw_df_1980.drop('pitcher', axis=1, inplace=True)

# Shape
raw_df_1980.shape

(28295, 101)

#### Plate appearances

We are going to remove players with less than 100 plate appearances (`PA`). 

By removing rows with low plate appearances, we focus our analysis on players who have had a more substantial presence at the plate. This helps ensure that our findings are based on a more robust sample of players with sufficient data. Also, their limited exposure to gameplay situations can introduce higher variability and potential biases in statistical measures.


In [141]:
# Drop rows where PA < 100
raw_df_1980 = raw_df_1980[raw_df_1980['PA'] >= 100]

#### Shortened seasons

To ensure consistency and comparability in our analysis, we have decided to drop the seasons of 1981, 1994, 1995, and 2020 from our dataset. These seasons were shortened due to various factors such as strikes (1981, 1994, 1995) and the COVID-19 pandemic (2020). By removing these seasons, we can focus on the full-length seasons, which will help us avoid potential biases in our analysis.

To be more sure about the shortened seasons, we are going to check the average number of games played in each season. We expect to find a lower number of games in the shortened seasons.


In [142]:
# Average games per season
raw_df_1980.groupby('yearID')['G'].mean()

yearID
1980    161.905512
1981    107.257053
1982    162.078804
1983    162.216000
1984    161.926893
1985    161.763158
1986    161.750636
1987    161.916010
1988    161.537234
1989    161.966495
1990    161.918635
1991    161.855330
1992    162.000000
1993    162.066667
1994    114.332418
1995    144.059553
1996    161.912114
1997    161.857143
1998    162.139013
1999    161.871332
2000    161.924406
2001    161.943567
2002    161.734694
2003    162.002222
2004    161.871332
2005    162.062937
2006    161.933185
2007    162.062791
2008    161.869469
2009    161.989059
2010    162.000000
2011    161.935897
2012    162.000000
2013    162.061002
2014    162.000000
2015    161.934498
2016    161.869369
2017    162.000000
2018    162.058190
2019    161.923077
2020     59.868852
2021    161.940329
2022    162.000000
Name: G, dtype: float64

As we can see from the table above, the average number of games played in the 1981, 1994, 1995 and 2020 is lower than the average number of games played in the rest of the seasons. We will drop the shortened seasons.

In [143]:
# Drop 1981, 1994, 1995 and 2020 seasons
raw_df_1980 = raw_df_1980[(raw_df_1980['yearID'] != 1994) & (raw_df_1980['yearID'] != 2020) & (raw_df_1980['yearID'] != 1981) & (raw_df_1980['yearID'] != 1995)]

In [144]:
# Shape
raw_df_1980.shape

(16742, 101)

### 2.5 Dealing with Null Values <a id='null_values'></a>

In [145]:
# Null values
raw_df_1980.isnull().sum().sort_values(ascending=False)

salary                3941
playerID                 0
weight                   0
Ghome                    0
G                        0
Rank                     0
divID                    0
franchID                 0
finalGame                0
debut                    0
throws                   0
bats                     0
height                   0
birthCountry             0
L                        0
TB_lg                    0
TOB_lg                   0
OPS_plus                 0
waa_win_perc_rep         0
waa_win_perc_def         0
waa_win_perc_off         0
waa_win_perc             0
pyth_exponent_rep        0
pyth_exponent            0
W                        0
DivWin                   0
oppRpPA_rep              0
HBP_y                    0
PPF                      0
BPF                      0
attendance               0
park                     0
name                     0
FP                       0
DP                       0
ER                       0
RA                       0
S

As we can see from the table above, during the previous steps we were able to handle most of the null values in the dataset. However, there are still some null values that we need to deal with.

The only null values left are in the `salary` column. For now, we are going to create two tables, one without the `salary` column and another one with the `salary` column removing the null values. We will decide later which one to keep.

In [146]:
# Save a dataset without salary
clean_batting = raw_df_1980.copy()
clean_batting.drop('salary', axis=1, inplace=True)
clean_batting.to_csv('clean_batting.csv', index=False)

# Copy a dataset with salary
clean_batting_salary = raw_df_1980.copy()
# Drop rows with null values in salary
clean_batting_salary.dropna(subset=['salary'], inplace=True)
# Save to csv
clean_batting_salary.to_csv('clean_batting_salary.csv', index=False)

In [147]:
clean_batting_salary.shape

(12801, 101)

In [148]:
clean_batting_salary.isnull().sum().sort_values(ascending=False)

playerID              0
height                0
W                     0
Ghome                 0
G                     0
Rank                  0
divID                 0
franchID              0
finalGame             0
debut                 0
throws                0
bats                  0
weight                0
oppRpPA_rep           0
birthCountry          0
TB_lg                 0
TOB_lg                0
OPS_plus              0
waa_win_perc_rep      0
waa_win_perc_def      0
waa_win_perc_off      0
waa_win_perc          0
pyth_exponent_rep     0
pyth_exponent         0
L                     0
DivWin                0
LgWin                 0
WSWin                 0
PPF                   0
BPF                   0
attendance            0
park                  0
name                  0
FP                    0
DP                    0
ER                    0
RA                    0
SF_y                  0
HBP_y                 0
CS_y                  0
SB_y                  0
SO_y            

In [149]:
clean_batting.shape

(16742, 100)

In [150]:
clean_batting.isnull().sum().sort_values(ascending=False)

playerID              0
height                0
W                     0
Ghome                 0
G                     0
Rank                  0
divID                 0
franchID              0
finalGame             0
debut                 0
throws                0
bats                  0
weight                0
yearID                0
birthCountry          0
TB_lg                 0
TOB_lg                0
OPS_plus              0
waa_win_perc_rep      0
waa_win_perc_def      0
waa_win_perc_off      0
waa_win_perc          0
pyth_exponent_rep     0
pyth_exponent         0
L                     0
DivWin                0
LgWin                 0
WSWin                 0
PPF                   0
BPF                   0
attendance            0
park                  0
name                  0
FP                    0
DP                    0
ER                    0
RA                    0
SF_y                  0
HBP_y                 0
CS_y                  0
SB_y                  0
SO_y            

## 3. Next Steps <a id='next_steps'></a>

Now that we have a clean dataset, we are ready to move on to the next step of our project: 

- __Exploratory Data Analysis (EDA):__ Perform an in-depth exploratory data analysis to uncover insights, patterns, and relationships within the preprocessed data. Utilize visualizations, statistical analysis, and other techniques to understand the distribution, correlations, and trends present in the data. This stage will provide valuable insights that can guide further analysis and modeling decisions.

- __Feature Engineering:__ Engage in feature engineering to enhance the dataset for modeling purposes. This includes selecting relevant features, transforming existing features, and potentially creating new features based on domain knowledge and insights gained from the EDA. Iteratively refine the feature set to improve model performance and align it with the project's objectives.

Please note that the current state of the preprocessed data is not the final form. More features can be added or removed during the feature engineering phase to further optimize our models and increase their predictive power.

