# Introduction to Phillies Batting Dataset (2015-2023)

## Brief Overview of Dataset

This dataset covers 31 variables and 460 rows of batting data spanning from the 2015 to 2023 season. The data contains pitchers as well, since pitchers would bat up until the year 2020.

## Interest

This specific data set was of interest to me because it has pitchers hitting data in it as well. I thought its unique and will be interesting to manipulate the data. I also play baseball and have my whole life, so I enjoy looking through baseball data.

## Source

I found this dataset from kaggle.

The link will be below:

https://www.kaggle.com/datasets/mattop/philadelphia-phillies-batting-and-pitching-data

Note: The original dataset had about 5000 rows of data, but I shortened it to 460 rows to make it easier to work with.

## Research Questions

1. How does pitchers hitting data affect the overall dataset?

2. Will there be trends in the overall stats to see what makes the team win or lose more?

## Codebook

 Variable Name-------------------------------------------------Description--------------------------------------Data Type----------------Possible Values / Notes

    Name                             Player's full name                              String              Ex: "Mike Trout"
    
    Age                              Player's age during the season                  Integer             Ex: 27
    
    Games                            Number of games played                          Integer             ≥ 0
    
    Plate_Appearances                 Number of times the player came                Integer             ≥ 0
                                     to bat (includes walks, HBP, etc.)    
                    
    At_Bats                             Number of official at-bats                   Integer             ≥ 0
                                     (excludes walks, HBP, sacrifices)     

    Runs                             Number of runs scored by the player             Integer             ≥ 0
    
    Hits                                           Total hits                        Integer             ≥ 0
                                    (singles + doubles + triples + home runs)
              
    Doubles                          Number of hits that were doubles                Integer             ≥ 0
      
    Triples                          Number of hits that were triples                Integer             ≥ 0
   
    Home_Runs                        Number of home runs                             Integer             ≥ 0
  
    Runs_Batted_In                   Total runs batted in (RBIs)                     Integer             ≥ 0
    
    Stolen_Bases                     Number of bases stolen successfully             Integer             ≥ 0
    
    Caught_Stealing                  Number of times caught stealing a base          Integer             ≥ 0
    
    Base_On_Balls                    Number of walks (BB) received                   Integer             ≥ 0
    
    Strikeouts                       Number of times the player struck out           Integer             ≥ 0
    
    Batting_Average                  Hits divided by at-bats (AVG)                    Float              0.000 - 1.000
    
    On_Base_Percentage               OBP: how often a player reaches base             Float              0.000 - 1.000
    
    Slugging_Percentage              SLG: total bases per at-bat                      Float              0.000 - 4.000 (typically < 1.000)
    
    On_Base_Plus_Slugging_           OPS = OBP + SLG                                  Float              0.000 - 5.000 (typically < 2.000)
    Percentage 
    
    On_Base_Plus_Slugging_ OPS+:     league- and park-adjusted OPS               Integer  0 and up      (100 = average, >100 = better)
    Percentage_Plus                    (100 = league average)
                                                                   
    Total_Bases                      Total number of bases from hits                  Integer            ≥ 0
                                       (1×1B + 2×2B + 3×3B + 4×HR)   
                        
    Double_Plays_Grounded_Into       Number of double plays the                       Integer            ≥ 0
                                     player grounded into (GIDP)
    
    Times_Hit_By_Pitch               Number of times hit by a pitch (HBP)             Integer            ≥ 0
    
    Sacrifice_Hits                   Number of successful sacrifice bunts (SH)        Integer            ≥ 0
    
    Sacrifice_Flies                  Number of sacrifice flies (SF)                   Integer            ≥ 0
    
    Intentional_Bases_on_Balls       Number of intentional walks                      Integer            ≥ 0
                                          (IBB) received
    
    Dominant_Hand                    Player's dominant throwing hand            Categorical (String)     "Right", "Left", "Ambidextrous"
    
    Switch_Hitter                    Whether the player is a switch hitter     Boolean or Categorical    "Yes"/"No" or True/False

## Reading in Data

In [70]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', None)

In [71]:
phillies_data = pd.read_csv('PhilliesBattingData2(2015-2023).csv')

## Exploration

In [72]:
phillies_data.head()

Unnamed: 0,Rank,Year,Position,Name,Age,Games,Plate_Appearances,At_Bats,Runs,Hits,Doubles,Triples,Home_Runs,Runs_Batted_In,Stolen_Bases,Caught_Stealing,Base_On_Balls,Strikeouts,Batting_Average,On_Base_Percentage,Slugging_Percentage,On_Base_Plus_Slugging_Percentage,On_Base_Plus_Slugging_Percentage_Plus,Total_Bases,Double_Plays_Grounded_Into,Times_Hit_By_Pitch,Sacrifice_Hits,Sacrifice_Flies,Intentional_Bases_on_Balls,Dominant_Hand,Switch_Hitter
0,1,2023,C,J.T. Realmuto,32,135,540,489,70,123,28.0,5,20,63,16.0,5,35,138.0,0.252,0.31,0.452,0.762,106,221,10.0,9,0,5,3,Right,No
1,2,2023,1B,Kody Clemens,27,47,148,139,15,32,7.0,0,4,13,0.0,0,8,40.0,0.23,0.277,0.367,0.644,75,51,1.0,1,0,0,0,Left,No
2,3,2023,2B,Bryson Stott,25,151,640,585,78,164,32.0,2,15,62,31.0,3,39,100.0,0.28,0.329,0.419,0.747,104,245,10.0,7,1,8,1,Left,No
3,4,2023,SS,Trea Turner,30,155,691,639,102,170,35.0,5,26,76,30.0,0,45,150.0,0.266,0.32,0.459,0.778,111,293,12.0,6,0,1,2,Right,No
4,5,2023,3B,Alec Bohm,26,145,611,558,74,153,31.0,0,20,97,4.0,1,42,94.0,0.274,0.327,0.437,0.765,108,244,23.0,5,0,6,1,Right,No


In [73]:
phillies_data.shape

(460, 31)

In [74]:
phillies_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 460 entries, 0 to 459
Data columns (total 31 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   Rank                                   460 non-null    int64  
 1   Year                                   460 non-null    int64  
 2   Position                               457 non-null    object 
 3   Name                                   460 non-null    object 
 4   Age                                    460 non-null    int64  
 5   Games                                  460 non-null    int64  
 6   Plate_Appearances                      460 non-null    int64  
 7   At_Bats                                460 non-null    int64  
 8   Runs                                   460 non-null    int64  
 9   Hits                                   460 non-null    int64  
 10  Doubles                                457 non-null    float64
 11  Triple

In [75]:
phillies_data.describe()

Unnamed: 0,Rank,Year,Age,Games,Plate_Appearances,At_Bats,Runs,Hits,Doubles,Triples,Home_Runs,Runs_Batted_In,Stolen_Bases,Caught_Stealing,Base_On_Balls,Strikeouts,Batting_Average,On_Base_Percentage,Slugging_Percentage,On_Base_Plus_Slugging_Percentage,On_Base_Plus_Slugging_Percentage_Plus,Total_Bases,Double_Plays_Grounded_Into,Times_Hit_By_Pitch,Sacrifice_Hits,Sacrifice_Flies,Intentional_Bases_on_Balls
count,460.0,460.0,460.0,460.0,460.0,460.0,460.0,460.0,457.0,460.0,460.0,460.0,455.0,460.0,460.0,457.0,460.0,460.0,455.0,460.0,460.0,460.0,455.0,460.0,460.0,460.0,460.0
mean,25.806522,2019.0,27.736957,39.771739,113.776087,101.876087,13.291304,25.106522,4.857768,0.56087,3.532609,12.73913,1.595604,0.502174,9.528261,25.899344,0.118307,0.158291,0.186393,0.34197,26.471739,41.81087,2.032967,1.095652,0.571739,0.678261,0.602174
std,14.886922,2.552285,3.5777,45.457947,187.854936,168.114534,23.712924,43.931264,8.898767,1.296296,7.326276,23.232401,4.206035,1.322668,18.792563,41.508649,0.121277,0.164371,0.199003,0.349152,68.696494,74.103128,4.029764,2.204072,1.397658,1.562345,1.991375
min,1.0,2015.0,21.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-100.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,13.0,2017.0,25.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,25.5,2019.0,27.0,23.0,16.0,14.0,0.5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.091,0.144,0.125,0.286,0.0,2.0,0.0,0.0,0.0,0.0,0.0
75%,38.0,2021.0,30.0,57.25,138.25,127.0,15.0,30.0,6.0,0.25,3.0,14.0,1.0,0.0,11.0,35.0,0.2355,0.30125,0.3745,0.69,85.0,46.0,2.0,1.0,0.0,0.0,0.0
max,56.0,2023.0,39.0,162.0,720.0,639.0,108.0,171.0,42.0,11.0,47.0,114.0,31.0,13.0,126.0,215.0,0.5,1.0,0.875,1.431,279.0,300.0,23.0,14.0,9.0,10.0,19.0


In [76]:
phillies_data['Age'].value_counts()

Age
26    53
27    52
24    50
25    47
30    39
28    38
29    32
31    31
23    28
32    25
22    16
33    16
34    11
36     8
35     6
37     5
21     1
38     1
39     1
Name: count, dtype: int64

In [77]:
phillies_data['Position'].value_counts()

Position
P     259
UT     26
C      25
CF     22
OF     21
LF     17
RF     16
1B     15
2B     13
SS     13
3B     12
IF     10
DH      6
CI      1
MI      1
Name: count, dtype: int64

## Screening

In [78]:
phillies_data.isnull().sum()

Rank                                     0
Year                                     0
Position                                 3
Name                                     0
Age                                      0
Games                                    0
Plate_Appearances                        0
At_Bats                                  0
Runs                                     0
Hits                                     0
Doubles                                  3
Triples                                  0
Home_Runs                                0
Runs_Batted_In                           0
Stolen_Bases                             5
Caught_Stealing                          0
Base_On_Balls                            0
Strikeouts                               3
Batting_Average                          0
On_Base_Percentage                       0
Slugging_Percentage                      5
On_Base_Plus_Slugging_Percentage         0
On_Base_Plus_Slugging_Percentage_Plus    0
Total_Bases

Position, Doubles, Stolen_Bases, Strikeouts, Slugging_Percentage, Double_Plays_Grounded_Into, Dominant_Hand columns have missing data

In [79]:
phillies_data.duplicated().sum()

np.int64(5)

5 duplicated rows

In [80]:
phillies_data_copy = phillies_data.copy()

In [81]:
phillies_data_copy = np.transpose(phillies_data_copy)

In [82]:
phillies_data_copy.duplicated().sum()

np.int64(0)

No duplicate columns

## Cleaning

In [83]:
phillies_data.shape

(460, 31)

In [84]:
phillies_data[phillies_data['Position'].isnull()]

Unnamed: 0,Rank,Year,Position,Name,Age,Games,Plate_Appearances,At_Bats,Runs,Hits,Doubles,Triples,Home_Runs,Runs_Batted_In,Stolen_Bases,Caught_Stealing,Base_On_Balls,Strikeouts,Batting_Average,On_Base_Percentage,Slugging_Percentage,On_Base_Plus_Slugging_Percentage,On_Base_Plus_Slugging_Percentage_Plus,Total_Bases,Double_Plays_Grounded_Into,Times_Hit_By_Pitch,Sacrifice_Hits,Sacrifice_Flies,Intentional_Bases_on_Balls,Dominant_Hand,Switch_Hitter
225,23,2019,,Mitch Walding,26,2,2,2,0,0,0.0,0,0,0,0.0,0,0,2.0,0.0,0.0,0.0,0.0,-100,0,0.0,0,0,0,0,Left,No
226,24,2019,,Dylan Cozens,25,1,1,1,0,0,0.0,0,0,0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,-100,0,0.0,0,0,0,0,,No
227,25,2019,,Rob Brantly,29,1,1,1,0,0,0.0,0,0,0,0.0,0,0,1.0,0.0,0.0,0.0,0.0,-100,0,0.0,0,0,0,0,Left,No


In [85]:
phillies_data = phillies_data.dropna(subset = ['Position'])
phillies_data['Position'].isnull().sum()

np.int64(0)

NAs for Position are dropped since we have no way of filling the data because there is no record of their position.

In [86]:
phillies_data[phillies_data['Doubles'].isnull()]

Unnamed: 0,Rank,Year,Position,Name,Age,Games,Plate_Appearances,At_Bats,Runs,Hits,Doubles,Triples,Home_Runs,Runs_Batted_In,Stolen_Bases,Caught_Stealing,Base_On_Balls,Strikeouts,Batting_Average,On_Base_Percentage,Slugging_Percentage,On_Base_Plus_Slugging_Percentage,On_Base_Plus_Slugging_Percentage_Plus,Total_Bases,Double_Plays_Grounded_Into,Times_Hit_By_Pitch,Sacrifice_Hits,Sacrifice_Flies,Intentional_Bases_on_Balls,Dominant_Hand,Switch_Hitter
106,6,2021,LF,Andrew McCutchen,34,144,574,482,78,107,,1,27,80,6.0,1,81,132.0,0.222,0.334,0.444,0.778,109,214,10.0,4,0,7,2,Right,No
209,7,2019,CF,Scott Kingery,25,126,500,458,64,118,,4,19,55,15.0,4,34,147.0,0.258,0.315,0.474,0.788,101,217,3.0,5,1,2,1,Right,No
365,6,2016,LF,Cody Asche,26,71,218,197,22,42,,0,4,18,3.0,1,18,54.0,0.213,0.284,0.35,0.635,69,69,1.0,2,0,1,0,Left,No


In [87]:
phillies_data = phillies_data.dropna(subset = ['Doubles'])
phillies_data['Doubles'].isnull().sum()

np.int64(0)

NAs are removed for Doubles since we cant take the mean to fill it in because they wont be accurate because some seasons they wouldnt have been the same amount of at-bats or they could have been hurt, and many other reasons.

In [88]:
phillies_data[phillies_data['Stolen_Bases'].isnull()]

Unnamed: 0,Rank,Year,Position,Name,Age,Games,Plate_Appearances,At_Bats,Runs,Hits,Doubles,Triples,Home_Runs,Runs_Batted_In,Stolen_Bases,Caught_Stealing,Base_On_Balls,Strikeouts,Batting_Average,On_Base_Percentage,Slugging_Percentage,On_Base_Plus_Slugging_Percentage,On_Base_Plus_Slugging_Percentage_Plus,Total_Bases,Double_Plays_Grounded_Into,Times_Hit_By_Pitch,Sacrifice_Hits,Sacrifice_Flies,Intentional_Bases_on_Balls,Dominant_Hand,Switch_Hitter
8,9,2023,DH,Bryce Harper,30,126,546,457,84,134,29.0,1,21,72,,3,80,119.0,0.293,0.401,0.499,0.9,146,228,,5,0,4,8,Left,No
104,4,2021,SS,Didi Gregorius,31,103,408,368,35,77,16.0,2,13,54,,0,25,67.0,0.209,0.27,0.37,0.639,71,136,8.0,8,0,7,1,Left,No
127,27,2021,P,Kyle Gibson,33,12,28,27,3,4,0.0,0,1,2,,0,0,9.0,0.148,0.148,0.259,0.407,8,7,0.0,0,1,0,0,Right,No
210,8,2019,RF,Bryce Harper,26,157,682,573,98,149,36.0,1,35,114,,3,99,178.0,0.26,0.372,0.51,0.882,126,292,10.0,6,0,4,11,Left,No
370,11,2016,RF,Aaron Altherr,25,57,227,198,23,39,6.0,0,4,22,,2,23,,0.197,0.3,0.288,0.587,59,57,4.0,6,0,0,2,Right,No


In [89]:
phillies_data = phillies_data.dropna(subset = ['Stolen_Bases'])
phillies_data['Stolen_Bases'].isnull().sum()

np.int64(0)

NAs are removed for Stolen Bases for the same reason as doubles, taking the mean wouldnt be accurate as it wouldnt be consistent or have any meaning.

In [90]:
phillies_data[phillies_data['Strikeouts'].isnull()]

Unnamed: 0,Rank,Year,Position,Name,Age,Games,Plate_Appearances,At_Bats,Runs,Hits,Doubles,Triples,Home_Runs,Runs_Batted_In,Stolen_Bases,Caught_Stealing,Base_On_Balls,Strikeouts,Batting_Average,On_Base_Percentage,Slugging_Percentage,On_Base_Plus_Slugging_Percentage,On_Base_Plus_Slugging_Percentage_Plus,Total_Bases,Double_Plays_Grounded_Into,Times_Hit_By_Pitch,Sacrifice_Hits,Sacrifice_Flies,Intentional_Bases_on_Balls,Dominant_Hand,Switch_Hitter
110,10,2021,IF,Ronald Torreyes,28,111,344,318,30,77,10.0,1,7,41,2.0,1,19,,0.242,0.286,0.346,0.632,71,110,7.0,1,5,1,2,Right,No
212,10,2019,C,Andrew Knapp,27,74,160,136,12,29,9.0,0,2,8,0.0,0,18,,0.213,0.318,0.324,0.642,68,44,2.0,3,3,0,2,Right,Yes


In [91]:
phillies_data = phillies_data.dropna(subset = ['Strikeouts'])
phillies_data['Strikeouts'].isnull().sum()

np.int64(0)

In [92]:
missing = phillies_data[phillies_data['Slugging_Percentage'].isnull()]
missing

Unnamed: 0,Rank,Year,Position,Name,Age,Games,Plate_Appearances,At_Bats,Runs,Hits,Doubles,Triples,Home_Runs,Runs_Batted_In,Stolen_Bases,Caught_Stealing,Base_On_Balls,Strikeouts,Batting_Average,On_Base_Percentage,Slugging_Percentage,On_Base_Plus_Slugging_Percentage,On_Base_Plus_Slugging_Percentage_Plus,Total_Bases,Double_Plays_Grounded_Into,Times_Hit_By_Pitch,Sacrifice_Hits,Sacrifice_Flies,Intentional_Bases_on_Balls,Dominant_Hand,Switch_Hitter
17,18,2023,CI,Drew Ellis,27,12,29,23,4,5,0.0,0,2,4,0.0,1,6,7.0,0.217,0.379,,0.858,134,11,0.0,0,0,0,0,Right,No
159,4,2020,SS,Didi Gregorius,30,60,237,215,34,61,10.0,2,10,40,3.0,2,15,28.0,0.284,0.339,,0.827,120,105,4.0,4,1,2,3,Left,No
268,10,2018,C,Andrew Knapp,26,84,215,187,19,37,6.0,2,4,15,1.0,0,24,75.0,0.198,0.294,,0.61,64,59,2.0,2,1,1,1,Right,Yes
366,7,2016,CF,Odúbel Herrera,24,159,656,583,87,167,21.0,6,15,49,25.0,7,63,134.0,0.286,0.361,,0.781,109,245,6.0,6,2,2,7,Left,No
397,37,2016,P,Colton Murray,26,22,1,1,0,0,0.0,0,0,0,0.0,0,0,1.0,0.0,0.0,,0.0,-100,0,0.0,0,0,0,0,Right,No


In [93]:
calc_slug_per = missing['Slugging_Percentage'] = (missing['Total_Bases'] / missing['At_Bats'])
calc_slug_per

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  calc_slug_per = missing['Slugging_Percentage'] = (missing['Total_Bases'] / missing['At_Bats'])


17     0.478261
159    0.488372
268    0.315508
366    0.420240
397    0.000000
dtype: float64

In [94]:
phillies_data.iloc[16, 20] = 0.478

In [95]:
phillies_data.iloc[154, 20] = 0.488

In [96]:
phillies_data.iloc[257, 20] = 0.316

In [97]:
phillies_data.iloc[354, 20] = 0.420

In [98]:
phillies_data.iloc[384, 20] = 0.0

In [99]:
phillies_data['Slugging_Percentage'].isnull().sum()

np.int64(0)

Missing Slugging Percentages can be calculated by filtering the rows with missing data and taking each players total bases and at bats to find the slugging percentage. After that I used the .iloc method to index the player and slugging percentage column to input the slugging percentages.

In [100]:
phillies_data[phillies_data['Double_Plays_Grounded_Into'].isnull()]

Unnamed: 0,Rank,Year,Position,Name,Age,Games,Plate_Appearances,At_Bats,Runs,Hits,Doubles,Triples,Home_Runs,Runs_Batted_In,Stolen_Bases,Caught_Stealing,Base_On_Balls,Strikeouts,Batting_Average,On_Base_Percentage,Slugging_Percentage,On_Base_Plus_Slugging_Percentage,On_Base_Plus_Slugging_Percentage_Plus,Total_Bases,Double_Plays_Grounded_Into,Times_Hit_By_Pitch,Sacrifice_Hits,Sacrifice_Flies,Intentional_Bases_on_Balls,Dominant_Hand,Switch_Hitter
154,54,2021,P,Seranthony Domínguez,26,1,0,0,0,0,0.0,0,0,0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0,0,,0,0,0,0,Right,No
205,4,2019,SS,Jean Segura,29,144,618,576,79,161,37.0,4,12,60,10.0,2,30,73.0,0.28,0.323,0.42,0.743,91,242,,8,1,3,1,Right,No
320,12,2017,C,Andrew Knapp,25,56,204,171,26,44,8.0,1,3,13,1.0,0,31,56.0,0.257,0.368,0.368,0.736,96,63,,0,0,2,4,Right,Yes
420,11,2015,UT,Darin Ruf,28,106,297,268,30,63,12.0,0,12,39,1.0,0,21,69.0,0.235,0.3,0.414,0.714,96,111,,5,0,3,0,Right,No


In [101]:
phillies_data = phillies_data.dropna(subset = ['Double_Plays_Grounded_Into'])
phillies_data['Double_Plays_Grounded_Into'].isnull().sum()

np.int64(0)

NAs are removed for Double Plays Grounded Into beacuse we have no way of filling or knowing how many double plays they grounded out in. 

In [102]:
phillies_data[phillies_data['Dominant_Hand'].isnull()]

Unnamed: 0,Rank,Year,Position,Name,Age,Games,Plate_Appearances,At_Bats,Runs,Hits,Doubles,Triples,Home_Runs,Runs_Batted_In,Stolen_Bases,Caught_Stealing,Base_On_Balls,Strikeouts,Batting_Average,On_Base_Percentage,Slugging_Percentage,On_Base_Plus_Slugging_Percentage,On_Base_Plus_Slugging_Percentage_Plus,Total_Bases,Double_Plays_Grounded_Into,Times_Hit_By_Pitch,Sacrifice_Hits,Sacrifice_Flies,Intentional_Bases_on_Balls,Dominant_Hand,Switch_Hitter
5,6,2023,LF,Kyle Schwarber,30,160,720,585,108,115,19.0,1,47,104,0.0,2,126,215.0,0.197,0.343,0.474,0.817,122,277,4.0,6,0,3,5,,No
343,35,2017,P,Jesen Therrien,24,14,2,2,0,0,0.0,0,0,0,0.0,0,0,2.0,0.0,0.0,0.0,0.0,-100,0,0.0,0,0,0,0,,No


In [103]:
phillies_data[phillies_data['Name'] == 'Kyle Schwarber']

Unnamed: 0,Rank,Year,Position,Name,Age,Games,Plate_Appearances,At_Bats,Runs,Hits,Doubles,Triples,Home_Runs,Runs_Batted_In,Stolen_Bases,Caught_Stealing,Base_On_Balls,Strikeouts,Batting_Average,On_Base_Percentage,Slugging_Percentage,On_Base_Plus_Slugging_Percentage,On_Base_Plus_Slugging_Percentage_Plus,Total_Bases,Double_Plays_Grounded_Into,Times_Hit_By_Pitch,Sacrifice_Hits,Sacrifice_Flies,Intentional_Bases_on_Balls,Dominant_Hand,Switch_Hitter
5,6,2023,LF,Kyle Schwarber,30,160,720,585,108,115,19.0,1,47,104,0.0,2,126,215.0,0.197,0.343,0.474,0.817,122,277,4.0,6,0,3,5,,No
49,6,2022,LF,Kyle Schwarber,29,155,669,577,100,126,21.0,3,46,94,10.0,1,86,200.0,0.218,0.323,0.504,0.827,131,291,10.0,4,0,2,3,Left,No


In [104]:
phillies_data[phillies_data['Name'] == 'Jesen Therrien']

Unnamed: 0,Rank,Year,Position,Name,Age,Games,Plate_Appearances,At_Bats,Runs,Hits,Doubles,Triples,Home_Runs,Runs_Batted_In,Stolen_Bases,Caught_Stealing,Base_On_Balls,Strikeouts,Batting_Average,On_Base_Percentage,Slugging_Percentage,On_Base_Plus_Slugging_Percentage,On_Base_Plus_Slugging_Percentage_Plus,Total_Bases,Double_Plays_Grounded_Into,Times_Hit_By_Pitch,Sacrifice_Hits,Sacrifice_Flies,Intentional_Bases_on_Balls,Dominant_Hand,Switch_Hitter
343,35,2017,P,Jesen Therrien,24,14,2,2,0,0,0.0,0,0,0,0.0,0,0,2.0,0.0,0.0,0.0,0.0,-100,0,0.0,0,0,0,0,,No


In [105]:
phillies_data.iloc[5, 29] = 'Left'

In [106]:
phillies_data = phillies_data.dropna(subset = ['Dominant_Hand'])
phillies_data['Dominant_Hand'].isnull().sum()

np.int64(0)

since kyle schwarber has another row of data from another year we can see what his dominant hand is and input it in for the other row. Jesen Therrien only has one season with the philles so there is no way to find his dominant hand so we drop the row.

In [108]:
phillies_data.shape

(442, 31)

After removing or filling the NAs we have gotten ride of 18 rows of data.

In [109]:
phillies_data.isnull().sum()

Rank                                     0
Year                                     0
Position                                 0
Name                                     0
Age                                      0
Games                                    0
Plate_Appearances                        0
At_Bats                                  0
Runs                                     0
Hits                                     0
Doubles                                  0
Triples                                  0
Home_Runs                                0
Runs_Batted_In                           0
Stolen_Bases                             0
Caught_Stealing                          0
Base_On_Balls                            0
Strikeouts                               0
Batting_Average                          0
On_Base_Percentage                       0
Slugging_Percentage                      0
On_Base_Plus_Slugging_Percentage         0
On_Base_Plus_Slugging_Percentage_Plus    0
Total_Bases

In [111]:
phillies_duplicated_rows = phillies_data[phillies_data.duplicated()]
phillies_duplicated_rows

Unnamed: 0,Rank,Year,Position,Name,Age,Games,Plate_Appearances,At_Bats,Runs,Hits,Doubles,Triples,Home_Runs,Runs_Batted_In,Stolen_Bases,Caught_Stealing,Base_On_Balls,Strikeouts,Batting_Average,On_Base_Percentage,Slugging_Percentage,On_Base_Plus_Slugging_Percentage,On_Base_Plus_Slugging_Percentage_Plus,Total_Bases,Double_Plays_Grounded_Into,Times_Hit_By_Pitch,Sacrifice_Hits,Sacrifice_Flies,Intentional_Bases_on_Balls,Dominant_Hand,Switch_Hitter
53,9,2022,DH,Bryce Harper,29,99,426,370,63,106,28.0,1,18,65,11.0,4,46,87.0,0.286,0.364,0.514,0.877,146,190,13.0,3,0,7,9,Left,No
207,5,2019,3B,Maikel Franco,26,123,428,389,48,91,17.0,0,17,56,0.0,0,36,61.0,0.234,0.297,0.409,0.705,81,159,14.0,0,0,3,19,Right,No
282,23,2018,P,Jake Arrieta,32,28,50,45,3,6,0.0,0,1,3,0.0,0,3,27.0,0.133,0.188,0.2,0.388,4,9,1.0,0,2,0,0,Right,No
319,11,2017,UT,Rhys Hoskins,24,50,212,170,37,44,7.0,0,18,48,2.0,0,37,46.0,0.259,0.396,0.618,1.014,162,105,2.0,3,0,2,1,Right,No
378,18,2016,UT,Emmanuel Burriss,31,39,50,45,3,5,1.0,1,0,0,1.0,0,2,10.0,0.111,0.184,0.178,0.361,-2,8,1.0,2,1,0,0,Right,Yes


In [112]:
phillies_data = phillies_data.drop_duplicates()
phillies_data.shape

(437, 31)

There were 5 rows of duplicate data and after dropping the duplicates we have finished cleaning the dataset and have removed 23 rows of data.

## Principles of DM

The 2 principles of data management I want to discuss are Data Accessibility and Data Lifecycle Management.

Data Accessibilty is important in my opinion because having access to data that you need at a certain moment can be the crucial for decision-making that is accurate and efficient. Being able to have data/information when needed will save time and allow you to be more efficient. One important process with data accessibility is making sure that the data that is being shared is with only the people that should have it. Making sure the data is protected and secure so no-one who shouldnt have it, wont have access to it. Lastly, making sure the data is in a well and usable format for the intended people who need access to it is crucial.

Data Lifecycle Management is something I never really thought of as being this important til I learned about it. Data Lifecycle Management is handling data from its creation to deletion. I've always thought that when data is collected, it just stays stored somewhere and eventually forgot about, as it becomes less important. Learning about this principle it has a lot more to it and is much more important than I previously thought. The data is first created, then it is stored somewhere, where it is secure and protected and only authorized to people that need it. After it is used, it gets stored away in a secure manner, that allows it to be used, if needed in the future. Finally, if it is no longer needed or stroage needs to be freed up, then it is deleted. Manageing this whole process takes a lot of energy and ensures that the data is used for what it needs, it doesnt get leaked or used by unauthorized users, all the way until it is deleted.

## Principles of FM

The 2 principles of file management I want to discuss are Keeping original version of program and data files, and Placing comments and notes in the data files.

Keeping original files and programs are very important in my opinion. Especially with data, it is so easy to lose or delete something by accident. I also feel like there is not much forgiveness when you delete or lose something, as it is almost always impoosible to recover something you delete. Making sure you make copies or backups of data or files you are working with because it is so easy to permanently change something that you didnt want to or delete something that you can not get back. Having backups makes it so that lost data can be recovered.

Placing comments and notes in files is important because it allows all the code to be readable and understood. If you have a file of a bunch of code, it would probably take someone very long to understand it all, which they most likely would forget if they had to go back to it another day. Making sure your code or data is has comments throught the file, makes it so anyone could understand what your code is doing and what it all means.

## Discussion of LLM Usage

LLM most definetly has its benefits as it can be used as a tool to help with errors or mistakes you are stuck on, used to help describe things you are working on, and help with any question you have or to find topics or things to work on. On the other hand, it has its cons. People can tend to abuse the usage of LLM, as it can be used to load data, and almost answer and do any problem you need. People can tend to lean on it too much and not retain or know anything about what they are doing. LLM usage has definetly increased productivity, but its the problem of if it is actually being used to help, or is it being used to do all the work. I think LLM is a great thing, but it can be abused and lead to problems down the road.