# Per game average players dataframes

En este script nos dedicaremos a crear una base de datos limpia segmentada por hitters y fielders. Se divide en las siguientes secciones:

- **Visualización del contenido de las bases de datos.**
- **Limpieza de la base de datos y exportación.**
- **Construcción de variables para estimaciones.**
- **Unión de las bases de datos para nuevas bases transversales.**
- **Segmentación de bases datos de acuerdo a la agencia libre.**
- **Realización del panel data de acuerdo a las especificaciones.**
- **Generación de las variables para el modelo dinámico.**

Importemos los modulos necesarios así como especificar la configuración deseada.

In [13]:
import pandas as pd
import numpy as np
import math
import os
import warnings
import statsmodels.api as sm
from matplotlib.colors import ListedColormap
from termcolor import colored
print('Modulos importados')

Modulos importados


In [14]:
# Configuraciones
warnings.filterwarnings('ignore')

In [15]:
# Directorio de trabajo
print("Directorio de trabajo previo: " + str(os.getcwd()))
# Cambiemoslo
os.chdir('/home/usuario/Documentos/Github/Proyectos/MLB_HN/')

Directorio de trabajo previo: /home/usuario/Documentos/Github/Proyectos/MLB_HN


In [16]:
# Veamos el directorio actual de trabajo
print(os.getcwd())
# El directorio anterior es el correcto, pero si no lo fuese, hacemos lo sigueinte:
path = '/home/usuario/Documentos/Github/Proyectos/MLB_HN'
print("Nuevo directorio de trabajo: " + str(os.chdir(path)))

/home/usuario/Documentos/Github/Proyectos/MLB_HN
Nuevo directorio de trabajo: None


## Visualización de las bases de datos

### Equipos por estado

In [17]:
states = 'Data/Teams/team_states.csv'
df_states = pd.read_csv(states)

In [18]:
df_states.head()

Unnamed: 0,Estado,Cantidad de equipos
0,Alabama,0
1,Alaska,0
2,Arizona,1
3,Arkansas,0
4,California,5


### Acrónimos

Nos servirá como llave intermedia para unificar las bases de datos de los equipos

In [19]:
acronym = 'Data/Teams/team_acronym.csv'
df_acronym = pd.read_csv(acronym)

In [20]:
df_acronym.head()

Unnamed: 0,Equipo,Acronimo,Estado
0,Arizona Diamondbacks,ARI,Arizona
1,Atlanta Braves,ATL,Georgia
2,Baltimore Orioles,BAL,Maryland
3,Boston Red Sox,BOS,Massachusetts
4,Chicago Cubs,CHC,Illinois


Unamos esta dataframe con el de los equipos por estado

In [21]:
acronym_state = pd.merge(df_states, df_acronym, on = 'Estado')

In [22]:
acronym_state.head()

Unnamed: 0,Estado,Cantidad de equipos,Equipo,Acronimo
0,Arizona,1,Arizona Diamondbacks,ARI
1,California,5,Los Angeles Angels,LAA
2,California,5,Los Angeles Dodgers,LAD
3,California,5,Oakland Athletics,OAK
4,California,5,San Diego Padres,SD


En este caso, el nombre de las variables es claro

## Algoritmo para la creación de las bases de datos

A continuaicón, se optimizará el código para que se puedan obtener los *dataframes* anteriores para un cojuntos de datos de años secuenciales, como es nuestro caso

In [23]:
# Auxiliares:
free_agents = 'Data/Free_Agents/free_agents_'
hitting = 'Data/Statistics/Per_Game/Hitting/hitting_'
pitching = 'Data/Statistics/Per_Game/Pitching/pitching_'
salary = 'Data/Salary/salary_'
teams = 'ETL_Data/Transversal/Teams/free_agents_team_'
csv = '.csv'
period = 12
# Originales:
df_free_agents = [None]*period
df_hitting = [None]*period
df_pitching = [None]*period
df_salary = [None]*period
df_teams = [None]*period
# Copias:
df_free_agents_copy = [None]*period
df_hitting_copy = [None]*period
df_pitching_copy = [None]*period
df_salary_copy = [None]*period
df_teams_copy = [None]*period
# Producto final:
df_pitchers = [None]*period
df_hitters = [None]*period
df_pitchers_free_agents = [None]*period
df_hitters_free_agents = [None]*period
df_pitchers_no_free_agents = [None]*period
df_hitters_no_free_agents = [None]*period
df_panel_hitters = [None]*period
df_panel_pitchers = [None]*period

Leamos todos los archivos y creemos las copias

In [24]:
for year in range(0,period):    
    df_free_agents[year] = pd.read_csv(free_agents + str(2011 + year) + csv)
    df_hitting[year] = pd.read_csv(hitting + str(2011 + year) + csv)
    df_pitching[year] = pd.read_csv(pitching + str(2011 + year) + csv)
    df_salary[year] = pd.read_csv(salary + str(2011 + year) + csv)
    df_teams[year] = pd.read_csv(teams + str(2011 + year) + csv)
    
    df_free_agents_copy[year] = df_free_agents[year].copy()
    df_hitting_copy[year] = df_hitting[year].copy()
    df_pitching_copy[year] = df_pitching[year].copy()
    df_salary_copy[year] = df_salary[year].copy()
    df_teams_copy[year] = pd.read_csv(teams + str(2011 + year) + csv)

Tratemos las bases de datos por separado. Sin embargo, a todas les quitaremos la columna de rango y *Cash2023*.

Como no queremos que se repita la columna del año de la temporada de la base de datos, borremos la columna de *Year* de la base  de datos de los agentes libres. Como los años del contrato aparecen en la base de datos sobre los salarios, se prefiere conservar dicha columna en la base de datos de salarios puesto que esta base de datos es más general que la de los agentes libres, razón por la que se borrará de esta última base de datos. 

El equipo al que se cambia el agente libre está señalado por la columna del equipo en la base de datos de salarios y la estadísticas deportivas por lo que se borrará *Team From To* de la base de datos de los agentes libres. 

Como nos importan los salarios para este analisis, quitaremos la columna de los equipos en las bases de datos sobre las estadísticas deportivas de todos los jugadores, así como la posición que ocupan.

In [25]:
for year in range(0,period):
    # Drop columns:
    if any(name in df_free_agents_copy[year].columns for name in ['Rank','Pos','Year','Team From To']):
        df_free_agents_copy[year].drop('Rank', axis = 1, inplace = True)
        df_free_agents_copy[year].drop('Year', axis = 1, inplace = True)
        df_free_agents_copy[year].drop('Pos', axis = 1, inplace = True)
        df_free_agents_copy[year].drop('Team From To', axis = 1, inplace = True)
    if 'Rank' in df_salary_copy[year].columns:
        df_salary_copy[year].drop('Rank', axis = 1, inplace = True)
    if any(name in df_hitting_copy[year].columns for name in ['Rank','Year','Cash2023','Team','Pos']):
        df_hitting_copy[year].drop('Rank', axis = 1, inplace = True)
        df_hitting_copy[year].drop('Cash2023', axis = 1, inplace = True)
        df_hitting_copy[year].drop('Team', axis = 1, inplace = True)
        df_hitting_copy[year].drop('Pos', axis = 1, inplace = True)
    if any(name in df_pitching_copy[year].columns for name in ['Rank','Year','Cash2023','Team','Pos']):
        df_pitching_copy[year].drop('Rank', axis = 1, inplace = True)
        df_pitching_copy[year].drop('Cash2023', axis = 1, inplace = True)
        df_pitching_copy[year].drop('Team', axis = 1, inplace = True)
        df_pitching_copy[year].drop('Pos', axis = 1, inplace = True)

Debido a que aparecen columnas que inician con el  nombre *Unnamed*, tendremos que borrarlas con algún método general, el cual se muestra a continuación:

In [26]:
for year in range(0,period):
    # Base de datos de agentes libres:
    df_free_agents_copy[year].drop(df_free_agents_copy[year].columns[df_free_agents_copy[year].columns.str.contains('Unnamed',case = False)],axis = 1, inplace = True)
    
    # Base de datos de los salarios:
    df_salary_copy[year].drop(df_salary_copy[year].columns[df_salary_copy[year].columns.str.contains('Unnamed',case = False)],axis = 1, inplace = True)
    
    # Base de datos de los hitters:
    df_hitting_copy[year].drop(df_hitting_copy[year].columns[df_hitting_copy[year].columns.str.contains('Unnamed',case = False)],axis = 1, inplace = True)
    
    # Base de datos de los pitchers:
    df_pitching_copy[year].drop(df_pitching_copy[year].columns[df_pitching_copy[year].columns.str.contains('Unnamed',case = False)],axis = 1, inplace = True)

Verifiquemos que ya no se encuentran dichas columnas molestas

In [27]:
df_free_agents_copy[9].columns

Index(['Player', 'Status', 'Team From', 'YRS', 'Value', 'AAV'], dtype='object')

In [28]:
df_salary_copy[11].columns

Index(['Player', 'Year', 'Pos', 'Team', 'BaseSalary', 'SigningBonus',
       'Payroll Salary', 'Adj Salary', 'Salary%', 'Cash', 'AAV', 'CONT YR',
       'CONT VALUE', 'Earnings', 'FA Year', 'Sign Age', 'Age', 'Weight',
       'Height'],
      dtype='object')

In [29]:
df_hitting_copy[2].columns

Index(['Player', 'GP', 'GP%', 'GS', 'GS%', 'AB', 'H', '2B', '3B', 'HR', 'RBI',
       'AVG', 'OBP', 'SLG', 'OPS', 'TVS'],
      dtype='object')

In [30]:
df_pitching_copy[5].columns

Index(['Player', 'GP', 'GS', 'IP', 'H', 'R', 'ER', 'BB', 'SO', 'W', 'L', 'SV',
       'WHIP', 'ERA', 'WAR', 'TVS'],
      dtype='object')

#### Agentes libres

No se conservará el equipo al que es contratado el agente libre puesto que esta información también la contiene la base de datos que facilita más el tratamiento _ETL_.

In [31]:
for year in range(0,period):
    df_free_agents_copy[year] = df_free_agents_copy[year].rename(columns = {'Player':'Jugador',
                                'Status':'Status_agente_libre', 'Team From':'Equipo_anterior',
                                'Value':'Valor_contrato', 'AAV':'Valor_promedio_contrato',
                                'YRS':'Anios_de_contrato'})
    
    free_agents_aux_1 = df_free_agents_copy[year]['Valor_contrato'].str.replace("$","")
    free_agents_aux_2 = free_agents_aux_1.str.replace(",","")
    free_agents_aux_3 = df_free_agents_copy[year]['Valor_promedio_contrato'].str.replace("$","")
    free_agents_aux_4 = free_agents_aux_3.str.replace(",","")
    df_free_agents_copy[year]['Valor_contrato'] = free_agents_aux_2
    df_free_agents_copy[year]['Valor_promedio_contrato'] = free_agents_aux_4
    
    df_free_agents_copy[year]['Valor_contrato'] = pd.to_numeric(df_free_agents_copy[year]['Valor_contrato'])
    df_free_agents_copy[year]['Valor_promedio_contrato'] = pd.to_numeric(df_free_agents_copy[year]['Valor_promedio_contrato'])

Observemos las dimensiones de las bases de datos como referencia

In [32]:
for year in range(0,period):
    print(df_free_agents_copy[year].shape)

(1, 6)
(108, 6)
(213, 6)
(208, 6)
(221, 6)
(241, 6)
(100, 6)
(98, 6)
(105, 6)
(118, 6)
(141, 6)
(137, 6)


También el tipo de datos que contiene cada columna

In [33]:
df_free_agents_copy[6].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Jugador                  100 non-null    object
 1   Status_agente_libre      100 non-null    object
 2   Equipo_anterior          100 non-null    object
 3   Anios_de_contrato        100 non-null    int64 
 4   Valor_contrato           100 non-null    int64 
 5   Valor_promedio_contrato  100 non-null    int64 
dtypes: int64(3), object(3)
memory usage: 4.8+ KB


#### Salarios

Como los salarios irán con las bases de datos de los _hitters_ y _pitchers_ es que se hará su proceso _ETL_ antes.

In [34]:
for year in range(0,period):
    # Cambio de nombres
    df_salary_copy[year] = df_salary_copy[year].rename(columns = {'Player':'Jugador',
                            'BaseSalary':'Sueldo_base', 'SigningBonus':'Bono_por_firma',
                            'Payroll Salary':'Sueldo_regular', 'Adj Salary':'Sueldo_ajustado',
                            'CONT YR':'Anios_de_contrato', 'CONT VALUE':'Valor_del_contrato',
                            'Earnings':'Ganancias', 'FA Year':'Anio_de_agente_libre',
                            'Sign Age':'Edad_al_firmar', 'Age':'Edad', 'Weight':'Peso',
                            'Height':'Altura', 'Year':'Anio', 'Pos':'Posicion',
                            'Salary%':'Sueldo_porcentual', 'Cash':'Pago_efectivo',
                            'AAV':'Valor_contrato_promedio', 'Team':'Acronimo'})
    
    # Tranformando al tipo de dato apropiado
    salary_aux_1 = df_salary_copy[year]['Sueldo_base'].str.replace("$","")
    salary_aux_2 = salary_aux_1.str.replace(",","")
    df_salary_copy[year]['Sueldo_base'] = salary_aux_2
    df_salary_copy[year]['Sueldo_base'] = pd.to_numeric(df_salary_copy[year]['Sueldo_base'])
    
    salary_aux_3 = df_salary_copy[year]['Sueldo_regular'].str.replace("$","")
    salary_aux_4 = salary_aux_3.str.replace(",","")
    df_salary_copy[year]['Sueldo_regular'] = salary_aux_4
    df_salary_copy[year]['Sueldo_regular'] = pd.to_numeric(df_salary_copy[year]['Sueldo_regular'])
    
    salary_aux_5 = df_salary_copy[year]['Sueldo_ajustado'].str.replace("$","")
    salary_aux_6 = salary_aux_5.str.replace(",","")
    df_salary_copy[year]['Sueldo_ajustado'] = salary_aux_6
    df_salary_copy[year]['Sueldo_ajustado'] = pd.to_numeric(df_salary_copy[year]['Sueldo_ajustado'])
    
    salary_aux_7 = df_salary_copy[year]['Valor_del_contrato'].str.replace("$","")
    salary_aux_8 = salary_aux_7.str.replace(",","")
    df_salary_copy[year]['Valor_del_contrato'] = salary_aux_8
    df_salary_copy[year]['Valor_del_contrato'] = pd.to_numeric(df_salary_copy[year]['Valor_del_contrato'])
    
    salary_aux_9 = df_salary_copy[year]['Bono_por_firma'].str.replace("$","")
    salary_aux_10 = salary_aux_9.str.replace(",","")
    df_salary_copy[year]['Bono_por_firma'] = salary_aux_10
    df_salary_copy[year]['Bono_por_firma'] = pd.to_numeric(df_salary_copy[year]['Bono_por_firma'])
    
    salary_aux_11 = df_salary_copy[year]['Ganancias'].str.replace("$","")
    salary_aux_12 = salary_aux_11.str.replace(",","")
    df_salary_copy[year]['Ganancias'] = salary_aux_12
    df_salary_copy[year]['Ganancias'] = pd.to_numeric(df_salary_copy[year]['Ganancias'])
    
    salary_aux_13 = df_salary_copy[year]['Pago_efectivo'].str.replace("$","")
    salary_aux_14 = salary_aux_13.str.replace(",","")
    df_salary_copy[year]['Pago_efectivo'] = salary_aux_14
    df_salary_copy[year]['Pago_efectivo'] = pd.to_numeric(df_salary_copy[year]['Pago_efectivo'])
    
    salary_aux_15 = df_salary_copy[year]['Valor_contrato_promedio'].str.replace("$","")
    salary_aux_16 = salary_aux_15.str.replace(",","")
    df_salary_copy[year]['Valor_contrato_promedio'] = salary_aux_16
    df_salary_copy[year]['Valor_contrato_promedio'] = pd.to_numeric(df_salary_copy[year]['Valor_contrato_promedio'])
    
    salary_aux_17 = df_salary_copy[year]['Altura'].str.replace("\"","")
    salary_aux_18 = salary_aux_17.str.replace("'","")
    df_salary_copy[year]['Altura'] = salary_aux_18
    df_salary_copy[year]['Altura'] = pd.to_numeric(df_salary_copy[year]['Altura'])/10
    # SUstitullamos los xeros
    height_mean = df_salary_copy[year]['Altura'].mean(skipna=True)
    df_salary_copy[year]['Altura'] = df_salary_copy[year].Altura.mask(df_salary_copy[year].Altura == 0, height_mean)
    
    df_salary_copy[year]['Anio_de_agente_libre'] = pd.to_numeric(df_salary_copy[year]['Anio_de_agente_libre'])
    df_salary_copy[year]['Anios_de_contrato'] = pd.to_numeric(df_salary_copy[year]['Anios_de_contrato'])
    df_salary_copy[year]['Edad'] = pd.to_numeric(df_salary_copy[year]['Edad'])

Por algunas particularidades de la base de datos, las columna que contiene la edad al firmar se tratará por separado aprovechando que la mayoría de los datos incorrectos tienen una longitud mayor a dos.

In [35]:
for year in range (0,period):
    df_salary_copy[year]['Edad_al_firmar'] = df_salary_copy[year]['Edad_al_firmar'].map(str)

    for edad in range(0,df_salary_copy[year].shape[0]):
        # String es mayor que 0:
        if len(df_salary_copy[year]['Edad_al_firmar'].iloc[edad]) == 2:
            df_salary_copy[year].iloc[edad, df_salary_copy[year].columns.get_loc('Edad_al_firmar')] = pd.to_numeric(df_salary_copy[year]['Edad_al_firmar'].iloc[edad])
            
        # String es menor o igual que 0:
        elif len(df_salary_copy[year]['Edad_al_firmar'].iloc[edad]) != 2:
            # Si la columna de la edad contiene datos correctos
            if df_salary_copy[year]['Edad'].iloc[edad] > 0:
                if df_salary_copy[year]['Anio_de_agente_libre'].iloc[edad] == 0:
                    ag_year = year + 2011 + 1
                else:
                    ag_year = df_salary_copy[year]['Anio_de_agente_libre'].iloc[edad]
                # Get first year of contract
                ini_year = ag_year - df_salary_copy[year]['Anios_de_contrato'].iloc[edad]
                # Años desde el el año inicial
                dif_years = year + 2011 - ini_year
                # Edad al firmar:
                sign_age = df_salary_copy[year]['Edad'].iloc[edad] - dif_years
                # Cambio de dato:
                df_salary_copy[year].iloc[edad, df_salary_copy[year].columns.get_loc('Edad_al_firmar')] = pd.to_numeric(sign_age)
                
            # Si la columna de edad no contiene un dato coherente
            else:
                # Cambio de dato:
                df_salary_copy[year].iloc[edad, df_salary_copy[year].columns.get_loc('Edad_al_firmar')] = pd.to_numeric(18)    
    
        # Entero  menor a 0:
        if df_salary_copy[year]['Edad_al_firmar'].iloc[edad] < 0:
            # Si la columna de la edad contiene datos correctos
            if df_salary_copy[year]['Edad'].iloc[edad] > 0:
                if df_salary_copy[year]['Anio_de_agente_libre'].iloc[edad] == 0:
                    ag_year = year + 2011 + 1
                else:
                    ag_year = df_salary_copy[year]['Anio_de_agente_libre'].iloc[edad]
                # Get first year of contract
                ini_year = ag_year - df_salary_copy[year]['Anios_de_contrato'].iloc[edad]
                # Años desde el el año inicial
                dif_years = year + 2011 - ini_year
                # Edad al firmar:
                sign_age = df_salary_copy[year]['Edad'].iloc[edad] - dif_years
                # Cambio de dato:
                df_salary_copy[year].iloc[edad, df_salary_copy[year].columns.get_loc('Edad_al_firmar')] = pd.to_numeric(sign_age)
                
            # Si la columna de edad no contiene un dato coherente
            else:
                # Cambio de dato:
                df_salary_copy[year].iloc[edad, df_salary_copy[year].columns.get_loc('Edad_al_firmar')] = pd.to_numeric(18)
         
    # Transformemos los datos a enteros
    df_salary_copy[year]['Edad_al_firmar'] = pd.to_numeric(df_salary_copy[year]['Edad_al_firmar'])

Podemos verificar si se limpiaron adecuadamente las celdas de la columna de edades al firmar. Esto, al filtrar los datos que sean distintos a enteros y al observar si se pudo transformar toda la columna al tipo entero.

In [36]:
for year in range(0,period):
    for edad in range(0,df_salary_copy[year]['Edad_al_firmar'].shape[0]):
        if type(df_salary_copy[year]['Edad_al_firmar'].iloc[edad]) != np.int64:
            print(type(df_salary_copy[year]['Edad_al_firmar'].iloc[edad]))

In [37]:
for year in range(0,period):
    for edad in range(0,df_salary_copy[year]['Edad_al_firmar'].shape[0]):
        if df_salary_copy[year]['Edad_al_firmar'].iloc[edad] < 0:
            print(df_salary_copy[year]['Edad_al_firmar'].iloc[edad])

In [38]:
#for year in range(0,period):
#    print(type(df_salary_copy[year][['Edad_al_firmar']].info()))

In [39]:
#for year in range(0,period):
#    print(year)
#    for edad in range(0,df_salary_copy[year]['Edad_al_firmar'].shape[0]):
#        print(str(df_salary_copy[year]['Edad_al_firmar'].iloc[edad]) + ' ' + str(edad))

Por otro lado, falta corregir las entradas de las columnas de las edades que tengan valores menores a cero. Esto se hará de acuerdo al resto de columnas

In [40]:
for year in range(0,period):
    for edad in range(0,df_salary_copy[year]['Edad'].shape[0]):
        if df_salary_copy[year]['Edad'].iloc[edad] < 0:
            print(year)
            print(df_salary_copy[year]['Edad'].iloc[edad])

0
-9
1
-9
2
-7


In [41]:
for year in range(0,period):
    for edad in range(0,df_salary_copy[year].shape[0]):
        # Condición para imputar:
        if df_salary_copy[year]['Edad'].iloc[edad] <= 0:
            # Si no se indica si tendrá año de agencia libre:
            if df_salary_copy[year]['Anio_de_agente_libre'].iloc[edad] == 0:
                        ag_year = year + 2011 + 1
            # Si tendrá año de agencia libre
            else:
                ag_year = df_salary_copy[year]['Anio_de_agente_libre'].iloc[edad]
            # Get first year of contract
            ini_year = ag_year - df_salary_copy[year]['Anios_de_contrato'].iloc[edad]
            # Años desde el el año inicial
            dif_years = year + 2011 - ini_year
            # Edad en la temporada:
            seasson_age = df_salary_copy[year]['Edad_al_firmar'].iloc[edad] + dif_years
            # Asignación
            df_salary_copy[year]['Edad'].iloc[edad] = seasson_age

Comprobemos que no hay ninguna edad negativa

In [42]:
for year in range(0,period):
    for edad in range(0,df_salary_copy[year]['Edad'].shape[0]):
        if df_salary_copy[year]['Edad'].iloc[edad] < 0:
            print(year)
            print(str(df_salary_copy[year]['Edad'].iloc[edad]) + ' ' + str(edad))

Con la imputación de datos ya se puede crear la columna que contiene la antiguedad del agente libre bajo el contrato

In [43]:
for year in range(0,period):
    df_salary_copy[year]['Antiguedad'] = df_salary_copy[year]['Edad'] - df_salary_copy[year]['Edad_al_firmar']

Por último, convirtamos la columna del año a string para que se entienda como una categoría y no una variable numérica

In [44]:
for year in range(0,period):
    df_salary_copy[year]['Anio'] = df_salary_copy[year]['Anio'].map(str)

In [45]:
df_salary_copy[5].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1045 entries, 0 to 1044
Data columns (total 20 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Jugador                  1045 non-null   object 
 1   Anio                     1045 non-null   object 
 2   Posicion                 1045 non-null   object 
 3   Acronimo                 1045 non-null   object 
 4   Sueldo_base              1045 non-null   int64  
 5   Bono_por_firma           1045 non-null   int64  
 6   Sueldo_regular           1045 non-null   int64  
 7   Sueldo_ajustado          1045 non-null   int64  
 8   Sueldo_porcentual        1045 non-null   float64
 9   Pago_efectivo            1045 non-null   int64  
 10  Valor_contrato_promedio  1045 non-null   int64  
 11  Anios_de_contrato        1045 non-null   int64  
 12  Valor_del_contrato       1045 non-null   int64  
 13  Ganancias                1045 non-null   int64  
 14  Anio_de_agente_libre    

#### Hitters

In [46]:
for year in range(0,period):
    # Cambio de nombres
    df_hitting_copy[year] = df_hitting_copy[year].rename(columns = {'Player':'Jugador',
                            'GP':'Juegos', 'GP%':'Porcentaje_juegos',
                            'AB':'At-bats', 'H':'Bateos', 'GS':'Juegos_iniciados',
                            'GS%':'Porcentaje_juegos_iniciados', 'RBI':'Runs-batted-in',
                            'HR':'Home-runs', 'AVG':'Bateos_promedio',
                            '2B':'Dobles', '3B':'Triples', 'OPS':'Porcentaje_On-base-plus-slugging',
                            'SLG':'Porcentaje_slugging', 'OBP':'Porcentaje_on-base'})

In [47]:
for year in range(0,period):
    print(df_hitting_copy[year].shape)

(182, 16)
(196, 16)
(277, 16)
(308, 16)
(472, 16)
(512, 16)
(236, 16)
(262, 16)
(280, 16)
(265, 16)
(463, 16)
(36, 16)


In [48]:
df_hitting_copy[5].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 512 entries, 0 to 511
Data columns (total 16 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Jugador                           512 non-null    object 
 1   Juegos                            512 non-null    float64
 2   Porcentaje_juegos                 512 non-null    float64
 3   Juegos_iniciados                  512 non-null    float64
 4   Porcentaje_juegos_iniciados       512 non-null    float64
 5   At-bats                           512 non-null    float64
 6   Bateos                            512 non-null    float64
 7   Dobles                            512 non-null    float64
 8   Triples                           512 non-null    float64
 9   Home-runs                         512 non-null    float64
 10  Runs-batted-in                    512 non-null    float64
 11  Bateos_promedio                   512 non-null    float64
 12  Porcenta

In [49]:
df_hitting_copy[5].columns

Index(['Jugador', 'Juegos', 'Porcentaje_juegos', 'Juegos_iniciados',
       'Porcentaje_juegos_iniciados', 'At-bats', 'Bateos', 'Dobles', 'Triples',
       'Home-runs', 'Runs-batted-in', 'Bateos_promedio', 'Porcentaje_on-base',
       'Porcentaje_slugging', 'Porcentaje_On-base-plus-slugging', 'TVS'],
      dtype='object')

#### Pitchers

In [50]:
for year in range(0,period):
    # Cambio de nombres
    df_pitching_copy[year] = df_pitching_copy[year].rename(columns = {'Player':'Jugador',
                             'GP':'Juegos', 'GS':'Juegos_iniciados', 'IP':'Inning_pitched',
                             'H':'Bateos', 'R':'Carreras', 'ER':'Carreras_ganadas',
                             'BB':'Walks', 'SO':'Strike-outs', 'W':'Wins', 'L':'Losses',
                             'SV':'Saves'})

In [51]:
for year in range(0,period):
    print(df_pitching_copy[year].shape)

(66, 16)
(84, 16)
(103, 16)
(106, 16)
(106, 16)
(80, 16)
(89, 16)
(99, 16)
(94, 16)
(37, 16)
(91, 16)
(100, 16)


In [52]:
df_pitching_copy[5].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80 entries, 0 to 79
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Jugador           80 non-null     object 
 1   Juegos            80 non-null     float64
 2   Juegos_iniciados  80 non-null     float64
 3   Inning_pitched    80 non-null     float64
 4   Bateos            80 non-null     float64
 5   Carreras          80 non-null     float64
 6   Carreras_ganadas  80 non-null     float64
 7   Walks             80 non-null     float64
 8   Strike-outs       80 non-null     float64
 9   Wins              80 non-null     float64
 10  Losses            80 non-null     float64
 11  Saves             80 non-null     float64
 12  WHIP              80 non-null     float64
 13  ERA               80 non-null     float64
 14  WAR               75 non-null     float64
 15  TVS               80 non-null     float64
dtypes: float64(15), object(1)
memory usage: 10.1+ 

## Agregación de variables sugeridas por artículos

Las primeras variables que agregaremos son el cuadrado de todas las estadísticas deportivas, así como las siguientes variables:

- DOMINANCE = $Strike-outs/(Inning \; Pitched)$
- CONTROL = $Walks/(Inning \; Pitched)$
- COMMAND = $Strike-outs/Walks$

In [53]:
for year in range(0,period):
    df_pitching_copy[year]['Dominio'] = df_pitching_copy[year]['Strike-outs']/(df_pitching_copy[year]['Inning_pitched'])
    df_pitching_copy[year]['Control'] = df_pitching_copy[year]['Walks']/(df_pitching_copy[year]['Inning_pitched'])
    df_pitching_copy[year]['Comando'] = df_pitching_copy[year]['Strike-outs']/df_pitching_copy[year]['Walks']

In [54]:
for year in range(0,period):
    print(df_pitching_copy[year].shape)

(66, 19)
(84, 19)
(103, 19)
(106, 19)
(106, 19)
(80, 19)
(89, 19)
(99, 19)
(94, 19)
(37, 19)
(91, 19)
(100, 19)


In [55]:
df_pitching_copy[2].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Jugador           103 non-null    object 
 1   Juegos            103 non-null    float64
 2   Juegos_iniciados  103 non-null    float64
 3   Inning_pitched    103 non-null    float64
 4   Bateos            103 non-null    float64
 5   Carreras          103 non-null    float64
 6   Carreras_ganadas  103 non-null    float64
 7   Walks             103 non-null    float64
 8   Strike-outs       103 non-null    float64
 9   Wins              103 non-null    float64
 10  Losses            103 non-null    float64
 11  Saves             103 non-null    float64
 12  WHIP              103 non-null    float64
 13  ERA               103 non-null    float64
 14  WAR               53 non-null     float64
 15  TVS               103 non-null    float64
 16  Dominio           101 non-null    float64
 1

Podemos verificar qué entradas poseen valores infinitos en la base de datos

In [56]:
"""
for year in range(0,period):
    print(str(2011 + year))
    for name in df_pitching_copy[year].columns:
        print(name)
        if type(name) != str:
            for element in range(0,len(df_pitching_copy[year][name])):
                if math.isinf(df_pitching_copy[year][name].iloc[element]) == True:
                    print(str(element) +  '  ' + str(df_pitching_copy[year][name].iloc[element]))
    print("")
"""

'\nfor year in range(0,period):\n    print(str(2011 + year))\n    for name in df_pitching_copy[year].columns:\n        print(name)\n        if type(name) != str:\n            for element in range(0,len(df_pitching_copy[year][name])):\n                if math.isinf(df_pitching_copy[year][name].iloc[element]) == True:\n                    print(str(element) +  \'  \' + str(df_pitching_copy[year][name].iloc[element]))\n    print("")\n'

Siguiendo la sugerencia de algunos artículos, obtengamos el logaritmo de los salarios

In [57]:
for year in range(0,period):
    df_salary_copy[year]['ln_Sueldo_base'] = np.log(df_salary_copy[year]['Sueldo_base'])
    df_salary_copy[year]['ln_Sueldo_ajustado'] = np.log(df_salary_copy[year]['Sueldo_ajustado'])
    df_salary_copy[year]['ln_Sueldo_regular'] = np.log(df_salary_copy[year]['Sueldo_regular'])

In [58]:
df_salary_copy[2].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1013 entries, 0 to 1012
Data columns (total 23 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Jugador                  1013 non-null   object 
 1   Anio                     1013 non-null   object 
 2   Posicion                 1013 non-null   object 
 3   Acronimo                 1013 non-null   object 
 4   Sueldo_base              1013 non-null   int64  
 5   Bono_por_firma           1013 non-null   int64  
 6   Sueldo_regular           1013 non-null   int64  
 7   Sueldo_ajustado          1013 non-null   int64  
 8   Sueldo_porcentual        1013 non-null   float64
 9   Pago_efectivo            1013 non-null   int64  
 10  Valor_contrato_promedio  1013 non-null   int64  
 11  Anios_de_contrato        1013 non-null   int64  
 12  Valor_del_contrato       1013 non-null   int64  
 13  Ganancias                1013 non-null   int64  
 14  Anio_de_agente_libre    

Debido a que hay columnas con datos tipo _Nan_ o _NULL_, optaremos por imputarlos.

Mientras que los valores infinitos generados por las nuevas variables se sustituirán dependediendo del caso:

- 0/0: 0
- num/0: Máximo de la columna correspondientefijarán

In [60]:
for year in range(0,period):
    # Salaries
    mean_hgt = df_salary_copy[year].loc[df_salary_copy[year]['Altura'] > 4.9].Altura.mean()
    mean_wgh = df_salary_copy[year].loc[df_salary_copy[year]['Peso'] > 0].Peso.mean()
    df_salary_copy[year]['Altura'].fillna(value = mean_hgt, inplace = True)
    df_salary_copy[year]['Altura'].mask(df_salary_copy[year]['Altura'] <= 4.9, mean_hgt, inplace = True)
    df_salary_copy[year]['Peso'].fillna(value = mean_wgh, inplace = True)
    df_salary_copy[year]['Peso'].mask(df_salary_copy[year]['Peso'] <= 0, mean_wgh, inplace = True)
    
    # Pitchers
    mean_war = df_pitching_copy[year].loc[df_pitching_copy[year]['WAR'] > 0].WAR.mean()
    mean_dom = df_pitching_copy[year].loc[df_pitching_copy[year]['Dominio'] > 0].Dominio.mean()
    mean_con = df_pitching_copy[year].loc[df_pitching_copy[year]['Control'] > 0].Control.mean()
    mean_com = df_pitching_copy[year].loc[df_pitching_copy[year]['Comando'] > 0].Comando.mean()
    df_pitching_copy[year]['WAR'].fillna(value = mean_war, inplace = True)
    df_pitching_copy[year]['WAR'].mask(df_pitching_copy[year]['WAR'] <= 0, mean_war, inplace = True)
    df_pitching_copy[year]['Dominio'].fillna(value = mean_dom, inplace = True)
    df_pitching_copy[year]['Dominio'].mask(df_pitching_copy[year]['Dominio'] <= 0, mean_dom, inplace = True)
    df_pitching_copy[year]['Control'].fillna(value = mean_con, inplace = True)
    df_pitching_copy[year]['Control'].mask(df_pitching_copy[year]['Control'] <= 0, mean_con, inplace = True)
    df_pitching_copy[year]['Comando'].fillna(value = mean_com, inplace = True)
    df_pitching_copy[year]['Comando'].mask(df_pitching_copy[year]['Comando'] <= 0, mean_com, inplace = True)

In [61]:
for year in range(0,period):   
    # Condiciones
    con_dom_1 = df_pitching_copy[year]['Strike-outs'] == 0
    con_con_1 = df_pitching_copy[year]['Walks'] == 0
    con_com_1 = df_pitching_copy[year]['Strike-outs'] == 0
                 
    # Imputación caso 0/0
    df_pitching_copy[year].loc[con_dom_1, "Dominio"] = 0
    df_pitching_copy[year].loc[con_con_1, "Control"] = 0
    df_pitching_copy[year].loc[con_com_1, "Comando"] = 0

In [62]:
for year in range(0,period):   
    # Máximos
    max_dom = df_pitching_copy[year]['Strike-outs'].max()/9
    max_con = df_pitching_copy[year]['Walks'].max()/9
    max_com = df_pitching_copy[year]['Strike-outs'].max()
    
    # Cambianfdo infinitos a NaNs
    df_pitching_copy[year]["Dominio"].replace([np.inf, -np.inf], np.nan, inplace = True)
    df_pitching_copy[year]["Control"].replace([np.inf, -np.inf], np.nan, inplace = True)
    df_pitching_copy[year]["Comando"].replace([np.inf, -np.inf], np.nan, inplace = True)
    
    # Imputación
    df_pitching_copy[year]['Dominio'].fillna(value = max_dom, inplace = True)
    df_pitching_copy[year]['Control'].fillna(value = max_con, inplace = True)
    df_pitching_copy[year]['Comando'].fillna(value = max_com, inplace = True)

Verifiquemos que ya no haya problemas con valores infinitos

In [63]:
"""
for year in range(0,period):
    print(str(2011 + year))
    for name in df_pitching_copy[year].columns:
        print(name)
        if type(name) != str:
            for element in range(0,len(df_pitching_copy[year][name])):
                if math.isinf(df_pitching_copy[year][name].iloc[element]) == True:
                    print(str(element) +  '  ' + str(df_pitching_copy[year][name].iloc[element]))
    print("")
"""

'\nfor year in range(0,period):\n    print(str(2011 + year))\n    for name in df_pitching_copy[year].columns:\n        print(name)\n        if type(name) != str:\n            for element in range(0,len(df_pitching_copy[year][name])):\n                if math.isinf(df_pitching_copy[year][name].iloc[element]) == True:\n                    print(str(element) +  \'  \' + str(df_pitching_copy[year][name].iloc[element]))\n    print("")\n'

Así mismo, contemos los valores *NaN* que queden presentes

In [64]:
for year in range(0,period):
    print('Año: ' + str(2011 + year))
    print('Hitters:')
    df_hitting_copy[year].isna().sum()
    print('Pitchers:')
    df_pitching_copy[year].isna().sum()
    print('Free agents:')
    df_free_agents_copy[year].isna().sum()
    print('Salaries:')
    df_salary_copy[year].isna().sum()
    print("")

Año: 2011
Hitters:
Pitchers:
Free agents:
Salaries:

Año: 2012
Hitters:
Pitchers:
Free agents:
Salaries:

Año: 2013
Hitters:
Pitchers:
Free agents:
Salaries:

Año: 2014
Hitters:
Pitchers:
Free agents:
Salaries:

Año: 2015
Hitters:
Pitchers:
Free agents:
Salaries:

Año: 2016
Hitters:
Pitchers:
Free agents:
Salaries:

Año: 2017
Hitters:
Pitchers:
Free agents:
Salaries:

Año: 2018
Hitters:
Pitchers:
Free agents:
Salaries:

Año: 2019
Hitters:
Pitchers:
Free agents:
Salaries:

Año: 2020
Hitters:
Pitchers:
Free agents:
Salaries:

Año: 2021
Hitters:
Pitchers:
Free agents:
Salaries:

Año: 2022
Hitters:
Pitchers:
Free agents:
Salaries:



Ahora, repitamos este proceso para la base de datos de los salarios.

In [65]:
salary_names = ['ln_Sueldo_ajustado', 'ln_Sueldo_base', 'ln_Sueldo_regular']

In [66]:
for name in salary_names:
    print(name)
    
    for year in range(0,period):
        print(str(2011 + year))
        for element in range(0,len(df_salary_copy[year][name])):
            if df_salary_copy[year][name].iloc[element] <= 0:
                print(str(element) +  '  ' + str(df_salary_copy[year][name].iloc[element]))
        print("")

ln_Sueldo_ajustado
2011

2012

2013

2014

2015

2016

2017

2018
72  -inf
188  -inf

2019

2020

2021
6  -inf
55  -inf
166  -inf
274  -inf

2022
193  -inf

ln_Sueldo_base
2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

2022

ln_Sueldo_regular
2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

2022



Al inspecsionar los errores nos damos cuenta que solo se desconocen los salarios ajustados y los fijaron a $0$. Usaremos el logaritmo base 10 del salario regular para sustituir dicho valor.

In [67]:
for year in range(0,period):
    df_salary_copy[year]['ln_Sueldo_ajustado'].mask(df_salary_copy[year]['ln_Sueldo_ajustado'] < 0,
                                                    df_salary_copy[year]['ln_Sueldo_regular'],
                                                    inplace = True)

In [68]:
for year in range(0,period):
    print(str(2011 + year))
    for element in range(0,len(df_salary_copy[year]['ln_Sueldo_ajustado'])):
        if df_salary_copy[year]['ln_Sueldo_ajustado'].iloc[element] <= 0:
            print(str(element) +  '  ' + str(df_salary_copy[year]['ln_Sueldo_ajustado'].iloc[element]))
    print("")

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

2022



In [69]:
for year in range(0,period):
    print("Ajustado: " + str(df_salary_copy[year]['ln_Sueldo_ajustado'].mean())
          + 'n'
          + 'Regular: ' + str(df_salary_copy[year]['ln_Sueldo_regular'].mean()))

Ajustado: 13.785784770849762nRegular: 13.888874386883515
Ajustado: 13.81181498978195nRegular: 13.90039782504837
Ajustado: 13.54819591748778nRegular: 13.978975725734408
Ajustado: 13.66411021920866nRegular: 14.080738990191257
Ajustado: 13.49415866539551nRegular: 13.99344595993313
Ajustado: 13.478992231977141nRegular: 14.0261405412466
Ajustado: 13.575363447521962nRegular: 14.079742661574734
Ajustado: 13.562311948859815nRegular: 14.080756222584538
Ajustado: 13.521364643693584nRegular: 14.066559580453122
Ajustado: 13.029458522392627nRegular: 14.216780831796356
Ajustado: 13.70981799580226nRegular: 14.198804602895482
Ajustado: 13.924314685157658nRegular: 14.351734783233747


En efecto, ya no hay valores _NaN_ o _infinitos_.

Con el objetivo de hacer más eficiente la creación de las variables al cuadrado, lo haremos extrayendo el índice de las columnas de interés

In [70]:
df_hitting_copy[0].columns

Index(['Jugador', 'Juegos', 'Porcentaje_juegos', 'Juegos_iniciados',
       'Porcentaje_juegos_iniciados', 'At-bats', 'Bateos', 'Dobles', 'Triples',
       'Home-runs', 'Runs-batted-in', 'Bateos_promedio', 'Porcentaje_on-base',
       'Porcentaje_slugging', 'Porcentaje_On-base-plus-slugging', 'TVS'],
      dtype='object')

In [71]:
df_pitching_copy[1].columns

Index(['Jugador', 'Juegos', 'Juegos_iniciados', 'Inning_pitched', 'Bateos',
       'Carreras', 'Carreras_ganadas', 'Walks', 'Strike-outs', 'Wins',
       'Losses', 'Saves', 'WHIP', 'ERA', 'WAR', 'TVS', 'Dominio', 'Control',
       'Comando'],
      dtype='object')

In [72]:
def get_col_indices(df, names):
    return df.columns.get_indexer(names)

In [74]:
hitting_names = ['Juegos_iniciados', 'Porcentaje_juegos_iniciados', 'At-bats', 'Bateos',
                  'Dobles', 'Triples', 'Home-runs', 'Runs-batted-in', 'Bateos_promedio',
                  'Porcentaje_on-base', 'Porcentaje_slugging', 'TVS',
                  'Porcentaje_On-base-plus-slugging']	
pitching_names = ['Inning_pitched', 'Bateos', 'Carreras',
                  'Carreras_ganadas', 'Walks', 'Strike-outs', 'Wins', 'Losses',
                  'Saves', 'WHIP', 'ERA', 'WAR', 'TVS', 'Dominio', 'Control',
                  'Comando']

Con el objetivo de simplificar el código, verifiquemos si todos los índices en cada base de datos son los mismos

In [75]:
print('Hitters:')
for year in range(0,period):
    print(get_col_indices(df_hitting_copy[year], hitting_names))
    
print('Pitchers:')
for year in range(0,period):
    print(get_col_indices(df_pitching_copy[year], pitching_names))

Hitters:
[ 3  4  5  6  7  8  9 10 11 12 13 15 14]
[ 3  4  5  6  7  8  9 10 11 12 13 15 14]
[ 3  4  5  6  7  8  9 10 11 12 13 15 14]
[ 3  4  5  6  7  8  9 10 11 12 13 15 14]
[ 3  4  5  6  7  8  9 10 11 12 13 15 14]
[ 3  4  5  6  7  8  9 10 11 12 13 15 14]
[ 3  4  5  6  7  8  9 10 11 12 13 15 14]
[ 3  4  5  6  7  8  9 10 11 12 13 15 14]
[ 3  4  5  6  7  8  9 10 11 12 13 15 14]
[ 3  4  5  6  7  8  9 10 11 12 13 15 14]
[ 3  4  5  6  7  8  9 10 11 12 13 15 14]
[ 3  4  5  6  7  8  9 10 11 12 13 15 14]
Pitchers:
[ 3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18]
[ 3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18]
[ 3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18]
[ 3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18]
[ 3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18]
[ 3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18]
[ 3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18]
[ 3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18]
[ 3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18]
[ 3  4  5  6  7  8  9 10 11 12 13 14 15

In [76]:
hitting_indexes = list(get_col_indices(df_hitting_copy[0], hitting_names))
pitching_indexes = list(get_col_indices(df_pitching_copy[0], pitching_names))

In [77]:
for year in range(0,period):
    # Hitters:
    for hitter_name in hitting_indexes:
        df_hitting_copy[year][df_hitting_copy[year].columns[hitter_name] + '_2'] = np.power(df_hitting_copy[year][df_hitting_copy[year].columns[hitter_name]], 2)
    # Pitchers:
    for pitcher_name in pitching_indexes:
        df_pitching_copy[year][df_pitching_copy[year].columns[pitcher_name] + '_2'] = np.power(df_pitching_copy[year][df_pitching_copy[year].columns[pitcher_name]], 2)

Apreciemos el resultado final

In [78]:
df_hitting_copy[2].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 277 entries, 0 to 276
Data columns (total 29 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   Jugador                             277 non-null    object 
 1   Juegos                              277 non-null    float64
 2   Porcentaje_juegos                   277 non-null    float64
 3   Juegos_iniciados                    277 non-null    float64
 4   Porcentaje_juegos_iniciados         277 non-null    float64
 5   At-bats                             277 non-null    float64
 6   Bateos                              277 non-null    float64
 7   Dobles                              277 non-null    float64
 8   Triples                             277 non-null    float64
 9   Home-runs                           277 non-null    float64
 10  Runs-batted-in                      277 non-null    float64
 11  Bateos_promedio                     277 non-n

In [79]:
df_pitching_copy[2].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 35 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Jugador             103 non-null    object 
 1   Juegos              103 non-null    float64
 2   Juegos_iniciados    103 non-null    float64
 3   Inning_pitched      103 non-null    float64
 4   Bateos              103 non-null    float64
 5   Carreras            103 non-null    float64
 6   Carreras_ganadas    103 non-null    float64
 7   Walks               103 non-null    float64
 8   Strike-outs         103 non-null    float64
 9   Wins                103 non-null    float64
 10  Losses              103 non-null    float64
 11  Saves               103 non-null    float64
 12  WHIP                103 non-null    float64
 13  ERA                 103 non-null    float64
 14  WAR                 0 non-null      float64
 15  TVS                 103 non-null    float64
 16  Dominio 

## Unión de las bases de datos
### Datos agregados por equipo

Solo resta añadir los datos relevantes al equipo al que pertenece cada jugador considerando la base de datos de la cantidad de equipos por estado

In [80]:
df_teams_copy[7].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28 entries, 0 to 27
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Equipo                   28 non-null     object 
 1   Cantidad_agentes_libres  28 non-null     int64  
 2   Valor_contrato_total     28 non-null     int64  
 3   Acronimo                 28 non-null     object 
 4   Victorias                28 non-null     int64  
 5   Juegos totales           28 non-null     int64  
 6   Playoffs                 28 non-null     int64  
 7   Pennants won             28 non-null     int64  
 8   WS ganadas               28 non-null     int64  
 9   Promedio_victorias       28 non-null     float64
dtypes: float64(1), int64(7), object(2)
memory usage: 2.3+ KB


In [81]:
acronym_state.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30 entries, 0 to 29
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Estado               30 non-null     object
 1   Cantidad de equipos  30 non-null     int64 
 2   Equipo               30 non-null     object
 3   Acronimo             30 non-null     object
dtypes: int64(1), object(3)
memory usage: 1.2+ KB


In [82]:
for year in range(0,period):
    df_teams_copy[year] = pd.merge(df_teams_copy[year], acronym_state, on = ['Equipo','Acronimo'])

In [83]:
df_teams_copy[7].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 28 entries, 0 to 27
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Equipo                   28 non-null     object 
 1   Cantidad_agentes_libres  28 non-null     int64  
 2   Valor_contrato_total     28 non-null     int64  
 3   Acronimo                 28 non-null     object 
 4   Victorias                28 non-null     int64  
 5   Juegos totales           28 non-null     int64  
 6   Playoffs                 28 non-null     int64  
 7   Pennants won             28 non-null     int64  
 8   WS ganadas               28 non-null     int64  
 9   Promedio_victorias       28 non-null     float64
 10  Estado                   28 non-null     object 
 11  Cantidad de equipos      28 non-null     int64  
dtypes: float64(1), int64(8), object(3)
memory usage: 2.8+ KB


In [84]:
df_salary_copy[7].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1036 entries, 0 to 1035
Data columns (total 23 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Jugador                  1036 non-null   object 
 1   Anio                     1036 non-null   object 
 2   Posicion                 1036 non-null   object 
 3   Acronimo                 1036 non-null   object 
 4   Sueldo_base              1036 non-null   int64  
 5   Bono_por_firma           1036 non-null   int64  
 6   Sueldo_regular           1036 non-null   int64  
 7   Sueldo_ajustado          1036 non-null   int64  
 8   Sueldo_porcentual        1036 non-null   float64
 9   Pago_efectivo            1036 non-null   int64  
 10  Valor_contrato_promedio  1036 non-null   int64  
 11  Anios_de_contrato        1036 non-null   int64  
 12  Valor_del_contrato       1036 non-null   int64  
 13  Ganancias                1036 non-null   int64  
 14  Anio_de_agente_libre    

Ahora, unamos las bases de datos sobre los equipos a las bases de datos de los salarios

In [85]:
for year in range(0,period):
    df_salary_copy[year] = pd.merge(df_teams_copy[year], df_salary_copy[year], on = 'Acronimo')

In [86]:
df_salary_copy[0].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40 entries, 0 to 39
Data columns (total 34 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Equipo                   40 non-null     object 
 1   Cantidad_agentes_libres  40 non-null     int64  
 2   Valor_contrato_total     40 non-null     int64  
 3   Acronimo                 40 non-null     object 
 4   Victorias                40 non-null     int64  
 5   Juegos totales           40 non-null     int64  
 6   Playoffs                 40 non-null     int64  
 7   Pennants won             40 non-null     int64  
 8   WS ganadas               40 non-null     int64  
 9   Promedio_victorias       40 non-null     float64
 10  Estado                   40 non-null     object 
 11  Cantidad de equipos      40 non-null     int64  
 12  Jugador                  40 non-null     object 
 13  Anio                     40 non-null     object 
 14  Posicion                 40 

Debido a que la mayoría de los jugadores juega tanto en la ofensiva como la defensiva es que tenemos que borrar los duplicados de la columna de la posición.

In [87]:
for year in range(0,period):
    df_hitting_copy[year] = pd.merge(df_hitting_copy[year], df_salary_copy[year], on = 'Jugador')
    df_pitching_copy[year] = pd.merge(df_pitching_copy[year], df_salary_copy[year], on = 'Jugador')

In [88]:
for year in range(0,period):
    df_pitching_copy[year]['Porcentaje_juegos'] = df_pitching_copy[year]['Juegos']/df_pitching_copy[year]['Juegos totales']

In [89]:
df_hitting_copy[3].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 287 entries, 0 to 286
Data columns (total 62 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   Jugador                             287 non-null    object 
 1   Juegos                              287 non-null    float64
 2   Porcentaje_juegos                   287 non-null    float64
 3   Juegos_iniciados                    287 non-null    float64
 4   Porcentaje_juegos_iniciados         287 non-null    float64
 5   At-bats                             287 non-null    float64
 6   Bateos                              287 non-null    float64
 7   Dobles                              287 non-null    float64
 8   Triples                             287 non-null    float64
 9   Home-runs                           287 non-null    float64
 10  Runs-batted-in                      287 non-null    float64
 11  Bateos_promedio                     287 non-n

In [90]:
df_pitching_copy[3].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 92 entries, 0 to 91
Data columns (total 69 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Jugador                  92 non-null     object 
 1   Juegos                   92 non-null     float64
 2   Juegos_iniciados         92 non-null     float64
 3   Inning_pitched           92 non-null     float64
 4   Bateos                   92 non-null     float64
 5   Carreras                 92 non-null     float64
 6   Carreras_ganadas         92 non-null     float64
 7   Walks                    92 non-null     float64
 8   Strike-outs              92 non-null     float64
 9   Wins                     92 non-null     float64
 10  Losses                   92 non-null     float64
 11  Saves                    92 non-null     float64
 12  WHIP                     92 non-null     float64
 13  ERA                      92 non-null     float64
 14  WAR                      0 n

Para facilitar la observación de las trnasformaciones de manera más eficiente, ordenemos alfabéticamente la base de datos de acuerdo al nombre de las columnas.

In [91]:
for year in range(0,period):
    # Ordenando alfabéticamente
    df_salary_copy[year].sort_index(axis = 1, inplace = True)
    df_hitting_copy[year].sort_index(axis = 1, inplace = True)
    df_pitching_copy[year].sort_index(axis = 1, inplace = True)
    df_free_agents_copy[year].sort_index(axis = 1, inplace = True)
    
    # Reiniciando los índices
    df_salary_copy[year].reset_index(drop = True, inplace = True)
    df_hitting_copy[year].reset_index(drop = True, inplace = True)
    df_pitching_copy[year].reset_index(drop = True, inplace = True)
    df_free_agents_copy[year].reset_index(drop = True, inplace = True)

## Variables del periodo t-1

Lo que haremos será un *merge* de las bases de datos del año $t$ con el año $t-1$ sobre los jugadores. La razón de esto es que solo nos interesan los jugadores que han sido agentes libres por más de un año.

Si la primera base de datos es del año 2011, entonces tendremos que empezar en el año 2012. Creemos los dataframes que contendrán los datos para el modelo. Para que no se sobrepongan todos los periodos, crearemos dataframes auxiliares para guardar los nuevos datos

In [92]:
hitting_merge = ['Juegos_iniciados', 'Porcentaje_juegos_iniciados', 'At-bats', 'Bateos',
                 'Dobles', 'Triples', 'Home-runs', 'Runs-batted-in', 'Bateos_promedio',
                 'Porcentaje_on-base', 'Porcentaje_slugging', 'TVS',
                 'Porcentaje_On-base-plus-slugging',
                 'Juegos_iniciados_2', 'Porcentaje_juegos_iniciados_2', 'At-bats_2', 'Bateos_2',
                 'Dobles_2', 'Triples_2', 'Home-runs_2', 'Runs-batted-in_2', 'Bateos_promedio_2',
                 'Porcentaje_on-base_2', 'Porcentaje_slugging_2', 'TVS_2',
                 'Porcentaje_On-base-plus-slugging_2']	
pitching_merge = ['Inning_pitched', 'Bateos_en_contra', 'Carreras_en_contra',
                  'Carreras_ganadas', 'Walks', 'Strike-outs', 'Wins', 'Losses',
                  'Saves', 'WHIP', 'ERA', 'WAR', 'TVS', 'Dominio', 'Control',
                  'Comando',
                  'Inning_pitched_2', 'Bateos_2', 'Carreras_2',
                  'Carreras_ganadas_2', 'Walks_2', 'Strike-outs_2', 'Wins_2', 'Losses_2',
                  'Saves_2', 'WHIP_2', 'ERA_2', 'WAR_2', 'TVS_2', 'Dominio_2', 'Control_2',
                  'Comando_2']

In [93]:
df_hitters_copy = [None]*period
df_pitchers_copy = [None]*period

In [94]:
for year in range(0,period):
    df_hitters_copy[year] = df_hitting_copy[year].copy()
    df_pitchers_copy[year] = df_pitching_copy[year].copy()

In [95]:
for year in range(1,period):    
    df_hitting_copy[year] = pd.merge(df_hitters_copy[year], df_hitters_copy[year-1], on = 'Jugador')
    df_pitching_copy[year] = pd.merge(df_pitchers_copy[year], df_pitchers_copy[year-1], on = 'Jugador')

A continuación se verifica que la cantidad de columnas sea la misma, salvo por el primer periodo

In [96]:
"""for name in df_pitching_copy[11].columns:
    print(name)"""

'for name in df_pitching_copy[11].columns:\n    print(name)'

In [97]:
for year in range(0,period):
    print(df_hitting_copy[year].columns.shape)
    
for year in range(0,period):    
    print(df_pitching_copy[year].columns.shape)

(62,)
(123,)
(123,)
(123,)
(123,)
(123,)
(123,)
(123,)
(123,)
(123,)
(123,)
(123,)
(69,)
(137,)
(137,)
(137,)
(137,)
(137,)
(137,)
(137,)
(137,)
(137,)
(137,)
(137,)


In [98]:
for year in range(1,period):       
    df_pitching_copy[year].columns = df_pitching_copy[year].columns.str.replace('_x', '_t')
    df_pitching_copy[year].columns = df_pitching_copy[year].columns.str.replace('_y', '_t_1')
    df_pitching_copy[year].columns = df_pitching_copy[year].columns.str.replace('-', '_')
    df_pitching_copy[year].columns = df_pitching_copy[year].columns.str.replace(' ', '_')
    df_pitching_copy[year].drop(['ln_Sueldo_base_t_1', 'ln_Sueldo_ajustado_t_1', 'ln_Sueldo_regular_t_1'],
                           axis = 1, inplace = True)
    df_pitching_copy[year] = df_pitching_copy[year].sort_values(by = 'Jugador', ascending = True)
    df_pitching_copy[year].reset_index(drop = True, inplace = True)
    
    df_hitting_copy[year].columns = df_hitting_copy[year].columns.str.replace('_x', '_t')
    df_hitting_copy[year].columns = df_hitting_copy[year].columns.str.replace('_y', '_t_1')
    df_hitting_copy[year].columns = df_hitting_copy[year].columns.str.replace('-', '_')
    df_hitting_copy[year].columns = df_hitting_copy[year].columns.str.replace(' ', '_')
    df_hitting_copy[year].drop(['ln_Sueldo_base_t_1', 'ln_Sueldo_ajustado_t_1', 'ln_Sueldo_regular_t_1'],
                          axis = 1, inplace = True)
    df_hitting_copy[year] = df_hitting_copy[year].sort_values(by = 'Jugador', ascending = True)
    df_hitting_copy[year].reset_index(drop = True, inplace = True)
    
    # Reordenando las columnas
    df_hitting_copy[year].sort_index(axis = 1, inplace = True)
    df_pitching_copy[year].sort_index(axis = 1, inplace = True)

In [99]:
for name in df_pitching_copy[11].columns:
    print(name)

Acronimo_t
Acronimo_t_1
Altura_t
Altura_t_1
Anio_de_agente_libre_t
Anio_de_agente_libre_t_1
Anio_t
Anio_t_1
Anios_de_contrato_t
Anios_de_contrato_t_1
Antiguedad_t
Antiguedad_t_1
Bateos_2_t
Bateos_2_t_1
Bateos_t
Bateos_t_1
Bono_por_firma_t
Bono_por_firma_t_1
Cantidad_agentes_libres_t
Cantidad_agentes_libres_t_1
Cantidad_de_equipos_t
Cantidad_de_equipos_t_1
Carreras_2_t
Carreras_2_t_1
Carreras_ganadas_2_t
Carreras_ganadas_2_t_1
Carreras_ganadas_t
Carreras_ganadas_t_1
Carreras_t
Carreras_t_1
Comando_2_t
Comando_2_t_1
Comando_t
Comando_t_1
Control_2_t
Control_2_t_1
Control_t
Control_t_1
Dominio_2_t
Dominio_2_t_1
Dominio_t
Dominio_t_1
ERA_2_t
ERA_2_t_1
ERA_t
ERA_t_1
Edad_al_firmar_t
Edad_al_firmar_t_1
Edad_t
Edad_t_1
Equipo_t
Equipo_t_1
Estado_t
Estado_t_1
Ganancias_t
Ganancias_t_1
Inning_pitched_2_t
Inning_pitched_2_t_1
Inning_pitched_t
Inning_pitched_t_1
Juegos_iniciados_t
Juegos_iniciados_t_1
Juegos_t
Juegos_t_1
Juegos_totales_t
Juegos_totales_t_1
Jugador
Losses_2_t
Losses_2_t_1
Losses_t

Debido a que muchas de las variables del periodo $t_1$ pueden funcionar como controles más realistas, se optarán por dejarlas a excepción de la columna que contiene el dato del año al que hace referencia el dataframe del periodo $t_1$, es decir, la columna *Anio_t_1*. Esto se hará para *pitchers* y *hitters*. Por razones análogas, también de omitirá la columna que indica la cantidad de equipos en determinado estado ya que en el periodo de análisis es invariante.

Para facilitar la escritura del código, entenderemos la columna *Anio* como la columna *Anio_t*.

In [100]:
for year in range(1,period):
    df_pitching_copy[year].drop(['Anio_t_1', 'Estado_t_1', 'Edad_t_1'],
                           axis = 1, inplace = True)
    
    df_hitting_copy[year].drop(['Anio_t_1', 'Estado_t_1', 'Edad_t_1'],
                           axis = 1, inplace = True)
    
    # Reordenando las columnas
    df_hitting_copy[year].sort_index(axis = 1, inplace = True)
    df_pitching_copy[year].sort_index(axis = 1, inplace = True)
    
    # Reiniciando índice
    df_hitting_copy[year].reset_index(drop = True, inplace = True)
    df_pitching_copy[year].reset_index(drop = True, inplace = True)

Cambiemos el súfijo de las basses de datos del año del 2011

In [101]:
year = 0
# Reiniciando los índices
df_hitting_copy[year] = df_hitting_copy[year].add_suffix('_t')
df_pitching_copy[year] = df_pitching_copy[year].add_suffix('_t')
# Corrección de columna del jugador
df_hitting_copy[year].columns = df_hitting_copy[year].columns.str.replace('Jugador_t', 'Jugador')
df_pitching_copy[year].columns = df_pitching_copy[year].columns.str.replace('Jugador_t', 'Jugador')

In [102]:
print("Salarios")
print(df_salary_copy[year].columns)
print("\n")
print("Hitters")
print(df_hitting_copy[year].columns)
print("\n")
print("Pitchers")
print(df_pitching_copy[year].columns)
print("\n")
print("Free agents")
print(df_free_agents_copy[year].columns)
print("\n")

Salarios
Index(['Acronimo', 'Altura', 'Anio', 'Anio_de_agente_libre',
       'Anios_de_contrato', 'Antiguedad', 'Bono_por_firma',
       'Cantidad de equipos', 'Cantidad_agentes_libres', 'Edad',
       'Edad_al_firmar', 'Equipo', 'Estado', 'Ganancias', 'Juegos totales',
       'Jugador', 'Pago_efectivo', 'Pennants won', 'Peso', 'Playoffs',
       'Posicion', 'Promedio_victorias', 'Sueldo_ajustado', 'Sueldo_base',
       'Sueldo_porcentual', 'Sueldo_regular', 'Valor_contrato_promedio',
       'Valor_contrato_total', 'Valor_del_contrato', 'Victorias', 'WS ganadas',
       'ln_Sueldo_ajustado', 'ln_Sueldo_base', 'ln_Sueldo_regular'],
      dtype='object')


Hitters
Index(['Acronimo_t', 'Altura_t', 'Anio_t', 'Anio_de_agente_libre_t',
       'Anios_de_contrato_t', 'Antiguedad_t', 'At-bats_t', 'At-bats_2_t',
       'Bateos_t', 'Bateos_2_t', 'Bateos_promedio_t', 'Bateos_promedio_2_t',
       'Bono_por_firma_t', 'Cantidad de equipos_t',
       'Cantidad_agentes_libres_t', 'Dobles_t', 'Dobles_2

## Segmentación por Agentes libres

Separaremos los pitchers y hitters en dos grupos:

- Agentes libres.
- No agentes libres.

In [103]:
for year in range(0,period):
    # Filtrando los agentes libres
    df_hitters_free_agents[year] = pd.merge(df_free_agents_copy[year],
                                            df_hitting_copy[year], on = 'Jugador')
    df_pitchers_free_agents[year] = pd.merge(df_free_agents_copy[year],
                                             df_pitching_copy[year], on = 'Jugador')
    # FIltrando los que no son agentes libres
    df_hitters_no_free_agents[year] = df_hitting_copy[year][~df_hitting_copy[year].Jugador.isin(df_hitters_free_agents[year].Jugador)]
    df_pitchers_no_free_agents[year] = df_pitching_copy[year][~df_pitching_copy[year].Jugador.isin(df_pitchers_free_agents[year].Jugador)]
    
    # Reiniciando el índice
    df_hitters_free_agents[year] = df_hitters_free_agents[year].reindex(sorted(df_hitters_free_agents[year].columns), axis=1)
    df_pitchers_free_agents[year] = df_pitchers_free_agents[year].reindex(sorted(df_pitchers_free_agents[year].columns), axis=1)
    df_hitters_no_free_agents[year] = df_hitters_no_free_agents[year].reindex(sorted(df_hitters_no_free_agents[year].columns), axis=1)
    df_pitchers_no_free_agents[year] = df_pitchers_no_free_agents[year].reindex(sorted(df_pitchers_no_free_agents[year].columns), axis=1)    

Veamos los contenidos de las nuevas bases de datos

In [104]:
print("FA - Hitters:")
df_hitters_free_agents[9].info()
print("\n FA - Pitchers:")
df_pitchers_free_agents[9].info()
print("\n No FA - Hitters:")
df_hitters_no_free_agents[9].info()
print("\n No FA - Hitters:")
df_pitchers_no_free_agents[9].info()

FA - Hitters:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1 entries, 0 to 0
Columns: 122 entries, Acronimo_t to ln_Sueldo_regular_t
dtypes: float64(66), int64(45), object(11)
memory usage: 984.0+ bytes

 FA - Pitchers:
<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Columns: 136 entries, Acronimo_t to ln_Sueldo_regular_t
dtypes: float64(80), int64(45), object(11)
memory usage: 0.0+ bytes

 No FA - Hitters:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Columns: 117 entries, Acronimo_t to ln_Sueldo_regular_t
dtypes: float64(66), int64(42), object(9)
memory usage: 3.7+ KB

 No FA - Hitters:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 0 entries
Columns: 131 entries, Acronimo_t to ln_Sueldo_regular_t
dtypes: float64(80), int64(42), object(9)
memory usage: 0.0+ bytes


In [105]:
print("FA - Hitters:")
for year in range(0,period):
    print(df_hitters_free_agents[year].shape)
print("\n FA - Pitchers:")
for year in range(0,period):
    print(df_pitchers_free_agents[year].shape)

FA - Hitters:
(0, 67)
(0, 122)
(22, 122)
(17, 122)
(23, 122)
(42, 122)
(9, 122)
(8, 122)
(10, 122)
(1, 122)
(5, 122)
(2, 122)

 FA - Pitchers:
(0, 74)
(0, 136)
(4, 136)
(3, 136)
(7, 136)
(3, 136)
(1, 136)
(0, 136)
(0, 136)
(0, 136)
(0, 136)
(1, 136)


Por último, para facilitar futuras aplicaciones, pasemos todos los nombres de las columnas a miníscula

In [106]:
for year in range(0,period):
    df_hitters_free_agents[year].rename(columns = str.lower)
    df_pitchers_free_agents[year].rename(columns = str.lower)
    df_hitters_no_free_agents[year].rename(columns = str.lower)
    df_pitchers_no_free_agents[year].rename(columns = str.lower)

In [107]:
start_year = 2011
end_year = 2022
general_trans_path = 'ETL_Data/Transversal/Per_Game'

In [109]:
for year in range(0,period):    
    # Exportemos los dataframes por separado
    df_hitters_free_agents[year].to_csv(general_trans_path + '/Hitters/Free_Agent/hitters_' + str(start_year + year) + '.csv',
                                        index = False)
    df_pitchers_free_agents[year].to_csv(general_trans_path + '/Fielders/Free_Agent/fielders_' + str(start_year + year) + '.csv',
                                         index = False)
    df_hitters_no_free_agents[year].to_csv(general_trans_path + '/Hitters/No_Free_Agent/hitters_' + str(start_year + year) + '.csv',
                                           index = False)
    df_pitchers_no_free_agents[year].to_csv(general_trans_path + '/Fielders/No_Free_Agent/fielders_' + str(start_year + year) + '.csv',
                                            index = False)

### Etiquetas para los agentes libres

Crearemos un etiqueta para indicar si el pitcher o hitter es  un agente libre o no.

In [110]:
for year in range(0,period):
    # Condiciones
    condicion_hitter = [df_hitting_copy[year].Jugador.isin(df_free_agents_copy[year].Jugador)]
    condicion_pitcher = [df_pitching_copy[year].Jugador.isin(df_free_agents_copy[year].Jugador)]
    
    # Etiquetas
    etiquetas = ['Si']
    
    df_hitting_copy[year]['Agente_libre'] = np.select(condicion_hitter, etiquetas, default = 'No')
    df_pitching_copy[year]['Agente_libre'] = np.select(condicion_pitcher, etiquetas, default = 'No')
    
    df_hitting_copy[year] = df_hitting_copy[year].reindex(sorted(df_hitting_copy[year].columns), axis=1)
    df_pitching_copy[year] = df_pitching_copy[year].reindex(sorted(df_pitching_copy[year].columns), axis=1)

In [111]:
for year in range(0,period):
    df_hitting_copy[year].rename(columns = str.lower)
    df_pitching_copy[year].rename(columns = str.lower)

In [112]:
for year in range(0,period):
    # Exportemos los dataframes
    df_hitting_copy[year].to_csv(general_trans_path + '/Hitters/All/hitters_' + str(start_year + year) + '.csv',
                                        index = False)
    df_pitching_copy[year].to_csv(general_trans_path + '/Fielders/All/hitters_' + str(start_year + year) + '.csv',
                                        index = False)

## Panel Data

Con el objetivo de contar con una base de datos en estructura panel, uniremos las bases de datos

In [113]:
# Inicialización del panel
df_panel_all_hitter = df_hitting_copy[0]
df_panel_all_pitcher = df_pitching_copy[0]

for year in range(1,period):
    # Hitter
    df_panel_all_hitter = pd.concat([df_panel_all_hitter, df_hitting_copy[year]])
    
    # Pitcher
    df_panel_all_pitcher = pd.concat([df_panel_all_pitcher, df_pitching_copy[year]])

Veamos las estadísticas descriptivas de los panel

In [114]:
df_panel_all_hitter[['ln_Sueldo_ajustado_t']].describe()

Unnamed: 0,ln_Sueldo_ajustado_t
count,1041.0
mean,14.051295
std,1.644047
min,8.640472
25%,13.003918
50%,13.345507
75%,15.50191
max,17.358538


In [115]:
df_panel_all_pitcher.describe()

Unnamed: 0,Altura_t,Anio_de_agente_libre_t,Anios_de_contrato_t,Antiguedad_t,Bateos_2_t,Bateos_t,Bono_por_firma_t,Cantidad de equipos_t,Cantidad_agentes_libres_t,Carreras_2_t,...,WAR_2_t_1,WAR_t_1,WHIP_2_t_1,WHIP_t_1,WS_ganadas_t,WS_ganadas_t_1,Walks_2_t_1,Walks_t_1,Wins_2_t_1,Wins_t_1
count,134.0,134.0,134.0,134.0,134.0,134.0,134.0,0.0,134.0,134.0,...,0.0,0.0,134.0,134.0,134.0,134.0,134.0,134.0,134.0,134.0
mean,6.230361,798.067164,1.134328,0.522388,6.492937,1.863582,82089.55,,6.30597,1.945349,...,,,0.197174,0.126194,2.962687,3.238806,0.978404,0.701194,0.018419,0.087612
std,0.204331,990.311846,0.487298,0.679633,11.89513,1.744335,865361.0,,3.440641,3.675205,...,,,1.442495,0.427331,4.527953,5.32716,2.601043,0.700279,0.041155,0.10404
min,5.567014,0.0,1.0,0.0,0.0,0.0,0.0,,1.0,0.0,...,,,0.0001,0.01,0.0,0.0,0.0,0.0,0.0,0.0
25%,6.125,0.0,1.0,0.0,0.7744,0.88,0.0,,3.0,0.204775,...,,,0.0004,0.02,0.0,0.0,0.085575,0.2925,0.0009,0.03
50%,6.233202,0.0,1.0,0.0,1.2321,1.11,0.0,,6.0,0.3249,...,,,0.0025,0.05,2.0,2.0,0.2025,0.45,0.0025,0.05
75%,6.3,2015.0,1.0,1.0,2.9268,1.71,0.0,,8.0,0.936075,...,,,0.0064,0.08,3.0,3.0,0.4903,0.7,0.011575,0.1075
max,6.7,2027.0,4.0,4.0,52.8529,7.27,10000000.0,,18.0,16.0,...,,,12.96,3.6,27.0,27.0,25.0,5.0,0.1936,0.44


In [116]:
df_panel_all_hitter.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1041 entries, 0 to 3
Columns: 132 entries, Acronimo_t to WS_ganadas_t_1
dtypes: float64(105), int64(17), object(10)
memory usage: 1.1+ MB


In [117]:
df_panel_all_pitcher.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 134 entries, 0 to 9
Columns: 138 entries, Acronimo_t to Wins_t_1
dtypes: float64(111), int64(17), object(10)
memory usage: 145.5+ KB


Verifquemos que no haya problemas con valores *NaN* o *infinitos*

Valores *NaN*:

In [118]:
"""for name in df_panel_all_hitter.columns:
    print(name)
    if type(name) != str:
        for element in range(0,len(df_panel_all_hitter[name])):
            if pd.isna(df_panel_all_hitter[name].iloc[element]) == True:
                print(str(element) +  '  ' + str(df_panel_all_hitter[name].iloc[element]))"""

"for name in df_panel_all_hitter.columns:\n    print(name)\n    if type(name) != str:\n        for element in range(0,len(df_panel_all_hitter[name])):\n            if pd.isna(df_panel_all_hitter[name].iloc[element]) == True:\n                print(str(element) +  '  ' + str(df_panel_all_hitter[name].iloc[element]))"

In [119]:
"""for name in df_panel_all_pitcher.columns:
    print(name)
    if type(name) != str:
        for element in range(0,len(df_panel_all_pitcher[name])):
            if pd.isna(df_panel_all_pitcher[name].iloc[element]) == True:
                print(str(element) +  '  ' + str(df_panel_all_pitcher[name].iloc[element]))"""

"for name in df_panel_all_pitcher.columns:\n    print(name)\n    if type(name) != str:\n        for element in range(0,len(df_panel_all_pitcher[name])):\n            if pd.isna(df_panel_all_pitcher[name].iloc[element]) == True:\n                print(str(element) +  '  ' + str(df_panel_all_pitcher[name].iloc[element]))"

Valores *infinitos*

In [120]:
for name in df_panel_all_hitter.columns:
    print(name)
    if type(name) != str:
        for element in range(0,len(df_panel_all_hitter[name])):
            if math.isinf(df_panel_all_hitter[name].iloc[element]) == True:
                print(str(element) +  '  ' + str(df_panel_all_hitter[name].iloc[element]))

Acronimo_t
Agente_libre
Altura_t
Anio_de_agente_libre_t
Anio_t
Anios_de_contrato_t
Antiguedad_t
At-bats_2_t
At-bats_t
Bateos_2_t
Bateos_promedio_2_t
Bateos_promedio_t
Bateos_t
Bono_por_firma_t
Cantidad de equipos_t
Cantidad_agentes_libres_t
Dobles_2_t
Dobles_t
Edad_al_firmar_t
Edad_t
Equipo_t
Estado_t
Ganancias_t
Home-runs_2_t
Home-runs_t
Juegos totales_t
Juegos_iniciados_2_t
Juegos_iniciados_t
Juegos_t
Jugador
Pago_efectivo_t
Pennants won_t
Peso_t
Playoffs_t
Porcentaje_On-base-plus-slugging_2_t
Porcentaje_On-base-plus-slugging_t
Porcentaje_juegos_iniciados_2_t
Porcentaje_juegos_iniciados_t
Porcentaje_juegos_t
Porcentaje_on-base_2_t
Porcentaje_on-base_t
Porcentaje_slugging_2_t
Porcentaje_slugging_t
Posicion_t
Promedio_victorias_t
Runs-batted-in_2_t
Runs-batted-in_t
Sueldo_ajustado_t
Sueldo_base_t
Sueldo_porcentual_t
Sueldo_regular_t
TVS_2_t
TVS_t
Triples_2_t
Triples_t
Valor_contrato_promedio_t
Valor_contrato_total_t
Valor_del_contrato_t
Victorias_t
WS ganadas_t
ln_Sueldo_ajustado_t
ln_

In [121]:
for name in df_panel_all_pitcher.columns:
    print(name)
    if type(name) != str:
        for element in range(0,len(df_panel_all_pitcher[name])):
            if math.isinf(df_panel_all_pitcher[name].iloc[element]) == True:
                print(str(element) +  '  ' + str(df_panel_all_pitcher[name].iloc[element]))

Acronimo_t
Agente_libre
Altura_t
Anio_de_agente_libre_t
Anio_t
Anios_de_contrato_t
Antiguedad_t
Bateos_2_t
Bateos_t
Bono_por_firma_t
Cantidad de equipos_t
Cantidad_agentes_libres_t
Carreras_2_t
Carreras_ganadas_2_t
Carreras_ganadas_t
Carreras_t
Comando_2_t
Comando_t
Control_2_t
Control_t
Dominio_2_t
Dominio_t
ERA_2_t
ERA_t
Edad_al_firmar_t
Edad_t
Equipo_t
Estado_t
Ganancias_t
Inning_pitched_2_t
Inning_pitched_t
Juegos totales_t
Juegos_iniciados_t
Juegos_t
Jugador
Losses_2_t
Losses_t
Pago_efectivo_t
Pennants won_t
Peso_t
Playoffs_t
Porcentaje_juegos_t
Posicion_t
Promedio_victorias_t
Saves_2_t
Saves_t
Strike-outs_2_t
Strike-outs_t
Sueldo_ajustado_t
Sueldo_base_t
Sueldo_porcentual_t
Sueldo_regular_t
TVS_2_t
TVS_t
Valor_contrato_promedio_t
Valor_contrato_total_t
Valor_del_contrato_t
Victorias_t
WAR_2_t
WAR_t
WHIP_2_t
WHIP_t
WS ganadas_t
Walks_2_t
Walks_t
Wins_2_t
Wins_t
ln_Sueldo_ajustado_t
ln_Sueldo_base_t
ln_Sueldo_regular_t
Acronimo_t_1
Altura_t_1
Anio_de_agente_libre_t_1
Anios_de_contr

In [122]:
df_panel_all_hitter.sort_index(axis = 1,
                               inplace = True)
df_panel_all_pitcher.sort_index(axis = 1,
                                inplace = True)

Repetiremos el procedimiento, pero únicamente para quienes son agentes libres

In [123]:
# Inicialización del panel
df_panel_fa_hitter = df_hitters_free_agents[0]
df_panel_fa_pitcher = df_pitchers_free_agents[0]

for year in range(1,period):
    # Hitter
    df_panel_fa_hitter = pd.concat([df_panel_fa_hitter, df_hitters_free_agents[year]])
    
    # Pitcher
    df_panel_fa_pitcher = pd.concat([df_panel_fa_pitcher, df_pitchers_free_agents[year]])

In [124]:
df_panel_fa_hitter.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 139 entries, 0 to 1
Columns: 136 entries, Acronimo_t to WS_ganadas_t_1
dtypes: float64(105), int64(20), object(11)
memory usage: 148.8+ KB


In [125]:
df_panel_fa_pitcher.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19 entries, 0 to 0
Columns: 142 entries, Acronimo_t to Wins_t_1
dtypes: float64(111), int64(20), object(11)
memory usage: 21.2+ KB


In [126]:
df_panel_fa_hitter.drop('Anios_de_contrato',
                        axis = 1, inplace = True)
df_panel_fa_pitcher.drop('Anios_de_contrato',
                         axis = 1, inplace = True)

In [127]:
df_panel_fa_hitter.sort_index(axis = 1, inplace = True)
df_panel_fa_pitcher.sort_index(axis = 1, inplace = True)

# Variables del Modelo Empírico

In [128]:
empiric_panel_hitter = df_panel_fa_hitter.copy()
empiric_panel_pitcher = df_panel_fa_pitcher.copy()

Veamos algunas estadísticas e información que contienen las bases de datos

In [129]:
print(empiric_panel_hitter.shape)

(139, 135)


In [130]:
print(empiric_panel_pitcher.shape)

(19, 141)


Las posiciones que hay en cada base de datos

In [131]:
empiric_panel_hitter['Posicion_t'].unique()

array(['SP', 'RP', 'RF', '2B', 'LF', '1B', 'C', '3B', 'CF', 'SS', 'DH'],
      dtype=object)

In [132]:
empiric_panel_pitcher['Posicion_t'].unique()

array(['SP', 'RP'], dtype=object)

Ordenemos las bases de datos  por nombre y año

In [133]:
# Hitter
empiric_panel_hitter = empiric_panel_hitter.sort_values(by = ['Jugador','Anio_t'], ascending=True)
empiric_panel_hitter.reset_index(drop = True, inplace = True)

# Pitcher
empiric_panel_pitcher = empiric_panel_pitcher.sort_values(by = ['Jugador','Anio_t'], ascending=True)
empiric_panel_pitcher.reset_index(drop = True, inplace = True)

In [134]:
empiric_panel_hitter[['Jugador','Anio_t']].head()

Unnamed: 0,Jugador,Anio_t
0,A.J. Burnett,2014
1,A.J. Burnett,2015
2,Aaron Harang,2015
3,Albert Pujols,2021
4,Alex Cobb,2018


In [135]:
empiric_panel_pitcher[['Jugador','Anio_t']].head()

Unnamed: 0,Jugador,Anio_t
0,Casey Janssen,2015
1,Chad Kuhl,2022
2,Chad Qualls,2016
3,Daisuke Matsuzaka,2014
4,Dan Haren,2013


In [136]:
"""hitter_names = empiric_panel_hitter.columns
for index in range(0,len(hitter_names)):
    print("Name: " + str(hitter_names[index]))
    print("index: " + str(index))"""

'hitter_names = empiric_panel_hitter.columns\nfor index in range(0,len(hitter_names)):\n    print("Name: " + str(hitter_names[index]))\n    print("index: " + str(index))'

Obtengamos los índices de las columnas de interes

In [137]:
hitter_regular_stats = ['At_bats_2_t_1', 'At_bats_t_1',
                        'Bateos_2_t_1', 'Bateos_t_1',
                        'Bateos_promedio_2_t_1', 'Bateos_promedio_t_1',
                        'Dobles_2_t_1', 'Dobles_t_1',
                        'Home_runs_2_t_1', 'Home_runs_t_1',
                        'Juegos_iniciados_2_t_1', 'Juegos_iniciados_t_1', 
                        'Porcentaje_On_base_plus_slugging_2_t_1', 'Porcentaje_On_base_plus_slugging_t_1',
                        'Porcentaje_on_base_2_t_1', 'Porcentaje_on_base_t_1',
                        'Porcentaje_slugging_2_t_1', 'Porcentaje_slugging_t_1',
                        'Runs_batted_in_2_t_1', 'Runs_batted_in_t_1',
                        'Triples_2_t_1', 'Triples_t_1']
hitter_regular_stats = sorted(hitter_regular_stats)

In [138]:
# Hitter
for stat in range(0,len(hitter_regular_stats)):
    # Variables auxiliares
    stat_name = hitter_regular_stats[stat]
    max_stat_name = stat_name + '_H'
    min_stat_name = stat_name + '_L'
    
    # Máximos por equipo
    max_stat = pd.DataFrame({"Acronimo_t":empiric_panel_hitter.groupby(by = "Acronimo_t")[stat_name].max().index,
                             max_stat_name: empiric_panel_hitter.groupby(by = "Acronimo_t")[stat_name].max().values})
    # Mínimos por equipo
    min_stat = pd.DataFrame({"Acronimo_t":empiric_panel_hitter.groupby(by = "Acronimo_t")[stat_name].min().index,
                             min_stat_name: empiric_panel_hitter.groupby(by = "Acronimo_t")[stat_name].min().values})
    empiric_panel_hitter = empiric_panel_hitter.merge(max_stat, on = "Acronimo_t",
                                                      how = "left")
    empiric_panel_hitter = empiric_panel_hitter.merge(min_stat, on = "Acronimo_t",
                                                      how = "left")

In [139]:
"""hitter_names = empiric_panel_hitter.columns
for index in range(0,len(hitter_names)):
    print("Name: " + str(hitter_names[index]))
    print("index: " + str(index))"""

'hitter_names = empiric_panel_hitter.columns\nfor index in range(0,len(hitter_names)):\n    print("Name: " + str(hitter_names[index]))\n    print("index: " + str(index))'

In [141]:
start = empiric_panel_hitter.columns.get_loc('At_bats_2_t_1_H')
end = empiric_panel_hitter.columns.get_loc('Triples_t_1_L') + 1
empiric_panel_hitter.iloc[:,start:end]

Unnamed: 0,At_bats_2_t_1_H,At_bats_2_t_1_L,At_bats_t_1_H,At_bats_t_1_L,Bateos_2_t_1_H,Bateos_2_t_1_L,Bateos_promedio_2_t_1_H,Bateos_promedio_2_t_1_L,Bateos_promedio_t_1_H,Bateos_promedio_t_1_L,...,Porcentaje_slugging_t_1_H,Porcentaje_slugging_t_1_L,Runs_batted_in_2_t_1_H,Runs_batted_in_2_t_1_L,Runs_batted_in_t_1_H,Runs_batted_in_t_1_L,Triples_2_t_1_H,Triples_2_t_1_L,Triples_t_1_H,Triples_t_1_L
0,3.8809,0.0004,1.97,0.02,0.0289,0.0004,1.000000,0.004624,1.000,0.068,...,1.000,0.068,0.0049,0.0000,0.07,0.00,0.0,0.0,0.0,0.0
1,12.2500,0.0004,3.50,0.02,0.8100,0.0000,0.066049,0.000000,0.257,0.000,...,0.420,0.000,0.2116,0.0000,0.46,0.00,0.0,0.0,0.0,0.0
2,3.8809,0.0004,1.97,0.02,0.0289,0.0004,1.000000,0.004624,1.000,0.068,...,1.000,0.068,0.0049,0.0000,0.07,0.00,0.0,0.0,0.0,0.0
3,15.2100,0.0169,3.90,0.13,0.7569,0.0009,0.062500,0.010609,0.250,0.103,...,0.395,0.138,0.4096,0.0009,0.64,0.03,0.0,0.0,0.0,0.0
4,12.8164,0.0144,3.58,0.12,0.8836,0.0000,0.250000,0.000000,0.500,0.000,...,0.750,0.000,0.5329,0.0000,0.73,0.00,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
134,2.6896,0.0036,1.64,0.06,0.0841,0.0000,0.085264,0.000000,0.292,0.000,...,0.292,0.000,0.0064,0.0000,0.08,0.00,0.0,0.0,0.0,0.0
135,0.0256,0.0004,0.16,0.02,0.0009,0.0000,0.250000,0.000000,0.500,0.000,...,0.500,0.000,0.0000,0.0000,0.00,0.00,0.0,0.0,0.0,0.0
136,0.0256,0.0004,0.16,0.02,0.0009,0.0000,0.250000,0.000000,0.500,0.000,...,0.500,0.000,0.0000,0.0000,0.00,0.00,0.0,0.0,0.0,0.0
137,0.5625,0.0064,0.75,0.08,0.0289,0.0000,1.000000,0.000000,1.000,0.000,...,1.000,0.000,0.0064,0.0000,0.08,0.00,0.0,0.0,0.0,0.0


Repitamos el mismo proceso para los lanzadores

In [142]:
"""pitcher_names = empiric_panel_pitcher.columns
for index in range(0,len(pitcher_names)):
    print("Name: " + str(pitcher_names[index]))
    print("index: " + str(index))"""

'pitcher_names = empiric_panel_pitcher.columns\nfor index in range(0,len(pitcher_names)):\n    print("Name: " + str(pitcher_names[index]))\n    print("index: " + str(index))'

Repitamos el mismo proceso para los filderos

In [143]:
pitcher_regular_stats = ['Bateos_2_t_1', 'Bateos_t_1',
                        'Carreras_2_t_1', 'Carreras_t_1',
                        'Carreras_ganadas_2_t_1', 'Carreras_ganadas_t_1',
                        'Comando_2_t_1', 'Comando_t_1',
                        'Control_2_t_1', 'Control_t_1',
                        'Dominio_2_t_1', 'Dominio_t_1', 
                        'ERA_2_t_1', 'ERA_t_1',
                        'Inning_pitched_2_t_1', 'Inning_pitched_t_1',
                        'Losses_2_t_1', 'Losses_t_1',
                        'Saves_2_t_1', 'Saves_t_1',
                        'Strike_outs_2_t_1', 'Strike_outs_t_1',
                        'WAR_2_t_1', 'WAR_t_1',
                        'WHIP_2_t_1', 'WHIP_t_1',
                        'Walks_2_t_1', 'Walks_t_1',
                        'Wins_2_t_1', 'Wins_t_1']
pitcher_regular_stats = sorted(pitcher_regular_stats)

In [144]:
# Pitcheer
for stat in range(0,len(pitcher_regular_stats)):
    # Variables auxiliares
    stat_name = pitcher_regular_stats[stat]
    max_stat_name = stat_name + '_H'
    min_stat_name = stat_name + '_L'
    
    # Máximos por equipo
    max_stat = pd.DataFrame({"Acronimo_t":empiric_panel_pitcher.groupby(by = "Acronimo_t")[stat_name].max().index,
                             max_stat_name: empiric_panel_pitcher.groupby(by = "Acronimo_t")[stat_name].max().values})
    # Mínimos por equipo
    min_stat = pd.DataFrame({"Acronimo_t":empiric_panel_pitcher.groupby(by = "Acronimo_t")[stat_name].min().index,
                             min_stat_name: empiric_panel_pitcher.groupby(by = "Acronimo_t")[stat_name].min().values})
    empiric_panel_pitcher = empiric_panel_pitcher.merge(max_stat, on = "Acronimo_t",
                                                        how = "left")
    empiric_panel_pitcher = empiric_panel_pitcher.merge(min_stat, on = "Acronimo_t",
                                                        how = "left")

In [145]:
"""pitcher_names = empiric_panel_pitcher.columns
for index in range(0,len(pitcher_names)):
    print("Name: " + str(pitcher_names[index]))
    print("index: " + str(index))"""

'pitcher_names = empiric_panel_pitcher.columns\nfor index in range(0,len(pitcher_names)):\n    print("Name: " + str(pitcher_names[index]))\n    print("index: " + str(index))'

In [146]:
start = empiric_panel_pitcher.columns.get_loc('Bateos_2_t_1_H')
end = empiric_panel_pitcher.columns.get_loc('Wins_t_1_L') + 1
empiric_panel_pitcher.iloc[:,start:end]

Unnamed: 0,Bateos_2_t_1_H,Bateos_2_t_1_L,Bateos_t_1_H,Bateos_t_1_L,Carreras_2_t_1_H,Carreras_2_t_1_L,Carreras_ganadas_2_t_1_H,Carreras_ganadas_2_t_1_L,Carreras_ganadas_t_1_H,Carreras_ganadas_t_1_L,...,WHIP_t_1_H,WHIP_t_1_L,Walks_2_t_1_H,Walks_2_t_1_L,Walks_t_1_H,Walks_t_1_L,Wins_2_t_1_H,Wins_2_t_1_L,Wins_t_1_H,Wins_t_1_L
0,37.5769,0.8836,6.13,0.94,9.3636,0.1764,7.5076,0.1369,2.74,0.37,...,0.04,0.02,1.5129,0.0196,1.23,0.14,0.1521,0.0036,0.39,0.06
1,6.8121,0.5625,2.61,0.75,3.2041,0.1369,2.3716,0.1225,1.54,0.35,...,0.05,0.02,2.25,0.0225,1.5,0.15,0.0324,0.0025,0.18,0.05
2,6.8121,0.5625,2.61,0.75,3.2041,0.1369,2.3716,0.1225,1.54,0.35,...,0.05,0.02,2.25,0.0225,1.5,0.15,0.0324,0.0025,0.18,0.05
3,20.8849,2.3104,4.57,1.52,9.0,0.3249,7.3441,0.3249,2.71,0.57,...,0.18,0.03,5.2441,0.2704,2.29,0.52,0.1849,0.0196,0.43,0.14
4,37.5769,0.8836,6.13,0.94,9.3636,0.1764,7.5076,0.1369,2.74,0.37,...,0.04,0.02,1.5129,0.0196,1.23,0.14,0.1521,0.0036,0.39,0.06
5,37.5769,0.8836,6.13,0.94,9.3636,0.1764,7.5076,0.1369,2.74,0.37,...,0.04,0.02,1.5129,0.0196,1.23,0.14,0.1521,0.0036,0.39,0.06
6,0.8836,0.8836,0.94,0.94,0.1764,0.1764,0.1369,0.1369,0.37,0.37,...,0.02,0.02,0.0625,0.0625,0.25,0.25,0.0081,0.0081,0.09,0.09
7,37.5769,0.8836,6.13,0.94,9.3636,0.1764,7.5076,0.1369,2.74,0.37,...,0.04,0.02,1.5129,0.0196,1.23,0.14,0.1521,0.0036,0.39,0.06
8,0.8836,0.8836,0.94,0.94,0.1764,0.1764,0.1369,0.1369,0.37,0.37,...,0.02,0.02,0.0625,0.0625,0.25,0.25,0.0081,0.0081,0.09,0.09
9,37.5769,0.8836,6.13,0.94,9.3636,0.1764,7.5076,0.1369,2.74,0.37,...,0.04,0.02,1.5129,0.0196,1.23,0.14,0.1521,0.0036,0.39,0.06


In [147]:
empiric_panel_hitter.shape

(139, 179)

In [148]:
empiric_panel_pitcher.shape

(19, 201)

### Filtrando jugadores con más de dos años de observaciones en nuestro panel

In [149]:
# Hitters:
# count number of observations for each player
counts_hitters = empiric_panel_hitter.groupby('Jugador').size()
# filter players with more than one year of observations
filtered_hitters = counts_hitters[counts_hitters > 2].index.tolist()

# Fielders:
# count number of observations for each player
counts_filders = empiric_panel_pitcher.groupby('Jugador').size()
# filter players with more than one year of observations
filtered_fielders = counts_filders[counts_filders > 2].index.tolist()

In [150]:
# Filter observations according to the previous list
# Hitter
empiric_panel_hitter = empiric_panel_hitter[empiric_panel_hitter['Jugador'].isin(filtered_hitters)]
empiric_panel_hitter.reset_index(drop = True,
                                 inplace = True)

# Pitcher
empiric_panel_pitcher = empiric_panel_pitcher[empiric_panel_pitcher['Jugador'].isin(filtered_fielders)]
empiric_panel_pitcher.reset_index(drop = True,
                                  inplace = True)

In [151]:
empiric_panel_hitter

Unnamed: 0,Acronimo_t,Acronimo_t_1,Altura_t,Altura_t_1,Anio_de_agente_libre_t,Anio_de_agente_libre_t_1,Anio_t,Anios_de_contrato_t,Anios_de_contrato_t_1,Antiguedad_t,...,Porcentaje_slugging_t_1_H,Porcentaje_slugging_t_1_L,Runs_batted_in_2_t_1_H,Runs_batted_in_2_t_1_L,Runs_batted_in_t_1_H,Runs_batted_in_t_1_L,Triples_2_t_1_H,Triples_2_t_1_L,Triples_t_1_H,Triples_t_1_L
0,NYY,NYY,6.2,6.2,2016,2015.0,2015,1,1.0,0,...,0.333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,BOS,NYY,6.2,6.2,2018,2016.0,2016,2,1.0,0,...,0.5,0.0,0.1681,0.0,0.41,0.0,0.0,0.0,0.0,0.0
2,BOS,NYY,6.2,6.2,2018,2016.0,2016,2,1.0,0,...,0.5,0.0,0.1681,0.0,0.41,0.0,0.0,0.0,0.0,0.0
3,BOS,NYY,6.2,6.2,2018,2016.0,2016,2,1.0,0,...,0.5,0.0,0.1681,0.0,0.41,0.0,0.0,0.0,0.0,0.0
4,BOS,NYY,6.2,6.2,2018,2016.0,2016,2,1.0,0,...,0.5,0.0,0.1681,0.0,0.41,0.0,0.0,0.0,0.0,0.0
5,WSH,LAA,6.5,6.5,2014,2013.0,2013,1,4.0,0,...,0.292,0.0,0.0064,0.0,0.08,0.0,0.0,0.0,0.0,0.0
6,WSH,LAA,6.5,6.5,2014,2013.0,2013,1,4.0,0,...,0.292,0.0,0.0064,0.0,0.08,0.0,0.0,0.0,0.0,0.0
7,LAD,WSH,6.5,6.5,2016,2014.0,2014,1,1.0,0,...,0.395,0.138,0.4096,0.0009,0.64,0.03,0.0,0.0,0.0,0.0
8,NYY,NYY,6.233202,6.2415,2014,2013.0,2013,1,1.0,1,...,0.333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,NYY,LAD,6.233202,6.2415,2014,2012.0,2013,1,1.0,1,...,0.333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [152]:
empiric_panel_pitcher

Unnamed: 0,Acronimo_t,Acronimo_t_1,Altura_t,Altura_t_1,Anio_de_agente_libre_t,Anio_de_agente_libre_t_1,Anio_t,Anios_de_contrato_t,Anios_de_contrato_t_1,Antiguedad_t,...,WHIP_t_1_H,WHIP_t_1_L,Walks_2_t_1_H,Walks_2_t_1_L,Walks_t_1_H,Walks_t_1_L,Wins_2_t_1_H,Wins_2_t_1_L,Wins_t_1_H,Wins_t_1_L
0,ATL,ATL,5.698559,6.3,0,0.0,2015,1,1.0,0,...,0.02,0.02,0.0625,0.0625,0.25,0.25,0.0081,0.0081,0.09,0.09
1,WSH,ATL,6.3,6.3,0,0.0,2015,1,1.0,0,...,0.04,0.02,1.5129,0.0196,1.23,0.14,0.1521,0.0036,0.39,0.06
2,ATL,ATL,5.698559,6.3,0,0.0,2015,1,1.0,0,...,0.02,0.02,0.0625,0.0625,0.25,0.25,0.0081,0.0081,0.09,0.09
3,WSH,ATL,6.3,6.3,0,0.0,2015,1,1.0,0,...,0.04,0.02,1.5129,0.0196,1.23,0.14,0.1521,0.0036,0.39,0.06


Para ser consistentes, las columnas que contienen datos de tipo *string* las imputaremos con la palabra *No* ya que representará que no tenía equipo, ni posición, etc.

In [153]:
empiric_panel_hitter.select_dtypes(include =['object'],
                                   exclude = ['int64','float64']).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Acronimo_t           20 non-null     object
 1   Acronimo_t_1         20 non-null     object
 2   Anio_t               20 non-null     object
 3   Equipo_anterior      20 non-null     object
 4   Equipo_t             20 non-null     object
 5   Equipo_t_1           20 non-null     object
 6   Estado_t             20 non-null     object
 7   Jugador              20 non-null     object
 8   Posicion_t           20 non-null     object
 9   Posicion_t_1         20 non-null     object
 10  Status_agente_libre  20 non-null     object
dtypes: object(11)
memory usage: 1.8+ KB


In [154]:
empiric_panel_hitter[['Acronimo_t',
                      'Equipo_anterior',
                      'Equipo_t',
                      'Estado_t',
                      'Posicion_t',
                      'Status_agente_libre',
                      'Acronimo_t_1',
                      'Equipo_t_1',
                      'Posicion_t_1']] = \
empiric_panel_hitter[['Acronimo_t',
                      'Equipo_anterior',
                      'Equipo_t',
                      'Estado_t',
                      'Posicion_t',
                      'Status_agente_libre',
                      'Acronimo_t_1',
                      'Equipo_t_1',
                      'Posicion_t_1']].fillna('No')

Veamos si funcionó la imputación

In [155]:
empiric_panel_hitter.select_dtypes(include =['object'],
                                   exclude = ['int64','float64']).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Acronimo_t           20 non-null     object
 1   Acronimo_t_1         20 non-null     object
 2   Anio_t               20 non-null     object
 3   Equipo_anterior      20 non-null     object
 4   Equipo_t             20 non-null     object
 5   Equipo_t_1           20 non-null     object
 6   Estado_t             20 non-null     object
 7   Jugador              20 non-null     object
 8   Posicion_t           20 non-null     object
 9   Posicion_t_1         20 non-null     object
 10  Status_agente_libre  20 non-null     object
dtypes: object(11)
memory usage: 1.8+ KB


Ahora con los lanzadores

In [156]:
empiric_panel_pitcher.select_dtypes(include =['object'],
                                    exclude = ['int64','float64']).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Acronimo_t           4 non-null      object
 1   Acronimo_t_1         4 non-null      object
 2   Anio_t               4 non-null      object
 3   Equipo_anterior      4 non-null      object
 4   Equipo_t             4 non-null      object
 5   Equipo_t_1           4 non-null      object
 6   Estado_t             4 non-null      object
 7   Jugador              4 non-null      object
 8   Posicion_t           4 non-null      object
 9   Posicion_t_1         4 non-null      object
 10  Status_agente_libre  4 non-null      object
dtypes: object(11)
memory usage: 480.0+ bytes


In [157]:
empiric_panel_pitcher[['Acronimo_t',
                       'Equipo_anterior',
                       'Equipo_t',
                       'Estado_t',
                       'Posicion_t',
                       'Status_agente_libre',
                       'Acronimo_t_1',
                       'Equipo_t_1',
                       'Posicion_t_1']] = \
empiric_panel_pitcher[['Acronimo_t',
                       'Equipo_anterior',
                       'Equipo_t',
                       'Estado_t',
                       'Posicion_t',
                       'Status_agente_libre',
                       'Acronimo_t_1',
                       'Equipo_t_1',
                       'Posicion_t_1']].fillna('No')

In [158]:
empiric_panel_pitcher.select_dtypes(include =['object'],
                                    exclude = ['int64','float64']).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Acronimo_t           4 non-null      object
 1   Acronimo_t_1         4 non-null      object
 2   Anio_t               4 non-null      object
 3   Equipo_anterior      4 non-null      object
 4   Equipo_t             4 non-null      object
 5   Equipo_t_1           4 non-null      object
 6   Estado_t             4 non-null      object
 7   Jugador              4 non-null      object
 8   Posicion_t           4 non-null      object
 9   Posicion_t_1         4 non-null      object
 10  Status_agente_libre  4 non-null      object
dtypes: object(11)
memory usage: 480.0+ bytes


En ambos casos, la imputación fue un éxito. No importa que en periodos de contratación de haya puesto *No* ya que con esto evitamos problemas con posibles instrumentos a partir de dummies.

Por otro lado, repitamos lo mismo para las columnas numéricas, imputaremos 0 ya que refleja la ausencia de desempeño. Sin embargo, la columna que contiene la estadística más alta de cada medida de desempeño se imputará por separado debido a que causará problemas cuando se imponga la trnasformación para el modelo.

In [159]:
# Add string to every element in a list
# Hitter
hitter_high_stat = [stat + '_H' for stat in hitter_regular_stats]
hitter_low_stat = [stat + '_L' for stat in hitter_regular_stats]
# Fielder
pitcher_high_stat = [stat + '_H' for stat in pitcher_regular_stats]
pitcher_low_stat = [stat + '_L' for stat in pitcher_regular_stats]

In [160]:
empiric_panel_hitter[hitter_high_stat] = empiric_panel_hitter[hitter_high_stat].fillna(method = 'ffill')
empiric_panel_pitcher[pitcher_high_stat] = empiric_panel_pitcher[pitcher_high_stat].fillna(method = 'ffill')

In [161]:
empiric_panel_hitter[hitter_high_stat].head(20)

Unnamed: 0,At_bats_2_t_1_H,At_bats_t_1_H,Bateos_2_t_1_H,Bateos_promedio_2_t_1_H,Bateos_promedio_t_1_H,Bateos_t_1_H,Dobles_2_t_1_H,Dobles_t_1_H,Home_runs_2_t_1_H,Home_runs_t_1_H,...,Porcentaje_On_base_plus_slugging_2_t_1_H,Porcentaje_On_base_plus_slugging_t_1_H,Porcentaje_on_base_2_t_1_H,Porcentaje_on_base_t_1_H,Porcentaje_slugging_2_t_1_H,Porcentaje_slugging_t_1_H,Runs_batted_in_2_t_1_H,Runs_batted_in_t_1_H,Triples_2_t_1_H,Triples_t_1_H
0,0.0361,0.19,0.0036,0.110889,0.333,0.06,0.0,0.0,0.0,0.0,...,0.580644,0.762,0.184041,0.429,0.110889,0.333,0.0,0.0,0.0,0.0
1,10.4976,3.24,0.3721,0.25,0.5,0.61,0.01,0.1,0.0361,0.19,...,1.0,1.0,0.25,0.5,0.25,0.5,0.1681,0.41,0.0,0.0
2,10.4976,3.24,0.3721,0.25,0.5,0.61,0.01,0.1,0.0361,0.19,...,1.0,1.0,0.25,0.5,0.25,0.5,0.1681,0.41,0.0,0.0
3,10.4976,3.24,0.3721,0.25,0.5,0.61,0.01,0.1,0.0361,0.19,...,1.0,1.0,0.25,0.5,0.25,0.5,0.1681,0.41,0.0,0.0
4,10.4976,3.24,0.3721,0.25,0.5,0.61,0.01,0.1,0.0361,0.19,...,1.0,1.0,0.25,0.5,0.25,0.5,0.1681,0.41,0.0,0.0
5,2.6896,1.64,0.0841,0.085264,0.292,0.29,0.0016,0.04,0.0,0.0,...,0.339889,0.583,0.094864,0.308,0.085264,0.292,0.0064,0.08,0.0,0.0
6,2.6896,1.64,0.0841,0.085264,0.292,0.29,0.0016,0.04,0.0,0.0,...,0.339889,0.583,0.094864,0.308,0.085264,0.292,0.0064,0.08,0.0,0.0
7,15.21,3.9,0.7569,0.0625,0.25,0.87,0.0625,0.25,0.0225,0.15,...,0.442225,0.665,0.0729,0.27,0.156025,0.395,0.4096,0.64,0.0,0.0
8,0.0361,0.19,0.0036,0.110889,0.333,0.06,0.0,0.0,0.0,0.0,...,0.580644,0.762,0.184041,0.429,0.110889,0.333,0.0,0.0,0.0,0.0
9,0.0361,0.19,0.0036,0.110889,0.333,0.06,0.0,0.0,0.0,0.0,...,0.580644,0.762,0.184041,0.429,0.110889,0.333,0.0,0.0,0.0,0.0


In [162]:
empiric_panel_hitter.fillna(0, inplace = True)
empiric_panel_pitcher.fillna(0, inplace = True)

Verifiquemos si queda alguna columna con alguna entrada tipo *NaN*:

In [163]:
# Hitter
hitter_nan = empiric_panel_hitter.isna().any()
hitter_name = empiric_panel_hitter.columns
for con in range(0, len(hitter_nan)):
    if hitter_nan[con]:
        print("Name: " + str(hitter_name[con]))

In [164]:
# Pitcher
pitcher_nan = empiric_panel_pitcher.isna().any()
pitcher_name = empiric_panel_pitcher.columns
for con in range(0, len(pitcher_nan)):
    if pitcher_nan[con]:
        print("Name: " + str(pitcher_name[con]))

Obtengamos el máximo de cada una de las medidas de desempeño, de periodos *t_1*, por equipo que han obtenido a lo largo de todas la temporadas

Para evitar problemas con el tipo de *ID*, creemos una que sea numérica para evitar usar los nombres de los jugadores

In [165]:
# Hitter
empiric_panel_hitter['id'] =  empiric_panel_hitter.groupby(['Jugador']).ngroup()
empiric_panel_hitter.reset_index(drop = True, inplace = True)
# Pitcher
empiric_panel_pitcher['id'] =  empiric_panel_pitcher.groupby(['Jugador']).ngroup()
empiric_panel_pitcher.reset_index(drop = True, inplace = True)

Obtengamos la transformación para obtener la $Y$ del modelo empírico a partir de los rezagos de las raices de dichos salarios

In [166]:
# Función de rezagos de raices
def sqrt_dif(X):
    S = []
    for i in range(0, len(X)-1):
        d = np.sqrt(X[i+1])-np.sqrt(X[i])
        S.append(d)
    try:
        S.append(d)
    except: 
        S.append(0)
    return S

In [167]:
Y_hitter = []
for p in empiric_panel_hitter["id"].unique():
    # Filtremos todos los sueldos (ln) de cada jugador por separado
    X = empiric_panel_hitter[empiric_panel_hitter["id"] == p]["ln_Sueldo_ajustado_t"].values
    # Aplicación de la función
    S = sqrt_dif(X)
    # Añadimos los datos de manera ordenada
    Y_hitter = np.concatenate((Y_hitter, S))
# Agregamos la columna:
empiric_panel_hitter["Y"] = Y_hitter

In [168]:
empiric_panel_hitter["Y"]

0     0.122518
1     0.000000
2     0.000000
3     0.000000
4     0.000000
5     0.000000
6    -0.020691
7    -0.020691
8     0.000000
9     0.007931
10    0.007931
11    0.146246
12   -0.074131
13   -0.074131
14   -0.028512
15   -0.027716
16   -0.027716
17    0.000000
18    0.329486
19    0.329486
Name: Y, dtype: float64

In [169]:
Y_pitcher = []
for p in empiric_panel_pitcher["id"].unique():
    # Filtremos todos los sueldos (ln) de cada jugador por separado
    X = empiric_panel_pitcher[empiric_panel_pitcher["id"] == p]["ln_Sueldo_ajustado_t"].values
    # Aplicación de la función
    S = sqrt_dif(X)
    # Añadimos los datos de manera ordenada
    Y_pitcher = np.concatenate((Y_pitcher, S))
# Agregamos la columna:
empiric_panel_pitcher["Y"] = Y_pitcher

In [170]:
empiric_panel_pitcher["Y"]

0    0.198092
1   -0.198092
2    0.198092
3    0.198092
Name: Y, dtype: float64

Contruyamos las dummy *I* del modelo empírico

In [171]:
# Long of the stats
end_hitter_name = int((len(hitter_high_stat) + len(hitter_low_stat))/2)

In [172]:
hitter_regular_stats

['At_bats_2_t_1',
 'At_bats_t_1',
 'Bateos_2_t_1',
 'Bateos_promedio_2_t_1',
 'Bateos_promedio_t_1',
 'Bateos_t_1',
 'Dobles_2_t_1',
 'Dobles_t_1',
 'Home_runs_2_t_1',
 'Home_runs_t_1',
 'Juegos_iniciados_2_t_1',
 'Juegos_iniciados_t_1',
 'Porcentaje_On_base_plus_slugging_2_t_1',
 'Porcentaje_On_base_plus_slugging_t_1',
 'Porcentaje_on_base_2_t_1',
 'Porcentaje_on_base_t_1',
 'Porcentaje_slugging_2_t_1',
 'Porcentaje_slugging_t_1',
 'Runs_batted_in_2_t_1',
 'Runs_batted_in_t_1',
 'Triples_2_t_1',
 'Triples_t_1']

In [173]:
# Hitter
for sport_stat in range(0,end_hitter_name):
    I_hitter = []
    for y,max_stat,min_stat in zip(empiric_panel_hitter[hitter_regular_stats[sport_stat]],
                                   empiric_panel_hitter[hitter_high_stat[sport_stat]],
                                   empiric_panel_hitter[hitter_low_stat[sport_stat]]):
        # Dummy condition
        if y > (max_stat + min_stat)/2:
            I_hitter.append(0)
        else: 
            I_hitter.append(1)
    
    I_name = hitter_regular_stats[sport_stat] + '_I'
    empiric_panel_hitter[I_name] = I_hitter

Veamos los resultados

In [174]:
"""hitter_names = empiric_panel_hitter.columns
for index in range(0,len(hitter_names)):
    print("Name: " + str(hitter_names[index]))
    print("index: " + str(index))"""

'hitter_names = empiric_panel_hitter.columns\nfor index in range(0,len(hitter_names)):\n    print("Name: " + str(hitter_names[index]))\n    print("index: " + str(index))'

In [175]:
# Hitter
hitter_nan = empiric_panel_hitter.isna().any()
hitter_name = empiric_panel_hitter.columns
for con in range(0, len(hitter_nan)):
    if hitter_nan[con]:
        print("Name: " + str(hitter_name[con]))

In [176]:
start = empiric_panel_hitter.columns.get_loc('At_bats_2_t_1_I')
end = empiric_panel_hitter.columns.get_loc('Triples_t_1_I') + 1
empiric_panel_hitter.iloc[:,start:end]

Unnamed: 0,At_bats_2_t_1_I,At_bats_t_1_I,Bateos_2_t_1_I,Bateos_promedio_2_t_1_I,Bateos_promedio_t_1_I,Bateos_t_1_I,Dobles_2_t_1_I,Dobles_t_1_I,Home_runs_2_t_1_I,Home_runs_t_1_I,...,Porcentaje_On_base_plus_slugging_2_t_1_I,Porcentaje_On_base_plus_slugging_t_1_I,Porcentaje_on_base_2_t_1_I,Porcentaje_on_base_t_1_I,Porcentaje_slugging_2_t_1_I,Porcentaje_slugging_t_1_I,Runs_batted_in_2_t_1_I,Runs_batted_in_t_1_I,Triples_2_t_1_I,Triples_t_1_I
0,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
1,1,1,1,0,0,1,1,1,1,1,...,0,0,0,0,0,0,1,1,1,1
2,1,1,1,0,0,1,1,1,1,1,...,0,0,0,0,0,0,1,1,1,1
3,1,1,1,0,0,1,1,1,1,1,...,0,0,0,0,0,0,1,1,1,1
4,1,1,1,0,0,1,1,1,1,1,...,0,0,0,0,0,0,1,1,1,1
5,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
6,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
7,1,1,1,1,1,1,1,1,1,1,...,1,1,1,0,1,1,1,1,1,1
8,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
9,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1


Repitamos el mismo proceso para los lanzadores

In [177]:
# Long of stats
end_pitcher_name = int((len(pitcher_high_stat) + len(pitcher_low_stat))/2)

In [178]:
# Pitcher
for sport_stat in range(0,end_pitcher_name):
    I_pitcher = []
    for y,max_stat,min_stat in zip(empiric_panel_pitcher[pitcher_regular_stats[sport_stat]],
                                   empiric_panel_pitcher[pitcher_high_stat[sport_stat]],
                                   empiric_panel_pitcher[pitcher_low_stat[sport_stat]]):
        if y > (max_stat + min_stat)/2:
            I_pitcher.append(0)
        else: 
            I_pitcher.append(1)
    
    I_name = pitcher_regular_stats[sport_stat] + '_I'
    empiric_panel_pitcher[I_name] = I_pitcher

In [179]:
"""pitcher_names = empiric_panel_pitcher.columns
for index in range(0,len(pitcher_names)):
    print("Name: " + str(pitcher_names[index]))
    print("index: " + str(index))"""

'pitcher_names = empiric_panel_pitcher.columns\nfor index in range(0,len(pitcher_names)):\n    print("Name: " + str(pitcher_names[index]))\n    print("index: " + str(index))'

In [180]:
# Pitcher
pitcher_nan = empiric_panel_pitcher.isna().any()
pitcher_name = empiric_panel_pitcher.columns
for con in range(0, len(pitcher_nan)):
    if pitcher_nan[con]:
        print("Name: " + str(pitcher_name[con]))

In [181]:
start = empiric_panel_pitcher.columns.get_loc('Bateos_2_t_1_I')
end = empiric_panel_pitcher.columns.get_loc('Wins_t_1_I') + 1
empiric_panel_pitcher.iloc[:,start:end]

Unnamed: 0,Bateos_2_t_1_I,Bateos_t_1_I,Carreras_2_t_1_I,Carreras_ganadas_2_t_1_I,Carreras_ganadas_t_1_I,Carreras_t_1_I,Comando_2_t_1_I,Comando_t_1_I,Control_2_t_1_I,Control_t_1_I,...,Strike_outs_2_t_1_I,Strike_outs_t_1_I,WAR_2_t_1_I,WAR_t_1_I,WHIP_2_t_1_I,WHIP_t_1_I,Walks_2_t_1_I,Walks_t_1_I,Wins_2_t_1_I,Wins_t_1_I
0,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
1,1,1,1,1,1,1,0,0,0,0,...,1,1,1,1,1,1,1,1,1,1
2,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
3,1,1,1,1,1,1,0,0,0,0,...,1,1,1,1,1,1,1,1,1,1


Obtengamos las variables auxiliares

In [182]:
# Good practice
hitter_high_stat = sorted(hitter_high_stat)
# Dummy
I_hitter = sorted([stat + '_I' for stat in hitter_regular_stats])

Verifiquemos que tengan la misma longitud

In [183]:
len(I_hitter) == len(hitter_high_stat) == len(hitter_regular_stats)

True

In [184]:
# Pitcher
for stat in range(0,len(hitter_regular_stats)):
    # Variable auxiliar
    X_hitter = []
    
    # Variables 
    i = (-1)**(empiric_panel_hitter[I_hitter[stat]])
    x = empiric_panel_hitter[hitter_regular_stats[stat]]/np.sqrt(empiric_panel_hitter[hitter_high_stat[stat]])
    X_hitter = i*x
    
    # X name
    name = 'X_' + hitter_regular_stats[stat]
    empiric_panel_hitter[name] = X_hitter

In [185]:
# Pitcher
hitter_nan = empiric_panel_hitter.isna().any()
hitter_name = empiric_panel_hitter.columns
for con in range(0, len(hitter_nan)):
    if hitter_nan[con]:
        print("Name: " + str(hitter_name[con]))

Name: X_Dobles_2_t_1
Name: X_Dobles_t_1
Name: X_Home_runs_2_t_1
Name: X_Home_runs_t_1
Name: X_Runs_batted_in_2_t_1
Name: X_Runs_batted_in_t_1
Name: X_Triples_2_t_1
Name: X_Triples_t_1


In [186]:
# define a list of suffixes to drop
drop_suffix = ['_I', '_L', '_H']

# Use a list comprehension to filter the columns to drop
hitter_cols_to_drop = [col for col in empiric_panel_hitter.columns if col.endswith(tuple(drop_suffix))]

# Drop the selected columns
empiric_panel_hitter = empiric_panel_hitter.drop(hitter_cols_to_drop,
                                                 axis=1)

In [187]:
empiric_panel_hitter.shape

(20, 159)

In [188]:
empiric_panel_hitter['Posicion_t'].value_counts()

SP    12
RF     5
RP     3
Name: Posicion_t, dtype: int64

Veamos todas las variables del panel

In [189]:
panel_columns = empiric_panel_hitter.columns

for name in panel_columns:
    print(name)

Acronimo_t
Acronimo_t_1
Altura_t
Altura_t_1
Anio_de_agente_libre_t
Anio_de_agente_libre_t_1
Anio_t
Anios_de_contrato_t
Anios_de_contrato_t_1
Antiguedad_t
Antiguedad_t_1
At-bats_2_t
At-bats_t
At_bats_2_t
At_bats_2_t_1
At_bats_t
At_bats_t_1
Bateos_2_t
Bateos_2_t_1
Bateos_promedio_2_t
Bateos_promedio_2_t_1
Bateos_promedio_t
Bateos_promedio_t_1
Bateos_t
Bateos_t_1
Bono_por_firma_t
Bono_por_firma_t_1
Cantidad de equipos_t
Cantidad_agentes_libres_t
Cantidad_agentes_libres_t_1
Cantidad_de_equipos_t
Cantidad_de_equipos_t_1
Dobles_2_t
Dobles_2_t_1
Dobles_t
Dobles_t_1
Edad_al_firmar_t
Edad_al_firmar_t_1
Edad_t
Equipo_anterior
Equipo_t
Equipo_t_1
Estado_t
Ganancias_t
Ganancias_t_1
Home-runs_2_t
Home-runs_t
Home_runs_2_t
Home_runs_2_t_1
Home_runs_t
Home_runs_t_1
Juegos totales_t
Juegos_iniciados_2_t
Juegos_iniciados_2_t_1
Juegos_iniciados_t
Juegos_iniciados_t_1
Juegos_t
Juegos_t_1
Juegos_totales_t
Juegos_totales_t_1
Jugador
Pago_efectivo_t
Pago_efectivo_t_1
Pennants won_t
Pennants_won_t
Pennants_w

In [192]:
general_dynamic_path = 'ETL_Data/Panel/Per_Game/Dynamic_model/'

In [193]:
empiric_panel_hitter.to_csv(general_dynamic_path + 'panel_hitters_pg_t_1' + '.csv',
                            index = False)

Ahora, filtremos las columnas que contienen las medidas de desempeño de los periodos **t_1**. Para ello, encontremos primero los índices de dichas variables

In [194]:
# Pitcher
for stat in range(0,len(hitter_regular_stats)):
    # Stat
    print(colored(hitter_regular_stats[stat], "cyan"))
    
    # X name
    name = 'X_' + hitter_regular_stats[stat]
    
    # OLS variables
    Y = empiric_panel_hitter['Y'].tolist()
    X = empiric_panel_hitter[name].tolist()
    X = sm.add_constant(X)
    
    # Modelo
    model = sm.OLS(Y, X,
                   missing = 'drop').fit()
    print(model.summary())
    print("\n")

[36mAt_bats_2_t_1[0m
                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.036
Model:                            OLS   Adj. R-squared:                 -0.018
Method:                 Least Squares   F-statistic:                    0.6634
Date:                Mon, 20 Mar 2023   Prob (F-statistic):              0.426
Time:                        12:56:34   Log-Likelihood:                 15.989
No. Observations:                  20   AIC:                            -27.98
Df Residuals:                      18   BIC:                            -25.99
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0395      0.

ValueError: zero-size array to reduction operation maximum which has no identity

Repitamos el mismo procedimiento para los filderos

In [195]:
# Good practice
pitcher_high_stat = sorted(pitcher_high_stat)
# Dummy
I_pitcher = sorted([stat + '_I' for stat in pitcher_regular_stats])

Verifiquemos que tengan la misma longitud

In [196]:
len(I_pitcher) == len(pitcher_high_stat) == len(pitcher_regular_stats)

True

In [197]:
# Pitcher
for stat in range(0,len(pitcher_regular_stats)):
    # Variable auxiliar
    X_pitcher = []
    
    # Variables 
    i = (-1)**(empiric_panel_pitcher[I_pitcher[stat]])
    x = empiric_panel_pitcher[pitcher_regular_stats[stat]]/np.sqrt(empiric_panel_pitcher[pitcher_high_stat[stat]])
    X_pitcher = i*x
    
    # X name
    name = 'X_' + pitcher_regular_stats[stat]
    empiric_panel_pitcher[name] = X_pitcher

In [198]:
# Pitcher
pitcher_nan = empiric_panel_pitcher.isna().any()
pitcher_name = empiric_panel_pitcher.columns
for con in range(0, len(pitcher_nan)):
    if pitcher_nan[con]:
        print("Name: " + str(pitcher_name[con]))

Name: X_WAR_2_t_1
Name: X_WAR_t_1


Tenemos que borrar las columnas que terminan con '_L', '_H' o '_I'

In [199]:
# define a list of suffixes to drop
drop_suffix = ['_I', '_L', '_H']

# Use a list comprehension to filter the columns to drop
fielder_cols_to_drop = [col for col in empiric_panel_pitcher.columns if col.endswith(tuple(drop_suffix))]

# Drop the selected columns
empiric_panel_pitcher = empiric_panel_pitcher.drop(fielder_cols_to_drop,
                                                   axis=1)

In [200]:
empiric_panel_pitcher.shape

(4, 173)

In [201]:
empiric_panel_pitcher['Posicion_t'].value_counts()

RP    4
Name: Posicion_t, dtype: int64

In [202]:
panel_columns = empiric_panel_pitcher.columns

for name in panel_columns:
    print(name)

Acronimo_t
Acronimo_t_1
Altura_t
Altura_t_1
Anio_de_agente_libre_t
Anio_de_agente_libre_t_1
Anio_t
Anios_de_contrato_t
Anios_de_contrato_t_1
Antiguedad_t
Antiguedad_t_1
Bateos_2_t
Bateos_2_t_1
Bateos_t
Bateos_t_1
Bono_por_firma_t
Bono_por_firma_t_1
Cantidad de equipos_t
Cantidad_agentes_libres_t
Cantidad_agentes_libres_t_1
Cantidad_de_equipos_t
Cantidad_de_equipos_t_1
Carreras_2_t
Carreras_2_t_1
Carreras_ganadas_2_t
Carreras_ganadas_2_t_1
Carreras_ganadas_t
Carreras_ganadas_t_1
Carreras_t
Carreras_t_1
Comando_2_t
Comando_2_t_1
Comando_t
Comando_t_1
Control_2_t
Control_2_t_1
Control_t
Control_t_1
Dominio_2_t
Dominio_2_t_1
Dominio_t
Dominio_t_1
ERA_2_t
ERA_2_t_1
ERA_t
ERA_t_1
Edad_al_firmar_t
Edad_al_firmar_t_1
Edad_t
Equipo_anterior
Equipo_t
Equipo_t_1
Estado_t
Ganancias_t
Ganancias_t_1
Inning_pitched_2_t
Inning_pitched_2_t_1
Inning_pitched_t
Inning_pitched_t_1
Juegos totales_t
Juegos_iniciados_t
Juegos_iniciados_t_1
Juegos_t
Juegos_t_1
Juegos_totales_t
Juegos_totales_t_1
Jugador
Losses

In [203]:
empiric_panel_pitcher.to_csv(general_dynamic_path + 'panel_fielders_pg_t_1' + '.csv',
                             index = False)

In [204]:
# Pitcher
for stat in range(0,len(pitcher_regular_stats)):
    # Stat
    print(colored(pitcher_regular_stats[stat], "cyan"))
    
    # X name
    name = 'X_' + pitcher_regular_stats[stat]
    
    # OLS variables
    Y = empiric_panel_pitcher['Y'].tolist()
    X = empiric_panel_pitcher[name].tolist()
    X = sm.add_constant(X)
    
    # Modelo
    model = sm.OLS(Y, X,
                   missing = 'drop').fit()
    print(model.summary())
    print("\n")

[36mBateos_2_t_1[0m
                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.333
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     1.000
Date:                Mon, 20 Mar 2023   Prob (F-statistic):              0.423
Time:                        12:59:43   Log-Likelihood:                 2.1866
No. Observations:                   4   AIC:                           -0.3733
Df Residuals:                       2   BIC:                            -1.601
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.0359      0.1

ValueError: zero-size array to reduction operation maximum which has no identity