# Players dataframes with t-1 variables

En este script nos dedicaremos a crear una base de datos limpia segmentada por hitters y pitchers. Debido a que es totalmente análogo al script para variables del mismo periodo $t$, se omitirán solo las explicaciones del código reutilizado.

- **Visualización del contenido de las bases de datos.**
- **Limpieza de la base de datos y exportación.**
- **Creación de indicador de si el jugador es agente libre.**

Importemos los modulos necesarios así como especificar la configuración deseada.

In [None]:
import pandas as pd
import numpy as np
import math
import os
import warnings
import statsmodels.api as sm
from matplotlib.colors import ListedColormap
from termcolor import colored
print('Modulos importados')

In [None]:
# Configuraciones
warnings.filterwarnings('ignore')

In [None]:
# Directorio de trabajo
print("Directorio de trabajo previo: " + str(os.getcwd()))
# Cambiemoslo
os.chdir('/home/usuario/Documentos/Github/Proyectos/MLB_HN/')

In [None]:
# Veamos el directorio actual de trabajo
print(os.getcwd())
# El directorio anterior es el correcto, pero si no lo fuese, hacemos lo sigueinte:
path = '/home/usuario/Documentos/Github/Proyectos/MLB_HN'
print("Nuevo directorio de trabajo: " + str(os.chdir(path)))

## Visualización de las bases de datos

### Equipos por estado

In [None]:
states = 'Data/Teams/team_states.csv'
df_states = pd.read_csv(states)

In [None]:
df_states.head()

### Acrónimos

Nos servirá como llave intermedia para unificar las bases de datos de los equipos

In [None]:
acronym = 'Data/Teams/team_acronym.csv'
df_acronym = pd.read_csv(acronym)

In [None]:
df_acronym.head()

Unamos esta dataframe con el de los equipos por estado

In [None]:
acronym_state = pd.merge(df_states, df_acronym, on = 'Estado')

In [None]:
acronym_state.head()

En este caso, el nombre de las variables es claro

## Algoritmo para la creación de las bases de datos

A continuaicón, se optimizará el código para que se puedan obtener los *dataframes* anteriores para un cojuntos de datos de años secuenciales, como es nuestro caso

In [None]:
# Auxiliares:
free_agents = 'Data/Free_Agents/free_agents_'
hitting = 'Data/Not_All_Variables/Statistics/Hitting/hitting_'
pitching = 'Data/Not_All_Variables/Statistics/Pitching/pitching_'
salary = 'Data/Not_All_Variables/Salary/salary_'
teams = 'ETL_Data/Agent/Teams/free_agents_team_'
csv = '.csv'
period = 12
# Originales:
df_free_agents = [None]*period
df_hitting = [None]*period
df_pitching = [None]*period
df_salary = [None]*period
df_teams = [None]*period
# Copias:
df_free_agents_copy = [None]*period
df_hitting_copy = [None]*period
df_pitching_copy = [None]*period
df_salary_copy = [None]*period
df_teams_copy = [None]*period
# Producto final:
df_pitchers = [None]*period
df_hitters = [None]*period
df_pitchers_free_agents = [None]*period
df_hitters_free_agents = [None]*period
df_pitchers_no_free_agents = [None]*period
df_hitters_no_free_agents = [None]*period
df_panel_hitters = [None]*period
df_panel_pitchers = [None]*period

Leamos todos los archivos y creemos las copias

In [None]:
for year in range(0,period):    
    df_free_agents[year] = pd.read_csv(free_agents + str(2011 + year) + csv)
    df_hitting[year] = pd.read_csv(hitting + str(2011 + year) + csv)
    df_pitching[year] = pd.read_csv(pitching + str(2011 + year) + csv)
    df_salary[year] = pd.read_csv(salary + str(2011 + year) + csv)
    df_teams[year] = pd.read_csv(teams + str(2011 + year) + csv)
    
    df_free_agents_copy[year] = df_free_agents[year].copy()
    df_hitting_copy[year] = df_hitting[year].copy()
    df_pitching_copy[year] = df_pitching[year].copy()
    df_salary_copy[year] = df_salary[year].copy()
    df_teams_copy[year] = pd.read_csv(teams + str(2011 + year) + csv)

Tratemos las bases de datos por separado. Sin embargo, a todas les quitaremos la columna de rango y *Cash2023*.

Como no queremos que se repita la columna del año de la temporada de la base de datos, borremos la columna de *Year* de la base  de datos de los agentes libres. Como los años del contrato aparecen en la base de datos sobre los salarios, se prefiere conservar dicha columna en la base de datos de salarios puesto que esta base de datos es más general que la de los agentes libres, razón por la que se borrará de esta última base de datos. 

El equipo al que se cambia el agente libre está señalado por la columna del equipo en la base de datos de salarios y la estadísticas deportivas por lo que se borrará *Team From To* de la base de datos de los agentes libres. 

Como nos importan los salarios para este analisis, quitaremos la columna de los equipos en las bases de datos sobre las estadísticas deportivas de todos los jugadores, así como la posición que ocupan.

In [None]:
for year in range(0,period):
    # Drop columns:
    if any(name in df_free_agents_copy[year].columns for name in ['Rank','Pos','Year','Team From To']):
        df_free_agents_copy[year].drop('Rank', axis = 1, inplace = True)
        df_free_agents_copy[year].drop('Year', axis = 1, inplace = True)
        df_free_agents_copy[year].drop('Pos', axis = 1, inplace = True)
        df_free_agents_copy[year].drop('Team From To', axis = 1, inplace = True)
    if 'Rank' in df_salary_copy[year].columns:
        df_salary_copy[year].drop('Rank', axis = 1, inplace = True)
    if any(name in df_hitting_copy[year].columns for name in ['Rank','Year','Cash2023','Team','Pos']):
        df_hitting_copy[year].drop('Rank', axis = 1, inplace = True)
        df_hitting_copy[year].drop('Cash2023', axis = 1, inplace = True)
        df_hitting_copy[year].drop('Team', axis = 1, inplace = True)
        df_hitting_copy[year].drop('Pos', axis = 1, inplace = True)
    if any(name in df_pitching_copy[year].columns for name in ['Rank','Year','Cash2023','Team','Pos']):
        df_pitching_copy[year].drop('Rank', axis = 1, inplace = True)
        df_pitching_copy[year].drop('Cash2023', axis = 1, inplace = True)
        df_pitching_copy[year].drop('Team', axis = 1, inplace = True)
        df_pitching_copy[year].drop('Pos', axis = 1, inplace = True)

Debido a que aparecen columnas que inician con el  nombre *Unnamed*, tendremos que borrarlas con algún método general, el cual se muestra a continuación:

In [None]:
for year in range(0,period):
    # Base de datos de agentes libres:
    df_free_agents_copy[year].drop(df_free_agents_copy[year].columns[df_free_agents_copy[year].columns.str.contains('Unnamed',case = False)],axis = 1, inplace = True)
    
    # Base de datos de los salarios:
    df_salary_copy[year].drop(df_salary_copy[year].columns[df_salary_copy[year].columns.str.contains('Unnamed',case = False)],axis = 1, inplace = True)
    
    # Base de datos de los hitters:
    df_hitting_copy[year].drop(df_hitting_copy[year].columns[df_hitting_copy[year].columns.str.contains('Unnamed',case = False)],axis = 1, inplace = True)
    
    # Base de datos de los pitchers:
    df_pitching_copy[year].drop(df_pitching_copy[year].columns[df_pitching_copy[year].columns.str.contains('Unnamed',case = False)],axis = 1, inplace = True)

Verifiquemos que ya no se encuentran dichas columnas molestas

In [None]:
df_free_agents_copy[9].columns

In [None]:
df_salary_copy[11].columns

In [None]:
df_hitting_copy[2].columns

In [None]:
df_pitching_copy[5].columns

#### Agentes libres

No se conservará el equipo al que es contratado el agente libre puesto que esta información también la contiene la base de datos que facilita más el tratamiento _ETL_.

In [None]:
for year in range(0,period):
    df_free_agents_copy[year] = df_free_agents_copy[year].rename(columns = {'Player':'Jugador',
                                'Status':'Status_agente_libre', 'Team From':'Equipo_anterior',
                                'Value':'Valor_contrato', 'AAV':'Valor_promedio_contrato',
                                'YRS':'Anios_de_contrato'})
    
    free_agents_aux_1 = df_free_agents_copy[year]['Valor_contrato'].str.replace("$","")
    free_agents_aux_2 = free_agents_aux_1.str.replace(",","")
    free_agents_aux_3 = df_free_agents_copy[year]['Valor_promedio_contrato'].str.replace("$","")
    free_agents_aux_4 = free_agents_aux_3.str.replace(",","")
    df_free_agents_copy[year]['Valor_contrato'] = free_agents_aux_2
    df_free_agents_copy[year]['Valor_promedio_contrato'] = free_agents_aux_4
    
    df_free_agents_copy[year]['Valor_contrato'] = pd.to_numeric(df_free_agents_copy[year]['Valor_contrato'])
    df_free_agents_copy[year]['Valor_promedio_contrato'] = pd.to_numeric(df_free_agents_copy[year]['Valor_promedio_contrato'])

Observemos las dimensiones de las bases de datos como referencia

In [None]:
for year in range(0,period):
    print(df_free_agents_copy[year].shape)

También el tipo de datos que contiene cada columna

In [None]:
df_free_agents_copy[6].info()

Agreguemos los agentes libres en todas las temporadas que su contrato está vigente en lugar de solo tener observaciones en el año que firmaron. Para obervar cuántos datos se añadirán, veamos el tamaño inicial de las bases de datos.

In [None]:
period_t = period - 1
df_contracts = [None]*(period_t)

In [None]:
for year in range(1,period_t):
    
    max_year_contract = max(df_free_agents_copy[year]['Anios_de_contrato'])
    years = max_year_contract - 1
    df_contracts[year] = [None]*years
    
    for incremento in range(0,years):
        diff_t = 1 + incremento
        real_year = 2011 + year + diff_t
        year_bound = 2022

        if real_year <= year_bound:
            df_contracts[year][incremento] = df_free_agents_copy[year][df_free_agents_copy[year]['Anios_de_contrato'] > diff_t]

In [None]:
for year in range(1,period_t):
    years = len(df_contracts[year])
    
    for incremento in range(0,years):
        if incremento < 2:
            diff_t = 1 + incremento
            real_year = 2011 + year + diff_t
            year_bound = 2022

            if real_year <= year_bound:
                frames = [df_free_agents_copy[year + diff_t], df_contracts[year][incremento]]

                df_free_agents_copy[year + diff_t] = pd.concat(frames)

                df_free_agents_copy[year + diff_t].reset_index(drop = True, inplace = True)

Veamos los resultados

In [None]:
for year in range(0,period):
    print(df_free_agents_copy[year].shape)

#### Salarios

Como los salarios irán con las bases de datos de los _hitters_ y _pitchers_ es que se hará su proceso _ETL_ antes.

In [None]:
for year in range(0,period):
    # Cambio de nombres
    df_salary_copy[year] = df_salary_copy[year].rename(columns = {'Player':'Jugador',
                            'BaseSalary':'Sueldo_base', 'SigningBonus':'Bono_por_firma',
                            'Payroll Salary':'Sueldo_regular', 'Adj Salary':'Sueldo_ajustado',
                            'CONT YR':'Anios_de_contrato', 'CONT VALUE':'Valor_del_contrato',
                            'Earnings':'Ganancias', 'FA Year':'Anio_de_agente_libre',
                            'Sign Age':'Edad_al_firmar', 'Age':'Edad', 'Weight':'Peso',
                            'Height':'Altura', 'Year':'Anio', 'Pos':'Posicion',
                            'Salary%':'Sueldo_porcentual', 'Cash':'Pago_efectivo',
                            'AAV':'Valor_contrato_promedio', 'Team':'Acronimo'})
    
    # Tranformando al tipo de dato apropiado
    salary_aux_1 = df_salary_copy[year]['Sueldo_base'].str.replace("$","")
    salary_aux_2 = salary_aux_1.str.replace(",","")
    df_salary_copy[year]['Sueldo_base'] = salary_aux_2
    df_salary_copy[year]['Sueldo_base'] = pd.to_numeric(df_salary_copy[year]['Sueldo_base'])
    
    salary_aux_3 = df_salary_copy[year]['Sueldo_regular'].str.replace("$","")
    salary_aux_4 = salary_aux_3.str.replace(",","")
    df_salary_copy[year]['Sueldo_regular'] = salary_aux_4
    df_salary_copy[year]['Sueldo_regular'] = pd.to_numeric(df_salary_copy[year]['Sueldo_regular'])
    
    salary_aux_5 = df_salary_copy[year]['Sueldo_ajustado'].str.replace("$","")
    salary_aux_6 = salary_aux_5.str.replace(",","")
    df_salary_copy[year]['Sueldo_ajustado'] = salary_aux_6
    df_salary_copy[year]['Sueldo_ajustado'] = pd.to_numeric(df_salary_copy[year]['Sueldo_ajustado'])
    
    salary_aux_7 = df_salary_copy[year]['Valor_del_contrato'].str.replace("$","")
    salary_aux_8 = salary_aux_7.str.replace(",","")
    df_salary_copy[year]['Valor_del_contrato'] = salary_aux_8
    df_salary_copy[year]['Valor_del_contrato'] = pd.to_numeric(df_salary_copy[year]['Valor_del_contrato'])
    
    salary_aux_9 = df_salary_copy[year]['Bono_por_firma'].str.replace("$","")
    salary_aux_10 = salary_aux_9.str.replace(",","")
    df_salary_copy[year]['Bono_por_firma'] = salary_aux_10
    df_salary_copy[year]['Bono_por_firma'] = pd.to_numeric(df_salary_copy[year]['Bono_por_firma'])
    
    salary_aux_11 = df_salary_copy[year]['Ganancias'].str.replace("$","")
    salary_aux_12 = salary_aux_11.str.replace(",","")
    df_salary_copy[year]['Ganancias'] = salary_aux_12
    df_salary_copy[year]['Ganancias'] = pd.to_numeric(df_salary_copy[year]['Ganancias'])
    
    salary_aux_13 = df_salary_copy[year]['Pago_efectivo'].str.replace("$","")
    salary_aux_14 = salary_aux_13.str.replace(",","")
    df_salary_copy[year]['Pago_efectivo'] = salary_aux_14
    df_salary_copy[year]['Pago_efectivo'] = pd.to_numeric(df_salary_copy[year]['Pago_efectivo'])
    
    salary_aux_15 = df_salary_copy[year]['Valor_contrato_promedio'].str.replace("$","")
    salary_aux_16 = salary_aux_15.str.replace(",","")
    df_salary_copy[year]['Valor_contrato_promedio'] = salary_aux_16
    df_salary_copy[year]['Valor_contrato_promedio'] = pd.to_numeric(df_salary_copy[year]['Valor_contrato_promedio'])
    
    salary_aux_17 = df_salary_copy[year]['Altura'].str.replace("\"","")
    salary_aux_18 = salary_aux_17.str.replace("'","")
    df_salary_copy[year]['Altura'] = salary_aux_18
    df_salary_copy[year]['Altura'] = pd.to_numeric(df_salary_copy[year]['Altura'])/10
    # SUstitullamos los xeros
    height_mean = df_salary_copy[year]['Altura'].mean(skipna=True)
    df_salary_copy[year]['Altura'] = df_salary_copy[year].Altura.mask(df_salary_copy[year].Altura == 0, height_mean)
    
    df_salary_copy[year]['Anio_de_agente_libre'] = pd.to_numeric(df_salary_copy[year]['Anio_de_agente_libre'])
    df_salary_copy[year]['Anios_de_contrato'] = pd.to_numeric(df_salary_copy[year]['Anios_de_contrato'])
    df_salary_copy[year]['Edad'] = pd.to_numeric(df_salary_copy[year]['Edad'])

Por algunas particularidades de la base de datos, las columna que contiene la edad al firmar se tratará por separado aprovechando que la mayoría de los datos incorrectos tienen una longitud mayor a dos.

In [None]:
for year in range (0,period):
    df_salary_copy[year]['Edad_al_firmar'] = df_salary_copy[year]['Edad_al_firmar'].map(str)

    for edad in range(0,df_salary_copy[year].shape[0]):
        # String es mayor que 0:
        if len(df_salary_copy[year]['Edad_al_firmar'].iloc[edad]) == 2:
            df_salary_copy[year].iloc[edad, df_salary_copy[year].columns.get_loc('Edad_al_firmar')] = pd.to_numeric(df_salary_copy[year]['Edad_al_firmar'].iloc[edad])
            
        # String es menor o igual que 0:
        elif len(df_salary_copy[year]['Edad_al_firmar'].iloc[edad]) != 2:
            # Si la columna de la edad contiene datos correctos
            if df_salary_copy[year]['Edad'].iloc[edad] > 0:
                if df_salary_copy[year]['Anio_de_agente_libre'].iloc[edad] == 0:
                    ag_year = year + 2011 + 1
                else:
                    ag_year = df_salary_copy[year]['Anio_de_agente_libre'].iloc[edad]
                # Get first year of contract
                ini_year = ag_year - df_salary_copy[year]['Anios_de_contrato'].iloc[edad]
                # Años desde el el año inicial
                dif_years = year + 2011 - ini_year
                # Edad al firmar:
                sign_age = df_salary_copy[year]['Edad'].iloc[edad] - dif_years
                # Cambio de dato:
                df_salary_copy[year].iloc[edad, df_salary_copy[year].columns.get_loc('Edad_al_firmar')] = pd.to_numeric(sign_age)
                
            # Si la columna de edad no contiene un dato coherente
            else:
                # Cambio de dato:
                df_salary_copy[year].iloc[edad, df_salary_copy[year].columns.get_loc('Edad_al_firmar')] = pd.to_numeric(18)    
    
        # Entero  menor a 0:
        if df_salary_copy[year]['Edad_al_firmar'].iloc[edad] < 0:
            # Si la columna de la edad contiene datos correctos
            if df_salary_copy[year]['Edad'].iloc[edad] > 0:
                if df_salary_copy[year]['Anio_de_agente_libre'].iloc[edad] == 0:
                    ag_year = year + 2011 + 1
                else:
                    ag_year = df_salary_copy[year]['Anio_de_agente_libre'].iloc[edad]
                # Get first year of contract
                ini_year = ag_year - df_salary_copy[year]['Anios_de_contrato'].iloc[edad]
                # Años desde el el año inicial
                dif_years = year + 2011 - ini_year
                # Edad al firmar:
                sign_age = df_salary_copy[year]['Edad'].iloc[edad] - dif_years
                # Cambio de dato:
                df_salary_copy[year].iloc[edad, df_salary_copy[year].columns.get_loc('Edad_al_firmar')] = pd.to_numeric(sign_age)
                
            # Si la columna de edad no contiene un dato coherente
            else:
                # Cambio de dato:
                df_salary_copy[year].iloc[edad, df_salary_copy[year].columns.get_loc('Edad_al_firmar')] = pd.to_numeric(18)
         
    # Transformemos los datos a enteros
    df_salary_copy[year]['Edad_al_firmar'] = pd.to_numeric(df_salary_copy[year]['Edad_al_firmar'])

Podemos verificar si se limpiaron adecuadamente las celdas de la columna de edades al firmar. Esto, al filtrar los datos que sean distintos a enteros y al observar si se pudo transformar toda la columna al tipo entero.

In [None]:
for year in range(0,period):
    for edad in range(0,df_salary_copy[year]['Edad_al_firmar'].shape[0]):
        if type(df_salary_copy[year]['Edad_al_firmar'].iloc[edad]) != np.int64:
            print(type(df_salary_copy[year]['Edad_al_firmar'].iloc[edad]))

In [None]:
for year in range(0,period):
    for edad in range(0,df_salary_copy[year]['Edad_al_firmar'].shape[0]):
        if df_salary_copy[year]['Edad_al_firmar'].iloc[edad] < 0:
            print(df_salary_copy[year]['Edad_al_firmar'].iloc[edad])

In [None]:
#for year in range(0,period):
#    print(type(df_salary_copy[year][['Edad_al_firmar']].info()))

In [None]:
#for year in range(0,period):
#    print(year)
#    for edad in range(0,df_salary_copy[year]['Edad_al_firmar'].shape[0]):
#        print(str(df_salary_copy[year]['Edad_al_firmar'].iloc[edad]) + ' ' + str(edad))

Por otro lado, falta corregir las entradas de las columnas de las edades que tengan valores menores a cero. Esto se hará de acuerdo al resto de columnas

In [None]:
for year in range(0,period):
    for edad in range(0,df_salary_copy[year]['Edad'].shape[0]):
        if df_salary_copy[year]['Edad'].iloc[edad] < 0:
            print(year)
            print(df_salary_copy[year]['Edad'].iloc[edad])

In [None]:
for year in range(0,period):
    for edad in range(0,df_salary_copy[year].shape[0]):
        # Condición para imputar:
        if df_salary_copy[year]['Edad'].iloc[edad] <= 0:
            # Si no se indica si tendrá año de agencia libre:
            if df_salary_copy[year]['Anio_de_agente_libre'].iloc[edad] == 0:
                        ag_year = year + 2011 + 1
            # Si tendrá año de agencia libre
            else:
                ag_year = df_salary_copy[year]['Anio_de_agente_libre'].iloc[edad]
            # Get first year of contract
            ini_year = ag_year - df_salary_copy[year]['Anios_de_contrato'].iloc[edad]
            # Años desde el el año inicial
            dif_years = year + 2011 - ini_year
            # Edad en la temporada:
            seasson_age = df_salary_copy[year]['Edad_al_firmar'].iloc[edad] + dif_years
            # Asignación
            df_salary_copy[year]['Edad'].iloc[edad] = seasson_age

Comprobemos que no hay ninguna edad negativa

In [None]:
for year in range(0,period):
    for edad in range(0,df_salary_copy[year]['Edad'].shape[0]):
        if df_salary_copy[year]['Edad'].iloc[edad] < 0:
            print(year)
            print(str(df_salary_copy[year]['Edad'].iloc[edad]) + ' ' + str(edad))

Con la imputación de datos ya se puede crear la columna que contiene la antiguedad del agente libre bajo el contrato

In [None]:
for year in range(0,period):
    df_salary_copy[year]['Antiguedad'] = df_salary_copy[year]['Edad'] - df_salary_copy[year]['Edad_al_firmar']

Por último, convirtamos la columna del año a string para que se entienda como una categoría y no una variable numérica

In [None]:
for year in range(0,period):
    df_salary_copy[year]['Anio'] = df_salary_copy[year]['Anio'].map(str)

In [None]:
df_salary_copy[5].info()

#### Hitters

In [None]:
for year in range(0,period):
    # Cambio de nombres
    df_hitting_copy[year] = df_hitting_copy[year].rename(columns = {'Player':'Jugador',
                            'GP':'Juegos', 'GP%':'Porcentaje_juegos',
                            'AB':'At-bats', 'H':'Bateos', 'GS':'Juegos_iniciados',
                            'GS%':'Porcentaje_juegos_iniciados', 'RBI':'Runs-batted-in',
                            'HR':'Home-runs', 'AVG':'Bateos_promedio',
                            '2B':'Dobles', '3B':'Triples', 'OPS':'Porcentaje_On-base-plus-slugging',
                            'SLG':'Porcentaje_slugging', 'OBP':'Porcentaje_on-base'})

In [None]:
for year in range(0,period):
    print(df_hitting_copy[year].shape)

In [None]:
df_hitting_copy[5].info()

In [None]:
df_hitting_copy[5].columns

#### Pitchers

In [None]:
for year in range(0,period):
    # Cambio de nombres
    df_pitching_copy[year] = df_pitching_copy[year].rename(columns = {'Player':'Jugador',
                             'GP':'Juegos', 'GS':'Juegos_iniciados', 'IP':'Inning_pitched',
                             'H':'Bateos', 'R':'Carreras', 'ER':'Carreras_ganadas',
                             'BB':'Walks', 'SO':'Strike-outs', 'W':'Wins', 'L':'Losses',
                             'SV':'Saves'})

In [None]:
for year in range(0,period):
    print(df_pitching_copy[year].shape)

In [None]:
df_pitching_copy[5].info()

## Agregación de variables sugeridas por artículos

Las primeras variables que agregaremos son el cuadrado de todas las estadísticas deportivas, así como las siguientes variables:

- DOMINANCE = $Strike-outs/(9*Inning \; Pitched)$
- CONTROL = $Walks/(9*Inning \; Pitched)$
- COMMAND = $Strike-outs/Walks$

In [None]:
for year in range(0,period):
    df_pitching_copy[year]['Dominio'] = df_pitching_copy[year]['Strike-outs']/(9*df_pitching_copy[year]['Inning_pitched'])
    df_pitching_copy[year]['Control'] = df_pitching_copy[year]['Walks']/(9*df_pitching_copy[year]['Inning_pitched'])
    df_pitching_copy[year]['Comando'] = df_pitching_copy[year]['Strike-outs']/df_pitching_copy[year]['Walks']

In [None]:
for year in range(0,period):
    print(df_pitching_copy[year].shape)

In [None]:
df_pitching_copy[2].info()

Podemos verificar qué entradas poseen valores infinitos en la base de datos

In [None]:
"""
for year in range(0,period):
    print(str(2011 + year))
    for name in df_pitching_copy[year].columns:
        print(name)
        if type(name) != str:
            for element in range(0,len(df_pitching_copy[year][name])):
                if math.isinf(df_pitching_copy[year][name].iloc[element]) == True:
                    print(str(element) +  '  ' + str(df_pitching_copy[year][name].iloc[element]))
    print("")
"""

Siguiendo la sugerencia de algunos artículos, obtengamos el logaritmo de los salarios

In [None]:
for year in range(0,period):
    df_salary_copy[year]['ln_Sueldo_base'] = np.log(df_salary_copy[year]['Sueldo_base'])
    df_salary_copy[year]['ln_Sueldo_ajustado'] = np.log(df_salary_copy[year]['Sueldo_ajustado'])
    df_salary_copy[year]['ln_Sueldo_regular'] = np.log(df_salary_copy[year]['Sueldo_regular'])

In [None]:
df_salary_copy[2].info()

Debido a que hay columnas con datos tipo _Nan_ o _NULL_, optaremos por imputarlos.

Mientras que los valores infinitos generados por las nuevas variables se sustituirán dependediendo del caso:

- 0/0: 0
- num/0: Máximo de la columna correspondientefijarán

In [None]:
for year in range(0,period):
    # Salaries
    mean_hgt = df_salary_copy[year].loc[df_salary_copy[year]['Altura'] > 4.9].Altura.mean()
    mean_wgh = df_salary_copy[year].loc[df_salary_copy[year]['Peso'] > 0].Peso.mean()
    df_salary_copy[year]['Altura'].fillna(value = mean_hgt, inplace = True)
    df_salary_copy[year]['Altura'].mask(df_salary_copy[year]['Altura'] <= 4.9, mean_hgt, inplace = True)
    df_salary_copy[year]['Peso'].fillna(value = mean_wgh, inplace = True)
    df_salary_copy[year]['Peso'].mask(df_salary_copy[year]['Peso'] <= 0, mean_wgh, inplace = True)
    
    # Pitchers
    mean_war = df_pitching_copy[year].loc[df_pitching_copy[year]['WAR'] > 0].WAR.mean()
    mean_dom = df_pitching_copy[year].loc[df_pitching_copy[year]['Dominio'] > 0].Dominio.mean()
    mean_con = df_pitching_copy[year].loc[df_pitching_copy[year]['Control'] > 0].Control.mean()
    mean_com = df_pitching_copy[year].loc[df_pitching_copy[year]['Comando'] > 0].Comando.mean()
    df_pitching_copy[year]['WAR'].fillna(value = mean_war, inplace = True)
    df_pitching_copy[year]['WAR'].mask(df_pitching_copy[year]['WAR'] <= 0, mean_war, inplace = True)
    df_pitching_copy[year]['Dominio'].fillna(value = mean_dom, inplace = True)
    df_pitching_copy[year]['Dominio'].mask(df_pitching_copy[year]['Dominio'] <= 0, mean_dom, inplace = True)
    df_pitching_copy[year]['Control'].fillna(value = mean_con, inplace = True)
    df_pitching_copy[year]['Control'].mask(df_pitching_copy[year]['Control'] <= 0, mean_con, inplace = True)
    df_pitching_copy[year]['Comando'].fillna(value = mean_com, inplace = True)
    df_pitching_copy[year]['Comando'].mask(df_pitching_copy[year]['Comando'] <= 0, mean_com, inplace = True)
    
    # Hitters
    mean_war = df_hitting_copy[year].loc[df_hitting_copy[year]['WAR'] > 0].WAR.mean()
    df_hitting_copy[year]['WAR'].fillna(value = mean_war, inplace = True)
    df_hitting_copy[year]['WAR'].mask(df_hitting_copy[year]['WAR'] <= 0, mean_war, inplace = True)

In [None]:
for year in range(0,period):   
    # Condiciones
    con_dom_1 = df_pitching_copy[year]['Strike-outs'] == 0
    con_con_1 = df_pitching_copy[year]['Walks'] == 0
    con_com_1 = df_pitching_copy[year]['Strike-outs'] == 0
                 
    # Imputación caso 0/0
    df_pitching_copy[year].loc[con_dom_1, "Dominio"] = 0
    df_pitching_copy[year].loc[con_con_1, "Control"] = 0
    df_pitching_copy[year].loc[con_com_1, "Comando"] = 0

In [None]:
for year in range(0,period):   
    # Máximos
    max_dom = df_pitching_copy[year]['Strike-outs'].max()/9
    max_con = df_pitching_copy[year]['Walks'].max()/9
    max_com = df_pitching_copy[year]['Strike-outs'].max()
    
    # Cambianfdo infinitos a NaNs
    df_pitching_copy[year]["Dominio"].replace([np.inf, -np.inf], np.nan, inplace = True)
    df_pitching_copy[year]["Control"].replace([np.inf, -np.inf], np.nan, inplace = True)
    df_pitching_copy[year]["Comando"].replace([np.inf, -np.inf], np.nan, inplace = True)
    
    # Imputación
    df_pitching_copy[year]['Dominio'].fillna(value = max_dom, inplace = True)
    df_pitching_copy[year]['Control'].fillna(value = max_con, inplace = True)
    df_pitching_copy[year]['Comando'].fillna(value = max_com, inplace = True)

Verifiquemos que ya no haya problemas con valores infinitos

In [None]:
"""
for year in range(0,period):
    print(str(2011 + year))
    for name in df_pitching_copy[year].columns:
        print(name)
        if type(name) != str:
            for element in range(0,len(df_pitching_copy[year][name])):
                if math.isinf(df_pitching_copy[year][name].iloc[element]) == True:
                    print(str(element) +  '  ' + str(df_pitching_copy[year][name].iloc[element]))
    print("")
"""

Así mismo, contemos los valores *NaN* que queden presentes

In [None]:
for year in range(0,period):
    print('Año: ' + str(2011 + year))
    print('Hitters:')
    df_hitting_copy[year].isna().sum()
    print('Pitchers:')
    df_pitching_copy[year].isna().sum()
    print('Free agents:')
    df_free_agents_copy[year].isna().sum()
    print('Salaries:')
    df_salary_copy[year].isna().sum()
    print("")

Ahora, repitamos este proceso para la base de datos de los salarios.

In [None]:
salary_names = ['ln_Sueldo_ajustado', 'ln_Sueldo_base', 'ln_Sueldo_regular']

In [None]:
for name in salary_names:
    print(name)
    
    for year in range(0,period):
        print(str(2011 + year))
        for element in range(0,len(df_salary_copy[year][name])):
            if df_salary_copy[year][name].iloc[element] <= 0:
                print(str(element) +  '  ' + str(df_salary_copy[year][name].iloc[element]))
        print("")

Al inspecsionar los errores nos damos cuenta que solo se desconocen los salarios ajustados y los fijaron a $0$. Usaremos el logaritmo base 10 del salario regular para sustituir dicho valor.

In [None]:
for year in range(0,period):
    df_salary_copy[year]['ln_Sueldo_ajustado'].mask(df_salary_copy[year]['ln_Sueldo_ajustado'] < 0,
                                                    df_salary_copy[year]['ln_Sueldo_regular'],
                                                    inplace = True)

In [None]:
for year in range(0,period):
    print(str(2011 + year))
    for element in range(0,len(df_salary_copy[year]['ln_Sueldo_ajustado'])):
        if df_salary_copy[year]['ln_Sueldo_ajustado'].iloc[element] <= 0:
            print(str(element) +  '  ' + str(df_salary_copy[year]['ln_Sueldo_ajustado'].iloc[element]))
    print("")

In [None]:
for year in range(0,period):
    print("Ajustado: " + str(df_salary_copy[year]['ln_Sueldo_ajustado'].mean())
          + 'n'
          + 'Regular: ' + str(df_salary_copy[year]['ln_Sueldo_regular'].mean()))

En efecto, ya no hay valores _NaN_ o _infinitos_.

Con el objetivo de hacer más eficiente la creación de las variables al cuadrado, lo haremos extrayendo el índice de las columnas de interés

In [None]:
df_hitting_copy[0].columns

In [None]:
df_pitching_copy[1].columns

In [None]:
def get_col_indices(df, names):
    return df.columns.get_indexer(names)

In [None]:
hitting_names = ['Juegos_iniciados', 'Porcentaje_juegos_iniciados', 'At-bats', 'Bateos',
                  'Dobles', 'Triples', 'Home-runs', 'Runs-batted-in', 'Bateos_promedio',
                  'Porcentaje_on-base', 'Porcentaje_slugging', 'TVS',
                  'Porcentaje_On-base-plus-slugging', 'WAR']	
pitching_names = ['Inning_pitched', 'Bateos', 'Carreras',
                  'Carreras_ganadas', 'Walks', 'Strike-outs', 'Wins', 'Losses',
                  'Saves', 'WHIP', 'ERA', 'WAR', 'TVS', 'Dominio', 'Control',
                  'Comando']

Con el objetivo de simplificar el código, verifiquemos si todos los índices en cada base de datos son los mismos

In [None]:
print('Hitters:')
for year in range(0,period):
    print(get_col_indices(df_hitting_copy[year], hitting_names))
    
print('Pitchers:')
for year in range(0,period):
    print(get_col_indices(df_pitching_copy[year], pitching_names))

In [None]:
hitting_indexes = list(get_col_indices(df_hitting_copy[0], hitting_names))
pitching_indexes = list(get_col_indices(df_pitching_copy[0], pitching_names))

In [None]:
for year in range(0,period):
    # Hitters:
    for hitter_name in hitting_indexes:
        df_hitting_copy[year][df_hitting_copy[year].columns[hitter_name] + '_2'] = np.power(df_hitting_copy[year][df_hitting_copy[year].columns[hitter_name]], 2)
    # Pitchers:
    for pitcher_name in pitching_indexes:
        df_pitching_copy[year][df_pitching_copy[year].columns[pitcher_name] + '_2'] = np.power(df_pitching_copy[year][df_pitching_copy[year].columns[pitcher_name]], 2)

Apreciemos el resultado final

In [None]:
df_hitting_copy[2].info()

In [None]:
df_pitching_copy[2].info()

## Unión de las bases de datos
### Datos agregados por equipo

Solo resta añadir los datos relevantes al equipo al que pertenece cada jugador considerando la base de datos de la cantidad de equipos por estado

In [None]:
df_teams_copy[7].info()

In [None]:
acronym_state.info()

In [None]:
for year in range(0,period):
    df_teams_copy[year] = pd.merge(df_teams_copy[year], acronym_state, on = ['Equipo','Acronimo'])

In [None]:
df_teams_copy[7].info()

In [None]:
df_salary_copy[7].info()

Ahora, unamos las bases de datos sobre los equipos a las bases de datos de los salarios

In [None]:
for year in range(0,period):
    df_salary_copy[year] = pd.merge(df_teams_copy[year], df_salary_copy[year], on = 'Acronimo')

In [None]:
df_salary_copy[0].info()

Debido a que la mayoría de los jugadores juega tanto en la ofensiva como la defensiva es que tenemos que borrar los duplicados de la columna de la posición.

In [None]:
for year in range(0,period):
    df_hitting_copy[year] = pd.merge(df_hitting_copy[year], df_salary_copy[year], on = 'Jugador')
    df_pitching_copy[year] = pd.merge(df_pitching_copy[year], df_salary_copy[year], on = 'Jugador')

In [None]:
for year in range(0,period):
    df_pitching_copy[year]['Porcentaje_juegos'] = df_pitching_copy[year]['Juegos']/df_pitching_copy[year]['Juegos totales']

In [None]:
df_hitting_copy[3].info()

In [None]:
df_pitching_copy[3].info()

Para facilitar la observación de las trnasformaciones de manera más eficiente, ordenemos alfabéticamente la base de datos de acuerdo al nombre de las columnas.

In [None]:
for year in range(0,period):
    # Ordenando alfabéticamente
    df_salary_copy[year].sort_index(axis = 1, inplace = True)
    df_hitting_copy[year].sort_index(axis = 1, inplace = True)
    df_pitching_copy[year].sort_index(axis = 1, inplace = True)
    df_free_agents_copy[year].sort_index(axis = 1, inplace = True)
    
    # Reiniciando los índices
    df_salary_copy[year].reset_index(drop = True, inplace = True)
    df_hitting_copy[year].reset_index(drop = True, inplace = True)
    df_pitching_copy[year].reset_index(drop = True, inplace = True)
    df_free_agents_copy[year].reset_index(drop = True, inplace = True)

## Variables del periodo t-1

Lo que haremos será un *merge* de las bases de datos del año $t$ con el año $t-1$ sobre los jugadores. La razón de esto es que solo nos interesan los jugadores que han sido agentes libres por más de un año.

Si la primera base de datos es del año 2011, entonces tendremos que empezar en el año 2012. Creemos los dataframes que contendrán los datos para el modelo. Para que no se sobrepongan todos los periodos, crearemos dataframes auxiliares para guardar los nuevos datos

In [None]:
hitting_merge = ['Juegos_iniciados', 'Porcentaje_juegos_iniciados', 'At-bats', 'Bateos',
                 'Dobles', 'Triples', 'Home-runs', 'Runs-batted-in', 'Bateos_promedio',
                 'Porcentaje_on-base', 'Porcentaje_slugging', 'TVS',
                 'Porcentaje_On-base-plus-slugging', 'WAR',
                 'Juegos_iniciados_2', 'Porcentaje_juegos_iniciados_2', 'At-bats_2', 'Bateos_2',
                 'Dobles_2', 'Triples_2', 'Home-runs_2', 'Runs-batted-in_2', 'Bateos_promedio_2',
                 'Porcentaje_on-base_2', 'Porcentaje_slugging_2', 'TVS_2',
                 'Porcentaje_On-base-plus-slugging_2', 'WAR_2']	
pitching_merge = ['Inning_pitched', 'Bateos_en_contra', 'Carreras_en_contra',
                  'Carreras_ganadas', 'Walks', 'Strike-outs', 'Wins', 'Losses',
                  'Saves', 'WHIP', 'ERA', 'WAR', 'TVS', 'Dominio', 'Control',
                  'Comando',
                  'Inning_pitched_2', 'Bateos_2', 'Carreras_2',
                  'Carreras_ganadas_2', 'Walks_2', 'Strike-outs_2', 'Wins_2', 'Losses_2',
                  'Saves_2', 'WHIP_2', 'ERA_2', 'WAR_2', 'TVS_2', 'Dominio_2', 'Control_2',
                  'Comando_2']

In [None]:
df_hitters_copy = [None]*period
df_pitchers_copy = [None]*period

In [None]:
for year in range(0,period):
    df_hitters_copy[year] = df_hitting_copy[year].copy()
    df_pitchers_copy[year] = df_pitching_copy[year].copy()

In [None]:
for year in range(1,period):    
    df_hitting_copy[year] = pd.merge(df_hitters_copy[year], df_hitters_copy[year-1], on = 'Jugador')
    df_pitching_copy[year] = pd.merge(df_pitchers_copy[year], df_pitchers_copy[year-1], on = 'Jugador')

A continuación se verifica que la cantidad de columnas sea la misma, salvo por el primer periodo

In [None]:
for name in df_pitching_copy[11].columns:
    print(name)

In [None]:
for year in range(0,period):
    print(df_hitting_copy[year].columns.shape)
    
for year in range(0,period):    
    print(df_pitching_copy[year].columns.shape)

In [None]:
for year in range(1,period):       
    df_pitching_copy[year].columns = df_pitching_copy[year].columns.str.replace('_x', '_t')
    df_pitching_copy[year].columns = df_pitching_copy[year].columns.str.replace('_y', '_t_1')
    df_pitching_copy[year].columns = df_pitching_copy[year].columns.str.replace('-', '_')
    df_pitching_copy[year].columns = df_pitching_copy[year].columns.str.replace(' ', '_')
    df_pitching_copy[year].drop(['ln_Sueldo_base_t_1', 'ln_Sueldo_ajustado_t_1', 'ln_Sueldo_regular_t_1'],
                           axis = 1, inplace = True)
    df_pitching_copy[year] = df_pitching_copy[year].sort_values(by = 'Jugador', ascending = True)
    df_pitching_copy[year].reset_index(drop = True, inplace = True)
    
    df_hitting_copy[year].columns = df_hitting_copy[year].columns.str.replace('_x', '_t')
    df_hitting_copy[year].columns = df_hitting_copy[year].columns.str.replace('_y', '_t_1')
    df_hitting_copy[year].columns = df_hitting_copy[year].columns.str.replace('-', '_')
    df_hitting_copy[year].columns = df_hitting_copy[year].columns.str.replace(' ', '_')
    df_hitting_copy[year].drop(['ln_Sueldo_base_t_1', 'ln_Sueldo_ajustado_t_1', 'ln_Sueldo_regular_t_1'],
                          axis = 1, inplace = True)
    df_hitting_copy[year] = df_hitting_copy[year].sort_values(by = 'Jugador', ascending = True)
    df_hitting_copy[year].reset_index(drop = True, inplace = True)
    
    # Reordenando las columnas
    df_hitting_copy[year].sort_index(axis = 1, inplace = True)
    df_pitching_copy[year].sort_index(axis = 1, inplace = True)

In [None]:
for name in df_pitching_copy[11].columns:
    print(name)

Debido a que muchas de las variables del periodo $t_1$ pueden funcionar como controles más realistas, se optarán por dejarlas a excepción de la columna que contiene el dato del año al que hace referencia el dataframe del periodo $t_1$, es decir, la columna *Anio_t_1*. Esto se hará para *pitchers* y *hitters*. Por razones análogas, también de omitirá la columna que indica la cantidad de equipos en determinado estado ya que en el periodo de análisis es invariante.

Para facilitar la escritura del código, entenderemos la columna *Anio* como la columna *Anio_t*.

In [None]:
for year in range(1,period):
    df_pitching_copy[year].drop(['Anio_t_1', 'Estado_t_1', 'Edad_t_1'],
                           axis = 1, inplace = True)
    
    df_hitting_copy[year].drop(['Anio_t_1', 'Estado_t_1', 'Edad_t_1'],
                           axis = 1, inplace = True)
    
    # Reordenando las columnas
    df_hitting_copy[year].sort_index(axis = 1, inplace = True)
    df_pitching_copy[year].sort_index(axis = 1, inplace = True)
    
    # Reiniciando índice
    df_hitting_copy[year].reset_index(drop = True, inplace = True)
    df_pitching_copy[year].reset_index(drop = True, inplace = True)

Cambiemos el súfijo de las basses de datos del año del 2011

In [None]:
year = 0
# Reiniciando los índices
df_hitting_copy[year] = df_hitting_copy[year].add_suffix('_t')
df_pitching_copy[year] = df_pitching_copy[year].add_suffix('_t')
# Corrección de columna del jugador
df_hitting_copy[year].columns = df_hitting_copy[year].columns.str.replace('Jugador_t', 'Jugador')
df_pitching_copy[year].columns = df_pitching_copy[year].columns.str.replace('Jugador_t', 'Jugador')

In [None]:
print("Salarios")
print(df_salary_copy[year].columns)
print("\n")
print("Hitters")
print(df_hitting_copy[year].columns)
print("\n")
print("Pitchers")
print(df_pitching_copy[year].columns)
print("\n")
print("Free agents")
print(df_free_agents_copy[year].columns)
print("\n")

## Segmentación por Agentes libres

Separaremos los pitchers y hitters en dos grupos:

- Agentes libres.
- No agentes libres.

In [None]:
for year in range(0,period):
    # Filtrando los agentes libres
    df_hitters_free_agents[year] = pd.merge(df_free_agents_copy[year],
                                            df_hitting_copy[year], on = 'Jugador')
    df_pitchers_free_agents[year] = pd.merge(df_free_agents_copy[year],
                                             df_pitching_copy[year], on = 'Jugador')
    # FIltrando los que no son agentes libres
    df_hitters_no_free_agents[year] = df_hitting_copy[year][~df_hitting_copy[year].Jugador.isin(df_hitters_free_agents[year].Jugador)]
    df_pitchers_no_free_agents[year] = df_pitching_copy[year][~df_pitching_copy[year].Jugador.isin(df_pitchers_free_agents[year].Jugador)]
    
    # Reiniciando el índice
    df_hitters_free_agents[year] = df_hitters_free_agents[year].reindex(sorted(df_hitters_free_agents[year].columns), axis=1)
    df_pitchers_free_agents[year] = df_pitchers_free_agents[year].reindex(sorted(df_pitchers_free_agents[year].columns), axis=1)
    df_hitters_no_free_agents[year] = df_hitters_no_free_agents[year].reindex(sorted(df_hitters_no_free_agents[year].columns), axis=1)
    df_pitchers_no_free_agents[year] = df_pitchers_no_free_agents[year].reindex(sorted(df_pitchers_no_free_agents[year].columns), axis=1)    

Veamos los contenidos de las nuevas bases de datos

In [None]:
print("FA - Hitters:")
df_hitters_free_agents[9].info()
print("\n FA - Pitchers:")
df_pitchers_free_agents[9].info()
print("\n No FA - Hitters:")
df_hitters_no_free_agents[9].info()
print("\n No FA - Hitters:")
df_pitchers_no_free_agents[9].info()

In [None]:
print("FA - Hitters:")
for year in range(0,period):
    print(df_hitters_free_agents[year].shape)
print("\n FA - Pitchers:")
for year in range(0,period):
    print(df_pitchers_free_agents[year].shape)

Por último, para facilitar futuras aplicaciones, pasemos todos los nombres de las columnas a miníscula

In [None]:
for year in range(0,period):
    df_hitters_free_agents[year].rename(columns = str.lower)
    df_pitchers_free_agents[year].rename(columns = str.lower)
    df_hitters_no_free_agents[year].rename(columns = str.lower)
    df_pitchers_no_free_agents[year].rename(columns = str.lower)

In [None]:
for year in range(0,period):    
    # Exportemos los dataframes por separado
    df_hitters_free_agents[year].to_csv('ETL_Data/Agent/First_Two_Years_Contract/Period_t_1/Free_Agent/Hitters/free_agents_batters_' + str(2011 + year) + '.csv', index = False)
    df_pitchers_free_agents[year].to_csv('ETL_Data/Agent/First_Two_Years_Contract/Period_t_1/Free_Agent/Pitchers/free_agents_pitchers_' + str(2011 + year) + '.csv', index = False)
    df_hitters_no_free_agents[year].to_csv('ETL_Data/Agent/First_Two_Years_Contract/Period_t_1/No_Free_Agent/Hitters/no_free_agents_batters_' + str(2011 + year) + '.csv', index = False)
    df_pitchers_no_free_agents[year].to_csv('ETL_Data/Agent/First_Two_Years_Contract/Period_t_1/No_Free_Agent/Pitchers/no_free_agents_pitchers_' + str(2011 + year) + '.csv', index = False)

### Etiquetas para los agentes libres

Crearemos un etiqueta para indicar si el pitcher o hitter es  un agente libre o no.

## Panel Data

Con el objetivo de contar con una base de datos en estructura panel, uniremos las bases de datos

In [None]:
# Inicialización del panel
df_panel_all_hitter = df_hitting_copy[0]
df_panel_all_pitcher = df_pitching_copy[0]

for year in range(1,period):
    # Hitter
    df_panel_all_hitter = pd.concat([df_panel_all_hitter, df_hitting_copy[year]])
    
    # Pitcher
    df_panel_all_pitcher = pd.concat([df_panel_all_pitcher, df_pitching_copy[year]])

Veamos las estadísticas descriptivas de los panel

In [None]:
df_panel_all_hitter[['ln_Sueldo_ajustado_t']].describe()

In [None]:
df_panel_all_pitcher.describe()

In [None]:
df_panel_all_hitter.info()

In [None]:
df_panel_all_pitcher.info()

Verifquemos que no haya problemas con valores *NaN* o *infinitos*

Valores *NaN*:

In [None]:
for name in df_panel_all_hitter.columns:
    print(name)
    if type(name) != str:
        for element in range(0,len(df_panel_all_hitter[name])):
            if pd.isna(df_panel_all_hitter[name].iloc[element]) == True:
                print(str(element) +  '  ' + str(df_panel_all_hitter[name].iloc[element]))

In [None]:
for name in df_panel_all_pitcher.columns:
    print(name)
    if type(name) != str:
        for element in range(0,len(df_panel_all_pitcher[name])):
            if pd.isna(df_panel_all_pitcher[name].iloc[element]) == True:
                print(str(element) +  '  ' + str(df_panel_all_pitcher[name].iloc[element]))

Valores *infinitos*

In [None]:
for name in df_panel_all_hitter.columns:
    print(name)
    if type(name) != str:
        for element in range(0,len(df_panel_all_hitter[name])):
            if math.isinf(df_panel_all_hitter[name].iloc[element]) == True:
                print(str(element) +  '  ' + str(df_panel_all_hitter[name].iloc[element]))

In [None]:
for name in df_panel_all_pitcher.columns:
    print(name)
    if type(name) != str:
        for element in range(0,len(df_panel_all_pitcher[name])):
            if math.isinf(df_panel_all_pitcher[name].iloc[element]) == True:
                print(str(element) +  '  ' + str(df_panel_all_pitcher[name].iloc[element]))

In [None]:
df_panel_all_hitter.sort_index(axis = 1, inplace = True)
df_panel_all_pitcher.sort_index(axis = 1, inplace = True)

Exportemos los paneles

In [None]:
df_panel_all_hitter.to_csv('ETL_Data/Agent/First_Two_Years_Contract/Period_t_1/Hitters/All_Hitters/panel_hitters' + '.csv', index = False)
df_panel_all_pitcher.to_csv('ETL_Data/Agent/First_Two_Years_Contract/Period_t_1/Pitchers/All_Pitchers/panel_pitchers' + '.csv', index = False)

Repetiremos el procedimiento, pero únicamente para quienes son agentes libres

In [None]:
# Inicialización del panel
df_panel_fa_hitter = df_hitters_free_agents[0]
df_panel_fa_pitcher = df_pitchers_free_agents[0]

for year in range(1,period):
    # Hitter
    df_panel_fa_hitter = pd.concat([df_panel_fa_hitter, df_hitters_free_agents[year]])
    
    # Pitcher
    df_panel_fa_pitcher = pd.concat([df_panel_fa_pitcher, df_pitchers_free_agents[year]])

In [None]:
df_panel_fa_hitter.info()

In [None]:
df_panel_fa_pitcher.info()

In [None]:
df_panel_fa_hitter.drop('Anios_de_contrato',
                        axis = 1, inplace = True)
df_panel_fa_pitcher.drop('Anios_de_contrato',
                         axis = 1, inplace = True)

In [None]:
df_panel_fa_hitter.sort_index(axis = 1, inplace = True)
df_panel_fa_pitcher.sort_index(axis = 1, inplace = True)

In [None]:
df_panel_fa_hitter.to_csv('ETL_Data/Agent/First_Two_Years_Contract/Period_t_1/Free_Agent/Hitters/panel_hitters' + '.csv', index = False)
df_panel_fa_pitcher.to_csv('ETL_Data/Agent/First_Two_Years_Contract/Period_t_1/Free_Agent/Pitchers/panel_pitchers' + '.csv', index = False)

# Variables del Modelo Empírico

In [None]:
empiric_panel_hitter = df_panel_fa_hitter.copy()
empiric_panel_pitcher = df_panel_fa_pitcher.copy()

Veamos algunas estadísticas e información que contienen las bases de datos

In [None]:
print(empiric_panel_hitter.shape)

In [None]:
print(empiric_panel_pitcher.shape)

Las posiciones que hay en cada base de datos

In [None]:
empiric_panel_hitter['Posicion_t'].unique()

In [None]:
empiric_panel_pitcher['Posicion_t'].unique()

Ordenemos las bases de datos  por nombre y año

In [None]:
# Hitter
empiric_panel_hitter = empiric_panel_hitter.sort_values(by = ['Jugador','Anio_t'], ascending=True)
empiric_panel_hitter.reset_index(drop = True, inplace = True)

# Pitcher
empiric_panel_pitcher = empiric_panel_pitcher.sort_values(by = ['Jugador','Anio_t'], ascending=True)
empiric_panel_pitcher.reset_index(drop = True, inplace = True)

In [None]:
empiric_panel_hitter[['Jugador','Anio_t']].head()

In [None]:
empiric_panel_pitcher[['Jugador','Anio_t']].head()

Obtengamos el máximo de cada una de las medidas de desempeño, de periodos *t_1*, por equipo que han obtenido a lo largo de todas la temporadas

In [123]:
hitter_names = empiric_panel_hitter.columns
for index in range(0,len(hitter_names)):
    print("Name: " + str(hitter_names[index]))
    print("index: " + str(index))

Name: Acronimo_t
index: 0
Name: Acronimo_t_1
index: 1
Name: Altura_t
index: 2
Name: Altura_t_1
index: 3
Name: Anio_de_agente_libre_t
index: 4
Name: Anio_de_agente_libre_t_1
index: 5
Name: Anio_t
index: 6
Name: Anios_de_contrato_t
index: 7
Name: Anios_de_contrato_t_1
index: 8
Name: Antiguedad_t
index: 9
Name: Antiguedad_t_1
index: 10
Name: At-bats_2_t
index: 11
Name: At-bats_t
index: 12
Name: At_bats_2_t
index: 13
Name: At_bats_2_t_1
index: 14
Name: At_bats_t
index: 15
Name: At_bats_t_1
index: 16
Name: Bateos_2_t
index: 17
Name: Bateos_2_t_1
index: 18
Name: Bateos_promedio_2_t
index: 19
Name: Bateos_promedio_2_t_1
index: 20
Name: Bateos_promedio_t
index: 21
Name: Bateos_promedio_t_1
index: 22
Name: Bateos_t
index: 23
Name: Bateos_t_1
index: 24
Name: Bono_por_firma_t
index: 25
Name: Bono_por_firma_t_1
index: 26
Name: Cantidad de equipos_t
index: 27
Name: Cantidad_agentes_libres_t
index: 28
Name: Cantidad_agentes_libres_t_1
index: 29
Name: Cantidad_de_equipos_t
index: 30
Name: Cantidad_de

Obtengamos los índices de las columnas de interes

In [124]:
sport_st_hitter = [14,16, 18,24, 20,22,
                   33,35, 48,50, 53,55,
                   73,75, 77,79, 85,87,
                   89,91, 99,101, 116,118,
                   130,132]

In [125]:
# Hitter
for sport_stat in range(0,len(sport_st_hitter)):
    # Variables auxiliares
    stat = hitter_names[sport_st_hitter[sport_stat]]
    max_stat_name = hitter_names[sport_st_hitter[sport_stat]] + '_H'
    min_stat_name = hitter_names[sport_st_hitter[sport_stat]] + '_L'
    
    # Máximos por equipo
    max_stat = pd.DataFrame({"Acronimo_t":empiric_panel_hitter.groupby(by = "Acronimo_t")[stat].max().index,
                             max_stat_name: empiric_panel_hitter.groupby(by = "Acronimo_t")[stat].max().values})
    # Mínimos por equipo
    min_stat = pd.DataFrame({"Acronimo_t":empiric_panel_hitter.groupby(by = "Acronimo_t")[stat].min().index,
                             min_stat_name: empiric_panel_hitter.groupby(by = "Acronimo_t")[stat].min().values})
    empiric_panel_hitter = empiric_panel_hitter.merge(max_stat, on = "Acronimo_t",
                                                      how = "left")
    empiric_panel_hitter = empiric_panel_hitter.merge(min_stat, on = "Acronimo_t",
                                                      how = "left")

In [126]:
hitter_names = empiric_panel_hitter.columns
for index in range(0,len(hitter_names)):
    print("Name: " + str(hitter_names[index]))
    print("index: " + str(index))

Name: Acronimo_t
index: 0
Name: Acronimo_t_1
index: 1
Name: Altura_t
index: 2
Name: Altura_t_1
index: 3
Name: Anio_de_agente_libre_t
index: 4
Name: Anio_de_agente_libre_t_1
index: 5
Name: Anio_t
index: 6
Name: Anios_de_contrato_t
index: 7
Name: Anios_de_contrato_t_1
index: 8
Name: Antiguedad_t
index: 9
Name: Antiguedad_t_1
index: 10
Name: At-bats_2_t
index: 11
Name: At-bats_t
index: 12
Name: At_bats_2_t
index: 13
Name: At_bats_2_t_1
index: 14
Name: At_bats_t
index: 15
Name: At_bats_t_1
index: 16
Name: Bateos_2_t
index: 17
Name: Bateos_2_t_1
index: 18
Name: Bateos_promedio_2_t
index: 19
Name: Bateos_promedio_2_t_1
index: 20
Name: Bateos_promedio_t
index: 21
Name: Bateos_promedio_t_1
index: 22
Name: Bateos_t
index: 23
Name: Bateos_t_1
index: 24
Name: Bono_por_firma_t
index: 25
Name: Bono_por_firma_t_1
index: 26
Name: Cantidad de equipos_t
index: 27
Name: Cantidad_agentes_libres_t
index: 28
Name: Cantidad_agentes_libres_t_1
index: 29
Name: Cantidad_de_equipos_t
index: 30
Name: Cantidad_de

In [127]:
empiric_panel_hitter.iloc[:,139:empiric_panel_hitter.shape[0]-1]

Unnamed: 0,At_bats_2_t_1_H,At_bats_2_t_1_L,At_bats_t_1_H,At_bats_t_1_L,Bateos_2_t_1_H,Bateos_2_t_1_L,Bateos_t_1_H,Bateos_t_1_L,Bateos_promedio_2_t_1_H,Bateos_promedio_2_t_1_L,...,Runs_batted_in_t_1_H,Runs_batted_in_t_1_L,Triples_2_t_1_H,Triples_2_t_1_L,Triples_t_1_H,Triples_t_1_L,WAR_2_t_1_H,WAR_2_t_1_L,WAR_t_1_H,WAR_t_1_L
0,344569.0,1.0,587.0,1.0,26896.0,1.0,164.0,1.0,1.000000,0.004624,...,100.0,0.0,25.0,0.0,5.0,0.0,14.8996,0.0196,3.86,0.14
1,207025.0,1.0,455.0,1.0,14884.0,0.0,122.0,0.0,0.082369,0.000000,...,61.0,0.0,9.0,0.0,3.0,0.0,8.1225,0.0025,2.85,0.05
2,379456.0,16.0,616.0,4.0,31329.0,0.0,177.0,0.0,0.092416,0.000000,...,86.0,0.0,4.0,0.0,2.0,0.0,8.4100,0.0961,2.90,0.31
3,425104.0,1.0,652.0,1.0,29929.0,0.0,173.0,0.0,0.250000,0.000000,...,102.0,0.0,16.0,0.0,4.0,0.0,52.5625,0.0144,7.25,0.12
4,335241.0,100.0,579.0,10.0,25281.0,0.0,159.0,0.0,0.091809,0.000000,...,97.0,0.0,81.0,0.0,9.0,0.0,33.1776,0.0169,5.76,0.13
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
737,360000.0,16.0,600.0,4.0,32400.0,0.0,180.0,0.0,0.107584,0.000000,...,83.0,0.0,25.0,0.0,5.0,0.0,21.6225,0.0001,4.65,0.01
738,360000.0,16.0,600.0,4.0,32400.0,0.0,180.0,0.0,0.107584,0.000000,...,83.0,0.0,25.0,0.0,5.0,0.0,21.6225,0.0001,4.65,0.01
739,336400.0,4.0,580.0,2.0,26569.0,0.0,163.0,0.0,0.078961,0.000000,...,63.0,0.0,9.0,0.0,3.0,0.0,78.8544,0.0144,8.88,0.12
740,336400.0,4.0,580.0,2.0,26569.0,0.0,163.0,0.0,0.078961,0.000000,...,63.0,0.0,9.0,0.0,3.0,0.0,78.8544,0.0144,8.88,0.12


Repitamos el mismo proceso para los lanzadores

In [128]:
pitcher_names = empiric_panel_pitcher.columns
for index in range(0,len(pitcher_names)):
    print("Name: " + str(pitcher_names[index]))
    print("index: " + str(index))

Name: Acronimo_t
index: 0
Name: Acronimo_t_1
index: 1
Name: Altura_t
index: 2
Name: Altura_t_1
index: 3
Name: Anio_de_agente_libre_t
index: 4
Name: Anio_de_agente_libre_t_1
index: 5
Name: Anio_t
index: 6
Name: Anios_de_contrato_t
index: 7
Name: Anios_de_contrato_t_1
index: 8
Name: Antiguedad_t
index: 9
Name: Antiguedad_t_1
index: 10
Name: Bateos_2_t
index: 11
Name: Bateos_2_t_1
index: 12
Name: Bateos_t
index: 13
Name: Bateos_t_1
index: 14
Name: Bono_por_firma_t
index: 15
Name: Bono_por_firma_t_1
index: 16
Name: Cantidad de equipos_t
index: 17
Name: Cantidad_agentes_libres_t
index: 18
Name: Cantidad_agentes_libres_t_1
index: 19
Name: Cantidad_de_equipos_t
index: 20
Name: Cantidad_de_equipos_t_1
index: 21
Name: Carreras_2_t
index: 22
Name: Carreras_2_t_1
index: 23
Name: Carreras_ganadas_2_t
index: 24
Name: Carreras_ganadas_2_t_1
index: 25
Name: Carreras_ganadas_t
index: 26
Name: Carreras_ganadas_t_1
index: 27
Name: Carreras_t
index: 28
Name: Carreras_t_1
index: 29
Name: Comando_2_t
index

In [129]:
sport_st_pitcher = [12,14, 23,29, 25,27,
                    31,33, 35,37, 39,41,
                    43,45, 56,58, 68,70,
                    87,89, 94,96, 120,122,
                    124,126, 131,133, 135,137]

In [130]:
# Hitter
for sport_stat in range(0,len(sport_st_pitcher)):
    # Variables auxiliares
    stat = pitcher_names[sport_st_pitcher[sport_stat]]
    max_stat_name = pitcher_names[sport_st_pitcher[sport_stat]] + '_H'
    min_stat_name = pitcher_names[sport_st_pitcher[sport_stat]] + '_L'
    
    # Máximos por equipo
    max_stat = pd.DataFrame({"Acronimo_t":empiric_panel_pitcher.groupby(by = "Acronimo_t")[stat].max().index,
                             max_stat_name: empiric_panel_pitcher.groupby(by = "Acronimo_t")[stat].max().values})
    # Mínimos por equipo
    min_stat = pd.DataFrame({"Acronimo_t":empiric_panel_pitcher.groupby(by = "Acronimo_t")[stat].min().index,
                             min_stat_name: empiric_panel_pitcher.groupby(by = "Acronimo_t")[stat].min().values})
    empiric_panel_pitcher = empiric_panel_pitcher.merge(max_stat, on = "Acronimo_t",
                                                        how = "left")
    empiric_panel_pitcher = empiric_panel_pitcher.merge(min_stat, on = "Acronimo_t",
                                                        how = "left")

In [131]:
pitcher_names = empiric_panel_pitcher.columns
for index in range(0,len(pitcher_names)):
    print("Name: " + str(pitcher_names[index]))
    print("index: " + str(index))

Name: Acronimo_t
index: 0
Name: Acronimo_t_1
index: 1
Name: Altura_t
index: 2
Name: Altura_t_1
index: 3
Name: Anio_de_agente_libre_t
index: 4
Name: Anio_de_agente_libre_t_1
index: 5
Name: Anio_t
index: 6
Name: Anios_de_contrato_t
index: 7
Name: Anios_de_contrato_t_1
index: 8
Name: Antiguedad_t
index: 9
Name: Antiguedad_t_1
index: 10
Name: Bateos_2_t
index: 11
Name: Bateos_2_t_1
index: 12
Name: Bateos_t
index: 13
Name: Bateos_t_1
index: 14
Name: Bono_por_firma_t
index: 15
Name: Bono_por_firma_t_1
index: 16
Name: Cantidad de equipos_t
index: 17
Name: Cantidad_agentes_libres_t
index: 18
Name: Cantidad_agentes_libres_t_1
index: 19
Name: Cantidad_de_equipos_t
index: 20
Name: Cantidad_de_equipos_t_1
index: 21
Name: Carreras_2_t
index: 22
Name: Carreras_2_t_1
index: 23
Name: Carreras_ganadas_2_t
index: 24
Name: Carreras_ganadas_2_t_1
index: 25
Name: Carreras_ganadas_t
index: 26
Name: Carreras_ganadas_t_1
index: 27
Name: Carreras_t
index: 28
Name: Carreras_t_1
index: 29
Name: Comando_2_t
index

In [132]:
empiric_panel_pitcher.iloc[:,141:empiric_panel_pitcher.shape[0]-1].head(10)

Unnamed: 0,Bateos_2_t_1_H,Bateos_2_t_1_L,Bateos_t_1_H,Bateos_t_1_L,Carreras_2_t_1_H,Carreras_2_t_1_L,Carreras_t_1_H,Carreras_t_1_L,Carreras_ganadas_2_t_1_H,Carreras_ganadas_2_t_1_L,...,WHIP_t_1_H,WHIP_t_1_L,Walks_2_t_1_H,Walks_2_t_1_L,Walks_t_1_H,Walks_t_1_L,Wins_2_t_1_H,Wins_2_t_1_L,Wins_t_1_H,Wins_t_1_L
0,46225.0,4.0,215.0,2.0,8649.0,0.0,93.0,0.0,7396.0,0.0,...,1.53,0.38,5041.0,0.0,71.0,0.0,196.0,0.0,14.0,0.0
1,42025.0,256.0,205.0,16.0,14884.0,100.0,122.0,10.0,11881.0,81.0,...,1.59,1.14,9216.0,196.0,96.0,14.0,144.0,1.0,12.0,1.0
2,39204.0,0.0,198.0,0.0,12996.0,0.0,114.0,0.0,11449.0,0.0,...,1.73,0.0,7225.0,0.0,85.0,0.0,324.0,0.0,18.0,0.0
3,46225.0,4.0,215.0,2.0,8649.0,0.0,93.0,0.0,7396.0,0.0,...,1.53,0.38,5041.0,0.0,71.0,0.0,196.0,0.0,14.0,0.0
4,42025.0,0.0,205.0,0.0,7744.0,0.0,88.0,0.0,7744.0,0.0,...,1.39,0.8,6084.0,1.0,78.0,1.0,400.0,0.0,20.0,0.0
5,42025.0,0.0,205.0,0.0,7744.0,0.0,88.0,0.0,7744.0,0.0,...,1.39,0.8,6084.0,1.0,78.0,1.0,400.0,0.0,20.0,0.0
6,51529.0,49.0,227.0,7.0,11236.0,4.0,106.0,2.0,10404.0,4.0,...,2.44,0.64,5041.0,1.0,71.0,1.0,324.0,0.0,18.0,0.0
7,39204.0,0.0,198.0,0.0,12996.0,0.0,114.0,0.0,11449.0,0.0,...,1.73,0.0,7225.0,0.0,85.0,0.0,324.0,0.0,18.0,0.0
8,32761.0,676.0,181.0,26.0,6889.0,81.0,83.0,9.0,6400.0,81.0,...,1.44,1.02,4096.0,64.0,64.0,8.0,196.0,0.0,14.0,0.0
9,32761.0,676.0,181.0,26.0,6889.0,81.0,83.0,9.0,6400.0,81.0,...,1.44,1.02,4096.0,64.0,64.0,8.0,196.0,0.0,14.0,0.0


Lo que se hará ahora es agregar todos los años (12) para cada agente en la base de datos y rellenar los *NaN* faltantes con 0 ya que representan la ausencia de desempeño en este caso.

In [133]:
# Hitter
empiric_panel_hitter = empiric_panel_hitter["Anio_t"].drop_duplicates().to_frame().merge(empiric_panel_hitter["Jugador"].drop_duplicates(),
                                                                                         how = "cross").merge(empiric_panel_hitter,
                                                                                                              how = "left")
empiric_panel_hitter.reset_index(drop = True, inplace = True)

# Pitcher
empiric_panel_pitcher = empiric_panel_pitcher["Anio_t"].drop_duplicates().to_frame().merge(empiric_panel_pitcher["Jugador"].drop_duplicates(),
                                                                                           how = "cross").merge(empiric_panel_pitcher,
                                                                                                                how = "left")
empiric_panel_pitcher.reset_index(drop = True, inplace = True)

Veamos las dimensiones

In [134]:
empiric_panel_hitter.shape

(3467, 191)

In [135]:
empiric_panel_pitcher.shape

(3374, 201)

Para ser consistentes, las columnas que contienen datos de tipo *string* las imputaremos con la palabra *No* ya que representará que no tenía equipo, ni posición, etc.

In [136]:
empiric_panel_hitter.select_dtypes(include =['object'],
                                   exclude = ['int64','float64']).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3467 entries, 0 to 3466
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Anio_t               3467 non-null   object
 1   Jugador              3467 non-null   object
 2   Acronimo_t           742 non-null    object
 3   Acronimo_t_1         742 non-null    object
 4   Equipo_anterior      742 non-null    object
 5   Equipo_t             742 non-null    object
 6   Equipo_t_1           742 non-null    object
 7   Estado_t             742 non-null    object
 8   Posicion_t           742 non-null    object
 9   Posicion_t_1         742 non-null    object
 10  Status_agente_libre  742 non-null    object
dtypes: object(11)
memory usage: 298.1+ KB


In [137]:
empiric_panel_hitter[['Acronimo_t',
                      'Equipo_anterior',
                      'Equipo_t',
                      'Estado_t',
                      'Posicion_t',
                      'Status_agente_libre',
                      'Acronimo_t_1',
                      'Equipo_t_1',
                      'Posicion_t_1']] = \
empiric_panel_hitter[['Acronimo_t',
                      'Equipo_anterior',
                      'Equipo_t',
                      'Estado_t',
                      'Posicion_t',
                      'Status_agente_libre',
                      'Acronimo_t_1',
                      'Equipo_t_1',
                      'Posicion_t_1']].fillna('No')

Veamos si funcionó la imputación

In [138]:
empiric_panel_hitter.select_dtypes(include =['object'],
                                   exclude = ['int64','float64']).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3467 entries, 0 to 3466
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Anio_t               3467 non-null   object
 1   Jugador              3467 non-null   object
 2   Acronimo_t           3467 non-null   object
 3   Acronimo_t_1         3467 non-null   object
 4   Equipo_anterior      3467 non-null   object
 5   Equipo_t             3467 non-null   object
 6   Equipo_t_1           3467 non-null   object
 7   Estado_t             3467 non-null   object
 8   Posicion_t           3467 non-null   object
 9   Posicion_t_1         3467 non-null   object
 10  Status_agente_libre  3467 non-null   object
dtypes: object(11)
memory usage: 298.1+ KB


Ahora con los lanzadores

In [139]:
empiric_panel_pitcher.select_dtypes(include =['object'],
                                    exclude = ['int64','float64']).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3374 entries, 0 to 3373
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Anio_t               3374 non-null   object
 1   Jugador              3374 non-null   object
 2   Acronimo_t           725 non-null    object
 3   Acronimo_t_1         725 non-null    object
 4   Equipo_anterior      725 non-null    object
 5   Equipo_t             725 non-null    object
 6   Equipo_t_1           725 non-null    object
 7   Estado_t             725 non-null    object
 8   Posicion_t           725 non-null    object
 9   Posicion_t_1         725 non-null    object
 10  Status_agente_libre  725 non-null    object
dtypes: object(11)
memory usage: 290.1+ KB


In [140]:
empiric_panel_pitcher[['Acronimo_t',
                       'Equipo_anterior',
                       'Equipo_t',
                       'Estado_t',
                       'Posicion_t',
                       'Status_agente_libre',
                       'Acronimo_t_1',
                       'Equipo_t_1',
                       'Posicion_t_1']] = \
empiric_panel_pitcher[['Acronimo_t',
                       'Equipo_anterior',
                       'Equipo_t',
                       'Estado_t',
                       'Posicion_t',
                       'Status_agente_libre',
                       'Acronimo_t_1',
                       'Equipo_t_1',
                       'Posicion_t_1']].fillna('No')

In [141]:
empiric_panel_pitcher.select_dtypes(include =['object'],
                                    exclude = ['int64','float64']).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3374 entries, 0 to 3373
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Anio_t               3374 non-null   object
 1   Jugador              3374 non-null   object
 2   Acronimo_t           3374 non-null   object
 3   Acronimo_t_1         3374 non-null   object
 4   Equipo_anterior      3374 non-null   object
 5   Equipo_t             3374 non-null   object
 6   Equipo_t_1           3374 non-null   object
 7   Estado_t             3374 non-null   object
 8   Posicion_t           3374 non-null   object
 9   Posicion_t_1         3374 non-null   object
 10  Status_agente_libre  3374 non-null   object
dtypes: object(11)
memory usage: 290.1+ KB


En ambos casos, la imputación fue un éxito. No importa que en periodos de contratación de haya puesto *No* ya que con esto evitamos problemas con posibles instrumentos a partir de dummies.

Por otro lado, repitamos lo mismo para las columnas numéricas, imputaremos 0 ya que refleja la ausencia de desempeño.

In [142]:
empiric_panel_hitter.fillna(0, inplace = True)
empiric_panel_pitcher.fillna(0, inplace = True)

Verifiquemos si queda alguna columna con alguna entrada tipo *NaN*:

In [143]:
# Hitter
hitter_nan = empiric_panel_hitter.isna().any()
hitter_name = empiric_panel_hitter.columns
for con in range(0, len(hitter_nan)):
    if hitter_nan[con]:
        print("Name: " + str(hitter_name[con]))

In [144]:
# Pitcher
pitcher_nan = empiric_panel_pitcher.isna().any()
pitcher_name = empiric_panel_pitcher.columns
for con in range(0, len(pitcher_nan)):
    if pitcher_nan[con]:
        print("Name: " + str(pitcher_name[con]))

Para evitar problemas con el tipo de *ID*, creemos una que sea numérica para evitar usar los nombres de los jugadores

In [145]:
# Hitter
empiric_panel_hitter['id'] =  empiric_panel_hitter.groupby(['Jugador']).ngroup()
empiric_panel_hitter.reset_index(drop = True, inplace = True)
# Pitcher
empiric_panel_pitcher['id'] =  empiric_panel_pitcher.groupby(['Jugador']).ngroup()
empiric_panel_pitcher.reset_index(drop = True, inplace = True)

Obtengamos la transformación para obtener la $Y$ del modelo empírico a partir de los rezagos de las raices de dichos salarios

In [146]:
# Función de rezagos de raices
def sqrt_dif(X):
    S = []
    for i in range(0, len(X)-1):
        d = np.sqrt(X[i+1])-np.sqrt(X[i])
        S.append(d)
    try:
        S.append(d)
    except: 
        S.append(0)
    return S

In [147]:
Y_hitter = []
for p in empiric_panel_hitter["id"].unique():
    # Filtremos todos los sueldos (ln) de cada jugador por separado
    X = empiric_panel_hitter[empiric_panel_hitter["id"] == p]["ln_Sueldo_ajustado_t"].values
    # Aplicación de la función
    S = sqrt_dif(X)
    # Añadimos los datos de manera ordenada
    Y_hitter = np.concatenate((Y_hitter, S))
# Agregamos la columna:
empiric_panel_hitter["Y"] = Y_hitter

In [148]:
empiric_panel_hitter["Y"]

0      -0.034934
1      -3.994443
2       0.000000
3       0.000000
4       0.000000
          ...   
3462    0.000000
3463    0.000000
3464    0.000000
3465    4.164357
3466    4.164357
Name: Y, Length: 3467, dtype: float64

In [149]:
Y_pitcher = []
for p in empiric_panel_pitcher["id"].unique():
    # Filtremos todos los sueldos (ln) de cada jugador por separado
    X = empiric_panel_pitcher[empiric_panel_pitcher["id"] == p]["ln_Sueldo_ajustado_t"].values
    # Aplicación de la función
    S = sqrt_dif(X)
    # Añadimos los datos de manera ordenada
    Y_pitcher = np.concatenate((Y_pitcher, S))
# Agregamos la columna:
empiric_panel_pitcher["Y"] = Y_pitcher

In [150]:
empiric_panel_pitcher["Y"]

0      -0.034934
1      -3.994443
2       0.000000
3       0.000000
4       0.000000
          ...   
3369    0.000000
3370    0.000000
3371    0.000000
3372    0.000000
3373    0.000000
Name: Y, Length: 3374, dtype: float64

Contruyamos las dummy *I* del modelo empírico

In [151]:
start = empiric_panel_hitter.columns.get_loc('At_bats_2_t_1_H')
end = empiric_panel_hitter.columns.get_loc('WAR_t_1_L') + 1
sport_st_hitter_names = empiric_panel_hitter.iloc[:,start:end].columns
end_hitter_name = int(len(sport_st_hitter_names)/2)

In [152]:
# Hitter
for sport_stat in range(0,end_hitter_name):
    I_hitter = []
    for y,max_stat,min_stat in zip(empiric_panel_hitter[hitter_names[sport_st_hitter[sport_stat]]],
                                   empiric_panel_hitter[sport_st_hitter_names[2*sport_stat]],
                                   empiric_panel_hitter[sport_st_hitter_names[2*sport_stat + 1]]):
        if y > (max_stat + min_stat)/2:
            I_hitter.append(0)
        else: 
            I_hitter.append(1)
    
    I_name = "I_" + hitter_names[sport_st_hitter[sport_stat]]
    empiric_panel_hitter[I_name] = I_hitter

Veamos los resultados

In [153]:
hitter_names = empiric_panel_hitter.columns
for index in range(0,len(hitter_names)):
    print("Name: " + str(hitter_names[index]))
    print("index: " + str(index))

Name: Anio_t
index: 0
Name: Jugador
index: 1
Name: Acronimo_t
index: 2
Name: Acronimo_t_1
index: 3
Name: Altura_t
index: 4
Name: Altura_t_1
index: 5
Name: Anio_de_agente_libre_t
index: 6
Name: Anio_de_agente_libre_t_1
index: 7
Name: Anios_de_contrato_t
index: 8
Name: Anios_de_contrato_t_1
index: 9
Name: Antiguedad_t
index: 10
Name: Antiguedad_t_1
index: 11
Name: At-bats_2_t
index: 12
Name: At-bats_t
index: 13
Name: At_bats_2_t
index: 14
Name: At_bats_2_t_1
index: 15
Name: At_bats_t
index: 16
Name: At_bats_t_1
index: 17
Name: Bateos_2_t
index: 18
Name: Bateos_2_t_1
index: 19
Name: Bateos_promedio_2_t
index: 20
Name: Bateos_promedio_2_t_1
index: 21
Name: Bateos_promedio_t
index: 22
Name: Bateos_promedio_t_1
index: 23
Name: Bateos_t
index: 24
Name: Bateos_t_1
index: 25
Name: Bono_por_firma_t
index: 26
Name: Bono_por_firma_t_1
index: 27
Name: Cantidad de equipos_t
index: 28
Name: Cantidad_agentes_libres_t
index: 29
Name: Cantidad_agentes_libres_t_1
index: 30
Name: Cantidad_de_equipos_t
ind

In [154]:
# Hitter
hitter_nan = empiric_panel_hitter.isna().any()
hitter_name = empiric_panel_hitter.columns
for con in range(0, len(hitter_nan)):
    if hitter_nan[con]:
        print("Name: " + str(hitter_name[con]))

Repitamos el mismo proceso para los lanzadores

In [155]:
start = empiric_panel_pitcher.columns.get_loc('Bateos_2_t_1_H')
end = empiric_panel_pitcher.columns.get_loc('Wins_t_1_L') + 1
sport_st_pitcher_names = empiric_panel_pitcher.iloc[:,start:end].columns
end_pitcher_name = int(len(sport_st_pitcher_names)/2)

In [156]:
# Pitcher
for sport_stat in range(0,end_pitcher_name):
    I_pitcher = []
    for y,max_stat,min_stat in zip(empiric_panel_pitcher[pitcher_names[sport_st_pitcher[sport_stat]]],
                                   empiric_panel_pitcher[sport_st_pitcher_names[2*sport_stat]],
                                   empiric_panel_pitcher[sport_st_pitcher_names[2*sport_stat + 1]]):
        if y > (max_stat + min_stat)/2:
            I_pitcher.append(0)
        else: 
            I_pitcher.append(1)
    
    I_name = "I_" + pitcher_names[sport_st_pitcher[sport_stat]]
    empiric_panel_pitcher[I_name] = I_pitcher

In [157]:
pitcher_names = empiric_panel_pitcher.columns
for index in range(0,len(pitcher_names)):
    print("Name: " + str(pitcher_names[index]))
    print("index: " + str(index))

Name: Anio_t
index: 0
Name: Jugador
index: 1
Name: Acronimo_t
index: 2
Name: Acronimo_t_1
index: 3
Name: Altura_t
index: 4
Name: Altura_t_1
index: 5
Name: Anio_de_agente_libre_t
index: 6
Name: Anio_de_agente_libre_t_1
index: 7
Name: Anios_de_contrato_t
index: 8
Name: Anios_de_contrato_t_1
index: 9
Name: Antiguedad_t
index: 10
Name: Antiguedad_t_1
index: 11
Name: Bateos_2_t
index: 12
Name: Bateos_2_t_1
index: 13
Name: Bateos_t
index: 14
Name: Bateos_t_1
index: 15
Name: Bono_por_firma_t
index: 16
Name: Bono_por_firma_t_1
index: 17
Name: Cantidad de equipos_t
index: 18
Name: Cantidad_agentes_libres_t
index: 19
Name: Cantidad_agentes_libres_t_1
index: 20
Name: Cantidad_de_equipos_t
index: 21
Name: Cantidad_de_equipos_t_1
index: 22
Name: Carreras_2_t
index: 23
Name: Carreras_2_t_1
index: 24
Name: Carreras_ganadas_2_t
index: 25
Name: Carreras_ganadas_2_t_1
index: 26
Name: Carreras_ganadas_t
index: 27
Name: Carreras_ganadas_t_1
index: 28
Name: Carreras_t
index: 29
Name: Carreras_t_1
index: 30

In [158]:
# Pitcher
pitcher_nan = empiric_panel_pitcher.isna().any()
pitcher_name = empiric_panel_pitcher.columns
for con in range(0, len(pitcher_nan)):
    if pitcher_nan[con]:
        print("Name: " + str(pitcher_name[con]))

Obtengamos las variables auxiliares

In [159]:
# Output
pitcher_stat = pitcher_names[sport_st_pitcher]
# Max output
pitcher_max = [col for col in empiric_panel_pitcher if col.endswith('H')]
# Dummy columns
start = empiric_panel_pitcher.columns.get_loc('I_Bateos_2_t_1')
end = empiric_panel_pitcher.columns.get_loc('I_Wins_t_1') + 1
I_pitcher = empiric_panel_pitcher.iloc[:,start:end].columns

In [160]:
empiric_panel_pitcher[I_pitcher]

Unnamed: 0,I_Bateos_2_t_1,I_Bateos_t_1,I_Carreras_2_t_1,I_Carreras_t_1,I_Carreras_ganadas_2_t_1,I_Carreras_ganadas_t_1,I_Comando_2_t_1,I_Comando_t_1,I_Control_2_t_1,I_Control_t_1,...,I_Strike_outs_2_t_1,I_Strike_outs_t_1,I_WAR_2_t_1,I_WAR_t_1,I_WHIP_2_t_1,I_WHIP_t_1,I_Walks_2_t_1,I_Walks_t_1,I_Wins_2_t_1,I_Wins_t_1
0,0,0,0,0,0,0,1,1,0,0,...,0,0,1,0,0,0,0,0,0,0
1,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
2,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
3,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
4,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3369,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
3370,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
3371,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
3372,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1


Verifiquemos que tengan la misma longitud

In [161]:
len(I_pitcher) == len(pitcher_max) == len(pitcher_stat)

True

In [162]:
# Pitcher
for stat in range(0,len(pitcher_stat)):
    """
    print("Max")
    print(np.sqrt(empiric_panel_pitcher[pitcher_max[stat]]).max())"""
    
    print((-1)**(empiric_panel_pitcher[I_pitcher[stat]]).min())
    print("Aux_2")
    print((empiric_panel_pitcher[pitcher_stat[stat]]/np.sqrt(empiric_panel_pitcher[pitcher_max[stat]])).max())

1
Aux_2
265.14285714285717
1
Aux_2
16.283207827171438
1
Aux_2
160.04301075268816
1
Aux_2
12.650810675711188
1
Aux_2
144.95294117647057
1
Aux_2
12.039640408935417
1
Aux_2
16074.370967741934
1
Aux_2
126.78474264572189
1
Aux_2
0.203442226234249
1
Aux_2
0.4510457030437703
1
Aux_2
1.044973544973545
1
Aux_2
1.0222394753547455
1
Aux_2
27.0
1
Aux_2
5.196152422706632
1
Aux_2
233.60946773433815
1
Aux_2
15.284288263911348
1
Aux_2
19.0
1
Aux_2
4.358898943540673
1
Aux_2
50.0
1
Aux_2
7.071067811865475
1
Aux_2
326.0
1
Aux_2
18.05547008526779
1
Aux_2
8.88
1
Aux_2
2.979932885150268
1
Aux_2
3.0
1
Aux_2
1.7320508075688774
1
Aux_2
96.0
1
Aux_2
9.797958971132713
1
Aux_2
20.0
1
Aux_2
4.47213595499958


In [166]:
# Pitcher
for stat in range(0,len(pitcher_stat)):
    # Variable auxiliar
    X_pitcher = []
    
    # Variables 
    i = (-1)**(empiric_panel_pitcher[I_pitcher[stat]])
    x = empiric_panel_pitcher[pitcher_stat[stat]]/np.sqrt(empiric_panel_pitcher[pitcher_max[stat]])
    X_pitcher = i*x
    
    # X name
    name = 'X_' + pitcher_stat[stat]
    empiric_panel_pitcher[name] = X_pitcher

In [167]:
empiric_panel_pitcher.iloc[:,234:]

Unnamed: 0,X_Bateos_t,X_Carreras_2_t,X_Carreras_t,X_Carreras_ganadas_2_t,X_Carreras_ganadas_t,X_Comando_2_t,X_Comando_t,X_Control_2_t,X_Control_t,X_Dominio_2_t,...,X_Strike_outs_2_t_1,X_Strike_outs_t_1,X_WAR_2_t_1,X_WAR_t_1,X_WHIP_2_t_1,X_WHIP_t_1,X_Walks_2_t_1,X_Walks_t_1,X_Wins_2_t_1,X_Wins_t_1
0,13.980884,160.043011,12.650811,138.151163,11.753772,-0.012016,-0.109616,0.045241,0.2127,0.066829,...,209.0,14.456832,-1.179941,1.086251,0.97281,0.986312,63.225352,7.951437,7.142857,2.672612
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3369,,,,,,,,,,,...,,,,,,,,,,
3370,,,,,,,,,,,...,,,,,,,,,,
3371,,,,,,,,,,,...,,,,,,,,,,
3372,,,,,,,,,,,...,,,,,,,,,,


In [208]:
# Pitcher
pitcher_nan = empiric_panel_pitcher.isna().any()
pitcher_name = empiric_panel_pitcher.columns
for con in range(0, len(pitcher_nan)):
    if pitcher_nan[con]:
        print("Name: " + str(pitcher_name[con]))

Name: X_Bateos_2_t
Name: X_Bateos_t
Name: X_Carreras_2_t
Name: X_Carreras_t
Name: X_Carreras_ganadas_2_t
Name: X_Carreras_ganadas_t
Name: X_Comando_2_t
Name: X_Comando_t
Name: X_Control_2_t
Name: X_Control_t
Name: X_Dominio_2_t
Name: X_Dominio_t
Name: X_ERA_2_t
Name: X_ERA_t
Name: X_Inning_pitched_2_t
Name: X_Inning_pitched_t
Name: X_Losses_2_t_1
Name: X_Losses_t_1
Name: X_Saves_2_t_1
Name: X_Saves_t_1
Name: X_Strike_outs_2_t_1
Name: X_Strike_outs_t_1
Name: X_WAR_2_t_1
Name: X_WAR_t_1
Name: X_WHIP_2_t_1
Name: X_WHIP_t_1
Name: X_Walks_2_t_1
Name: X_Walks_t_1
Name: X_Wins_2_t_1
Name: X_Wins_t_1


Ahora, filtremos las columnas que contienen las medidas de desempeño de los periodos **t_1**. Para ello, encontremos primero los índices de dichas variables

In [174]:
# Pitcher
for stat in range(0,len(pitcher_stat)):
    # Stat
    print(colored(pitcher_stat[stat], "cyan"))
    
    # X name
    name = 'X_' + pitcher_stat[stat]
    
    # OLS variables
    Y = empiric_panel_pitcher['Y'].tolist()
    X = empiric_panel_pitcher[name].tolist()
    X = sm.add_constant(X)
    
    # Modelo
    model = sm.OLS(Y, X,
                   missing = 'drop').fit()
    print(model.summary())
    print("\n")

[36mBateos_2_t[0m
                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                 -0.001
Method:                 Least Squares   F-statistic:                    0.5607
Date:                Thu, 26 Jan 2023   Prob (F-statistic):              0.454
Time:                        08:54:13   Log-Likelihood:                -1507.7
No. Observations:                 725   AIC:                             3019.
Df Residuals:                     723   BIC:                             3029.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.1710      0.072

$$
\frac{x - x_{min}}{x_{max} - x_{min}}
$$