# Players dataframes for three periods

En este script nos dedicaremos a crear una base de datos limpia segmentada por position players y pitchers. Se exportarán dichas bases de datos contemplando respectivamente a los jugadores que son agentes libres, a los que no son y a todos los jugadores. Las secciones dentro del script son:

- **Visualización del contenido de las bases de datos.**
- **Limpieza de la base de datos y exportación.**
- **Creación de indicador de si el jugador es agente libre.**

Importemos los modulos necesarios así como especificar la configuración deseada.

In [1]:
import pandas as pd
import numpy as np
import collections
import math
import os
import warnings
print('Modulos importados')

Modulos importados


In [2]:
# Configuraciones
warnings.filterwarnings('ignore')
# Reduzcamos el número de línea a leer
pd.options.display.max_rows = 100

In [3]:
# Directorio de trabajo
print("Directorio de trabajo previo: " + str(os.getcwd()))
# Cambiemoslo
os.chdir('/home/usuario/Documentos/Github/Proyectos/MLB_HN/')

Directorio de trabajo previo: /home/usuario/Documentos/Github/Proyectos/MLB_HN/ETL_Scripts/Whole_Contract


In [4]:
# Veamos el directorio actual de trabajo
print(os.getcwd())
# El directorio anterior es el correcto, pero si no lo fuese, hacemos lo sigueinte:
path = '/home/usuario/Documentos/Github/Proyectos/MLB_HN'
print("Nuevo directorio de trabajo: " + str(os.chdir(path)))

/home/usuario/Documentos/Github/Proyectos/MLB_HN
Nuevo directorio de trabajo: None


## Visualización de las bases de datos

Basta con ver el contenido de las base de datos de un año para observar qué variables contienen. Escojamos el año 2012.

A continuación, se mostrará el contenido de las distintas bases de datos sobre los *bateadores*, *pitchers*, *salarios de los agentes libres* y *salarios de los todos los jugadores*. Esto para determinar el proceso de limpieza que se llevará a cabo.

In [5]:
# Rutas de los archivos del año 2012
free_agents_2012 = 'Data/Free_Agents/Free_Agents_2012.csv'
hitting_2012 = 'Data/Statistics/Hitting/hitting_2012.csv'
pitching_2012 = 'Data/Statistics/Pitching/pitching_2012.csv'
salary_2012 = 'Data/Salary/Salary_2012.csv'
teams_etl_2012 = 'ETL_data/Period_t_2/Teams/free_agents_team_2012.csv'

# Importando los dataframes
df_free_agent_auxiliar_2012 = pd.read_csv(free_agents_2012)
df_hitting_auxiliar_2012 = pd.read_csv(hitting_2012)
df_pitching_auxiliar_2012 = pd.read_csv(pitching_2012)
df_salary_auxiliar_2012 = pd.read_csv(salary_2012)
df_teams_etl_2012 = pd.read_csv(teams_etl_2012)

FileNotFoundError: [Errno 2] No such file or directory: 'Data/Free_Agents/Free_Agents_2012.csv'

### Agentes libres

Veamos primero el dataframe

In [None]:
df_free_agent_auxiliar_2012.head()

### Hitters

Veamos el dataframe

In [None]:
df_hitting_auxiliar_2012.head()

In [None]:
df_hitting_auxiliar_2012.columns

Los términos en la base de datos no se traducirán para evitar malentendidos en la traducción.

- **Pos**: Player position.
- **Team**: Team acronym.
- **GP**: Games played.
- **GP%**: Games played %.
- **AB**: At bats.
- **H**: Hitting.
- **HR**: Home runs.
- **RBI**: Runs batted in.
- **AVG**: Batting average.
- **OPS**: Onebase plus slugging%.

Se omitirá la columna *Cash2022* puesto que no es de interés para el trabajo el valor del jugador en la actualidad puesto que hay agentes libres que ya se han retirado en años posteriores.

## Pitchers

In [None]:
df_pitching_auxiliar_2012.head()

In [None]:
df_pitching_auxiliar_2012.columns

#### Notación.

Veamos a qué se refieren algunos términos

- **Pos**: Player position.
- **Team**: Team acronym.
- **GP**: Games played.
- **GS**: Games started.
- **IP**: Inning pitched.
- **H**: Hits.
- **R**: Runs.
- **ER**: Earned runs.
- **BB**: Walks.
- **SO**: Strikeouts.
- **W**: Wins.
- **L**: Losses-
- **SV**: Saves.
- **WHIP**: WHIP.
- **ERA**: Earned runs average.

Por razones análogas, se descartará la columna *Cash2022*.

### Salarios
En este caso, hay muchas menos variables que en las anteriores bases de datos

In [None]:
df_salary_auxiliar_2012.head()

- **BaseSalary**: A base salary is the minimum amount you can expect to earn in exchange for your time or services. This is the amount earned before benefits, bonuses, or compensation is added.
- **Payroll Salary**: Payroll is the compensation a business must pay to its employees for a set period and on a given date.
- **Adj Salary**: Adjusted Salary means the regular salary, wages and commissions, if any, payable to a Participant by the Company for the Participant's service, excluding any bonuses or other compensation.

### Teams ETL

Esta base de datos sobre los equipos es bajo el proceso ETL

In [None]:
df_teams_etl_2012.head()

    913
    31

Buscar
### Equipos por estado

In [None]:
states = 'Data/Teams/team_states.csv'
df_states = pd.read_csv(states)

In [None]:
df_states.head()

### Acrónimos

Nos servirá como llave intermedia para unificar las bases de datos de los equipos

In [None]:
acronym = 'Data/Teams/team_acronym.csv'
df_acronym = pd.read_csv(acronym)

In [None]:
df_acronym.head()

Unamos esta dataframe con el de los equipos por estado

In [None]:
acronym_state = pd.merge(df_states, df_acronym, on = 'Estado')

In [None]:
acronym_state.head()

En este caso, el nombre de las variables es claro

## Algoritmo para la creación de las bases de datos

A continuaicón, se optimizará el código para que se puedan obtener los *dataframes* anteriores para un cojuntos de datos de años secuenciales, como es nuestro caso

In [None]:
# Auxiliares:
free_agents = 'Data/Free_Agents/Free_Agents_'
hitting = 'Data/Statistics/Hitting/hitting_'
pitching = 'Data/Statistics/Pitching/pitching_'
salary = 'Data/Salary/Salary_'
teams = 'ETL_data/Period_t_2/Teams/free_agents_team_'
csv = '.csv'
period = 11
# Originales:
df_free_agents = [None]*period
df_hitting = [None]*period
df_pitching = [None]*period
df_salary = [None]*period
df_teams = [None]*period
# Copias:
df_free_agents_copy = [None]*period
df_hitting_copy = [None]*period
df_pitching_copy = [None]*period
df_salary_copy = [None]*period
df_teams_copy = [None]*period
# Producto final:
df_pitchers = [None]*period
df_hitters = [None]*period
df_pitchers_free_agents = [None]*period
df_hitters_free_agents = [None]*period
df_pitchers_no_free_agents = [None]*period
df_hitters_no_free_agents = [None]*period

Leamos todos los archivos y creemos las copias

In [None]:
for i in range(0,period):    
    df_free_agents[i] = pd.read_csv(free_agents + str(2011 + i) + csv)
    df_hitting[i] = pd.read_csv(hitting + str(2011 + i) + csv)
    df_pitching[i] = pd.read_csv(pitching + str(2011 + i) + csv)
    df_salary[i] = pd.read_csv(salary + str(2011 + i) + csv)
    df_teams[i] = pd.read_csv(teams + str(2011 + i) + csv)
    
    df_free_agents_copy[i] = df_free_agents[i].copy()
    df_hitting_copy[i] = df_hitting[i].copy()
    df_pitching_copy[i] = df_pitching[i].copy()
    df_salary_copy[i] = df_salary[i].copy()
    df_teams_copy[i] = pd.read_csv(teams + str(2011 + i) + csv)

Tratemos las bases de datos por separado

#### Agentes libres

No se conservará el equipo al que es contratado el agente libre puesto que esta información también la contiene la base de datos que facilita más el tratamiento _ETL_.

In [None]:
for i in range(0,period):    
    df_free_agents_copy[i]  = df_free_agents_copy[i][['Player', 'Year', 'Status', 'Team From',
                                                      'YRS', 'Value', 'AAV']]
    df_free_agents_names  = ['Jugador', 'Anio', 'Status', 'Equipo_anterior',
                             'Anios_contrato', 'Valor_contrato', 'Valor_promedio_contrato']
    df_free_agents_copy[i].columns = df_free_agents_names
    
    free_agents_aux_1 = df_free_agents_copy[i]['Valor_contrato'].str.replace("$","")
    free_agents_aux_2 = free_agents_aux_1.str.replace(",","")
    free_agents_aux_3 = df_free_agents_copy[i]['Valor_promedio_contrato'].str.replace("$","")
    free_agents_aux_4 = free_agents_aux_3.str.replace(",","")
    df_free_agents_copy[i]['Valor_contrato'] = free_agents_aux_2
    df_free_agents_copy[i]['Valor_promedio_contrato'] = free_agents_aux_4
    
    df_free_agents_copy[i]['Valor_contrato'] = pd.to_numeric(df_free_agents_copy[i]['Valor_contrato'])
    df_free_agents_copy[i]['Valor_promedio_contrato'] = pd.to_numeric(df_free_agents_copy[i]['Valor_promedio_contrato'])

Agreguemos los agentes libres en todas las temporadas que su contrato está vigente en lugar de solo tener observaciones en el año que firmaron. Para obervar cuántos datos se añadirán, veamos el tamaño inicial de las bases de datos.

In [None]:
period_t = period - 1
df_contracts = [None]*(period_t)

In [None]:
for year in range(1,period_t):
    
    max_year_contract = max(df_free_agents_copy[year]['Anios_contrato'])
    years = max_year_contract - 1
    df_contracts[year] = [None]*years
    
    for incremento in range(0,years):
        diff_t = 1 + incremento
        real_year = 2011 + year + diff_t
        year_bound = 2022

        if real_year <= year_bound:
            df_contracts[year][incremento] = df_free_agents_copy[year][df_free_agents_copy[year]['Anios_contrato'] > diff_t]

In [None]:
for year in range(1,period_t):
    years = len(df_contracts[year])
    
    for incremento in range(0,years):
        diff_t = 1 + incremento
        real_year = 2011 + year + diff_t
        year_bound = 2022

        if real_year <= year_bound:
            frames = [df_free_agents_copy[year + diff_t], df_contracts[year][incremento]]
            
            df_free_agents_copy[year + diff_t] = pd.concat(frames)
            
            df_free_agents_copy[year + diff_t].reset_index(drop = True, inplace = True)
            df_free_agents_copy[year + diff_t] = df_free_agents_copy[year + diff_t].sort_values(by = 'Jugador', ascending = True)

#### Salarios

Como los salarios irán con las bases de datos de los _hitters_ y _pitchers_ es que se hará su proceso _ETL_ antes.

In [None]:
for i in range(0,period):
    df_salary_copy[i] = df_salary_copy[i][['Player', 'Team', 'BaseSalary',
                                           'Payroll Salary', 'Adj Salary']]
    df_salary_names = ['Jugador', 'Equipo', 'Sueldo_base', 'Sueldo', 'Sueldo_regular']
    df_salary_copy[i].columns = df_salary_names
    
    salary_aux_1 = df_salary_copy[i]['Sueldo_base'].str.replace("$","")
    salary_aux_2 = salary_aux_1.str.replace(",","")
    df_salary_copy[i]['Sueldo_base'] = salary_aux_2
    df_salary_copy[i]['Sueldo_base'] = pd.to_numeric(df_salary_copy[i]['Sueldo_base'])
    
    salary_aux_3 = df_salary_copy[i]['Sueldo'].str.replace("$","")
    salary_aux_4 = salary_aux_3.str.replace(",","")
    df_salary_copy[i]['Sueldo'] = salary_aux_4
    df_salary_copy[i]['Sueldo'] = pd.to_numeric(df_salary_copy[i]['Sueldo'])
    
    salary_aux_5 = df_salary_copy[i]['Sueldo_regular'].str.replace("$","")
    salary_aux_6 = salary_aux_5.str.replace(",","")
    df_salary_copy[i]['Sueldo_regular'] = salary_aux_6
    df_salary_copy[i]['Sueldo_regular'] = pd.to_numeric(df_salary_copy[i]['Sueldo_regular'])

#### Hitters

In [None]:
for i in range(0,period):    
    df_hitting_copy[i] = df_hitting_copy[i][['Player', 'Pos', 'GP', 'GP%', 'AB', 'H',
                                             'HR', 'RBI', 'AVG', 'OPS', 'WAR', 'TVS',
                                             'Age', 'Weight', 'Height']]
    df_hitting_names = ['Jugador', 'Posicion', 'Juegos', 'Porcetnaje_juegos', 'At-bats',
                        'Bateos', 'Home-runs', 'RBI', 'Porcentaje_bateo', 'OPS',
                        'WAR', 'TVS', 'Edad', 'Peso', 'Altura']
    df_hitting_copy[i].columns = df_hitting_names
    
    hitting_aux_1 = df_hitting_copy[i]['Altura'].str.replace("\"","")
    hitting_aux_2 = hitting_aux_1.str.replace("'","")
    df_hitting_copy[i]['Altura'] = hitting_aux_2
    df_hitting_copy[i]['Altura'] = pd.to_numeric(df_hitting_copy[i]['Altura'])/10
    
    df_hitters[i] = pd.merge(df_hitting_copy[i], df_salary_copy[i], on = 'Jugador')

    df_hitters[i] = df_hitters[i].rename(columns = {'Equipo':'Acronimo'})

#### Pitchers

In [None]:
for i in range(0,period):    
    df_pitching_copy[i] = df_pitching_copy[i][['Player', 'Pos', 'GP', 'GS', 'IP', 'H', 
                                               'R', 'ER', 'BB', 'SO', 'W', 'L', 'SV', 
                                               'WHIP', 'ERA', 'WAR', 'TVS', 'Age',
                                               'Weight', 'Height']]
    df_pitching_names = ['Jugador', 'Posicion', 'Juegos', 'Juegos_iniciados', 'Inning_pitched', 'Bateos_pitcher',
                         'Carreras', 'Carreras_ganadas', 'Walks', 'Strike-outs', 'Wins', 'Losses',
                         'Saves', 'WHIP', 'ERA', 'WAR', 'TVS', 'Edad', 'Peso', 'Altura']
    df_pitching_copy[i].columns = df_pitching_names    
    
    pitching_aux_1 = df_pitching_copy[i]['Altura'].str.replace("\"","")
    pitching_aux_2 = pitching_aux_1.str.replace("'","")
    df_pitching_copy[i]['Altura'] = pitching_aux_2
    df_pitching_copy[i]['Altura'] = pd.to_numeric(df_pitching_copy[i]['Altura'])/10

    df_pitchers[i] = pd.merge(df_pitching_copy[i], df_salary_copy[i], on = 'Jugador')
    
    df_pitchers[i] = df_pitchers[i].rename(columns = {'Equipo':'Acronimo'})

## Agregación de variables sugeridas por artículos

Las primeras variables que agregaremos son el cuadrado de todas las estadísticas deportivas, así como las siguientes variables:

- DOMINANCE = $Strike-outs/(9*Inning \; Pitched)$
- CONTROL = $Walks/(9*Inning \; Pitched)$
- COMMAND = $Strike-outs/Walks$

In [None]:
df_hitters[2].head()

In [None]:
df_hitters[2].columns

In [None]:
df_pitchers[2].head()

In [None]:
for i in range(0,period):
    df_pitchers[i]['Dominio'] = df_pitchers[i]['Strike-outs']/(9*df_pitchers[i]['Inning_pitched'])
    df_pitchers[i]['Control'] = df_pitchers[i]['Walks']/(9*df_pitchers[i]['Inning_pitched'])
    df_pitchers[i]['Comando'] = df_pitchers[i]['Strike-outs']/df_pitchers[i]['Walks']

In [None]:
df_pitchers[2].head()

In [None]:
df_pitchers[2].columns

Con el objetivo de hacer más eficiente la creación de las variables al cuadrado, lo haremos por índice

In [None]:
# Indiquemos las columnas que se usarán por medio de su índice
square_pitchers_index = list(range(2,17)) + [24,25,26]
square_hitters_index = list(range(2,12))

In [None]:
for i in range(0,period):
    for j in square_pitchers_index:
        df_pitchers[i][df_pitchers[i].columns[j] + '_2'] = np.power(df_pitchers[i][df_pitchers[i].columns[j]], 2)
    
    for k in square_hitters_index:
        df_hitters[i][df_hitters[i].columns[k] + '_2'] = np.power(df_hitters[i][df_hitters[i].columns[k]], 2)

Apreciemos el resultado final

In [None]:
df_pitchers[2].head()

In [None]:
df_pitchers[2].columns

In [None]:
df_hitters[7].head()

In [None]:
df_hitters[7].columns

Siguiendo la sugerencia de algunos artículos, obtengamos el logaritmo de los salarios

In [None]:
for year in range(0,period):
    df_hitters[year]['ln_Sueldo_base'] = np.log(df_hitters[year]['Sueldo_base'])
    df_hitters[year]['ln_Sueldo'] = np.log(df_hitters[year]['Sueldo'])
    df_hitters[year]['ln_Sueldo_regular'] = np.log(df_hitters[year]['Sueldo_regular'])
    df_hitters[year]['Anio'] = year + 1
    
    df_pitchers[year]['ln_Sueldo_base'] = np.log(df_pitchers[year]['Sueldo_base'])
    df_pitchers[year]['ln_Sueldo'] = np.log(df_pitchers[year]['Sueldo'])
    df_pitchers[year]['ln_Sueldo_regular'] = np.log(df_pitchers[year]['Sueldo_regular'])
    df_pitchers[year]['Anio'] = year + 1

### Datos agregados por equipo

Solo resta añadir los datos relevantes al equipo al que pertenece cada jugador considerando la base de datos de la cantidad de equipos por estado

In [None]:
df_teams_copy[2].head()

In [None]:
for i in range(0,period):
    df_hitters[i] = pd.merge(df_teams_copy[i], df_hitters[i], on = 'Acronimo')
    df_pitchers[i] = pd.merge(df_teams_copy[i], df_pitchers[i], on = 'Acronimo')

In [None]:
df_hitters[7].head()

## Variables del periodo t-1 y t-2

Lo que haremos será un *merge* de las bases de datos del año $t$ con el año $t-1$ sobre los jugadores. La razón de esto es que solo nos interesan los jugadores que han sido agentes libres por más de un año.

Si la primera base de datos es del año 2011, entonces tendremos que empezar en el año 2012. Creemos los dataframes que contendrán los datos para el modelo. Para que no se sobrepongan todos los periodos, crearemos dataframes auxiliares para guardar los nuevos datos

In [None]:
df_hitters_copy = [None]*period
df_pitchers_copy = [None]*period

In [None]:
for i in range(0,period):
    df_hitters_copy[i] = df_hitters[i].copy()
    df_pitchers_copy[i] = df_pitchers[i].copy()

In [None]:
for year in range(2,period):
    df_aux_merge_pitchers = pd.merge(df_pitchers_copy[year], df_pitchers_copy[year-1],
                                     how = 'inner', on = 'Jugador')
    df_pitchers[year] = pd.merge(df_aux_merge_pitchers, df_pitchers_copy[year-2],
                                 how = 'inner', on = 'Jugador')
    
    df_aux_merge_hitters = pd.merge(df_hitters_copy[year], df_hitters_copy[year-1],
                                    how = 'inner', on = 'Jugador')
    df_hitters[year] = pd.merge(df_aux_merge_hitters, df_hitters_copy[year-2],
                                how = 'inner', on = 'Jugador')

A continuación se verifica que la cantidad de columnas sea la misma, salvo por el primer periodo

In [None]:
for year in range(0,period):
    print(df_pitchers[year].columns.shape)

In [None]:
for year in range(0,period):
    print(df_hitters[year].columns.shape)

In [49]:
for year in range(2,period):       
    df_pitchers[year].columns = df_pitchers[year].columns.str.replace('_x', '_t_2')
    df_pitchers[year].columns = df_pitchers[year].columns.str.replace('_y', '_t_1')
    df_pitchers[year].columns = df_pitchers[year].columns.str.replace('-', '_')
    df_pitchers[year].columns = df_pitchers[year].columns.str.replace(' ', '_')
    df_pitchers[year].drop(['ln_Sueldo_base_t_1', 'ln_Sueldo_t_1', 'ln_Sueldo_regular_t_1',
                            'ln_Sueldo_base_t_2', 'ln_Sueldo_t_2', 'ln_Sueldo_regular_t_2'],
                            axis = 1, inplace = True)
    df_pitchers[year] = df_pitchers[year].sort_values(by = 'Jugador', ascending = True)
    df_pitchers[year].reset_index(drop = True, inplace = True)
    
    df_hitters[year].columns = df_hitters[year].columns.str.replace('_x', '_t_2')
    df_hitters[year].columns = df_hitters[year].columns.str.replace('_y', '_t_1')
    df_hitters[year].columns = df_hitters[year].columns.str.replace('-', '_')
    df_hitters[year].columns = df_hitters[year].columns.str.replace(' ', '_')
    df_hitters[year].drop(['ln_Sueldo_base_t_1', 'ln_Sueldo_t_1', 'ln_Sueldo_regular_t_1',
                           'ln_Sueldo_base_t_2', 'ln_Sueldo_t_2', 'ln_Sueldo_regular_t_2'],
                           axis = 1, inplace = True)
    df_hitters[year] = df_hitters[year].sort_values(by = 'Jugador', ascending = True)
    df_hitters[year].reset_index(drop = True, inplace = True)

## Segmentación por Agentes libres

Separaremos los pitchers y hitters en dos grupos:

- Agentes libres.
- No agentes libres.

In [50]:
for i in range(0,period):
    df_hitters_free_agents[i] = pd.merge(df_free_agents_copy[i], df_hitters[i], on = 'Jugador')
    
    df_pitchers_free_agents[i] = pd.merge(df_free_agents_copy[i], df_pitchers[i], on = 'Jugador')
    
    df_hitters_no_free_agents[i] = df_hitters[i][~df_hitters[i].Jugador.isin(df_hitters_free_agents[i].Jugador)]
    df_pitchers_no_free_agents[i] = df_pitchers[i][~df_pitchers[i].Jugador.isin(df_pitchers_free_agents[i].Jugador)] 
    
    df_hitters_free_agents[i] = df_hitters_free_agents[i].reindex(sorted(df_hitters_free_agents[i].columns), axis=1)
    df_pitchers_free_agents[i] = df_pitchers_free_agents[i].reindex(sorted(df_pitchers_free_agents[i].columns), axis=1)
    df_hitters_no_free_agents[i] = df_hitters_no_free_agents[i].reindex(sorted(df_hitters_no_free_agents[i].columns), axis=1)
    df_pitchers_no_free_agents[i] = df_pitchers_no_free_agents[i].reindex(sorted(df_pitchers_no_free_agents[i].columns), axis=1) 
    
    # Exportemos los dataframes por separado
    df_hitters_free_agents[i].to_csv('ETL_data/Period_t_2/Free_Agent/Hitters/free_agents_batters_' + str(2011 + i) + '.csv', index = False)
    df_pitchers_free_agents[i].to_csv('ETL_data/Period_t_2/Free_Agent/Pitchers/free_agents_pitchers_' + str(2011 + i) + '.csv', index = False)
    df_hitters_no_free_agents[i].to_csv('ETL_data/Period_t_2/No_Free_Agent/Hitters/no_free_agents_batters_' + str(2011 + i) + '.csv', index = False)
    df_pitchers_no_free_agents[i].to_csv('ETL_data/Period_t_2/No_Free_Agent/Pitchers/no_free_agents_pitchers_' + str(2011 + i) + '.csv', index = False)

In [51]:
# Algunos ejemplos
df_pitchers_no_free_agents[0].head()

Unnamed: 0,Acronimo,Altura,Anio,Bateos_pitcher,Bateos_pitcher_2,Cantidad_agentes_libres,Carreras,Carreras_2,Carreras_ganadas,Carreras_ganadas_2,...,WHIP,WHIP_2,WS ganadas,Walks,Walks_2,Wins,Wins_2,ln_Sueldo,ln_Sueldo_base,ln_Sueldo_regular
0,MIL,0.0,1,214,45796,1,95,9025,87,7569,...,1.32,1.7424,0,66,4356,13,169,16.066802,16.066802,16.066802
1,MIL,6.2,1,193,37249,1,92,8464,81,6561,...,1.22,1.4884,0,59,3481,17,289,15.068274,14.994166,15.068274
2,MIL,6.0,1,175,30625,1,84,7056,79,6241,...,1.16,1.3456,0,57,3249,13,169,15.189226,15.189226,15.189226
3,MIL,6.2,1,161,25921,1,82,6724,73,5329,...,1.2,1.44,0,45,2025,16,256,16.4182,16.4182,16.4182
4,MIL,6.3,1,160,25600,1,82,6724,80,6400,...,1.39,1.9321,0,65,4225,11,121,13.003918,13.003918,13.003918


In [52]:
df_hitters_no_free_agents[0].head()

Unnamed: 0,Acronimo,Altura,Anio,At-bats,At-bats_2,Bateos,Bateos_2,Cantidad_agentes_libres,Edad,Equipo,...,TVS,TVS_2,Valor_contrato_total,Victorias,WAR,WAR_2,WS ganadas,ln_Sueldo,ln_Sueldo_base,ln_Sueldo_regular
0,MIL,0.0,1,563,316969,187,34969,1,27,Milwaukee Brewers,...,96.43,9298.7449,775000,96,7.74,59.9076,0,15.424948,15.201805,15.424948
1,MIL,6.6,1,492,242064,140,19600,1,29,Milwaukee Brewers,...,68.72,4722.4384,775000,96,3.18,10.1124,0,15.737323,15.687313,15.737323
2,MIL,6.1,1,546,298116,122,14884,1,28,Milwaukee Brewers,...,8.77,76.9129,775000,96,-0.74,0.5476,0,13.056224,13.056224,13.056224
3,MIL,0.0,1,378,142884,115,13225,1,30,Milwaukee Brewers,...,58.06,3370.9636,775000,96,2.95,8.7025,0,13.017003,13.017003,13.017003
4,MIL,6.0,1,430,184900,114,12996,1,25,Milwaukee Brewers,...,21.82,476.1124,775000,96,1.02,1.0404,0,12.957489,12.957489,12.957489


In [53]:
df_pitchers_free_agents[10].head()

Unnamed: 0,Acronimo,Acronimo_t_1,Acronimo_t_2,Altura,Altura_t_1,Altura_t_2,Anio_t_1,Anio_t_2,Anio_x,Anio_y,...,Walks_t_2,Wins,Wins_2,Wins_2_t_1,Wins_2_t_2,Wins_t_1,Wins_t_2,ln_Sueldo,ln_Sueldo_base,ln_Sueldo_regular
0,NYY,NYY,BOS,6.5,6.5,6.5,10,11,2019,9,...,44,6,36,4,81,2,9,16.012735,15.894952,16.012735
1,STL,STL,STL,6.7,6.7,6.7,10,11,2021,9,...,65,14,196,25,484,5,22,16.118096,14.508658,16.118096
2,BAL,BAL,LAA,6.3,6.3,6.3,10,11,2018,9,...,51,0,0,4,100,2,10,16.454568,16.454568,16.066802
3,CHW,CHW,MIN,6.1,6.1,6.1,10,11,2021,9,...,31,4,16,4,36,2,6,15.806804,15.806804,15.806804
4,CIN,LAD,SF,6.4,6.4,6.4,10,11,2021,9,...,45,1,1,0,100,0,10,16.082468,16.082468,16.082468


In [54]:
df_hitters_free_agents[8].head()

Unnamed: 0,Acronimo,Acronimo_t_1,Acronimo_t_2,Altura,Altura_t_1,Altura_t_2,Anio_t_1,Anio_t_2,Anio_x,Anio_y,...,WAR_2_t_1,WAR_2_t_2,WAR_t_1,WAR_t_2,WS_ganadas,WS_ganadas_t_1,WS_ganadas_t_2,ln_Sueldo,ln_Sueldo_base,ln_Sueldo_regular
0,ARI,ARI,LAD,6.1,6.1,6.1,8,9,2019,7,...,6.2001,0.0441,2.49,0.21,1,1,6,15.725053,15.725053,15.725053
1,BAL,BAL,ARI,6.2,6.2,6.2,8,9,2019,7,...,0.0225,0.1296,0.15,-0.36,3,3,1,16.608719,16.588099,16.608719
2,TB,NYY,ATL,0.0,0.0,0.0,8,9,2019,7,...,0.7396,0.0361,0.86,-0.19,0,27,3,15.285686,15.285686,14.661147
3,LAA,LAA,LAA,6.3,6.3,6.3,8,9,2012,7,...,0.2601,0.1849,0.51,0.43,1,1,1,17.073607,17.073607,17.073607
4,KC,KC,KC,6.1,6.1,6.1,8,9,2016,7,...,5.6644,1.8225,2.38,1.35,2,2,2,16.588099,16.588099,16.588099


### Etiquetas para los agentes libres

Crearemos un etiqueta para indicar si el pitcher o hitter es  un agente libre o no.

In [55]:
for i in range(0,period):
    # Condiciones
    condicion_hitter = [df_hitters[i].Jugador.isin(df_hitters_free_agents[i].Jugador)]
    condicion_pitcher = [df_pitchers[i].Jugador.isin(df_pitchers_free_agents[i].Jugador)]
    # Etiquetas
    etiquetas = ['Si']
    
    df_hitters[i]['Agente libre'] = np.select(condicion_hitter, etiquetas, default = 'No')
    df_pitchers[i]['Agente libre'] = np.select(condicion_pitcher, etiquetas, default = 'No')
    
    df_hitters[i] = df_hitters[i].reindex(sorted(df_hitters[i].columns), axis=1)
    df_pitchers[i] = df_pitchers[i].reindex(sorted(df_pitchers[i].columns), axis=1)
    
    # Exportemos los dataframes
    df_hitters[i].to_csv('ETL_data/Period_t_2/Hitters/All_Hitters/hitters_' + str(2011 + i) + '.csv', index = False)
    df_pitchers[i].to_csv('ETL_data/Period_t_2/Pitchers/All_Pitchers/pitchers_' + str(2011 + i) + '.csv', index = False)

In [56]:
df_hitters[10].head()

Unnamed: 0,Acronimo,Acronimo_t_1,Acronimo_t_2,Agente libre,Altura,Altura_t_1,Altura_t_2,Anio,Anio_t_1,Anio_t_2,...,WAR_2_t_1,WAR_2_t_2,WAR_t_1,WAR_t_2,WS_ganadas,WS_ganadas_t_1,WS_ganadas_t_2,ln_Sueldo,ln_Sueldo_base,ln_Sueldo_regular
0,LAD,LAD,LAD,Si,6.1,6.1,6.1,9,10,11,...,0.3844,13.5424,0.62,3.68,6,7,7,15.201805,13.815511,15.201805
1,NYY,NYY,NYY,No,6.1,6.1,6.1,9,10,11,...,0.2401,0.04,0.49,0.2,27,27,27,15.65379,15.60727,15.65379
2,NYY,NYY,NYY,No,6.7,6.7,6.7,9,10,11,...,0.9801,47.8864,0.99,6.92,27,27,27,13.436152,13.436152,13.436152
3,HOU,HOU,SEA,No,6.0,6.0,6.0,9,10,11,...,0.4761,0.0064,-0.69,0.08,1,1,0,13.226723,13.226723,11.638606
4,KC,KC,KC,No,6.1,6.1,6.1,9,10,11,...,0.7396,1.5625,0.86,1.25,2,2,2,13.263863,13.263863,13.263863


In [57]:
df_pitchers[9].head()

Unnamed: 0,Acronimo,Acronimo_t_1,Acronimo_t_2,Agente libre,Altura,Altura_t_1,Altura_t_2,Anio,Anio_t_1,Anio_t_2,...,Walks_t_2,Wins,Wins_2,Wins_2_t_1,Wins_2_t_2,Wins_t_1,Wins_t_2,ln_Sueldo,ln_Sueldo_base,ln_Sueldo_regular
0,ATL,ATL,ATL,No,6.0,6.0,6.0,8,9,10,...,9,4,16,9,1,3,1,13.226723,13.226723,13.226723
1,CHW,CHW,CHW,No,6.3,6.3,6.3,8,9,10,...,5,0,0,0,1,0,1,13.215854,13.215854,12.587928
2,TB,LAD,LAD,No,6.3,6.3,6.3,8,9,10,...,4,1,1,36,9,6,3,13.208541,13.208541,12.326563
3,PHI,PHI,PHI,No,6.1,6.1,6.1,8,9,10,...,6,0,0,9,0,3,0,13.235692,13.235692,13.235692
4,COL,NYY,NYY,Si,6.5,6.5,6.5,8,9,10,...,9,6,36,36,4,6,2,15.761421,15.761421,15.761421


In [58]:
df_hitters[0].describe()

Unnamed: 0,Altura,Anio,At-bats,At-bats_2,Bateos,Bateos_2,Cantidad_agentes_libres,Edad,Home-runs,Home-runs_2,...,TVS,TVS_2,Valor_contrato_total,Victorias,WAR,WAR_2,WS ganadas,ln_Sueldo,ln_Sueldo_base,ln_Sueldo_regular
count,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,...,17.0,17.0,17.0,17.0,16.0,16.0,17.0,17.0,17.0,17.0
mean,5.070588,1.0,182.647059,74700.058824,48.529412,5874.176471,1.0,27.588235,6.176471,128.647059,...,29.276471,1807.240753,775000.0,96.0,1.73,6.795263,0.0,14.188448,14.164705,14.094196
std,2.425326,0.0,209.580218,114767.153659,61.147483,9980.445007,0.0,2.550951,9.805836,297.470995,...,31.77282,2615.43929,0.0,0.0,2.013915,14.604046,0.0,1.414654,1.391819,1.313965
min,0.0,1.0,1.0,1.0,0.0,0.0,1.0,24.0,0.0,0.0,...,0.0,0.0,775000.0,96.0,-0.74,0.0009,0.0,12.933621,12.933621,12.933621
25%,6.0,1.0,49.0,2401.0,7.0,49.0,1.0,25.0,0.0,0.0,...,0.0,0.0,775000.0,96.0,0.3225,0.262425,0.0,12.957489,12.957489,12.957489
50%,6.0,1.0,61.0,3721.0,13.0,169.0,1.0,28.0,1.0,1.0,...,15.65,244.9225,775000.0,96.0,1.21,1.5002,0.0,13.056224,13.056224,13.056224
75%,6.2,1.0,378.0,142884.0,114.0,12996.0,1.0,29.0,8.0,64.0,...,58.06,3370.9636,775000.0,96.0,2.6075,6.833725,0.0,15.424948,15.201805,15.189226
max,6.6,1.0,563.0,316969.0,187.0,34969.0,1.0,34.0,33.0,1089.0,...,96.43,9298.7449,775000.0,96.0,7.74,59.9076,0.0,16.4182,16.4182,16.4182


In [59]:
df_pitchers[0].describe()

Unnamed: 0,Altura,Anio,Bateos_pitcher,Bateos_pitcher_2,Cantidad_agentes_libres,Carreras,Carreras_2,Carreras_ganadas,Carreras_ganadas_2,Comando,...,WHIP,WHIP_2,WS ganadas,Walks,Walks_2,Wins,Wins_2,ln_Sueldo,ln_Sueldo_base,ln_Sueldo_regular
count,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,...,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0
mean,5.59,1.0,116.4,18255.4,1.0,53.6,4108.8,49.2,3473.0,2.769545,...,1.368,2.01898,0.0,38.5,1958.7,8.5,106.9,14.61905,14.599941,14.458822
std,1.972843,0.0,72.314437,16677.633093,0.0,37.056114,3785.863495,34.194866,3187.262636,1.037008,...,0.404909,1.498929,0.0,23.008453,1725.599867,6.204837,109.642297,1.492873,1.480655,1.371793
min,0.0,1.0,2.0,4.0,1.0,0.0,0.0,0.0,0.0,0.666667,...,1.14,1.2996,0.0,3.0,9.0,0.0,0.0,12.933621,12.933621,12.933621
25%,6.0,1.0,61.0,3733.0,1.0,19.75,391.75,17.25,302.25,2.21571,...,1.2025,1.446025,0.0,25.25,637.75,3.25,10.75,13.001126,13.001126,13.001126
50%,6.2,1.0,121.5,16244.5,1.0,63.5,4374.5,57.5,3546.5,2.917241,...,1.23,1.513,0.0,37.0,1433.0,8.5,78.5,15.12875,15.091696,14.890103
75%,6.275,1.0,171.5,29449.0,1.0,83.5,6973.0,79.75,6360.25,3.339615,...,1.315,1.7293,0.0,58.5,3423.0,13.0,169.0,15.865709,15.850553,15.244129
max,6.5,1.0,214.0,45796.0,1.0,95.0,9025.0,87.0,7569.0,4.466667,...,2.5,6.25,0.0,66.0,4356.0,17.0,289.0,16.4182,16.4182,16.4182
