# Players dataframes for the same period t

En este script nos dedicaremos a crear una base de datos limpia segmentada por hitters y pitchers. Se exportarán dichas bases de datos contemplando respectivamente a los jugadores que son agentes libres, a los que no son y a todos los jugadores. Las secciones dentro del script son:

- **Visualización del contenido de las bases de datos.**
- **Limpieza de la base de datos y exportación.**
- **Creación de indicador de si el jugador es agente libre.**

Importemos los modulos necesarios así como especificar la configuración deseada.

In [1]:
import pandas as pd
import numpy as np
import math
import os
import warnings
print('Modulos importados')

Modulos importados


In [2]:
# Configuraciones
warnings.filterwarnings('ignore')
# Reduzcamos el número de línea a leer
pd.options.display.max_rows = 5

In [3]:
# Directorio de trabajo
print("Directorio de trabajo previo: " + str(os.getcwd()))
# Cambiemoslo
os.chdir('/home/usuario/Documentos/Github/Proyectos/MLB_HN/')

Directorio de trabajo previo: /home/usuario/Documentos/Github/Proyectos/MLB_HN/ETL_Scripts/First_Year_Contract


In [4]:
# Veamos el directorio actual de trabajo
print(os.getcwd())
# El directorio anterior es el correcto, pero si no lo fuese, hacemos lo sigueinte:
path = '/home/usuario/Documentos/Github/Proyectos/MLB_HN'
print("Nuevo directorio de trabajo: " + str(os.chdir(path)))

/home/usuario/Documentos/Github/Proyectos/MLB_HN
Nuevo directorio de trabajo: None


## Visualización de las bases de datos

Basta con ver el contenido de las base de datos de un año para observar qué variables contienen. Escojamos el año 2012.

A continuación, se mostrará el contenido de las distintas bases de datos sobre los *bateadores*, *pitchers*, *salarios de los agentes libres* y *salarios de los todos los jugadores*. Esto para determinar el proceso de limpieza que se llevará a cabo.

In [5]:
# Rutas de los archivos del año 2012
free_agents_2012 = 'Data/Free_Agents/free_agents_2012.csv'
hitting_2012 = 'Data/Statistics/Hitting/hitting_2012.csv'
pitching_2012 = 'Data/Statistics/Pitching/pitching_2012.csv'
salary_2012 = 'Data/Salary/salary_2012.csv'
teams_etl_2012 = 'ETL_Data/Agent/Teams/free_agents_team_2012.csv'

# Importando los dataframes
df_free_agent_auxiliar_2012 = pd.read_csv(free_agents_2012)
df_hitting_auxiliar_2012 = pd.read_csv(hitting_2012)
df_pitching_auxiliar_2012 = pd.read_csv(pitching_2012)
df_salary_auxiliar_2012 = pd.read_csv(salary_2012)
df_teams_etl_2012 = pd.read_csv(teams_etl_2012)

### Agentes libres

Veamos primero el dataframe

In [6]:
df_free_agent_auxiliar_2012.head()

Unnamed: 0,Rank,Player,Year,Pos,Status,Team From,Team From To,YRS,Value,AAV
0,1,Albert Pujols,2012,DH,UFA,STL,LAA,10,"$240,000,000","$24,000,000"
1,2,Prince Fielder,2012,DH,UFA,MIL,DET,9,"$214,000,000","$23,777,778"
2,3,Jose Reyes,2012,SS,UFA,NYM,MIA,6,"$106,000,000","$17,666,667"
3,4,C.J. Wilson,2012,SP,UFA,TEX,LAA,5,"$77,500,000","$15,500,000"
4,5,Mark Buehrle,2012,SP,UFA,CHW,MIA,4,"$58,000,000","$14,500,000"


### Hitters

Veamos el dataframe

In [7]:
df_hitting_auxiliar_2012.head()

Unnamed: 0,Rank,Player,Pos,Team,GP,GP%,GS,GS%,AB,H,...,WAR,TVS,Payroll Salary2022,Cash2022,AAV2022,Earnings2022,Age,Weight,Height,Unnamed: 27
0,,Derek Jeter,SS,NYY,159,0.982,158,0.975,683,216,...,2.16,23.61,$0,$0,$0,$0,38,0,"0'0""",
1,,Miguel Cabrera,1B,DET,161,0.994,161,0.994,622,205,...,7.14,96.42,"$32,000,000","$32,000,000","$31,000,000","$353,023,111",29,249,"6'4""",
2,,Robinson Cano,2B,NYY,161,0.994,159,0.982,627,196,...,8.44,98.76,$0,$0,$0,$0,29,210,"6'0""",
3,,Billy Butler,DH,KC,161,0.994,158,0.975,614,192,...,3.03,76.02,$0,$0,$0,$0,26,260,"6'0""",
4,,Ryan Braun,LF,MIL,154,0.951,153,0.944,598,191,...,6.87,97.04,$0,$0,$0,$0,28,205,"0'0""",


In [8]:
df_hitting_auxiliar_2012.columns

Index(['Rank', 'Player', 'Pos', 'Team', 'GP', 'GP%', 'GS', 'GS%', 'AB', 'H',
       '2B', '3B', 'HR', 'RBI', 'AVG', 'OBP', 'SLG', 'OPS', 'WAR', 'TVS',
       'Payroll Salary2022', 'Cash2022', 'AAV2022', 'Earnings2022', 'Age',
       'Weight', 'Height', 'Unnamed: 27'],
      dtype='object')

Los términos en la base de datos no se traducirán para evitar malentendidos en la traducción.

- **Pos**: Player position.
- **Team**: Team acronym.
- **GP**: Games played.
- **GP%**: Games played %.
- **AB**: At bats.
- **H**: Hitting.
- **HR**: Home runs.
- **RBI**: Runs batted in.
- **AVG**: Batting average.
- **OPS**: Onebase plus slugging%.

Se omitirá la columna *Cash2022* puesto que no es de interés para el trabajo el valor del jugador en la actualidad puesto que hay agentes libres que ya se han retirado en años posteriores.

## Pitchers

In [9]:
df_pitching_auxiliar_2012.head()

Unnamed: 0,Rank,Player,Pos,Team,GP,GS,IP,H,R,ER,...,WAR,TVS,Payroll Salary2022,Cash2022,AAV2022,Earnings2022,Age,Weight,Height,Unnamed: 26
0,,R.A. Dickey,SP,NYM,34,33,233.7,192,78,71,...,5.68,97.27,$0,$0,$0,$0,37,215,"6'3""",
1,,Felix Hernandez,SP,SEA,33,33,232.0,209,84,79,...,5.26,85.2,$0,$0,$0,$0,26,225,"6'3""",
2,,James Shields,SP,TB,33,33,227.7,209,103,89,...,2.55,79.41,$0,$0,$0,$0,30,215,"6'3""",
3,,Clayton Kershaw,SP,LAD,34,33,227.7,170,70,64,...,6.43,95.71,"$17,000,000","$17,000,000","$17,000,000","$268,342,641",24,225,"6'4""",
4,,Hiroki Kuroda,SP,NYY,33,33,219.7,205,86,81,...,5.28,81.85,$0,$0,$0,$0,37,0,"0'0""",


In [10]:
df_pitching_auxiliar_2012.columns

Index(['Rank', 'Player', 'Pos', 'Team', 'GP', 'GS', 'IP', 'H', 'R', 'ER', 'BB',
       'SO', 'W', 'L', 'SV', 'WHIP', 'ERA', 'WAR', 'TVS', 'Payroll Salary2022',
       'Cash2022', 'AAV2022', 'Earnings2022', 'Age', 'Weight', 'Height',
       'Unnamed: 26'],
      dtype='object')

#### Notación.

Veamos a qué se refieren algunos términos

- **Pos**: Player position.
- **Team**: Team acronym.
- **GP**: Games played.
- **GS**: Games started.
- **IP**: Inning pitched.
- **H**: Hits.
- **R**: Runs.
- **ER**: Earned runs.
- **BB**: Walks.
- **SO**: Strikeouts.
- **W**: Wins.
- **L**: Losses-
- **SV**: Saves.
- **WHIP**: WHIP.
- **ERA**: Earned runs average.

Por razones análogas, se descartará la columna *Cash2022*.

### Salarios
En este caso, hay muchas menos variables que en las anteriores bases de datos

In [11]:
df_salary_auxiliar_2012.head()

Unnamed: 0,Rank,Player,Year,Pos,Team,BaseSalary,SigningBonus,Payroll Salary,Adj Salary,CONT YR,CONT VALUE,Earnings,FA Year,Sign Year,Sign Age,Age,Weight,Height,Unnamed: 18
0,,Alex Rodriguez,2012,DH,NYY,"$29,000,000","$1,000,000","$30,000,000","$30,000,000",10,"$275,000,000","$321,290,700",2018,2008,32,37,230,"0'0""",
1,,C.C. Sabathia,2012,SP,NYY,"$23,000,000","$1,285,714","$24,285,714","$24,285,714",7,"$161,000,000","$127,285,714",2016,2009,28,31,300,"6'6""",
2,,Vernon Wells,2012,LF,LAA,"$21,000,000","$3,187,500","$24,187,500","$24,187,500",7,"$126,000,000","$102,521,000",2015,2008,29,33,0,"0'0""",
3,,Johan Santana,2012,SP,NYM,"$24,000,000",$0,"$24,000,000","$24,000,000",6,"$137,500,000","$148,560,000",2014,2008,28,33,155,"6'0""",
4,,Mark Teixeira,2012,1B,NYY,"$22,500,000","$625,000","$23,125,000","$23,125,000",8,"$180,000,000","$127,650,000",2017,2009,28,32,225,"6'3""",


In [12]:
df_salary_auxiliar_2012.columns

Index(['Rank', 'Player', 'Year', 'Pos', 'Team', 'BaseSalary', 'SigningBonus',
       'Payroll Salary', 'Adj Salary', 'CONT YR', 'CONT VALUE', 'Earnings',
       'FA Year', 'Sign Year', 'Sign Age', 'Age', 'Weight', 'Height',
       'Unnamed: 18'],
      dtype='object')

- **BaseSalary**: A base salary is the minimum amount you can expect to earn in exchange for your time or services. This is the amount earned before benefits, bonuses, or compensation is added.
- **Payroll Salary**: Payroll is the compensation a business must pay to its employees for a set period and on a given date plus signing bonus
- **Adj Salary**: Adjusted Salary means the regular salary, wages and commissions, if any, payable to a Participant by the Company for the Participant's service, excluding any bonuses or other compensation.

### Teams ETL

Esta base de datos sobre los equipos es bajo el proceso ETL

In [13]:
df_teams_etl_2012.head()

Unnamed: 0,Equipo,Cantidad_agentes_libres,Valor_contrato_total,Acronimo,Victorias,Juegos totales,Playoffs,Pennants won,WS ganadas,Promedio_victorias
0,Los Angeles Angels,4,321150000,LAA,89,162,10,1,1,0.549383
1,Detroit Tigers,3,221000000,DET,88,162,14,11,4,0.54321
2,Miami Marlins,6,203300000,MIA,69,162,2,2,2,0.425926
3,Philadelphia Phillies,7,57650000,PHI,81,162,14,7,2,0.5
4,Los Angeles Dodgers,9,44651311,LAD,86,162,26,25,6,0.530864


### Equipos por estado

In [14]:
states = 'Data/Teams/team_states.csv'
df_states = pd.read_csv(states)

In [15]:
df_states.head()

Unnamed: 0,Estado,Cantidad de equipos
0,Alabama,0
1,Alaska,0
2,Arizona,1
3,Arkansas,0
4,California,5


### Acrónimos

Nos servirá como llave intermedia para unificar las bases de datos de los equipos

In [16]:
acronym = 'Data/Teams/team_acronym.csv'
df_acronym = pd.read_csv(acronym)

In [17]:
df_acronym.head()

Unnamed: 0,Equipo,Acronimo,Estado
0,Arizona Diamondbacks,ARI,Arizona
1,Atlanta Braves,ATL,Georgia
2,Baltimore Orioles,BAL,Maryland
3,Boston Red Sox,BOS,Massachusetts
4,Chicago Cubs,CHC,Illinois


Unamos esta dataframe con el de los equipos por estado

In [18]:
acronym_state = pd.merge(df_states, df_acronym, on = 'Estado')

In [19]:
acronym_state.head()

Unnamed: 0,Estado,Cantidad de equipos,Equipo,Acronimo
0,Arizona,1,Arizona Diamondbacks,ARI
1,California,5,Los Angeles Angels,LAA
2,California,5,Los Angeles Dodgers,LAD
3,California,5,Oakland Athletics,OAK
4,California,5,San Diego Padres,SD


En este caso, el nombre de las variables es claro

## Algoritmo para la creación de las bases de datos

A continuaicón, se optimizará el código para que se puedan obtener los *dataframes* anteriores para un cojuntos de datos de años secuenciales, como es nuestro caso

In [20]:
# Auxiliares:
free_agents = 'Data/Free_Agents/free_agents_'
hitting = 'Data/Statistics/Hitting/hitting_'
pitching = 'Data/Statistics/Pitching/pitching_'
salary = 'Data/Salary/salary_'
teams = 'ETL_Data/Agent/Teams/free_agents_team_'
csv = '.csv'
period = 12
# Originales:
df_free_agents = [None]*period
df_hitting = [None]*period
df_pitching = [None]*period
df_salary = [None]*period
df_teams = [None]*period
# Copias:
df_free_agents_copy = [None]*period
df_hitting_copy = [None]*period
df_pitching_copy = [None]*period
df_salary_copy = [None]*period
df_teams_copy = [None]*period
# Producto final:
df_pitchers = [None]*period
df_hitters = [None]*period
df_pitchers_free_agents = [None]*period
df_hitters_free_agents = [None]*period
df_pitchers_no_free_agents = [None]*period
df_hitters_no_free_agents = [None]*period

Leamos todos los archivos y creemos las copias

In [21]:
for i in range(0,period):    
    df_free_agents[i] = pd.read_csv(free_agents + str(2011 + i) + csv)
    df_hitting[i] = pd.read_csv(hitting + str(2011 + i) + csv)
    df_pitching[i] = pd.read_csv(pitching + str(2011 + i) + csv)
    df_salary[i] = pd.read_csv(salary + str(2011 + i) + csv)
    df_teams[i] = pd.read_csv(teams + str(2011 + i) + csv)
    
    df_free_agents_copy[i] = df_free_agents[i].copy()
    df_hitting_copy[i] = df_hitting[i].copy()
    df_pitching_copy[i] = df_pitching[i].copy()
    df_salary_copy[i] = df_salary[i].copy()
    df_teams_copy[i] = pd.read_csv(teams + str(2011 + i) + csv)

Tratemos las bases de datos por separado. Sin embargo, a todas les quitaremos la columna de rango

In [22]:
for i in range(0,period):
    # Drop 'Rank' columns:
    if 'Rank' in df_free_agents_copy[i].columns:
        df_free_agents_copy[i].drop('Rank', axis = 1, inplace = True)
        
    if 'Rank' in df_hitting_copy[i].columns:
        df_hitting_copy[i].drop('Rank', axis = 1, inplace = True)
    
    if 'Rank' in df_pitching_copy[i].columns:
        df_pitching_copy[i].drop('Rank', axis = 1, inplace = True)
    
    if 'Rank' in df_salary_copy[i].columns:
        df_salary_copy[i].drop('Rank', axis = 1, inplace = True)
        
    # Drop 'Unnamed' columns:
    if 'Unnamed: 10' in df_free_agents_copy[i].columns:
        df_free_agents_copy[i].drop('Unnamed: 10', axis = 1, inplace = True)
        
    if 'Unnamed: 27' in df_hitting_copy[i].columns:
        df_hitting_copy[i].drop('Unnamed: 27', axis = 1, inplace = True)
    
    if 'Unnamed: 26' in df_pitching_copy[i].columns:
        df_pitching_copy[i].drop('Unnamed: 26', axis = 1, inplace = True)
    
    if 'Unnamed: 18' in df_salary_copy[i].columns:
        df_salary_copy[i].drop('Unnamed: 18', axis = 1, inplace = True)

Verifiquemos que ya no se encuentran dichas columnas molestas

In [23]:
df_free_agents_copy[7].columns

Index(['Player', 'Year', 'Pos', 'Status', 'Team From', 'Team From To', 'YRS',
       'Value', 'AAV'],
      dtype='object')

In [24]:
df_hitting_copy[2].columns

Index(['Player', 'Pos', 'Team', 'GP', 'GP%', 'GS', 'GS%', 'AB', 'H', '2B',
       '3B', 'HR', 'RBI', 'AVG', 'OBP', 'SLG', 'OPS', 'WAR', 'TVS',
       'Payroll Salary2022', 'Cash2022', 'AAV2022', 'Earnings2022', 'Age',
       'Weight', 'Height'],
      dtype='object')

In [25]:
df_pitching_copy[7].columns

Index(['Player', 'Pos', 'Team', 'GP', 'GS', 'IP', 'H', 'R', 'ER', 'BB', 'SO',
       'W', 'L', 'SV', 'WHIP', 'ERA', 'WAR', 'TVS', 'Payroll Salary2022',
       'Cash2022', 'AAV2022', 'Earnings2022', 'Age', 'Weight', 'Height'],
      dtype='object')

In [26]:
df_salary_copy[11].columns

Index(['Player', 'Year', 'Pos', 'Team', 'BaseSalary', 'SigningBonus',
       'Payroll Salary', 'Adj Salary', 'CONT YR', 'CONT VALUE', 'Earnings',
       'FA Year', 'Sign Year', 'Sign Age', 'Age', 'Weight', 'Height'],
      dtype='object')

#### Agentes libres

No se conservará el equipo al que es contratado el agente libre puesto que esta información también la contiene la base de datos que facilita más el tratamiento _ETL_.

In [27]:
for i in range(0,period):    
    df_free_agents_copy[i]  = df_free_agents_copy[i][['Player', 'Year', 'Status', 'Team From',
                                                      'YRS', 'Value', 'AAV']]
    df_free_agents_names  = ['Jugador', 'Anio', 'Status', 'Equipo_anterior',
                             'Anios_contrato', 'Valor_contrato', 'Valor_promedio_contrato']
    df_free_agents_copy[i].columns = df_free_agents_names
    
    free_agents_aux_1 = df_free_agents_copy[i]['Valor_contrato'].str.replace("$","")
    free_agents_aux_2 = free_agents_aux_1.str.replace(",","")
    free_agents_aux_3 = df_free_agents_copy[i]['Valor_promedio_contrato'].str.replace("$","")
    free_agents_aux_4 = free_agents_aux_3.str.replace(",","")
    df_free_agents_copy[i]['Valor_contrato'] = free_agents_aux_2
    df_free_agents_copy[i]['Valor_promedio_contrato'] = free_agents_aux_4
    
    df_free_agents_copy[i]['Valor_contrato'] = pd.to_numeric(df_free_agents_copy[i]['Valor_contrato'])
    df_free_agents_copy[i]['Valor_promedio_contrato'] = pd.to_numeric(df_free_agents_copy[i]['Valor_promedio_contrato'])

#### Salarios

Como los salarios irán con las bases de datos de los _hitters_ y _pitchers_ es que se hará su proceso _ETL_ antes.

In [28]:
for i in range(0,period):
    df_salary_copy[i] = df_salary_copy[i][['Player', 'Pos', 'Team', 'BaseSalary', 'SigningBonus',
                                           'Payroll Salary', 'Adj Salary', 'CONT YR', 'CONT VALUE', 'Earnings',
                                           'FA Year', 'Sign Year', 'Sign Age']]
    df_salary_names = ['Jugador', 'Posicion', 'Equipo', 'Sueldo_base', 'Bono_por_firma',
                       'Sueldo_regular', 'Sueldo_ajustado', 'Anios_de_contrato', 'Valor_del_contrato', 'Ganancias',
                       'Anio_de_agente_libre', 'Anio_inicio_de_contrato', 'Edad_al_firmar']
    df_salary_copy[i].columns = df_salary_names
    
    salary_aux_1 = df_salary_copy[i]['Sueldo_base'].str.replace("$","")
    salary_aux_2 = salary_aux_1.str.replace(",","")
    df_salary_copy[i]['Sueldo_base'] = salary_aux_2
    df_salary_copy[i]['Sueldo_base'] = pd.to_numeric(df_salary_copy[i]['Sueldo_base'])
    
    salary_aux_3 = df_salary_copy[i]['Sueldo_regular'].str.replace("$","")
    salary_aux_4 = salary_aux_3.str.replace(",","")
    df_salary_copy[i]['Sueldo_regular'] = salary_aux_4
    df_salary_copy[i]['Sueldo_regular'] = pd.to_numeric(df_salary_copy[i]['Sueldo_regular'])
    
    salary_aux_5 = df_salary_copy[i]['Sueldo_ajustado'].str.replace("$","")
    salary_aux_6 = salary_aux_5.str.replace(",","")
    df_salary_copy[i]['Sueldo_ajustado'] = salary_aux_6
    df_salary_copy[i]['Sueldo_ajustado'] = pd.to_numeric(df_salary_copy[i]['Sueldo_ajustado'])
    
    salary_aux_7 = df_salary_copy[i]['Valor_del_contrato'].str.replace("$","")
    salary_aux_8 = salary_aux_7.str.replace(",","")
    df_salary_copy[i]['Valor_del_contrato'] = salary_aux_8
    df_salary_copy[i]['Valor_del_contrato'] = pd.to_numeric(df_salary_copy[i]['Valor_del_contrato'])
    
    salary_aux_9 = df_salary_copy[i]['Bono_por_firma'].str.replace("$","")
    salary_aux_10 = salary_aux_9.str.replace(",","")
    df_salary_copy[i]['Bono_por_firma'] = salary_aux_10
    df_salary_copy[i]['Bono_por_firma'] = pd.to_numeric(df_salary_copy[i]['Bono_por_firma'])
    
    salary_aux_11 = df_salary_copy[i]['Ganancias'].str.replace("$","")
    salary_aux_12 = salary_aux_11.str.replace(",","")
    df_salary_copy[i]['Ganancias'] = salary_aux_12
    df_salary_copy[i]['Ganancias'] = pd.to_numeric(df_salary_copy[i]['Ganancias'])

#### Hitters

In [29]:
for i in range(0,period):
    df_hitting_copy[i] = df_hitting_copy[i][['Player', 'Pos', 'GP', 'GP%', 'AB', 'H',
                                             'HR', 'RBI', 'AVG', 'OPS', 'WAR', 'TVS',
                                             'Age', 'Weight', 'Height']]
    df_hitting_names = ['Jugador', 'Posicion', 'Juegos', 'Porcetnaje_juegos', 'At-bats',
                        'Bateos', 'Home-runs', 'RBI', 'Porcentaje_bateo', 'OPS',
                        'WAR', 'TVS', 'Edad', 'Peso', 'Altura']
    df_hitting_copy[i].columns = df_hitting_names
    
    hitting_aux_1 = df_hitting_copy[i]['Altura'].str.replace("\"","")
    hitting_aux_2 = hitting_aux_1.str.replace("'","")
    df_hitting_copy[i]['Altura'] = hitting_aux_2
    df_hitting_copy[i]['Altura'] = pd.to_numeric(df_hitting_copy[i]['Altura'])/10
    
    df_hitters[i] = pd.merge(df_hitting_copy[i], df_salary_copy[i], on = 'Jugador')

    df_hitters[i] = df_hitters[i].rename(columns = {'Equipo':'Acronimo'})

#### Pitchers

In [30]:
for i in range(0,period):    
    df_pitching_copy[i] = df_pitching_copy[i][['Player', 'Pos', 'GP', 'GS', 'IP', 'H', 
                                               'R', 'ER', 'BB', 'SO', 'W', 'L', 'SV', 
                                               'WHIP', 'ERA', 'WAR', 'TVS', 'Age',
                                               'Weight', 'Height']]
    df_pitching_names = ['Jugador', 'Posicion', 'Juegos', 'Juegos_iniciados', 'Inning_pitched', 'Bateos_pitcher',
                         'Carreras', 'Carreras_ganadas', 'Walks', 'Strike-outs', 'Wins', 'Losses',
                         'Saves', 'WHIP', 'ERA', 'WAR', 'TVS', 'Edad', 'Peso', 'Altura']
    df_pitching_copy[i].columns = df_pitching_names    
    
    pitching_aux_1 = df_pitching_copy[i]['Altura'].str.replace("\"","")
    pitching_aux_2 = pitching_aux_1.str.replace("'","")
    df_pitching_copy[i]['Altura'] = pitching_aux_2
    df_pitching_copy[i]['Altura'] = pd.to_numeric(df_pitching_copy[i]['Altura'])/10

    df_pitchers[i] = pd.merge(df_pitching_copy[i], df_salary_copy[i], on = 'Jugador')
    
    df_pitchers[i] = df_pitchers[i].rename(columns = {'Equipo':'Acronimo'})

Debido a que la mayoría de los jugadores juega tanto en la ofensiva como la defensiva es que tenemos que borrar los duplicados de la columna de la posición.

In [31]:
for i in range(0,period):
    # Drop 'Posición_y' columns:
    if 'Posicion_y' in df_hitters[i].columns:
        df_hitters[i].drop('Posicion_y', axis = 1, inplace = True)
    
    if 'Posicion_y' in df_pitchers[i].columns:
        df_pitchers[i].drop('Posicion_y', axis = 1, inplace = True)
        
    # Cambiando nombre de 'Posicion_x':
    if 'Posicion_x' in df_hitters[i].columns:
        df_hitters[i] = df_hitters[i].rename(columns = {'Posicion_x':'Posicion'})
    
    if 'Posicion_x' in df_pitchers[i].columns:
        df_pitchers[i] = df_pitchers[i].rename(columns = {'Posicion_x':'Posicion'})

## Agregación de variables sugeridas por artículos

Las primeras variables que agregaremos son el cuadrado de todas las estadísticas deportivas, así como las siguientes variables:

- DOMINANCE = $Strike-outs/(9*Inning \; Pitched)$
- CONTROL = $Walks/(9*Inning \; Pitched)$
- COMMAND = $Strike-outs/Walks$

In [32]:
df_hitters[2].head()

Unnamed: 0,Jugador,Posicion,Juegos,Porcetnaje_juegos,At-bats,Bateos,Home-runs,RBI,Porcentaje_bateo,OPS,...,Sueldo_base,Bono_por_firma,Sueldo_regular,Sueldo_ajustado,Anios_de_contrato,Valor_del_contrato,Ganancias,Anio_de_agente_libre,Anio_inicio_de_contrato,Edad_al_firmar
0,Dustin Pedroia,2B,160,0.988,641,193,9,84,0.301,0.787,...,10000000,250000,10250000,10250000,6,40500000,31315984,0,2009,25
1,Miguel Cabrera,1B,148,0.914,555,193,44,137,0.348,1.078,...,21000000,0,21000000,21000000,8,152300000,119927573,0,2008,24
2,Robinson Cano,2B,160,0.988,605,190,27,107,0.314,0.899,...,15000000,0,15000000,15000000,4,30000000,58021800,2014,2008,25
3,Nick Markakis,RF,160,0.988,632,172,10,59,0.272,0.687,...,15000000,350000,15350000,15350000,6,66100000,52207000,2016,2009,25
4,Alex Rios,RF,156,0.957,616,171,18,81,0.278,0.756,...,12500000,500000,13000000,2551912,7,68835000,60400000,2015,2008,26


In [33]:
df_hitters[2].columns

Index(['Jugador', 'Posicion', 'Juegos', 'Porcetnaje_juegos', 'At-bats',
       'Bateos', 'Home-runs', 'RBI', 'Porcentaje_bateo', 'OPS', 'WAR', 'TVS',
       'Edad', 'Peso', 'Altura', 'Acronimo', 'Sueldo_base', 'Bono_por_firma',
       'Sueldo_regular', 'Sueldo_ajustado', 'Anios_de_contrato',
       'Valor_del_contrato', 'Ganancias', 'Anio_de_agente_libre',
       'Anio_inicio_de_contrato', 'Edad_al_firmar'],
      dtype='object')

In [34]:
df_pitchers[2].head()

Unnamed: 0,Jugador,Posicion,Juegos,Juegos_iniciados,Inning_pitched,Bateos_pitcher,Carreras,Carreras_ganadas,Walks,Strike-outs,...,Sueldo_base,Bono_por_firma,Sueldo_regular,Sueldo_ajustado,Anios_de_contrato,Valor_del_contrato,Ganancias,Anio_de_agente_libre,Anio_inicio_de_contrato,Edad_al_firmar
0,Adam Wainwright,SP,35,34,241.7,223,83,79,35,219,...,12000000,0,12150000,12150000,4,15000000,39003000,2014,2008,26
1,James Shields,SP,34,34,228.7,215,82,80,68,196,...,10250000,0,10250000,10250000,4,11250000,27594900,2015,2008,26
2,Jon Lester,SP,33,33,213.3,209,94,89,67,177,...,11625000,0,11625000,11625000,5,30000000,31852403,2015,2009,24
3,C.C. Sabathia,SP,32,32,211.0,224,122,112,65,175,...,23000000,1285714,24285714,24285714,7,161000000,150285714,2016,2009,28
4,Ervin Santana,SP,32,32,211.0,190,85,76,51,161,...,13000000,0,13000000,13000000,4,30000000,44236000,2014,2009,26


In [35]:
for i in range(0,period):
    df_pitchers[i]['Dominio'] = df_pitchers[i]['Strike-outs']/(9*df_pitchers[i]['Inning_pitched'])
    df_pitchers[i]['Control'] = df_pitchers[i]['Walks']/(9*df_pitchers[i]['Inning_pitched'])
    df_pitchers[i]['Comando'] = df_pitchers[i]['Strike-outs']/df_pitchers[i]['Walks']

In [36]:
df_pitchers[2].head()

Unnamed: 0,Jugador,Posicion,Juegos,Juegos_iniciados,Inning_pitched,Bateos_pitcher,Carreras,Carreras_ganadas,Walks,Strike-outs,...,Sueldo_ajustado,Anios_de_contrato,Valor_del_contrato,Ganancias,Anio_de_agente_libre,Anio_inicio_de_contrato,Edad_al_firmar,Dominio,Control,Comando
0,Adam Wainwright,SP,35,34,241.7,223,83,79,35,219,...,12150000,4,15000000,39003000,2014,2008,26,0.100676,0.01609,6.257143
1,James Shields,SP,34,34,228.7,215,82,80,68,196,...,10250000,4,11250000,27594900,2015,2008,26,0.095224,0.033037,2.882353
2,Jon Lester,SP,33,33,213.3,209,94,89,67,177,...,11625000,5,30000000,31852403,2015,2009,24,0.092202,0.034901,2.641791
3,C.C. Sabathia,SP,32,32,211.0,224,122,112,65,175,...,24285714,7,161000000,150285714,2016,2009,28,0.092154,0.034229,2.692308
4,Ervin Santana,SP,32,32,211.0,190,85,76,51,161,...,13000000,4,30000000,44236000,2014,2009,26,0.084781,0.026856,3.156863


In [37]:
df_pitchers[2].columns

Index(['Jugador', 'Posicion', 'Juegos', 'Juegos_iniciados', 'Inning_pitched',
       'Bateos_pitcher', 'Carreras', 'Carreras_ganadas', 'Walks',
       'Strike-outs', 'Wins', 'Losses', 'Saves', 'WHIP', 'ERA', 'WAR', 'TVS',
       'Edad', 'Peso', 'Altura', 'Acronimo', 'Sueldo_base', 'Bono_por_firma',
       'Sueldo_regular', 'Sueldo_ajustado', 'Anios_de_contrato',
       'Valor_del_contrato', 'Ganancias', 'Anio_de_agente_libre',
       'Anio_inicio_de_contrato', 'Edad_al_firmar', 'Dominio', 'Control',
       'Comando'],
      dtype='object')

Con el objetivo de hacer más eficiente la creación de las variables al cuadrado, lo haremos por índice

In [38]:
# Indiquemos las columnas que se usarán por medio de su índice
square_pitchers_index = list(range(2,17)) + [31,32,33]
square_hitters_index = list(range(2,12))

In [39]:
for i in range(0,period):
    for j in square_pitchers_index:
        df_pitchers[i][df_pitchers[i].columns[j] + '_2'] = np.power(df_pitchers[i][df_pitchers[i].columns[j]], 2)
    
    for k in square_hitters_index:
        df_hitters[i][df_hitters[i].columns[k] + '_2'] = np.power(df_hitters[i][df_hitters[i].columns[k]], 2)

Apreciemos el resultado final

In [40]:
df_pitchers[2].head()

Unnamed: 0,Jugador,Posicion,Juegos,Juegos_iniciados,Inning_pitched,Bateos_pitcher,Carreras,Carreras_ganadas,Walks,Strike-outs,...,Wins_2,Losses_2,Saves_2,WHIP_2,ERA_2,WAR_2,TVS_2,Dominio_2,Control_2,Comando_2
0,Adam Wainwright,SP,35,34,241.7,223,83,79,35,219,...,361,81,0,1.1449,8.6436,38.3161,9672.7225,0.010136,0.000259,39.151837
1,James Shields,SP,34,34,228.7,215,82,80,68,196,...,169,81,0,1.5376,9.9225,18.6624,6794.7049,0.009068,0.001091,8.307958
2,Jon Lester,SP,33,33,213.3,209,94,89,67,177,...,225,64,0,1.6641,14.1376,9.1204,4142.2096,0.008501,0.001218,6.97906
3,C.C. Sabathia,SP,32,32,211.0,224,122,112,65,175,...,196,169,0,1.8769,22.8484,0.0001,18.2329,0.008492,0.001172,7.248521
4,Ervin Santana,SP,32,32,211.0,190,85,76,51,161,...,81,100,0,1.2996,10.4976,8.4681,2818.5481,0.007188,0.000721,9.965782


In [41]:
df_pitchers[2].columns

Index(['Jugador', 'Posicion', 'Juegos', 'Juegos_iniciados', 'Inning_pitched',
       'Bateos_pitcher', 'Carreras', 'Carreras_ganadas', 'Walks',
       'Strike-outs', 'Wins', 'Losses', 'Saves', 'WHIP', 'ERA', 'WAR', 'TVS',
       'Edad', 'Peso', 'Altura', 'Acronimo', 'Sueldo_base', 'Bono_por_firma',
       'Sueldo_regular', 'Sueldo_ajustado', 'Anios_de_contrato',
       'Valor_del_contrato', 'Ganancias', 'Anio_de_agente_libre',
       'Anio_inicio_de_contrato', 'Edad_al_firmar', 'Dominio', 'Control',
       'Comando', 'Juegos_2', 'Juegos_iniciados_2', 'Inning_pitched_2',
       'Bateos_pitcher_2', 'Carreras_2', 'Carreras_ganadas_2', 'Walks_2',
       'Strike-outs_2', 'Wins_2', 'Losses_2', 'Saves_2', 'WHIP_2', 'ERA_2',
       'WAR_2', 'TVS_2', 'Dominio_2', 'Control_2', 'Comando_2'],
      dtype='object')

In [42]:
df_hitters[7].head()

Unnamed: 0,Jugador,Posicion,Juegos,Porcetnaje_juegos,At-bats,Bateos,Home-runs,RBI,Porcentaje_bateo,OPS,...,Juegos_2,Porcetnaje_juegos_2,At-bats_2,Bateos_2,Home-runs_2,RBI_2,Porcentaje_bateo_2,OPS_2,WAR_2,TVS_2
0,Whit Merrifield,2B,158,0.975,632,192,12,60,0.304,0.806,...,24964,0.950625,399424,36864,144,3600,0.092416,0.649636,30.6916,9164.2329
1,Freddie Freeman,1B,162,1.0,618,191,23,98,0.309,0.892,...,26244,1.0,381924,36481,529,9604,0.095481,0.795664,29.7025,7492.6336
2,J.D. Martinez,DH,150,0.926,569,188,43,130,0.33,1.031,...,22500,0.857476,323761,35344,1849,16900,0.1089,1.062961,40.5769,9348.9561
3,Manny Machado,3B,162,0.994,632,188,37,107,0.298,0.905,...,26244,0.988036,399424,35344,1369,11449,0.088804,0.819025,8.4681,5665.5729
4,Christian Yelich,LF,147,0.902,574,187,36,110,0.326,1.0,...,21609,0.813604,329476,34969,1296,12100,0.106276,1.0,58.0644,9970.0225


In [43]:
df_hitters[7].columns

Index(['Jugador', 'Posicion', 'Juegos', 'Porcetnaje_juegos', 'At-bats',
       'Bateos', 'Home-runs', 'RBI', 'Porcentaje_bateo', 'OPS', 'WAR', 'TVS',
       'Edad', 'Peso', 'Altura', 'Acronimo', 'Sueldo_base', 'Bono_por_firma',
       'Sueldo_regular', 'Sueldo_ajustado', 'Anios_de_contrato',
       'Valor_del_contrato', 'Ganancias', 'Anio_de_agente_libre',
       'Anio_inicio_de_contrato', 'Edad_al_firmar', 'Juegos_2',
       'Porcetnaje_juegos_2', 'At-bats_2', 'Bateos_2', 'Home-runs_2', 'RBI_2',
       'Porcentaje_bateo_2', 'OPS_2', 'WAR_2', 'TVS_2'],
      dtype='object')

Siguiendo la sugerencia de algunos artículos, obtengamos el logaritmo de los salarios

In [44]:
for year in range(0,period):
    df_hitters[year]['ln_Sueldo_base'] = np.log(df_hitters[year]['Sueldo_base'])
    df_hitters[year]['ln_Sueldo_ajustado'] = np.log(df_hitters[year]['Sueldo_ajustado'])
    df_hitters[year]['ln_Sueldo_regular'] = np.log(df_hitters[year]['Sueldo_regular'])
    df_hitters[year]['Anio'] = year + 1
    
    df_pitchers[year]['ln_Sueldo_base'] = np.log(df_pitchers[year]['Sueldo_base'])
    df_pitchers[year]['ln_Sueldo_ajustado'] = np.log(df_pitchers[year]['Sueldo_ajustado'])
    df_pitchers[year]['ln_Sueldo_regular'] = np.log(df_pitchers[year]['Sueldo_regular'])
    df_pitchers[year]['Anio'] = year + 1

### Datos agregados por equipo

Solo resta añadir los datos relevantes al equipo al que pertenece cada jugador considerando la base de datos de la cantidad de equipos por estado

In [45]:
df_teams_copy[2].head()

Unnamed: 0,Equipo,Cantidad_agentes_libres,Valor_contrato_total,Acronimo,Victorias,Juegos totales,Playoffs,Pennants won,WS ganadas,Promedio_victorias
0,Los Angeles Angels,8,153500000,LAA,78,162,10,1,1,0.481481
1,Los Angeles Dodgers,7,150850000,LAD,92,162,27,25,6,0.567901
2,Boston Red Sox,10,130700000,BOS,97,162,21,13,8,0.598765
3,Detroit Tigers,4,107775000,DET,93,162,15,11,4,0.574074
4,San Francisco Giants,6,80750000,SF,76,162,23,23,7,0.469136


In [46]:
for i in range(0,period):
    df_teams_copy[i] = pd.merge(df_teams_copy[i], acronym_state, on = ['Equipo','Acronimo'])
    df_hitters[i] = pd.merge(df_teams_copy[i], df_hitters[i], on = 'Acronimo')
    df_pitchers[i] = pd.merge(df_teams_copy[i], df_pitchers[i], on = 'Acronimo')

## Segmentación por Agentes libres

Separaremos los pitchers y hitters en dos grupos:

- Agentes libres.
- No agentes libres.

In [47]:
for i in range(0,period):
    # Drop 'Posición_y' columns:
    if 'Posicion_y' in df_hitters[i].columns:
        df_hitters[i].drop('Posicion_y', axis = 1, inplace = True)
    
    if 'Posicion_y' in df_pitchers[i].columns:
        df_pitchers[i].drop('Posicion_y', axis = 1, inplace = True)
        
    # Cambiando nombre de 'Posicion_x':
    if 'Posicion_x' in df_hitters[i].columns:
        df_hitters[i] = df_hitters[i].rename(columns = {'Posicion_x':'Posicion'})
    
    if 'Posicion_x' in df_pitchers[i].columns:
        df_pitchers[i] = df_pitchers[i].rename(columns = {'Posicion_x':'Posicion'})

In [48]:
for i in range(0,period):    
    df_hitters_free_agents[i] = pd.merge(df_free_agents_copy[i], df_hitters[i], on = 'Jugador')
    df_pitchers_free_agents[i] = pd.merge(df_free_agents_copy[i], df_pitchers[i], on = 'Jugador')
    
    df_hitters_no_free_agents[i] = df_hitters[i][~df_hitters[i].Jugador.isin(df_hitters_free_agents[i].Jugador)]
    df_pitchers_no_free_agents[i] = df_pitchers[i][~df_pitchers[i].Jugador.isin(df_pitchers_free_agents[i].Jugador)]
    
    df_hitters_free_agents[i] = df_hitters_free_agents[i].reindex(sorted(df_hitters_free_agents[i].columns), axis=1)
    df_pitchers_free_agents[i] = df_pitchers_free_agents[i].reindex(sorted(df_pitchers_free_agents[i].columns), axis=1)
    df_hitters_no_free_agents[i] = df_hitters_no_free_agents[i].reindex(sorted(df_hitters_no_free_agents[i].columns), axis=1)
    df_pitchers_no_free_agents[i] = df_pitchers_no_free_agents[i].reindex(sorted(df_pitchers_no_free_agents[i].columns), axis=1)  
    
    # Drop 'Anio_y' columns:
    if 'Anio_x' in df_hitters_free_agents[i].columns:
        df_hitters_free_agents[i].drop('Anio_x', axis = 1, inplace = True)
    
    if 'Anio_x' in df_pitchers_free_agents[i].columns:
        df_pitchers_free_agents[i].drop('Anio_x', axis = 1, inplace = True)
        
    if 'Anio_y' in df_hitters_no_free_agents[i].columns:
        df_hitters_no_free_agents[i].drop('Anio_y', axis = 1, inplace = True)
        
    if 'Anio_y' in df_pitchers_no_free_agents[i].columns:
        df_pitchers_no_free_agents[i].drop('Anio_y', axis = 1, inplace = True)
        
    # Cambiando nombre de 'Anio_x':
    if 'Anio_y' in df_hitters_free_agents[i].columns:
        df_hitters_free_agents[i] = df_hitters_free_agents[i].rename(columns = {'Anio_y':'Anio'})
    
    if 'Anio_y' in df_pitchers_free_agents[i].columns:
        df_pitchers_free_agents[i] = df_pitchers_free_agents[i].rename(columns = {'Anio_y':'Anio'})
    
    if 'Anio_x' in df_hitters_no_free_agents[i].columns:
        df_hitters_no_free_agents[i] = df_hitters_no_free_agents[i].rename(columns = {'Anio_x':'Anio'})
    
    if 'Anio_x' in df_pitchers_no_free_agents[i].columns:
        df_pitchers_no_free_agents[i] = df_pitchers_no_free_agents[i].rename(columns = {'Anio_x':'Anio'})
    
    # Drop 'Anios_contrato' columns:
    if 'Anios_contrato' in df_hitters_free_agents[i].columns:
        df_hitters_free_agents[i].drop('Anios_contrato', axis = 1, inplace = True)
    
    if 'Anios_contrato' in df_pitchers_free_agents[i].columns:
        df_pitchers_free_agents[i].drop('Anios_contrato', axis = 1, inplace = True)
    
    # Transformación
    df_hitters_free_agents[i]['Anio'] = df_hitters_free_agents[i]['Anio'] + 2010
    df_hitters_free_agents[i]['Anio'] = df_hitters_free_agents[i]['Anio'].map(str)
    df_pitchers_free_agents[i]['Anio'] = df_pitchers_free_agents[i]['Anio'] + 2010
    df_pitchers_free_agents[i]['Anio'] = df_pitchers_free_agents[i]['Anio'].map(str)
    df_hitters_no_free_agents[i]['Anio'] = df_hitters_no_free_agents[i]['Anio'] + 2010
    df_hitters_no_free_agents[i]['Anio'] = df_hitters_no_free_agents[i]['Anio'].map(str)
    df_pitchers_no_free_agents[i]['Anio'] = df_pitchers_no_free_agents[i]['Anio'] + 2010
    df_pitchers_no_free_agents[i]['Anio'] = df_pitchers_no_free_agents[i]['Anio'].map(str)
    
    # Exportemos los dataframes por separado
    df_hitters_free_agents[i].to_csv('ETL_Data/Agent/First_Year_Contract/Period_t/Free_Agent/Hitters/free_agents_batters_' + str(2011 + i) + '.csv', index = False)
    df_pitchers_free_agents[i].to_csv('ETL_Data/Agent/First_Year_Contract/Period_t/Free_Agent/Pitchers/free_agents_pitchers_' + str(2011 + i) + '.csv', index = False)
    df_hitters_no_free_agents[i].to_csv('ETL_Data/Agent/First_Year_Contract/Period_t/No_Free_Agent/Hitters/no_free_agents_batters_' + str(2011 + i) + '.csv', index = False)
    df_pitchers_no_free_agents[i].to_csv('ETL_Data/Agent/First_Year_Contract/Period_t/No_Free_Agent/Pitchers/no_free_agents_pitchers_' + str(2011 + i) + '.csv', index = False)

In [49]:
# Algunos ejemplos
df_pitchers_no_free_agents[6].head()

Unnamed: 0,Acronimo,Altura,Anio,Anio_de_agente_libre,Anio_inicio_de_contrato,Anios_de_contrato,Bateos_pitcher,Bateos_pitcher_2,Bono_por_firma,Cantidad de equipos,...,WHIP,WHIP_2,WS ganadas,Walks,Walks_2,Wins,Wins_2,ln_Sueldo_ajustado,ln_Sueldo_base,ln_Sueldo_regular
0,LAD,6.5,2017,2018,2012,6,159,25281,0,5,...,1.16,1.3456,6,58,3364,10,100,15.14705,16.213406,16.213406
1,LAD,6.4,2017,2021,2014,7,136,18496,2571428,5,...,0.95,0.9025,6,30,900,18,324,17.387053,17.312018,17.387053
2,LAD,6.4,2017,2020,2017,1,123,15129,0,5,...,1.06,1.1236,6,38,1444,16,256,14.84513,14.84513,14.84513
4,LAD,6.1,2017,2024,2016,8,121,14641,125000,5,...,1.15,1.3225,6,34,1156,13,169,15.813606,14.914123,15.813606
5,LAD,6.3,2017,2019,2013,6,128,16384,833333,5,...,1.37,1.8769,6,45,2025,5,25,15.873899,15.761421,15.873899


In [50]:
df_hitters_no_free_agents[0].head()

Unnamed: 0,Acronimo,Altura,Anio,Anio_de_agente_libre,Anio_inicio_de_contrato,Anios_de_contrato,At-bats,At-bats_2,Bateos,Bateos_2,...,TVS_2,Valor_contrato_total,Valor_del_contrato,Victorias,WAR,WAR_2,WS ganadas,ln_Sueldo_ajustado,ln_Sueldo_base,ln_Sueldo_regular
0,MIL,6.2,2011,2016,2010,5,68,4624,15,225,...,3632.4729,775000,30100000,96,2.25,5.0625,0,15.068274,14.994166,15.068274
1,MIL,0.0,2011,2013,2010,3,61,3721,10,100,...,1900.96,775000,29750000,96,2.5,6.25,0,16.066802,16.066802,16.066802
2,MIL,6.2,2011,2013,2009,4,49,2401,7,49,...,2072.0704,775000,38000000,96,1.4,1.96,0,16.4182,16.4182,16.4182
3,MIL,6.0,2011,0,2009,3,2,4,1,1,...,86.1184,775000,37000000,96,0.94,0.8836,0,14.711933,16.257858,16.314211


In [59]:
df_pitchers_free_agents[9].head()

Unnamed: 0,Acronimo,Altura,Anio_de_agente_libre,Anio_inicio_de_contrato,Anio,Anios_de_contrato,Bateos_pitcher,Bateos_pitcher_2,Bono_por_firma,Cantidad de equipos,...,WHIP,WHIP_2,WS ganadas,Walks,Walks_2,Wins,Wins_2,ln_Sueldo_ajustado,ln_Sueldo_base,ln_Sueldo_regular
0,NYY,6.4,2029,2020,2020,9,53,2809,0,2,...,0.96,0.9216,27,17,289,7,49,16.405778,17.399029,17.399029
1,WSH,6.5,2027,2020,2020,7,8,64,0,1,...,1.8,3.24,1,1,1,0,0,15.982294,17.370859,17.370859
2,PHI,6.4,2025,2020,2020,5,67,4489,0,2,...,1.17,1.3689,2,16,256,4,16,15.890312,16.883563,16.883563
3,ARI,6.4,2025,2020,2020,5,47,2209,0,1,...,1.44,2.0736,1,13,169,1,1,14.614018,15.60727,15.60727
4,CHW,6.2,2024,2020,2020,3,52,2704,0,2,...,1.09,1.1881,3,17,289,6,36,15.712631,16.705882,16.705882


In [52]:
df_hitters_free_agents[8].head()

Unnamed: 0,Acronimo,Altura,Anio_de_agente_libre,Anio_inicio_de_contrato,Anio,Anios_de_contrato,At-bats,At-bats_2,Bateos,Bateos_2,...,Valor_contrato_total,Valor_del_contrato,Valor_promedio_contrato,Victorias,WAR,WAR_2,WS ganadas,ln_Sueldo_ajustado,ln_Sueldo_base,ln_Sueldo_regular
0,SD,6.3,2029,2019,2019,10,587,344569,150,22500,...,326000000,300000000,30000000,70,3.08,9.4864,0,16.300417,16.118096,16.300417
1,WSH,6.4,2025,2019,2019,6,65,4225,6,36,...,184550000,140000000,23333333,93,5.66,32.0356,1,16.374029,16.341239,16.374029
2,BOS,6.2,2023,2019,2019,4,2,4,0,0,...,68000000,68000000,17000000,83,0.12,0.0144,9,16.648724,16.648724,16.648724
3,LAD,6.1,2023,2019,2019,5,308,94864,82,6724,...,85000000,60000000,12000000,105,0.21,0.0441,6,15.201805,13.815511,15.201805
4,NYY,6.5,2021,2019,2019,2,3,9,0,0,...,124555000,34000000,17000000,103,1.21,1.4641,27,16.648724,16.648724,16.648724


### Etiquetas para los agentes libres

Crearemos un etiqueta para indicar si el pitcher o hitter es  un agente libre o no.

In [53]:
for i in range(0,period):
    # Condiciones
    condicion_hitter = [df_hitters[i].Jugador.isin(df_hitters_free_agents[i].Jugador)]
    condicion_pitcher = [df_pitchers[i].Jugador.isin(df_pitchers_free_agents[i].Jugador)]
    
    # Etiquetas
    etiquetas = ['Si']
    
    df_hitters[i]['Agente libre'] = np.select(condicion_hitter, etiquetas, default = 'No')
    df_pitchers[i]['Agente libre'] = np.select(condicion_pitcher, etiquetas, default = 'No')
    
    df_hitters[i] = df_hitters[i].reindex(sorted(df_hitters[i].columns), axis=1)
    df_pitchers[i] = df_pitchers[i].reindex(sorted(df_pitchers[i].columns), axis=1)
    
    
    # Transformación
    df_hitters[i]['Anio'] = df_hitters[i]['Anio'] + 2010
    df_hitters[i]['Anio'] = df_hitters[i]['Anio'].map(str)
    df_pitchers[i]['Anio'] = df_pitchers[i]['Anio'] + 2010
    df_pitchers[i]['Anio'] = df_pitchers[i]['Anio'].map(str)
    
    # Exportemos los dataframes
    df_hitters[i].to_csv('ETL_Data/Agent/First_Year_Contract/Period_t/Hitters/All_Hitters/hitters_' + str(2011 + i) + '.csv', index = False)
    df_pitchers[i].to_csv('ETL_Data/Agent/First_Year_Contract/Period_t/Pitchers/All_Pitchers/pitchers_' + str(2011 + i) + '.csv', index = False)

In [54]:
df_hitters[10].head()

Unnamed: 0,Acronimo,Agente libre,Altura,Anio,Anio_de_agente_libre,Anio_inicio_de_contrato,Anios_de_contrato,At-bats,At-bats_2,Bateos,...,TVS_2,Valor_contrato_total,Valor_del_contrato,Victorias,WAR,WAR_2,WS ganadas,ln_Sueldo_ajustado,ln_Sueldo_base,ln_Sueldo_regular
0,TOR,No,6.4,2021,2025,2017,7,708,501264,202,...,19743.0601,186250000,22000000,92,3.81,14.5161,2,15.183786,15.068274,15.183786
1,TOR,No,6.2,2021,2024,2019,5,727,528529,182,...,3723.4404,186250000,52000000,92,0.9,0.81,2,16.150885,16.049103,16.150885
2,TOR,No,6.1,2021,2022,2020,2,530,280900,141,...,1393.5289,186250000,17500000,92,0.12,0.0144,2,14.626441,16.066802,16.066802
3,TOR,No,6.3,2021,2024,2020,4,4,16,0,...,944.9476,186250000,80000000,92,1.71,2.9241,2,16.811243,16.811243,16.811243
4,LAD,No,6.1,2021,2022,2020,2,692,478864,179,...,22554.0324,160320500,13400000,106,4.55,20.7025,7,15.869634,15.869634,15.869634


In [55]:
df_pitchers[9].head()

Unnamed: 0,Acronimo,Agente libre,Altura,Anio,Anio_de_agente_libre,Anio_inicio_de_contrato,Anios_de_contrato,Bateos_pitcher,Bateos_pitcher_2,Bono_por_firma,...,WHIP,WHIP_2,WS ganadas,Walks,Walks_2,Wins,Wins_2,ln_Sueldo_ajustado,ln_Sueldo_base,ln_Sueldo_regular
0,NYY,Si,6.4,2020,2029,2020,9,53,2809,0,...,0.96,0.9216,27,17,289,7,49,16.405778,17.399029,17.399029
1,NYY,No,6.5,2020,2021,2019,2,37,1369,0,...,1.05,1.1025,27,15,225,2,4,15.655472,16.648724,16.648724
2,NYY,No,0.0,2020,2021,2014,7,48,2304,0,...,1.17,1.3689,27,8,64,3,9,15.957753,16.951005,16.951005
3,NYY,No,6.6,2020,2024,2020,1,48,2304,0,...,1.29,1.6641,27,9,81,2,4,12.461102,13.598598,13.598598
4,NYY,No,5.9,2020,0,2020,1,35,1225,0,...,1.19,1.4161,27,6,36,3,9,11.088507,13.241923,13.241923


In [56]:
df_hitters[0].describe()

Unnamed: 0,Altura,Anio_de_agente_libre,Anio_inicio_de_contrato,Anios_de_contrato,At-bats,At-bats_2,Bateos,Bateos_2,Bono_por_firma,Cantidad de equipos,...,TVS_2,Valor_contrato_total,Valor_del_contrato,Victorias,WAR,WAR_2,WS ganadas,ln_Sueldo_ajustado,ln_Sueldo_base,ln_Sueldo_regular
count,4.0,4.00,4.0,4.00,4.00,4.00,4.00,4.00,4.00,4.0,...,4.000000,4.0,4.0,4.0,4.0000,4.000000,4.0,4.000000,4.000000,4.000000
mean,4.6,1510.50,2009.5,3.75,45.00,2687.50,8.25,93.75,229166.75,1.0,...,1922.905425,775000.0,33712500.0,96.0,1.7725,3.539025,0.0,15.566302,15.934256,15.966872
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75%,6.2,2013.75,2010.0,4.25,62.75,3946.75,11.25,131.25,354166.75,1.0,...,2462.171025,775000.0,37250000.0,96.0,2.3125,5.359375,0.0,16.154652,16.297943,16.340208
max,6.2,2016.00,2010.0,5.00,68.00,4624.00,15.00,225.00,666667.00,1.0,...,3632.472900,775000.0,38000000.0,96.0,2.5000,6.250000,0.0,16.418200,16.418200,16.418200


In [57]:
df_pitchers[0].describe()

Unnamed: 0,Altura,Anio_de_agente_libre,Anio_inicio_de_contrato,Anios_de_contrato,Bateos_pitcher,Bateos_pitcher_2,Bono_por_firma,Cantidad de equipos,Cantidad_agentes_libres,Carreras,...,WHIP,WHIP_2,WS ganadas,Walks,Walks_2,Wins,Wins_2,ln_Sueldo_ajustado,ln_Sueldo_base,ln_Sueldo_regular
count,5.00,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,5.000,5.00000,5.0,5.0,5.0,5.0,5.0,5.000000,5.000000,5.000000
mean,4.98,1610.8,2009.6,3.4,137.0,23191.0,233333.4,1.0,1.0,61.2,...,1.256,1.57968,0.0,41.2,2127.6,11.0,151.8,15.505528,15.787766,15.825983
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75%,6.20,2013.0,2010.0,4.0,193.0,37249.0,250000.0,1.0,1.0,92.0,...,1.300,1.69000,0.0,59.0,3481.0,16.0,256.0,16.066802,16.257858,16.314211
max,6.50,2016.0,2010.0,5.0,214.0,45796.0,666667.0,1.0,1.0,95.0,...,1.320,1.74240,0.0,66.0,4356.0,17.0,289.0,16.418200,16.418200,16.418200
