# Players dataframes for the same period t

En este script nos dedicaremos a crear una base de datos limpia segmentada por position players y pitchers. Se exportarán dichas bases de datos contemplando respectivamente a los jugadores que son agentes libres, a los que no son y a todos los jugadores. Las secciones dentro del script son:

- **Visualización del contenido de las bases de datos.**
- **Limpieza de la base de datos y exportación.**
- **Creación de indicador de si el jugador es agente libre.**

Importemos los modulos necesarios así como especificar la configuración deseada.

In [1]:
import pandas as pd
import numpy as np
import math
import os
import warnings
print('Modulos importados')

Modulos importados


In [2]:
# Configuraciones
warnings.filterwarnings('ignore')
# Reduzcamos el número de línea a leer
pd.options.display.max_rows = 5

In [3]:
# Directorio de trabajo
print("Directorio de trabajo previo: " + str(os.getcwd()))
# Cambiemoslo
os.chdir('/home/usuario/Documentos/Github/Proyectos/MLB_HN/')

Directorio de trabajo previo: /home/usuario/Documentos/Github/Proyectos/MLB_HN/ETL_scripts/First_Year_Contract


In [4]:
# Veamos el directorio actual de trabajo
print(os.getcwd())
# El directorio anterior es el correcto, pero si no lo fuese, hacemos lo sigueinte:
path = '/home/usuario/Documentos/Github/Proyectos/MLB_HN'
print("Nuevo directorio de trabajo: " + str(os.chdir(path)))

/home/usuario/Documentos/Github/Proyectos/MLB_HN
Nuevo directorio de trabajo: None


## Visualización de las bases de datos

Basta con ver el contenido de las base de datos de un año para observar qué variables contienen. Escojamos el año 2012.

A continuación, se mostrará el contenido de las distintas bases de datos sobre los *bateadores*, *pitchers*, *salarios de los agentes libres* y *salarios de los todos los jugadores*. Esto para determinar el proceso de limpieza que se llevará a cabo.

In [6]:
# Rutas de los archivos del año 2012
free_agents_2012 = 'Data/Free_Agents/Free_Agents_2012.csv'
hitting_2012 = 'Data/Statistics/Hitting/hitting_2012.csv'
pitching_2012 = 'Data/Statistics/Pitching/pitching_2012.csv'
salary_2012 = 'Data/Salary/salary_2012.csv'
teams_etl_2012 = 'ETL_data/Period_t/Teams/free_agents_team_2012.csv'

# Importando los dataframes
df_free_agent_auxiliar_2012 = pd.read_csv(free_agents_2012)
df_hitting_auxiliar_2012 = pd.read_csv(hitting_2012)
df_pitching_auxiliar_2012 = pd.read_csv(pitching_2012)
df_salary_auxiliar_2012 = pd.read_csv(salary_2012)
df_teams_etl_2012 = pd.read_csv(teams_etl_2012)

FileNotFoundError: [Errno 2] No such file or directory: 'ETL_data/Period_t/Teams/free_agents_team_2012.csv'

### Agentes libres

Veamos primero el dataframe

In [6]:
df_free_agent_auxiliar_2012.head()

Unnamed: 0,Rank,Player,Year,Pos,Status,Team From,Team From To,YRS,Value,AAV
0,1,Albert Pujols,2012,DH,UFA,STL,LAA,10,"$240,000,000","$24,000,000"
1,2,Prince Fielder,2012,DH,UFA,MIL,DET,9,"$214,000,000","$23,777,778"
2,3,Jose Reyes,2012,SS,UFA,NYM,MIA,6,"$106,000,000","$17,666,667"
3,4,C.J. Wilson,2012,SP,UFA,TEX,LAA,5,"$77,500,000","$15,500,000"
4,5,Mark Buehrle,2012,SP,UFA,CHW,MIA,4,"$58,000,000","$14,500,000"


### Hitters

Veamos el dataframe

In [7]:
df_hitting_auxiliar_2012.head()

Unnamed: 0,Rank,Player,Pos,Team,GP,GP%,GS,GS%,AB,H,...,WAR,TVS,Payroll Salary2022,Cash2022,AAV2022,Earnings2022,Age,Weight,Height,Unnamed: 27
0,,Derek Jeter,SS,NYY,159,0.982,158,0.975,683,216,...,2.16,23.61,$0,$0,$0,$0,38,0,"0'0""",
1,,Miguel Cabrera,1B,DET,161,0.994,161,0.994,622,205,...,7.14,96.42,"$32,000,000","$32,000,000","$31,000,000","$353,023,111",29,249,"6'4""",
2,,Robinson Cano,2B,NYY,161,0.994,159,0.982,627,196,...,8.44,98.76,$0,$0,$0,$0,29,210,"6'0""",
3,,Billy Butler,DH,KC,161,0.994,158,0.975,614,192,...,3.03,76.02,$0,$0,$0,$0,26,260,"6'0""",
4,,Ryan Braun,LF,MIL,154,0.951,153,0.944,598,191,...,6.87,97.04,$0,$0,$0,$0,28,205,"0'0""",


In [8]:
df_hitting_auxiliar_2012.columns

Index(['Rank', 'Player', 'Pos', 'Team', 'GP', 'GP%', 'GS', 'GS%', 'AB', 'H',
       '2B', '3B', 'HR', 'RBI', 'AVG', 'OBP', 'SLG', 'OPS', 'WAR', 'TVS',
       'Payroll Salary2022', 'Cash2022', 'AAV2022', 'Earnings2022', 'Age',
       'Weight', 'Height', 'Unnamed: 27'],
      dtype='object')

Los términos en la base de datos no se traducirán para evitar malentendidos en la traducción.

- **Pos**: Player position.
- **Team**: Team acronym.
- **GP**: Games played.
- **GP%**: Games played %.
- **AB**: At bats.
- **H**: Hitting.
- **HR**: Home runs.
- **RBI**: Runs batted in.
- **AVG**: Batting average.
- **OPS**: Onebase plus slugging%.

Se omitirá la columna *Cash2022* puesto que no es de interés para el trabajo el valor del jugador en la actualidad puesto que hay agentes libres que ya se han retirado en años posteriores.

## Pitchers

In [9]:
df_pitching_auxiliar_2012.head()

Unnamed: 0,Rank,Player,Pos,Team,GP,GS,IP,H,R,ER,...,WAR,TVS,Payroll Salary2022,Cash2022,AAV2022,Earnings2022,Age,Weight,Height,Unnamed: 26
0,,R.A. Dickey,SP,NYM,34,33,233.7,192,78,71,...,5.68,97.27,$0,$0,$0,$0,37,215,"6'3""",
1,,Felix Hernandez,SP,SEA,33,33,232.0,209,84,79,...,5.26,85.2,$0,$0,$0,$0,26,225,"6'3""",
2,,James Shields,SP,TB,33,33,227.7,209,103,89,...,2.55,79.41,$0,$0,$0,$0,30,215,"6'3""",
3,,Clayton Kershaw,SP,LAD,34,33,227.7,170,70,64,...,6.43,95.71,"$17,000,000","$17,000,000","$17,000,000","$268,342,641",24,225,"6'4""",
4,,Hiroki Kuroda,SP,NYY,33,33,219.7,205,86,81,...,5.28,81.85,$0,$0,$0,$0,37,0,"0'0""",


In [10]:
df_pitching_auxiliar_2012.columns

Index(['Rank', 'Player', 'Pos', 'Team', 'GP', 'GS', 'IP', 'H', 'R', 'ER', 'BB',
       'SO', 'W', 'L', 'SV', 'WHIP', 'ERA', 'WAR', 'TVS', 'Payroll Salary2022',
       'Cash2022', 'AAV2022', 'Earnings2022', 'Age', 'Weight', 'Height',
       'Unnamed: 26'],
      dtype='object')

#### Notación.

Veamos a qué se refieren algunos términos

- **Pos**: Player position.
- **Team**: Team acronym.
- **GP**: Games played.
- **GS**: Games started.
- **IP**: Inning pitched.
- **H**: Hits.
- **R**: Runs.
- **ER**: Earned runs.
- **BB**: Walks.
- **SO**: Strikeouts.
- **W**: Wins.
- **L**: Losses-
- **SV**: Saves.
- **WHIP**: WHIP.
- **ERA**: Earned runs average.

Por razones análogas, se descartará la columna *Cash2022*.

### Salarios
En este caso, hay muchas menos variables que en las anteriores bases de datos

In [10]:
df_salary_auxiliar_2012.head()

Unnamed: 0,Rank,Player,Year,Pos,Team,BaseSalary,Payroll Salary,Adj Salary,Unnamed: 8
0,,Alex Rodriguez,2012,DH,NYY,"$29,000,000","$30,000,000","$30,000,000",
1,,C.C. Sabathia,2012,SP,NYY,"$23,000,000","$24,285,714","$24,285,714",
2,,Vernon Wells,2012,LF,LAA,"$21,000,000","$24,187,500","$24,187,500",
3,,Johan Santana,2012,SP,NYM,"$24,000,000","$24,000,000","$24,000,000",
4,,Prince Fielder,2012,DH,DET,"$23,000,000","$23,150,000","$23,150,000",


- **BaseSalary**: A base salary is the minimum amount you can expect to earn in exchange for your time or services. This is the amount earned before benefits, bonuses, or compensation is added.
- **Payroll Salary**: Payroll is the compensation a business must pay to its employees for a set period and on a given date.
- **Adj Salary**: Adjusted Salary means the regular salary, wages and commissions, if any, payable to a Participant by the Company for the Participant's service, excluding any bonuses or other compensation.

### Teams ETL

Esta base de datos sobre los equipos es bajo el proceso ETL

In [11]:
df_teams_etl_2012.head()

Unnamed: 0,Equipo,Cantidad_agentes_libres,Valor_contrato_total,Acronimo,Victorias,Juegos totales,Playoffs,Pennants won,WS ganadas,Promedio_victorias
0,Los Angeles Angels,4,321150000,LAA,89,162,10,1,1,0.549383
1,Detroit Tigers,3,221000000,DET,88,162,14,11,4,0.54321
2,Miami Marlins,6,203300000,MIA,69,162,2,2,2,0.425926
3,Philadelphia Phillies,7,57650000,PHI,81,162,14,7,2,0.5
4,Los Angeles Dodgers,9,44651311,LAD,86,162,26,25,6,0.530864


### Equipos por estado

In [12]:
states = 'Data/Teams/team_states.csv'
df_states = pd.read_csv(states)

In [13]:
df_states.head()

Unnamed: 0,Estado,Cantidad de equipos
0,Alabama,0
1,Alaska,0
2,Arizona,1
3,Arkansas,0
4,California,5


### Acrónimos

Nos servirá como llave intermedia para unificar las bases de datos de los equipos

In [14]:
acronym = 'Data/Teams/team_acronym.csv'
df_acronym = pd.read_csv(acronym)

In [15]:
df_acronym.head()

Unnamed: 0,Equipo,Acronimo,Estado
0,Arizona Diamondbacks,ARI,Arizona
1,Atlanta Braves,ATL,Georgia
2,Baltimore Orioles,BAL,Maryland
3,Boston Red Sox,BOS,Massachusetts
4,Chicago Cubs,CHC,Illinois


Unamos esta dataframe con el de los equipos por estado

In [16]:
acronym_state = pd.merge(df_states, df_acronym, on = 'Estado')

In [17]:
acronym_state.head()

Unnamed: 0,Estado,Cantidad de equipos,Equipo,Acronimo
0,Arizona,1,Arizona Diamondbacks,ARI
1,California,5,Los Angeles Angels,LAA
2,California,5,Los Angeles Dodgers,LAD
3,California,5,Oakland Athletics,OAK
4,California,5,San Diego Padres,SD


En este caso, el nombre de las variables es claro

## Algoritmo para la creación de las bases de datos

A continuaicón, se optimizará el código para que se puedan obtener los *dataframes* anteriores para un cojuntos de datos de años secuenciales, como es nuestro caso

In [18]:
# Auxiliares:
free_agents = 'Data/Free_Agents/Free_Agents_'
hitting = 'Data/Statistics/Hitting/hitting_'
pitching = 'Data/Statistics/Pitching/pitching_'
salary = 'Data/Salary/Salary_'
teams = 'ETL_data/Period_t/Teams/free_agents_team_'
csv = '.csv'
period = 11
# Originales:
df_free_agents = [None]*period
df_hitting = [None]*period
df_pitching = [None]*period
df_salary = [None]*period
df_teams = [None]*period
# Copias:
df_free_agents_copy = [None]*period
df_hitting_copy = [None]*period
df_pitching_copy = [None]*period
df_salary_copy = [None]*period
df_teams_copy = [None]*period
# Producto final:
df_pitchers = [None]*period
df_hitters = [None]*period
df_pitchers_free_agents = [None]*period
df_hitters_free_agents = [None]*period
df_pitchers_no_free_agents = [None]*period
df_hitters_no_free_agents = [None]*period

Leamos todos los archivos y creemos las copias

In [19]:
for i in range(0,period):    
    df_free_agents[i] = pd.read_csv(free_agents + str(2011 + i) + csv)
    df_hitting[i] = pd.read_csv(hitting + str(2011 + i) + csv)
    df_pitching[i] = pd.read_csv(pitching + str(2011 + i) + csv)
    df_salary[i] = pd.read_csv(salary + str(2011 + i) + csv)
    df_teams[i] = pd.read_csv(teams + str(2011 + i) + csv)
    
    df_free_agents_copy[i] = df_free_agents[i].copy()
    df_hitting_copy[i] = df_hitting[i].copy()
    df_pitching_copy[i] = df_pitching[i].copy()
    df_salary_copy[i] = df_salary[i].copy()
    df_teams_copy[i] = pd.read_csv(teams + str(2011 + i) + csv)

Tratemos las bases de datos por separado

#### Agentes libres

No se conservará el equipo al que es contratado el agente libre puesto que esta información también la contiene la base de datos que facilita más el tratamiento _ETL_.

In [20]:
for i in range(0,period):    
    df_free_agents_copy[i]  = df_free_agents_copy[i][['Player', 'Year', 'Status', 'Team From',
                                                      'YRS', 'Value', 'AAV']]
    df_free_agents_names  = ['Jugador', 'Anio', 'Status', 'Equipo_anterior',
                             'Anios_contrato', 'Valor_contrato', 'Valor_promedio_contrato']
    df_free_agents_copy[i].columns = df_free_agents_names
    
    free_agents_aux_1 = df_free_agents_copy[i]['Valor_contrato'].str.replace("$","")
    free_agents_aux_2 = free_agents_aux_1.str.replace(",","")
    free_agents_aux_3 = df_free_agents_copy[i]['Valor_promedio_contrato'].str.replace("$","")
    free_agents_aux_4 = free_agents_aux_3.str.replace(",","")
    df_free_agents_copy[i]['Valor_contrato'] = free_agents_aux_2
    df_free_agents_copy[i]['Valor_promedio_contrato'] = free_agents_aux_4
    
    df_free_agents_copy[i]['Valor_contrato'] = pd.to_numeric(df_free_agents_copy[i]['Valor_contrato'])
    df_free_agents_copy[i]['Valor_promedio_contrato'] = pd.to_numeric(df_free_agents_copy[i]['Valor_promedio_contrato'])

#### Salarios

Como los salarios irán con las bases de datos de los _hitters_ y _pitchers_ es que se hará su proceso _ETL_ antes.

In [25]:
for i in range(0,period):
    df_salary_copy[i] = df_salary_copy[i][['Player', 'Team', 'BaseSalary',
                                           'Payroll Salary', 'Adj Salary']]
    df_salary_names = ['Jugador', 'Equipo', 'Sueldo_base', 'Sueldo', 'Sueldo_regular']
    df_salary_copy[i].columns = df_salary_names
    
    salary_aux_1 = df_salary_copy[i]['Sueldo_base'].str.replace("$","")
    salary_aux_2 = salary_aux_1.str.replace(",","")
    df_salary_copy[i]['Sueldo_base'] = salary_aux_2
    df_salary_copy[i]['Sueldo_base'] = pd.to_numeric(df_salary_copy[i]['Sueldo_base'])
    
    salary_aux_3 = df_salary_copy[i]['Sueldo'].str.replace("$","")
    salary_aux_4 = salary_aux_3.str.replace(",","")
    df_salary_copy[i]['Sueldo'] = salary_aux_4
    df_salary_copy[i]['Sueldo'] = pd.to_numeric(df_salary_copy[i]['Sueldo'])
    
    salary_aux_5 = df_salary_copy[i]['Sueldo_regular'].str.replace("$","")
    salary_aux_6 = salary_aux_5.str.replace(",","")
    df_salary_copy[i]['Sueldo_regular'] = salary_aux_6
    df_salary_copy[i]['Sueldo_regular'] = pd.to_numeric(df_salary_copy[i]['Sueldo_regular'])

#### Hitters

In [26]:
for i in range(0,period):    
    df_hitting_copy[i] = df_hitting_copy[i][['Player', 'Pos', 'GP', 'GP%', 'AB', 'H',
                                             'HR', 'RBI', 'AVG', 'OPS', 'WAR', 'TVS',
                                             'Age', 'Weight', 'Height']]
    df_hitting_names = ['Jugador', 'Posicion', 'Juegos', 'Porcetnaje_juegos', 'At-bats',
                        'Bateos', 'Home-runs', 'RBI', 'Porcentaje_bateo', 'OPS',
                        'WAR', 'TVS', 'Edad', 'Peso', 'Altura']
    df_hitting_copy[i].columns = df_hitting_names
    
    hitting_aux_1 = df_hitting_copy[i]['Altura'].str.replace("\"","")
    hitting_aux_2 = hitting_aux_1.str.replace("'","")
    df_hitting_copy[i]['Altura'] = hitting_aux_2
    df_hitting_copy[i]['Altura'] = pd.to_numeric(df_hitting_copy[i]['Altura'])/10
    
    df_hitters[i] = pd.merge(df_hitting_copy[i], df_salary_copy[i], on = 'Jugador')

    df_hitters[i] = df_hitters[i].rename(columns = {'Equipo':'Acronimo'})

#### Pitchers

In [27]:
for i in range(0,period):    
    df_pitching_copy[i] = df_pitching_copy[i][['Player', 'Pos', 'GP', 'GS', 'IP', 'H', 
                                               'R', 'ER', 'BB', 'SO', 'W', 'L', 'SV', 
                                               'WHIP', 'ERA', 'WAR', 'TVS', 'Age',
                                               'Weight', 'Height']]
    df_pitching_names = ['Jugador', 'Posicion', 'Juegos', 'Juegos_iniciados', 'Inning_pitched', 'Bateos_pitcher',
                         'Carreras', 'Carreras_ganadas', 'Walks', 'Strike-outs', 'Wins', 'Losses',
                         'Saves', 'WHIP', 'ERA', 'WAR', 'TVS', 'Edad', 'Peso', 'Altura']
    df_pitching_copy[i].columns = df_pitching_names    
    
    pitching_aux_1 = df_pitching_copy[i]['Altura'].str.replace("\"","")
    pitching_aux_2 = pitching_aux_1.str.replace("'","")
    df_pitching_copy[i]['Altura'] = pitching_aux_2
    df_pitching_copy[i]['Altura'] = pd.to_numeric(df_pitching_copy[i]['Altura'])/10

    df_pitchers[i] = pd.merge(df_pitching_copy[i], df_salary_copy[i], on = 'Jugador')
    
    df_pitchers[i] = df_pitchers[i].rename(columns = {'Equipo':'Acronimo'})

## Agregación de variables sugeridas por artículos

Las primeras variables que agregaremos son el cuadrado de todas las estadísticas deportivas, así como las siguientes variables:

- DOMINANCE = $Strike-outs/(9*Inning \; Pitched)$
- CONTROL = $Walks/(9*Inning \; Pitched)$
- COMMAND = $Strike-outs/Walks$

In [28]:
df_hitters[2].head()

Unnamed: 0,Jugador,Posicion,Juegos,Porcetnaje_juegos,At-bats,Bateos,Home-runs,RBI,Porcentaje_bateo,OPS,WAR,TVS,Edad,Peso,Altura,Acronimo,Sueldo_base,Sueldo,Sueldo_regular
0,Matt Carpenter,3B,157,0.969,626,199,11,78,0.318,0.873,6.6,98.43,27,210,6.4,STL,504000,514000,514000
1,Dustin Pedroia,2B,160,0.988,641,193,9,84,0.301,0.787,6.32,88.44,29,175,5.9,BOS,10000000,10250000,10250000
2,Miguel Cabrera,1B,148,0.914,555,193,44,137,0.348,1.078,7.49,99.44,30,249,6.4,DET,21000000,21000000,21000000
3,Robinson Cano,2B,160,0.988,605,190,27,107,0.314,0.899,6.63,97.4,30,210,6.0,NYY,15000000,15000000,15000000
4,Daniel Murphy,2B,161,0.994,658,188,13,78,0.286,0.733,1.83,60.12,28,221,0.0,NYM,2925000,2925000,2925000


In [29]:
df_hitters[2].columns

Index(['Jugador', 'Posicion', 'Juegos', 'Porcetnaje_juegos', 'At-bats',
       'Bateos', 'Home-runs', 'RBI', 'Porcentaje_bateo', 'OPS', 'WAR', 'TVS',
       'Edad', 'Peso', 'Altura', 'Acronimo', 'Sueldo_base', 'Sueldo',
       'Sueldo_regular'],
      dtype='object')

In [30]:
df_pitchers[2].head()

Unnamed: 0,Jugador,Posicion,Juegos,Juegos_iniciados,Inning_pitched,Bateos_pitcher,Carreras,Carreras_ganadas,Walks,Strike-outs,...,ERA,WAR,TVS,Edad,Peso,Altura,Acronimo,Sueldo_base,Sueldo,Sueldo_regular
0,Miguel Gonzalez,SP,60,56,342.7,314,162,144,106,240,...,7.56,1.76,44.91,29,170,6.1,BAL,502000,502000,502000
1,Miguel Gonzalez,SP,60,56,342.7,314,162,144,106,240,...,7.56,1.76,44.91,29,170,6.1,CHW,490000,490000,74972
2,Adam Wainwright,SP,35,34,241.7,223,83,79,35,219,...,2.94,6.19,98.35,31,230,6.7,STL,12000000,12150000,12150000
3,Clayton Kershaw,SP,35,33,236.0,164,55,48,52,232,...,1.83,8.0,98.86,25,225,6.4,LAD,11000000,11250000,11250000
4,James Shields,SP,34,34,228.7,215,82,80,68,196,...,3.15,4.32,82.43,31,215,6.3,KC,10250000,10250000,10250000


In [31]:
for i in range(0,period):
    df_pitchers[i]['Dominio'] = df_pitchers[i]['Strike-outs']/(9*df_pitchers[i]['Inning_pitched'])
    df_pitchers[i]['Control'] = df_pitchers[i]['Walks']/(9*df_pitchers[i]['Inning_pitched'])
    df_pitchers[i]['Comando'] = df_pitchers[i]['Strike-outs']/df_pitchers[i]['Walks']

In [32]:
df_pitchers[2].head()

Unnamed: 0,Jugador,Posicion,Juegos,Juegos_iniciados,Inning_pitched,Bateos_pitcher,Carreras,Carreras_ganadas,Walks,Strike-outs,...,Edad,Peso,Altura,Acronimo,Sueldo_base,Sueldo,Sueldo_regular,Dominio,Control,Comando
0,Miguel Gonzalez,SP,60,56,342.7,314,162,144,106,240,...,29,170,6.1,BAL,502000,502000,502000,0.077813,0.034368,2.264151
1,Miguel Gonzalez,SP,60,56,342.7,314,162,144,106,240,...,29,170,6.1,CHW,490000,490000,74972,0.077813,0.034368,2.264151
2,Adam Wainwright,SP,35,34,241.7,223,83,79,35,219,...,31,230,6.7,STL,12000000,12150000,12150000,0.100676,0.01609,6.257143
3,Clayton Kershaw,SP,35,33,236.0,164,55,48,52,232,...,25,225,6.4,LAD,11000000,11250000,11250000,0.109228,0.024482,4.461538
4,James Shields,SP,34,34,228.7,215,82,80,68,196,...,31,215,6.3,KC,10250000,10250000,10250000,0.095224,0.033037,2.882353


In [33]:
df_pitchers[2].columns

Index(['Jugador', 'Posicion', 'Juegos', 'Juegos_iniciados', 'Inning_pitched',
       'Bateos_pitcher', 'Carreras', 'Carreras_ganadas', 'Walks',
       'Strike-outs', 'Wins', 'Losses', 'Saves', 'WHIP', 'ERA', 'WAR', 'TVS',
       'Edad', 'Peso', 'Altura', 'Acronimo', 'Sueldo_base', 'Sueldo',
       'Sueldo_regular', 'Dominio', 'Control', 'Comando'],
      dtype='object')

Con el objetivo de hacer más eficiente la creación de las variables al cuadrado, lo haremos por índice

In [34]:
# Indiquemos las columnas que se usarán por medio de su índice
square_pitchers_index = list(range(2,17)) + [24,25,26]
square_hitters_index = list(range(2,12))

In [35]:
for i in range(0,period):
    for j in square_pitchers_index:
        df_pitchers[i][df_pitchers[i].columns[j] + '_2'] = np.power(df_pitchers[i][df_pitchers[i].columns[j]], 2)
    
    for k in square_hitters_index:
        df_hitters[i][df_hitters[i].columns[k] + '_2'] = np.power(df_hitters[i][df_hitters[i].columns[k]], 2)

Apreciemos el resultado final

In [36]:
df_pitchers[2].head()

Unnamed: 0,Jugador,Posicion,Juegos,Juegos_iniciados,Inning_pitched,Bateos_pitcher,Carreras,Carreras_ganadas,Walks,Strike-outs,...,Wins_2,Losses_2,Saves_2,WHIP_2,ERA_2,WAR_2,TVS_2,Dominio_2,Control_2,Comando_2
0,Miguel Gonzalez,SP,60,56,342.7,314,162,144,106,240,...,484,256,0,6.0025,57.1536,3.0976,2016.9081,0.006055,0.001181,5.126379
1,Miguel Gonzalez,SP,60,56,342.7,314,162,144,106,240,...,484,256,0,6.0025,57.1536,3.0976,2016.9081,0.006055,0.001181,5.126379
2,Adam Wainwright,SP,35,34,241.7,223,83,79,35,219,...,361,81,0,1.1449,8.6436,38.3161,9672.7225,0.010136,0.000259,39.151837
3,Clayton Kershaw,SP,35,33,236.0,164,55,48,52,232,...,256,81,0,0.8464,3.3489,64.0,9773.2996,0.011931,0.000599,19.905325
4,James Shields,SP,34,34,228.7,215,82,80,68,196,...,169,81,0,1.5376,9.9225,18.6624,6794.7049,0.009068,0.001091,8.307958


In [37]:
df_pitchers[2].columns

Index(['Jugador', 'Posicion', 'Juegos', 'Juegos_iniciados', 'Inning_pitched',
       'Bateos_pitcher', 'Carreras', 'Carreras_ganadas', 'Walks',
       'Strike-outs', 'Wins', 'Losses', 'Saves', 'WHIP', 'ERA', 'WAR', 'TVS',
       'Edad', 'Peso', 'Altura', 'Acronimo', 'Sueldo_base', 'Sueldo',
       'Sueldo_regular', 'Dominio', 'Control', 'Comando', 'Juegos_2',
       'Juegos_iniciados_2', 'Inning_pitched_2', 'Bateos_pitcher_2',
       'Carreras_2', 'Carreras_ganadas_2', 'Walks_2', 'Strike-outs_2',
       'Wins_2', 'Losses_2', 'Saves_2', 'WHIP_2', 'ERA_2', 'WAR_2', 'TVS_2',
       'Dominio_2', 'Control_2', 'Comando_2'],
      dtype='object')

In [38]:
df_hitters[7].head()

Unnamed: 0,Jugador,Posicion,Juegos,Porcetnaje_juegos,At-bats,Bateos,Home-runs,RBI,Porcentaje_bateo,OPS,...,Juegos_2,Porcetnaje_juegos_2,At-bats_2,Bateos_2,Home-runs_2,RBI_2,Porcentaje_bateo_2,OPS_2,WAR_2,TVS_2
0,Whit Merrifield,2B,158,0.975,632,192,12,60,0.304,0.806,...,24964,0.950625,399424,36864,144,3600,0.092416,0.649636,30.6916,9164.2329
1,Freddie Freeman,1B,162,1.0,618,191,23,98,0.309,0.892,...,26244,1.0,381924,36481,529,9604,0.095481,0.795664,29.7025,7492.6336
2,J.D. Martinez,DH,150,0.926,569,188,43,130,0.33,1.031,...,22500,0.857476,323761,35344,1849,16900,0.1089,1.062961,40.5769,9348.9561
3,Manny Machado,3B,162,0.994,632,188,37,107,0.298,0.905,...,26244,0.988036,399424,35344,1369,11449,0.088804,0.819025,8.4681,5665.5729
4,Christian Yelich,LF,147,0.902,574,187,36,110,0.326,1.0,...,21609,0.813604,329476,34969,1296,12100,0.106276,1.0,58.0644,9970.0225


In [39]:
df_hitters[7].columns

Index(['Jugador', 'Posicion', 'Juegos', 'Porcetnaje_juegos', 'At-bats',
       'Bateos', 'Home-runs', 'RBI', 'Porcentaje_bateo', 'OPS', 'WAR', 'TVS',
       'Edad', 'Peso', 'Altura', 'Acronimo', 'Sueldo_base', 'Sueldo',
       'Sueldo_regular', 'Juegos_2', 'Porcetnaje_juegos_2', 'At-bats_2',
       'Bateos_2', 'Home-runs_2', 'RBI_2', 'Porcentaje_bateo_2', 'OPS_2',
       'WAR_2', 'TVS_2'],
      dtype='object')

Siguiendo la sugerencia de algunos artículos, obtengamos el logaritmo de los salarios

In [40]:
for year in range(0,period):
    df_hitters[year]['ln_Sueldo_base'] = np.log(df_hitters[year]['Sueldo_base'])
    df_hitters[year]['ln_Sueldo'] = np.log(df_hitters[year]['Sueldo'])
    df_hitters[year]['ln_Sueldo_regular'] = np.log(df_hitters[year]['Sueldo_regular'])
    df_hitters[year]['Anio'] = year + 1
    
    df_pitchers[year]['ln_Sueldo_base'] = np.log(df_pitchers[year]['Sueldo_base'])
    df_pitchers[year]['ln_Sueldo'] = np.log(df_pitchers[year]['Sueldo'])
    df_pitchers[year]['ln_Sueldo_regular'] = np.log(df_pitchers[year]['Sueldo_regular'])
    df_pitchers[year]['Anio'] = year + 1

### Datos agregados por equipo

Solo resta añadir los datos relevantes al equipo al que pertenece cada jugador considerando la base de datos de la cantidad de equipos por estado

In [41]:
df_teams_copy[2].head()

Unnamed: 0,Equipo,Cantidad_agentes_libres,Valor_contrato_total,Acronimo,Victorias,Juegos totales,Playoffs,Pennants won,WS ganadas,Promedio_victorias
0,Los Angeles Angels,8,153500000,LAA,78,162,10,1,1,0.481481
1,Los Angeles Dodgers,7,150850000,LAD,92,162,27,25,6,0.567901
2,Boston Red Sox,10,130700000,BOS,97,162,21,13,8,0.598765
3,Detroit Tigers,4,107775000,DET,93,162,15,11,4,0.574074
4,San Francisco Giants,6,80750000,SF,76,162,23,23,7,0.469136


In [42]:
for i in range(0,period):
    df_teams_copy[i] = pd.merge(df_teams_copy[i], acronym_state, on = ['Equipo','Acronimo'])
    df_hitters[i] = pd.merge(df_teams_copy[i], df_hitters[i], on = 'Acronimo')
    df_pitchers[i] = pd.merge(df_teams_copy[i], df_pitchers[i], on = 'Acronimo')

## Segmentación por Agentes libres

Separaremos los pitchers y hitters en dos grupos:

- Agentes libres.
- No agentes libres.

In [43]:
for i in range(0,period):    
    df_hitters_free_agents[i] = pd.merge(df_free_agents_copy[i], df_hitters[i], on = 'Jugador')
    
    df_pitchers_free_agents[i] = pd.merge(df_free_agents_copy[i], df_pitchers[i], on = 'Jugador')
    
    df_hitters_no_free_agents[i] = df_hitters[i][~df_hitters[i].Jugador.isin(df_hitters_free_agents[i].Jugador)]
    df_pitchers_no_free_agents[i] = df_pitchers[i][~df_pitchers[i].Jugador.isin(df_pitchers_free_agents[i].Jugador)]
    
    df_hitters_free_agents[i] = df_hitters_free_agents[i].reindex(sorted(df_hitters_free_agents[i].columns), axis=1)
    df_pitchers_free_agents[i] = df_pitchers_free_agents[i].reindex(sorted(df_pitchers_free_agents[i].columns), axis=1)
    df_hitters_no_free_agents[i] = df_hitters_no_free_agents[i].reindex(sorted(df_hitters_no_free_agents[i].columns), axis=1)
    df_pitchers_no_free_agents[i] = df_pitchers_no_free_agents[i].reindex(sorted(df_pitchers_no_free_agents[i].columns), axis=1)  
    
    # Exportemos los dataframes por separado
    df_hitters_free_agents[i].to_csv('ETL_data/Period_t/Free_Agent/Hitters/free_agents_batters_' + str(2011 + i) + '.csv', index = False)
    df_pitchers_free_agents[i].to_csv('ETL_data/Period_t/Free_Agent/Pitchers/free_agents_pitchers_' + str(2011 + i) + '.csv', index = False)
    df_hitters_no_free_agents[i].to_csv('ETL_data/Period_t/No_Free_Agent/Hitters/no_free_agents_batters_' + str(2011 + i) + '.csv', index = False)
    df_pitchers_no_free_agents[i].to_csv('ETL_data/Period_t/No_Free_Agent/Pitchers/no_free_agents_pitchers_' + str(2011 + i) + '.csv', index = False)

In [44]:
# Algunos ejemplos
df_pitchers_no_free_agents[0].head()

Unnamed: 0,Acronimo,Altura,Anio,Bateos_pitcher,Bateos_pitcher_2,Cantidad de equipos,Cantidad_agentes_libres,Carreras,Carreras_2,Carreras_ganadas,...,WHIP,WHIP_2,WS ganadas,Walks,Walks_2,Wins,Wins_2,ln_Sueldo,ln_Sueldo_base,ln_Sueldo_regular
0,MIL,0.0,1,214,45796,1,1,95,9025,87,...,1.32,1.7424,0,66,4356,13,169,16.066802,16.066802,16.066802
1,MIL,6.2,1,193,37249,1,1,92,8464,81,...,1.22,1.4884,0,59,3481,17,289,15.068274,14.994166,15.068274
2,MIL,6.0,1,175,30625,1,1,84,7056,79,...,1.16,1.3456,0,57,3249,13,169,15.189226,15.189226,15.189226
3,MIL,6.2,1,161,25921,1,1,82,6724,73,...,1.2,1.44,0,45,2025,16,256,16.4182,16.4182,16.4182
4,MIL,6.3,1,160,25600,1,1,82,6724,80,...,1.39,1.9321,0,65,4225,11,121,13.003918,13.003918,13.003918


In [45]:
df_hitters_no_free_agents[0].head()

Unnamed: 0,Acronimo,Altura,Anio,At-bats,At-bats_2,Bateos,Bateos_2,Cantidad de equipos,Cantidad_agentes_libres,Edad,...,TVS,TVS_2,Valor_contrato_total,Victorias,WAR,WAR_2,WS ganadas,ln_Sueldo,ln_Sueldo_base,ln_Sueldo_regular
0,MIL,0.0,1,563,316969,187,34969,1,1,27,...,96.43,9298.7449,775000,96,7.74,59.9076,0,15.424948,15.201805,15.424948
1,MIL,6.6,1,492,242064,140,19600,1,1,29,...,68.72,4722.4384,775000,96,3.18,10.1124,0,15.737323,15.687313,15.737323
2,MIL,6.1,1,546,298116,122,14884,1,1,28,...,8.77,76.9129,775000,96,-0.74,0.5476,0,13.056224,13.056224,13.056224
3,MIL,0.0,1,378,142884,115,13225,1,1,30,...,58.06,3370.9636,775000,96,2.95,8.7025,0,13.017003,13.017003,13.017003
4,MIL,6.0,1,430,184900,114,12996,1,1,25,...,21.82,476.1124,775000,96,1.02,1.0404,0,12.957489,12.957489,12.957489


In [46]:
df_pitchers_free_agents[10]

Unnamed: 0,Acronimo,Altura,Anio_x,Anio_y,Anios_contrato,Bateos_pitcher,Bateos_pitcher_2,Cantidad de equipos,Cantidad_agentes_libres,Carreras,...,WHIP,WHIP_2,WS ganadas,Walks,Walks_2,Wins,Wins_2,ln_Sueldo,ln_Sueldo_base,ln_Sueldo_regular
0,BOS,6.5,2019,11,3,75,5625,1,6,43,...,3.03,9.1809,9,44,1936,9,81,15.793064,15.782623,15.793064
1,STL,6.7,2021,11,1,222,49284,2,2,97,...,2.11,4.4521,11,65,4225,22,484,15.894952,15.894952,15.894952
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96,NYY,6.1,2019,11,3,29,841,2,4,20,...,2.69,7.2361,27,21,441,1,1,16.380460,16.380460,16.380460
97,HOU,6.2,2016,11,6,231,53361,2,5,112,...,2.30,5.2900,1,45,2025,14,196,17.370859,17.281246,17.020963


In [47]:
df_hitters_free_agents[8].head()

Unnamed: 0,Acronimo,Altura,Anio_x,Anio_y,Anios_contrato,At-bats,At-bats_2,Bateos,Bateos_2,Cantidad de equipos,...,Valor_contrato,Valor_contrato_total,Valor_promedio_contrato,Victorias,WAR,WAR_2,WS ganadas,ln_Sueldo,ln_Sueldo_base,ln_Sueldo_regular
0,LAD,6.1,2019,9,5,308,94864,82,6724,5,...,60000000,85000000,12000000,105,0.21,0.0441,6,15.201805,13.815511,15.201805
1,ARI,6.2,2019,9,1,485,235225,126,15876,1,...,3000000,11600000,3000000,85,-0.36,0.1296,1,15.319588,14.914123,15.319588
2,ATL,0.0,2019,9,0,203,41209,49,2401,1,...,0,52210000,0,97,-0.19,0.0361,3,13.815511,13.815511,12.396424
3,LAA,6.3,2012,9,10,491,241081,120,14400,5,...,240000000,31000000,24000000,72,0.43,0.1849,1,17.147715,17.147715,17.147715
4,KC,6.1,2016,9,4,556,309136,148,21904,2,...,72000000,15700000,18000000,58,1.35,1.8225,2,16.811243,16.811243,16.811243


### Etiquetas para los agentes libres

Crearemos un etiqueta para indicar si el pitcher o hitter es  un agente libre o no.

In [48]:
for i in range(0,period):
    # Condiciones
    condicion_hitter = [df_hitters[i].Jugador.isin(df_hitters_free_agents[i].Jugador)]
    condicion_pitcher = [df_pitchers[i].Jugador.isin(df_pitchers_free_agents[i].Jugador)]
    # Etiquetas
    etiquetas = ['Si']
    
    df_hitters[i]['Agente libre'] = np.select(condicion_hitter, etiquetas, default = 'No')
    df_pitchers[i]['Agente libre'] = np.select(condicion_pitcher, etiquetas, default = 'No')
    
    df_hitters[i] = df_hitters[i].reindex(sorted(df_hitters[i].columns), axis=1)
    df_pitchers[i] = df_pitchers[i].reindex(sorted(df_pitchers[i].columns), axis=1)
    
    # Exportemos los dataframes
    df_hitters[i].to_csv('ETL_data/Period_t/Hitters/All_Hitters/hitters_' + str(2011 + i) + '.csv', index = False)
    df_pitchers[i].to_csv('ETL_data/Period_t/Pitchers/All_Pitchers/pitchers_' + str(2011 + i) + '.csv', index = False)

In [49]:
df_hitters[10].head()

Unnamed: 0,Acronimo,Agente libre,Altura,Anio,At-bats,At-bats_2,Bateos,Bateos_2,Cantidad de equipos,Cantidad_agentes_libres,...,TVS,TVS_2,Valor_contrato_total,Victorias,WAR,WAR_2,WS ganadas,ln_Sueldo,ln_Sueldo_base,ln_Sueldo_regular
0,TOR,No,6.2,11,825,680625,246,60516,1,6,...,165.28,27317.4784,186250000,92,7.1,50.41,2,13.313645,13.313645,13.313645
1,TOR,No,6.0,11,763,582169,228,51984,1,6,...,97.05,9418.7025,186250000,92,6.93,48.0249,2,13.284142,13.284142,13.284142
2,TOR,Si,6.0,11,863,744769,220,48400,1,6,...,122.45,14994.0025,186250000,92,7.41,54.9081,2,16.705882,16.705882,16.705882
3,TOR,No,6.2,11,740,547600,218,47524,1,6,...,186.19,34666.7161,186250000,92,5.25,27.5625,2,15.279923,15.279923,15.279923
4,TOR,No,6.4,11,708,501264,202,40804,1,6,...,140.51,19743.0601,186250000,92,3.81,14.5161,2,15.183786,15.068274,15.183786


In [50]:
df_pitchers[9].head()

Unnamed: 0,Acronimo,Agente libre,Altura,Anio,Bateos_pitcher,Bateos_pitcher_2,Cantidad de equipos,Cantidad_agentes_libres,Carreras,Carreras_2,...,WHIP,WHIP_2,WS ganadas,Walks,Walks_2,Wins,Wins_2,ln_Sueldo,ln_Sueldo_base,ln_Sueldo_regular
0,NYY,Si,6.4,10,53,2809,2,2,27,729,...,0.96,0.9216,27,17,289,7,49,17.399029,17.399029,16.405778
1,NYY,Si,6.5,10,37,1369,2,2,19,361,...,1.05,1.1025,27,15,225,2,4,16.648724,16.648724,15.655472
2,NYY,No,0.0,10,48,2304,2,2,25,625,...,1.17,1.3689,27,8,64,3,9,16.951005,16.951005,15.957753
3,NYY,No,6.6,10,48,2304,2,2,27,729,...,1.29,1.6641,27,9,81,2,4,13.598598,13.598598,12.461102
4,NYY,No,5.9,10,35,1225,2,2,20,400,...,1.19,1.4161,27,6,36,3,9,13.241923,13.241923,11.088507


In [51]:
df_hitters[0].describe()

Unnamed: 0,Altura,Anio,At-bats,At-bats_2,Bateos,Bateos_2,Cantidad de equipos,Cantidad_agentes_libres,Edad,Home-runs,...,TVS,TVS_2,Valor_contrato_total,Victorias,WAR,WAR_2,WS ganadas,ln_Sueldo,ln_Sueldo_base,ln_Sueldo_regular
count,17.000000,17.0,17.000000,17.000000,17.000000,17.000000,17.0,17.0,17.000000,17.000000,...,17.000000,17.000000,17.0,17.0,16.0000,16.000000,17.0,17.000000,17.000000,17.000000
mean,5.070588,1.0,182.647059,74700.058824,48.529412,5874.176471,1.0,1.0,27.588235,6.176471,...,29.276471,1807.240753,775000.0,96.0,1.7300,6.795263,0.0,14.188448,14.164705,14.094196
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75%,6.200000,1.0,378.000000,142884.000000,114.000000,12996.000000,1.0,1.0,29.000000,8.000000,...,58.060000,3370.963600,775000.0,96.0,2.6075,6.833725,0.0,15.424948,15.201805,15.189226
max,6.600000,1.0,563.000000,316969.000000,187.000000,34969.000000,1.0,1.0,34.000000,33.000000,...,96.430000,9298.744900,775000.0,96.0,7.7400,59.907600,0.0,16.418200,16.418200,16.418200


In [52]:
df_pitchers[0].describe()

Unnamed: 0,Altura,Anio,Bateos_pitcher,Bateos_pitcher_2,Cantidad de equipos,Cantidad_agentes_libres,Carreras,Carreras_2,Carreras_ganadas,Carreras_ganadas_2,...,WHIP,WHIP_2,WS ganadas,Walks,Walks_2,Wins,Wins_2,ln_Sueldo,ln_Sueldo_base,ln_Sueldo_regular
count,10.000,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.00,10.00,...,10.000,10.00000,10.0,10.0,10.0,10.0,10.0,10.000000,10.000000,10.000000
mean,5.590,1.0,116.4,18255.4,1.0,1.0,53.6,4108.8,49.20,3473.00,...,1.368,2.01898,0.0,38.5,1958.7,8.5,106.9,14.619050,14.599941,14.458822
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75%,6.275,1.0,171.5,29449.0,1.0,1.0,83.5,6973.0,79.75,6360.25,...,1.315,1.72930,0.0,58.5,3423.0,13.0,169.0,15.865709,15.850553,15.244129
max,6.500,1.0,214.0,45796.0,1.0,1.0,95.0,9025.0,87.00,7569.00,...,2.500,6.25000,0.0,66.0,4356.0,17.0,289.0,16.418200,16.418200,16.418200
