# Players dataframes for the same period t

En este script nos dedicaremos a crear una base de datos limpia segmentada por hitters y pitchers. Se exportarán dichas bases de datos contemplando respectivamente a los jugadores que son agentes libres, a los que no son y a todos los jugadores. Las secciones dentro del script son:

- **Visualización del contenido de las bases de datos.**
- **Limpieza de la base de datos y exportación.**
- **Creación de indicador de si el jugador es agente libre.**

Importemos los modulos necesarios así como especificar la configuración deseada.

In [1]:
import pandas as pd
import numpy as np
import math
import os
import warnings
print('Modulos importados')

Modulos importados


In [2]:
# Configuraciones
warnings.filterwarnings('ignore')
# Reduzcamos el número de línea a leer
pd.options.display.max_rows = 5

In [3]:
# Directorio de trabajo
print("Directorio de trabajo previo: " + str(os.getcwd()))
# Cambiemoslo
os.chdir('/home/usuario/Documentos/Github/Proyectos/MLB_HN/')

Directorio de trabajo previo: /home/usuario/Documentos/Github/Proyectos/MLB_HN/ETL_Scripts/First_Year_Contract


In [4]:
# Veamos el directorio actual de trabajo
print(os.getcwd())
# El directorio anterior es el correcto, pero si no lo fuese, hacemos lo sigueinte:
path = '/home/usuario/Documentos/Github/Proyectos/MLB_HN'
print("Nuevo directorio de trabajo: " + str(os.chdir(path)))

/home/usuario/Documentos/Github/Proyectos/MLB_HN
Nuevo directorio de trabajo: None


## Visualización de las bases de datos

Basta con ver el contenido de las base de datos de un año para observar qué variables contienen. Escojamos el año 2012.

A continuación, se mostrará el contenido de las distintas bases de datos sobre los *bateadores*, *pitchers*, *salarios de los agentes libres* y *salarios de los todos los jugadores*. Esto para determinar el proceso de limpieza que se llevará a cabo.

In [5]:
# Rutas de los archivos del año 2012
free_agents_2012 = 'Data/Free_Agents/free_agents_2012.csv'
hitting_2012 = 'Data/Not_All_Variables/Statistics/Hitting/hitting_2012.csv'
pitching_2012 = 'Data/Not_All_Variables/Statistics/Pitching/pitching_2012.csv'
salary_2012 = 'Data/Not_All_Variables/Salary/salary_2012.csv'
teams_etl_2012 = 'ETL_Data/Agent/Teams/free_agents_team_2012.csv'

# Importando los dataframes
df_free_agent_auxiliar_2012 = pd.read_csv(free_agents_2012)
df_hitting_auxiliar_2012 = pd.read_csv(hitting_2012)
df_pitching_auxiliar_2012 = pd.read_csv(pitching_2012)
df_salary_auxiliar_2012 = pd.read_csv(salary_2012)
df_teams_etl_2012 = pd.read_csv(teams_etl_2012)

### Agentes libres

Veamos primero el dataframe

In [6]:
df_free_agent_auxiliar_2012.head()

Unnamed: 0,Rank,Player,Year,Pos,Status,Team From,Team From To,YRS,Value,AAV
0,1,Albert Pujols,2012,DH,UFA,STL,LAA,10,"$240,000,000","$24,000,000"
1,2,Prince Fielder,2012,DH,UFA,MIL,DET,9,"$214,000,000","$23,777,778"
2,3,Jose Reyes,2012,SS,UFA,NYM,MIA,6,"$106,000,000","$17,666,667"
3,4,C.J. Wilson,2012,SP,UFA,TEX,LAA,5,"$77,500,000","$15,500,000"
4,5,Mark Buehrle,2012,SP,UFA,CHW,MIA,4,"$58,000,000","$14,500,000"


### Hitters

Veamos el dataframe

In [7]:
df_hitting_auxiliar_2012.head()

Unnamed: 0,Rank,Player,Pos,Team,GP,GP%,GS,GS%,AB,H,...,HR,RBI,AVG,OBP,SLG,OPS,WAR,TVS,Cash2023,Unnamed: 21
0,,Derek Jeter,SS,NYY,159,0.982,158,0.975,683,216,...,15,58,0.316,0.362,0.429,0.791,2.16,23.61,$0,
1,,Miguel Cabrera,1B,DET,161,0.994,161,0.994,622,205,...,44,139,0.33,0.393,0.606,0.999,7.14,96.42,"$32,000,000",
2,,Robinson Cano,2B,NYY,161,0.994,159,0.982,627,196,...,33,94,0.313,0.379,0.55,0.929,8.44,98.76,$0,
3,,Everth Cabrera,SS,SD,230,0.71,218,0.673,796,196,...,4,48,0.246,0.324,0.324,0.648,3.56,82.66,$0,
4,,Adrian Beltre,3B,TEX,156,0.963,152,0.938,604,194,...,36,102,0.321,0.359,0.561,0.921,7.24,88.47,$0,


In [8]:
df_hitting_auxiliar_2012.columns

Index(['Rank', 'Player', 'Pos', 'Team', 'GP', 'GP%', 'GS', 'GS%', 'AB', 'H',
       '2B', '3B', 'HR', 'RBI', 'AVG', 'OBP', 'SLG', 'OPS', 'WAR', 'TVS',
       'Cash2023', 'Unnamed: 21'],
      dtype='object')

Los términos en la base de datos no se traducirán para evitar malentendidos en la traducción.

- **Pos**: Player position.
- **Team**: Team acronym.
- **GP**: Games played.
- **GP%**: Games played %.
- **AB**: At bats.
- **H**: Hitting.
- **HR**: Home runs.
- **RBI**: Runs batted in.
- **AVG**: Batting average.
- **OPS**: Onebase plus slugging%.

Se omitirá la columna *Cash2022* puesto que no es de interés para el trabajo el valor del jugador en la actualidad puesto que hay agentes libres que ya se han retirado en años posteriores.

## Pitchers

In [9]:
df_pitching_auxiliar_2012.head()

Unnamed: 0,Rank,Player,Pos,Team,GP,GS,IP,H,R,ER,...,SO,W,L,SV,WHIP,ERA,WAR,TVS,Cash2023,Unnamed: 20
0,,R.A. Dickey,SP,NYM,34,33,233.7,192,78,71,...,230,20,6,0,1.05,2.74,5.69,97.27,$0,
1,,Felix Hernandez,SP,SEA,33,33,232.0,209,84,79,...,223,13,9,0,1.14,3.07,5.26,85.2,$0,
2,,James Shields,SP,TB,33,33,227.7,209,103,89,...,223,15,10,0,1.17,3.52,2.54,79.41,$0,
3,,Clayton Kershaw,SP,LAD,34,33,227.7,170,70,64,...,229,14,9,0,1.02,2.53,6.43,95.71,"$20,000,000",
4,,Hiroki Kuroda,SP,NYY,33,33,219.7,205,86,81,...,167,16,11,0,1.16,3.32,5.27,81.85,$0,


In [10]:
df_pitching_auxiliar_2012.columns

Index(['Rank', 'Player', 'Pos', 'Team', 'GP', 'GS', 'IP', 'H', 'R', 'ER', 'BB',
       'SO', 'W', 'L', 'SV', 'WHIP', 'ERA', 'WAR', 'TVS', 'Cash2023',
       'Unnamed: 20'],
      dtype='object')

#### Notación.

Veamos a qué se refieren algunos términos

- **Pos**: Player position.
- **Team**: Team acronym.
- **GP**: Games played.
- **GS**: Games started.
- **IP**: Inning pitched.
- **H**: Hits.
- **R**: Runs.
- **ER**: Earned runs.
- **BB**: Walks.
- **SO**: Strikeouts.
- **W**: Wins.
- **L**: Losses-
- **SV**: Saves.
- **WHIP**: WHIP.
- **ERA**: Earned runs average.

Por razones análogas, se descartará la columna *Cash2022*.

### Salarios
En este caso, hay muchas menos variables que en las anteriores bases de datos

In [11]:
df_salary_auxiliar_2012.head()

Unnamed: 0,Rank,Player,Year,Pos,Team,BaseSalary,SigningBonus,Payroll Salary,Adj Salary,Salary%,...,AAV,CONT YR,CONT VALUE,Earnings,FA Year,Sign Age,Age,Weight,Height,Unnamed: 20
0,,Alex Rodriguez,2012,DH,NYY,"$29,000,000","$1,000,000","$30,000,000","$30,000,000",0.132,...,"$27,500,000",10,"$275,000,000","$321,290,700",2018,32,37,230,"0'0""",
1,,C.C. Sabathia,2012,SP,NYY,"$23,000,000","$1,285,714","$24,285,714","$24,285,714",0.107,...,"$23,000,000",7,"$161,000,000","$127,285,714",2016,28,31,300,"6'6""",
2,,Vernon Wells,2012,LF,LAA,"$21,000,000","$3,187,500","$24,187,500","$24,187,500",0.154,...,"$18,000,000",7,"$126,000,000","$102,521,000",2015,29,33,0,"0'0""",
3,,Johan Santana,2012,SP,NYM,"$24,000,000",$0,"$24,000,000","$24,000,000",0.228,...,"$22,916,667",6,"$137,500,000","$148,560,000",2014,28,33,155,"6'0""",
4,,Mark Teixeira,2012,1B,NYY,"$22,500,000","$625,000","$23,125,000","$23,125,000",0.101,...,"$22,500,000",8,"$180,000,000","$127,650,000",2017,28,32,225,"6'3""",


In [12]:
df_salary_auxiliar_2012.columns

Index(['Rank', 'Player', 'Year', 'Pos', 'Team', 'BaseSalary', 'SigningBonus',
       'Payroll Salary', 'Adj Salary', 'Salary%', 'Cash', 'AAV', 'CONT YR',
       'CONT VALUE', 'Earnings', 'FA Year', 'Sign Age', 'Age', 'Weight',
       'Height', 'Unnamed: 20'],
      dtype='object')

- **BaseSalary**: A base salary is the minimum amount you can expect to earn in exchange for your time or services. This is the amount earned before benefits, bonuses, or compensation is added.
- **Payroll Salary**: Payroll is the compensation a business must pay to its employees for a set period and on a given date plus signing bonus
- **Adj Salary**: Adjusted Salary means the regular salary, wages and commissions, if any, payable to a Participant by the Company for the Participant's service, excluding any bonuses or other compensation.

### Teams ETL

Esta base de datos sobre los equipos es bajo el proceso ETL

In [13]:
df_teams_etl_2012.head()

Unnamed: 0,Equipo,Cantidad_agentes_libres,Valor_contrato_total,Acronimo,Victorias,Juegos totales,Playoffs,Pennants won,WS ganadas,Promedio_victorias
0,Los Angeles Angels,4,321150000,LAA,89,162,10,1,1,0.549383
1,Detroit Tigers,3,221000000,DET,88,162,14,11,4,0.54321
2,Miami Marlins,6,203300000,MIA,69,162,2,2,2,0.425926
3,Philadelphia Phillies,7,57650000,PHI,81,162,14,7,2,0.5
4,Los Angeles Dodgers,9,44651311,LAD,86,162,26,25,6,0.530864


### Equipos por estado

In [14]:
states = 'Data/Teams/team_states.csv'
df_states = pd.read_csv(states)

In [15]:
df_states.head()

Unnamed: 0,Estado,Cantidad de equipos
0,Alabama,0
1,Alaska,0
2,Arizona,1
3,Arkansas,0
4,California,5


### Acrónimos

Nos servirá como llave intermedia para unificar las bases de datos de los equipos

In [16]:
acronym = 'Data/Teams/team_acronym.csv'
df_acronym = pd.read_csv(acronym)

In [17]:
df_acronym.head()

Unnamed: 0,Equipo,Acronimo,Estado
0,Arizona Diamondbacks,ARI,Arizona
1,Atlanta Braves,ATL,Georgia
2,Baltimore Orioles,BAL,Maryland
3,Boston Red Sox,BOS,Massachusetts
4,Chicago Cubs,CHC,Illinois


Unamos esta dataframe con el de los equipos por estado

In [18]:
acronym_state = pd.merge(df_states, df_acronym, on = 'Estado')

In [19]:
acronym_state.head()

Unnamed: 0,Estado,Cantidad de equipos,Equipo,Acronimo
0,Arizona,1,Arizona Diamondbacks,ARI
1,California,5,Los Angeles Angels,LAA
2,California,5,Los Angeles Dodgers,LAD
3,California,5,Oakland Athletics,OAK
4,California,5,San Diego Padres,SD


En este caso, el nombre de las variables es claro

## Algoritmo para la creación de las bases de datos

A continuaicón, se optimizará el código para que se puedan obtener los *dataframes* anteriores para un cojuntos de datos de años secuenciales, como es nuestro caso

In [20]:
# Auxiliares:
free_agents = 'Data/Free_Agents/free_agents_'
hitting = 'Data/Not_All_Variables/Statistics/Hitting/hitting_'
pitching = 'Data/Not_All_Variables/Statistics/Pitching/pitching_'
salary = 'Data/Not_All_Variables/Salary/salary_'
teams = 'ETL_Data/Agent/Teams/free_agents_team_'
csv = '.csv'
period = 12
# Originales:
df_free_agents = [None]*period
df_hitting = [None]*period
df_pitching = [None]*period
df_salary = [None]*period
df_teams = [None]*period
# Copias:
df_free_agents_copy = [None]*period
df_hitting_copy = [None]*period
df_pitching_copy = [None]*period
df_salary_copy = [None]*period
df_teams_copy = [None]*period
# Producto final:
df_pitchers = [None]*period
df_hitters = [None]*period
df_pitchers_free_agents = [None]*period
df_hitters_free_agents = [None]*period
df_pitchers_no_free_agents = [None]*period
df_hitters_no_free_agents = [None]*period
df_panel_hitters = [None]*period
df_panel_pitchers = [None]*period

Leamos todos los archivos y creemos las copias

In [21]:
for year in range(0,period):    
    df_free_agents[year] = pd.read_csv(free_agents + str(2011 + year) + csv)
    df_hitting[year] = pd.read_csv(hitting + str(2011 + year) + csv)
    df_pitching[year] = pd.read_csv(pitching + str(2011 + year) + csv)
    df_salary[year] = pd.read_csv(salary + str(2011 + year) + csv)
    df_teams[year] = pd.read_csv(teams + str(2011 + year) + csv)
    
    df_free_agents_copy[year] = df_free_agents[year].copy()
    df_hitting_copy[year] = df_hitting[year].copy()
    df_pitching_copy[year] = df_pitching[year].copy()
    df_salary_copy[year] = df_salary[year].copy()
    df_teams_copy[year] = pd.read_csv(teams + str(2011 + year) + csv)

Tratemos las bases de datos por separado. Sin embargo, a todas les quitaremos la columna de rango y *Cash2023*.

Como no queremos que se repita la columna del año de la temporada de la base de datos, borremos la columna de *Year* de la base  de datos de los agentes libres. Como los años del contrato aparecen en la base de datos sobre los salarios, se prefiere conservar dicha columna en la base de datos de salarios puesto que esta base de datos es más general que la de los agentes libres, razón por la que se borrará de esta última base de datos.

In [22]:
for year in range(0,period):
    # Drop columns:
    if any(name in df_free_agents_copy[year].columns for name in ['Rank','Year','YRS']):
        df_free_agents_copy[year].drop('Rank', axis = 1, inplace = True)
        df_free_agents_copy[year].drop('Year', axis = 1, inplace = True)
        df_free_agents_copy[year].drop('YRS', axis = 1, inplace = True)
    if 'Rank' in df_salary_copy[year].columns:
        df_salary_copy[year].drop('Rank', axis = 1, inplace = True)
    if any(name in df_hitting_copy[year].columns for name in ['Rank','Cash2023']):
        df_hitting_copy[year].drop('Rank', axis = 1, inplace = True)
        df_hitting_copy[year].drop('Cash2023', axis = 1, inplace = True)
    if any(name in df_pitching_copy[year].columns for name in ['Rank','Cash2023']):
        df_pitching_copy[year].drop('Rank', axis = 1, inplace = True)
        df_pitching_copy[year].drop('Cash2023', axis = 1, inplace = True)

Debido a que aparecen columnas que inician con el  nombre *Unnamed*, tendremos que borrarlas con algún método general, el cual se muestra a continuación:

In [23]:
for year in range(0,period):
    # Base de datos de agentes libres:
    df_free_agents_copy[year].drop(df_free_agents_copy[year].columns[df_free_agents_copy[year].columns.str.contains('Unnamed',case = False)],axis = 1, inplace = True)
    
    # Base de datos de los salarios:
    df_salary_copy[year].drop(df_salary_copy[year].columns[df_salary_copy[year].columns.str.contains('Unnamed',case = False)],axis = 1, inplace = True)
    
    # Base de datos de los hitters:
    df_hitting_copy[year].drop(df_hitting_copy[year].columns[df_hitting_copy[year].columns.str.contains('Unnamed',case = False)],axis = 1, inplace = True)
    
    # Base de datos de los pitchers:
    df_pitching_copy[year].drop(df_pitching_copy[year].columns[df_pitching_copy[year].columns.str.contains('Unnamed',case = False)],axis = 1, inplace = True)

Verifiquemos que ya no se encuentran dichas columnas molestas

In [24]:
df_free_agents_copy[9].columns

Index(['Player', 'Pos', 'Status', 'Team From', 'Team From To', 'Value', 'AAV'], dtype='object')

In [25]:
df_salary_copy[11].columns

Index(['Player', 'Year', 'Pos', 'Team', 'BaseSalary', 'SigningBonus',
       'Payroll Salary', 'Adj Salary', 'Salary%', 'Cash', 'AAV', 'CONT YR',
       'CONT VALUE', 'Earnings', 'FA Year', 'Sign Age', 'Age', 'Weight',
       'Height'],
      dtype='object')

In [26]:
df_hitting_copy[2].columns

Index(['Player', 'Pos', 'Team', 'GP', 'GP%', 'GS', 'GS%', 'AB', 'H', '2B',
       '3B', 'HR', 'RBI', 'AVG', 'OBP', 'SLG', 'OPS', 'WAR', 'TVS'],
      dtype='object')

In [27]:
df_pitching_copy[5].columns

Index(['Player', 'Pos', 'Team', 'GP', 'GS', 'IP', 'H', 'R', 'ER', 'BB', 'SO',
       'W', 'L', 'SV', 'WHIP', 'ERA', 'WAR', 'TVS'],
      dtype='object')

#### Agentes libres

No se conservará el equipo al que es contratado el agente libre puesto que esta información también la contiene la base de datos que facilita más el tratamiento _ETL_.

In [28]:
for year in range(0,period):
    df_free_agents_copy[year] = df_free_agents_copy[year].rename(columns = {'Player':'Jugador',
                            'Status':'Status_agente_libre', 'Team From':'Equipo_anterior',
                            'Value':'Valor_contrato', 'AAV':'Valor_promedio_contrato'})
    
    free_agents_aux_1 = df_free_agents_copy[year]['Valor_contrato'].str.replace("$","")
    free_agents_aux_2 = free_agents_aux_1.str.replace(",","")
    free_agents_aux_3 = df_free_agents_copy[year]['Valor_promedio_contrato'].str.replace("$","")
    free_agents_aux_4 = free_agents_aux_3.str.replace(",","")
    df_free_agents_copy[year]['Valor_contrato'] = free_agents_aux_2
    df_free_agents_copy[year]['Valor_promedio_contrato'] = free_agents_aux_4
    
    df_free_agents_copy[year]['Valor_contrato'] = pd.to_numeric(df_free_agents_copy[year]['Valor_contrato'])
    df_free_agents_copy[year]['Valor_promedio_contrato'] = pd.to_numeric(df_free_agents_copy[year]['Valor_promedio_contrato'])

Observemos las dimensiones de las bases de datos como referencia

In [34]:
for year in range(0,period):
    print(df_free_agents_copy[year].shape)

(1, 7)
(108, 7)
(213, 7)
(208, 7)
(221, 7)
(241, 7)
(100, 7)
(98, 7)
(105, 7)
(118, 7)
(141, 7)
(137, 7)


#### Salarios

Como los salarios irán con las bases de datos de los _hitters_ y _pitchers_ es que se hará su proceso _ETL_ antes.

In [29]:
for year in range(0,period):
    df_salary_copy[i] = df_salary_copy[i][['Player', 'Pos', 'Team', 'BaseSalary', 'SigningBonus',
                                           'Payroll Salary', 'Adj Salary', 'CONT YR', 'CONT VALUE', 'Earnings',
                                           'FA Year', 'Sign Age']]
    df_salary_names = ['Jugador', 'Posicion', 'Equipo', 'Sueldo_base', 'Bono_por_firma',
                       'Sueldo_regular', 'Sueldo_ajustado', 'Anios_de_contrato', 'Valor_del_contrato', 'Ganancias',
                       'Anio_de_agente_libre', 'Edad_al_firmar']
    df_salary_copy[i].columns = df_salary_names
    
    salary_aux_1 = df_salary_copy[year]['Sueldo_base'].str.replace("$","")
    salary_aux_2 = salary_aux_1.str.replace(",","")
    df_salary_copy[year]['Sueldo_base'] = salary_aux_2
    df_salary_copy[year]['Sueldo_base'] = pd.to_numeric(df_salary_copy[year]['Sueldo_base'])
    
    salary_aux_3 = df_salary_copy[i]['Sueldo_regular'].str.replace("$","")
    salary_aux_4 = salary_aux_3.str.replace(",","")
    df_salary_copy[year]['Sueldo_regular'] = salary_aux_4
    df_salary_copy[year]['Sueldo_regular'] = pd.to_numeric(df_salary_copy[year]['Sueldo_regular'])
    
    salary_aux_5 = df_salary_copy[year]['Sueldo_ajustado'].str.replace("$","")
    salary_aux_6 = salary_aux_5.str.replace(",","")
    df_salary_copy[year]['Sueldo_ajustado'] = salary_aux_6
    df_salary_copy[year]['Sueldo_ajustado'] = pd.to_numeric(df_salary_copy[year]['Sueldo_ajustado'])
    
    salary_aux_7 = df_salary_copy[year]['Valor_del_contrato'].str.replace("$","")
    salary_aux_8 = salary_aux_7.str.replace(",","")
    df_salary_copy[year]['Valor_del_contrato'] = salary_aux_8
    df_salary_copy[year]['Valor_del_contrato'] = pd.to_numeric(df_salary_copy[year]['Valor_del_contrato'])
    
    salary_aux_9 = df_salary_copy[year]['Bono_por_firma'].str.replace("$","")
    salary_aux_10 = salary_aux_9.str.replace(",","")
    df_salary_copy[year]['Bono_por_firma'] = salary_aux_10
    df_salary_copy[year]['Bono_por_firma'] = pd.to_numeric(df_salary_copy[year]['Bono_por_firma'])
    
    salary_aux_11 = df_salary_copy[year]['Ganancias'].str.replace("$","")
    salary_aux_12 = salary_aux_11.str.replace(",","")
    df_salary_copy[year]['Ganancias'] = salary_aux_12
    df_salary_copy[year]['Ganancias'] = pd.to_numeric(df_salary_copy[year]['Ganancias'])

KeyError: "['Sign Year'] not in index"

#### Hitters

In [None]:
for i in range(0,period):
    df_hitting_copy[i] = df_hitting_copy[i][['Player', 'Pos', 'GP', 'GP%', 'AB', 'H',
                                             'HR', 'RBI', 'AVG', 'OPS', 'WAR', 'TVS',
                                             'Age', 'Weight', 'Height']]
    df_hitting_names = ['Jugador', 'Posicion', 'Juegos', 'Porcetnaje_juegos', 'At-bats',
                        'Bateos', 'Home-runs', 'RBI', 'Porcentaje_bateo', 'OPS',
                        'WAR', 'TVS', 'Edad', 'Peso', 'Altura']
    df_hitting_copy[i].columns = df_hitting_names
    
    hitting_aux_1 = df_hitting_copy[i]['Altura'].str.replace("\"","")
    hitting_aux_2 = hitting_aux_1.str.replace("'","")
    df_hitting_copy[i]['Altura'] = hitting_aux_2
    df_hitting_copy[i]['Altura'] = pd.to_numeric(df_hitting_copy[i]['Altura'])/10
    
    df_hitters[i] = pd.merge(df_hitting_copy[i], df_salary_copy[i], on = 'Jugador')

    df_hitters[i] = df_hitters[i].rename(columns = {'Equipo':'Acronimo'})

#### Pitchers

In [None]:
for i in range(0,period):    
    df_pitching_copy[i] = df_pitching_copy[i][['Player', 'Pos', 'GP', 'GS', 'IP', 'H', 
                                               'R', 'ER', 'BB', 'SO', 'W', 'L', 'SV', 
                                               'WHIP', 'ERA', 'WAR', 'TVS', 'Age',
                                               'Weight', 'Height']]
    df_pitching_names = ['Jugador', 'Posicion', 'Juegos', 'Juegos_iniciados', 'Inning_pitched', 'Bateos_pitcher',
                         'Carreras', 'Carreras_ganadas', 'Walks', 'Strike-outs', 'Wins', 'Losses',
                         'Saves', 'WHIP', 'ERA', 'WAR', 'TVS', 'Edad', 'Peso', 'Altura']
    df_pitching_copy[i].columns = df_pitching_names    
    
    pitching_aux_1 = df_pitching_copy[i]['Altura'].str.replace("\"","")
    pitching_aux_2 = pitching_aux_1.str.replace("'","")
    df_pitching_copy[i]['Altura'] = pitching_aux_2
    df_pitching_copy[i]['Altura'] = pd.to_numeric(df_pitching_copy[i]['Altura'])/10

    df_pitchers[i] = pd.merge(df_pitching_copy[i], df_salary_copy[i], on = 'Jugador')
    
    df_pitchers[i] = df_pitchers[i].rename(columns = {'Equipo':'Acronimo'})

Debido a que la mayoría de los jugadores juega tanto en la ofensiva como la defensiva es que tenemos que borrar los duplicados de la columna de la posición.

In [None]:
for i in range(0,period):
    # Drop 'Posición_y' columns:
    if 'Posicion_y' in df_hitters[i].columns:
        df_hitters[i].drop('Posicion_y', axis = 1, inplace = True)
    
    if 'Posicion_y' in df_pitchers[i].columns:
        df_pitchers[i].drop('Posicion_y', axis = 1, inplace = True)
        
    # Cambiando nombre de 'Posicion_x':
    if 'Posicion_x' in df_hitters[i].columns:
        df_hitters[i] = df_hitters[i].rename(columns = {'Posicion_x':'Posicion'})
    
    if 'Posicion_x' in df_pitchers[i].columns:
        df_pitchers[i] = df_pitchers[i].rename(columns = {'Posicion_x':'Posicion'})

## Agregación de variables sugeridas por artículos

Las primeras variables que agregaremos son el cuadrado de todas las estadísticas deportivas, así como las siguientes variables:

- DOMINANCE = $Strike-outs/(9*Inning \; Pitched)$
- CONTROL = $Walks/(9*Inning \; Pitched)$
- COMMAND = $Strike-outs/Walks$

In [None]:
df_hitters[2].head()

In [None]:
df_hitters[2].columns

In [None]:
df_pitchers[2].head()

In [None]:
for i in range(0,period):
    df_pitchers[i]['Dominio'] = df_pitchers[i]['Strike-outs']/(9*df_pitchers[i]['Inning_pitched'])
    df_pitchers[i]['Control'] = df_pitchers[i]['Walks']/(9*df_pitchers[i]['Inning_pitched'])
    df_pitchers[i]['Comando'] = df_pitchers[i]['Strike-outs']/df_pitchers[i]['Walks']

In [None]:
df_pitchers[2].head()

In [None]:
df_pitchers[2].columns

Con el objetivo de hacer más eficiente la creación de las variables al cuadrado, lo haremos por índice

In [None]:
# Indiquemos las columnas que se usarán por medio de su índice
square_pitchers_index = list(range(2,17)) + [31,32,33]
square_hitters_index = list(range(2,12))

In [None]:
for i in range(0,period):
    for j in square_pitchers_index:
        df_pitchers[i][df_pitchers[i].columns[j] + '_2'] = np.power(df_pitchers[i][df_pitchers[i].columns[j]], 2)
    
    for k in square_hitters_index:
        df_hitters[i][df_hitters[i].columns[k] + '_2'] = np.power(df_hitters[i][df_hitters[i].columns[k]], 2)

Apreciemos el resultado final

In [None]:
df_pitchers[2].head()

In [None]:
df_pitchers[2].columns

In [None]:
df_hitters[7].head()

In [None]:
df_hitters[7].columns

Siguiendo la sugerencia de algunos artículos, obtengamos el logaritmo de los salarios

In [None]:
for year in range(0,period):
    df_hitters[year]['ln_Sueldo_base'] = np.log(df_hitters[year]['Sueldo_base'])
    df_hitters[year]['ln_Sueldo_ajustado'] = np.log(df_hitters[year]['Sueldo_ajustado'])
    df_hitters[year]['ln_Sueldo_regular'] = np.log(df_hitters[year]['Sueldo_regular'])
    df_hitters[year]['Anio'] = year + 1
    
    df_pitchers[year]['ln_Sueldo_base'] = np.log(df_pitchers[year]['Sueldo_base'])
    df_pitchers[year]['ln_Sueldo_ajustado'] = np.log(df_pitchers[year]['Sueldo_ajustado'])
    df_pitchers[year]['ln_Sueldo_regular'] = np.log(df_pitchers[year]['Sueldo_regular'])
    df_pitchers[year]['Anio'] = year + 1

### Datos agregados por equipo

Solo resta añadir los datos relevantes al equipo al que pertenece cada jugador considerando la base de datos de la cantidad de equipos por estado

In [None]:
df_teams_copy[2].head()

In [None]:
for i in range(0,period):
    df_teams_copy[i] = pd.merge(df_teams_copy[i], acronym_state, on = ['Equipo','Acronimo'])
    df_hitters[i] = pd.merge(df_teams_copy[i], df_hitters[i], on = 'Acronimo')
    df_pitchers[i] = pd.merge(df_teams_copy[i], df_pitchers[i], on = 'Acronimo')

## Segmentación por Agentes libres

Separaremos los pitchers y hitters en dos grupos:

- Agentes libres.
- No agentes libres.

In [None]:
for i in range(0,period):
    # Drop 'Posición_y' columns:
    if 'Posicion_y' in df_hitters[i].columns:
        df_hitters[i].drop('Posicion_y', axis = 1, inplace = True)
    
    if 'Posicion_y' in df_pitchers[i].columns:
        df_pitchers[i].drop('Posicion_y', axis = 1, inplace = True)
        
    # Cambiando nombre de 'Posicion_x':
    if 'Posicion_x' in df_hitters[i].columns:
        df_hitters[i] = df_hitters[i].rename(columns = {'Posicion_x':'Posicion'})
    
    if 'Posicion_x' in df_pitchers[i].columns:
        df_pitchers[i] = df_pitchers[i].rename(columns = {'Posicion_x':'Posicion'})

In [None]:
for i in range(0,period):    
    df_hitters_free_agents[i] = pd.merge(df_free_agents_copy[i], df_hitters[i], on = 'Jugador')
    df_pitchers_free_agents[i] = pd.merge(df_free_agents_copy[i], df_pitchers[i], on = 'Jugador')
    
    df_hitters_no_free_agents[i] = df_hitters[i][~df_hitters[i].Jugador.isin(df_hitters_free_agents[i].Jugador)]
    df_pitchers_no_free_agents[i] = df_pitchers[i][~df_pitchers[i].Jugador.isin(df_pitchers_free_agents[i].Jugador)]
    
    df_hitters_free_agents[i] = df_hitters_free_agents[i].reindex(sorted(df_hitters_free_agents[i].columns), axis=1)
    df_pitchers_free_agents[i] = df_pitchers_free_agents[i].reindex(sorted(df_pitchers_free_agents[i].columns), axis=1)
    df_hitters_no_free_agents[i] = df_hitters_no_free_agents[i].reindex(sorted(df_hitters_no_free_agents[i].columns), axis=1)
    df_pitchers_no_free_agents[i] = df_pitchers_no_free_agents[i].reindex(sorted(df_pitchers_no_free_agents[i].columns), axis=1)  
    
    # Drop 'Anio_y' columns:
    if 'Anio_x' in df_hitters_free_agents[i].columns:
        df_hitters_free_agents[i].drop('Anio_x', axis = 1, inplace = True)
    
    if 'Anio_x' in df_pitchers_free_agents[i].columns:
        df_pitchers_free_agents[i].drop('Anio_x', axis = 1, inplace = True)
        
    if 'Anio_y' in df_hitters_no_free_agents[i].columns:
        df_hitters_no_free_agents[i].drop('Anio_y', axis = 1, inplace = True)
        
    if 'Anio_y' in df_pitchers_no_free_agents[i].columns:
        df_pitchers_no_free_agents[i].drop('Anio_y', axis = 1, inplace = True)
        
    # Cambiando nombre de 'Anio_x':
    if 'Anio_y' in df_hitters_free_agents[i].columns:
        df_hitters_free_agents[i] = df_hitters_free_agents[i].rename(columns = {'Anio_y':'Anio'})
    
    if 'Anio_y' in df_pitchers_free_agents[i].columns:
        df_pitchers_free_agents[i] = df_pitchers_free_agents[i].rename(columns = {'Anio_y':'Anio'})
    
    if 'Anio_x' in df_hitters_no_free_agents[i].columns:
        df_hitters_no_free_agents[i] = df_hitters_no_free_agents[i].rename(columns = {'Anio_x':'Anio'})
    
    if 'Anio_x' in df_pitchers_no_free_agents[i].columns:
        df_pitchers_no_free_agents[i] = df_pitchers_no_free_agents[i].rename(columns = {'Anio_x':'Anio'})
    
    # Drop 'Anios_contrato' columns:
    if 'Anios_contrato' in df_hitters_free_agents[i].columns:
        df_hitters_free_agents[i].drop('Anios_contrato', axis = 1, inplace = True)
    
    if 'Anios_contrato' in df_pitchers_free_agents[i].columns:
        df_pitchers_free_agents[i].drop('Anios_contrato', axis = 1, inplace = True)
    
    # Transformación
    df_hitters_free_agents[i]['Anio'] = df_hitters_free_agents[i]['Anio'] + 2010
    df_hitters_free_agents[i]['Anio'] = df_hitters_free_agents[i]['Anio'].map(str)
    df_pitchers_free_agents[i]['Anio'] = df_pitchers_free_agents[i]['Anio'] + 2010
    df_pitchers_free_agents[i]['Anio'] = df_pitchers_free_agents[i]['Anio'].map(str)
    df_hitters_no_free_agents[i]['Anio'] = df_hitters_no_free_agents[i]['Anio'] + 2010
    df_hitters_no_free_agents[i]['Anio'] = df_hitters_no_free_agents[i]['Anio'].map(str)
    df_pitchers_no_free_agents[i]['Anio'] = df_pitchers_no_free_agents[i]['Anio'] + 2010
    df_pitchers_no_free_agents[i]['Anio'] = df_pitchers_no_free_agents[i]['Anio'].map(str)
    
    # Exportemos los dataframes por separado
    df_hitters_free_agents[i].to_csv('ETL_Data/Agent/First_Year_Contract/Period_t/Free_Agent/Hitters/free_agents_batters_' + str(2011 + i) + '.csv', index = False)
    df_pitchers_free_agents[i].to_csv('ETL_Data/Agent/First_Year_Contract/Period_t/Free_Agent/Pitchers/free_agents_pitchers_' + str(2011 + i) + '.csv', index = False)
    df_hitters_no_free_agents[i].to_csv('ETL_Data/Agent/First_Year_Contract/Period_t/No_Free_Agent/Hitters/no_free_agents_batters_' + str(2011 + i) + '.csv', index = False)
    df_pitchers_no_free_agents[i].to_csv('ETL_Data/Agent/First_Year_Contract/Period_t/No_Free_Agent/Pitchers/no_free_agents_pitchers_' + str(2011 + i) + '.csv', index = False)

In [None]:
# Algunos ejemplos
df_pitchers_no_free_agents[6].head()

In [None]:
df_hitters_no_free_agents[0].head()

In [None]:
df_pitchers_free_agents[9].head()

In [None]:
df_hitters_free_agents[8].head()

### Etiquetas para los agentes libres

Crearemos un etiqueta para indicar si el pitcher o hitter es  un agente libre o no.

In [None]:
for i in range(0,period):
    # Condiciones
    condicion_hitter = [df_hitters[i].Jugador.isin(df_hitters_free_agents[i].Jugador)]
    condicion_pitcher = [df_pitchers[i].Jugador.isin(df_pitchers_free_agents[i].Jugador)]
    
    # Etiquetas
    etiquetas = ['Si']
    
    df_hitters[i]['Agente libre'] = np.select(condicion_hitter, etiquetas, default = 'No')
    df_pitchers[i]['Agente libre'] = np.select(condicion_pitcher, etiquetas, default = 'No')
    
    df_hitters[i] = df_hitters[i].reindex(sorted(df_hitters[i].columns), axis=1)
    df_pitchers[i] = df_pitchers[i].reindex(sorted(df_pitchers[i].columns), axis=1)
    
    
    # Transformación
    df_hitters[i]['Anio'] = df_hitters[i]['Anio'] + 2010
    df_hitters[i]['Anio'] = df_hitters[i]['Anio'].map(str)
    df_pitchers[i]['Anio'] = df_pitchers[i]['Anio'] + 2010
    df_pitchers[i]['Anio'] = df_pitchers[i]['Anio'].map(str)
    
    # Exportemos los dataframes
    df_hitters[i].to_csv('ETL_Data/Agent/First_Year_Contract/Period_t/Hitters/All_Hitters/hitters_' + str(2011 + i) + '.csv', index = False)
    df_pitchers[i].to_csv('ETL_Data/Agent/First_Year_Contract/Period_t/Pitchers/All_Pitchers/pitchers_' + str(2011 + i) + '.csv', index = False)

In [None]:
df_hitters[10].head()

In [None]:
df_pitchers[9].head()

In [None]:
df_hitters[0].describe()

In [None]:
df_pitchers[0].describe()