In [1]:
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
from sklearn import linear_model
import statsmodels.formula.api as smf
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures, StandardScaler

## Importar Datos

In [2]:
data = pd.read_csv('Hitters.csv')
data.head()

Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,CRuns,CRBI,CWalks,League,Division,PutOuts,Assists,Errors,Salary,NewLeague
0,293,66,1,30,29,14,1,293,66,1,30,29,14,A,E,446,33,20,,A
1,315,81,7,24,38,39,14,3449,835,69,321,414,375,N,W,632,43,10,475.0,N
2,479,130,18,66,72,76,3,1624,457,63,224,266,263,A,W,880,82,14,480.0,A
3,496,141,20,65,78,37,11,5628,1575,225,828,838,354,N,E,200,11,3,500.0,N
4,321,87,10,39,42,30,2,396,101,12,48,46,33,N,E,805,40,4,91.5,N


## Explicación de variables

### **Player Performance (1986 Season):**
1. **`AtBat`**: Number of times the player was at bat (faced a pitcher) in 1986.  
2. **`Hits`**: Total hits (successful contact with the ball resulting in reaching base) in 1986.  
3. **`HmRun`**: Number of home runs (hits allowing the batter to circle all bases) in 1986.  
4. **`Runs`**: Total runs scored (times crossing home plate) in 1986.  
5. **`RBI`**: Runs Batted In (number of runs a player caused by their hits, including home runs) in 1986.  
6. **`Walks`**: Number of times the player was walked (awarded first base due to four balls) in 1986.  

---

### **Career Statistics (Cumulative Until 1986):**
7. **`CAtBat`**: Total times at bat during the player’s entire career.  
8. **`CHits`**: Total career hits.  
9. **`CHmRun`**: Total career home runs.  
10. **`CRuns`**: Total runs scored in the player’s career.  
11. **`CRBI`**: Total career Runs Batted In.  
12. **`CWalks`**: Total career walks.  

---

### **Player Demographics:**
13. **`Years`**: Number of years the player has been in the major leagues (experience).  

---

### **League/Division Affiliation:**
14. **`League`**: Player’s league at the **end of 1986** (`A` = American League, `N` = National League).  
15. **`Division`**: Player’s division at the **end of 1986** (`E` = East, `W` = West).  

---

### **Defensive Performance (1986 Season):**
16. **`PutOuts`**: Number of defensive plays where the player directly caused an out (e.g., catching a fly ball).  
17. **`Assists`**: Number of times the player contributed to an out made by another fielder.  
18. **`Errors`**: Number of fielding mistakes that benefited the opposing team.  

---

### **Salary:**
19. **`Salary`**: The player’s **1987 annual salary** (on opening day), measured in **thousands of dollars** (e.g., `500` = \$500,000). Missing values are marked as `NA`.  

---

### **Post-1986 League:**
20. **`NewLeague`**: Player’s league at the **beginning of 1987** (`A` or `N`). Indicates potential league changes after 1986.  

---

### Notes:
- **Career variables** (e.g., `CAtBat`, `CHits`) are cumulative totals up to and including 1986.  
- **Salary** is the target variable for prediction in many analyses.  
- **Factor variables** (`League`, `Division`, `NewLeague`) have categorical levels (e.g., `A/N`, `E/W`).  
- Missing values (`NA`) appear in some rows, particularly for `Salary`.

## Análisis de datos

Tenemos 3 variables categóricas: `League`, `Division` y `NewLeague`, para poder utilizarlas las haremos dummies.

In [3]:
data = pd.get_dummies(data, columns=['League','Division','NewLeague'])
target = 'Salary'
data.head()

Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,...,PutOuts,Assists,Errors,Salary,League_A,League_N,Division_E,Division_W,NewLeague_A,NewLeague_N
0,293,66,1,30,29,14,1,293,66,1,...,446,33,20,,True,False,True,False,True,False
1,315,81,7,24,38,39,14,3449,835,69,...,632,43,10,475.0,False,True,False,True,False,True
2,479,130,18,66,72,76,3,1624,457,63,...,880,82,14,480.0,True,False,False,True,True,False
3,496,141,20,65,78,37,11,5628,1575,225,...,200,11,3,500.0,False,True,True,False,False,True
4,321,87,10,39,42,30,2,396,101,12,...,805,40,4,91.5,False,True,True,False,False,True


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 322 entries, 0 to 321
Data columns (total 23 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   AtBat        322 non-null    int64  
 1   Hits         322 non-null    int64  
 2   HmRun        322 non-null    int64  
 3   Runs         322 non-null    int64  
 4   RBI          322 non-null    int64  
 5   Walks        322 non-null    int64  
 6   Years        322 non-null    int64  
 7   CAtBat       322 non-null    int64  
 8   CHits        322 non-null    int64  
 9   CHmRun       322 non-null    int64  
 10  CRuns        322 non-null    int64  
 11  CRBI         322 non-null    int64  
 12  CWalks       322 non-null    int64  
 13  PutOuts      322 non-null    int64  
 14  Assists      322 non-null    int64  
 15  Errors       322 non-null    int64  
 16  Salary       263 non-null    float64
 17  League_A     322 non-null    bool   
 18  League_N     322 non-null    bool   
 19  Division

Vemos que existen valores nulos para el salario que es nuestra variable a predecir, por lo que vemos cuantos datos faltantes hay.

In [5]:
pd.DataFrame(data.isnull().sum()).T

Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,...,PutOuts,Assists,Errors,Salary,League_A,League_N,Division_E,Division_W,NewLeague_A,NewLeague_N
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,59,0,0,0,0,0,0


La variable de salario tiene 59 valores faltantes, los cuales debemos de quitar de nuestros datos.

In [7]:
data = data[data['Salary'].notnull()]
data.head()

Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,...,PutOuts,Assists,Errors,Salary,League_A,League_N,Division_E,Division_W,NewLeague_A,NewLeague_N
1,315,81,7,24,38,39,14,3449,835,69,...,632,43,10,475.0,False,True,False,True,False,True
2,479,130,18,66,72,76,3,1624,457,63,...,880,82,14,480.0,True,False,False,True,True,False
3,496,141,20,65,78,37,11,5628,1575,225,...,200,11,3,500.0,False,True,True,False,False,True
4,321,87,10,39,42,30,2,396,101,12,...,805,40,4,91.5,False,True,True,False,False,True
5,594,169,4,74,51,35,11,4408,1133,19,...,282,421,25,750.0,True,False,False,True,True,False
