---

## üåê Phase 1: Web Scraping

### 1.1 Baseball Reference Scraper (`baseball_scraper.py`)

**Source:** [Baseball Reference](https://www.baseball-reference.com)

**Technology Stack:**
- **Selenium WebDriver** - For browser automation (handles JavaScript-rendered content)
- **BeautifulSoup** - For HTML parsing
- **pandas** - For DataFrame creation and CSV export
- **webdriver-manager** - Automatic ChromeDriver management

**Key Features:**
- Headless browser mode for faster scraping
- Handles hidden tables (Baseball Reference stores some tables in HTML comments)
- Timeout management to prevent hanging on slow pages
- Anti-detection measures (custom user agent, disabled automation flags)

**Data Collected:**

| File | Description | Key Statistics |
|------|-------------|----------------|
| `Batting_YEAR.csv` | Team batting stats | BA, OBP, SLG, OPS, HR, RBI, R, etc. |
| `Pitching_YEAR.csv` | Team pitching stats | ERA, WHIP, SO, BB, W-L%, etc. |
| `Fielding_YEAR.csv` | Team fielding stats | Fld%, E, DP, DefEff, etc. |
| `Postseason_YEAR.csv` | Playoff results | Series winners/losers, scores |
| `WAA_Positions_YEAR.csv` | Wins Above Average by position | Position-specific WAA rankings |

### 1.2 Salary Scraper (`salary_scraper.py`)

**Source:** [SteveTheUmp.com](https://www.stevetheump.com/Payrolls.htm)

**Purpose:** Scrape historical team payroll data to analyze the relationship between team spending and performance.

**Key Features:**
- Parses multiple tables from a single page (one per year)
- Uses regex to identify year sections from headers
- Extracts team names and payroll amounts

**Data Collected:**

| File | Description | Key Statistics |
|------|-------------|----------------|
| `Salaries_YEAR.csv` | Team payroll data | Team name, Total payroll ($) |

---

## üßπ Phase 2: Data Cleaning

### 2.1 Team Name Standardization (`data_cleaning.py`)

**Problem:** Team names and abbreviations vary across years and data sources due to:
- Team relocations (Montreal Expos ‚Üí Washington Nationals)
- Team renamings (Cleveland Indians ‚Üí Cleveland Guardians)
- Inconsistent abbreviations (CWS vs CHW for White Sox)

**Solution:** All team names are standardized to **2025 conventions**.

#### Historical Name Changes Handled:

| Old Name | New Name (2025 Standard) | Year Changed |
|----------|--------------------------|---------------|
| Montreal Expos | Washington Nationals | 2005 |
| Florida Marlins | Miami Marlins | 2012 |
| Cleveland Indians | Cleveland Guardians | 2022 |
| Oakland Athletics | Athletics | 2024 |
| Tampa Bay Devil Rays | Tampa Bay Rays | 2008 |
| Anaheim Angels | Los Angeles Angels | Various |
| California Angels | Los Angeles Angels | Various |

#### Abbreviation Mappings:

| Old Abbreviation | New (2025 Standard) |
|------------------|---------------------|
| OAK | ATH |
| CWS | CHW |
| FLA | MIA |
| MON | WSN |
| ANA/CAL | LAA |
| TBD | TBR |

### 2.2 Salary Data Cleaning (`salary_cleaning.py`)

**Tasks Performed:**
1. **Identify correct payroll column** - Some source tables have multiple numeric columns
2. **Remove aggregate rows** - Filter out average/median salary rows
3. **Clean currency formatting** - Remove `$`, commas, and `M` suffixes
4. **Standardize team names** - Apply same mappings as main data cleaning

**Before Cleaning:**
```
Oakland Athletics, $45,500,000
```

**After Cleaning:**
```
Athletics, 45500000
```

---

## üìä Data Overview

Let's examine the structure and sample data from each file type.

In [2]:
import pandas as pd
import os
from pathlib import Path

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

# Base data path
DATA_PATH = Path('Data')

# Example year to display
SAMPLE_YEAR = 1999

### Batting Statistics

In [3]:
# Load batting data
batting = pd.read_csv(DATA_PATH / str(SAMPLE_YEAR) / f'Batting_{SAMPLE_YEAR}.csv')
print(f"Batting Statistics ({SAMPLE_YEAR})")
print(f"Shape: {batting.shape[0]} teams √ó {batting.shape[1]} statistics\n")
print("Columns:", list(batting.columns))
batting.head()

Batting Statistics (1999)
Shape: 31 teams √ó 29 statistics

Columns: ['Tm', '#Bat', 'BatAge', 'R/G', 'G', 'PA', 'AB', 'R', 'H', '2B', '3B', 'HR', 'RBI', 'SB', 'CS', 'BB', 'SO', 'BA', 'OBP', 'SLG', 'OPS', 'OPS+', 'TB', 'GDP', 'HBP', 'SH', 'SF', 'IBB', 'LOB']


Unnamed: 0,Tm,#Bat,BatAge,R/G,G,PA,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,BA,OBP,SLG,OPS,OPS+,TB,GDP,HBP,SH,SF,IBB,LOB
0,Anaheim Angels,45,28.6,4.39,162,6132,5494,711,1404,248,22,158,673,71,45,511,1022,0.256,0.322,0.395,0.716,83,2170,135,43,41,42,24,1097
1,Arizona Diamondbacks,43,30.0,5.6,162,6415,5658,908,1566,289,46,216,865,137,39,588,1045,0.277,0.347,0.459,0.805,101,2595,94,48,61,60,52,1169
2,Atlanta Braves,44,29.6,5.19,162,6351,5569,840,1481,309,23,197,791,148,66,608,962,0.266,0.341,0.436,0.777,96,2427,120,53,74,47,62,1155
3,Baltimore Orioles,43,32.5,5.25,162,6409,5637,851,1572,299,21,203,804,107,46,615,890,0.279,0.353,0.447,0.8,108,2522,146,61,41,55,34,1241
4,Boston Red Sox,48,28.8,5.16,162,6321,5579,836,1551,334,42,176,808,67,39,597,928,0.278,0.35,0.448,0.798,99,2497,131,55,34,56,27,1213


### Pitching Statistics

In [4]:
# Load pitching data
pitching = pd.read_csv(DATA_PATH / str(SAMPLE_YEAR) / f'Pitching_{SAMPLE_YEAR}.csv')
print(f"Pitching Statistics ({SAMPLE_YEAR})")
print(f"Shape: {pitching.shape[0]} teams √ó {pitching.shape[1]} statistics\n")
print("Columns:", list(pitching.columns))
pitching.head()

Pitching Statistics (1999)
Shape: 31 teams √ó 36 statistics

Columns: ['Tm', '#P', 'PAge', 'RA/G', 'W', 'L', 'W-L%', 'ERA', 'G', 'GS', 'GF', 'CG', 'tSho', 'cSho', 'SV', 'IP', 'H', 'R', 'ER', 'HR', 'BB', 'IBB', 'SO', 'HBP', 'BK', 'WP', 'BF', 'ERA+', 'FIP', 'WHIP', 'H9', 'HR9', 'BB9', 'SO9', 'SO/W', 'LOB']


Unnamed: 0,Tm,#P,PAge,RA/G,W,L,W-L%,ERA,G,GS,GF,CG,tSho,cSho,SV,IP,H,R,ER,HR,BB,IBB,SO,HBP,BK,WP,BF,ERA+,FIP,WHIP,H9,HR9,BB9,SO9,SO/W,LOB
0,Anaheim Angels,20,31.6,5.1,70,92,0.432,4.79,162,162,158,4,7,0,37,1431.1,1472,826,762,177,624,17,877,56,5,65,6258,101,4.94,1.464,9.3,1.1,3.9,5.5,1.41,1138
1,Arizona Diamondbacks,20,30.6,4.17,100,62,0.617,3.77,162,162,146,16,9,4,42,1467.1,1387,676,615,176,543,48,1198,49,10,39,6233,122,4.27,1.315,8.5,1.1,3.3,7.3,2.21,1155
2,Atlanta Braves,22,28.6,4.08,103,59,0.636,3.63,162,162,153,9,9,1,45,1471.0,1398,661,593,142,507,55,1197,26,3,34,6218,123,3.85,1.295,8.6,0.9,3.1,7.3,2.36,1144
3,Baltimore Orioles,21,30.1,5.03,78,84,0.481,4.77,162,162,145,17,11,4,33,1435.0,1468,815,760,198,647,34,982,49,6,55,6259,97,5.01,1.474,9.2,1.2,4.1,6.2,1.52,1139
4,Boston Red Sox,25,30.1,4.43,94,68,0.58,4.0,162,162,156,6,12,1,50,1436.2,1396,718,638,160,469,25,1131,55,0,28,6120,126,4.1,1.298,8.7,1.0,2.9,7.1,2.41,1092


### Fielding Statistics

In [5]:
# Load fielding data
fielding = pd.read_csv(DATA_PATH / str(SAMPLE_YEAR) / f'Fielding_{SAMPLE_YEAR}.csv')
print(f"Fielding Statistics ({SAMPLE_YEAR})")
print(f"Shape: {fielding.shape[0]} teams √ó {fielding.shape[1]} statistics\n")
print("Columns:", list(fielding.columns))
fielding.head()

Fielding Statistics (1999)
Shape: 31 teams √ó 16 statistics

Columns: ['Tm', '#Fld', 'RA/G', 'DefEff', 'G', 'GS', 'CG', 'Inn', 'Ch', 'PO', 'A', 'E', 'DP', 'Fld%', 'Rtot', 'Rtot/yr']


Unnamed: 0,Tm,#Fld,RA/G,DefEff,G,GS,CG,Inn,Ch,PO,A,E,DP,Fld%,Rtot,Rtot/yr
0,Anaheim Angels,45,5.1,0.699,162,1458,1155,12882.0,6123,4294,1723,106,156,0.983,72,7
1,Arizona Diamondbacks,42,4.17,0.701,162,1458,1135,13206.0,6096,4402,1590,104,132,0.983,42,4
2,Atlanta Braves,44,4.08,0.694,162,1458,1021,13239.0,6182,4413,1658,111,127,0.982,43,4
3,Baltimore Orioles,42,5.03,0.697,162,1458,1128,12915.0,6175,4305,1781,89,191,0.986,39,4
4,Boston Red Sox,48,4.43,0.693,162,1458,1188,12930.0,5985,4310,1548,127,132,0.979,48,4


### Salary Data

In [6]:
# Load salary data
salaries = pd.read_csv(DATA_PATH / str(SAMPLE_YEAR) / f'Salaries_{SAMPLE_YEAR}.csv')
print(f"Team Salaries ({SAMPLE_YEAR})")
print(f"Shape: {salaries.shape[0]} teams\n")
salaries.head(10)

Team Salaries (1999)
Shape: 30 teams



Unnamed: 0,Tm,Payroll
0,New York Yankees,88180712
1,Texas Rangers,81576598
2,Atlanta Braves,74890000
3,Cleveland Guardians,73278458
4,Baltimore Orioles,72198363
5,Boston Red Sox,71725000
6,New York Mets,71506427
7,Los Angeles Dodgers,71115786
8,Arizona Diamondbacks,70496000
9,Chicago Cubs,55443500


### Postseason Results

In [7]:
# Load postseason data
postseason = pd.read_csv(DATA_PATH / str(SAMPLE_YEAR) / f'Postseason_{SAMPLE_YEAR}.csv')
print(f"Postseason Results ({SAMPLE_YEAR})")
postseason

Postseason Results (1999)


Unnamed: 0,0,1,2
0,World Series,4-0,New York Yankees over Atlanta Braves
1,ALCS,4-1,New York Yankees over Boston Red Sox
2,NLCS,4-2,Atlanta Braves over New York Mets
3,AL Division Series,3-2,Boston Red Sox over Cleveland Indians
4,AL Division Series,3-0,New York Yankees over Texas Rangers
5,NL Division Series,3-1,Atlanta Braves over Houston Astros
6,NL Division Series,3-1,New York Mets over Arizona Diamondbacks


### WAA by Position

In [8]:
# Load WAA positions data
waa = pd.read_csv(DATA_PATH / str(SAMPLE_YEAR) / f'WAA_Positions_{SAMPLE_YEAR}.csv')
print(f"Wins Above Average by Position ({SAMPLE_YEAR})")
print(f"Shape: {waa.shape[0]} teams √ó {waa.shape[1]} positions\n")
print("Columns:", list(waa.columns))
waa.head()

Wins Above Average by Position (1999)
Shape: 30 teams √ó 17 positions

Columns: ['Rk', 'Total', 'All P', 'SP', 'RP', 'Non-P', 'C', '1B', '2B', '3B', 'SS', 'LF', 'CF', 'RF', 'OF (All)', 'DH', 'PH']


Unnamed: 0,Rk,Total,All P,SP,RP,Non-P,C,1B,2B,3B,SS,LF,CF,RF,OF (All),DH,PH
0,1,Arizona Diamondbacks19.9,ATL12.2,ARI9.1,TEX4.5,CLE12.7,TEX4.2,HOU5.1,CLE4.3,ATL4.8,NYY5.5,ARI4.3,ATL5.0,CLE5.0,CLE8.1,SEA2.5,BAL0.1
1,2,Atlanta Braves17.7,ARI10.6,BOS8.8,NYM4.0,CIN11.9,NYM2.5,PIT3.4,NYM3.9,NYM4.4,BOS4.2,MIL3.1,HOU3.7,PHI3.9,KCR7.7,OAK2.5,ARI0.1
2,3,New York Yankees15.3,NYY10.2,ATL8.2,NYY3.6,BAL10.7,DET1.4,OAK3.3,HOU2.8,PHI2.8,CLE3.3,KCR2.9,CIN3.7,TOR3.8,PHI6.3,TEX2.4,SFG0.0
3,4,New York Mets14.7,HOU10.0,HOU7.9,COL3.5,NYM9.6,TBD0.7,NYM3.2,ARI2.7,MIL2.5,TOR3.2,BAL2.6,CLE3.2,CHW3.3,ATL6.2,BAL1.5,CIN-0.2
4,5,Houston Astros14.6,BOS9.9,SEA7.1,ATL3.0,ARI9.3,SFG0.7,STL2.8,SFG1.9,LAD2.0,CIN3.1,CIN1.8,NYY2.9,CHC2.4,CIN6.2,TBD0.7,OAK-0.2


---

## üìà Data Coverage Summary

In [9]:
# Check data availability across all years
years = range(1998, 2026)
file_types = ['Batting', 'Pitching', 'Fielding', 'Postseason', 'Salaries', 'WAA_Positions']

coverage = []
for year in years:
    year_path = DATA_PATH / str(year)
    if year_path.exists():
        row = {'Year': year}
        for ft in file_types:
            file_path = year_path / f'{ft}_{year}.csv'
            row[ft] = '‚úì' if file_path.exists() else '‚úó'
        coverage.append(row)

coverage_df = pd.DataFrame(coverage)
print("Data Coverage by Year and File Type:")
print(f"Years covered: {coverage_df['Year'].min()} - {coverage_df['Year'].max()}")
print(f"Total years: {len(coverage_df)}\n")
coverage_df

Data Coverage by Year and File Type:
Years covered: 1998 - 2025
Total years: 28



Unnamed: 0,Year,Batting,Pitching,Fielding,Postseason,Salaries,WAA_Positions
0,1998,‚úì,‚úì,‚úì,‚úì,‚úì,‚úì
1,1999,‚úì,‚úì,‚úì,‚úì,‚úì,‚úì
2,2000,‚úì,‚úì,‚úì,‚úì,‚úì,‚úì
3,2001,‚úì,‚úì,‚úì,‚úì,‚úì,‚úì
4,2002,‚úì,‚úì,‚úì,‚úì,‚úì,‚úì
5,2003,‚úì,‚úì,‚úì,‚úì,‚úì,‚úì
6,2004,‚úì,‚úì,‚úì,‚úì,‚úì,‚úì
7,2005,‚úì,‚úì,‚úì,‚úì,‚úì,‚úì
8,2006,‚úì,‚úì,‚úì,‚úì,‚úì,‚úì
9,2007,‚úì,‚úì,‚úì,‚úì,‚úì,‚úì


---

## üîß Dependencies

The project uses the following Python packages:

```
selenium           # Browser automation for web scraping
webdriver-manager  # Automatic ChromeDriver installation
beautifulsoup4     # HTML parsing
pandas             # Data manipulation and CSV handling
lxml               # Fast XML/HTML parsing
```

Install with:
```bash
pip install -r requirements.txt
```

---

## ‚úÖ What's Been Completed

| Phase | Task | Status |
|-------|------|--------|
| **Scraping** | Baseball Reference scraper | ‚úÖ Complete |
| **Scraping** | Salary data scraper | ‚úÖ Complete |
| **Cleaning** | Team name standardization | ‚úÖ Complete |
| **Cleaning** | Abbreviation standardization | ‚úÖ Complete |
| **Cleaning** | Salary data cleaning | ‚úÖ Complete |
| **Storage** | Organized CSV structure (1998-2025) | ‚úÖ Complete |

---

## üöÄ Next Steps (Future Work)

1. **Parse Postseason Results** - Extract team names and calculate playoff wins per team
2. **Aggregate Team Stats** - Rank teams 1-30 for each statistic per year
3. **Build Point System** - Assign points based on rankings (1st = 30 pts, etc.)
4. **Correlation Analysis** - Determine which stats correlate with playoff success
5. **Visualization** - Create charts showing top-correlated statistics