In [3]:
import pandas as pd

# ✅ Load your uploaded CSV
df = pd.read_csv('/content/WOMENS_2025_TEAM.csv')

# ✅ Show the first few rows to verify it loaded correctly
df.head()


Unnamed: 0,#,Player,GP-GS,G,A,PTS,SH,SH%,SOG,SOG%,GWG,FPG,FPS,GB,TO,CT,DC,FOULS,RC-YC-GC
0,44.0,"Ward, Emma",19-19,30,46,76,77,0.39,55,0.714,1,3,9,6,41,2,0,9,0-1-0
1,24.0,"Trinkaus, Caroline",19-18,32,11,43,72,0.444,57,0.792,4,9,11,6,16,5,8,6,0-3-7
2,5.0,"Muchnick, Emma",19-18,34,7,41,71,0.479,55,0.775,2,12,24,27,31,9,13,8,0-1-1
3,19.0,"Britton, Gracie",19-14,20,10,30,41,0.488,33,0.805,0,3,7,8,16,0,1,2,0-0-1
4,11.0,"Vogelman, Alexa",19-10,21,6,27,46,0.457,35,0.761,0,9,14,25,27,13,31,26,0-3-0


### Dataset Description

This project uses the official 2025 season statistics for the **Syracuse University Women’s Lacrosse team**.  
The dataset includes **individual-level performance metrics** such as:

- `G`, `A`, `PTS`: Goals, Assists, and Total Points
- `SH`, `SH%`: Total shots and shooting percentage
- `SOG`, `SOG%`: Shots on goal and shot accuracy
- `TO`, `CT`: Turnovers and caused turnovers
- `GB`, `DC`: Ground balls and draw controls
- `FOULS`, `RC-YC-GC`: Fouls and cards
- `GP-GS`: Games Played and Games Started

These stats allow us to evaluate individual and team performance across offensive and defensive contributions.


In [4]:
# ✅ Filter rows where GP-GS is like "number-number"
df = df[df['GP-GS'].str.match(r'^\d+-\d+$')]

# ✅ Extract GP and GS
df[['GP', 'GS']] = df['GP-GS'].str.split('-', expand=True).astype(int)

# ✅ Convert key columns to numeric (handle SH%, SOG%, etc.)
numeric_cols = ['G', 'A', 'PTS', 'SH', 'SH%', 'SOG', 'SOG%', 'GB', 'TO', 'CT', 'DC']
for col in numeric_cols:
    df[col] = pd.to_numeric(df[col], errors='coerce')









A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[['GP', 'GS']] = df['GP-GS'].str.split('-', expand=True).astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[['GP', 'GS']] = df['GP-GS'].str.split('-', expand=True).astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[col] = pd.to_numeric(df[col], errors='coerce')


### Feature Engineering

To prepare the data for analysis:
- We first removed rows with malformed `GP-GS` entries.
- Then, we split the `GP-GS` column into two numeric fields: `GP` (Games Played) and `GS` (Games Started).
- We also converted key performance columns (`G`, `A`, `SH%`, `TO`, etc.) to numeric format for proper calculations.

This step ensures the dataset is clean and analysis-ready.


In [6]:
# Q1: Total games played
total_games = df['GP'].max()
print(f"1️⃣ Total games played: {total_games}")

1️⃣ Total games played: 19


This question helps understand the **size of the season**.  
We assume the total number of games played equals the **maximum value in the `GP` column** (Games Played).  
Since each row is a player, and some players may not have played all games, using the maximum value gives us the full season length.

In [7]:
# Q2: Average goals per game
avg_goals = df['G'].sum() / total_games
print(f"2️⃣ Average goals per game: {avg_goals:.2f}")

2️⃣ Average goals per game: 10.84


We calculate this by summing all player goals (`G`) and dividing by the total number of games.  
This gives us a basic sense of the team’s offensive output per match.

In [8]:
# Q3: Average assists per game
avg_assists = df['A'].sum() / total_games
print(f"3️⃣ Average assists per game: {avg_assists:.2f}")

3️⃣ Average assists per game: 5.42


Similar to the goals calculation, we total all assists (`A`) and divide by the number of games played.  
This helps assess how much of the offense came from coordinated passing rather than individual scoring.

In [9]:
# Q4: Highest shot accuracy (min 10 shots to be fair)
df_shooters = df[df['SH'] >= 10]
most_accurate = df_shooters.sort_values(by='SH%', ascending=False).iloc[0]
print(f"4️⃣ Highest shot accuracy: {most_accurate['Player']} ({most_accurate['SH%']:.3f})")









4️⃣ Highest shot accuracy: DeVito, Sam (0.667)


We use the `SH%` column (Shot %) to find the most accurate shooter.  
To keep it fair, we filter to players who took at least 10 shots.  
This avoids cases where someone scored on their only shot and gets 100%.

In [10]:
# Q5: Most turnovers
most_turnovers = df.loc[df['TO'].idxmax()]
print(f"5️⃣ Most turnovers: {most_turnovers['Player']} ({int(most_turnovers['TO'])} TOs)")

5️⃣ Most turnovers: Ward, Emma (41 TOs)


This gives insight into **risky or error-prone players**.  
Turnovers (`TO`) represent lost possession and are critical in evaluating both offensive efficiency and decision-making.

In [11]:
# Q6: Most ground balls
most_gb = df.loc[df['GB'].idxmax()]
print(f"6️⃣ Most ground balls: {most_gb['Player']} ({int(most_gb['GB'])} GB)")

6️⃣ Most ground balls: Benoit, Kaci (34 GB)


Ground balls (`GB`) reflect **hustle and defensive contribution**.  
Players who recover the most ground balls are often key defenders or midfielders who keep possession alive.

In [12]:
# Q7: Top scorer
top_scorer = df.loc[df['G'].idxmax()]
print(f"7️⃣ Top scorer: {top_scorer['Player']} ({int(top_scorer['G'])} goals)")

7️⃣ Top scorer: Muchnick, Emma (34 goals)


A simple but important metric: we identify the player with the highest number of goals (`G`).  
This often reflects who the team relied on most for offensive production.

In [13]:
#  Q8: Proxy "Most Improved" via custom impact score
df['impact_score'] = (df['G'] + df['A'] + df['GB'] + df['DC'] - df['TO']) / df['GP']
most_improved = df.sort_values(by='impact_score', ascending=False).iloc[0]
print(f"8️⃣ Proxy Most Improved: {most_improved['Player']} (Impact Score: {most_improved['impact_score']:.2f})")

8️⃣ Proxy Most Improved: Rode, Meghan (Impact Score: 4.29)


Since we don't have per-game or multi-season data, we use a **proxy metric**:

`impact_score = (Goals + Assists + Ground Balls + Draw Controls - Turnovers) / Games Played`

This reflects each player's **positive contribution per game**, while penalizing for lost possessions.  
The player with the highest score is considered the most impactful relative to their usage — a stand-in for “most improved”.