# DS701: Tools for Data Science - Midterm Challenge -- Sports

In this challenge, you will build a systematic analysis of LALIGA shot data from seasons 2015-16 through 2024-25.

### Objectives:
1. **Setup and Loade** the LALIGA shots dataset
2. **Exploratory Data Analysis (EDA)** - understand patterns, distributions, and relationships
3. **Clustering Analysis** - identify player archetypes based on shooting patterns
4. **Predictive Modeling** - build models to predict goal probability (excluding xG feature)
5. **Model Comparison** - compare our predictions with the existing xG metric
6. **Kaggle Submission** - prepare a submission for the Kaggle competition
7. **Summary and Conclusions** - summarize your findings and conclusions

You can use as many markdown and code cells as you want in your solutions.


---
## 1. Setup and Data Loading

In [2]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Set visualization style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

# Set random seed for reproducibility
np.random.seed(42)

# Display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', lambda x: '%.4f' % x)

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

Libraries imported successfully!


In [3]:
# Load the dataset
df = pd.read_csv('laliga_shots_train.csv')

print(f"Dataset loaded successfully!")
print(f"Shape: {df.shape[0]:,} rows × {df.shape[1]} columns")
print(f"\nFirst few rows:")
df.head()

Dataset loaded successfully!
Shape: 72,116 rows × 19 columns

First few rows:


Unnamed: 0,id,minute,result,X,Y,player,h_a,player_id,situation,season,shotType,match_id,h_team,a_team,h_goals,a_goals,date,player_assisted,lastAction
0,297163,44,Goal,0.896,0.627,Borja Bastón,a,1701,SetPiece,2018,RightFoot,10301,Athletic Club,Alaves,1,1,2019-04-27 12:00:00,,Rebound
1,606888,49,MissedShots,0.861,0.723,Aimar Oroz,a,8424,OpenPlay,2024,RightFoot,27150,Espanyol,Osasuna,0,0,2024-12-14 13:00:00,Ante Budimir,Pass
2,140398,29,SavedShot,0.925,0.324,Gerard Moreno,h,2120,OpenPlay,2016,LeftFoot,3991,Espanyol,Las Palmas,4,3,2017-03-10 19:45:00,Felipe Caicedo,Throughball
3,336585,42,SavedShot,0.44,0.53,Youssef En-Nesyri,h,5169,OpenPlay,2019,LeftFoot,12177,Leganes,Celta Vigo,3,2,2019-12-08 17:30:00,Rubén Pérez,BallRecovery
4,575017,30,BlockedShot,0.893,0.457,Sergio Camello,h,7528,OpenPlay,2023,RightFoot,23004,Rayo Vallecano,Osasuna,2,1,2024-04-20 14:15:00,Jorge De Frutos,Cross


In [4]:
# Dataset overview
print("=" * 80)
print("DATASET OVERVIEW")
print("=" * 80)

df.info()

DATASET OVERVIEW
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72116 entries, 0 to 72115
Data columns (total 19 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   id               72116 non-null  int64  
 1   minute           72116 non-null  int64  
 2   result           72116 non-null  object 
 3   X                72116 non-null  float64
 4   Y                72116 non-null  float64
 5   player           72116 non-null  object 
 6   h_a              72116 non-null  object 
 7   player_id        72116 non-null  int64  
 8   situation        72116 non-null  object 
 9   season           72116 non-null  int64  
 10  shotType         72116 non-null  object 
 11  match_id         72116 non-null  int64  
 12  h_team           72116 non-null  object 
 13  a_team           72116 non-null  object 
 14  h_goals          72116 non-null  int64  
 15  a_goals          72116 non-null  int64  
 16  date             72116 non-null  object 


In [5]:
# Check for missing values
print("\nMissing Values:")
missing_df = pd.DataFrame({
    'Column': df.columns,
    'Missing_Count': df.isnull().sum().values,
    'Missing_Percentage': (df.isnull().sum().values / len(df) * 100).round(2)
})
missing_df = missing_df[missing_df['Missing_Count'] > 0].sort_values('Missing_Count', ascending=False)
print(missing_df.to_string(index=False))

if len(missing_df) == 0:
    print("No missing values in required columns!")


Missing Values:
         Column  Missing_Count  Missing_Percentage
player_assisted          19310             26.7800
     lastAction           8643             11.9800


## Explanation of the columns


| Column | Description |
|--------|-------------|
| **id** | Unique shot identifier |
| **minute** | Minute of the match when shot occurred. Regularly two 45-minute halves. |
| **result** | Shot outcome (Goal, SavedShot, MissedShots, BlockedShot) |
| **X** | X-coordinate of shot location on pitch, where X = 0 is the defensive end (own goal) and X = 1 is the attacking end (opponent goal) |
| **Y** | Y-coordinate of shot location on pitch |
| **player** | Name of player taking the shot |
| **h_a** | Home ('h') or Away ('a') team indicator |
| **player_id** | Unique player identifier |
| **situation** | Match situation (OpenPlay, FromCorner, SetPiece, etc.) |
| **season** | Season year |
| **shotType** | Type of shot (RightFoot, LeftFoot, Head, OtherBodyPart) |
| **match_id** | Match identifier |
| **h_team** | Home team name |
| **a_team** | Away team name |
| **h_goals** | Home team goals at time of shot |
| **a_goals** | Away team goals at time of shot |
| **date** | Match date |
| **player_assisted** | Player who assisted |
| **lastAction** | Last action before shot |


---

#### Load the Test Set

We'll also need to load the test set so we can predict the results for it and
submit it for evaluation.

In [6]:
# Load the test set
df_test = pd.read_csv('laliga_shots_test_no_result.csv')

print(f"Dataset loaded successfully!")
print(f"Shape: {df_test.shape[0]:,} rows × {df_test.shape[1]} columns")
print(f"\nFirst few rows:")
df_test.head()

Dataset loaded successfully!
Shape: 18,029 rows × 18 columns

First few rows:


Unnamed: 0,id,minute,X,Y,player,h_a,player_id,situation,season,shotType,match_id,h_team,a_team,h_goals,a_goals,date,player_assisted,lastAction
0,444146,63,0.98,0.543,Isi Palazón,a,9812,FromCorner,2021,LeftFoot,17284,Valencia,Rayo Vallecano,1,1,2021-11-27 15:15:00,,Rebound
1,281129,31,0.916,0.371,Giannelli Imbula,a,861,OpenPlay,2018,LeftFoot,10183,Espanyol,Rayo Vallecano,2,1,2019-02-09 17:30:00,,
2,455491,26,0.856,0.546,Gerard Piqué,a,2092,FromCorner,2021,Head,17346,Alaves,Barcelona,0,1,2022-01-23 20:00:00,Pedri,Aerial
3,213542,36,0.864,0.385,Daniel Torres,a,5048,OpenPlay,2017,RightFoot,8232,Malaga,Alaves,0,3,2018-05-06 11:00:00,,TakeOn
4,195341,37,0.88,0.416,Simone Zaza,a,1642,SetPiece,2017,LeftFoot,8098,Atletico Madrid,Valencia,1,0,2018-02-04 19:45:00,José Gayá,Cross


In [7]:
# Test set overview
print("=" * 80)
print("TEST SET OVERVIEW")
print("=" * 80)

df_test.info()

TEST SET OVERVIEW
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18029 entries, 0 to 18028
Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   id               18029 non-null  int64  
 1   minute           18029 non-null  int64  
 2   X                18029 non-null  float64
 3   Y                18029 non-null  float64
 4   player           18029 non-null  object 
 5   h_a              18029 non-null  object 
 6   player_id        18029 non-null  int64  
 7   situation        18029 non-null  object 
 8   season           18029 non-null  int64  
 9   shotType         18029 non-null  object 
 10  match_id         18029 non-null  int64  
 11  h_team           18029 non-null  object 
 12  a_team           18029 non-null  object 
 13  h_goals          18029 non-null  int64  
 14  a_goals          18029 non-null  int64  
 15  date             18029 non-null  object 
 16  player_assisted  13150 non-null  object 

---
## 2. Exploratory Data Analysis (EDA) (35 total points)

### 2.1 Target Variable Analysis - Shot Results (5 points)

Analyze the distribution of shot 'results' and:

1. print as a table with columns 'Result', 'Count', and 'Percentage'. (5 points)
2. Show a bar plot of the distribution of 'result' categories. (5 points)   
3. Show a pie chart of the distribution of 'result' categories. (5 points)
4. Calculate the goal conversion rate. (5 points)

### 2.2 Numerical Features Analysis (5 points)

For the numerical features, calculate the following:
1. Statistical summary of the numerical features. (5 points)
2. Distribution plots for the key numerical features. (5 points)


### 2.3 Shot Location Analysis (X, Y Coordinates) (5 points)

Create a scatter plot of shot locations colored by result. (10 points)

Calculate the average X and Y coordinates for goals. (5 points)

### 2.4 Categorical Features Analysis (5 points)

Show the bar plots for each of the categorical features. (10 points)

### 2.5 Goal Conversion by Categorical Features (5 points)

Calculate the goal conversion rate by each of the categorical features, 'situation', 'shotType', and 'h_a'. (10 points)
* Display the results in a table with columns '<Category>', 'Total Shots', 'Goal Rate (%)'.

Display the results in bar plots. (5 points)

### 2.6 Player Analysis (10 points)

Analyze the players' performance by calculating the following:
1. How many total matches are present in the dataset? (5 points)
2. How many total matches were played in each season? (5 points)
3. Which 3 players played the most matches in each season? (5 points)
4. How many matches did each player play and show the top 10 players? (5 points)
5. How many goals were scored by each player and show the top 10 players? (5 points)
6. Calculate the average number of goals per game for each player and show the top 10 players who played at least 300 matches. (5 points)


### 2.7 Feature Engineering for Analysis (5 points)

You may want to perform some feature engineering to create new features that will be useful for modeling.

Create derived features that will be useful for modeling. (10 points)


### 2.8 Correlation Analysis (10 points)

Calcuate and display the correlation matrix of the numerical features and display as a heatmap. (10 points)

---

## 3. Clustering Analysis: Player Shooting Archetypes (40 total points)

Identify different types of goal scorers based on their shooting patterns.

Perform K-Means clustering on the player-level statistics. (15 points)
 * Use Silhouette Score to determine the optimal number of clusters.
 * Print a table of the cluster centers.

Based on the cluster centers, assign interpretable names to the clusters. (5 points)
  * These could be for example:
    * Penalty Specialists
    * Aerial Specialists
    * Poachers (Close-range)
    * Long-range Shooters
    * Mixed Style
  * Justify your choices.

Display the results in a table with columns 'Cluster', 'Archetype', 'Total Players', and 'Players'. (10 points)

Visualize the clusters using PCA. (10 points)



---
## 4. Predictive Modeling: Goal Probability Prediction (50 total points + up to 15 bonus points)

Now we'll build models to predict the probability of scoring a goal.
We'll then compare our model predictions with the actual xG values.

Define and train the following models:
1. Logistic Regression (25 points)
2. Random Forest (25 points)
3. Bonus Models (feel free to use any model you want)  (**Bonus:** 15 points)


Evaluate each model using the following metrics and display the results in a table:
* Accuracy
* Precision
* Recall
* F1-Score
* ROC-AUC




### 4.2 Model 1: Logistic Regression (Baseline)

### 4.3 Model 2: Random Forest Classifier

### 4.4 Model 3: Bonus Models

---

## 5. Model Comparison and Analysis (30 total points)

Create a grouped bar plot to compare each of those metrics across all models. (10 points)

Plot the ROC curves for all models. (10 points)

Plot the confusion matrices for all models. (10 points)


### 5.1 Grouped Bar Plot Comparison

### 5.2 ROC Curves Comparison

### 5.3 Confusion Matrices Visualization

---

## 6.0 Kaggle Submission (15 total points + up to 10 bonus points)

### Prepare a submission for the Kaggle competition.

1. Load the test dataset, laliga_shots_test_no_result.csv
2. Preprocess the test dataset in the same way you did for the training dataset.
3. Make predictions for the test dataset using your best model.
4. Create a submission file and upload it to Kaggle.

The submission file should be a csv file with the following columns:

| is_goal | goal_prob |
|---------|-----------|
| 0       | 0.234      |
| 1       | 0.634       |
| 0       | 0.142     |

We will evaluate using F1-score.


The top 10 finishers get an additional 10 bonus points.

The 11-20 finishers get an additional 5 bonus points.




---
## 7. Summary and Conclusions (10 points)

Summaryize your findings and conclusions. (10 points)

Among other things, answer the following questions:
* What did you learn from this analysis?
* Were there any unexpected findings?
* What would you do differently if you had more time?


---