# <center> Data Preprocessing With Inspiration from xG Paper <center>

### Features from the xG paper: (try with LSTM before resorting to XGBoost)

At each time-step (row), the dataset will contain:

- Relevant ball data (positions, velocities)
    - will be the ABSOLUTE VALUE of the positions and velocities, since the half on which the shot was taken is not necessarily important

- Distance of the ball to the center of the goal (3D center)

- Angle of the ball between the goal posts (coordinates above)

- Number of opposing players between the ball and the goal (number of players in the "shot triangle")

- Make a "team state" variable for each team (attacking team and defending team) to encode the overall structure/dynamics of the team via a Spectral Embedding, and take the 'k' most important Eigenvectors (test different k-values on model performance) 
    - Use the ChatGPT prompt as a guide

The finalized datasets will be saved in the `datasets` file directory.

### Important Notes:

- Goal dimensions: 2000 units wide x 456 units tall. Blue goal at y = -5200, x $\in$ [-1000,1000]; Orange goal at y = 5200, x $\in$ [-1000,1000]

- Blue team half is when $\text{pos}_y < 0$. Orange team half is when $\text{pos}_y > 0$

---

## Preprocessing Steps:

### Step 1:

Separate relevant ball data into its own dataframe (will be the principle dataframe post-separation)

### Step 2: 

Using the new principle dataframe, for each time-step compute the 3D distance of the ball to the center of the goal and the angle of the ball between the goal-posts

### Step 3:

Using the new angle variable, use the player positions of the defending team to determine (at each time-step) how many of them are betweeen the ball and the goal

### Step 4:

Find the number of players within 400 units of the ball

### Step 5:

Implement a Spectral Embedding on the attacking and defending teams, respectively, to get a team-state variable

### Step 6:

Once all the new features are created, reformat the ball data (absolute value of the `y` direction), fill in NA values, and standardize/scale the features appropriately for use with a model. Add the label into the final dataset

### Step 7:

Turn the dataframes into separate torch tensors, merge into a torch dataset, and save the dataset

---

In [1]:
import pandas as pd
import numpy as np
import os

In [2]:
DATA_DIR = '/Users/marcomaluf/Desktop/Unfinished Projects/New RL/raw CSVs'
raw_csvs = os.listdir(DATA_DIR)

In [3]:
# initial preprocessing testing used raw_csvs[4]

sample_df = pd.read_csv(f'{DATA_DIR}/{raw_csvs[4]}').drop(columns='Unnamed: 0')
sample_df

Unnamed: 0,pos_x,pos_y,pos_z,vel_x,vel_y,vel_z,ang_vel_x,ang_vel_y,ang_vel_z,hit_team_no,...,throttle.5,steer.5,dodge_active.5,double_jump_active.5,jump_active.5,boost_active.5,team.5,dist_to_ball.5,label,matchId
0,-2246.02,-3926.62,1718.94,6517.4,-10724.5,-5849.6,-3712.1,-1762.9,951.6,1.0,...,128.0,255.0,16,8,24,False,1,6314.254848,1,C09E0B52498F83A541DB1D9181542022
1,-2224.30,-3962.34,1699.00,6510.6,-10713.3,-6060.0,-3712.1,-1762.9,951.6,1.0,...,128.0,255.0,16,8,24,False,1,6245.645240,1,C09E0B52498F83A541DB1D9181542022
2,-2208.04,-3989.11,1683.59,6505.5,-10704.9,-6217.8,-3712.1,-1762.9,951.6,1.0,...,128.0,255.0,16,8,24,False,1,6266.047076,1,C09E0B52498F83A541DB1D9181542022
3,-2175.54,-4042.59,1651.58,6495.3,-10688.1,-6532.8,-3712.1,-1762.9,951.6,1.0,...,128.0,255.0,16,8,24,False,1,6306.869337,1,C09E0B52498F83A541DB1D9181542022
4,-2148.49,-4087.09,1623.70,6486.8,-10674.1,-6795.0,-3712.1,-1762.9,951.6,1.0,...,0.0,254.0,16,8,24,False,1,6252.404876,1,C09E0B52498F83A541DB1D9181542022
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1019,436.58,-5031.17,568.86,8401.2,-29599.1,-5841.7,1394.1,-1415.2,4551.2,1.0,...,,83.0,30,8,38,False,1,810.625318,1,C09E0B52498F83A541DB1D9181542022
1020,450.58,-5080.48,558.99,8396.8,-29583.9,-5946.9,1394.1,-1415.2,4551.2,1.0,...,,83.0,30,8,38,False,1,683.569817,1,C09E0B52498F83A541DB1D9181542022
1021,485.90,-5177.71,482.15,6793.1,-17521.4,-16925.0,-5462.3,-2257.9,1031.8,1.0,...,,83.0,30,8,38,False,1,800.460257,1,C09E0B52498F83A541DB1D9181542022
1022,497.22,-5206.90,453.81,,,,,,,1.0,...,,128.0,30,8,38,False,1,664.211151,1,C09E0B52498F83A541DB1D9181542022


In [4]:
sample_df.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1024 entries, 0 to 1023
Data columns (total 114 columns):
 #    Column                Dtype  
---   ------                -----  
 0    pos_x                 float64
 1    pos_y                 float64
 2    pos_z                 float64
 3    vel_x                 float64
 4    vel_y                 float64
 5    vel_z                 float64
 6    ang_vel_x             float64
 7    ang_vel_y             float64
 8    ang_vel_z             float64
 9    hit_team_no           float64
 10   pos_x.1               float64
 11   pos_y.1               float64
 12   pos_z.1               float64
 13   vel_x.1               float64
 14   vel_y.1               float64
 15   vel_z.1               float64
 16   ang_vel_x.1           float64
 17   ang_vel_y.1           float64
 18   ang_vel_z.1           float64
 19   throttle              float64
 20   steer                 float64
 21   dodge_active          object 
 22   double_jump_active    

In [5]:
print(sample_df.isna().sum().to_frame().to_markdown())

|                      |   0 |
|:---------------------|----:|
| pos_x                |   0 |
| pos_y                |   0 |
| pos_z                |   0 |
| vel_x                |   4 |
| vel_y                |   4 |
| vel_z                |   4 |
| ang_vel_x            |   4 |
| ang_vel_y            |   4 |
| ang_vel_z            |   4 |
| hit_team_no          |   0 |
| pos_x.1              |   0 |
| pos_y.1              |   0 |
| pos_z.1              |   0 |
| vel_x.1              |   0 |
| vel_y.1              |   0 |
| vel_z.1              |   0 |
| ang_vel_x.1          |   0 |
| ang_vel_y.1          |   0 |
| ang_vel_z.1          |   0 |
| throttle             |   3 |
| steer                |  15 |
| dodge_active         |   0 |
| double_jump_active   |   0 |
| jump_active          |   0 |
| boost_active         |   0 |
| team                 |   0 |
| dist_to_ball         |   0 |
| pos_x.2              |   0 |
| pos_y.2              |   0 |
| pos_z.2              |   0 |
| vel_x.

In [6]:
sample_df = sample_df.fillna(0)

## Step 1:

In [7]:
ball_data = sample_df[['pos_x', 'pos_y', 'pos_z', 'vel_x', 'vel_y', 'vel_z']].copy(deep=True)
ball_data

Unnamed: 0,pos_x,pos_y,pos_z,vel_x,vel_y,vel_z
0,-2246.02,-3926.62,1718.94,6517.4,-10724.5,-5849.6
1,-2224.30,-3962.34,1699.00,6510.6,-10713.3,-6060.0
2,-2208.04,-3989.11,1683.59,6505.5,-10704.9,-6217.8
3,-2175.54,-4042.59,1651.58,6495.3,-10688.1,-6532.8
4,-2148.49,-4087.09,1623.70,6486.8,-10674.1,-6795.0
...,...,...,...,...,...,...
1019,436.58,-5031.17,568.86,8401.2,-29599.1,-5841.7
1020,450.58,-5080.48,558.99,8396.8,-29583.9,-5946.9
1021,485.90,-5177.71,482.15,6793.1,-17521.4,-16925.0
1022,497.22,-5206.90,453.81,0.0,0.0,0.0


## Step 2:

In [8]:
final_value = []
for i in range(64, ball_data.shape[0]+1, 64):
    final_value.append(ball_data['pos_y'].iloc[i-1])
    
half = np.where(np.array(final_value) < 0, 0, 1)

In [9]:
blue_goal_center = np.array([0, -5200, 273])
orange_goal_center = np.array([0, 5200, 273])

def distance_3d(v1, v2):
    d = np.power(v1-v2, 2)
    d = np.sum(d, axis=1)

    distance = np.sqrt(d)
    
    return distance

segments = [ball_data[['pos_x', 'pos_y', 'pos_z']].iloc[i:i+64] for i in range(0, ball_data.shape[0], 64)]

In [10]:
for seg, h in zip(segments, half):
    if h == 0:
        d = distance_3d(seg, blue_goal_center)
        seg['distance_to_goal'] = d
    else:
        d = distance_3d(seg, orange_goal_center)
        seg['distance_to_goal'] = d

In [11]:
dist_to_goal = pd.concat(segments, ignore_index=True)
dist_to_goal = dist_to_goal['distance_to_goal']
ball_data.loc[:, 'dist_to_goal'] = dist_to_goal.values
ball_data

Unnamed: 0,pos_x,pos_y,pos_z,vel_x,vel_y,vel_z,dist_to_goal
0,-2246.02,-3926.62,1718.94,6517.4,-10724.5,-5849.6,2959.196673
1,-2224.30,-3962.34,1699.00,6510.6,-10713.3,-6060.0,2917.668378
2,-2208.04,-3989.11,1683.59,6505.5,-10704.9,-6217.8,2886.426750
3,-2175.54,-4042.59,1651.58,6495.3,-10688.1,-6532.8,2823.659862
4,-2148.49,-4087.09,1623.70,6486.8,-10674.1,-6795.0,2771.095169
...,...,...,...,...,...,...,...
1019,436.58,-5031.17,568.86,8401.2,-29599.1,-5841.7,553.749767
1020,450.58,-5080.48,558.99,8396.8,-29583.9,-5946.9,546.898205
1021,485.90,-5177.71,482.15,6793.1,-17521.4,-16925.0,529.470846
1022,497.22,-5206.90,453.81,0.0,0.0,0.0,529.119641


In [None]:
def ball_goal_angle(a, b = np.array([-1000, 5200]), c = np.array([1000, 5200])):
    # a := ball 2d position, b := left post, c := right post
    a = a.copy()  
    a['pos_y'] = np.abs(a['pos_y'])
    a = a[['pos_x', 'pos_y']].values  

    ab = b - a
    ac = c - a
    dot = np.einsum('ij,ij->i', ab, ac) 

    mag_ab = np.linalg.norm(ab, axis=1)
    mag_ac = np.linalg.norm(ac, axis=1)

    theta = np.degrees(np.arccos(dot / (mag_ab * mag_ac)))

    return theta  

In [13]:
angle_to_goal = ball_goal_angle(a=ball_data[['pos_x', 'pos_y']])
ball_data.loc[:, 'angle_to_goal'] = angle_to_goal
ball_data

Unnamed: 0,pos_x,pos_y,pos_z,vel_x,vel_y,vel_z,dist_to_goal,angle_to_goal
0,-2246.02,-3926.62,1718.94,6517.4,-10724.5,-5849.6,2959.196673,24.202630
1,-2224.30,-3962.34,1699.00,6510.6,-10713.3,-6060.0,2917.668378,24.311425
2,-2208.04,-3989.11,1683.59,6505.5,-10704.9,-6217.8,2886.426750,24.388253
3,-2175.54,-4042.59,1651.58,6495.3,-10688.1,-6532.8,2823.659862,24.529133
4,-2148.49,-4087.09,1623.70,6486.8,-10674.1,-6795.0,2771.095169,24.631417
...,...,...,...,...,...,...,...,...
1019,436.58,-5031.17,568.86,8401.2,-29599.1,-5841.7,553.749767,156.616265
1020,450.58,-5080.48,558.99,8396.8,-29583.9,-5946.9,546.898205,163.016949
1021,485.90,-5177.71,482.15,6793.1,-17521.4,-16925.0,529.470846,176.657933
1022,497.22,-5206.90,453.81,0.0,0.0,0.0,529.119641,178.949691


## Step 3:

(find the number of defending players in-between the ball and the goal [in the 2D-triangle and with z <= 646, halfway between the floor and the ceiling])

### Algorithm for Step 3:

1. determine which half the play is on (use the `half` array from **Step 2**)
2. For each time_step in the segment:
    - record the ball's xy-position, and each defending player's xyz-position
    - if the defending players `pos_z` <= 646, check if they are in the triangle with the function
    - if yes, count += 1
    - return the count for that time-step

In [14]:
def is_player_inside_ball_goal_triangle(player_pos, ball_position, half):

    def sign(p1, p2, p3):
        """Computes the cross product sign for all timesteps."""
        return (p1[:, 0] - p3[:, 0]) * (p2[:, 1] - p3[:, 1]) - (p2[:, 0] - p3[:, 0]) * (p1[:, 1] - p3[:, 1])
    
    if half == 0:
        p_left = np.array([-1000, -5200, 646])  # Left goalpost
        p_right = np.array([1000, -5200, 646])  # Right goalpost
    else:
        p_left = np.array([-1000, 5200, 646])
        p_right = np.array([1000, 5200, 646])

    # Ensure inputs are NumPy arrays
    player_pos = np.asarray(player_pos)  # Shape (N, 3)
    ball_position = np.asarray(ball_position)  # Shape (N, 3)

    # Expand goalpost positions to match shape (N, 3)
    p_left = np.tile(p_left, (player_pos.shape[0], 1))   # (N, 3)
    p_right = np.tile(p_right, (player_pos.shape[0], 1)) # (N, 3)

    # Compute the cross-product signs for each timestep
    s1 = sign(player_pos, p_left, p_right)
    s2 = sign(player_pos, p_right, ball_position)
    s3 = sign(player_pos, ball_position, p_left)

    # Player is inside if all signs are the same (either all positive or all negative)
    return (s1 >= 0) & (s2 >= 0) & (s3 >= 0) | (s1 <= 0) & (s2 <= 0) & (s3 <= 0)


In [15]:
player_positions_indeces = [10,11,12,25, 27,28,29,42, 44,45,46,59, 61,62,63,76, 78,79,80,93, 95,96,97,110]
player_segments = [sample_df.iloc[i:i+64, player_positions_indeces].copy(deep=True) for i in range(0, sample_df.shape[0], 64)]

In [16]:
player_segments[0].iloc[:, 12:24]

Unnamed: 0,pos_x.4,pos_y.4,pos_z.4,team.3,pos_x.5,pos_y.5,pos_z.5,team.4,pos_x.6,pos_y.6,pos_z.6,team.5
0,138.77,-1702.21,17.01,1,-2625.34,-3930.20,1565.71,1,-1828.23,2148.32,48.50,1
1,139.54,-1778.34,17.01,1,-2530.10,-4050.02,1520.59,1,-1744.38,2036.30,27.45,1
2,139.54,-1778.34,17.01,1,-2458.94,-4136.93,1483.30,1,-1744.38,2036.30,27.45,1
3,140.73,-1856.45,17.01,1,-2458.94,-4136.93,1483.30,1,-1744.38,2036.30,27.45,1
4,140.73,-1856.45,17.01,1,-2458.94,-4136.93,1483.30,1,-1637.70,1933.30,15.42,1
...,...,...,...,...,...,...,...,...,...,...,...,...
59,327.29,-4934.40,101.56,1,367.97,-5013.22,733.87,1,305.49,399.17,62.61,1
60,327.29,-4934.40,101.56,1,367.97,-5013.22,733.87,1,305.49,399.17,62.61,1
61,327.29,-4934.40,101.56,1,367.97,-5013.22,733.87,1,350.92,276.65,59.39,1
62,279.97,-5056.95,100.13,1,500.74,-5009.29,651.47,1,350.92,276.65,59.39,1


In [17]:
players_btwn_ball_and_goal = []

for ball, players, h in zip(segments, player_segments, half):
    temp = []
    
    # Identify player indices (assuming 4 columns per player: x, y, z, team)
    num_players = players.shape[1] // 4  
    selected_columns = []  

    for i in range(num_players):
        team_col = i * 4 + 3  # The "team" column index for player i
        
        # Keep players whose team is NOT on the ball's half
        if not (players.iloc[:, team_col] == h).all():
            selected_columns.extend(players.columns[i * 4: i * 4 + 4])

    # Filter the DataFrame
    players = players[selected_columns]

    # Iterate over each player and check if they are inside the ball-goal triangle
    for i in range(0, players.shape[1] - 3, 4):  # Step by 4 to get x, y, z
        player = players.iloc[:, i:i+3]  # Select x, y, z only
        temp.append(is_player_inside_ball_goal_triangle(player, ball, h))

    temp = np.sum(temp, axis=0)
    players_btwn_ball_and_goal.append(temp)

In [18]:
ball_data.loc[:, 'players_btwn_ball_and_goal'] = np.hstack(players_btwn_ball_and_goal)
ball_data

Unnamed: 0,pos_x,pos_y,pos_z,vel_x,vel_y,vel_z,dist_to_goal,angle_to_goal,players_btwn_ball_and_goal
0,-2246.02,-3926.62,1718.94,6517.4,-10724.5,-5849.6,2959.196673,24.202630,0
1,-2224.30,-3962.34,1699.00,6510.6,-10713.3,-6060.0,2917.668378,24.311425,0
2,-2208.04,-3989.11,1683.59,6505.5,-10704.9,-6217.8,2886.426750,24.388253,0
3,-2175.54,-4042.59,1651.58,6495.3,-10688.1,-6532.8,2823.659862,24.529133,0
4,-2148.49,-4087.09,1623.70,6486.8,-10674.1,-6795.0,2771.095169,24.631417,0
...,...,...,...,...,...,...,...,...,...
1019,436.58,-5031.17,568.86,8401.2,-29599.1,-5841.7,553.749767,156.616265,0
1020,450.58,-5080.48,558.99,8396.8,-29583.9,-5946.9,546.898205,163.016949,0
1021,485.90,-5177.71,482.15,6793.1,-17521.4,-16925.0,529.470846,176.657933,0
1022,497.22,-5206.90,453.81,0.0,0.0,0.0,529.119641,178.949691,0


## Step 4:

(number of players within 400 units of the ball)

In [19]:
col_indeces = [26, 43, 60, 77, 94, 111]
dist_to_ball = sample_df.iloc[:, col_indeces].copy(deep=False)

within_400 = []
for i in range(6):
    within_400_indicator = np.select(condlist=[dist_to_ball.iloc[:,i] <= 400], choicelist=[1], default=0)
    within_400.append(within_400_indicator)

players_within_400 = np.sum(within_400, axis=0)
players_within_400

array([0, 1, 1, ..., 0, 0, 0])

In [20]:
ball_data.loc[:, 'players_within_400_of_ball'] = players_within_400
ball_data

Unnamed: 0,pos_x,pos_y,pos_z,vel_x,vel_y,vel_z,dist_to_goal,angle_to_goal,players_btwn_ball_and_goal,players_within_400_of_ball
0,-2246.02,-3926.62,1718.94,6517.4,-10724.5,-5849.6,2959.196673,24.202630,0,0
1,-2224.30,-3962.34,1699.00,6510.6,-10713.3,-6060.0,2917.668378,24.311425,0,1
2,-2208.04,-3989.11,1683.59,6505.5,-10704.9,-6217.8,2886.426750,24.388253,0,1
3,-2175.54,-4042.59,1651.58,6495.3,-10688.1,-6532.8,2823.659862,24.529133,0,1
4,-2148.49,-4087.09,1623.70,6486.8,-10674.1,-6795.0,2771.095169,24.631417,0,1
...,...,...,...,...,...,...,...,...,...,...
1019,436.58,-5031.17,568.86,8401.2,-29599.1,-5841.7,553.749767,156.616265,0,0
1020,450.58,-5080.48,558.99,8396.8,-29583.9,-5946.9,546.898205,163.016949,0,0
1021,485.90,-5177.71,482.15,6793.1,-17521.4,-16925.0,529.470846,176.657933,0,0
1022,497.22,-5206.90,453.81,0.0,0.0,0.0,529.119641,178.949691,0,0


## Step 5:

(spectral embedding for the game)

Given a graph (with each player and the ball), make a similarity matrix in which the encoded edges are the distances to each node. Using a gaussian kernel (RBF function) to calculate similarty based on distance:

$$
S(i,j) = \exp{(-\frac{d(i,j)^2}{2\sigma^2})}
$$

where $\sigma$ controls the neighborhoos size (set to a percentile of the distance between pairs of points).

If players are not on the same team, their similiarty is 0.

### Steps:

(IMPORTANT: Need to generalize algorithm for different team sizes, so if theres only one player [like in 1v1s], its just the player and the ball)

1. Find the team size given the input DataFrame

2. Find the distance from each player on the same team to each other, each player's distance to the ball

3. Pass each distance (distance to teammates, distance to ball) into the RBF Kernel to compute similarity

4. Make the Similarity Matrix

5. Take the bottom Eigenvector for each step (and put it as a row vector) e.g. [[1, 1, 1], [2, 2, 2], etc.]

6. Take average of each player in each team (first step of Lloyd's algorithm) to get the center for each team cluster

In [21]:
# need to find a "good enough" sigma for the dataset
# manual search given the data (matplotlib skills)

def RBF_kernel(distance, sigma):

    s = np.exp(-1 * (distance**2/2*sigma**2))

    return s

def compute_distance_matrices(df, num_players):
    num_entities = num_players + 1  # 6 players + 1 ball

    # Reshape the dataframe into (timesteps, entities, 3)
    positions = df.values.reshape(len(df), num_entities, 3)  # (T, 7, 3)

    # Compute pairwise distances: Euclidean norm along axis=2 after broadcasting
    dist_matrices = np.linalg.norm(positions[:, :, np.newaxis, :] - positions[:, np.newaxis, :, :], axis=-1)  # (T, 7, 7)

    return dist_matrices

def find_player_count(df):
    
    if df.iloc[0, 44] == -9999999:
        return 2
    elif df.iloc[0, 78] == -9999999:
        return 4
    else:
        return 6
    
def fill_matrix_with_zeros(matrix):
    
    for i in range(1, matrix.shape[0]):
        if i < matrix.shape[0]/2:
            matrix[i][int(num_players/2+1):num_players+1] = 0
        if i > matrix.shape[0]/2:
            matrix[i][1:int(num_players/2+1)] = 0

    return matrix

def spectral_embedding_1d(L):
    # Compute eigenvalues and eigenvectors
    eigenvalues, eigenvectors = np.linalg.eigh(L)

    # Get the second smallest eigenvector (first non-trivial one)
    X = eigenvectors[:, np.argsort(eigenvalues)[1]]  # 1D vector

    return X

In [22]:
# getting ball and players positions
num_players = find_player_count(sample_df)
pos_cols = [0,1,2, 10,11,12, 27,28,29, 44,45,46, 61,62,63, 78,79,80, 95,96,97]
ball_player_positions = sample_df.iloc[:, pos_cols[0:9+(3*(num_players-2))]].copy(deep=False)

# computing similarity matrices for each time-step
matrices = compute_distance_matrices(ball_player_positions, num_players)
sim_matrices = RBF_kernel(matrices, sigma=0.0003)

# setting similarties for players on different teams to 0
for s in sim_matrices:
    s = fill_matrix_with_zeros(s)

# computing laplacian matrices
laplacians = []
for s in sim_matrices:
    d = np.diag(s.sum(axis=1))
    l = d - s
    laplacians.append(l)

# finding the bottom 1 eigenvector for each time-step (i.e. for each matrix)
eigvecs = []
for l in laplacians:
    v = spectral_embedding_1d(l)
    eigvecs.append(v)
        
# storing the vectors in a dataframe object to concat to the main dataframe (each value in the vector is in a separate column)
eigvec_df = pd.DataFrame(eigvecs)
eigvec_df


# separating ball_state and each team's clusters 
ball_state = eigvec_df.iloc[:,0].copy(deep=False)
team1_cluster = np.sum(eigvec_df.iloc[:,1:int(num_players/2+1)], axis=1)/(num_players/2)
team2_cluster = np.sum(eigvec_df.iloc[:,int(num_players/2+1):num_players+1], axis=1)/(num_players/2)

# adding columns
ball_data.loc[:,'ball_state'] = ball_state
ball_data.loc[:,'team1_cluster'] = team1_cluster
ball_data.loc[:,'team2_cluster'] = team2_cluster
ball_data

Unnamed: 0,pos_x,pos_y,pos_z,vel_x,vel_y,vel_z,dist_to_goal,angle_to_goal,players_btwn_ball_and_goal,players_within_400_of_ball,ball_state,team1_cluster,team2_cluster
0,-2246.02,-3926.62,1718.94,6517.4,-10724.5,-5849.6,2959.196673,24.202630,0,0,-0.113231,0.259431,-0.221688
1,-2224.30,-3962.34,1699.00,6510.6,-10713.3,-6060.0,2917.668378,24.311425,0,1,-0.119079,0.251285,-0.211592
2,-2208.04,-3989.11,1683.59,6505.5,-10704.9,-6217.8,2886.426750,24.388253,0,1,0.119030,-0.251590,0.211913
3,-2175.54,-4042.59,1651.58,6495.3,-10688.1,-6532.8,2823.659862,24.529133,0,1,0.119802,-0.250493,0.210559
4,-2148.49,-4087.09,1623.70,6486.8,-10674.1,-6795.0,2771.095169,24.631417,0,1,0.124701,-0.244811,0.203244
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1019,436.58,-5031.17,568.86,8401.2,-29599.1,-5841.7,553.749767,156.616265,0,0,-0.094945,-0.342050,0.373699
1020,450.58,-5080.48,558.99,8396.8,-29583.9,-5946.9,546.898205,163.016949,0,0,-0.096355,-0.338166,0.370284
1021,485.90,-5177.71,482.15,6793.1,-17521.4,-16925.0,529.470846,176.657933,0,0,-0.098603,-0.335281,0.368149
1022,497.22,-5206.90,453.81,0.0,0.0,0.0,529.119641,178.949691,0,0,-0.101894,-0.328104,0.362069


## Step 6:

(standardize and scale certain variables, add label)

Standardize:
- positions, velocities

Min-Max Scale: 
- distance, angle

In [23]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.compose import ColumnTransformer

standardize_cols = ['pos_x','pos_y','pos_z','vel_x','vel_y','vel_z']
scale_cols = ['dist_to_goal','angle_to_goal','players_btwn_ball_and_goal','players_within_400_of_ball']
nothing = ['ball_state','team1_cluster','team2_cluster']

preprocessor = ColumnTransformer([
    ('standard', StandardScaler(), standardize_cols),
    ('minmax', MinMaxScaler(), scale_cols),
], remainder='passthrough')

transformed_values = preprocessor.fit_transform(ball_data)

ball_data[standardize_cols + scale_cols] = transformed_values[:, :len(standardize_cols) + len(scale_cols)]
ball_data

Unnamed: 0,pos_x,pos_y,pos_z,vel_x,vel_y,vel_z,dist_to_goal,angle_to_goal,players_btwn_ball_and_goal,players_within_400_of_ball,ball_state,team1_cluster,team2_cluster
0,-1.835837,-1.357269,1.900797,0.816533,-0.971154,-0.742783,0.437958,0.111929,0.0,0.000000,-0.113231,0.259431,-0.221688
1,-1.822936,-1.367518,1.855929,0.815956,-0.970346,-0.783202,0.431645,0.112551,0.0,0.333333,-0.119079,0.251285,-0.211592
2,-1.813278,-1.375199,1.821255,0.815524,-0.969739,-0.813515,0.426896,0.112990,0.0,0.333333,0.119030,-0.251590,0.211913
3,-1.793973,-1.390544,1.749228,0.814658,-0.968526,-0.874027,0.417354,0.113795,0.0,0.333333,0.119802,-0.250493,0.210559
4,-1.777906,-1.403313,1.686495,0.813937,-0.967515,-0.924396,0.409363,0.114380,0.0,0.333333,0.124701,-0.244811,0.203244
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1019,-0.242408,-1.674194,-0.687031,0.976340,-2.333927,-0.741266,0.072278,0.868742,0.0,0.000000,-0.094945,-0.342050,0.373699
1020,-0.234092,-1.688342,-0.709240,0.975967,-2.332830,-0.761475,0.071236,0.905325,0.0,0.000000,-0.096355,-0.338166,0.370284
1021,-0.213112,-1.716240,-0.882140,0.839921,-1.461900,-2.870388,0.068587,0.983290,0.0,0.000000,-0.098603,-0.335281,0.368149
1022,-0.206388,-1.724616,-0.945909,0.263647,-0.196830,0.380936,0.068533,0.996389,0.0,0.000000,-0.101894,-0.328104,0.362069


In [24]:
ball_data.loc[:, 'label'] = sample_df.label
ball_data 

Unnamed: 0,pos_x,pos_y,pos_z,vel_x,vel_y,vel_z,dist_to_goal,angle_to_goal,players_btwn_ball_and_goal,players_within_400_of_ball,ball_state,team1_cluster,team2_cluster,label
0,-1.835837,-1.357269,1.900797,0.816533,-0.971154,-0.742783,0.437958,0.111929,0.0,0.000000,-0.113231,0.259431,-0.221688,1
1,-1.822936,-1.367518,1.855929,0.815956,-0.970346,-0.783202,0.431645,0.112551,0.0,0.333333,-0.119079,0.251285,-0.211592,1
2,-1.813278,-1.375199,1.821255,0.815524,-0.969739,-0.813515,0.426896,0.112990,0.0,0.333333,0.119030,-0.251590,0.211913,1
3,-1.793973,-1.390544,1.749228,0.814658,-0.968526,-0.874027,0.417354,0.113795,0.0,0.333333,0.119802,-0.250493,0.210559,1
4,-1.777906,-1.403313,1.686495,0.813937,-0.967515,-0.924396,0.409363,0.114380,0.0,0.333333,0.124701,-0.244811,0.203244,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1019,-0.242408,-1.674194,-0.687031,0.976340,-2.333927,-0.741266,0.072278,0.868742,0.0,0.000000,-0.094945,-0.342050,0.373699,1
1020,-0.234092,-1.688342,-0.709240,0.975967,-2.332830,-0.761475,0.071236,0.905325,0.0,0.000000,-0.096355,-0.338166,0.370284,1
1021,-0.213112,-1.716240,-0.882140,0.839921,-1.461900,-2.870388,0.068587,0.983290,0.0,0.000000,-0.098603,-0.335281,0.368149,1
1022,-0.206388,-1.724616,-0.945909,0.263647,-0.196830,0.380936,0.068533,0.996389,0.0,0.000000,-0.101894,-0.328104,0.362069,1


## Step 7:

(make and save torch dataset)

In [34]:
import torch
from torch.utils.data import Dataset

class TimeSeriesDataset(Dataset):
    def __init__(self, dataframes, labels):
        """
        Args:
            dataframes (list of pd.DataFrame): List of DataFrames (each is a sequence).
            labels (list of int or float): List of labels (one per DataFrame).
        """
        self.data = [torch.tensor(df.values, dtype=torch.float32) for df in dataframes]
        self.labels = torch.tensor(labels, dtype=torch.float32)  # Use float32 for regression, long for classification

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx], self.labels[idx]

In [42]:
SEQ_LENGTH = 64
NUM_SEQS = int(ball_data.shape[0]/64)

features = []
labels = []
for i in range(NUM_SEQS):
    data = ball_data.iloc[(64*i):(64+64*i),0:13].copy(deep=False)
    label = ball_data.iloc[(64*i):(64+64*i)].label.values[0]

    features.append(data)
    labels.append(label)

torch_dataset = TimeSeriesDataset(features, labels)