### Provant de llegir i entendre els datasets

Tracking data can be combined with event data with timeelapsed and current_phase

#### The tracking data
The tracking data contains the following columns:

+ 'current_phase': the current period
+ 'timeelapsed': the time in seconds of the current period 
+ 'team_id_opta': Opta team id
+ 'player_id': Opta player id
+ 'jersey_no': jersey number of the player
+ 'pos_x': x-coordinate on the pitch; pitch coordinates in [-52.5, 52.5]
+ 'pos_y': y-coordinate on the pitch; pitch coordinates in [-34, 34]
+ 'frame_count': unique identifier for each frame
+ 'team_id': inidicates home(=1)/away(=2); team_id 4 is the ball
+ 'speed': speed
+ 'acc': acceleration
+ 'speed_x': speed regarding x-axis
+ 'speed_y': speed regarding y-axis
+ 'ball_x': x location of the ball
+ 'ball_y': y location of the ball
+ 'ball_speed': ball speed
+ 'ball_acc': ball acceleration
+ 'dop': direction of play of the team ('L'--> 'Left-to-Right; 'R' --> 'Right-to-Left'

#### The event data
This event data is the Opta event data and contains the following columns:
+ 'event_type_id': the Opta event type identifier; see 'event_description' for an explanation
+ 'contestantId': id of the team
+ 'playerId': id of the player
+ 'current_phase': the current period
+ 'timeelapsed': the time in seconds of the current period
+ 'period_minute': the minute in which the game is currently
+ 'period_second': the second of the minute in which the game is currently
+ 'outcome': outcome of the event, 1=successful, 0=otherwise
+ 'event_description': descriptions of 'event_type_id' (see below)

In [2]:
import pandas as pd
import json
import os
import numpy as np

pd.options.display.max_columns = 999

Llegim dataset tracking i els noms de cada event amb el seu id

In [5]:
# load tracking data
current_directory = os.getcwd()
path_tracking = os.path.join(os.path.join(os.path.dirname(current_directory),'data'),"tracking_set_0")
print(path_tracking)
game_id = 1

df_tracking = pd.read_parquet(f'{path_tracking}/{game_id}_tracking.parquet')

#           ------------------------------------------------------------        

# load events names
path_event_csv = os.path.join(os.path.dirname(current_directory),'data')
df_event_names = pd.read_csv(os.path.join(path_event_csv,'event_names.csv'))
dict_event_names = df_event_names.set_index('event_type_id').to_dict()['event_description']


c:\Users\Gabriel\OneDrive\Escritorio\SportsAnalyticsCourse\OptaForum\OptaChallenge_Clustering_Player_Styles\data\tracking_set_0


Llegim el dataset de event, ho relacionem amb el diccionari dels noms de cada event i afegim columna timeelapsed que es la que es relaciona amb tracking

In [8]:
# load event data
def load_event_data(file_name, base_path):
    # read in event file
    with open(f'{base_path}/{file_name}') as f:
        data=json.loads(f.read())

    f.close()
    
    # transform data into pandas dataframe
    df_events = pd.json_normalize(data['liveData']['event'])
    
    # preprocess event data and keep relevant information only

    # add timeelapsed to each event
    df_events['timestamp'] = pd.to_datetime(df_events.timeStamp).apply(lambda x: x.timestamp())

    df_events = df_events.query('periodId in [1,2]')

    def add_timeelapsed_to_events(df):
        start_time = df.query('typeId==32')['timestamp'].iloc[0]
        df['timestamp_new'] = np.int64((df['timestamp'] - start_time)*1000)

        df['timeelapsed'] = df['timestamp_new'].apply(lambda x: (40 * round(x/40))/1000)

        return df

    df_events = df_events.groupby('periodId').apply(add_timeelapsed_to_events)

    df_events = df_events.drop(columns=['timeStamp','timestamp','timestamp_new'])
    
    # rename some columns
    df_events = df_events.rename(columns=
        {
            'periodId':'current_phase',
            'typeId':'event_type_id',
            'timeMin':'period_minute',
            'timeSec':'period_second'
        }
    )
    
    return df_events

path_events = os.path.join(os.path.join(os.path.dirname(current_directory),'data'),"first_10_events")
print(path_events)

event_file = f'{game_id}.json'

df_events = load_event_data(
    base_path=path_events,
    file_name=event_file
)

# add event descriptions
df_events['event_description'] = df_events['event_type_id'].map(dict_event_names)

c:\Users\Gabriel\OneDrive\Escritorio\SportsAnalyticsCourse\OptaForum\OptaChallenge_Clustering_Player_Styles\data\first_10_events


To preserve the previous behavior, use

	>>> .groupby(..., group_keys=False)


	>>> .groupby(..., group_keys=True)
  df_events = df_events.groupby('periodId').apply(add_timeelapsed_to_events)


En el dataset event, hi ha una columna que es qualifier. Aquesta columna es un nested diccionary que si fem un merge amb el qualifier_names.csv podrem veure informació més detallada de l'event.

A fer:
- Agafar un event i veure quina informació tinc amb els qualifiers. Provar-ho amb diferents events.