# GPS exploratory analysis

In this notebook, we start by cleaning the gps data and then we will merge it with games data to get the dates and final scores for each game. Also, we compute the maximim speed and acceleration for each player over a single game as a possible measure of performance for that player in that game (Note that we first compute the average values over one second (10 frames) and then take the max).

In [None]:
import pandas as pd
import seaborn as sns
from sklearn.decomposition import PCA
from scipy.stats import ttest_ind
import matplotlib.pyplot as plt

games_df= pd.read_csv('https://www.dropbox.com/s/tk13ad7sca5nkwy/games.csv?dl=1')
gps_df= pd.read_csv('https://www.dropbox.com/s/n7pvlxy60qwyy91/gps.csv?dl=1')

# Fill in the missing longitude data for each player based on its previous frame (it
# seems that in some cases when the longitude stays the same it is filled in as nan in the dataset)
gps_df['Longitude'] = gps_df.groupby('PlayerID')['Longitude'].fillna(method='ffill')
print(games_df.columns)
print(gps_df.columns)

In [None]:
# Remove the data that logitude was nan for a player and could not
# be filled based on the previous frames
print(len(gps_df) - gps_df['Longitude'].count())
gps_df.dropna(inplace = True)
print(len(gps_df))

In [None]:
# Converting game clock to seconds to group by seconds later on
gps_df['GameClock'] = pd.to_datetime(gps_df.GameClock, format = '%H:%M:%S')
gps_df['GameClockInSeconds'] = gps_df.GameClock.dt.hour * 3600 + gps_df.GameClock.dt.minute * 60 + gps_df.GameClock.dt.second

Merging the gpa data with game data to get the dates. We kept the team points and game outcome as well. 

In [None]:
# Removing players with player ID greater than 17
gps_df = gps_df[ (gps_df['PlayerID'] < 18) ]

gps_with_date = gps_df.merge(games_df[['GameID', 'Date', 'TeamPoints','Outcome']],
                    how='inner', on='GameID').drop_duplicates()
gps_with_date.head()

In [None]:
# Negative AccelX shows when the player has stopped and we don't want that 
# to be considered as a player's best performance during a game
gps_with_date = gps_with_date[(gps_with_date['AccelX'] >= 0)]

accel_data = gps_with_date.copy()
accel_data[['MaxAccImpulse', 'MaxAccLoad']] = gps_with_date.groupby(['GameID','PlayerID'])['AccelImpulse', 'AccelLoad'].transform('max')
accel_data.head()

In [None]:
second_data = gps_with_date.groupby(['Date','GameID','Outcome','TeamPoints',
                                     'PlayerID','Half','GameClockInSeconds'],
                                    as_index = False)[['Speed','AccelImpulse',
                                                       'AccelLoad', 'AccelX']].mean()

second_data.head()

In [None]:
second_data['MaxSpeedInGame'] = second_data.groupby(['GameID','PlayerID'])['Speed'].transform(max)
second_data['MaxAccelImpulseInGame'] = second_data.groupby(['GameID','PlayerID'])['AccelImpulse'].transform(max)
second_data['MaxAccelLoadInGame'] = second_data.groupby(['GameID','PlayerID'])['AccelLoad'].transform(max)
second_data['MaxAccelXInGame'] = second_data.groupby(['GameID','PlayerID'])['AccelX'].transform(max)
second_data.head()

In [None]:
idx = second_data['MaxSpeedInGame'] == second_data['Speed']
second_data[idx]['Half'].value_counts()

We can see that it is almost equally likely for the max speed for players to occur either in the first half or the second half.

In [None]:
second_data[idx].groupby('PlayerID')['Half'].value_counts()

In [None]:
second_data[idx].groupby('PlayerID')['GameID'].unique()

In [None]:
second_data[idx].groupby('PlayerID')['GameID'].nunique()

In general maximum speed can happen either in first half or second half. For some players we can see that their max speed occurs in the second half across all games more than the first half and for some other players we can see a reverse pattern.<br>
Also, after printing out the gameIDs that each player participated in, we can see that the regular players that played in almost all games are the following players: (2,4,7,8,10,11,13)

In [None]:
speed_data = second_data[['Date','GameID','Outcome','TeamPoints','PlayerID','MaxSpeedInGame','MaxAccelImpulseInGame']].drop_duplicates()
speed_data.to_csv('./processed_data/processed_gps.csv')
speed_data = speed_data.sort_values(by=['PlayerID', 'Date'])


In [None]:
def box_plot_and_ttest(group1_name, group2_name, column1, column2, ylabel, title):
    plt.figure()
    sns.set(context='notebook', style='whitegrid')
    sns.utils.axlabel(xlabel="group", ylabel=ylabel)
    sns.boxplot(data=[column1, column2])
    plt.xticks(plt.xticks()[0], [group1_name, group2_name])
    plt.title(title)
    plt.show()
    t, p = ttest_ind(column1, column2)
    print("p-value is: %.4f" % p," t-statistic is: %.2f" % t)
    
    
print(speed_data.columns)
avg_speed_per_game = speed_data[['GameID', 'Outcome', 'TeamPoints', 'MaxSpeedInGame', 'MaxAccelImpulseInGame']].groupby(['GameID','Outcome','TeamPoints'],as_index=False).mean()
winners = avg_speed_per_game[avg_speed_per_game['Outcome'] == 'W']
losers = avg_speed_per_game[avg_speed_per_game['Outcome'] == 'L']

print('Max speed average')
print(winners['MaxSpeedInGame'].mean())
print(losers['MaxSpeedInGame'].mean())
print('max accel average')
print(winners['MaxAccelImpulseInGame'].mean())
print(losers['MaxAccelImpulseInGame'].mean())

box_plot_and_ttest('Winners', 'Losers', winners['MaxSpeedInGame'], losers['MaxSpeedInGame'], 'MaxSpeedInGame', 'Max Speed In Game For Winners and Losers')
box_plot_and_ttest('Winners', 'Losers', winners['MaxAccelImpulseInGame'], losers['MaxAccelImpulseInGame'], 'MaxAccelImpulseInGame', 'Max Acceleration Impulse In Game For Winners and Losers')

From both the boxplots and the ttests, we can see that the average max speed of players in the games they won is higher than the average max speed in the games they lost and the difference is statistically significant. However, that is not the case for max acceleration (p_value is large). This shows, that max speed might be a better metric of performance for players compared to max acceleration.
Also in general it is worth to note that if a player's speed was 0 in one frame and 0.4 (m/s) in the next frame, its acceleartion would be 4 which is relatively a large value. Also if a player keeps his/her high speed, its acceleration will be close to zero and will not show the true performance of the player. So using max speed as a measure of performance seems more reasonable than using acceleration.