The three main components currently in this dataset are:

The individual players' current performance stats.
The individual players' past performance stats (how much historical data depends on the player).
A list of future match fixtures.
All the data was taken from the Official Fantasy Premier League website.

N.B. A lot of the data was cobbled together from the output of publicly accessible JSON endpoints, therefore there are a lot of duplications (as fixture data was initially from the perspective of the individual players). Also, since a lot of this data is used to drive the UI of a Web Application, there are a lot of redundancies, all of which could do with being cleaned up.

Inspiration

A lot of my friends are massively into all aspects of the Premier League (fantasy or otherwise), so my main motivation in putting this dataset together was to see was it possible to gain a competitive advantage over my very domain knowledgeable friends, with little to no domain knowledge myself.

The obvious questions that could be answered with this data correspond to predicting the future performance of players based on historical metrics.

In [352]:
# Importing libraries
import pandas as pd
import numpy as np
import plotly as ply
import plotly.graph_objs as go

ply.offline.init_notebook_mode(connected = True)
pd.options.display.max_columns = None
pd.options.mode.chained_assignment = None  # default='warn'

In [353]:
# loading data
ply_perf = pd.read_csv('C:/Datasets/Football/fantasy-premier-league-201718/historical-performance.csv')
curr_stat = pd.read_csv('C:/Datasets/Football/fantasy-premier-league-201718/player-info.csv')

In [354]:
# Removing unwanted columns. Id is unique for all rows and season is redundant due to season name
ply_perf.drop(['season' , 'id'], axis = 1 , inplace = True)

### Analysis Roadmap

1. Group data around various performance metrics by each player
2. Exploratory visualization of performance
3. Try to model with points scored as the label

In [355]:
# Aggregating key performance indicators
ply_dat = ply_perf.groupby(by = 'player_id')['assists' , 'bonus' , 'bps' , 'clean_sheets', 'goals_conceded' , 'goals_scored' ,
                                         'own_goals' , 'penalties_missed' , 'penalties_saved' , 'red_cards' , 'saves' ,
                                         'yellow_cards', 'total_points', 'minutes'].sum()

In [356]:
# Get player names, team names and positions from curr_stat table, drop the redundant id column
ply_dat = ply_dat.merge(curr_stat[['id' , 'first_name', 'second_name', 'element_type_singular_name_short','team_short_name']], how = 'left', left_index = True , right_on = 'id')

In [357]:
# Concatenate first name and second name and drop the two columns
ply_dat['ply_name'] = ply_dat['first_name'] + ' ' + ply_dat['second_name']
ply_dat.drop(['first_name' , 'second_name'] , axis = 1 , inplace = True)
ply_dat.rename(columns = {'element_type_singular_name_short': 'pos' , 'team_short_name' : 'team'}, inplace = True)

In [358]:
# Filter numeric columns (except id)
num_col = ply_dat.select_dtypes(['int64'])
num_col.drop(['id','minutes'], inplace = True, axis = 1)

In [359]:
# Calculate the per minute stats
for i in num_col:
    temp = i + '_p90m'
    ply_dat[temp] = ply_dat[i]/ply_dat['minutes'] * 90

In [375]:
plot_cols = ['assists_p90m', 'bonus_p90m', 'bps_p90m', 'clean_sheets_p90m', 'goals_conceded_p90m', 'goals_scored_p90m', 
             'own_goals_p90m', 'penalties_missed_p90m', 'penalties_saved_p90m', 'red_cards_p90m', 'saves_p90m', 
             'yellow_cards_p90m', 'total_points_p90m']
for i in plot_cols:
    temp = ply_dat.sort_values(by = i, ascending = False).head(n = 50)
    trace = go.Scatter(x = temp['ply_name'], 
                       y = temp[i])
    trace1 = go.Scatter(x = temp['ply_name'], 
                        y = temp['minutes'], 
                        yaxis = 'y2' )
    data = [trace, trace1]
    lay = {'title' : i, 
           'xaxis' : {'tickangle' : 45, 'showgrid' : False}, 
           'yaxis2' : {'side' : 'right', 'overlaying' : 'y', 'showgrid' : False}, 
           'yaxis' : {'showgrid' : False}
          }
    fig = {'data' : data, 'layout' : lay }
    ply.offline.iplot(fig)