This notebook is created to perform exploratory analysis of khl_results_2008_2019.csv file.

In [1]:
# Importing standard packages for data analysis

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# Reading in original data for the 2008-2019 KHL game results

games = pd.read_csv('source_files\khl_results_2008_2019')

See a list of the columns from the original data with descriptions below:


DATE - date

DAY - day

MONTH - month

YEAR - year

SEASON - season

HOMETEAM - home team

AWAYTEAM - away team

WINNER - winner, including all overtimes (OT) and shoot outs (SO)

HG - home goals, including all overtimes (OT) and shoot outs (SO)

AG - away goals, including all overtimes (OT) and shoot outs (SO)

ADD - if there is overtime (OT) or shoot out (SO)

HG1 - home goals 1 period

AG1 - away goals 1 period

TOTAL1 - total goals 1 period

HG2 - home goals 3 period

AG2 - away goals 2 period

TOTAL2 - total goals 2 period

HG - home goals 3 period

AG3 - away goals 3 period

TOTAL3 - total goals 3 period

HGOT - home goals in OT

AGOT - away goals in OT

HGSO - home goals in SO

AGSO - away goals in SO

TOTALOT - total goals OT

TOTALMAIN - Total goals 3 periods only

TOTALFULL - total goals full game (including OT and SO - plus 1 goal to winning team)

TOTALHMAIN - total goals home team 3 periods

TOTALAWAYMAIN - total goals away team 3 periods

GAP3PERIODS - absolute value in gap between home and away teams' goals scored.

In [49]:
# The original file includes a number of columns with data on specific periods of the match.
# While this may be helpful for some analysis, we are going to look at data on just the match level for now.

summary = games[['DATE', 'DAY', 'MONTH', 'YEAR', 'SEASON', 'HOMETEAM', 'AWAYTEAM',
       'WINNER', 'HG', 'AG', 'TOTALFULL']]

# The absolute difference in goals scored is an interesting metric to add to our data.
# However, GAP3PERIODS does not include the difference coming from OT or SO.
# To maintain consistency in our decision to look at the match level only, we will calculate the metric ourselves.

summary['GAP'] = abs(summary['HG'] - summary['AG'])

# Finally, let us save the resulting dataframe in case we need to reference it later in another notebook.

summary.to_csv('game_results.csv', sep='\t', encoding='utf8', index=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  summary['GAP'] = abs(summary['HG'] - summary['AG'])


In [4]:
summary.head()

Unnamed: 0,DATE,DAY,MONTH,YEAR,SEASON,HOMETEAM,AWAYTEAM,WINNER,HG,AG,TOTALFULL,GAP
0,12/9/2019,9,12,2019,1920,Avangard Omsk,Barys Nur-Sultan,Barys Nur-Sultan,2,3,5,1
1,12/9/2019,9,12,2019,1920,Dyn. Moscow,SKA St. Petersburg,SKA St. Petersburg,2,3,5,1
2,12/9/2019,9,12,2019,1920,Jokerit,Sp. Moscow,Jokerit,3,2,5,1
3,12/9/2019,9,12,2019,1920,Podolsk,Nizhny Novgorod,Podolsk,3,2,5,1
4,12/9/2019,9,12,2019,1920,Metallurg Magnitogorsk,Dinamo Riga,Metallurg Magnitogorsk,5,1,6,4


In [5]:
summary.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9282 entries, 0 to 9281
Data columns (total 12 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   DATE       9282 non-null   object
 1   DAY        9282 non-null   int64 
 2   MONTH      9282 non-null   int64 
 3   YEAR       9282 non-null   int64 
 4   SEASON     9282 non-null   int64 
 5   HOMETEAM   9282 non-null   object
 6   AWAYTEAM   9282 non-null   object
 7   WINNER     9282 non-null   object
 8   HG         9282 non-null   int64 
 9   AG         9282 non-null   int64 
 10  TOTALFULL  9282 non-null   int64 
 11  GAP        9282 non-null   int64 
dtypes: int64(8), object(4)
memory usage: 870.3+ KB


In [6]:
summary.describe()

Unnamed: 0,DAY,MONTH,YEAR,SEASON,HG,AG,TOTALFULL,GAP
count,9282.0,9282.0,9282.0,9282.0,9282.0,9282.0,9282.0,9282.0
mean,16.005495,7.456367,2014.011312,1371.986318,2.726891,2.391187,5.117539,2.000431
std,8.834632,4.155381,3.353693,329.79159,1.662074,1.55787,2.207108,1.272232
min,1.0,1.0,2008.0,809.0,0.0,0.0,1.0,0.0
25%,8.0,2.0,2011.0,1112.0,2.0,1.0,3.0,1.0
50%,16.0,9.0,2014.0,1415.0,3.0,2.0,5.0,2.0
75%,24.0,11.0,2018.0,1617.0,4.0,3.0,7.0,3.0
max,31.0,12.0,2019.0,1920.0,12.0,11.0,17.0,12.0


In [17]:
summary.groupby(['SEASON', 'YEAR']).size()

SEASON  YEAR
809     2008    443
        2009    259
910     2009    438
        2010    271
1011    2010    415
        2011    263
1112    2011    394
        2012    283
1213    2012    480
        2013    282
1314    2013    530
        2014    314
1415    2014    573
        2015    347
1516    2015    624
        2016    296
1617    2016    644
        2018    300
1718    2018    834
1819    2018    546
        2019    309
1920    2019    437
dtype: int64

The values in a SEASON column represent the two years across which the season took place, and are in fact the format in which KHL website stores the season data. As it is not so easily readable, let us change it to more clearly indicate the season that a taken match was played in.

In addition, we can see that for seasons 1617 and 1718 the matches are not distributed across two consecutive years as in all other seasons. Season 1617 matches took places in years 2016 and 2018 and season 1718 matches took place only in year 2018. One possible reason would be if all year 2017 matches somehow got mislabeled as year 2018, so let us investigate it.

Season 1920 matches also took place only in 2019 but judging by the number of games we can tell that the reason is that the data was taken in the middle of the season and thus is incomplete for season 1920.

In [36]:
summary.groupby('SEASON')['YEAR'].unique()

SEASON
809     [2009, 2008]
910     [2010, 2009]
1011    [2011, 2010]
1112    [2012, 2011]
1213    [2013, 2012]
1314    [2014, 2013]
1415    [2015, 2014]
1516    [2016, 2015]
1617    [2018, 2016]
1718          [2018]
1819    [2019, 2018]
1920          [2019]
Name: YEAR, dtype: object

In [41]:
dict_seasons = {1920: '2019-2020',
                1819: '2018-2019',
                1718: '2017-2018',
                1617: '2016-2017',
                1516: '2015-2016',
                1415: '2014-2015',
                1314: '2013-2014',
                1213: '2012-2013',
                1112: '2011-2012',
                1011: '2010-2011',
                910: '2009-2010',
                809: '2008-2009'}

summary['SEASON'] = summary['SEASON'].map(dict_seasons)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  summary['SEASON'] = summary['SEASON'].map(dict_seasons)


In [45]:
summary[summary['YEAR'] == 2018]

Unnamed: 0,DATE,DAY,MONTH,YEAR,SEASON,HOMETEAM,AWAYTEAM,WINNER,HG,AG,TOTALFULL,GAP
746,12/30/2018,30,12,2018,2018-2019,Slovan Bratislava,Jokerit,Jokerit,1,3,4,2
747,12/30/2018,30,12,2018,2018-2019,Dinamo Riga,Dyn. Moscow,Dinamo Riga,5,4,9,1
748,12/30/2018,30,12,2018,2018-2019,Avangard Omsk,Barys Nur-Sultan,Avangard Omsk,4,3,7,1
749,12/30/2018,30,12,2018,2018-2019,CSKA Moscow,Nizhny Novgorod,CSKA Moscow,2,1,3,1
750,12/30/2018,30,12,2018,2018-2019,Lokomotiv Yaroslavl,Din. Minsk,Din. Minsk,0,3,3,3
...,...,...,...,...,...,...,...,...,...,...,...,...
2421,1/3/2018,3,1,2018,2016-2017,Niznekamsk,Salavat Ufa,Niznekamsk,1,0,1,1
2422,1/3/2018,3,1,2018,2016-2017,Lada,Metallurg Magnitogorsk,Metallurg Magnitogorsk,2,4,6,2
2423,1/3/2018,3,1,2018,2016-2017,Avangard Omsk,Dyn. Moscow,Dyn. Moscow,2,3,5,1
2424,1/3/2018,3,1,2018,2016-2017,Sibir Novosibirsk,Lokomotiv Yaroslavl,Sibir Novosibirsk,3,2,5,1
