Introduction
--
In our ever changing and evolving field of technology one of the foremost topics is data analytics. The growth in the amount of digital data that is being collected across many different fields is massive and . That is, taking data in whatever raw form it exists and using technology to transform it into information that has value and context. In most cases data analysis is performed in order to provide class descriptions of data, highlight behaviors, trends, associations in the data or predictive information that prove useful or even vital to key decision-makers. 


In [1]:
import numpy as np
import pandas as pd
import requests
import json
import matplotlib.pyplot as plt

Data Collection
--
Using the NHL API following the documentation found at https://gitlab.com/dword4/nhlapi/tree/master

For each player in the specified year range (years must be consecutive) collect all avalible stats 

In [2]:
def get_csv_skaters(y1, y2):
    team_rosters = requests.get('https://statsapi.web.nhl.com/api/v1/teams?expand=team.roster&season=' + y1 + y2)
    team_rosters = team_rosters.json()
    players= []
    for i in range(0, len(team_rosters['teams'])):
        for j in range(0, len(team_rosters['teams'][i]['roster']['roster'])):
            player = [team_rosters['teams'][i]['roster']['roster'][j]['person']['id'], 
                      team_rosters['teams'][i]['roster']['roster'][j]['person']['fullName'],
                     team_rosters['teams'][i]['name']]
            if (team_rosters['teams'][i]['roster']['roster'][j]['position']['code'] != 'G'):
                players.append(player)
    players_stats = []
    labels = requests.get('https://statsapi.web.nhl.com/api/v1/people/' 
                           + str(players[i][0]) 
                           + '/stats?stats=statsSingleSeason&season=' + y1 + y2).json()
    labels = labels['stats'][0]['splits'][0]['stat']
    header = ['id', 'fullName', 'teamName']
    for label in labels:
        header.append(label)
    for i in range(0, len(players)): 
        stats = requests.get('https://statsapi.web.nhl.com/api/v1/people/' 
                           + str(players[i][0]) 
                           + '/stats?stats=statsSingleSeason&season=' + y1 + y2).json()
        if(stats['stats'][0]['splits'] == []):
            players_stats.append([0] * len(labels))
            continue
        stats = stats['stats'][0]['splits'][0]['stat']
        stats_array = []
        for label in labels:
            if label in stats:
                stats_array.append(stats[label])
            else:
                stats_array.append(0)
        players_stats.append(stats_array)
        
    skaters = []
    skaters.append(header)
    for i in range(0, len(players)):
        skaters.append(players[i] + players_stats[i])
    return skaters

Returns the skaters data for a year range and saves the result as a csv file.
Years must be in the range [1917, 2019], note that the 2004-2005 season is skipped as this was a lockout year.

In [3]:
def get_skaters_data(start, end):
    for i in range(start, end):
        if(i == 2004):
            continue
        print("Getting skaters data for " + str(i) + "-" + str(i+1) + " season.")
        skaters = get_csv_skaters(str(i), str(i+1))
        np.savetxt('data/skaters_' + str(i) + '_' + str(i+1) + '.csv', skaters, fmt='%s', delimiter=',')

In [None]:
get_skaters_data(1917,1918)

For each team in the specified year range (years must be consecutive) collect all avalible stats 

In [10]:
def get_csv_team(y1, y2):
    teams = requests.get('https://statsapi.web.nhl.com/api/v1/teams?season=' + str(y1) + str(y2))
    teams = teams.json()
    team_id_name = []
    for i in range(0, len(teams['teams'])):
        team_arr = [teams['teams'][i]['id'], teams['teams'][i]['name']]
        team_id_name.append(team_arr)

    labels = requests.get('https://statsapi.web.nhl.com/api/v1/teams/' 
                           + str(team_id_name[0][0])
                           + '/stats?stats=statsSingleSeason&season=' + str(y1) + str(y2)).json()
    labels = labels['stats'][0]['splits'][0]['stat']
    
    header = ['id', 'teamName']
    for label in labels:
        header.append(label)
    header.append('PDO')
    team_stats = []
    for i in range(0, len(team_id_name)):
        stats = requests.get('https://statsapi.web.nhl.com/api/v1/teams/' 
                             + str(team_id_name[i][0]) 
                             + '/stats?stats=statsSingleSeason&season=' + str(y1) + str(y2)).json()
        if(stats['stats'][0]['splits'] == []):
            team_stats.append([0] * len(labels))
            continue
        stats = stats['stats'][0]['splits'][0]['stat']
        stats_array = []
        for label in labels:
            if label in stats:
                stats_array.append(stats[label])
            else:
                stats_array.append(0)
        team_stats.append(stats_array)
        stats_array.append((stats['shootingPctg']/100) + stats['savePctg'])
    
    teams_stats_final = []
    teams_stats_final.append(header)
    for i in range(0, len(team_id_name)):
        teams_stats_final.append(team_id_name[i] + team_stats[i]) 
    return teams_stats_final

Gets the team data for a year range and saves the result as a csv file.
Years must be in the range [1917, 2019], note that the 2004-2005 season is skipped as this was a lockout year.

In [11]:
def get_team_data(start, end):
    for i in range(start, end):
        if(i == 2004):
            continue
        print("Getting team data for " + str(i) + "-" + str(i+1) + " season.")
        data = get_csv_team(str(i), str(i+1))
        np.savetxt('team_data/teams_' + str(i) + '_' + str(i+1) + '.csv', data, fmt='%s', delimiter=',')

In [13]:
get_team_data(2000, 2018)

Getting team data for 2000-2001 season.
Getting team data for 2001-2002 season.
Getting team data for 2002-2003 season.
Getting team data for 2003-2004 season.
Getting team data for 2005-2006 season.
Getting team data for 2006-2007 season.
Getting team data for 2007-2008 season.
Getting team data for 2008-2009 season.
Getting team data for 2009-2010 season.
Getting team data for 2010-2011 season.
Getting team data for 2011-2012 season.
Getting team data for 2012-2013 season.
Getting team data for 2013-2014 season.
Getting team data for 2014-2015 season.
Getting team data for 2015-2016 season.
Getting team data for 2016-2017 season.
Getting team data for 2017-2018 season.


Returns the index of the team stats 

In [14]:
def get_team_stats(id, team_data):
    for i in range(0, len(team_data)):
        if team_data[i][0] == id:
            return i

For each game in the specified year range (years must be consecutive) return the winner, away team ID, home team ID, and the away and home team stats for that season

In [15]:
def get_csv_game(y1, y2):
    team_data = get_csv_team(y1, y2)
    header = team_data[0][3:]
    away_header = []
    home_header = []
    for head in header:
        away_header.append('away_'+head)
        home_header.append('home_'+head)
    games_data = [['winner', 'awayID', 'homeID'] + away_header + home_header]
    games = requests.get('https://statsapi.web.nhl.com/api/v1/schedule?startDate=' 
                         + str(y1) + '-10-01&endDate=' + str(y2) + '-06-30')
    games = games.json()
    for date in games['dates']:
        for game in date['games']:
            away_ID = game['teams']['away']['team']['id']
            home_ID = game['teams']['home']['team']['id']
            if away_ID > 80 or home_ID > 80:
                continue
            
            away_score = game['teams']['away']['score']
            home_score = game['teams']['home']['score']
            winner = 0
            away_stats = team_data[get_team_stats(away_ID, team_data)][3:]
            home_stats = team_data[get_team_stats(home_ID, team_data)][3:]
            if home_score > away_score:
                winner = 1
            games_data.append([winner,
                          away_ID, 
                          home_ID] +
                          away_stats +
                          home_stats)
    return games_data

Gets the game data for a year range and saves the result as a csv file.
Years must be in the range [1917, 2019], note that the 2004-2005 season is skipped as this was a lockout year.

In [17]:
def get_game_data(start, end):
    for i in range(start, end):
        if(i == 2004):
            continue
        print("Getting game data for " + str(i) + "-" + str(i+1) + " season.")
        data = get_csv_game(i, i+1)
        np.savetxt('game_data/game_data_' + str(i) + '_' + str(i+1) + '.csv', data, fmt='%s', delimiter=',')

In [18]:
get_game_data(2000,2018)

Getting game data for 2000-2001 season.
Getting game data for 2001-2002 season.
Getting game data for 2002-2003 season.
Getting game data for 2003-2004 season.
Getting game data for 2005-2006 season.
Getting game data for 2006-2007 season.
Getting game data for 2007-2008 season.
Getting game data for 2008-2009 season.
Getting game data for 2009-2010 season.
Getting game data for 2010-2011 season.
Getting game data for 2011-2012 season.
Getting game data for 2012-2013 season.
Getting game data for 2013-2014 season.
Getting game data for 2014-2015 season.
Getting game data for 2015-2016 season.
Getting game data for 2016-2017 season.
Getting game data for 2017-2018 season.


Setup master csv and normalized master csv file of all the games data for later use

In [26]:
data_2000_2001 = pd.read_csv('game_data/game_data_2000_2001.csv', header=0)
data_2001_2002 = pd.read_csv('game_data/game_data_2001_2002.csv', header=0)
data_2002_2003 = pd.read_csv('game_data/game_data_2002_2003.csv', header=0)
data_2003_2004 = pd.read_csv('game_data/game_data_2003_2004.csv', header=0)
data_2005_2006 = pd.read_csv('game_data/game_data_2005_2006.csv', header=0)
data_2006_2007 = pd.read_csv('game_data/game_data_2006_2007.csv', header=0)
data_2007_2008 = pd.read_csv('game_data/game_data_2007_2008.csv', header=0)
data_2008_2009 = pd.read_csv('game_data/game_data_2008_2009.csv', header=0)
data_2009_2010 = pd.read_csv('game_data/game_data_2009_2010.csv', header=0)
data_2010_2011 = pd.read_csv('game_data/game_data_2010_2011.csv', header=0)
data_2011_2012 = pd.read_csv('game_data/game_data_2011_2012.csv', header=0)
data_2012_2013 = pd.read_csv('game_data/game_data_2012_2013.csv', header=0)
data_2013_2014 = pd.read_csv('game_data/game_data_2013_2014.csv', header=0)
data_2014_2015 = pd.read_csv('game_data/game_data_2014_2015.csv', header=0)
data_2015_2016 = pd.read_csv('game_data/game_data_2015_2016.csv', header=0)
data_2016_2017 = pd.read_csv('game_data/game_data_2016_2017.csv', header=0)
data_2017_2018 = pd.read_csv('game_data/game_data_2017_2018.csv', header=0)

frames = [data_2000_2001, data_2001_2002, data_2002_2003, data_2003_2004, data_2005_2006, 
          data_2006_2007, data_2007_2008, data_2008_2009, data_2009_2010, data_2010_2011, 
          data_2011_2012, data_2012_2013, data_2013_2014, data_2014_2015, data_2015_2016, 
          data_2016_2017, data_2017_2018]

data = pd.concat(frames)

data = data.drop(['away_wins', 'away_losses', 'away_ot',
       'away_pts', 'away_ptPctg', 'away_powerPlayGoals',
       'away_powerPlayGoalsAgainst', 'away_powerPlayOpportunities','away_shotsPerGame', 'away_shotsAllowed',
       'away_winScoreFirst', 'away_winOppScoreFirst', 'away_winLeadFirstPer',
       'away_winLeadSecondPer', 'away_winOutshootOpp', 'away_winOutshotByOpp',
       'away_faceOffsTaken', 'away_faceOffsWon', 'away_faceOffsLost',
       'away_faceOffWinPercentage',
       'home_wins', 'home_losses', 'home_ot',
       'home_pts', 'home_ptPctg', 'home_powerPlayGoals',
       'home_powerPlayGoalsAgainst', 'home_powerPlayOpportunities','home_shotsPerGame', 'home_shotsAllowed',
       'home_winScoreFirst', 'home_winOppScoreFirst', 'home_winLeadFirstPer',
       'home_winLeadSecondPer', 'home_winOutshootOpp', 'home_winOutshotByOpp',
       'home_faceOffsTaken', 'home_faceOffsWon', 'home_faceOffsLost',
       'home_faceOffWinPercentage'], axis=1)
header = ['winner', 'awayID', 'homeID', 'away_goalsPerGame',
       'away_goalsAgainstPerGame', 'away_evGGARatio',
       'away_powerPlayPercentage', 'away_penaltyKillPercentage',
       'away_shootingPctg', 'away_savePctg', 'away_PDO', 'home_goalsPerGame',
       'home_goalsAgainstPerGame', 'home_evGGARatio',
       'home_powerPlayPercentage', 'home_penaltyKillPercentage',
       'home_shootingPctg', 'home_savePctg', 'home_PDO']
header = ','.join(header)

np.savetxt('master.csv', data, fmt='%s', delimiter=',', header = header)

In [27]:
def normalize(data):
    max_data = np.max(data, axis=0)
    min_data = np.min(data, axis=0)
    stats = ['away_wins', 'away_losses', 'away_ot',
             'away_pts', 'away_ptPctg', 'away_goalsPerGame',
             'away_goalsAgainstPerGame', 'away_evGGARatio',
             'away_powerPlayPercentage', 'away_powerPlayGoals',
             'away_powerPlayGoalsAgainst', 'away_powerPlayOpportunities',
             'away_penaltyKillPercentage', 'away_shotsPerGame', 'away_shotsAllowed',
             'away_winScoreFirst', 'away_winOppScoreFirst', 'away_winLeadFirstPer',
             'away_winLeadSecondPer', 'away_winOutshootOpp', 'away_winOutshotByOpp',
             'away_faceOffsTaken', 'away_faceOffsWon', 'away_faceOffsLost',
             'away_faceOffWinPercentage', 'away_shootingPctg', 'away_savePctg',
             'home_wins', 'home_losses', 'home_ot', 'home_pts', 'home_ptPctg',
             'home_goalsPerGame', 'home_goalsAgainstPerGame', 'home_evGGARatio',
             'home_powerPlayPercentage', 'home_powerPlayGoals',
             'home_powerPlayGoalsAgainst', 'home_powerPlayOpportunities',
             'home_penaltyKillPercentage', 'home_shotsPerGame', 'home_shotsAllowed',
             'home_winScoreFirst', 'home_winOppScoreFirst', 'home_winLeadFirstPer',
             'home_winLeadSecondPer', 'home_winOutshootOpp', 'home_winOutshotByOpp',
             'home_faceOffsTaken', 'home_faceOffsWon', 'home_faceOffsLost',
             'home_faceOffWinPercentage', 'home_shootingPctg', 'home_savePctg']
    for stat in stats:
        data[stat] = (data[stat] - min_data[stat])/(max_data[stat] - min_data[stat])
    return data

In [28]:
data_2000_2001 = pd.read_csv('game_data/game_data_2000_2001.csv', header=0)
data_2001_2002 = pd.read_csv('game_data/game_data_2001_2002.csv', header=0)
data_2002_2003 = pd.read_csv('game_data/game_data_2002_2003.csv', header=0)
data_2003_2004 = pd.read_csv('game_data/game_data_2003_2004.csv', header=0)
data_2005_2006 = pd.read_csv('game_data/game_data_2005_2006.csv', header=0)
data_2006_2007 = pd.read_csv('game_data/game_data_2006_2007.csv', header=0)
data_2007_2008 = pd.read_csv('game_data/game_data_2007_2008.csv', header=0)
data_2008_2009 = pd.read_csv('game_data/game_data_2008_2009.csv', header=0)
data_2009_2010 = pd.read_csv('game_data/game_data_2009_2010.csv', header=0)
data_2010_2011 = pd.read_csv('game_data/game_data_2010_2011.csv', header=0)
data_2011_2012 = pd.read_csv('game_data/game_data_2011_2012.csv', header=0)
data_2012_2013 = pd.read_csv('game_data/game_data_2012_2013.csv', header=0)
data_2013_2014 = pd.read_csv('game_data/game_data_2013_2014.csv', header=0)
data_2014_2015 = pd.read_csv('game_data/game_data_2014_2015.csv', header=0)
data_2015_2016 = pd.read_csv('game_data/game_data_2015_2016.csv', header=0)
data_2016_2017 = pd.read_csv('game_data/game_data_2016_2017.csv', header=0)
data_2017_2018 = pd.read_csv('game_data/game_data_2017_2018.csv', header=0)

data_2000_2001 = normalize(data_2000_2001)
data_2001_2002 = normalize(data_2001_2002)
data_2002_2003 = normalize(data_2002_2003)
data_2003_2004 = normalize(data_2003_2004)
data_2005_2006 = normalize(data_2005_2006)
data_2006_2007 = normalize(data_2006_2007)
data_2007_2008 = normalize(data_2007_2008)
data_2008_2009 = normalize(data_2008_2009)
data_2009_2010 = normalize(data_2009_2010)
data_2010_2011 = normalize(data_2010_2011)
data_2011_2012 = normalize(data_2011_2012)
data_2012_2013 = normalize(data_2012_2013)
data_2013_2014 = normalize(data_2013_2014)
data_2014_2015 = normalize(data_2014_2015)
data_2016_2017 = normalize(data_2016_2017)
data_2017_2018 = normalize(data_2017_2018)

frames = [data_2000_2001, data_2001_2002, data_2002_2003, data_2003_2004, data_2005_2006, 
          data_2006_2007, data_2007_2008, data_2008_2009, data_2009_2010, data_2010_2011, 
          data_2011_2012, data_2012_2013, data_2013_2014, data_2014_2015, data_2015_2016, 
          data_2016_2017, data_2017_2018]

data = pd.concat(frames)

data = data.drop(['away_wins', 'away_losses', 'away_ot',
       'away_pts', 'away_ptPctg', 'away_powerPlayGoals',
       'away_powerPlayGoalsAgainst', 'away_powerPlayOpportunities','away_shotsPerGame', 'away_shotsAllowed',
       'away_winScoreFirst', 'away_winOppScoreFirst', 'away_winLeadFirstPer',
       'away_winLeadSecondPer', 'away_winOutshootOpp', 'away_winOutshotByOpp',
       'away_faceOffsTaken', 'away_faceOffsWon', 'away_faceOffsLost',
       'away_faceOffWinPercentage',
       'home_wins', 'home_losses', 'home_ot',
       'home_pts', 'home_ptPctg', 'home_powerPlayGoals',
       'home_powerPlayGoalsAgainst', 'home_powerPlayOpportunities','home_shotsPerGame', 'home_shotsAllowed',
       'home_winScoreFirst', 'home_winOppScoreFirst', 'home_winLeadFirstPer',
       'home_winLeadSecondPer', 'home_winOutshootOpp', 'home_winOutshotByOpp',
       'home_faceOffsTaken', 'home_faceOffsWon', 'home_faceOffsLost',
       'home_faceOffWinPercentage'], axis=1)
header = ['winner', 'awayID', 'homeID', 'away_goalsPerGame',
       'away_goalsAgainstPerGame', 'away_evGGARatio',
       'away_powerPlayPercentage', 'away_penaltyKillPercentage',
       'away_shootingPctg', 'away_savePctg', 'away_PDO', 'home_goalsPerGame',
       'home_goalsAgainstPerGame', 'home_evGGARatio',
       'home_powerPlayPercentage', 'home_penaltyKillPercentage',
       'home_shootingPctg', 'home_savePctg', 'home_PDO']
header = ','.join(header)

np.savetxt('master_normalized.csv', data, fmt='%s', delimiter=',', header = header)

Aanaylsis
--

In [1]:
import pandas as pd
import numpy as np
from sklearn import svm, datasets
from sklearn.metrics import classification_report, confusion_matrix   

In [2]:
def prepare(data):
    X = data.iloc[:,3:].values
    # we insert an all-ones column at index 0
    X = np.insert(X, 0, 1, axis=1)
    # get the first column of the data
    y = data.iloc[:,0:1].values
    return X,y

def split_train_test(X,y,pct=80):
    n = X.shape[0]
    s = round(n * pct / 100)
    
    indices = np.random.permutation(n)
    train_idx, test_idx = indices[:s], indices[s:]
    
    X_train, X_test = X[train_idx,:], X[test_idx,:]
    y_train, y_test = y[train_idx,:], y[test_idx,:]
    
    return X_train, y_train, X_test, y_test

def accuracy(pred, labels):
    count = 0
    for i in range(0, len(pred)):
        if(pred[i] == labels[i]):
            count += 1
    return count/len(pred)

SVM using various kernals
--

In [3]:
data = pd.read_csv('master_normalized.csv', header=0)

In [4]:
X,y = prepare(data)

X,Y,X_test,Y_test = split_train_test(X,y,pct=80)
Y = np.concatenate(Y, axis=0 )

In [5]:
data.head()

Unnamed: 0,winner,awayID,homeID,away_goalsPerGame,away_goalsAgainstPerGame,away_evGGARatio,away_powerPlayPercentage,away_penaltyKillPercentage,away_shootingPctg,away_savePctg,away_PDO,home_goalsPerGame,home_goalsAgainstPerGame,home_evGGARatio,home_powerPlayPercentage,home_penaltyKillPercentage,home_shootingPctg,home_savePctg,home_PDO
0,0,21,25,0.803099,0.075019,0.789802,0.932331,0.475248,0.885714,0.7,1.02,0.574564,0.027842,0.620057,0.75188,0.821782,0.942857,0.733333,1.023
1,0,9,6,0.834087,0.197989,0.734781,0.676692,0.841584,0.828571,0.733333,1.019,0.46417,0.613302,0.795786,0.511278,0.475248,0.4,0.0,0.982
2,1,16,7,0.330536,0.584687,0.664542,0.24812,0.594059,0.428571,0.066667,0.985,0.393802,0.0,0.502732,0.488722,1.0,0.485714,1.0,1.015
3,1,23,4,0.55907,0.508894,0.7673,0.56391,0.287129,0.457143,0.066667,0.986,0.566817,0.216551,1.0,0.458647,0.445545,0.457143,0.566667,1.001
4,0,17,20,0.668819,0.169374,0.849896,0.93985,0.772277,0.371429,0.766667,1.004,0.227889,0.490333,0.444459,0.398496,0.188119,0.171429,0.266667,0.982


Linear SVC

In [6]:
linear_svc = svm.SVC(kernel='linear')

In [7]:
linear_svc.fit(X, Y)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [8]:
linear_Y_pred = linear_svc.predict(X_test)

In [9]:
linear_Y_pred

array([1, 0, 1, ..., 0, 1, 0])

In [10]:
lin_acc = accuracy(linear_Y_pred, Y_test)

In [11]:
print(lin_acc)

0.590867992766727


In [12]:
print(confusion_matrix(Y_test, linear_Y_pred))  
print(classification_report(Y_test, linear_Y_pred)) 

[[ 930 1173]
 [ 637 1684]]
             precision    recall  f1-score   support

          0       0.59      0.44      0.51      2103
          1       0.59      0.73      0.65      2321

avg / total       0.59      0.59      0.58      4424



Polynomial SVC - degree 2

In [13]:
poly_svc = svm.SVC(kernel='poly', degree=2)

In [14]:
poly_svc.fit(X, Y)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=2, gamma='auto', kernel='poly',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [15]:
poly_Y_pred = poly_svc.predict(X_test)  

In [16]:
poly_acc = accuracy(poly_Y_pred, Y_test)

In [17]:
print(poly_acc)

0.5924502712477396


In [18]:
print(confusion_matrix(Y_test, poly_Y_pred))  
print(classification_report(Y_test, poly_Y_pred)) 

[[ 695 1408]
 [ 395 1926]]
             precision    recall  f1-score   support

          0       0.64      0.33      0.44      2103
          1       0.58      0.83      0.68      2321

avg / total       0.61      0.59      0.56      4424



Gaussian SVC 

In [19]:
gaussian_svc = svm.SVC(kernel='rbf')

In [20]:
gaussian_svc.fit(X, Y)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [21]:
gaussian_Y_pred = gaussian_svc.predict(X_test)  

In [22]:
gaussian_acc = accuracy(gaussian_Y_pred, Y_test)

In [23]:
print(gaussian_acc)

0.5881555153707052


In [24]:
print(confusion_matrix(Y_test, poly_Y_pred))  
print(classification_report(Y_test, poly_Y_pred)) 

[[ 695 1408]
 [ 395 1926]]
             precision    recall  f1-score   support

          0       0.64      0.33      0.44      2103
          1       0.58      0.83      0.68      2321

avg / total       0.61      0.59      0.56      4424



Sigmoid SVC

In [25]:
sigmoid_svc = svm.SVC(kernel='sigmoid')

In [26]:
sigmoid_svc.fit(X, Y)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='sigmoid',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [27]:
sigmoid_Y_pred = sigmoid_svc.predict(X_test)  

In [28]:
sigmoid_acc = accuracy(sigmoid_Y_pred, Y_test)

In [29]:
print(sigmoid_acc)

0.5339059674502713


In [30]:
print(confusion_matrix(Y_test, sigmoid_Y_pred))  
print(classification_report(Y_test, sigmoid_Y_pred)) 

[[1022 1081]
 [ 981 1340]]
             precision    recall  f1-score   support

          0       0.51      0.49      0.50      2103
          1       0.55      0.58      0.57      2321

avg / total       0.53      0.53      0.53      4424



Ensamble Prediction 

In [31]:
def ensamble(pred1, pred2, pred3, pred4):
    prediction = []
    for i in range(0, len(pred1)):
        p = (pred1[i] + pred2[i] + pred3[i] + pred4[i])
        p = p/4
        if(p < 0.5):
            prediction.append(0)
        else:
            prediction.append(1)
    return prediction

In [32]:
ensamble_pred = ensamble(linear_Y_pred, poly_Y_pred, gaussian_Y_pred, sigmoid_Y_pred)

In [33]:
ensamble_acc = accuracy(ensamble_pred, Y_test)

print(ensamble_acc)

0.594258589511754


Logistic regression in TensorFlow
--

In [52]:
import tensorflow as tf

In [53]:
data = pd.read_csv('master_normalized.csv', header=0)

In [54]:
X,y = prepare(data)

X,Y,X_test,Y_test = split_train_test(X,y,pct=80)
Y = np.concatenate(Y, axis=0 )

In [55]:
data.head()

Unnamed: 0,winner,awayID,homeID,away_goalsPerGame,away_goalsAgainstPerGame,away_evGGARatio,away_powerPlayPercentage,away_penaltyKillPercentage,away_shootingPctg,away_savePctg,away_PDO,home_goalsPerGame,home_goalsAgainstPerGame,home_evGGARatio,home_powerPlayPercentage,home_penaltyKillPercentage,home_shootingPctg,home_savePctg,home_PDO
0,0,21,25,0.803099,0.075019,0.789802,0.932331,0.475248,0.885714,0.7,1.02,0.574564,0.027842,0.620057,0.75188,0.821782,0.942857,0.733333,1.023
1,0,9,6,0.834087,0.197989,0.734781,0.676692,0.841584,0.828571,0.733333,1.019,0.46417,0.613302,0.795786,0.511278,0.475248,0.4,0.0,0.982
2,1,16,7,0.330536,0.584687,0.664542,0.24812,0.594059,0.428571,0.066667,0.985,0.393802,0.0,0.502732,0.488722,1.0,0.485714,1.0,1.015
3,1,23,4,0.55907,0.508894,0.7673,0.56391,0.287129,0.457143,0.066667,0.986,0.566817,0.216551,1.0,0.458647,0.445545,0.457143,0.566667,1.001
4,0,17,20,0.668819,0.169374,0.849896,0.93985,0.772277,0.371429,0.766667,1.004,0.227889,0.490333,0.444459,0.398496,0.188119,0.171429,0.266667,0.982


In [57]:
Y = Y.reshape((Y.shape[0],1))
Y_test = Y_test.reshape((Y_test.shape[0],1))

print("Train dataset shape", X.shape, Y.shape)
print("Test dataset shape", X_test.shape, Y_test.shape)

m   = X.shape[0] 
n_x = X.shape[1]

Train dataset shape (17698, 17) (17698, 1)
Test dataset shape (4424, 17) (4424, 1)


In [58]:
def accuracy(A, Y):
    P = A>.5      #prediction
    num_agreements = np.sum(P==Y)
    return num_agreements / Y.shape[0]

In [59]:
# Input data.
# Load the training and test data into constants
tf_X = tf.constant(X.astype(np.float32))
tf_Y = tf.constant(Y.astype(np.float32))
tf_X_test = tf.constant(X_test.astype(np.float32))
tf_Y_test = tf.constant(Y_test.astype(np.float32))

# Variables.
# These are the parameters that we are going to be training.
tf_w = tf.Variable(tf.zeros((n_x, 1)))
tf_b = tf.Variable(tf.zeros((1,1)))

# Training computation.
# We multiply the inputs with the weight matrix, and add biases. We compute
# the sigmoid and cross-entropy (it's one operation in TensorFlow, because
# it's very common, and it can be optimized). We take the average of this
# cross-entropy across all training examples: that's our cost.
tf_Z = tf.matmul(tf_X, tf_w) + tf_b
tf_J = tf.reduce_mean( tf.nn.sigmoid_cross_entropy_with_logits(labels=tf_Y, logits=tf_Z) )

# Optimizer.
# We are going to find the minimum of this loss using gradient descent.
# We pass alpha=0.1 as input parameter.
optimizer = tf.train.GradientDescentOptimizer(0.1).minimize(tf_J)

# Predictions for the train and test data.
# These are not part of training, but merely here so that we can report
# accuracy figures as we train.
tf_A = tf.nn.sigmoid(tf_Z)
tf_A_test = tf.nn.sigmoid(tf.matmul(tf_X_test, tf_w) + tf_b)

In [60]:
session = tf.InteractiveSession()

# This is a one-time operation which ensures the parameters get initialized as
# we described in the graph: random weights for the matrix, zeros for the biases. 
tf.global_variables_initializer().run()
print("Initialized")

for iter in range(1000):
    # Run the computations. We tell .run() that we want to run the optimizer,
    # and get the cost value and the training predictions returned as numpy arrays.
    _, J, A = session.run([optimizer, tf_J, tf_A])
    
    print(iter, J)

Initialized
0 0.693131
1 0.746797
2 1.75275
3 1.03357
4 1.51833
5 1.22138
6 1.30908
7 1.39474
8 1.11621
9 1.55757
10 0.934253
11 1.71249
12 0.761854
13 1.78962
14 0.693884
15 1.14816
16 1.37988
17 1.33583
18 1.17094
19 1.50904
20 0.97843
21 1.67161
22 0.797491
23 1.80393
24 0.686385
25 0.806689
26 1.75353
27 1.01832
28 1.52315
29 1.20415
30 1.31649
31 1.37612
32 1.1253
33 1.53802
34 0.944447
35 1.69247
36 0.771973
37 1.79606
38 0.68476
39 0.936593
40 1.60741
41 1.13357
42 1.38709
43 1.31379
44 1.18701
45 1.48165
46 1.00025
47 1.64055
48 0.822701
49 1.78389
50 0.686224
51 1.10835
52 1.40818
53 1.29663
54 1.19878
55 1.47037
56 1.00595
57 1.63348
58 0.82436
59 1.78033
60 0.685096
61 1.10113
62 1.4121
63 1.29087
64 1.201
65 1.46571
66 1.00705
67 1.62961
68 0.82471
69 1.77726
70 0.684441
71 1.11229
72 1.39601
73 1.30159
74 1.18549
75 1.47612
76 0.991925
77 1.6398
78 0.809989
79 1.78255
80 0.680875
81 0.9855
82 1.53624
83 1.18109
84 1.31787
85 1.36039
86 1.11907
87 1.52763
88 0.933195
89 1.6

697 1.54214
698 0.876496
699 1.70353
700 0.711839
701 1.44666
702 0.980889
703 1.6198
704 0.790596
705 1.72784
706 0.694905
707 1.22906
708 1.22411
709 1.41448
710 1.01835
711 1.58676
712 0.827955
713 1.73049
714 0.694166
715 1.22101
716 1.23376
717 1.40534
718 1.02928
719 1.57684
720 0.839575
721 1.72549
722 0.697383
723 1.27453
724 1.17384
725 1.45564
726 0.973136
727 1.62453
728 0.787472
729 1.72474
730 0.697434
731 1.27199
732 1.1763
733 1.45396
734 0.974599
735 1.6235
736 0.788233
737 1.7251
738 0.697125
739 1.26569
740 1.18325
741 1.44816
742 0.980963
743 1.61812
744 0.793915
745 1.72839
746 0.695271
747 1.23502
748 1.21773
749 1.41891
750 1.01376
751 1.59006
752 0.824624
753 1.73069
754 0.694293
755 1.21925
756 1.23568
757 1.40334
758 1.03147
759 1.57469
760 0.841906
761 1.72388
762 0.698501
763 1.28702
764 1.1597
765 1.46737
766 0.959858
767 1.63559
768 0.775621
769 1.71146
770 0.705964
771 1.3795
772 1.05549
773 1.55633
774 0.859824
775 1.71464
776 0.703553
777 1.34968
778 1.0

In [43]:
# Calling .eval() is basically like calling run(), but
# just to get that one numpy array. 
# Note that it recomputes all its computation graph dependencies.
A = tf_A.eval()
A_test = tf_A_test.eval()

print("Accuracy on the train set is ", accuracy(A,Y))
print("Accuracy on the test set is ", accuracy(A_test,Y_test))

Accuracy on the train set is  0.584755339586
Accuracy on the test set is  0.593806509946


Neural network in TensorFlow
--

In [88]:
data = pd.read_csv('master_normalized.csv', header=0)

In [89]:
X,y = prepare(data)

X,Y,X_test,Y_test = split_train_test(X,y,pct=80)

In [90]:
data.head()

Unnamed: 0,winner,awayID,homeID,away_goalsPerGame,away_goalsAgainstPerGame,away_evGGARatio,away_powerPlayPercentage,away_penaltyKillPercentage,away_shootingPctg,away_savePctg,away_PDO,home_goalsPerGame,home_goalsAgainstPerGame,home_evGGARatio,home_powerPlayPercentage,home_penaltyKillPercentage,home_shootingPctg,home_savePctg,home_PDO
0,0,21,25,0.803099,0.075019,0.789802,0.932331,0.475248,0.885714,0.7,1.02,0.574564,0.027842,0.620057,0.75188,0.821782,0.942857,0.733333,1.023
1,0,9,6,0.834087,0.197989,0.734781,0.676692,0.841584,0.828571,0.733333,1.019,0.46417,0.613302,0.795786,0.511278,0.475248,0.4,0.0,0.982
2,1,16,7,0.330536,0.584687,0.664542,0.24812,0.594059,0.428571,0.066667,0.985,0.393802,0.0,0.502732,0.488722,1.0,0.485714,1.0,1.015
3,1,23,4,0.55907,0.508894,0.7673,0.56391,0.287129,0.457143,0.066667,0.986,0.566817,0.216551,1.0,0.458647,0.445545,0.457143,0.566667,1.001
4,0,17,20,0.668819,0.169374,0.849896,0.93985,0.772277,0.371429,0.766667,1.004,0.227889,0.490333,0.444459,0.398496,0.188119,0.171429,0.266667,0.982


In [91]:
# Input data.
n_x = X.shape[1]

num_hidden_nodes = 15

learning_rate = 0.01

C = 1

# Load the training and test data into constants
tf_X = tf.constant(X.astype(np.float32))
tf_Y = tf.constant(Y.astype(np.float32))
tf_X_test = tf.constant(X_test.astype(np.float32))
tf_Y_test = tf.constant(Y_test.astype(np.float32))

# Variables.
# These are the parameters that we are going to be training.
tf_w1 = tf.Variable(tf.truncated_normal((n_x, num_hidden_nodes)))
tf_b1 = tf.Variable(tf.zeros((1, num_hidden_nodes)))
tf_w2 = tf.Variable(tf.truncated_normal([num_hidden_nodes, C]))
tf_b2 = tf.Variable(tf.zeros((1, C)))



tf_Z1 = tf.matmul(tf_X, tf_w1) + tf_b1
tf_A1 = tf.nn.relu(tf_Z1)    #tf.nn.relu(tf_Z1)
tf_Z2 = tf.matmul(tf_A1, tf_w2) + tf_b2
tf_A2 = tf.nn.relu(tf_Z2)

# Training computation.
# We multiply the inputs with the weight matrix, and add biases. We compute
# the sigmoid and cross-entropy (it's one operation in TensorFlow, because
# it's very common, and it can be optimized). We take the average of this
# cross-entropy across all training examples: that's our cost.
tf_J = tf.reduce_mean( tf.nn.sigmoid_cross_entropy_with_logits(labels=tf_Y, logits=tf_Z2) )

# Optimizer.
# optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(tf_J)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(tf_J)


# Predictions for the test data.
tf_Z1_test = tf.matmul(tf_X_test, tf_w1) + tf_b1
tf_A1_test = tf.nn.relu(tf_Z1_test)
tf_Z2_test = tf.matmul(tf_A1_test, tf_w2) + tf_b2
tf_A2_test = tf.nn.relu(tf_Z2_test)

In [92]:
session = tf.InteractiveSession()

# This is a one-time operation which ensures the parameters get initialized as
# we described in the graph: random weights for the matrix, zeros for the biases. 
tf.global_variables_initializer().run()
print("Initialized")


# Replace None with your code.

for iter in range(800):
    # Run the computations. We tell .run() that we want to run the optimizer,
    # and get the cost value and the training predictions returned as numpy arrays.
    # Print out the iteration number and cost every 50 iterations.
    _, J, A = session.run([optimizer, tf_J, tf_A2])
    
    if iter%50 ==0:
        print(iter, J)

Initialized
0 2.59669
50 0.776328
100 0.723498
150 0.700101
200 0.685143
250 0.67828
300 0.67527
350 0.675978
400 0.667006
450 0.679443
500 0.671259
550 0.665545
600 0.665922
650 0.665756
700 0.66534
750 0.665179


In [93]:
def accuracy(A, Y):
    P = A>.5      #prediction
    num_agreements = np.sum(P==Y)
    return num_agreements / Y.shape[0]

In [94]:
# Print out the accuracy for the training set and test set.
A = tf_A2.eval()
A_test = tf_A2_test.eval()

print("Accuracy on the train set is ", accuracy(A,Y))
print("Accuracy on the test set is ", accuracy(A_test,Y_test))
# Put your code here.

# Call .eval() on tf_A2 and tf_A2_test.

Accuracy on the train set is  0.561758390779
Accuracy on the test set is  0.55424954792


Stochastic Gradient Descent
--

In [77]:
import tensorflow as tf
import numpy as np
import pandas as pd

In [78]:
data = pd.read_csv('master_normalized.csv', header=0)

In [79]:
X,y = prepare(data)

X,Y,X_test,Y_test = split_train_test(X,y,pct=80)

n_x = X.shape[1]

In [80]:
# Input data.
# Let's use placeholders for the training data. 
# This is so that we can suply batches of tranining examples each iteration.
tf_X = tf.placeholder(tf.float32)
tf_Y = tf.placeholder(tf.float32)

tf_X_test = tf.constant(X_test.astype(np.float32))
tf_Y_test = tf.constant(Y_test.astype(np.float32))

# Variables.
# These are the parameters that we are going to be training.
tf_w = tf.Variable( tf.zeros((n_x, 1)) )
tf_b = tf.Variable(tf.zeros((1,1)))

# Training computation.
# We multiply the inputs with the weight matrix, and add biases. We compute
# the sigmoid and cross-entropy (it's one operation in TensorFlow, because
# it's very common, and it can be optimized). We take the average of this
# cross-entropy across all training examples: that's our cost.
tf_Z = tf.matmul(tf_X, tf_w) + tf_b
tf_J = tf.reduce_mean( tf.nn.sigmoid_cross_entropy_with_logits(labels=tf_Y, logits=tf_Z) )

# Optimizer.
# We are going to find the minimum of this loss using gradient descent.
# We pass alpha=0.1 as input parameter.
#optimizer = tf.train.GradientDescentOptimizer(0.1).minimize(tf_J)
learning_rate = 0.01
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(tf_J)


# Predictions for the train and test data.
# These are not part of training, but merely here so that we can report
# accuracy figures as we train.
tf_A = tf.nn.sigmoid(tf_Z)
tf_A_test = tf.nn.sigmoid(tf.matmul(tf_X_test, tf_w) + tf_b)

In [82]:
num_steps = 1000
batch_size = 100

session = tf.InteractiveSession()
tf.global_variables_initializer().run()
print("Initialized")

for step in range(num_steps):
    # Pick an offset within the training data.
    offset = (step * batch_size) % (X.shape[0] - batch_size)
    
    # Generate a minibatch.
    X_batch = X[offset:(offset + batch_size), :]
    Y_batch = Y[offset:(offset + batch_size), :]
    
    _, J, A = session.run([optimizer, tf_J, tf_A], feed_dict={tf_X : X_batch, tf_Y : Y_batch})
    
    if (step % 500 == 0):
        print("Minibatch loss at step ", (step, J))
        print("Minibatch accuracy: ", accuracy(A, Y_batch))
        A_test = tf_A_test.eval()
        print("Test accuracy: ", accuracy(A_test,Y_test))

Initialized
Minibatch loss at step  (0, 0.69314742)
Minibatch accuracy:  0.42
Test accuracy:  0.530741410488
Minibatch loss at step  (500, 0.67694199)
Minibatch accuracy:  0.57
Test accuracy:  0.603752260398
