# Hockey Team Visualizations

The following code is to classify a player into six distinct player types (Playmaker, Sniper, Enforcer, Power, Grinder, Defensive) and determine if the composition of these players have an effect on the team's overall success in the regular season.  

The notebook has the following elements
1. Downloading (from NHL.com) or importing (from local CSV) the statistics individual players and team statistics from years 1918 to 2016
2. Editing the columns to have the proper names and a user-friendly look.
3. Function to sort player information into a year, if they are a forward or not, if they have played over 20 games, and their team.  
4. A player sort function to create an additional column, determining if the correct player type for each player (only applicable from 1998 and beyond as it uses the average time on ice statistic)
5. A new DataFrame is created with the team name, their points, the year, the number of players, and the composition of their team of (Player Type / Total # of players)



In [1]:
## Importing a database and analytics tool (pandas), a web
##  scraper (requests), and allowing graphs to be produced 
##  below the notebook
% matplotlib inline
import pandas as pd
import requests as rq
import numpy as np
import re
from IPython.display import display
## setting the database location to store and 
##  retrieve the NHL data and set the inline
##  column view to 25 to view all data
player_database_loc = '/Users/HudsonAccount/Google Drive/Professional Development/Ashtag Consulting/Hockey Team Composition/Full NHL Stat DB.csv'
team_database_loc = '/Users/HudsonAccount/Google Drive/Professional Development/Ashtag Consulting/Hockey Team Composition/TeamStats.csv'
pd.set_option('max_columns', 25)

# DO NOT RUN
The purpose of the code below is to locate the online database of NHL statistics from _NHL.com._  The code is in markdown now because all the data has been saved to a CSV file for faster importing.  

**hockey_data_url(years)**:  
> Takes a year and returns the url of an NHL.com database of all player stat for a given year.

**all_season_stats()**:  

> Scrapes NHL data from NHL.com and then stores it into a comma seperated value file, stored in the database_loc variable.

hockey_data_url(years) consumes a two year string and returns a url containing all the NHL data for the year  
hockey_data_url: Int -> Str  
requires: 1917 =< years =< 2015  


    def hockey_data_url (years): 
        first_year = str(years - 1)
        second_year = str(years)
        proper_year = first_year + second_year
        url = 'http://www.nhl.com/stats/rest/grouped/skaters/season/skatersummary?cayenneExp=seasonId=' + proper_year + '%20and%20gameTypeId=2' 
        return url


Similar to above, except for team statistics

    def team_data_url(years):
        first_year = str(years - 1)
        second_year = str(years)
        proper_year = first_year + second_year
        url = 'http://www.nhl.com/stats/rest/grouped/teams/season/teamsummary?cayenneExp=seasonId=' + proper_year + '%20and%20gameTypeId=2' 
        return url

All_season_stats() takes all the data from the available years and combines them into a DataFrame  
All_season_stats: Null -> Table
    
    def all_season_stats():
        current_year = 1917
        full_hockey_dataframe = {}

        while current_year <= 2016:
            url_code = rq.get(hockey_data_url(current_year))
            print(url_code)
            if url_code.raise_for_status() == None:
                json_hockey_data = pd.DataFrame(url_code.json())
                season_stats = json_hockey_data['data'].apply(pd.Series)
                full_hockey_dataframe[current_year] = season_stats
                current_year = current_year + 1

            else: current_year = current_year + 1
           print('Done!')
        return pd.concat(full_hockey_dataframe)

        all_stats = all_season_stats() 






            
            

Similar to above, except with team statistics 
   
       def all_season_stats():
            current_year = 1917
            full_hockey_dataframe = {}

            while current_year <= 2016:
                url_code = rq.get(team_data_url(current_year))
                print(url_code)
                if url_code.raise_for_status() == None:
                    json_hockey_data = pd.DataFrame(url_code.json())
                    season_stats = json_hockey_data['data'].apply(pd.Series)
                    full_hockey_dataframe[current_year] = season_stats
                    current_year = current_year + 1

                else: current_year = current_year + 1
            print('Done!')
            return pd.concat(full_hockey_dataframe)

        team_db = all_season_stats()
        ## Editing and saving the DataFrame to a CSV file
        team_db.index.names = ('Year', 'None')
        team_db.reset_index().set_index(['Year', 'teamAbbrev'])
        team_db.to_csv('TeamStats.csv')

### Editing the Database 
The following two functions delete unnecessary columns in the NHL database and adds the following columns:
* **goals_pm**: Creates a column of the ratio of goals scored divided by total minutes played
* **assists_pm**: Creates a column of the ratio of assists divided by total minutes played
* **shanded_pts_pm**: Creates a column of the ratio of total short-handed points divided by total minutes played
* **pm_pm**: Creates a column of the penalty minutes per game divided by the number of minutes played
* **ppp_pm**: Creates a column of the penalty minutes per game divided by the number of minutes played
* **Scoring, Assist, Shorthanded, Powerplay, Penalty, and Accuracy Scores**: 6 columns of 0 for a player's score to be updated in each column

The original database, NHL_db, is then updated to remove deleted columns and add the new columns 

In [2]:
## Reading and adjusting the CSV file to set for year and 
##  player ID for the player database
NHL_db = pd.read_csv(player_database_loc)
NHL_db.set_index(['Unnamed: 0', 'playerName'], inplace = True)
columns_to_delete = ['0', 'Unnamed: 1', 'faceoffWinPctg',
                     'gameWinningGoals', 'otGoals', 'playerFirstName',
                     'playerLastName', 'plusMinus', 'seasonId', 
                     'playerId', 'ppGoals', 'shiftsPerGame', 'shots',
                     'points', 'pointsPerGame']
for column in columns_to_delete:
    del NHL_db[column]

## add_pg_columns(db) takes a DataFrame and returns a DataFrame
##  with a column for goals per game, assists per game, penalty
##  minutes per game, short-handed points per game, and
##  shooting percentage
def add_columns(db):
    update0 = db.assign(total_mins_played = lambda x: (x['timeOnIcePerGame'] * x['gamesPlayed']) / 60)
    update = update0.assign(goals_pm = lambda x: x['goals'] / x['total_mins_played'])
    update1 = update.assign(assists_pm = lambda x: x['assists'] / x['total_mins_played'])
    update2 = update1.assign(shanded_pts_pm = lambda x: x['shPoints'] / x['total_mins_played'])
    update3 = update2.assign(pm_pm= lambda x: x['penaltyMinutes'] / x['total_mins_played'])
    update4 = update3.assign(ppp_pm = lambda x: x['ppPoints'] / x['total_mins_played'])
    return update4

## add necessary columns to the dataframe

NHL_db = add_columns(NHL_db)
NHL_db.index.names = ['Year','Player Name']


## Reading and adjusting the team statistic DataFrame
teams_db = pd.read_csv(team_database_loc)
teams_db = teams_db.set_index(['Year', 'teamAbbrev'])
teams_and_pts = teams_db['points']



### Classifying the Players
Below, each player is classified into a player type:
* Sniper for the high scorers
* Playmaker for the players who set the play up
* Defensive for players who excel in the defensive end
* Grinder for players who get their nose dirty and do a bit of everything
* Power for the aggressive scoreres
* Enforcer for the fighters

A player is placed into a category based on their scoring, powerplay, assist, shorthanded, accuracy, and penalty scores, which are derived from the number of standard deviations away from the mean of the year.  

A player's 6 scores are sent through the player sort and are assigned a player type.

### Database selection by year, position, games played, team, and player sort
The function below allows the dataframe to be filtered so that a single year can be selected, a team may be selected, and players who have played under 20 games can be filtered out in order to preserve accuracy in the ratios.  Below, a function sorts the difference between forwards and defenceman

In [3]:
## player_sort(scoring, assist, shorthanded, powerplay, penalty, accuracy)
##  takes all the player's player score and determines the player 
##  categorythey fallunder.  It is used in an apply function 
##  for a DataFrame
def player_sort(scoring, assist, shorthanded, powerplay, penalty, accuracy):
    sniper_score = scoring
    playmaker_score = assist
    defensive_score = .8*shorthanded
    grinder_score = -.5 * (scoring + assist + shorthanded + powerplay + penalty + accuracy)
    power_score = (.9 *scoring) + (.9 * penalty)
    enforcer_score = penalty
    score_lst = (sniper_score, defensive_score, playmaker_score, 
                 grinder_score, power_score, enforcer_score)
    top = score_lst.index(max(score_lst))
    if top == 0:
        return 'Sniper'
    if top == 1:
        return 'Defensive'
    if top == 2: 
        return 'Playmaker'
    if top == 3:
        return 'Grinder'
    if top == 4:
        return 'Power'
    if top == 5:
        return 'Enforcer'

    
def stds(col_val, mean, std):
    pass
    
## dictionary to be iterated in assign_wrapper
player_score_dct = {'Scoring Score': 'goals_pm',
                    'Assist Score': 'assists_pm',
                    'Shorthanded Score': 'shanded_pts_pm',
                    'Powerplay Score': 'ppp_pm',
                    'Penalty Score': 'pm_pm',
                    'Accuracy Score': 'shootingPctg'}

def df_filter(year, db, position = 'forward', team = '', 
              over_twenty = True, player_scores = True):
    
    if year == 2005:
        return "Sorry, this year was a lockout.  Please re-enter another year"
    
    year_db = db.groupby(level = 'Year').get_group(year)
    
    if (team != ''):
        year_db = year_db[year_db['playerTeamsPlayedFor'].apply(lambda x: team in x)]
    
    if over_twenty == True:
        year_db =  year_db[year_db['gamesPlayed'] >= 20]
    
    if position == 'forward':
        year_db = year_db[year_db['playerPositionCode'].apply(lambda x: x in 'CLR')]
    
    if position == 'defence':
        year_db = year_db[year_db['playerPositionCode'].apply(lambda x: x == 'D')]
    
    if player_scores == True:
        for player_score, col_name in player_score_dct.items():
            mean = year_db[col_name].mean()
            std = year_db[col_name].std()
            new_col = year_db.apply((lambda y: (y[col_name] - mean) / std), axis = 1)
            year_db[player_score] = new_col
        player_col = year_db.apply(lambda x: player_sort(x['Scoring Score'], x['Assist Score'], x['Shorthanded Score'],
                                                         x['Powerplay Score'], x['Penalty Score'], x['Accuracy Score']),
                                                         axis = 1)
        year_db['Player Type'] = player_col
       
    
    
    return year_db

### Team Composition
The goal of the code below is to create a new DataFrame containing a team's number of players, the number of Grinders, Playmakers, Snipers, Power Forwards, Defensive Forwards, and Enforcers, the total points of the team in a single year, and the composition of each player type.  



In [4]:
## team_composition(year, db, team_acc) takes a team name, a year,
##  and a database then creates a dictionary of the number of 
##  players of each player type, the team name, and the total 
##  number of players
## Str, DataFrame -> Dict 
def team_composition(year, db, team_acc):
    
    team_db = df_filter(year, db)
    team_db = team_db[team_db['playerTeamsPlayedFor'].apply(lambda x: team_acc in x)]

    team_dct = dict(team_db['Player Type'].value_counts())
    team_dct['Team'] = team_acc
    team_dct['Total Players'] = len(team_db)
    player_types_lst = ('Defensive', 'Enforcer', 'Grinder', 
                        'Playmaker', 'Power', 'Sniper')
    for player_type in player_types_lst:
        if player_type not in team_dct:
            team_dct[player_type + ' Ratio'] = 0
        else: team_dct[player_type + ' Ratio'] = team_dct[player_type] / team_dct['Total Players']
    teams_pts = teams_and_pts.groupby(level = 'Year').get_group(year)
    team_dct['Points'] = teams_pts.loc[year, team_acc]
    return team_dct
    

## team_acronym_lst(db) takes a database and produces a list of all the unique
##  teams in the league 
##  Notes: should be used on a year database for use with team composition

def team_acronym_lst(year, db):
    year_db = db.groupby(level = "Year").get_group(year)
    teams_db = year_db[year_db['playerTeamsPlayedFor'].apply(lambda x: len(x) == 3)]
    return teams_db['playerTeamsPlayedFor'].value_counts().index.values.tolist()
    

## year_compostion(year, db) takes a year and a database and uses the above
##  two functions to create a new DataFrame for the year
## Int, DataFrame -> DataFrame
## Only applicable for years 1998 and beyond. 
def year_composition(year, db):
    lst_to_df = []
    teams_lst = team_acronym_lst(year, db)
    for team in teams_lst:
        lst_to_df.append(team_composition(year, db, team))
    
    return pd.DataFrame(lst_to_df).fillna(0).set_index('Team')



### User Input
The functions below allow a user to discover player types of players, and the correlation between the success of a team's regular season and the ratio of player type to other 

In [9]:
## player_type(user) takes a user input and returns a DataFrame,
##  statistic, or scatterplot.

def player_type_creator(user):
    if user == 'more':
        print("If you would like to see statistics of a certain year, enter a year between 1998 and 2016")
        print("If you would like to see the overall relationship between points and player type for all years,")
        user = input("enter the player type (Sniper, Defensive, Grinder, Playmaker, Enforcer, Power) \n \n")
        if len(user) <= 4:
            try:
                int(user)
            except ValueError:
                user = input("Sorry, it doesn't look like this date is valid. Please re-enter 'more' \n \n")
                player_type_creator(user)
            else:
                if int(user) == 2005:
                    print("Sorry, lockout year!")
                    user = input("Please type another player name, 'more' for further statistics, and 'exit' to leave \n \n")
                    player_type_creator(user)

                elif (int(user) >= 1998) and (int(user) <= 2016): 
                    player_type = input("Which player type would you like to see? (Grinder, Playmaker, Defensive, Sniper, Enforcer, Power) \n \n")
                    plotting_df = year_composition(int(user), NHL_db)
                    plotting_df.plot.scatter("Points", player_type+" Ratio", figsize = (12,8))
                    print('Correlation of {0}'.format(plotting_df.corr()['Points'][player_type+" Ratio"]))
                    user = input("Please type another player name, 'more' for further statistics, and 'exit' to leave \n \n")
                    player_type_creator(user)
                else: 
                    print("Sorry, the date does not appear to be within range")
                    user = input("Please type another player name, 'more' for further statistics, and 'exit' to leave \n \n")
                    player_type_creator(user)
        else: 
            if user in ['Sniper', 'Defensive', 'Grinder', 'Playmaker', 'Enforcer', 'Power']:
                year = 1998
                correlation = 0 
                while year <= 2016:
                    if year == 2005:
                        year += 1 
                    print(year)
                    year_corr = year_composition(year, NHL_db).corr()['Points'][user+' Ratio']
                    correlation = year_corr + correlation
                    year += 1 
                print("The correlation of almost 540 data points between Points and {0} is {1}".format(user,
                                                                                                      correlation / 17))
                user = input("Please type another player name, 'more' for further statistics, and 'exit' to leave \n \n")
                player_type_creator(user)
                
            else: 
                print("Sorry, it does not look like a proper date or player type.")
                user = input("Please type another player name, 'more' for further statistics, and 'exit' to leave \n \n")
                player_type_creator(user)
        
        
    elif user == 'exit':
        return print("Thank you, goodbye.")
    else: 
        player_lst = []
        year = 1998
        player_count = 0 
        
        while year <= 2016:
            if year == 2005:
                year += 1
            year_db = df_filter(year, NHL_db)
            if user in year_db.index.get_level_values(1):
                player_lst.append(year_db.loc[year,user])
                player_count += 1
                year += 1
            else: year += 1 
                        
        if player_count == 0:
            print("\n")
            print("Sorry, it doesn't look like {0} is in the database.  If it's not spelled correctly,".format(user))
            print("then the program won't be able to find it. (Only players active before 1998 can be found")
            print("due to a lack of statistics and use full names) \n")
            user = input("Please type another player name, 'more' for further statistics, and 'exit' to leave \n \n")
            player_type_creator(user)
        else:
            player_df = pd.DataFrame(player_lst)
            value_count = player_df['Player Type'].value_counts()
            if value_count.size == 1: 
                print('In his {0} seasons of over 20 games, {1} is primarily a {2}.'.format(player_count,
                                                                                            user, 
                                                                                            value_count.index[0]))
                                                                                                
                print("In addition, below is a summary of {0}'s scores, stats, and player type per year".format(user))
                display(pd.DataFrame(player_lst))
                user = input("Please type another player name, 'more' for further statistics, and 'exit' to leave \n \n")
                player_type_creator(user)
            if value_count.values[0] == value_count.values[1]:
                print('In his {0} seasons of over 20 games, {1} is a combination of a {2} and a {3}.'.format(player_count,
                                                                                                user, 
                                                                                                value_count.index[0],
                                                                                                value_count.index[1]))
                print("In addition, below is a summary of {0}'s scores, stats, and player type per year".format(user))
                display(pd.DataFrame(player_lst))
                user = input("Please type another player name, 'more' for further statistics, and 'exit' to leave \n \n")
                player_type_creator(user)
            else: 
                print('In his {0} seasons of over 20 games, {1} is primarily a {2}.'.format(player_count,
                                                                                            user, 
                                                                                            value_count.index[0]))                                                                                
                print("In addition, below is a summary of {0}'s scores, stats, and player type per year".format(user))
                display(pd.DataFrame(player_lst))
                user = input("Please type another player name, 'more' for further statistics, and 'exit' to leave \n \n")
                player_type_creator(user)
                
## run_user_input() runs the player_type_creator
def run_user_input():
    print("Hello, Welcome to the NHL Player Classification System")
    print("The system classifies players into six categories: \n")
    print("Sniper: High scoring and high accuracy")
    print("Grinder: Does a little bit of everything, from fights to scoring")
    print("Defensive: A forward who excels at the penalty kill")
    print("Enforcer: A fighter who takes a lot of penalties")
    print("Power: A forward who can score builds up penalty minutes")
    print("Playmaker: A leader in assists and setting up goals \n")
    print("There are a few things the program can do. It can show a player's type over the career,")
    print("determine a correlation between a team's points and their composition of a player type in a year,")
    print("and show the relationship between points and a player type over 18 years")
    user = input("Please type a player's name to see their player type or type 'more' for a team's success vs the player types: \n \n")
    player_type_creator(user)


In [8]:
run_user_input()

Hello, Welcome to the NHL Player Classification System
The system classifies players into six categories: 

Sniper: High scoring and high accuracy
Grinder: Does a little bit of everything, from fights to scoring
Defensive: A forward who excels at the penalty kill
Enforcer: A fighter who takes a lot of penalties
Power: A forward who can score builds up penalty minutes
Playmaker: A leader in assists and setting up goals 

There are a few things the program can do. It can show a player's type over the career,
determine a correlation between a team's points and their composition of a player type in a year,
and show the relationship between points and a player type over 18 years
Please type a player's name to see their player type or type 'more' for a team's success vs the player types: 
 
Sidney Crosby
In his 11 seasons of over 20 games, Sidney Crosby is primarily a Playmaker.
In addition, below is a summary of Sidney Crosby's scores, stats, and player type per year


Unnamed: 0,assists,gamesPlayed,goals,penaltyMinutes,playerPositionCode,playerTeamsPlayedFor,ppPoints,shGoals,shPoints,shootingPctg,timeOnIcePerGame,total_mins_played,goals_pm,assists_pm,shanded_pts_pm,pm_pm,ppp_pm,Penalty Score,Shorthanded Score,Accuracy Score,Assist Score,Powerplay Score,Scoring Score,Player Type
"(2006, Sidney Crosby)",63,81,39,110,C,PIT,47,0,2,0.1402,1207.6049,1630.266615,0.023922,0.038644,0.001227,0.067474,0.02883,-0.063208,0.161723,0.722897,2.043767,1.998292,1.517667,Playmaker
"(2007, Sidney Crosby)",84,79,36,60,C,PIT,61,0,0,0.144,1245.5443,1639.966662,0.021952,0.051221,0.0,0.036586,0.037196,-0.330225,-0.705299,0.954806,3.572405,3.586182,1.410665,Playmaker
"(2008, Sidney Crosby)",48,53,24,39,C,PIT,27,0,1,0.1387,1250.8113,1104.883315,0.021722,0.043444,0.000905,0.035298,0.024437,-0.342643,0.066786,0.947239,3.208664,2.386518,1.626945,Playmaker
"(2009, Sidney Crosby)",70,77,33,76,C,PIT,40,0,1,0.1386,1316.5324,1689.549913,0.019532,0.041431,0.000592,0.044982,0.023675,-0.25774,-0.15228,0.847856,2.920331,2.261174,1.117477,Playmaker
"(2010, Sidney Crosby)",58,81,51,71,C,PIT,34,2,3,0.1711,1317.2716,1778.31666,0.028679,0.032615,0.001687,0.039925,0.019119,-0.265499,1.118047,1.595092,1.961121,1.997638,2.704534,Sniper
"(2011, Sidney Crosby)",34,41,32,31,C,PIT,19,1,1,0.1987,1315.0487,898.616612,0.03561,0.037836,0.001113,0.034497,0.021144,-0.253328,0.393432,2.399272,2.690114,2.468649,3.935128,Sniper
"(2012, Sidney Crosby)",29,22,8,14,C,PIT,11,0,0,0.1066,1108.3181,406.383303,0.019686,0.071361,0.0,0.03445,0.027068,-0.275708,-0.574315,0.242978,6.77278,3.935716,1.379646,Playmaker
"(2013, Sidney Crosby)",41,36,15,16,C,PIT,17,0,0,0.1209,1266.25,759.75,0.019743,0.053965,0.0,0.02106,0.022376,-0.428737,-0.486532,0.417501,4.18039,2.591216,1.319973,Playmaker
"(2014, Sidney Crosby)",68,80,36,46,C,PIT,38,0,0,0.1389,1318.3375,1757.783333,0.02048,0.038685,0.0,0.026169,0.021618,-0.328748,-0.608159,0.951372,2.893336,2.870667,1.484943,Playmaker
"(2015, Sidney Crosby)",56,77,28,47,C,PIT,31,0,0,0.1181,1198.2467,1537.749932,0.018208,0.036417,0.0,0.030564,0.020159,-0.203287,-0.588361,0.495791,2.69381,2.701656,1.14131,Playmaker


Please type another player name, 'more' for further statistics, and 'exit' to leave 
 
exit
Thank you, goodbye.


## Issues with the Forward Sort Function
**_Generations of the NHL_**  
The NHL has gone through several different generations.  Thus, the categorization of player types  become skewed due higher scoring eras of the 1980's and the mid-2000's 'lockdown' era. 

**_Input Data_**  
There is potential to get better estimates based on new stats, such as blocked shots, power-play time, hits, etc.  The more points, the better the potential classification. 
 