# Exploratory Data Analysis (EDA)

This notebook will focus on exploring tabular data through DataFrames. This will only work if you've downloaded the data already.

**Note:** If you didn't use [Anaconda](https://anaconda.com), you may need to do `%pip install` on any missing libraries.

## DataFrames
Now that we have our data, let's load up a table into a DataFrame

In [113]:
import pandas as pd

df_teams = pd.read_csv('../Data/team_info.csv')
df_teams

Unnamed: 0,team_id,franchiseId,shortName,teamName,abbreviation,link
0,1,23,New Jersey,Devils,NJD,/api/v1/teams/1
1,4,16,Philadelphia,Flyers,PHI,/api/v1/teams/4
2,26,14,Los Angeles,Kings,LAK,/api/v1/teams/26
3,14,31,Tampa Bay,Lightning,TBL,/api/v1/teams/14
4,6,6,Boston,Bruins,BOS,/api/v1/teams/6
5,3,10,NY Rangers,Rangers,NYR,/api/v1/teams/3
6,5,17,Pittsburgh,Penguins,PIT,/api/v1/teams/5
7,17,12,Detroit,Red Wings,DET,/api/v1/teams/17
8,28,29,San Jose,Sharks,SJS,/api/v1/teams/28
9,18,34,Nashville,Predators,NSH,/api/v1/teams/18


That's a lot. Let's look at just 5 rows

In [114]:
df_teams.head()

Unnamed: 0,team_id,franchiseId,shortName,teamName,abbreviation,link
0,1,23,New Jersey,Devils,NJD,/api/v1/teams/1
1,4,16,Philadelphia,Flyers,PHI,/api/v1/teams/4
2,26,14,Los Angeles,Kings,LAK,/api/v1/teams/26
3,14,31,Tampa Bay,Lightning,TBL,/api/v1/teams/14
4,6,6,Boston,Bruins,BOS,/api/v1/teams/6


Interesting, but not immediately useful. Let's load up more data from another table.

![Other Tables](../Data/table_relationships.jpeg)

In [115]:
df_team_stats = pd.read_csv('../Data/game_teams_stats.csv')
df_team_stats.head()

Unnamed: 0,game_id,team_id,HoA,won,settled_in,head_coach,goals,shots,hits,pim,powerPlayOpportunities,powerPlayGoals,faceOffWinPercentage,giveaways,takeaways,blocked,startRinkSide
0,2016020045,4,away,False,REG,Dave Hakstol,4.0,27.0,30.0,6.0,4.0,2.0,50.9,12.0,9.0,11.0,left
1,2016020045,16,home,True,REG,Joel Quenneville,7.0,28.0,20.0,8.0,3.0,2.0,49.1,16.0,8.0,9.0,left
2,2017020812,24,away,True,OT,Randy Carlyle,4.0,34.0,16.0,6.0,3.0,1.0,43.8,7.0,4.0,14.0,right
3,2017020812,7,home,False,OT,Phil Housley,3.0,33.0,17.0,8.0,2.0,1.0,56.2,5.0,6.0,14.0,right
4,2015020314,21,away,True,REG,Patrick Roy,4.0,29.0,17.0,9.0,3.0,1.0,45.7,13.0,5.0,20.0,left


## Getting Column Values & Unique Values

In [116]:
df_team_stats["settled_in"]

0        REG
1        REG
2         OT
3         OT
4        REG
        ... 
52605    REG
52606    REG
52607    REG
52608    REG
52609    REG
Name: settled_in, Length: 52610, dtype: object

In [117]:
df_team_stats["settled_in"].unique()

array(['REG', 'OT', 'tbc'], dtype=object)

## Descriptive Statistics for Columns and DataFrames

In [118]:
df_team_stats["hits"].describe()

count    47682.000000
mean        21.127449
std          9.237332
min          0.000000
25%         15.000000
50%         20.000000
75%         27.000000
max         80.000000
Name: hits, dtype: float64

In [119]:
df_team_stats.describe()

Unnamed: 0,game_id,team_id,goals,shots,hits,pim,powerPlayOpportunities,powerPlayGoals,faceOffWinPercentage,giveaways,takeaways,blocked
count,52610.0,52610.0,52602.0,52602.0,47682.0,52602.0,52602.0,52602.0,30462.0,47682.0,47682.0,47682.0
mean,2010765000.0,16.880403,2.781282,29.930744,21.127449,11.754838,3.693567,0.667674,49.967179,8.832977,6.74504,13.317352
std,6073510.0,11.195171,1.657257,6.896107,9.237332,9.029566,1.870606,0.813093,7.326322,5.478274,4.144502,5.581261
min,2000020000.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2006020000.0,8.0,2.0,25.0,15.0,6.0,2.0,0.0,45.2,5.0,4.0,10.0
50%,2011021000.0,16.0,3.0,30.0,20.0,10.0,3.0,0.0,50.0,8.0,6.0,13.0
75%,2016030000.0,24.0,4.0,34.0,27.0,15.0,5.0,1.0,54.8,12.0,9.0,17.0
max,2019041000.0,90.0,12.0,88.0,80.0,213.0,16.0,7.0,79.2,52.0,40.0,62.0


## Adding a Column

In [120]:
df_team_stats['is_home'] = df_team_stats['HoA'].eq('home')
df_team_stats["has_many_hits"] = df_team_stats["hits"] > 30
df_team_stats.sample(10)

Unnamed: 0,game_id,team_id,HoA,won,settled_in,head_coach,goals,shots,hits,pim,powerPlayOpportunities,powerPlayGoals,faceOffWinPercentage,giveaways,takeaways,blocked,startRinkSide,is_home,has_many_hits
2448,2016020757,26,home,True,REG,Darryl Sutter,5.0,40.0,26.0,13.0,3.0,1.0,49.2,7.0,1.0,16.0,left,True,False
36800,2000020451,24,home,True,OT,Craig Hartsburg,5.0,30.0,,10.0,5.0,2.0,,,,,right,True,False
30779,2002020921,7,away,True,REG,Lindy Ruff,4.0,25.0,16.0,19.0,2.0,1.0,,10.0,6.0,7.0,right,False,False
6338,2015020215,53,away,True,OT,Dave Tippett,4.0,28.0,20.0,8.0,5.0,1.0,61.1,13.0,6.0,12.0,right,False,False
17858,2010020917,29,home,True,REG,Scott Arniel,5.0,34.0,25.0,17.0,9.0,3.0,47.4,1.0,4.0,7.0,right,True,False
1121,2015020742,18,away,True,REG,Peter Laviolette,2.0,24.0,9.0,8.0,4.0,1.0,45.5,6.0,7.0,12.0,left,False,False
38122,2000020631,25,away,True,REG,Ken Hitchcock,3.0,21.0,,19.0,5.0,2.0,,,,,,False,False
19292,2009020356,4,home,False,REG,John Stevens,2.0,33.0,17.0,30.0,5.0,0.0,,6.0,7.0,8.0,left,True,False
8371,2014020836,15,away,True,REG,Barry Trotz,5.0,28.0,35.0,8.0,5.0,1.0,55.4,10.0,7.0,18.0,right,False,True
193,2017020164,30,home,True,REG,Bruce Boudreau,2.0,29.0,12.0,6.0,4.0,0.0,44.6,6.0,9.0,18.0,left,True,False


In [121]:
df_team_stats['wins'] = df_team_stats['won'].apply(lambda x: 1 if x else 0)
df_team_stats['games'] = 1
df_team_stats

Unnamed: 0,game_id,team_id,HoA,won,settled_in,head_coach,goals,shots,hits,pim,...,powerPlayGoals,faceOffWinPercentage,giveaways,takeaways,blocked,startRinkSide,is_home,has_many_hits,wins,games
0,2016020045,4,away,False,REG,Dave Hakstol,4.0,27.0,30.0,6.0,...,2.0,50.9,12.0,9.0,11.0,left,False,False,0,1
1,2016020045,16,home,True,REG,Joel Quenneville,7.0,28.0,20.0,8.0,...,2.0,49.1,16.0,8.0,9.0,left,True,False,1,1
2,2017020812,24,away,True,OT,Randy Carlyle,4.0,34.0,16.0,6.0,...,1.0,43.8,7.0,4.0,14.0,right,False,False,1,1
3,2017020812,7,home,False,OT,Phil Housley,3.0,33.0,17.0,8.0,...,1.0,56.2,5.0,6.0,14.0,right,True,False,0,1
4,2015020314,21,away,True,REG,Patrick Roy,4.0,29.0,17.0,9.0,...,1.0,45.7,13.0,5.0,20.0,left,False,False,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52605,2018030416,19,home,False,REG,Craig Berube,1.0,29.0,29.0,20.0,...,0.0,58.7,12.0,11.0,9.0,right,True,False,0,1
52606,2018030417,19,away,True,REG,Craig Berube,4.0,20.0,36.0,2.0,...,0.0,49.0,7.0,8.0,21.0,right,False,True,1,1
52607,2018030417,6,home,False,REG,Bruce Cassidy,1.0,33.0,28.0,0.0,...,0.0,51.0,13.0,6.0,7.0,right,True,False,0,1
52608,2018030417,19,away,True,REG,Craig Berube,4.0,20.0,36.0,2.0,...,0.0,49.0,7.0,8.0,21.0,right,False,True,1,1


## Merging DataFrames

In [122]:
df_teams_merged = pd.merge(df_team_stats, df_teams, on='team_id')
df_teams_merged.sample(5)

Unnamed: 0,game_id,team_id,HoA,won,settled_in,head_coach,goals,shots,hits,pim,...,startRinkSide,is_home,has_many_hits,wins,games,franchiseId,shortName,teamName,abbreviation,link
34190,2019020456,14,home,True,REG,Jon Cooper,7.0,38.0,18.0,8.0,...,right,True,False,1,1,31,Tampa Bay,Lightning,TBL,/api/v1/teams/14
19727,2019020848,26,away,False,REG,Todd McLellan,0.0,37.0,18.0,8.0,...,right,False,False,0,1,14,Los Angeles,Kings,LAK,/api/v1/teams/26
16863,2012020544,10,home,False,REG,Randy Carlyle,3.0,28.0,35.0,13.0,...,left,True,True,0,1,5,Toronto,Maple Leafs,TOR,/api/v1/teams/10
44499,2000020560,13,home,False,REG,Duane Sutter,1.0,25.0,,20.0,...,left,True,False,0,1,33,Florida,Panthers,FLA,/api/v1/teams/13
1400,2010030127,4,home,True,REG,Peter Laviolette,5.0,36.0,25.0,32.0,...,left,True,False,1,1,16,Philadelphia,Flyers,PHI,/api/v1/teams/4


## Filtering DataFrames

In [123]:
df_rangers = df_teams_merged[df_teams_merged['teamName'].eq('Rangers')]
df_rangers.head()

Unnamed: 0,game_id,team_id,HoA,won,settled_in,head_coach,goals,shots,hits,pim,...,startRinkSide,is_home,has_many_hits,wins,games,franchiseId,shortName,teamName,abbreviation,link
44975,2015020646,3,away,False,REG,Alain Vigneault,1.0,35.0,27.0,8.0,...,right,False,False,0,1,10,NY Rangers,Rangers,NYR,/api/v1/teams/3
44976,2017021170,3,home,False,REG,Alain Vigneault,2.0,30.0,17.0,6.0,...,left,True,False,0,1,10,NY Rangers,Rangers,NYR,/api/v1/teams/3
44977,2017021257,3,away,False,REG,Alain Vigneault,0.0,17.0,25.0,6.0,...,left,False,False,0,1,10,NY Rangers,Rangers,NYR,/api/v1/teams/3
44978,2017020059,3,away,False,REG,Alain Vigneault,1.0,38.0,13.0,8.0,...,right,False,False,0,1,10,NY Rangers,Rangers,NYR,/api/v1/teams/3
44979,2015021069,3,away,False,REG,Alain Vigneault,1.0,26.0,19.0,8.0,...,right,False,False,0,1,10,NY Rangers,Rangers,NYR,/api/v1/teams/3


In [124]:
df_teams_merged[df_teams_merged['giveaways'].gt(30)].sample(5)

Unnamed: 0,game_id,team_id,HoA,won,settled_in,head_coach,goals,shots,hits,pim,...,startRinkSide,is_home,has_many_hits,wins,games,franchiseId,shortName,teamName,abbreviation,link
27888,2003020284,25,home,True,REG,Dave Tippett,3.0,17.0,28.0,8.0,...,left,True,False,1,1,15,Dallas,Stars,DAL,/api/v1/teams/25
47811,2003020224,17,away,True,REG,Dave Lewis,6.0,27.0,9.0,20.0,...,left,False,False,1,1,12,Detroit,Red Wings,DET,/api/v1/teams/17
49494,2003020726,19,away,False,REG,Joel Quenneville,0.0,33.0,19.0,10.0,...,left,False,False,0,1,18,St Louis,Blues,STL,/api/v1/teams/19
32087,2003020335,1,away,False,REG,Pat Burns,0.0,21.0,10.0,14.0,...,left,False,False,0,1,23,New Jersey,Devils,NJD,/api/v1/teams/1
27956,2003020085,25,home,False,REG,Dave Tippett,1.0,16.0,27.0,15.0,...,left,True,False,0,1,15,Dallas,Stars,DAL,/api/v1/teams/25


## Column Listing and Manipulation

In [125]:
df_rangers.columns

Index(['game_id', 'team_id', 'HoA', 'won', 'settled_in', 'head_coach', 'goals',
       'shots', 'hits', 'pim', 'powerPlayOpportunities', 'powerPlayGoals',
       'faceOffWinPercentage', 'giveaways', 'takeaways', 'blocked',
       'startRinkSide', 'is_home', 'has_many_hits', 'wins', 'games',
       'franchiseId', 'shortName', 'teamName', 'abbreviation', 'link'],
      dtype='object')

In [126]:
df_rangers = df_rangers.drop(columns=["HoA", "settled_in", "startRinkSide"])

In [127]:
df_subset = df_rangers[["wins", 'games', "head_coach", "goals", "shots", "hits", "pim", "powerPlayOpportunities", "powerPlayGoals", "faceOffWinPercentage", "giveaways", "takeaways", "blocked"]]

## Grouping

In [128]:
df_subset.groupby('head_coach').agg(['mean', 'sum']).sort_values(('wins','mean'), ascending=False)

Unnamed: 0_level_0,wins,wins,games,games,goals,goals,shots,shots,hits,hits,...,powerPlayGoals,powerPlayGoals,faceOffWinPercentage,faceOffWinPercentage,giveaways,giveaways,takeaways,takeaways,blocked,blocked
Unnamed: 0_level_1,mean,sum,mean,sum,mean,sum,mean,sum,mean,sum,...,mean,sum,mean,sum,mean,sum,mean,sum,mean,sum
head_coach,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
Terry O'Reilly,1.0,1,1.0,1,2.0,2.0,30.0,30.0,5.0,5.0,...,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Alain Vigneault,0.545648,257,1.0,471,2.821656,1329.0,30.751592,14484.0,24.420382,11502.0,...,0.547771,258.0,48.318047,22757.8,10.511677,4951.0,7.698514,3626.0,15.063694,7095.0
John Tortorella,0.53125,187,1.0,352,2.630682,926.0,29.806818,10492.0,29.78125,10483.0,...,0.576705,203.0,49.009639,12203.4,6.03125,2123.0,6.903409,2430.0,15.934659,5609.0
Tom Renney,0.501529,164,1.0,327,2.678899,876.0,31.04893,10153.0,21.675841,7088.0,...,0.813456,266.0,,0.0,8.7737,2869.0,7.730887,2528.0,13.844037,4527.0
David Quinn,0.446602,138,1.0,309,2.957929,914.0,30.05178,9286.0,24.278317,7502.0,...,0.627832,194.0,46.823948,14468.6,12.524272,3870.0,8.106796,2505.0,14.883495,4599.0
Ron Low,0.420732,69,1.0,164,2.908537,477.0,28.884146,4737.0,,0.0,...,0.689024,113.0,,0.0,,0.0,,0.0,,0.0
Bryan Trottier,0.377358,20,1.0,53,2.528302,134.0,29.018868,1538.0,3.735849,198.0,...,0.679245,36.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Glen Sather,0.366667,33,1.0,90,2.6,234.0,30.4,2736.0,17.155556,1544.0,...,0.588889,53.0,,0.0,9.833333,885.0,6.544444,589.0,11.633333,1047.0
