# Pandas DataFrames 101
[Real Python Link](https://realpython.com/courses/pandas-dataframes-101/)


## Video 1: Importing CSV Data Into a Pandas DataFrame
This video will be using data downloaded from basketball-reference.com

In [58]:
import pandas as pd
import numpy as np
import vincent
from pandas import DataFrame, Series
vincent.core.initialize_notebook() # enables vincent in our notebook
pd.set_option('display.max_columns', None) # allows us to render wide tables

In [24]:
data=pd.read_csv('Data/kevin.csv', )
data.head(5)

Unnamed: 0,Rk,G,Date,Age,Tm,Unnamed: 5,Opp,Unnamed: 7,GS,MP,FG,FGA,FG%,3P,3PA,3P%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc,+/-
0,1,1.0,2012-11-01,24-033,OKC,@,SAS,L (-2),1,40:39,9,18,0.5,1,2,0.5,4,5,0.8,2,12,14,5,2,0,4,1,23,19.7,2
1,2,2.0,2012-11-02,24-034,OKC,,POR,W (+14),1,42:41,7,14,0.5,1,3,0.333,8,12,0.667,2,15,17,7,0,2,6,1,23,20.2,13
2,3,3.0,2012-11-04,24-036,OKC,,ATL,L (-9),1,42:05,7,17,0.412,1,4,0.25,7,8,0.875,1,11,12,8,3,2,6,3,22,19.3,-8
3,4,4.0,2012-11-06,24-038,OKC,,TOR,W (+20),1,29:09,4,11,0.364,1,4,0.25,6,6,1.0,0,6,6,3,2,0,4,3,15,9.6,23
4,5,5.0,2012-11-08,24-040,OKC,@,CHI,W (+6),1,38:23,11,19,0.579,0,2,0.0,2,2,1.0,1,3,4,1,3,3,6,1,24,16.1,-3


### Renaming some columns

In [25]:
data.rename(columns={"Unnamed: 5":'Home/Away', "Unnamed: 7":'Win/Loss'}, inplace=True)
data.head()

Unnamed: 0,Rk,G,Date,Age,Tm,Home/Away,Opp,Win/Loss,GS,MP,FG,FGA,FG%,3P,3PA,3P%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc,+/-
0,1,1.0,2012-11-01,24-033,OKC,@,SAS,L (-2),1,40:39,9,18,0.5,1,2,0.5,4,5,0.8,2,12,14,5,2,0,4,1,23,19.7,2
1,2,2.0,2012-11-02,24-034,OKC,,POR,W (+14),1,42:41,7,14,0.5,1,3,0.333,8,12,0.667,2,15,17,7,0,2,6,1,23,20.2,13
2,3,3.0,2012-11-04,24-036,OKC,,ATL,L (-9),1,42:05,7,17,0.412,1,4,0.25,7,8,0.875,1,11,12,8,3,2,6,3,22,19.3,-8
3,4,4.0,2012-11-06,24-038,OKC,,TOR,W (+20),1,29:09,4,11,0.364,1,4,0.25,6,6,1.0,0,6,6,3,2,0,4,3,15,9.6,23
4,5,5.0,2012-11-08,24-040,OKC,@,CHI,W (+6),1,38:23,11,19,0.579,0,2,0.0,2,2,1.0,1,3,4,1,3,3,6,1,24,16.1,-3


In [26]:
data.columns

Index(['Rk', 'G', 'Date', 'Age', 'Tm', 'Home/Away', 'Opp', 'Win/Loss', 'GS',
       'MP', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%', 'FT', 'FTA', 'FT%', 'ORB',
       'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS', 'GmSc', '+/-'],
      dtype='object')

## Video 2: Slicing and Dicing Pandas DataFrame
### Deleting Columns
    This is done by using the del keyword and then specifying the DF and column name

In [27]:
data.drop(columns=['Rk', 'Home/Away', 'Tm', 'GS'], inplace=True)
del data['Win/Loss']
data.head()

Unnamed: 0,G,Date,Age,Opp,MP,FG,FGA,FG%,3P,3PA,3P%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc,+/-
0,1.0,2012-11-01,24-033,SAS,40:39,9,18,0.5,1,2,0.5,4,5,0.8,2,12,14,5,2,0,4,1,23,19.7,2
1,2.0,2012-11-02,24-034,POR,42:41,7,14,0.5,1,3,0.333,8,12,0.667,2,15,17,7,0,2,6,1,23,20.2,13
2,3.0,2012-11-04,24-036,ATL,42:05,7,17,0.412,1,4,0.25,7,8,0.875,1,11,12,8,3,2,6,3,22,19.3,-8
3,4.0,2012-11-06,24-038,TOR,29:09,4,11,0.364,1,4,0.25,6,6,1.0,0,6,6,3,2,0,4,3,15,9.6,23
4,5.0,2012-11-08,24-040,CHI,38:23,11,19,0.579,0,2,0.0,2,2,1.0,1,3,4,1,3,3,6,1,24,16.1,-3


## Video 3: Mapping and Analyzing a Data Set in Pandas

### Field Goals Made / Field Goals Attempted Per Minute

First we need to subset the data that we're going to use. We're also going to check the data types to make sure the columns are in a usable format

In [28]:
# Notice that none of the fields are numeric
temp = data[['MP', 'FG', 'FGA']][data['MP']!='Inactive']
data = data[data['MP']!='Inactive']
temp.dtypes

MP     object
FG     object
FGA    object
dtype: object

We need to take these columns manipulate them and then put them in a format that meets our needs. First we need to make to make a function that converts our MP column to total minutes. We'll also import some basic time packages

In [29]:
import time
import datetime

def str_to_minutes(minutes):
    minutes = str(minutes)
    minutes = time.strptime(minutes, '%M:%S')
    return datetime.timedelta(minutes=minutes.tm_min, seconds=minutes.tm_sec).total_seconds()/60

In [30]:
str_to_minutes('10:30') #10.5

10.5

Now, we need to convert the MP field to minutes using our str_to_minutes() funtion. This is done using the map function. Map() takes a function and applies that function to each row in the list.

Notice that data type of the MP field is now float64

In [31]:
temp['MP'] = temp['MP'].map(str_to_minutes)
data['MP'] = data['MP'].map(str_to_minutes)
temp.info

<bound method DataFrame.info of            MP  FG FGA
0   40.650000   9  18
1   42.683333   7  14
2   42.083333   7  17
3   29.150000   4  11
4   38.383333  11  19
..        ...  ..  ..
76  45.216667   7  17
77  37.383333   6  10
78  34.450000  10  16
79  29.700000   7  11
80  37.333333  10  16

[81 rows x 3 columns]>

In [32]:
temp.dtypes

MP     float64
FG      object
FGA     object
dtype: object

In [33]:
temp[['FG', 'FGA']] = temp[['FG', 'FGA']].astype('float')
data[['FG', 'FGA']] = data[['FG', 'FGA']].astype('float')

temp

Unnamed: 0,MP,FG,FGA
0,40.650000,9.0,18.0
1,42.683333,7.0,14.0
2,42.083333,7.0,17.0
3,29.150000,4.0,11.0
4,38.383333,11.0,19.0
...,...,...,...
76,45.216667,7.0,17.0
77,37.383333,6.0,10.0
78,34.450000,10.0,16.0
79,29.700000,7.0,11.0


Now, we're going to create a new field called FGA/M

In [34]:
temp['FGA/M'] = temp['FGA'] / temp['MP']
temp['FG/M'] = temp['FG'] / temp['MP']
temp.head()

Unnamed: 0,MP,FG,FGA,FGA/M,FG/M
0,40.65,9.0,18.0,0.442804,0.221402
1,42.683333,7.0,14.0,0.327997,0.163998
2,42.083333,7.0,17.0,0.40396,0.166337
3,29.15,4.0,11.0,0.377358,0.137221
4,38.383333,11.0,19.0,0.495007,0.286583


We can get basic summary statistics using the describe() function

In [35]:
temp.describe()

Unnamed: 0,MP,FG,FGA,FGA/M,FG/M
count,81.0,81.0,81.0,81.0,81.0
mean,38.504527,9.024691,17.691358,0.455967,0.234632
std,5.771923,2.554287,5.001605,0.094907,0.057639
min,23.766667,4.0,8.0,0.267499,0.09539
25%,35.766667,7.0,14.0,0.387812,0.202703
50%,39.1,9.0,17.0,0.447344,0.232288
75%,42.083333,10.0,21.0,0.513619,0.267857
max,49.666667,16.0,31.0,0.699029,0.411487


## Video 4: Working with groupby() in Pandas

We can use the .groupby() function to group data rows into groups based on a field in the DataFrame. In the example below, we group the data by opponent. This is allow us to see which team Garnett performed best and worst against.

In [36]:
group_by_opp = data.groupby('Opp')

In [37]:
# How many times did KD play against each team?
group_by_opp.size()

Opp
ATL    2
BOS    2
BRK    2
CHA    2
CHI    2
CLE    2
DAL    4
DEN    4
DET    2
GSW    4
HOU    3
IND    2
LAC    3
LAL    4
MEM    3
MIA    2
MIL    1
MIN    4
NOH    4
NYK    2
ORL    2
PHI    2
PHO    4
POR    4
SAC    3
SAS    4
TOR    2
UTA    4
WAS    2
dtype: int64

If we change the function to .sum() we can get the totals for the counting statistics.

In [38]:
group_by_opp.sum()

Unnamed: 0_level_0,G,Date,Age,MP,FG,FGA,FG%,3P,3PA,3P%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc,+/-
Opp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
ATL,28.0,2012-11-042012-12-19,24-03624-081,82.95,21.0,40.0,.412.609,14,48,.250.500,79,810,.875.900,11,1112,1213,83,30,22,66,31,2241,19.331.5,-8+2
BOS,76.0,2012-11-232013-03-10,24-05524-162,76.133333,15.0,36.0,.450.375,30,74,.429.000,811,811,1.0001.000,0,311,311,24,11,10,33,33,2923,18.417.1,+4+4
BRK,50.0,2012-12-042013-01-02,24-06624-095,83.416667,20.0,33.0,.563.647,22,25,1.000.400,123,124,1.000.750,0,55,55,65,12,10,15,43,3227,29.219.9,+12-11
CHA,77.0,2012-11-262013-03-08,24-05824-160,50.233333,12.0,20.0,.750.500,21,43,.500.333,46,47,1.000.857,10,56,66,47,0,41,22,20,1819,19.818.0,+44+16
CHI,61.0,2012-11-082013-02-24,24-04024-148,70.083333,17.0,38.0,.579.316,1,24,.000.250,26,26,1.0001.000,12,314,416,16,30,32,64,10,2419,16.115.3,-3+22
CLE,54.0,2012-11-112013-02-02,24-04324-126,72.466667,17.0,37.0,.563.381,13,27,.500.429,713,817,.875.765,2,89,811,23,1,10,23,12,2632,20.122.3,+160
DAL,183.0,2012-12-272013-01-182013-02-042013-03-17,24-08924-11124-12824-169,166.283333,42.0,89.0,.464.419.545.526,4522,8925,.500.5561.000.400,102159,102169,1.0001.000.8331.000,2000,69109,89109,5141,1222,3012,4436,3211,40521931,30.236.118.422.1,+6-1+17+4
DEN,206.0,2013-01-162013-01-202013-03-012013-03-19,24-10924-11324-15324-171,159.983333,33.0,75.0,.583.350.450.435,2310,51046,.400.300.250.000,420614,521716,.800.952.857.875,11,27136,27147,4835,5221,32,3546,2404,20372534,18.628.521.021.9,+13-1+10-3
DET,14.0,2012-11-092012-11-12,24-04124-044,81.816667,17.0,38.0,.563.364,20,22,1.000.000,510,510,1.0001.000,10,129,139,23,10,4,32,11,2526,20.719.0,+7+8
GSW,182.0,2012-11-182013-01-232013-02-062013-04-11,24-05024-11624-13024-194,143.516667,38.0,67.0,.500.588.556.625,3222,5333,.600.667.667.667,61139,712410,.857.917.750.900,11,13569,135710,10948,2123,321,2654,311,25332531,27.528.419.331.7,+25-14+2+25


In [39]:
data[data.Opp == 'ATL']

Unnamed: 0,G,Date,Age,Opp,MP,FG,FGA,FG%,3P,3PA,3P%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc,+/-
2,3.0,2012-11-04,24-036,ATL,42.083333,7.0,17.0,0.412,1,4,0.25,7,8,0.875,1,11,12,8,3,2,6,3,22,19.3,-8
24,25.0,2012-12-19,24-081,ATL,40.866667,14.0,23.0,0.609,4,8,0.5,9,10,0.9,1,12,13,3,0,2,6,1,41,31.5,2


Now, we're going to analyze FG and FGA. First we need slice the data to only contain the needed columns.

In [40]:
field_goal_per_team = group_by_opp.sum()[['FG', 'FGA']]
field_goal_per_team

Unnamed: 0_level_0,FG,FGA
Opp,Unnamed: 1_level_1,Unnamed: 2_level_1
ATL,21.0,40.0
BOS,15.0,36.0
BRK,20.0,33.0
CHA,12.0,20.0
CHI,17.0,38.0
CLE,17.0,37.0
DAL,42.0,89.0
DEN,33.0,75.0
DET,17.0,38.0
GSW,38.0,67.0


In [None]:
data.info()

## Video 5: Plotting a DataFrame

Next we'll use vincent to create a bar graph

In [46]:
field_goal_per_team
stacked = vincent.StackedBar(field_goal_per_team)
#stacked.legend(title='Field Goals')
#stacked.scales['x'].padding = 0.1
#stacked.display()


AttributeError: 'Series' object has no attribute 'iteritems'

In [60]:
field_goal_per_team['index'] = np.arange(field_goal_per_team.shape[0])
field_goal_per_team

Unnamed: 0_level_0,FG,FGA,index
Opp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ATL,21.0,40.0,0
BOS,15.0,36.0,1
BRK,20.0,33.0,2
CHA,12.0,20.0,3
CHI,17.0,38.0,4
CLE,17.0,37.0,5
DAL,42.0,89.0,6
DEN,33.0,75.0,7
DET,17.0,38.0,8
GSW,38.0,67.0,9


In [61]:
field_goal_per_team.set_index('index')

Unnamed: 0_level_0,FG,FGA
index,Unnamed: 1_level_1,Unnamed: 2_level_1
0,21.0,40.0
1,15.0,36.0
2,20.0,33.0
3,12.0,20.0
4,17.0,38.0
5,17.0,37.0
6,42.0,89.0
7,33.0,75.0
8,17.0,38.0
9,38.0,67.0
