# Lecture 3

The documentation covers:  
- Reshaping and Pivot Tables

## Reshaping and Pivot Tables

In [9]:
import numpy as np
import pandas as pd

home = pd.read_csv('data_processed/home.csv')
so = pd.read_csv('data_input/stackoverflow_qa.csv')

In [83]:
# adding a year and month column
so['questionyear'] = pd.DatetimeIndex(so['creationdate']).year
so['questionmonth'] = pd.DatetimeIndex(so['creationdate']).month

In [27]:
# top 20
top20 = so.groupby('ans_name').aggregate(np.sum).sort_values(by=['ans_rep','answercount'], ascending=False).head(20)
top20.head()

Unnamed: 0_level_0,id,score,viewcount,answercount,commentcount,favoritecount,quest_rep,ans_rep,questionyear,questionmonth
ans_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
jezrael,228737021995,7860,1785324,7988,5177,856.0,6209834.0,1015956000.0,10962131,36140
unutbu,29959502937,4003,3511034,1462,926,1612.0,3359192.0,454985400.0,1972607,6278
EdChum,61904922571,4812,3580513,2692,3123,898.0,3354967.0,231571400.0,3752567,11192
piRSquared,81847238788,3762,542888,3192,1978,537.0,3896510.0,199053900.0,3910186,12830
MaxU,65214217669,2348,495154,2404,2121,360.0,2862882.0,131291800.0,3161929,9881


In [28]:
criteria = so['ans_name'].isin(top20.index)
so_selected = so[criteria]
so_selected.head()

Unnamed: 0,id,creationdate,score,viewcount,title,answercount,commentcount,favoritecount,quest_name,quest_rep,ans_name,ans_rep,questionyear,questionmonth
4,7577546,2011-09-28 01:58:38,9,2488,"Using pandas, how do I subsample a large DataF...",1,0,5.0,Uri Laserson,958.0,HYRY,54137.0,2011,9
121,10844493,2012-06-01 04:40:16,5,4397,DataFrame.apply in python pandas alters both o...,1,0,1.0,MikeGruz,28.0,BrenBarn,136870.0,2012,6
127,10943478,2012-06-08 05:24:57,12,11115,pandas reindex DataFrame with datetime objects,1,0,4.0,BFTM,895.0,BrenBarn,136870.0,2012,6
130,10972410,2012-06-10 21:12:43,19,46428,pandas: combine two columns in a DataFrame,5,0,5.0,BFTM,895.0,BrenBarn,136870.0,2012,6
145,11067027,2012-06-16 21:05:01,115,85762,Python Pandas - Re-ordering columns in a dataf...,11,2,28.0,pythOnometrist,1068.0,BrenBarn,136870.0,2012,6


### Pivot

In [90]:
# subset only one user with selected columns only
hy = so.loc[so['ans_name'] == 'HYRY', ['title', 'questionyear', 'viewcount', 'commentcount', 'quest_name']]
hy.head()

Unnamed: 0,title,questionyear,viewcount,commentcount,quest_name
4,"Using pandas, how do I subsample a large DataF...",2011,2488,0,Uri Laserson
216,Pandas xaxis auto-format issue,2012,613,0,joelhoro
367,Grouping data by multiple dates in pandas,2012,273,0,user1074057
722,Change Categorical Variable levels to What I p...,2012,1382,0,Tom Bennett
932,Plot key count per unique value count in pandas,2013,9115,0,monkut


In [91]:
hy.head(3)

Unnamed: 0,title,questionyear,viewcount,commentcount,quest_name
4,"Using pandas, how do I subsample a large DataF...",2011,2488,0,Uri Laserson
216,Pandas xaxis auto-format issue,2012,613,0,joelhoro
367,Grouping data by multiple dates in pandas,2012,273,0,user1074057


In [95]:
# index and columns have to be unique
hy.pivot(index='title', columns='questionyear', values='viewcount').head()

questionyear,2011,2012,2013,2014,2015,2016,2017
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"3D animation with matplotlib, connect points to create moving stick figure",,,,2970.0,,,
Add multi-index to pandas dataframe and keep current index,,,4723.0,,,,
Adding means to a pandas dataframe,,,,,53.0,,
Adding values for missing data combinations in Pandas,,,,,209.0,,
"Aggregating overlapping ""all-previous-events"" features from time series data - in Python",,,,181.0,,,


If the `values` argument are omitted, and the DataFrame has more than one columns of values not used as index or columns, then the result will have hierarchical columns:

In [102]:
pivoted = hy.head().pivot(index='title', columns='questionyear')
pivoted

Unnamed: 0_level_0,viewcount,viewcount,viewcount,commentcount,commentcount,commentcount,quest_name,quest_name,quest_name
questionyear,2011,2012,2013,2011,2012,2013,2011,2012,2013
title,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
Change Categorical Variable levels to What I provide/Combine levels two categorical variables,,1382.0,,,0.0,,,Tom Bennett,
Grouping data by multiple dates in pandas,,273.0,,,0.0,,,user1074057,
Pandas xaxis auto-format issue,,613.0,,,0.0,,,joelhoro,
Plot key count per unique value count in pandas,,,9115.0,,,0.0,,,monkut
"Using pandas, how do I subsample a large DataFrame by group in an efficient manner?",2488.0,,,0.0,,,Uri Laserson,,


In [104]:
# we can then subset from the pivoted dataframe
pivoted[['viewcount', 'quest_name']]

Unnamed: 0_level_0,viewcount,viewcount,viewcount,quest_name,quest_name,quest_name
questionyear,2011,2012,2013,2011,2012,2013
title,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Change Categorical Variable levels to What I provide/Combine levels two categorical variables,,1382.0,,,Tom Bennett,
Grouping data by multiple dates in pandas,,273.0,,,user1074057,
Pandas xaxis auto-format issue,,613.0,,,joelhoro,
Plot key count per unique value count in pandas,,,9115.0,,,monkut
"Using pandas, how do I subsample a large DataFrame by group in an efficient manner?",2488.0,,,Uri Laserson,,


### Stacking and Unstacking

- `stack`: "pivot" a level of of the (possibly hierarchical) column labels, returning a DataFrame with an index with a new inner-most level of row labels
- `unstack`: "pivot" a level of the (possibly hierarchical) row index to the column axis, producing a DataFrame with a new inner-most level of column labels

In [111]:
hy[['questionyear', 'quest_name', 'viewcount']].head()

Unnamed: 0,questionyear,quest_name,viewcount
4,2011,Uri Laserson,2488
216,2012,joelhoro,613
367,2012,user1074057,273
722,2012,Tom Bennett,1382
932,2013,monkut,9115


In [122]:
# stack() compresses a level in the DataFrame's columns
stacked = hy[['questionyear', 'quest_name', 'viewcount']].stack()
stacked.head(10)

4    questionyear            2011
     quest_name      Uri Laserson
     viewcount               2488
216  questionyear            2012
     quest_name          joelhoro
     viewcount                613
367  questionyear            2012
     quest_name       user1074057
     viewcount                273
722  questionyear            2012
dtype: object

If our DataFrame have a `MultiIndex`, we can choose which level to stack or unstack. The `stacked` dataframe above have two levels of index, so we can pass the level as an argument to our `stack` or `unstack` call. The default is to unstack the last level:

In [120]:
stacked.unstack(0)

Unnamed: 0,4,216,367,722,932,937,962,1065,1197,1219,...,27824,27926,27950,27952,27980,28395,28807,29357,29464,32894
questionyear,2011,2012,2012,2012,2013,2013,2013,2013,2013,2013,...,2016,2016,2016,2016,2016,2016,2016,2016,2016,2017
quest_name,Uri Laserson,joelhoro,user1074057,Tom Bennett,monkut,zzzeek,Kyle Brandt,Einar,John Salvatier,John,...,marie,Radical Edward,Lin Ma,Lin Ma,user1934212,user3177938,ShanZhengYang,pizzacat,vera,Marat
viewcount,2488,613,273,1382,9115,9456,11317,163,1148,268,...,680,112,254,204,34,120,367,84,92,99


In [131]:
# same as stacked.unstack().head()
stacked.unstack(1).head()

Unnamed: 0,questionyear,quest_name,viewcount
4,2011,Uri Laserson,2488
216,2012,joelhoro,613
367,2012,user1074057,273
722,2012,Tom Bennett,1382
932,2013,monkut,9115


Notice that `stack` and `unstack` implicitly sort the index levels involved. Hence a call to `stack` and then `unstack`, or vice versa, will result in a **sorted copy of the original** DataFrame or Series. 

In [134]:
all(hy.stack().unstack() == hy.sort_index())

True

In [140]:
hybren = so.loc[so['ans_name'].isin(['HYRY','BrenBarn']),
                ['title', 'questionyear', 'viewcount', 'commentcount', 'quest_name','ans_name']]
hybren.head()

Unnamed: 0,title,questionyear,viewcount,commentcount,quest_name,ans_name
4,"Using pandas, how do I subsample a large DataF...",2011,2488,0,Uri Laserson,HYRY
121,DataFrame.apply in python pandas alters both o...,2012,4397,0,MikeGruz,BrenBarn
127,pandas reindex DataFrame with datetime objects,2012,11115,0,BFTM,BrenBarn
130,pandas: combine two columns in a DataFrame,2012,46428,0,BFTM,BrenBarn
145,Python Pandas - Re-ordering columns in a dataf...,2012,85762,2,pythOnometrist,BrenBarn


In [151]:
x = hybren.groupby(['ans_name', 'questionyear']).sum()
x

Unnamed: 0_level_0,Unnamed: 1_level_0,viewcount,commentcount
ans_name,questionyear,Unnamed: 2_level_1,Unnamed: 3_level_1
BrenBarn,2012,635522,11
BrenBarn,2013,127605,26
BrenBarn,2014,226402,68
BrenBarn,2015,62340,69
BrenBarn,2016,14126,76
BrenBarn,2017,1066,18
HYRY,2011,2488,0
HYRY,2012,2268,0
HYRY,2013,215018,51
HYRY,2014,145200,86


We can also stack or unstack more than one level at a time by passing a list of levels:

In [152]:
x.unstack(level=['ans_name']).head()

Unnamed: 0_level_0,viewcount,viewcount,commentcount,commentcount
ans_name,BrenBarn,HYRY,BrenBarn,HYRY
questionyear,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2011,,2488.0,,0.0
2012,635522.0,2268.0,11.0,0.0
2013,127605.0,215018.0,26.0,51.0
2014,226402.0,145200.0,68.0,86.0
2015,62340.0,8457.0,69.0,22.0


In [153]:
x.unstack(level=['ans_name', 'questionyear']).head()

           ans_name  questionyear
viewcount  BrenBarn  2012            635522
                     2013            127605
                     2014            226402
                     2015             62340
                     2016             14126
dtype: int64

Unstack takes an optional `fill_value` argument to replace NaN with a specified value:

In [154]:
x.unstack(level=['ans_name'], fill_value=0).head()

Unnamed: 0_level_0,viewcount,viewcount,commentcount,commentcount
ans_name,BrenBarn,HYRY,BrenBarn,HYRY
questionyear,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2011,0,2488,0,0
2012,635522,2268,11,0
2013,127605,215018,26,51
2014,226402,145200,68,86
2015,62340,8457,69,22


### Reshaping with Melt

In [156]:
hybren.head()

Unnamed: 0,title,questionyear,viewcount,commentcount,quest_name,ans_name
4,"Using pandas, how do I subsample a large DataF...",2011,2488,0,Uri Laserson,HYRY
121,DataFrame.apply in python pandas alters both o...,2012,4397,0,MikeGruz,BrenBarn
127,pandas reindex DataFrame with datetime objects,2012,11115,0,BFTM,BrenBarn
130,pandas: combine two columns in a DataFrame,2012,46428,0,BFTM,BrenBarn
145,Python Pandas - Re-ordering columns in a dataf...,2012,85762,2,pythOnometrist,BrenBarn


`melt` melts the dataframe (from a wide format) into a long one, with only the _identified variable_ (`id_vars`) being retained plus two other non-identifier variables: `variable` and `value`.

In [160]:
hybren.head(2).melt(id_vars=['title','quest_name'])

Unnamed: 0,title,quest_name,variable,value
0,"Using pandas, how do I subsample a large DataF...",Uri Laserson,questionyear,2011
1,DataFrame.apply in python pandas alters both o...,MikeGruz,questionyear,2012
2,"Using pandas, how do I subsample a large DataF...",Uri Laserson,viewcount,2488
3,DataFrame.apply in python pandas alters both o...,MikeGruz,viewcount,4397
4,"Using pandas, how do I subsample a large DataF...",Uri Laserson,commentcount,0
5,DataFrame.apply in python pandas alters both o...,MikeGruz,commentcount,0
6,"Using pandas, how do I subsample a large DataF...",Uri Laserson,ans_name,HYRY
7,DataFrame.apply in python pandas alters both o...,MikeGruz,ans_name,BrenBarn


The name of those non-identifier columns can be customized using `var_name`:

In [161]:
hybren.head(2).melt(id_vars=['title', 'quest_name', 'ans_name'], var_name='measurement')

Unnamed: 0,title,quest_name,ans_name,measurement,value
0,"Using pandas, how do I subsample a large DataF...",Uri Laserson,HYRY,questionyear,2011
1,DataFrame.apply in python pandas alters both o...,MikeGruz,BrenBarn,questionyear,2012
2,"Using pandas, how do I subsample a large DataF...",Uri Laserson,HYRY,viewcount,2488
3,DataFrame.apply in python pandas alters both o...,MikeGruz,BrenBarn,viewcount,4397
4,"Using pandas, how do I subsample a large DataF...",Uri Laserson,HYRY,commentcount,0
5,DataFrame.apply in python pandas alters both o...,MikeGruz,BrenBarn,commentcount,0


### Combining with stats and GroupBy

In [163]:
hybren.head()

Unnamed: 0,title,questionyear,viewcount,commentcount,quest_name,ans_name
4,"Using pandas, how do I subsample a large DataF...",2011,2488,0,Uri Laserson,HYRY
121,DataFrame.apply in python pandas alters both o...,2012,4397,0,MikeGruz,BrenBarn
127,pandas reindex DataFrame with datetime objects,2012,11115,0,BFTM,BrenBarn
130,pandas: combine two columns in a DataFrame,2012,46428,0,BFTM,BrenBarn
145,Python Pandas - Re-ordering columns in a dataf...,2012,85762,2,pythOnometrist,BrenBarn


In [254]:
x = hybren.groupby(['ans_name', 'questionyear']).sum()
x.head(8)

Unnamed: 0_level_0,Unnamed: 1_level_0,viewcount,commentcount
ans_name,questionyear,Unnamed: 2_level_1,Unnamed: 3_level_1
BrenBarn,2012,635522,11
BrenBarn,2013,127605,26
BrenBarn,2014,226402,68
BrenBarn,2015,62340,69
BrenBarn,2016,14126,76
BrenBarn,2017,1066,18
HYRY,2011,2488,0
HYRY,2012,2268,0


In [255]:
dat = x.unstack(level=['ans_name'], fill_value=0)
dat.columns.names=['measurement', 'ans_name']
dat.head()

measurement,viewcount,viewcount,commentcount,commentcount
ans_name,BrenBarn,HYRY,BrenBarn,HYRY
questionyear,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2011,0,2488,0,0
2012,635522,2268,11,0
2013,127605,215018,26,51
2014,226402,145200,68,86
2015,62340,8457,69,22


In [205]:
dat.stack().head()

Unnamed: 0_level_0,measurement,viewcount,commentcount
questionyear,ans_name,Unnamed: 2_level_1,Unnamed: 3_level_1
2011,BrenBarn,0,0
2011,HYRY,2488,0
2012,BrenBarn,635522,11
2012,HYRY,2268,0
2013,BrenBarn,127605,26


We can call `stack` and then sum for the required axis or levels:

In [258]:
# by default, .sum(level=None) so no level is preserved
dat.stack().sum()

measurement
viewcount       1452270
commentcount        472
dtype: int64

In [259]:
# preserve first level of index (axis=0, level=0)
dat.stack().sum(axis=0, level=0).head()

measurement,viewcount,commentcount
questionyear,Unnamed: 1_level_1,Unnamed: 2_level_1
2011,2488,0
2012,637790,11
2013,342623,77
2014,371602,154
2015,70797,91


In [260]:
# preserve second level of index
dat.stack().sum(axis=0, level=1)

measurement,viewcount,commentcount
ans_name,Unnamed: 1_level_1,Unnamed: 2_level_1
BrenBarn,1067061,268
HYRY,385209,204


In [249]:
# again, by default level=None so no column level is preserved
dat.stack().sum(axis=1).head()

questionyear  ans_name
2011          BrenBarn         0
              HYRY          2488
2012          BrenBarn    635533
              HYRY          2268
2013          BrenBarn    127631
dtype: int64

In [250]:
# preserve first level of columns
dat.stack().sum(axis=1, level=0).head()

Unnamed: 0_level_0,measurement,viewcount,commentcount
questionyear,ans_name,Unnamed: 2_level_1,Unnamed: 3_level_1
2011,BrenBarn,0,0
2011,HYRY,2488,0
2012,BrenBarn,635522,11
2012,HYRY,2268,0
2013,BrenBarn,127605,26


### Pivot tables

While `pandas` allow us to use `pivot()` for general-purpose pivoting on various data types (strings, numerics etc), it also has `pivot_table()` for pivoting with aggregation of numeric data:

In [289]:
so['questionquarter'] = pd.DatetimeIndex(so['creationdate']).quarter
hybren = so.loc[so['ans_name'].isin(['HYRY','BrenBarn']),
                ['questionyear', 'viewcount', 'commentcount', 'quest_name','ans_name', 'questionmonth', 'questionquarter']]
hybren.head()

Unnamed: 0,questionyear,viewcount,commentcount,quest_name,ans_name,questionmonth,questionquarter
4,2011,2488,0,Uri Laserson,HYRY,9,3
121,2012,4397,0,MikeGruz,BrenBarn,6,2
127,2012,11115,0,BFTM,BrenBarn,6,2
130,2012,46428,0,BFTM,BrenBarn,6,2
145,2012,85762,2,pythOnometrist,BrenBarn,6,2


In [292]:
hybren.pivot_table(index=['ans_name', 'questionquarter'], columns='questionyear', values='viewcount').head()

Unnamed: 0_level_0,questionyear,2011,2012,2013,2014,2015,2016,2017
ans_name,questionquarter,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
BrenBarn,1,,,14894.5,6118.7,1308.35,207.941176,164.0
BrenBarn,2,,34442.2,3452.909091,9357.615385,1170.375,598.909091,
BrenBarn,3,,61660.2,,1553.642857,964.272727,289.8,126.666667
BrenBarn,4,,17223.333333,6009.0,1678.076923,1350.25,319.25,30.0
HYRY,1,,,4312.894737,3319.405405,567.75,588.25,99.0


Notice that if we specify multiple values (which are potentially not present in the `columns`, the pivot table will include all the data that can be aggregated in an additional level of hierarchy in the columns:

In [300]:
hybren.pivot_table(index=['ans_name','questionquarter'], 
                   columns='questionyear', 
                   values=['viewcount','commentcount'], 
                   aggfunc=np.sum, 
                   fill_value=0)

Unnamed: 0_level_0,Unnamed: 1_level_0,commentcount,commentcount,commentcount,commentcount,commentcount,commentcount,commentcount,viewcount,viewcount,viewcount,viewcount,viewcount,viewcount,viewcount
Unnamed: 0_level_1,questionyear,2011,2012,2013,2014,2015,2016,2017,2011,2012,2013,2014,2015,2016,2017
ans_name,questionquarter,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2
BrenBarn,1,0,0,4,17,15,20,16,0,0,59578,61187,26167,3535,656
BrenBarn,2,0,2,5,15,26,15,0,0,172211,37982,121649,9363,6588,0
BrenBarn,3,0,1,0,27,12,11,2,0,308301,0,21751,10607,1449,380
BrenBarn,4,0,8,17,9,16,30,0,0,155010,30045,21815,16203,2554,30
HYRY,1,0,0,4,31,1,11,7,0,0,81945,122818,2271,7059,99
HYRY,2,0,0,4,38,5,4,0,0,0,12095,13531,3457,1963,0
HYRY,3,0,0,9,3,15,7,0,2488,886,17444,4134,1747,1843,0
HYRY,4,0,0,34,14,1,16,0,0,1382,103534,4717,982,814,0


If we rather omit the missing values instead of filling it with 0, one option is to use `to_string` which will produce a nice table:

In [309]:
table = hybren.pivot_table(index=['ans_name','questionquarter'], 
                           columns='questionyear', 
                           values='commentcount', 
                           aggfunc=np.sum)
print(table.to_string(na_rep=''))

questionyear              2011  2012  2013  2014  2015  2016  2017
ans_name questionquarter                                          
BrenBarn 1                             4.0  17.0  15.0  20.0  16.0
         2                       2.0   5.0  15.0  26.0  15.0      
         3                       1.0        27.0  12.0  11.0   2.0
         4                       8.0  17.0   9.0  16.0  30.0   0.0
HYRY     1                             4.0  31.0   1.0  11.0   7.0
         2                             4.0  38.0   5.0   4.0      
         3                 0.0   0.0   9.0   3.0  15.0   7.0      
         4                       0.0  34.0  14.0   1.0  16.0      


If we specify `margins=True` to `pivot_table`, a special row and column named `All` will be added with partial group aggregates across all the categories on our rows and columns:

In [317]:
hybren.pivot_table(index='questionquarter', 
                   columns='questionyear', 
                   values='viewcount',
                   margins=True).head()

questionyear,2011,2012,2013,2014,2015,2016,2017,All
questionquarter,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,,,6153.173913,3915.0,1184.916667,365.310345,151.0,2854.023438
2,,34442.2,3129.8125,4827.857143,854.666667,503.0,,4677.024691
3,2488.0,44169.571429,2492.0,1125.434783,617.7,193.647059,126.666667,4756.794872
4,,15639.2,3339.475,1020.461538,1227.5,280.666667,30.0,3272.679612
All,2488.0,28990.454545,3983.988372,2996.790323,969.821918,344.066667,129.444444,3723.769231


### Cross Tabulations

We can use `crosstab()` to compute a cross-tabulation of two or more factors. By default it computes a frequency table unless an array of values and an aggregation function are passed:

In [318]:
hybren.head()

Unnamed: 0,questionyear,viewcount,commentcount,quest_name,ans_name,questionmonth,questionquarter
4,2011,2488,0,Uri Laserson,HYRY,9,3
121,2012,4397,0,MikeGruz,BrenBarn,6,2
127,2012,11115,0,BFTM,BrenBarn,6,2
130,2012,46428,0,BFTM,BrenBarn,6,2
145,2012,85762,2,pythOnometrist,BrenBarn,6,2


In [322]:
pd.crosstab(hybren.questionquarter, hybren.ans_name)

ans_name,BrenBarn,HYRY
questionquarter,Unnamed: 1_level_1,Unnamed: 2_level_1
1,55,73
2,48,33
3,38,40
4,48,55


In [329]:
# frequency tables can also be normalized to show percentages rather than counts
pd.crosstab(hybren.questionquarter, hybren.ans_name, normalize=True)

ans_name,BrenBarn,HYRY
questionquarter,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.141026,0.187179
2,0.123077,0.084615
3,0.097436,0.102564
4,0.123077,0.141026


In [324]:
# normalize values within rows or columns (`columns` or `index` into the normalize argument)
pd.crosstab(hybren.questionquarter, hybren.questionyear, normalize='columns')

questionyear,2011,2012,2013,2014,2015,2016,2017
questionquarter,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,0.0,0.0,0.267442,0.379032,0.328767,0.386667,0.555556
2,0.0,0.227273,0.186047,0.225806,0.205479,0.226667,0.0
3,1.0,0.318182,0.081395,0.185484,0.273973,0.226667,0.333333
4,0.0,0.454545,0.465116,0.209677,0.191781,0.16,0.111111


In [333]:
# apply an aggregation function with the specified values
pd.crosstab(hybren.questionquarter, hybren.questionyear, values=hybren.commentcount, aggfunc=np.sum)

questionyear,2011,2012,2013,2014,2015,2016,2017
questionquarter,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,,,8.0,48.0,16.0,31.0,23.0
2,,2.0,9.0,53.0,31.0,19.0,
3,0.0,1.0,9.0,30.0,27.0,18.0,2.0
4,,8.0,51.0,23.0,17.0,46.0,0.0


In [334]:
# add margins to crosstab
pd.crosstab(index=hybren.questionyear, 
            columns=hybren.questionquarter, 
            values=hybren.commentcount,
            aggfunc=np.sum,
            normalize=True,
            margins=True)

questionquarter,1,2,3,4,All
questionyear,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2011,0.0,0.0,0.0,0.0,0.0
2012,0.0,0.004237,0.002119,0.016949,0.023305
2013,0.016949,0.019068,0.019068,0.108051,0.163136
2014,0.101695,0.112288,0.063559,0.048729,0.326271
2015,0.033898,0.065678,0.057203,0.036017,0.192797
2016,0.065678,0.040254,0.038136,0.097458,0.241525
2017,0.048729,0.0,0.004237,0.0,0.052966
All,0.266949,0.241525,0.184322,0.307203,1.0


### Tiling

In [345]:
hybren['viewgroup'] = pd.cut(hybren.viewcount, bins=4)
hybren.head(10)

Unnamed: 0,questionyear,viewcount,commentcount,quest_name,ans_name,questionmonth,questionquarter,viewgroup
4,2011,2488,0,Uri Laserson,HYRY,9,3,"(-123.023, 37030.75]"
121,2012,4397,0,MikeGruz,BrenBarn,6,2,"(-123.023, 37030.75]"
127,2012,11115,0,BFTM,BrenBarn,6,2,"(-123.023, 37030.75]"
130,2012,46428,0,BFTM,BrenBarn,6,2,"(37030.75, 74036.5]"
145,2012,85762,2,pythOnometrist,BrenBarn,6,2,"(74036.5, 111042.25]"
146,2012,24509,0,serguei,BrenBarn,6,2,"(-123.023, 37030.75]"
216,2012,613,0,joelhoro,HYRY,7,3,"(-123.023, 37030.75]"
246,2012,20051,0,turtle,BrenBarn,8,3,"(-123.023, 37030.75]"
258,2012,4951,0,DanB,BrenBarn,8,3,"(-123.023, 37030.75]"
287,2012,148048,0,bigbug,BrenBarn,8,3,"(111042.25, 148048.0]"


In [351]:
# specify custom bin-edges and own labels
hybren['viewgroup'] = pd.cut(hybren.viewcount, 
                             bins=[25, 132, 445, 1420, 148048], 
                             include_lowest=True, 
                             labels=['25_per', '50_per', '75_per', 'above_75'])
hybren.tail(6)

Unnamed: 0,questionyear,viewcount,commentcount,quest_name,ans_name,questionmonth,questionquarter,viewgroup
35753,2017,133,6,Cleb,BrenBarn,3,1,50_per
36709,2017,328,2,Goofy Gert,BrenBarn,2,1,50_per
46190,2017,227,0,Franck Dernoncourt,BrenBarn,7,3,50_per
46601,2017,102,2,GhostRider,BrenBarn,8,3,25_per
49864,2017,51,0,Leigh Tsai,BrenBarn,9,3,25_per
53778,2017,30,0,K. Lou,BrenBarn,10,4,25_per


In [357]:
hybren.viewgroup.value_counts()

25_per      99
above_75    98
75_per      97
50_per      96
Name: viewgroup, dtype: int64

### Dummy Variables

In [366]:
# notice that questionquarter is treated as an integer and only the categorical variables are encoded as dummy variables
pd.get_dummies(hybren[['questionquarter','viewgroup', 'ans_name']]).head()

Unnamed: 0,questionquarter,viewgroup_25_per,viewgroup_50_per,viewgroup_75_per,viewgroup_above_75,ans_name_BrenBarn,ans_name_HYRY
4,3,0,0,0,1,0,1
121,2,0,0,0,1,1,0
127,2,0,0,0,1,1,0
130,2,0,0,0,1,1,0
145,2,0,0,0,1,1,0


In [368]:
# we can also prefix the column names
pd.get_dummies(hybren['ans_name'], prefix='person').join(hybren.quest_name).head()

Unnamed: 0,person_BrenBarn,person_HYRY,quest_name
4,0,1,Uri Laserson
121,1,0,MikeGruz
127,1,0,BFTM
130,1,0,BFTM
145,1,0,pythOnometrist


The `get_dummies()` function is often used along with discretization function like `cut`:

In [372]:
bins = [25, 132, 445, 1420, 148048]
dum = pd.get_dummies(pd.cut(hybren['viewcount'], bins))
dum.head()

Unnamed: 0,"(25, 132]","(132, 445]","(445, 1420]","(1420, 148048]"
4,0,0,0,1
121,0,0,0,1
127,0,0,0,1
130,0,0,0,1
145,0,0,0,1


`get_dummies()` also accepts a DataFrame. By default all categorical variables (categorical in the statistical sense, those with object or categorical dtype) are encoded as dummy variables. All non-object columns are included untouched in the output. You can control the columns that are encoded into dummy variables with the `columns` keyword.

Additionally, we can also pass `prefix` or `prefix_sep` as additional arguments. The default uses the column name as the prefix and `_` as the prefix separator.

If we have wanted only _k-1_ levels of a categorical variable (say, to avoid collinearity) then we can specify `drop_first` as an additional argument.

By default new columns will have `np.uint8` dtype. To choose another type, use the `dtype` argument:

In [401]:
dat = pd.get_dummies(hybren, 
                     columns=['ans_name', 'viewgroup'], 
                     prefix=['answerby', 'vg'], 
                     prefix_sep='',
                     drop_first=True,
                     dtype=bool)
dat.head()

Unnamed: 0,questionyear,viewcount,commentcount,quest_name,questionmonth,questionquarter,answerbyHYRY,vg50_per,vg75_per,vgabove_75
4,2011,2488,0,Uri Laserson,9,3,True,False,False,True
121,2012,4397,0,MikeGruz,6,2,False,False,False,True
127,2012,11115,0,BFTM,6,2,False,False,False,True
130,2012,46428,0,BFTM,6,2,False,False,False,True
145,2012,85762,2,pythOnometrist,6,2,False,False,False,True


In [402]:
dat.dtypes

questionyear        int64
viewcount           int64
commentcount        int64
quest_name         object
questionmonth       int64
questionquarter     int64
answerbyHYRY         bool
vg50_per             bool
vg75_per             bool
vgabove_75           bool
dtype: object

### Categorical Values

In [403]:
dat['quest_name'] = pd.Categorical(dat['quest_name'])
# equivalent: dat['quest_name'] = dat['quest_name'].astype("category")
dat.dtypes

questionyear          int64
viewcount             int64
commentcount          int64
quest_name         category
questionmonth         int64
questionquarter       int64
answerbyHYRY           bool
vg50_per               bool
vg75_per               bool
vgabove_75             bool
dtype: object