### This Notebook covers:
* split-apply-combine in detail
    - manually then using groupby
* groupby mechanics and lazy evaluation
* aggregation functions
* grouping by multiple keys
* groupby + transform(), filter() and apply()

### Revision:
* split-apply-combine method
    - for aggregate results.
    - sales.loc[sales.p == 'p1'].sum(numeric_only=True, axis=0)
    - have to find seperately for each category.
* groupby():
    - sales.groupby('category')  gives DataFrameGroupBy object which lazily evaluates function when given.
    - sales.groupby('category').sum() - gives category wise sum of all columns
    - sales.set_index('Platform').groupby(platforms).sum()  - customized groups platforms: dict
    - can work with SeriesGroupBy object as well.
* Iterating over groupby object:
    - tuples of sgname and related df
* Handpicking subgroups:
    - sales.groupby('Platform')['JP_Sales'] - gives seriesgroupby object
    - dict(iter(sales.groupby('Platform')))['PS3']
    - sales.groupby('Platform').get_group('PS3')
* MultiIndex grouping:
    - studios.groupby(['Genre','Publisher']).sum()
    - studios.groupby(['Genre','Publisher']).sum().index
* Aggregates():
    - apply multiple functions at once
    - sales.groupby(['Genre', 'Platform']).agg(['mean', 'sum', 'count'])
    - named aggregates: 
    - studios.groupby(['Genre','Publisher']).agg(total_revenue=('Global_Sales','sum'), 
                                           num_games=('Global_Sales', 'count'),
                                           average_revenue=('Global_Sales',np.mean),
                                           deviation=('Global_Sales', 'std')
                                          ).sort_values(by='total_revenue', ascending=False)
    - sales.groupby(['Genre','Publisher']).agg(
                                            {
                                                'Global_Sales': 'sum',
                                                'EU_Sales': 'mean'
                                            }
                                          )
* filter() + groupby():
    - games.groupby(['Genre','Publisher']).filter(lambda sg: sg['NA_Sales'].sum() > sg['EU_Sales'].sum())
* transform() + groupby():
    - games_relative.set_index(['Name','Platform']).groupby('Genre').transform(lambda x: (x-x.mean())/x.std())
* apply() + groupby():
    - games.groupby('Genre').apply(lambda x: 'solid' if x.EU_Sales.sum() > 50 else 'weak')
    - games.groupby('Genre').apply(sales_detail)   -  sales_detail is a function where you can return things for a subgroup.

In [1]:
import pandas as pd
import numpy as np
pd.__version__

'1.4.2'

In [2]:
# new data
games_url = 'https://andybek.com/pandas-games'
games = pd.read_csv(games_url)
games.head()

Unnamed: 0,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,Kinect Adventures!,X360,2010.0,Misc,Microsoft Game Studios,14.97,4.94,0.24,1.67,21.82
1,Grand Theft Auto V,PS3,2013.0,Action,Take-Two Interactive,7.01,9.27,0.97,4.14,21.4
2,Grand Theft Auto V,X360,2013.0,Action,Take-Two Interactive,9.63,5.31,0.06,1.38,16.38
3,Call of Duty: Modern Warfare 3,X360,2011.0,Shooter,Activision,9.03,4.28,0.13,1.32,14.76
4,Call of Duty: Black Ops,X360,2010.0,Shooter,Activision,9.67,3.73,0.11,1.13,14.64


In [3]:
games.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3143 entries, 0 to 3142
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Name          3143 non-null   object 
 1   Platform      3143 non-null   object 
 2   Year          3088 non-null   float64
 3   Genre         3143 non-null   object 
 4   Publisher     3136 non-null   object 
 5   NA_Sales      3143 non-null   float64
 6   EU_Sales      3143 non-null   float64
 7   JP_Sales      3143 non-null   float64
 8   Other_Sales   3143 non-null   float64
 9   Global_Sales  3143 non-null   float64
dtypes: float64(6), object(4)
memory usage: 245.7+ KB


In [10]:
# what are the total sales across all regions?
games.loc[:, ['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales']].sum()  # max, min, mean, std, var

NA_Sales        1173.30
EU_Sales         793.64
JP_Sales         107.06
Other_Sales      282.75
Global_Sales    2356.96
dtype: float64

In [11]:
games.loc[:, ['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales']].sum(axis=1)

0       43.64
1       42.79
2       32.76
3       29.52
4       29.28
        ...  
3138     0.02
3139     0.02
3140     0.02
3141     0.02
3142     0.02
Length: 3143, dtype: float64

In [12]:
games.Platform.unique()

array(['X360', 'PS3', 'PS4', 'XOne'], dtype=object)

In [16]:
# find the total sales per region for ps3 and xone
games[games.Platform == 'XOne'].loc[:, ['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales']].sum()

NA_Sales         83.19
EU_Sales         45.65
JP_Sales          0.34
Other_Sales      11.92
Global_Sales    141.06
dtype: float64

In [17]:
games[games.Platform == 'PS3'].loc[:, ['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales']].sum()

NA_Sales        392.26
EU_Sales        343.71
JP_Sales         79.99
Other_Sales     141.93
Global_Sales    957.84
dtype: float64

In [20]:
sales = games.loc[:, ['Platform','NA_Sales', 'EU_Sales', 
                      'JP_Sales', 'Other_Sales', 'Global_Sales']]
sales.loc[sales.Platform=='XOne'].sum(numeric_only=True, axis=0)

NA_Sales         83.19
EU_Sales         45.65
JP_Sales          0.34
Other_Sales      11.92
Global_Sales    141.06
dtype: float64

In [21]:
sales = games.loc[:, ['Platform','NA_Sales', 'EU_Sales', 
                      'JP_Sales', 'Other_Sales', 'Global_Sales']]
sales.loc[sales.Platform=='PS3'].sum(numeric_only=True, axis=0)

NA_Sales        392.26
EU_Sales        343.71
JP_Sales         79.99
Other_Sales     141.93
Global_Sales    957.84
dtype: float64

In [22]:
# above method is called split-apply-combine in statistics.

### groupby()

In [23]:
sales.groupby('Platform').sum()

Unnamed: 0_level_0,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
Platform,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
PS3,392.26,343.71,79.99,141.93,957.84
PS4,96.8,123.7,14.3,43.36,278.1
X360,601.05,280.58,12.43,85.54,979.96
XOne,83.19,45.65,0.34,11.92,141.06


In [24]:
# how much does each platform sell across regions on average?
sales.groupby('Platform').mean()

Unnamed: 0_level_0,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
Platform,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
PS3,0.295154,0.258623,0.060188,0.106795,0.720722
PS4,0.288095,0.368155,0.04256,0.129048,0.827679
X360,0.475138,0.221802,0.009826,0.067621,0.774672
XOne,0.390563,0.214319,0.001596,0.055962,0.662254


In [25]:
sales.groupby('Platform') 

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001EBAF006F70>

In [26]:
# lazy evaluation: no computation/calculation takes place, until it is needed.
# only splits happen

In [30]:
# sales.groupby(sales['Platform']).sum() equivalent to  sales.groupby('Platform').sum()

### Customizing Index to Group Mappings

In [31]:
platforms = {
    'PS3': 'PlayStation',
    'PS4': 'PlayStation',
    'X360': 'XBOX',
    'XOne': 'XBOX'
}

In [32]:
sales.set_index('Platform').groupby(platforms).sum()

Unnamed: 0_level_0,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
Platform,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
PlayStation,489.06,467.41,94.29,185.29,1235.94
XBOX,684.24,326.23,12.77,97.46,1121.02


### Series groupby()

In [37]:
ser = games.loc[:,['Genre','Global_Sales']].set_index('Genre').squeeze()
ser

Genre
Misc            21.82
Action          21.40
Action          16.38
Shooter         14.76
Shooter         14.64
                ...  
Role-Playing     0.01
Platform         0.01
Shooter          0.01
Simulation       0.01
Sports           0.01
Name: Global_Sales, Length: 3143, dtype: float64

In [38]:
ser.groupby('Genre').mean().sort_values(ascending=False)

Genre
Shooter         1.412019
Action          0.751007
Role-Playing    0.715804
Racing          0.687854
Sports          0.681094
Platform        0.651842
Fighting        0.604182
Misc            0.550250
Simulation      0.336076
Adventure       0.298289
Strategy        0.264333
Puzzle          0.133636
Name: Global_Sales, dtype: float64

In [39]:
ser.groupby('Genre')

<pandas.core.groupby.generic.SeriesGroupBy object at 0x000001EBAFD1B610>

### Challenge

In [40]:
# 1. Create a smaller dataframe from games, selecting only the publisher, genre, platform, and NA_Sales columns. name it publishers.
publishers = games.loc[:, ['Publisher', 'Genre', 'Platform', 'NA_Sales']]
publishers

Unnamed: 0,Publisher,Genre,Platform,NA_Sales
0,Microsoft Game Studios,Misc,X360,14.97
1,Take-Two Interactive,Action,PS3,7.01
2,Take-Two Interactive,Action,X360,9.63
3,Activision,Shooter,X360,9.03
4,Activision,Shooter,X360,9.67
...,...,...,...,...
3138,,Role-Playing,X360,0.00
3139,Deep Silver,Platform,XOne,0.01
3140,Capcom,Shooter,XOne,0.01
3141,UIG Entertainment,Simulation,PS4,0.00


In [48]:
# 2. from publishers dataframe, find the top 10 game publishers in North America by total sales.
publishers.groupby('Publisher').sum().sort_values(by='NA_Sales',ascending=False).head(10)

Unnamed: 0_level_0,NA_Sales
Publisher,Unnamed: 1_level_1
Electronic Arts,213.38
Activision,193.16
Take-Two Interactive,120.99
Microsoft Game Studios,116.77
Ubisoft,98.65
Sony Computer Entertainment,76.35
Warner Bros. Interactive Entertainment,45.24
THQ,36.44
Bethesda Softworks,33.88
Capcom,24.74


In [49]:
# 3. what is gaming platform that has attracted most NA sales? 
publishers.groupby('Platform').sum().sort_values(by='NA_Sales', ascending=False).iloc[0]

NA_Sales    601.05
Name: X360, dtype: float64

### Iterating through groups

In [50]:
sales.Platform.unique()

array(['X360', 'PS3', 'PS4', 'XOne'], dtype=object)

In [51]:
for i in sales.groupby('Platform'):
    print(i)

('PS3',      Platform  NA_Sales  EU_Sales  JP_Sales  Other_Sales  Global_Sales
1         PS3      7.01      9.27      0.97         4.14         21.40
6         PS3      4.99      5.88      0.65         2.52         14.03
9         PS3      5.54      5.82      0.49         1.62         13.46
10        PS3      5.98      4.44      0.48         1.83         12.73
14        PS3      2.96      4.88      0.81         2.12         10.77
...       ...       ...       ...       ...          ...           ...
3124      PS3      0.00      0.01      0.00         0.00          0.01
3125      PS3      0.00      0.00      0.01         0.00          0.01
3129      PS3      0.00      0.00      0.01         0.00          0.01
3132      PS3      0.00      0.00      0.01         0.00          0.01
3136      PS3      0.00      0.00      0.01         0.00          0.01

[1329 rows x 6 columns])
('PS4',      Platform  NA_Sales  EU_Sales  JP_Sales  Other_Sales  Global_Sales
5         PS4      5.77      5.81  

In [52]:
for name,df in sales.groupby('Platform'):
    print('--------------')
    print(name)
    print('--------------')
    print(df)


--------------
PS3
--------------
     Platform  NA_Sales  EU_Sales  JP_Sales  Other_Sales  Global_Sales
1         PS3      7.01      9.27      0.97         4.14         21.40
6         PS3      4.99      5.88      0.65         2.52         14.03
9         PS3      5.54      5.82      0.49         1.62         13.46
10        PS3      5.98      4.44      0.48         1.83         12.73
14        PS3      2.96      4.88      0.81         2.12         10.77
...       ...       ...       ...       ...          ...           ...
3124      PS3      0.00      0.01      0.00         0.00          0.01
3125      PS3      0.00      0.00      0.01         0.00          0.01
3129      PS3      0.00      0.00      0.01         0.00          0.01
3132      PS3      0.00      0.00      0.01         0.00          0.01
3136      PS3      0.00      0.00      0.01         0.00          0.01

[1329 rows x 6 columns]
--------------
PS4
--------------
     Platform  NA_Sales  EU_Sales  JP_Sales  Other_Sale

### Handpicking subgroups

In [55]:
sales.groupby('Platform')['JP_Sales']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x000001EBB13F7C70>

In [62]:
dict(iter(sales.groupby('Platform')))['PS3']

Unnamed: 0,Platform,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
1,PS3,7.01,9.27,0.97,4.14,21.40
6,PS3,4.99,5.88,0.65,2.52,14.03
9,PS3,5.54,5.82,0.49,1.62,13.46
10,PS3,5.98,4.44,0.48,1.83,12.73
14,PS3,2.96,4.88,0.81,2.12,10.77
...,...,...,...,...,...,...
3124,PS3,0.00,0.01,0.00,0.00,0.01
3125,PS3,0.00,0.00,0.01,0.00,0.01
3129,PS3,0.00,0.00,0.01,0.00,0.01
3132,PS3,0.00,0.00,0.01,0.00,0.01


In [63]:
sales.groupby('Platform').get_group('PS3')

Unnamed: 0,Platform,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
1,PS3,7.01,9.27,0.97,4.14,21.40
6,PS3,4.99,5.88,0.65,2.52,14.03
9,PS3,5.54,5.82,0.49,1.62,13.46
10,PS3,5.98,4.44,0.48,1.83,12.73
14,PS3,2.96,4.88,0.81,2.12,10.77
...,...,...,...,...,...,...
3124,PS3,0.00,0.01,0.00,0.00,0.01
3125,PS3,0.00,0.00,0.01,0.00,0.01
3129,PS3,0.00,0.00,0.01,0.00,0.01
3132,PS3,0.00,0.00,0.01,0.00,0.01


### MultiIndex Grouping

In [64]:
studios = games.loc[:, ['Genre', 'Publisher', 'Global_Sales']]

In [65]:
studios

Unnamed: 0,Genre,Publisher,Global_Sales
0,Misc,Microsoft Game Studios,21.82
1,Action,Take-Two Interactive,21.40
2,Action,Take-Two Interactive,16.38
3,Shooter,Activision,14.76
4,Shooter,Activision,14.64
...,...,...,...
3138,Role-Playing,,0.01
3139,Platform,Deep Silver,0.01
3140,Shooter,Capcom,0.01
3141,Simulation,UIG Entertainment,0.01


In [67]:
# Q. which are the top publishers by global_sales? 
studios.groupby('Publisher').sum().sort_values(by='Global_Sales', ascending=False)

Unnamed: 0_level_0,Global_Sales
Publisher,Unnamed: 1_level_1
Electronic Arts,434.41
Activision,349.22
Take-Two Interactive,218.08
Ubisoft,201.98
Microsoft Game Studios,190.56
...,...
UIG Entertainment,0.01
ChunSoft,0.01
Kaga Create,0.01
Epic Games,0.01


In [69]:
# Q. which are the top publishers within each genre by global sales? 
studios.groupby(['Genre','Publisher']).sum().sort_values(by='Global_Sales', ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Global_Sales
Genre,Publisher,Unnamed: 2_level_1
Shooter,Activision,245.46
Sports,Electronic Arts,203.50
Action,Take-Two Interactive,106.04
Action,Ubisoft,96.44
Shooter,Electronic Arts,92.58
...,...,...
Adventure,Cave,0.01
Role-Playing,TopWare Interactive,0.01
Sports,"Interworks Unlimited, Inc.",0.01
Strategy,Ackkstudios,0.01


In [71]:
studios.groupby(['Genre','Publisher']).sum().index

MultiIndex([(  'Action',                   '505 Games'),
            (  'Action',                    'Abylight'),
            (  'Action',                 'Ackkstudios'),
            (  'Action',                     'Acquire'),
            (  'Action',                  'Activision'),
            (  'Action',            'Activision Value'),
            (  'Action',            'Arc System Works'),
            (  'Action',                       'Atari'),
            (  'Action',                   'Avanquest'),
            (  'Action',          'Bethesda Softworks'),
            ...
            ('Strategy',        'Nippon Ichi Software'),
            ('Strategy',                'PopCap Games'),
            ('Strategy',                        'Sega'),
            ('Strategy',         'Slitherine Software'),
            ('Strategy', 'Sony Computer Entertainment'),
            ('Strategy',                 'Square Enix'),
            ('Strategy',                 'Takara Tomy'),
            ('S

### Fine-tuned aggregates()

In [72]:
# Equivalents:
# studios.groupby(['Genre','Publisher']).sum() 
# studios.groupby(['Genre','Publisher']).agg('sum')
# studios.groupby(['Genre','Publisher']).agg(np.sum)

In [73]:
# benefit is we can apply multiple function at once.

In [75]:
# Summarize the sum, average, and std dev of sales as well as the number of games published by each publisher within each genre.
studios.groupby(['Genre','Publisher']).agg(['sum', 'mean', 'std', 'count'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Global_Sales,Global_Sales,Global_Sales,Global_Sales
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean,std,count
Genre,Publisher,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Action,505 Games,2.25,0.281250,0.266482,8
Action,Abylight,0.08,0.080000,,1
Action,Ackkstudios,0.33,0.330000,,1
Action,Acquire,0.11,0.110000,,1
Action,Activision,42.84,0.450947,0.559717,95
...,...,...,...,...,...
Strategy,Square Enix,0.35,0.350000,,1
Strategy,Takara Tomy,0.09,0.090000,,1
Strategy,Take-Two Interactive,2.92,0.486667,0.364289,6
Strategy,Tecmo Koei,0.58,0.096667,0.055015,6


In [77]:
# it is a multiindex
studios.groupby(['Genre','Publisher']).agg(['sum', 'mean', 'std', 'count']) \
                        .sort_values(by=('Global_Sales','sum'), ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Global_Sales,Global_Sales,Global_Sales,Global_Sales
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean,std,count
Genre,Publisher,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Shooter,Activision,245.46,3.409167,4.621920,72
Sports,Electronic Arts,203.50,1.197059,1.404108,170
Action,Take-Two Interactive,106.04,4.610435,5.843768,23
Action,Ubisoft,96.44,1.439403,1.636460,67
Shooter,Electronic Arts,92.58,1.851600,1.794404,50
...,...,...,...,...,...
Adventure,Cave,0.01,0.010000,,1
Role-Playing,TopWare Interactive,0.01,0.010000,,1
Sports,"Interworks Unlimited, Inc.",0.01,0.010000,,1
Strategy,Ackkstudios,0.01,0.010000,,1


### Names Aggregations

In [79]:
studios.groupby(['Genre','Publisher']).agg(['sum', 'mean', 'std', 'count']) \
                                .rename(mapper={'sum': 'total_revenue', 'mean': 'average_revenue',
                                                'std': 'deviation', 'count': 'num_games'}, axis=1)

Unnamed: 0_level_0,Unnamed: 1_level_0,Global_Sales,Global_Sales,Global_Sales,Global_Sales
Unnamed: 0_level_1,Unnamed: 1_level_1,total_revenue,average_revenue,deviation,num_games
Genre,Publisher,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Action,505 Games,2.25,0.281250,0.266482,8
Action,Abylight,0.08,0.080000,,1
Action,Ackkstudios,0.33,0.330000,,1
Action,Acquire,0.11,0.110000,,1
Action,Activision,42.84,0.450947,0.559717,95
...,...,...,...,...,...
Strategy,Square Enix,0.35,0.350000,,1
Strategy,Takara Tomy,0.09,0.090000,,1
Strategy,Take-Two Interactive,2.92,0.486667,0.364289,6
Strategy,Tecmo Koei,0.58,0.096667,0.055015,6


In [84]:
studios.groupby(['Genre','Publisher']).agg(total_revenue=('Global_Sales','sum'), 
                                           num_games=('Global_Sales', 'count'),
                                           average_revenue=('Global_Sales',np.mean),
                                           deviation=('Global_Sales', 'std')
                                          ).sort_values(by='total_revenue', ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,total_revenue,num_games,average_revenue,deviation
Genre,Publisher,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Shooter,Activision,245.46,72,3.409167,4.621920
Sports,Electronic Arts,203.50,170,1.197059,1.404108
Action,Take-Two Interactive,106.04,23,4.610435,5.843768
Action,Ubisoft,96.44,67,1.439403,1.636460
Shooter,Electronic Arts,92.58,50,1.851600,1.794404
...,...,...,...,...,...
Adventure,Cave,0.01,1,0.010000,
Role-Playing,TopWare Interactive,0.01,1,0.010000,
Sports,"Interworks Unlimited, Inc.",0.01,1,0.010000,
Strategy,Ackkstudios,0.01,1,0.010000,


In [86]:
games.groupby(['Genre','Publisher']).agg(
    total_global_revenue=('Global_Sales', 'sum'),
    average_EU_revenue=('EU_Sales', 'mean')
    )

Unnamed: 0_level_0,Unnamed: 1_level_0,total_global_revenue,average_EU_revenue
Genre,Publisher,Unnamed: 2_level_1,Unnamed: 3_level_1
Action,505 Games,2.25,0.131250
Action,Abylight,0.08,0.000000
Action,Ackkstudios,0.33,0.000000
Action,Acquire,0.11,0.000000
Action,Activision,42.84,0.143053
...,...,...,...
Strategy,Square Enix,0.35,0.100000
Strategy,Takara Tomy,0.09,0.000000
Strategy,Take-Two Interactive,2.92,0.145000
Strategy,Tecmo Koei,0.58,0.000000


In [87]:
games.groupby(['Genre','Publisher']).agg(
    {
        'Global_Sales': 'sum',
        'EU_Sales': 'mean'
    }
)

Unnamed: 0_level_0,Unnamed: 1_level_0,Global_Sales,EU_Sales
Genre,Publisher,Unnamed: 2_level_1,Unnamed: 3_level_1
Action,505 Games,2.25,0.131250
Action,Abylight,0.08,0.000000
Action,Ackkstudios,0.33,0.000000
Action,Acquire,0.11,0.000000
Action,Activision,42.84,0.143053
...,...,...,...
Strategy,Square Enix,0.35,0.100000
Strategy,Takara Tomy,0.09,0.000000
Strategy,Take-Two Interactive,2.92,0.145000
Strategy,Tecmo Koei,0.58,0.000000


### filter() method

In [104]:
# find all the games whose publisher has sold more than 50m in north america within the game's genre.
games_grouped = games.groupby(['Genre', 'Publisher']).sum()
games_grouped.loc[games_grouped['NA_Sales'] > 50]
games.loc[(games['Genre'] == 'Shooter') & (games['Publisher'].isin(['Activision', 'Microsoft Game Studios'])) |
          (games['Genre']=='Sports') & (games['Publisher']=='Electronic Arts')]

Unnamed: 0,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
3,Call of Duty: Modern Warfare 3,X360,2011.0,Shooter,Activision,9.03,4.28,0.13,1.32,14.76
4,Call of Duty: Black Ops,X360,2010.0,Shooter,Activision,9.67,3.73,0.11,1.13,14.64
5,Call of Duty: Black Ops 3,PS4,2015.0,Shooter,Activision,5.77,5.81,0.35,2.31,14.24
6,Call of Duty: Black Ops II,PS3,2012.0,Shooter,Activision,4.99,5.88,0.65,2.52,14.03
7,Call of Duty: Black Ops II,X360,2012.0,Shooter,Activision,8.25,4.30,0.07,1.12,13.73
...,...,...,...,...,...,...,...,...,...,...
2908,Cabela's Big Game Hunter: Pro Hunts,X360,2014.0,Shooter,Activision,0.02,0.00,0.00,0.00,0.03
3012,Call of Duty: Modern Warfare Trilogy,PS3,2016.0,Shooter,Activision,0.00,0.01,0.00,0.00,0.02
3033,NHL 16,X360,2015.0,Sports,Electronic Arts,0.00,0.02,0.00,0.00,0.02
3035,Call of Duty: Modern Warfare Trilogy,X360,2016.0,Shooter,Activision,0.01,0.01,0.00,0.00,0.02


In [105]:
games.groupby(['Genre','Publisher']).filter(lambda ag: ag['NA_Sales'].sum() > 50)

Unnamed: 0,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
3,Call of Duty: Modern Warfare 3,X360,2011.0,Shooter,Activision,9.03,4.28,0.13,1.32,14.76
4,Call of Duty: Black Ops,X360,2010.0,Shooter,Activision,9.67,3.73,0.11,1.13,14.64
5,Call of Duty: Black Ops 3,PS4,2015.0,Shooter,Activision,5.77,5.81,0.35,2.31,14.24
6,Call of Duty: Black Ops II,PS3,2012.0,Shooter,Activision,4.99,5.88,0.65,2.52,14.03
7,Call of Duty: Black Ops II,X360,2012.0,Shooter,Activision,8.25,4.30,0.07,1.12,13.73
...,...,...,...,...,...,...,...,...,...,...
2908,Cabela's Big Game Hunter: Pro Hunts,X360,2014.0,Shooter,Activision,0.02,0.00,0.00,0.00,0.03
3012,Call of Duty: Modern Warfare Trilogy,PS3,2016.0,Shooter,Activision,0.00,0.01,0.00,0.00,0.02
3033,NHL 16,X360,2015.0,Sports,Electronic Arts,0.00,0.02,0.00,0.00,0.02
3035,Call of Duty: Modern Warfare Trilogy,X360,2016.0,Shooter,Activision,0.01,0.01,0.00,0.00,0.02


### groupby() Transformations

In [111]:
# Q. convert raw global_sales to within genre standard scores.(z score)
games_relative = games.loc[:, ['Name','Platform','Genre','Global_Sales']]
games_relative

Unnamed: 0,Name,Platform,Genre,Global_Sales
0,Kinect Adventures!,X360,Misc,21.82
1,Grand Theft Auto V,PS3,Action,21.40
2,Grand Theft Auto V,X360,Action,16.38
3,Call of Duty: Modern Warfare 3,X360,Shooter,14.76
4,Call of Duty: Black Ops,X360,Shooter,14.64
...,...,...,...,...
3138,Bound By Flame,X360,Role-Playing,0.01
3139,Mighty No. 9,XOne,Platform,0.01
3140,Resident Evil 4 HD,XOne,Shooter,0.01
3141,Farming 2017 - The Simulation,PS4,Simulation,0.01


In [115]:
games_relative.set_index(['Name','Platform']).groupby('Genre').transform(lambda x: (x-x.mean())/x.std()) \
.sort_values(by='Global_Sales', ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Global_Sales
Name,Platform,Unnamed: 2_level_1
Grand Theft Auto V,PS3,13.831175
Kinect Adventures!,X360,13.814162
Grand Theft Auto V,X360,10.468663
Gran Turismo 5,PS3,9.159261
Grand Theft Auto V,PS4,7.521441
...,...,...
Dragon Ball Z for Kinect,X360,-0.872762
Nitroplus Blasterz: Heroines Infinite Duel,PS3,-0.872762
Battle Fantasia,PS3,-0.872762
"Sakigake!! Otokojuku - Nihon yo, Kore ga Otoko Dearu!",PS3,-0.872762


### apply() with groupby()

In [117]:
ps3 = games.loc[games.Platform=='PS3', ['Name', 'Genre', 'EU_Sales', 'Global_Sales']]

In [118]:
ps3

Unnamed: 0,Name,Genre,EU_Sales,Global_Sales
1,Grand Theft Auto V,Action,9.27,21.40
6,Call of Duty: Black Ops II,Shooter,5.88,14.03
9,Call of Duty: Modern Warfare 3,Shooter,5.82,13.46
10,Call of Duty: Black Ops,Shooter,4.44,12.73
14,Gran Turismo 5,Racing,4.88,10.77
...,...,...,...,...
3124,Hyperdimension Neptunia mk2,Action,0.01,0.01
3125,Shin Koihime Musou: Otome Taisen * Sangokushi ...,Adventure,0.00,0.01
3129,Muv-Luv Alternative,Simulation,0.00,0.01
3132,Akatsuki no Goei Trinity,Adventure,0.00,0.01


In [119]:
# subgroup -> apply -> output

In [120]:
ps3.groupby('Genre').apply(lambda ag: 'solid' if ag.EU_Sales.sum()>50 else 'weak')

Genre
Action          solid
Adventure        weak
Fighting         weak
Misc             weak
Platform         weak
Puzzle           weak
Racing           weak
Role-Playing     weak
Shooter         solid
Simulation       weak
Sports           weak
Strategy         weak
dtype: object

In [121]:
def sales_detail(sg):
    level = 'solid' if sg.EU_Sales.sum() > 50 else 'weak'
    variability = 'volatile' if sg.EU_Sales.std()/sg.EU_Sales.mean() > 2 else 'steady'
    return (variability, level + ' sales')

In [123]:
ps3.groupby('Genre').apply(sales_detail)

Genre
Action          (volatile, solid sales)
Adventure        (volatile, weak sales)
Fighting           (steady, weak sales)
Misc               (steady, weak sales)
Platform           (steady, weak sales)
Puzzle             (steady, weak sales)
Racing             (steady, weak sales)
Role-Playing     (volatile, weak sales)
Shooter           (steady, solid sales)
Simulation         (steady, weak sales)
Sports           (volatile, weak sales)
Strategy           (steady, weak sales)
dtype: object

In [124]:
ps3.groupby('Genre').apply(lambda sg: sg.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 380 entries, 1 to 3124
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Name          380 non-null    object 
 1   Genre         380 non-null    object 
 2   EU_Sales      380 non-null    float64
 3   Global_Sales  380 non-null    float64
dtypes: float64(2), object(2)
memory usage: 14.8+ KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 74 entries, 70 to 3132
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Name          74 non-null     object 
 1   Genre         74 non-null     object 
 2   EU_Sales      74 non-null     float64
 3   Global_Sales  74 non-null     float64
dtypes: float64(2), object(2)
memory usage: 2.9+ KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 76 entries, 90 to 3136
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --

### Challenge

In [131]:
# 1. starting with game df, calculate the total global sales across for each year across all records. 
# what are the top 3 year by aggregate global sales.
games.groupby('Year').sum().sort_values(by='Global_Sales', ascending=False)

Unnamed: 0_level_0,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2010.0,168.13,99.57,11.98,35.65,315.47
2011.0,151.5,101.96,15.88,35.11,304.49
2008.0,139.62,78.35,7.72,29.77,255.45
2009.0,136.6,76.34,10.98,29.35,253.19
2013.0,116.33,89.5,13.5,31.02,250.36
2014.0,100.71,96.2,9.37,32.39,238.57
2012.0,98.12,73.72,13.0,25.39,210.37
2015.0,86.92,80.61,10.03,26.58,204.23
2007.0,95.1,49.17,5.74,19.65,169.65
2006.0,43.99,18.59,2.27,8.15,72.95


In [138]:
# 2. in the games df, what genre in what year, in what platform sold the most in EU_Sales.
games.groupby(['Genre','Year','Platform']).sum().nlargest(1, columns='EU_Sales')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
Genre,Year,Platform,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Action,2013.0,PS3,18.47,21.72,4.95,9.34,54.44


In [135]:
# 3. Find all the names in games df, whose genre in their respective platform sold more in JP_Sales than in EU_Sales.
games.groupby(['Genre','Platform']).filter(lambda sg: sg['JP_Sales'].sum() > sg['EU_Sales'].sum())

Unnamed: 0,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
1246,Katamari Forever,PS3,2009.0,Puzzle,Namco Bandai Games,0.26,0.05,0.06,0.04,0.42
1440,Beautiful Katamari,X360,2007.0,Puzzle,Namco Bandai Games,0.14,0.02,0.15,0.02,0.32
2117,Bejeweled 3,PS3,,Puzzle,Unknown,0.13,0.0,0.0,0.01,0.14
2132,Bejeweled 3,X360,,Puzzle,Unknown,0.13,0.0,0.0,0.01,0.14
2214,Are You Smarter than a 5th Grader? Game Time,X360,2009.0,Puzzle,THQ,0.12,0.0,0.0,0.01,0.12
2318,Tetris Evolution,X360,2007.0,Puzzle,THQ,0.08,0.02,0.0,0.01,0.11
2497,Qubed,X360,2009.0,Puzzle,Atari,0.07,0.0,0.0,0.01,0.08
2744,Puyo Puyo Tetris,PS3,2014.0,Puzzle,Sega,0.0,0.0,0.04,0.0,0.04
2767,PopCap Arcade Vol 1,X360,2007.0,Puzzle,PopCap Games,0.04,0.0,0.0,0.0,0.04
2787,Bomberman: Act Zero,X360,2006.0,Puzzle,Konami Digital Entertainment,0.04,0.0,0.0,0.0,0.04
