> **Copyright (c) 2020 Skymind Holdings Berhad**<br><br>
> **Copyright (c) 2021 Skymind Education Group Sdn. Bhd.**<br>
<br>
Licensed under the Apache License, Version 2.0 (the \"License\");
<br>you may not use this file except in compliance with the License.
<br>You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0/
<br>
<br>Unless required by applicable law or agreed to in writing, software
<br>distributed under the License is distributed on an \"AS IS\" BASIS,
<br>WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
<br>See the License for the specific language governing permissions and
<br>limitations under the License.
<br>
<br>
**SPDX-License-Identifier: Apache-2.0**
<br>

# Introduction

In the last tutorial, we learned how to select relevant data out of a DataFrame or Series. Plucking the right data out of our data representation is critical to getting work done, as we demonstrated in the exercises.

However, the data does not always come out of memory in the format we want it in right out of the bat. Sometimes we have to do some more work ourselves to reformat it for the task at hand.  This tutorial will cover different operations we can apply to our data to get the input "just right". 

# Notebook Content

* [Summary Functions](#Summary-functions)


* [Maps](#Maps)

In [1]:
import pandas as pd
import numpy as np

games = pd.read_csv("../../../resources/day_01/twitch_game_data.csv", index_col=0)

In [2]:
games

Unnamed: 0_level_0,Game,Month,Year,Hours_watched,Hours_Streamed,Peak_viewers,Peak_channels,Streamers,Avg_viewers,Avg_channels,Avg_viewer_ratio
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,League of Legends,1,2016,94377226,1362044 hours,530270,2903,129172,127021,1833,69.29
2,Counter-Strike: Global Offensive,1,2016,47832863,830105 hours,372654,2197,120849,64378,1117,57.62
3,Dota 2,1,2016,45185893,433397 hours,315083,1100,44074,60815,583,104.26
4,Hearthstone,1,2016,39936159,235903 hours,131357,517,36170,53749,317,169.29
5,Call of Duty: Black Ops III,1,2016,16153057,1151578 hours,71639,3620,214054,21740,1549,14.03
...,...,...,...,...,...,...,...,...,...,...,...
196,War Thunder,6,2021,704459,73613 hours,8812,223,7035,979,102,9.57
197,Muck,6,2021,701456,31741 hours,60091,112,8591,975,44,22.10
198,Trials Rising,6,2021,698899,4626 hours,217333,26,581,972,6,151.08
199,Little Nightmares II,6,2021,695130,27581 hours,43518,105,6128,966,38,25.20


# Summary functions

Pandas provides many simple "summary functions" (not an official name) which restructure the data in some useful way. For example, consider the `describe()` method:

In [3]:
games.Streamers.describe()

count    1.320000e+04
mean     1.643152e+04
std      5.400997e+04
min      0.000000e+00
25%      1.375750e+03
50%      3.867500e+03
75%      1.017375e+04
max      1.013029e+06
Name: Streamers, dtype: float64

This method generates a high-level summary of the attributes of the given column. It is type-aware, meaning that its output changes based on the data type of the input. The output above only makes sense for numerical data; for string data here's what we get:

In [4]:
games.Game.describe()

count                  13199
unique                  1676
top       Dungeons & Dragons
freq                      67
Name: Game, dtype: object

If you want to get some particular simple summary statistic about a column in a DataFrame or a Series, there is usually a helpful pandas function that makes it happen. 

For example, to see the mean of the hour watch, we can use the `mean()` function:

In [5]:
games.Hours_watched.mean()

4427318.414848485

To see a list of unique values we can use the `unique()` function:

In [6]:
games.Game.unique()

array(['League of Legends', 'Counter-Strike: Global Offensive', 'Dota 2',
       ..., 'Phantasy Star Online 2 New Genesis', 'Phantom Abyss', 'Muck'],
      dtype=object)

To see a list of unique values _and_ how often they occur in the dataset, we can use the `value_counts()` method:

In [7]:
games.Game.value_counts()

Dungeons & Dragons           67
League of Legends            66
Dark Souls                   66
Music                        66
Path of Exile                66
                             ..
Space Hulk: Deathwing         1
Transport Tycoon Deluxe       1
CardLife                      1
S.T.A.L.K.E.R.: Clear Sky     1
Muck                          1
Name: Game, Length: 1676, dtype: int64

# Maps

A **map** is a term, borrowed from mathematics, for a function that takes one set of values and "maps" them to another set of values. In data science we often have a need for creating new representations from existing data, or for transforming data from the format it is in now to the format that we want it to be in later. Maps are what handle this work, making them extremely important for getting your work done!

There are two mapping methods that you will use often. 

[`map()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html) is the first, and slightly simpler one. For example, suppose that we wanted to remean the average viewers to 0. We can do this as follows:

In [8]:
viewer_mean = games.Avg_viewers.mean()
print("Viewer Mean:", viewer_mean)

games.Avg_viewers.map(lambda v: v - viewer_mean)

Viewer Mean: 6075.140378787879


Rank
1      120945.859621
2       58302.859621
3       54739.859621
4       47673.859621
5       15664.859621
           ...      
196     -5096.140379
197     -5100.140379
198     -5103.140379
199     -5109.140379
200     -5118.140379
Name: Avg_viewers, Length: 13200, dtype: float64

The function you pass to `map()` should expect a single value from the Series (a point value, in the above example), and return a transformed version of that value. `map()` returns a new Series where all the values have been transformed by your function.

[`apply()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html) is the equivalent method if we want to transform a whole DataFrame by calling a custom method on each row.

In [9]:
def remean_viewers(row):
    row.Avg_viewers = row.Avg_viewers - viewer_mean
    return row

games.apply(remean_viewers, axis='columns')

Unnamed: 0_level_0,Game,Month,Year,Hours_watched,Hours_Streamed,Peak_viewers,Peak_channels,Streamers,Avg_viewers,Avg_channels,Avg_viewer_ratio
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,League of Legends,1,2016,94377226,1362044 hours,530270,2903,129172,120945.859621,1833,69.29
2,Counter-Strike: Global Offensive,1,2016,47832863,830105 hours,372654,2197,120849,58302.859621,1117,57.62
3,Dota 2,1,2016,45185893,433397 hours,315083,1100,44074,54739.859621,583,104.26
4,Hearthstone,1,2016,39936159,235903 hours,131357,517,36170,47673.859621,317,169.29
5,Call of Duty: Black Ops III,1,2016,16153057,1151578 hours,71639,3620,214054,15664.859621,1549,14.03
...,...,...,...,...,...,...,...,...,...,...,...
196,War Thunder,6,2021,704459,73613 hours,8812,223,7035,-5096.140379,102,9.57
197,Muck,6,2021,701456,31741 hours,60091,112,8591,-5100.140379,44,22.10
198,Trials Rising,6,2021,698899,4626 hours,217333,26,581,-5103.140379,6,151.08
199,Little Nightmares II,6,2021,695130,27581 hours,43518,105,6128,-5109.140379,38,25.20


If we had called `reviews.apply()` with `axis='index'`, then instead of passing a function to transform each row, we would need to give a function to transform each *column*.

Note that `map()` and `apply()` return new, transformed Series and DataFrames, respectively. They don't modify the original data they're called on. If we look at the first row of `reviews`, we can see that it still has its original `points` value.

In [10]:
games.head(1)

Unnamed: 0_level_0,Game,Month,Year,Hours_watched,Hours_Streamed,Peak_viewers,Peak_channels,Streamers,Avg_viewers,Avg_channels,Avg_viewer_ratio
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,League of Legends,1,2016,94377226,1362044 hours,530270,2903,129172,127021,1833,69.29


Pandas provides many common mapping operations as built-ins. For example, here's a faster way of remeaning our points column:

In [11]:
viewers_mean = games.Avg_viewers.mean()
games.Avg_viewers - viewer_mean

Rank
1      120945.859621
2       58302.859621
3       54739.859621
4       47673.859621
5       15664.859621
           ...      
196     -5096.140379
197     -5100.140379
198     -5103.140379
199     -5109.140379
200     -5118.140379
Name: Avg_viewers, Length: 13200, dtype: float64

In [12]:
games

Unnamed: 0_level_0,Game,Month,Year,Hours_watched,Hours_Streamed,Peak_viewers,Peak_channels,Streamers,Avg_viewers,Avg_channels,Avg_viewer_ratio
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,League of Legends,1,2016,94377226,1362044 hours,530270,2903,129172,127021,1833,69.29
2,Counter-Strike: Global Offensive,1,2016,47832863,830105 hours,372654,2197,120849,64378,1117,57.62
3,Dota 2,1,2016,45185893,433397 hours,315083,1100,44074,60815,583,104.26
4,Hearthstone,1,2016,39936159,235903 hours,131357,517,36170,53749,317,169.29
5,Call of Duty: Black Ops III,1,2016,16153057,1151578 hours,71639,3620,214054,21740,1549,14.03
...,...,...,...,...,...,...,...,...,...,...,...
196,War Thunder,6,2021,704459,73613 hours,8812,223,7035,979,102,9.57
197,Muck,6,2021,701456,31741 hours,60091,112,8591,975,44,22.10
198,Trials Rising,6,2021,698899,4626 hours,217333,26,581,972,6,151.08
199,Little Nightmares II,6,2021,695130,27581 hours,43518,105,6128,966,38,25.20


In this code we are performing an operation between a lot of values on the left-hand side (everything in the Series) and a single value on the right-hand side (the mean value). Pandas looks at this expression and figures out that we must mean to subtract that mean value from every value in the dataset.

Pandas will also understand what to do if we perform these operations between Series of equal length. For example, an easy way of combining games and hours streamed information in the dataset would be to do the following:

In [13]:
games.Game + " : " + games.Hours_Streamed

Rank
1                    League of Legends : 1362044 hours
2      Counter-Strike: Global Offensive : 830105 hours
3                                Dota 2 : 433397 hours
4                           Hearthstone : 235903 hours
5          Call of Duty: Black Ops III : 1151578 hours
                            ...                       
196                          War Thunder : 73613 hours
197                                 Muck : 31741 hours
198                         Trials Rising : 4626 hours
199                 Little Nightmares II : 27581 hours
200                        Tabletop RPGs : 23602 hours
Length: 13200, dtype: object

These operators are faster than `map()` or `apply()` because they uses speed ups built into pandas. All of the standard Python operators (`>`, `<`, `==`, and so on) work in this manner.

However, they are not as flexible as `map()` or `apply()`, which can do more advanced things, like applying conditional logic, which cannot be done with addition and subtraction alone.

# Contributors

**Author**
<br>Chee Lam

# References

1. [Learning Pandas](https://www.kaggle.com/learn/pandas)
2. [Pandas Documentation](https://pandas.pydata.org/docs/reference/index.html)