# 本コードについて
[polars](https://github.com/pola-rs/polars)ライブラリを使って、ユーザ一覧、ユーザ操作一覧、リアクション一覧を集計するサンプル

幾つかの例を踏まえつつ、ユーザごとの活動度(actions.point + reactions.reward)のランキングを出すことを目標とする

## データの関係
```
< users > (親)
| --- < actions > (子) 
  | --- < reactions > (孫)
```

## データ構造
- users.csv: ユーザ一覧
  - id: ユーザID。常にユニークな文字列
  - name: ユーザ名
  - email: ユーザのemailアドレス
  - type: ユーザの種類。personalは個人で、enterpriseは営利団体としている
  - leaved: 退会済みかどうか
- actions.csv: ユーザ操作一覧
  - id: アクションID。常にユニークな文字列
  - user_id: ユーザ操作をしたユーザのID
  - date: 操作日時 ※
  - message: 操作に関するメッセージ
  - point: 操作により得られたポイント。整数値 or 欠損値
- reactions.csv: ユーザ操作に対するリアクション一覧
  - id: リアクションID
  - action_id: リアクション対象のアクションID
  - user_id: リアクション元のユーザID
  - date: リアクション日時 ※
  - type: リアクションの種類
  - reward: リアクションによる受け取りポイント
  
※ いずれもISO8601フォーマットの文字列とする

In [1]:
import numpy as np
import polars as pl

# ユーザデータの解析

In [2]:
df_users = pl.read_csv("users.csv")
df_users

id,name,email,type,leaved
str,str,str,str,bool
"""BcHgeZkTsc""","""アリス""","""hoge@hoge.example.com""","""personal""",False
"""KHPiabVr3o""","""ボブ""","""fuga@fuga.org""","""personal""",False
"""AQA7LkexXv""","""チャーリー""","""foo@foo.net""","""personal""",True
"""C82ZQKSQk7""","""ダービー""","""baz@baz.biz""","""enterprise""",False
"""HzZow64HGH""","""エリザベス""","""bar@bar.info""","""enterprise""",False


## idとnameを取り出す
以下の様に、listで必要な列を指定するか、スライスする

- listの場合、列が隣り合ってなくてもよい
- pandasと異なり、indexが存在しないのでpolars.DataFrame.locは存在しない。スライスが等価

In [3]:
df_users[["id", "name"]]

id,name
str,str
"""BcHgeZkTsc""","""アリス"""
"""KHPiabVr3o""","""ボブ"""
"""AQA7LkexXv""","""チャーリー"""
"""C82ZQKSQk7""","""ダービー"""
"""HzZow64HGH""","""エリザベス"""


In [4]:
df_users[:, :"name"]    # same as df_users.loc[:, :"name"] of pd.DataFrame

id,name
str,str
"""BcHgeZkTsc""","""アリス"""
"""KHPiabVr3o""","""ボブ"""
"""AQA7LkexXv""","""チャーリー"""
"""C82ZQKSQk7""","""ダービー"""
"""HzZow64HGH""","""エリザベス"""


## enterprise ユーザ or 退会済みユーザを取り出す: pl.DataFrame.filter等
複数条件指定時、ANDは"&"、ORは"|"でつなぐ

条件式は必ず()で括らないと意図した評価順にならず、`TypeError`となる

※ pandasと異なり、query関数は存在しない

In [5]:
df_users.filter((pl.col("type") == "enterprise") | (pl.col("leaved") == True))

id,name,email,type,leaved
str,str,str,str,bool
"""AQA7LkexXv""","""チャーリー""","""foo@foo.net""","""personal""",True
"""C82ZQKSQk7""","""ダービー""","""baz@baz.biz""","""enterprise""",False
"""HzZow64HGH""","""エリザベス""","""bar@bar.info""","""enterprise""",False


In [6]:
# df_users.leaved == True はNG。右辺がTrueのまま評価され、リスト化されないため
df_users[
    (df_users.type == "enterprise")
    | (df_users.leaved)
]



id,name,email,type,leaved
str,str,str,str,bool
"""AQA7LkexXv""","""チャーリー""","""foo@foo.net""","""personal""",True
"""C82ZQKSQk7""","""ダービー""","""baz@baz.biz""","""enterprise""",False
"""HzZow64HGH""","""エリザベス""","""bar@bar.info""","""enterprise""",False


## ユーザ数を出す

### 全ユーザ数

In [7]:
len(df_users)

5

### leavedがFalseであるユーザの数(入会ユーザ数)

In [8]:
# df_users.leaved == False はNG。右辺がFalseのまま評価され、リスト化されないため
len(df_users[~df_users.leaved])

4

### メールアドレスが"b"から始まっているユーザを取りだし: polars.Series.apply

今回のケースでは、df_usersのemailだけ参照すればいいので、df_users.emailにapplyを使う方法が望ましい(よりスループットが高く、必要なデータのみ触るため)。polarsでは、第二引数(return_dtype)で返り値の型を指定しておく必要がある。

なお、polars.DataFrameにもapplyを利用できる。ただし、pandasと違ってcallableに渡されるのはTuple型の各行データになっている。


In [9]:
serial_email_b_start = df_users.email.apply(lambda e: e.startswith("b"), bool)
print(len(df_users[serial_email_b_start]))

2


## ユニークなユーザタイプをリストアップし、カウントアップ: polars.DataFrame.distinct, polars.Series.to_list

- pandas.DataFrame.drop_duplicates ≒ polars.DataFrame.distinct
- pandas.Series.to_list = polars.Series.to_list

In [10]:
user_types = df_users.distinct(subset="type").type.to_list()
user_types

['personal', 'enterprise']

In [11]:
{
    user_type: len(df_users[df_users.type == user_type])
    for user_type in user_types
} 

{'personal': 3, 'enterprise': 2}

# ユーザによる操作データの解析

id="OoXNK4b3px"について、actions.csv内でpointが空なのでnullになる

主なデータは整数値のため、pointはi64になっている(pandasではNaN=float。暗黙の型変換でpointがfloatになる)

In [12]:
df_actions = pl.read_csv("actions.csv")
df_actions

id,user_id,date,message,point
str,str,str,str,i64
"""7GsubTX9n6""","""BcHgeZkTsc""","""2021-08-15T10:12:34+09:00""","""Hoge hoge""",1.0
"""D76FJVQ2j2""","""KHPiabVr3o""","""2021-08-15T10:23:45+09:00""","""Lorem Ipsum""",21.0
"""6znyhCukd6""","""HzZow64HGH""","""2021-08-15T11:54:32+09:00""","""テスト　テスト""",3.0
"""hSQszmDjlU""","""BcHgeZkTsc""","""2021-08-15T12:34:56+09:00""","""テスト テスト２""",42.0
"""CVqQD0xH2Y""","""HzZow64HGH""","""2021-08-15T14:36:52+09:00""","""👑👑💢""",49.0
"""OoXNK4b3px""","""BcHgeZkTsc""","""2021-08-15T14:41:03+09:00""","""🔡""",
"""veDQHBOXnG""","""BcHgeZkTsc""","""2021-08-15T14:52:12+09:00""","""foo""",4.0


## 投稿データをuser_id昇順→point降順にする: sort

引数で渡すリストの各アイテムの順番がそれぞれ対応する(以下例であれば、user_idがreverse=False, pointがreverse=True)

pandas.DataFrame.sort_valuesはpolars.DataFrame.sortに対応する。acsending属性はreverse属性に対応するが、acsendingはTrueで昇順、reverseはTrueで降順になっている。


In [13]:
sort_orders = [False, True]
df_actions.sort(["user_id", "point"], reverse=sort_orders)

id,user_id,date,message,point
str,str,str,str,i64
"""hSQszmDjlU""","""BcHgeZkTsc""","""2021-08-15T12:34:56+09:00""","""テスト テスト２""",42.0
"""veDQHBOXnG""","""BcHgeZkTsc""","""2021-08-15T14:52:12+09:00""","""foo""",4.0
"""7GsubTX9n6""","""BcHgeZkTsc""","""2021-08-15T10:12:34+09:00""","""Hoge hoge""",1.0
"""OoXNK4b3px""","""BcHgeZkTsc""","""2021-08-15T14:41:03+09:00""","""🔡""",
"""CVqQD0xH2Y""","""HzZow64HGH""","""2021-08-15T14:36:52+09:00""","""👑👑💢""",49.0
"""6znyhCukd6""","""HzZow64HGH""","""2021-08-15T11:54:32+09:00""","""テスト　テスト""",3.0
"""D76FJVQ2j2""","""KHPiabVr3o""","""2021-08-15T10:23:45+09:00""","""Lorem Ipsum""",21.0


## pointの値なし(null)を置き換える: fill_null

fill_nullのstrategy属性に指定できるものは以下。

- "forward": 直前の値で置き換える
- "backward": 直後の値で置き換える
- "mean": 平均値で置き換える
- "max": 最大値で置き換える
- "min": 最小値で置き換える
- "one": 1で置き換える
- "zero": 0で置き換える

In [14]:
df_actions["point"] = df_actions.point.fill_null("zero")
df_actions[["id", "point"]]



id,point
str,i64
"""7GsubTX9n6""",1
"""D76FJVQ2j2""",21
"""6znyhCukd6""",3
"""hSQszmDjlU""",42
"""CVqQD0xH2Y""",49
"""OoXNK4b3px""",0
"""veDQHBOXnG""",4


## ユーザごとにpointを合算: groupby

groupbyの引数で指定した"user_id"はIndexに変換される

aggにて、どのパラメータをどの関数・処理で集計するかを指定する(指定しないパラメータは出力されない)。pandasの場合と異なり、dictで各列に関数を渡す形ではなく、リストで列選択＋処理内容を渡すイメージ。

In [15]:
df_actions \
    .groupby(["user_id"]) \
    .agg([
         pl.col("point").sum()
    ])

user_id,point
str,i64
"""KHPiabVr3o""",21
"""BcHgeZkTsc""",47
"""HzZow64HGH""",52


## ユーザごとのpointの統計値を出す: describe
pandasでは

- count, mean,std,min,25%,50%,75%,max

が出力されるが、polarsでは

- mean, std, min, median(=50%), max

が出力される。count, 25%, 75%に相当する統計値は出さない。

In [16]:
dfs = []
for user_id in df_actions["user_id"]:
    df = df_actions[df_actions["user_id"] == user_id][["point"]].describe()
    df["user_id"] = np.repeat(user_id, len(df))
    dfs.append(df)
pl.concat(dfs).pivot(values="point", index="user_id", columns="describe")

user_id,max,mean,median,min,std
str,f64,f64,f64,f64,f64
"""BcHgeZkTsc""",42.0,11.75,2.5,0.0,20.238165
"""KHPiabVr3o""",21.0,21.0,21.0,21.0,
"""HzZow64HGH""",49.0,26.0,26.0,3.0,32.526912


## ユーザごとに初めてのaction取り出し: groupby

In [17]:
df_actions \
    .groupby(["user_id"]) \
    .agg([
        pl.col("date").first(),
        pl.col("message").first(),
        pl.col("point").first(),
    ])

user_id,date,message,point
str,str,str,i64
"""HzZow64HGH""","""2021-08-15T11:54:32+09:00""","""テスト　テスト""",3
"""BcHgeZkTsc""","""2021-08-15T10:12:34+09:00""","""Hoge hoge""",1
"""KHPiabVr3o""","""2021-08-15T10:23:45+09:00""","""Lorem Ipsum""",21


## 10分ごとのpoint値の小計を出力: groupby_dynamic
groupby_dynamic + sumにより、10分ごとの区間でpointの小計を取る。

-5Minのoffsetを入れることで、最初action日時=2021-08-15T10:12:34+09:00を含む区間として10:05(=10:10 - 0:05)から集計していく




In [18]:
df_actions_date = df_actions.clone()
df_actions_date["date"] = df_actions_date.date.str.strptime(
    pl.Datetime, "%Y-%m-%dT%H:%M:%S%z"
)
df_actions_date[["date", "point"]] \
    .groupby_dynamic(
        "date", 
        every="10m", 
        offset="-5m"    # FIXME: offset may not work?
    ) \
    .agg([
        pl.col("point").sum(),
    ])

[/Users/runner/work/polars/polars/polars/polars-time/src/windows/groupby.rs:108] start_offset = 0
[/Users/runner/work/polars/polars/polars/polars-time/src/windows/groupby.rs:109] first = 0
[/Users/runner/work/polars/polars/polars/polars-time/src/windows/groupby.rs:108] start_offset = 0
[/Users/runner/work/polars/polars/polars/polars-time/src/windows/groupby.rs:109] first = 1
[/Users/runner/work/polars/polars/polars/polars-time/src/windows/groupby.rs:108] start_offset = 1
[/Users/runner/work/polars/polars/polars/polars-time/src/windows/groupby.rs:109] first = 2
[/Users/runner/work/polars/polars/polars/polars-time/src/windows/groupby.rs:108] start_offset = 2
[/Users/runner/work/polars/polars/polars/polars-time/src/windows/groupby.rs:109] first = 3
[/Users/runner/work/polars/polars/polars/polars-time/src/windows/groupby.rs:108] start_offset = 3
[/Users/runner/work/polars/polars/polars/polars-time/src/windows/groupby.rs:109] first = 4


date,point
datetime,i64
2021-08-15 10:10:00,1
2021-08-15 10:20:00,21
2021-08-15 11:50:00,3
2021-08-15 12:30:00,42
2021-08-15 14:30:00,49
2021-08-15 14:50:00,4


## actionに対するリアクションデータの解析

In [19]:
df_reactions_1 = pl.read_csv("reactions_1.csv")
df_reactions_1

id,action_id,user_id,date,type,reward
str,str,str,str,str,i64
"""ycV0zbqrL5""","""7GsubTX9n6""","""AQA7LkexXv""","""2021-08-15T11:52:12+09:00""","""like""",1
"""NXXk7iEsMA""","""OoXNK4b3px""","""AQA7LkexXv""","""2021-08-15T13:09:31+09:00""","""like""",1
"""Oq9i6DBTBp""","""OoXNK4b3px""","""AQA7LkexXv""","""2021-08-15T13:21:47+09:00""","""like""",1


In [20]:
df_reactions_2 = pl.read_csv("reactions_2.csv")
df_reactions_2

id,action_id,user_id,date,type,reward
str,str,str,str,str,i64
"""jwUcUCuVEq""","""D76FJVQ2j2""","""BcHgeZkTsc""","""2021-08-15T16:21:48+09:00""","""comment""",3


## 2データを1つのDataFrameに結合する: concat

In [21]:
df_reactions = pl.concat([df_reactions_1, df_reactions_2])
df_reactions

id,action_id,user_id,date,type,reward
str,str,str,str,str,i64
"""ycV0zbqrL5""","""7GsubTX9n6""","""AQA7LkexXv""","""2021-08-15T11:52:12+09:00""","""like""",1
"""NXXk7iEsMA""","""OoXNK4b3px""","""AQA7LkexXv""","""2021-08-15T13:09:31+09:00""","""like""",1
"""Oq9i6DBTBp""","""OoXNK4b3px""","""AQA7LkexXv""","""2021-08-15T13:21:47+09:00""","""like""",1
"""jwUcUCuVEq""","""D76FJVQ2j2""","""BcHgeZkTsc""","""2021-08-15T16:21:48+09:00""","""comment""",3


# 複数データをmergeして解析する

## action, reactionデータをactionのidで左結合: join, (rename)
polars.DataFrame.mergeは存在しない。代わりに、polars.DataFrame.joinで対応する。

polars.DataFrame.renameは、DataFrameにindexがないのでcolumnsを指定する必要がない。


In [22]:
df_reactions_renamed = df_reactions.rename({
    "id": "reaction_id", 
    "user_id": "reaction_user_id",
    "date": "reaction_date",
})
df_actions_renamed = df_actions.rename({"id": "action_id"})

df_actions_merged = df_actions_renamed.join(
    df_reactions_renamed,
    on=["action_id"],
    how="left"
)

df_actions_merged.reward = \
    df_actions_merged.reward.fill_null("zero")
df_actions_merged

action_id,user_id,date,message,point,reaction_id,reaction_user_id,reaction_date,type,reward
str,str,str,str,i64,str,str,str,str,i64
"""7GsubTX9n6""","""BcHgeZkTsc""","""2021-08-15T10:12:34+09:00""","""Hoge hoge""",1,"""ycV0zbqrL5""","""AQA7LkexXv""","""2021-08-15T11:52:12+09:00""","""like""",1.0
"""D76FJVQ2j2""","""KHPiabVr3o""","""2021-08-15T10:23:45+09:00""","""Lorem Ipsum""",21,"""jwUcUCuVEq""","""BcHgeZkTsc""","""2021-08-15T16:21:48+09:00""","""comment""",3.0
"""6znyhCukd6""","""HzZow64HGH""","""2021-08-15T11:54:32+09:00""","""テスト　テスト""",3,,,,,
"""hSQszmDjlU""","""BcHgeZkTsc""","""2021-08-15T12:34:56+09:00""","""テスト テスト２""",42,,,,,
"""CVqQD0xH2Y""","""HzZow64HGH""","""2021-08-15T14:36:52+09:00""","""👑👑💢""",49,,,,,
"""OoXNK4b3px""","""BcHgeZkTsc""","""2021-08-15T14:41:03+09:00""","""🔡""",0,"""NXXk7iEsMA""","""AQA7LkexXv""","""2021-08-15T13:09:31+09:00""","""like""",1.0
"""OoXNK4b3px""","""BcHgeZkTsc""","""2021-08-15T14:41:03+09:00""","""🔡""",0,"""Oq9i6DBTBp""","""AQA7LkexXv""","""2021-08-15T13:21:47+09:00""","""like""",1.0
"""veDQHBOXnG""","""BcHgeZkTsc""","""2021-08-15T14:52:12+09:00""","""foo""",4,,,,,


## 上記データにユーザ名を追加: join
df_usersにあるemail, type, leavedは以後使わない

そのため、"id", "name"列以外をdropしている(下記コードの1行目)

df_actions_mergedにあるuser_idのみをmerge対象にしたいので、左結合(df_actions_merged.user_idを基準にマージ)する

In [23]:
df_users_dropped = df_users[["id", "name"]]
df_merged = df_actions_merged.join(
    df_users_dropped.rename({"id": "user_id"}),
    left_on=["user_id"],
    right_on=["user_id"],
    how="left")
df_merged

action_id,user_id,date,message,point,reaction_id,reaction_user_id,reaction_date,type,reward,name
str,str,str,str,i64,str,str,str,str,i64,str
"""7GsubTX9n6""","""BcHgeZkTsc""","""2021-08-15T10:12:34+09:00""","""Hoge hoge""",1,"""ycV0zbqrL5""","""AQA7LkexXv""","""2021-08-15T11:52:12+09:00""","""like""",1.0,"""アリス"""
"""D76FJVQ2j2""","""KHPiabVr3o""","""2021-08-15T10:23:45+09:00""","""Lorem Ipsum""",21,"""jwUcUCuVEq""","""BcHgeZkTsc""","""2021-08-15T16:21:48+09:00""","""comment""",3.0,"""ボブ"""
"""6znyhCukd6""","""HzZow64HGH""","""2021-08-15T11:54:32+09:00""","""テスト　テスト""",3,,,,,,"""エリザベス"""
"""hSQszmDjlU""","""BcHgeZkTsc""","""2021-08-15T12:34:56+09:00""","""テスト テスト２""",42,,,,,,"""アリス"""
"""CVqQD0xH2Y""","""HzZow64HGH""","""2021-08-15T14:36:52+09:00""","""👑👑💢""",49,,,,,,"""エリザベス"""
"""OoXNK4b3px""","""BcHgeZkTsc""","""2021-08-15T14:41:03+09:00""","""🔡""",0,"""NXXk7iEsMA""","""AQA7LkexXv""","""2021-08-15T13:09:31+09:00""","""like""",1.0,"""アリス"""
"""OoXNK4b3px""","""BcHgeZkTsc""","""2021-08-15T14:41:03+09:00""","""🔡""",0,"""Oq9i6DBTBp""","""AQA7LkexXv""","""2021-08-15T13:21:47+09:00""","""like""",1.0,"""アリス"""
"""veDQHBOXnG""","""BcHgeZkTsc""","""2021-08-15T14:52:12+09:00""","""foo""",4,,,,,,"""アリス"""


## ユーザごとの活動量(point + reward)を計算

In [24]:
df_sum = df_merged \
    .groupby("user_id") \
    .agg([
        pl.col("point").fill_null(0).sum() + pl.col("reward").fill_null(0).sum(),
    ])
df_sum = df_sum.rename({"point": "activity"})
df_sum

user_id,activity
str,i64
"""BcHgeZkTsc""",50
"""KHPiabVr3o""",24
"""HzZow64HGH""",52


## ユーザごとに活動量のランキングを出す

In [25]:
df_ranking = df_sum \
    .join(
        df_users_dropped.rename({"id": "user_id"}),
        on="user_id",
        how="left"
    ) \
    .sort(["activity"], reverse=True)
df_ranking

user_id,activity,name
str,i64,str
"""HzZow64HGH""",52,"""エリザベス"""
"""BcHgeZkTsc""",50,"""アリス"""
"""KHPiabVr3o""",24,"""ボブ"""
