# データサイエンス100本ノック（構造化データ加工編） - Python

## はじめに
- 初めに以下のセルを実行してください
- 必要なライブラリのインポートとデータベース（PostgreSQL）からのデータ読み込みを行います
- pandas等、利用が想定されるライブラリは以下セルでインポートしています
- その他利用したいライブラリがあれば適宜インストールしてください（"!pip install ライブラリ名"でインストールも可能）
- 処理は複数回に分けても構いません
- 名前、住所等はダミーデータであり、実在するものではありません

In [1]:
import os
import pandas as pd
import numpy as np
from datetime import datetime, date
from dateutil.relativedelta import relativedelta
import math
import psycopg2
from sqlalchemy import create_engine
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from imblearn.under_sampling import RandomUnderSampler # conda install -c conda-forge imbalanced-learn

df_customer = pd.read_csv("./data/customer.csv")
df_category = pd.read_csv("./data/category.csv")
df_product = pd.read_csv("./data/product.csv")
df_receipt = pd.read_csv("./data/receipt.csv")
df_store = pd.read_csv("./data/store.csv")
df_geocode = pd.read_csv("./data/geocode.csv")

  interactivity=interactivity, compiler=compiler, result=result)


# 演習問題

---
> P-026: レシート明細データフレーム（df_receipt）に対し、顧客ID（customer_id）ごとに最も新しい売上日（sales_ymd）と古い売上日を求め、両者が異なるデータを10件表示せよ。

In [18]:
df_receipt.head()

Unnamed: 0,sales_ymd,sales_epoch,store_cd,receipt_no,receipt_sub_no,customer_id,product_cd,quantity,amount
0,20181103,1257206400,S14006,112,1,CS006214000001,P070305012,1,158
1,20181118,1258502400,S13008,1132,2,CS008415000097,P070701017,1,81
2,20170712,1215820800,S14028,1102,1,CS028414000014,P060101005,1,170
3,20190205,1265328000,S14042,1132,1,ZZ000000000000,P050301001,1,25
4,20180821,1250812800,S14025,1102,2,CS025415000050,P060102007,1,90


In [19]:
df_receipt2 = df_receipt.groupby('customer_id').agg({'sales_ymd':['max','min']})
df_receipt2.head()

Unnamed: 0_level_0,sales_ymd,sales_ymd
Unnamed: 0_level_1,max,min
customer_id,Unnamed: 1_level_2,Unnamed: 2_level_2
CS001113000004,20190308,20190308
CS001114000005,20190731,20180503
CS001115000010,20190405,20171228
CS001205000004,20190625,20170914
CS001205000006,20190224,20180207


In [20]:
df_receipt3 = df_receipt2.query('max != min')
df_receipt3.head()
# エラーがでました。groupby した後にカラム名を変更しないといけません。

UndefinedVariableError: name 'max' is not defined

In [21]:
# 参照：https://qiita.com/fuppi/items/e6657860f22beae84a03

df_receipt3 = df_receipt2.agg({"sales_ymd": {"sales_ymd_max": "max","sales_ymd_min": "min"}})
df_receipt3.head()
# まだまだエラーが続くのでギブアップして解答例を見ます。

SpecificationError: nested renamer is not supported

In [22]:
df_receipt2.columns = ["_".join(pair) for pair in df_receipt2.columns]
df_receipt2.head()

Unnamed: 0_level_0,sales_ymd_max,sales_ymd_min
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1
CS001113000004,20190308,20190308
CS001114000005,20190731,20180503
CS001115000010,20190405,20171228
CS001205000004,20190625,20170914
CS001205000006,20190224,20180207


In [24]:
# 解答例を見ちゃんいました
df_receipt3 = df_receipt2.query('sales_ymd_max != sales_ymd_min')
df_receipt3.head(10)

# カラム名の行が揃ってないけどなんとかできました。

Unnamed: 0_level_0,sales_ymd_max,sales_ymd_min
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1
CS001114000005,20190731,20180503
CS001115000010,20190405,20171228
CS001205000004,20190625,20170914
CS001205000006,20190224,20180207
CS001214000009,20190902,20170306
CS001214000017,20191006,20180828
CS001214000048,20190929,20171109
CS001214000052,20190617,20180208
CS001215000005,20181021,20170206
CS001215000040,20171022,20170214


In [29]:
# 27問目をやった段階でreset_index()メソッドを使って、26問目もやり直し。
df_receipt2 = df_receipt.groupby('customer_id').agg({'sales_ymd':['max','min']}).reset_index()
df_receipt2.columns = ["_".join(pair) for pair in df_receipt2.columns]
df_receipt3 = df_receipt2.query('sales_ymd_max != sales_ymd_min')
df_receipt3.head(10)

Unnamed: 0,customer_id_,sales_ymd_max,sales_ymd_min
1,CS001114000005,20190731,20180503
2,CS001115000010,20190405,20171228
3,CS001205000004,20190625,20170914
4,CS001205000006,20190224,20180207
13,CS001214000009,20190902,20170306
14,CS001214000017,20191006,20180828
16,CS001214000048,20190929,20171109
17,CS001214000052,20190617,20180208
20,CS001215000005,20181021,20170206
21,CS001215000040,20171022,20170214


In [None]:
# できました！

---
> P-027: レシート明細データフレーム（df_receipt）に対し、店舗コード（store_cd）ごとに売上金額（amount）の平均を計算し、降順でTOP5を表示せよ。

In [25]:
df_receipt2 = df_receipt.groupby('store_cd').agg({'amount':'mean'})
df_receipt2.head()

Unnamed: 0_level_0,amount
store_cd,Unnamed: 1_level_1
S12007,307.688343
S12013,330.19413
S12014,310.830261
S12029,315.623908
S12030,288.533727


In [27]:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html
df_receipt3 = df_receipt2.sort_values('amount', ascending=False)
df_receipt3.head()

Unnamed: 0_level_0,amount
store_cd,Unnamed: 1_level_1
S13052,402.86747
S13015,351.11196
S13003,350.915519
S14010,348.791262
S13001,348.470386


In [28]:
# やっぱりカラム名が揃いません。解答例にある reset_index() メソッドを使ってやり直します。
df_receipt2 = df_receipt.groupby('store_cd').agg({'amount':'mean'}).reset_index()
df_receipt3 = df_receipt2.sort_values('amount', ascending=False)
df_receipt3.head()

Unnamed: 0,store_cd,amount
28,S13052,402.86747
12,S13015,351.11196
7,S13003,350.915519
30,S14010,348.791262
5,S13001,348.470386


In [None]:
# きちんとできました。reset_index() メソッド覚えました。

---
> P-028: レシート明細データフレーム（df_receipt）に対し、店舗コード（store_cd）ごとに売上金額（amount）の中央値を計算し、降順でTOP5を表示せよ。

In [30]:
# もうコツがつかめたので、簡潔なコードで
df_receipt2 = df_receipt.groupby('store_cd').agg({'amount':'median'}).reset_index().sort_values('amount', ascending=False)
df_receipt2.head()

Unnamed: 0,store_cd,amount
28,S13052,190
30,S14010,188
51,S14050,185
44,S14040,180
7,S13003,180


---
> P-029: レシート明細データフレーム（df_receipt）に対し、店舗コード（store_cd）ごとに商品コード（product_cd）の最頻値を求めよ。

In [32]:
# df_receipt の構造を確認
df_receipt.head()

Unnamed: 0,sales_ymd,sales_epoch,store_cd,receipt_no,receipt_sub_no,customer_id,product_cd,quantity,amount
0,20181103,1257206400,S14006,112,1,CS006214000001,P070305012,1,158
1,20181118,1258502400,S13008,1132,2,CS008415000097,P070701017,1,81
2,20170712,1215820800,S14028,1102,1,CS028414000014,P060101005,1,170
3,20190205,1265328000,S14042,1132,1,ZZ000000000000,P050301001,1,25
4,20180821,1250812800,S14025,1102,2,CS025415000050,P060102007,1,90


In [38]:
# 参考：lambdaについて https://yuru-d.com/series-apply-lambda/
# 参考：mode()について https://note.nkmk.me/python-statistics-mean-median-mode-var-stdev/
df_receipt2 = df_receipt.groupby('store_cd')
df_receipt3 = df_receipt2.product_cd.apply(lambda x: x.mode())
df_receipt3

store_cd   
S12007    0    P060303001
S12013    0    P060303001
S12014    0    P060303001
S12029    0    P060303001
S12030    0    P060303001
S13001    0    P060303001
S13002    0    P060303001
S13003    0    P071401001
S13004    0    P060303001
S13005    0    P040503001
S13008    0    P060303001
S13009    0    P060303001
S13015    0    P071401001
S13016    0    P071102001
S13017    0    P060101002
S13018    0    P071401001
S13019    0    P071401001
S13020    0    P071401001
S13031    0    P060303001
S13032    0    P060303001
S13035    0    P040503001
S13037    0    P060303001
S13038    0    P060303001
S13039    0    P071401001
S13041    0    P071401001
S13043    0    P060303001
S13044    0    P060303001
S13051    0    P050102001
          1    P071003001
          2    P080804001
S13052    0    P050101001
S14006    0    P060303001
S14010    0    P060303001
S14011    0    P060101001
S14012    0    P060303001
S14021    0    P060101001
S14022    0    P060303001
S14023    0    P071401001


In [39]:
# reset_index()を入れないとDataFrame型にならない。
df_receipt2 = df_receipt.groupby('store_cd')
df_receipt3 = df_receipt2.product_cd.apply(lambda x: x.mode()).reset_index()
df_receipt3

Unnamed: 0,store_cd,level_1,product_cd
0,S12007,0,P060303001
1,S12013,0,P060303001
2,S12014,0,P060303001
3,S12029,0,P060303001
4,S12030,0,P060303001
5,S13001,0,P060303001
6,S13002,0,P060303001
7,S13003,0,P071401001
8,S13004,0,P060303001
9,S13005,0,P040503001


---
> P-030: レシート明細データフレーム（df_receipt）に対し、店舗コード（store_cd）ごとに売上金額（amount）の標本分散を計算し、降順でTOP5を表示せよ。

In [40]:
# 参考；標準分散を pandas で出す方法　https://it-engineer-lab.com/archives/1065
df_receipt2 = df_receipt.groupby('store_cd').amount.var(ddof=0).reset_index()
df_receipt2.head()

Unnamed: 0,store_cd,amount
0,S12007,199878.572908
1,S12013,221059.615563
2,S12014,200946.440113
3,S12029,194078.594456
4,S12030,185542.898104


In [41]:
df_receipt3 = df_receipt2.sort_values('amount', ascending=False)
df_receipt3.head()

Unnamed: 0,store_cd,amount
28,S13052,440088.701311
31,S14011,306314.558164
42,S14034,296920.081011
5,S13001,295431.993329
12,S13015,295294.361116
