# データサイエンス100本ノック（構造化データ加工編） - Python

## はじめに
- 初めに以下のセルを実行してください
- 必要なライブラリのインポートとデータベース（PostgreSQL）からのデータ読み込みを行います
- pandas等、利用が想定されるライブラリは以下セルでインポートしています
- その他利用したいライブラリがあれば適宜インストールしてください（"!pip install ライブラリ名"でインストールも可能）
- 処理は複数回に分けても構いません
- 名前、住所等はダミーデータであり、実在するものではありません

In [1]:
import os
import pandas as pd
import numpy as np
from datetime import datetime, date
from dateutil.relativedelta import relativedelta
import math
import psycopg2
from sqlalchemy import create_engine
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from imblearn.under_sampling import RandomUnderSampler # conda install -c conda-forge imbalanced-learn

df_customer = pd.read_csv("./data/customer.csv")
df_category = pd.read_csv("./data/category.csv")
df_product = pd.read_csv("./data/product.csv")
df_receipt = pd.read_csv("./data/receipt.csv")
df_store = pd.read_csv("./data/store.csv")
df_geocode = pd.read_csv("./data/geocode.csv")

  interactivity=interactivity, compiler=compiler, result=result)


# 演習問題

---
> P-041: レシート明細データフレーム（df_receipt）の売上金額（amount）を日付（sales_ymd）ごとに集計し、前日からの売上金額増減を計算せよ。なお、計算結果は10件表示すればよい。

In [2]:
df_receipt

Unnamed: 0,sales_ymd,sales_epoch,store_cd,receipt_no,receipt_sub_no,customer_id,product_cd,quantity,amount
0,20181103,1257206400,S14006,112,1,CS006214000001,P070305012,1,158
1,20181118,1258502400,S13008,1132,2,CS008415000097,P070701017,1,81
2,20170712,1215820800,S14028,1102,1,CS028414000014,P060101005,1,170
3,20190205,1265328000,S14042,1132,1,ZZ000000000000,P050301001,1,25
4,20180821,1250812800,S14025,1102,2,CS025415000050,P060102007,1,90
...,...,...,...,...,...,...,...,...,...
104676,20180221,1235174400,S13043,1132,2,ZZ000000000000,P050101001,1,40
104677,20190911,1284163200,S14047,1132,2,ZZ000000000000,P071006005,1,218
104678,20170311,1205193600,S14040,1122,1,CS040513000195,P050405003,1,168
104679,20170331,1206921600,S13002,1142,1,CS002513000049,P060303001,1,148


In [3]:
# sales_ymd で groupby()
df2 = df_receipt[["sales_ymd","amount"]].groupby("sales_ymd")
df2.head() # なぜ全部表示される？ --> 依然未解決

Unnamed: 0,sales_ymd,amount
0,20181103,158
1,20181118,81
2,20170712,170
3,20190205,25
4,20180821,90
...,...,...
15413,20170115,198
15680,20170920,138
15776,20170920,60
17132,20170325,328


In [4]:
# 同一の sales_ymd の 合計を計算
df2 = df2.sum()
df2.head()

Unnamed: 0_level_0,amount
sales_ymd,Unnamed: 1_level_1
20170101,33723
20170102,24165
20170103,27503
20170104,36165
20170105,37830


In [5]:
df2 = df2.reset_index()
df2.head()

Unnamed: 0,sales_ymd,amount
0,20170101,33723
1,20170102,24165
2,20170103,27503
3,20170104,36165
4,20170105,37830


In [6]:
# shift() で一行ずらす
# https://note.nkmk.me/python-pandas-shift/
df2_shift = df2.shift()
df2_shift.head()

Unnamed: 0,sales_ymd,amount
0,,
1,20170101.0,33723.0
2,20170102.0,24165.0
3,20170103.0,27503.0
4,20170104.0,36165.0


In [7]:
# 最後の１行が削除されている
# デフォルトでは下方向に1行ずれる。行数はそのままなので、最後の行のデータは削除される。

In [8]:
# df2 のカラム名を置換
# https://note.nkmk.me/python-pandas-dataframe-rename/
new = {"sales_ymd": "sales_ymd_shift", "amount": "amount_shift"}
df2_shift = df2_shift.rename(columns=new)
df2_shift.head()

Unnamed: 0,sales_ymd_shift,amount_shift
0,,
1,20170101.0,33723.0
2,20170102.0,24165.0
3,20170103.0,27503.0
4,20170104.0,36165.0


In [9]:
df3 = pd.concat([df2, df2_shift], axis=1)
df3.head()

Unnamed: 0,sales_ymd,amount,sales_ymd_shift,amount_shift
0,20170101,33723,,
1,20170102,24165,20170101.0,33723.0
2,20170103,27503,20170102.0,24165.0
3,20170104,36165,20170103.0,27503.0
4,20170105,37830,20170104.0,36165.0


In [10]:
df3["amount_delta"] = df3["amount"] - df3["amount_shift"]
df3.head(10)

Unnamed: 0,sales_ymd,amount,sales_ymd_shift,amount_shift,amount_delta
0,20170101,33723,,,
1,20170102,24165,20170101.0,33723.0,-9558.0
2,20170103,27503,20170102.0,24165.0,3338.0
3,20170104,36165,20170103.0,27503.0,8662.0
4,20170105,37830,20170104.0,36165.0,1665.0
5,20170106,32387,20170105.0,37830.0,-5443.0
6,20170107,23415,20170106.0,32387.0,-8972.0
7,20170108,24737,20170107.0,23415.0,1322.0
8,20170109,26718,20170108.0,24737.0,1981.0
9,20170110,20143,20170109.0,26718.0,-6575.0


In [11]:
# 簡潔なコードに書き直す
df2 = df_receipt[["sales_ymd","amount"]].groupby("sales_ymd").sum().reset_index()
df2_shift = df2.shift().rename(columns={"sales_ymd": "sales_ymd_shift", "amount": "amount_shift"})
df3 = pd.concat([df2, df2_shift], axis=1)
df3["amount_delta"] = df3["amount"] - df3["amount_shift"]
df3.head(10)
# できた！難しかった。この問題で３０分もかかった。

Unnamed: 0,sales_ymd,amount,sales_ymd_shift,amount_shift,amount_delta
0,20170101,33723,,,
1,20170102,24165,20170101.0,33723.0,-9558.0
2,20170103,27503,20170102.0,24165.0,3338.0
3,20170104,36165,20170103.0,27503.0,8662.0
4,20170105,37830,20170104.0,36165.0,1665.0
5,20170106,32387,20170105.0,37830.0,-5443.0
6,20170107,23415,20170106.0,32387.0,-8972.0
7,20170108,24737,20170107.0,23415.0,1322.0
8,20170109,26718,20170108.0,24737.0,1981.0
9,20170110,20143,20170109.0,26718.0,-6575.0


---
> P-042: レシート明細データフレーム（df_receipt）の売上金額（amount）を日付（sales_ymd）ごとに集計し、各日付のデータに対し、１日前、２日前、３日前のデータを結合せよ。結果は10件表示すればよい。

In [12]:
df2 = df_receipt[["sales_ymd","amount"]].groupby("sales_ymd").sum().reset_index()
df2_shift1 = df2.shift().rename(columns={"sales_ymd": "sales_ymd_shift1", "amount": "amount_shift1"})
df3 = pd.concat([df2, df2_shift1], axis=1)
df2_shift2 = df2.shift(2).rename(columns={"sales_ymd": "sales_ymd_shift2", "amount": "amount_shift2"})
df4 = pd.concat([df3, df2_shift2], axis=1)
df2_shift3 = df2.shift(3).rename(columns={"sales_ymd": "sales_ymd_shift3", "amount": "amount_shift3"})
df5 = pd.concat([df4, df2_shift3], axis=1)
df5.head(10)

Unnamed: 0,sales_ymd,amount,sales_ymd_shift1,amount_shift1,sales_ymd_shift2,amount_shift2,sales_ymd_shift3,amount_shift3
0,20170101,33723,,,,,,
1,20170102,24165,20170101.0,33723.0,,,,
2,20170103,27503,20170102.0,24165.0,20170101.0,33723.0,,
3,20170104,36165,20170103.0,27503.0,20170102.0,24165.0,20170101.0,33723.0
4,20170105,37830,20170104.0,36165.0,20170103.0,27503.0,20170102.0,24165.0
5,20170106,32387,20170105.0,37830.0,20170104.0,36165.0,20170103.0,27503.0
6,20170107,23415,20170106.0,32387.0,20170105.0,37830.0,20170104.0,36165.0
7,20170108,24737,20170107.0,23415.0,20170106.0,32387.0,20170105.0,37830.0
8,20170109,26718,20170108.0,24737.0,20170107.0,23415.0,20170106.0,32387.0
9,20170110,20143,20170109.0,26718.0,20170108.0,24737.0,20170107.0,23415.0


In [13]:
# 公式解答：繰り返しをfor文で回している
# コード例2:横持ちケース
df_sales_amount_by_date = df_receipt[['sales_ymd', 'amount']].groupby('sales_ymd').sum().reset_index()
for i in range(1, 4):
    if i == 1:
        df_lag = pd.concat([df_sales_amount_by_date, df_sales_amount_by_date.shift(i)],axis=1)
    else:
        df_lag = pd.concat([df_lag, df_sales_amount_by_date.shift(i)],axis=1)
df_lag.columns = ['sales_ymd', 'amount', 'lag_ymd_1', 'lag_amount_1', 'lag_ymd_2', 'lag_amount_2', 'lag_ymd_3', 'lag_amount_3']
df_lag.dropna().sort_values(['sales_ymd']).head(10)

Unnamed: 0,sales_ymd,amount,lag_ymd_1,lag_amount_1,lag_ymd_2,lag_amount_2,lag_ymd_3,lag_amount_3
3,20170104,36165,20170103.0,27503.0,20170102.0,24165.0,20170101.0,33723.0
4,20170105,37830,20170104.0,36165.0,20170103.0,27503.0,20170102.0,24165.0
5,20170106,32387,20170105.0,37830.0,20170104.0,36165.0,20170103.0,27503.0
6,20170107,23415,20170106.0,32387.0,20170105.0,37830.0,20170104.0,36165.0
7,20170108,24737,20170107.0,23415.0,20170106.0,32387.0,20170105.0,37830.0
8,20170109,26718,20170108.0,24737.0,20170107.0,23415.0,20170106.0,32387.0
9,20170110,20143,20170109.0,26718.0,20170108.0,24737.0,20170107.0,23415.0
10,20170111,24287,20170110.0,20143.0,20170109.0,26718.0,20170108.0,24737.0
11,20170112,23526,20170111.0,24287.0,20170110.0,20143.0,20170109.0,26718.0
12,20170113,28004,20170112.0,23526.0,20170111.0,24287.0,20170110.0,20143.0


---
> P-043： レシート明細データフレーム（df_receipt）と顧客データフレーム（df_customer）を結合し、性別（gender）と年代（ageから計算）ごとに売上金額（amount）を合計した売上サマリデータフレーム（df_sales_summary）を作成せよ。性別は0が男性、1が女性、9が不明を表すものとする。
>
> ただし、項目構成は年代、女性の売上金額、男性の売上金額、性別不明の売上金額の4項目とすること（縦に年代、横に性別のクロス集計）。また、年代は10歳ごとの階級とすること。

In [14]:
df_receipt.head()

Unnamed: 0,sales_ymd,sales_epoch,store_cd,receipt_no,receipt_sub_no,customer_id,product_cd,quantity,amount
0,20181103,1257206400,S14006,112,1,CS006214000001,P070305012,1,158
1,20181118,1258502400,S13008,1132,2,CS008415000097,P070701017,1,81
2,20170712,1215820800,S14028,1102,1,CS028414000014,P060101005,1,170
3,20190205,1265328000,S14042,1132,1,ZZ000000000000,P050301001,1,25
4,20180821,1250812800,S14025,1102,2,CS025415000050,P060102007,1,90


In [15]:
# まず df_customerの構造を見る --> customer_id が共通なのでこれをKEYにする
df_customer.head()

Unnamed: 0,customer_id,customer_name,gender_cd,gender,birth_day,age,postal_cd,address,application_store_cd,application_date,status_cd
0,CS021313000114,大野 あや子,1,女性,1981-04-29,37,259-1113,神奈川県伊勢原市粟窪**********,S14021,20150905,0-00000000-0
1,CS037613000071,六角 雅彦,9,不明,1952-04-01,66,136-0076,東京都江東区南砂**********,S13037,20150414,0-00000000-0
2,CS031415000172,宇多田 貴美子,1,女性,1976-10-04,42,151-0053,東京都渋谷区代々木**********,S13031,20150529,D-20100325-C
3,CS028811000001,堀井 かおり,1,女性,1933-03-27,86,245-0016,神奈川県横浜市泉区和泉町**********,S14028,20160115,0-00000000-0
4,CS001215000145,田崎 美紀,1,女性,1995-03-29,24,144-0055,東京都大田区仲六郷**********,S13001,20170605,6-20090929-2


In [16]:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
df2 = pd.merge(df_receipt, df_customer, on="customer_id", how="inner")
df2.head()

Unnamed: 0,sales_ymd,sales_epoch,store_cd,receipt_no,receipt_sub_no,customer_id,product_cd,quantity,amount,customer_name,gender_cd,gender,birth_day,age,postal_cd,address,application_store_cd,application_date,status_cd
0,20181103,1257206400,S14006,112,1,CS006214000001,P070305012,1,158,志水 佳乃,1,女性,1996-12-08,22,224-0057,神奈川県横浜市都筑区川和町**********,S14006,20150201,E-20100908-F
1,20170509,1210291200,S14006,112,1,CS006214000001,P071401004,1,1100,志水 佳乃,1,女性,1996-12-08,22,224-0057,神奈川県横浜市都筑区川和町**********,S14006,20150201,E-20100908-F
2,20170608,1212883200,S14006,112,1,CS006214000001,P060104021,1,120,志水 佳乃,1,女性,1996-12-08,22,224-0057,神奈川県横浜市都筑区川和町**********,S14006,20150201,E-20100908-F
3,20170608,1212883200,S14006,112,2,CS006214000001,P080403001,1,175,志水 佳乃,1,女性,1996-12-08,22,224-0057,神奈川県横浜市都筑区川和町**********,S14006,20150201,E-20100908-F
4,20181028,1256688000,S14006,112,2,CS006214000001,P050102004,1,188,志水 佳乃,1,女性,1996-12-08,22,224-0057,神奈川県横浜市都筑区川和町**********,S14006,20150201,E-20100908-F


In [17]:
# 年代の出し方、頑張れば出来そうだけど公式解答をカンニングしました。
# lambda を使えばこんな簡潔に出来るのですね。
df2['age_brackets'] = df2['age'].apply(lambda x: math.floor(x / 10) * 10)
df2.head()

Unnamed: 0,sales_ymd,sales_epoch,store_cd,receipt_no,receipt_sub_no,customer_id,product_cd,quantity,amount,customer_name,gender_cd,gender,birth_day,age,postal_cd,address,application_store_cd,application_date,status_cd,age_brackets
0,20181103,1257206400,S14006,112,1,CS006214000001,P070305012,1,158,志水 佳乃,1,女性,1996-12-08,22,224-0057,神奈川県横浜市都筑区川和町**********,S14006,20150201,E-20100908-F,20
1,20170509,1210291200,S14006,112,1,CS006214000001,P071401004,1,1100,志水 佳乃,1,女性,1996-12-08,22,224-0057,神奈川県横浜市都筑区川和町**********,S14006,20150201,E-20100908-F,20
2,20170608,1212883200,S14006,112,1,CS006214000001,P060104021,1,120,志水 佳乃,1,女性,1996-12-08,22,224-0057,神奈川県横浜市都筑区川和町**********,S14006,20150201,E-20100908-F,20
3,20170608,1212883200,S14006,112,2,CS006214000001,P080403001,1,175,志水 佳乃,1,女性,1996-12-08,22,224-0057,神奈川県横浜市都筑区川和町**********,S14006,20150201,E-20100908-F,20
4,20181028,1256688000,S14006,112,2,CS006214000001,P050102004,1,188,志水 佳乃,1,女性,1996-12-08,22,224-0057,神奈川県横浜市都筑区川和町**********,S14006,20150201,E-20100908-F,20


In [18]:
df_sales_summary = pd.pivot_table(df2, index='age_brackets', columns='gender_cd', values='amount', aggfunc='sum').reset_index()
df_sales_summary.head()

gender_cd,age_brackets,0,1,9
0,10,1591.0,149836.0,4317.0
1,20,72940.0,1363724.0,44328.0
2,30,177322.0,693047.0,50441.0
3,40,19355.0,9320791.0,483512.0
4,50,54320.0,6685192.0,342923.0


In [19]:
df_sales_summary = df_sales_summary.rename(columns={0:"male", 1:"female", 9:"unknown"})
df_sales_summary

gender_cd,age_brackets,male,female,unknown
0,10,1591.0,149836.0,4317.0
1,20,72940.0,1363724.0,44328.0
2,30,177322.0,693047.0,50441.0
3,40,19355.0,9320791.0,483512.0
4,50,54320.0,6685192.0,342923.0
5,60,272469.0,987741.0,71418.0
6,70,13435.0,29764.0,2427.0
7,80,46360.0,262923.0,5111.0
8,90,,6260.0,


In [20]:
# カラム名の変換はこのように書いてもよい（公式解答から引用）
df_sales_summary.columns = ['era', 'male', 'female', 'unknown']
df_sales_summary

Unnamed: 0,era,male,female,unknown
0,10,1591.0,149836.0,4317.0
1,20,72940.0,1363724.0,44328.0
2,30,177322.0,693047.0,50441.0
3,40,19355.0,9320791.0,483512.0
4,50,54320.0,6685192.0,342923.0
5,60,272469.0,987741.0,71418.0
6,70,13435.0,29764.0,2427.0
7,80,46360.0,262923.0,5111.0
8,90,,6260.0,


---
> P-044： 前設問で作成した売上サマリデータフレーム（df_sales_summary）は性別の売上を横持ちさせたものであった。このデータフレームから性別を縦持ちさせ、年代、性別コード、売上金額の3項目に変換せよ。ただし、性別コードは男性を'00'、女性を'01'、不明を'99'とする。

In [21]:
# 全面降参。とりつく島がないので公式解答例をカンニング。
df_sales_summary2 = df_sales_summary.set_index('era'). \
        stack().reset_index().replace({'female':'01',
                                        'male':'00',
                                        'unknown':'99'}).rename(columns={'level_1':'gender_cd', 0: 'amount'})

In [22]:
df_sales_summary2

Unnamed: 0,era,gender_cd,amount
0,10,0,1591.0
1,10,1,149836.0
2,10,99,4317.0
3,20,0,72940.0
4,20,1,1363724.0
5,20,99,44328.0
6,30,0,177322.0
7,30,1,693047.0
8,30,99,50441.0
9,40,0,19355.0


In [23]:
# 解読開始。
df_sales_summary2 = df_sales_summary.set_index('era')
df_sales_summary2.head()

Unnamed: 0_level_0,male,female,unknown
era,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10,1591.0,149836.0,4317.0
20,72940.0,1363724.0,44328.0
30,177322.0,693047.0,50441.0
40,19355.0,9320791.0,483512.0
50,54320.0,6685192.0,342923.0


In [24]:
# stack() 列と行を相互に変換する
# https://deepage.net/features/pandas-stack-unstack.html
df_sales_summary2 = df_sales_summary2.stack().reset_index()
df_sales_summary2.head()

Unnamed: 0,era,level_1,0
0,10,male,1591.0
1,10,female,149836.0
2,10,unknown,4317.0
3,20,male,72940.0
4,20,female,1363724.0


In [25]:
# replace() はデータを全部置換
df_sales_summary2 = df_sales_summary2.replace({'female':'01','male':'00','unknown':'99'}).\
                                        rename(columns={'level_1':'gender_cd', 0: 'amount'})
df_sales_summary2

Unnamed: 0,era,gender_cd,amount
0,10,0,1591.0
1,10,1,149836.0
2,10,99,4317.0
3,20,0,72940.0
4,20,1,1363724.0
5,20,99,44328.0
6,30,0,177322.0
7,30,1,693047.0
8,30,99,50441.0
9,40,0,19355.0


---
> P-045: 顧客データフレーム（df_customer）の生年月日（birth_day）は日付型（Date）でデータを保有している。これをYYYYMMDD形式の文字列に変換し、顧客ID（customer_id）とともに抽出せよ。データは10件を抽出すれば良い。

In [26]:
df2 = df_customer[["customer_id", "birth_day"]]
df2.head()

Unnamed: 0,customer_id,birth_day
0,CS021313000114,1981-04-29
1,CS037613000071,1952-04-01
2,CS031415000172,1976-10-04
3,CS028811000001,1933-03-27
4,CS001215000145,1995-03-29


In [27]:
birth_day = pd.to_datetime(df_customer['birth_day']).dt.strftime('%Y%m%d')
birth_day

0        19810429
1        19520401
2        19761004
3        19330327
4        19950329
           ...   
21966    19591012
21967    19701019
21968    19721216
21969    19640605
21970    19960816
Name: birth_day, Length: 21971, dtype: object

In [28]:
df2["birth_day"] = birth_day
df2.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,customer_id,birth_day
0,CS021313000114,19810429
1,CS037613000071,19520401
2,CS031415000172,19761004
3,CS028811000001,19330327
4,CS001215000145,19950329
5,CS020401000016,19740915
6,CS015414000103,19770809
7,CS029403000008,19730817
8,CS015804000004,19310502
9,CS033513000180,19620711
