# 데이터 로더

## Target
- 한국의 약 2,000개 상점의 신용카드 거래 내역이 제공됩니다. 
- card_id를 기준으로 샘플링되었으며, amount의 단위는 KRW가 아닙니다. 
- 테스트 파일에서 **각 상점의 마지막 매출 발생일 다음 날부터 100일 후까지 매출의 총합**을 예측해야 합니다.

## Data informaion
### train.csv - 카드 매출 내역, 2016-08-01 ~ 2018-07-31
### test.csv - train과 같은 형식, train과 store_id가 같아도 같은 상점은 아님.
- store_id ; 각 파일에서의 상점 고유 번호 (해당기한 내 개업 혹은 폐업한 상점도 있음)
- data ; 거래 일자
- time ; 거래 시간
- card_id ; 카드번호의 hash 값
- amount ; 매출 금액, 음수인 경우 취소 거래
- insatallments ; 할부 개월 수. 일시불은 빈 문자열
- days_of_week ; 요일, 월요일이 0, 일요일은 6
- holiday ; 1이면 공휴일, 0이면 공휴일 아님

## submission sample
- store_id ; 각 파일에서의 상점 고유 번호(test.csv 와 동일)
- total_sales ; 해당 기간의 총 sales

In [2]:
import pandas as pd
import numpy as np

Raw_train = pd.read_csv("train.csv")
Raw_test = pd.read_csv("test.csv")
submission = pd.read_csv("submission.csv")

In [3]:
pd.set_option('display.max_columns', 100)

In [5]:
Raw_train.info

<bound method DataFrame.info of          store_id        date      time     card_id  amount  installments  \
0               0  2016-12-14  18:05:31  d297bba73f       5           NaN   
1               0  2016-12-14  18:05:54  d297bba73f      -5           NaN   
2               0  2016-12-19  12:42:31  0880849c05     144           NaN   
3               0  2016-12-19  12:48:08  8b4f9e0e95      66           NaN   
4               0  2016-12-19  13:31:08  7ad237eed0      24           NaN   
...           ...         ...       ...         ...     ...           ...   
3362791      1799  2018-07-19  17:58:31  e254bf70d9     600           NaN   
3362792      1799  2018-07-19  18:54:34  8f41c89891     275           NaN   
3362793      1799  2018-07-22  14:46:57  aeb64fe1fb     350           NaN   
3362794      1799  2018-07-25  18:09:13  57932602d6     300           NaN   
3362795      1799  2018-07-30  10:58:53  42d354807a     325           NaN   

         days_of_week  holyday  
0         

In [6]:
Raw_train.head()

Unnamed: 0,store_id,date,time,card_id,amount,installments,days_of_week,holyday
0,0,2016-12-14,18:05:31,d297bba73f,5,,2,0
1,0,2016-12-14,18:05:54,d297bba73f,-5,,2,0
2,0,2016-12-19,12:42:31,0880849c05,144,,0,0
3,0,2016-12-19,12:48:08,8b4f9e0e95,66,,0,0
4,0,2016-12-19,13:31:08,7ad237eed0,24,,0,0


### Data 전처리

#### Qusetion
- 음수인 amount feature 값은 어떻게 처리할 것인가?
> 방안 1) 어차피 음수이므로 그대로 사용하여 학습과정에서 자연적으로 총 매출에서 제외되는 효과를 기대한다. <br>
> 방안 2) card_id를 비교하여 음수인 amount feature를 가진 거래와 같은 card_id를 가진 양수의 거래 data를 삭제한다.
- installments(할부 개월수)는 어떻게 처리할 것인가?
> 방안 1) 할부된 개월 수 만큼 거래 금액을 나누어 data를 늘린다.
- date feature가 필요할까?
> 보통 소비자가 식당을 찾는 것은 날짜 자체보다는 요일(days_of_week) feature와 공휴일(holyday) feature에 의해 방문하지 않을까?
- store_id를 어떻게 처리할 것인가?
> test data와 train data의 store id가 다름..<br>
> **store_id 별로 따로 학습시켜 결과를 만들자**
- 하루 전체의 매출을 다 더해서 time feature 없이 data feature만 사용해볼까?
> 어느 시간대에 매출이 발생하는 지도 학습해야하므로 좋지 못하다...

#### 전처리 방안
1. installments 할부된 개월 수 만큼 거래 금액을 나누어 data늘리기
> 이때 time, card_id, days_of_week, holyday feature는 그대로 사용하고 <br> installments를 NaN으로 만든다.
2. days_of_week는 One-hot encoding을 이용한다.
3. date, time feature를 합쳐서 datetime 자료형으로 바꾸어 이를 index로 이용한 시계열 data를 만든다.

In [7]:
Raw_train['datetime'] = np.NaN

In [10]:
tmp = Raw_train['datetime'].copy()

for i in Raw_train['date'].index :
    tmp[i] = Raw_train['date'][i] + ", " + Raw_train['time'][i]

In [12]:
Raw_train['datetime'] = tmp

Raw_train['datetime'] = pd.to_datetime(Raw_train['datetime'], format="%Y-%m-%d, %H:%M:%S")

Raw_train['datetime']

0         2016-12-14 18:05:31
1         2016-12-14 18:05:54
2         2016-12-19 12:42:31
3         2016-12-19 12:48:08
4         2016-12-19 13:31:08
                  ...        
3362791   2018-07-19 17:58:31
3362792   2018-07-19 18:54:34
3362793   2018-07-22 14:46:57
3362794   2018-07-25 18:09:13
3362795   2018-07-30 10:58:53
Name: datetime, Length: 3362796, dtype: datetime64[ns]

In [13]:
Raw_train = Raw_train.drop(['date', 'time', 'card_id'], axis = 1)

Raw_train = Raw_train.fillna(0)

Raw_train

Unnamed: 0,store_id,amount,installments,days_of_week,holyday,datetime
0,0,5,0.0,2,0,2016-12-14 18:05:31
1,0,-5,0.0,2,0,2016-12-14 18:05:54
2,0,144,0.0,0,0,2016-12-19 12:42:31
3,0,66,0.0,0,0,2016-12-19 12:48:08
4,0,24,0.0,0,0,2016-12-19 13:31:08
...,...,...,...,...,...,...
3362791,1799,600,0.0,3,0,2018-07-19 17:58:31
3362792,1799,275,0.0,3,0,2018-07-19 18:54:34
3362793,1799,350,0.0,6,0,2018-07-22 14:46:57
3362794,1799,300,0.0,2,0,2018-07-25 18:09:13


In [24]:
Raw_copy = Raw_train.copy()

from datetime import timedelta

for i in Raw_copy['installments'].index :
    if Raw_copy['installments'][i] != 0.0 :
        tmp = Raw_copy.loc[i].copy()
        tmp['amount'] /= tmp['installments']
        Raw_copy.loc[i] = tmp
        
        for j in range(int(Raw_copy['installments'][i]) - 1) :
            tmp = Raw_copy.loc[i].copy()
            tmp['datetime'] += timedelta(weeks = 4*(j+1))
            tmp['intallments'] = 0.0
            Raw_copy = Raw_copy.append(tmp, ignore_index=True)
        
        tmp = Raw_copy.loc[i].copy()
        tmp['installments'] = 0.0
        Raw_copy.loc[i] = tmp

In [25]:
Raw_train = Raw_train.set_index('datetime')

Raw_train

Unnamed: 0_level_0,store_id,amount,installments,days_of_week,holyday
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2016-12-14 18:05:31,0,5,0.0,2,0
2016-12-14 18:05:54,0,-5,0.0,2,0
2016-12-19 12:42:31,0,144,0.0,0,0
2016-12-19 12:48:08,0,66,0.0,0,0
2016-12-19 13:31:08,0,24,0.0,0,0
...,...,...,...,...,...
2018-07-19 17:58:31,1799,600,0.0,3,0
2018-07-19 18:54:34,1799,275,0.0,3,0
2018-07-22 14:46:57,1799,350,0.0,6,0
2018-07-25 18:09:13,1799,300,0.0,2,0


In [26]:
Raw_train = pd.get_dummies(Raw_train, columns=['days_of_week'])

Raw_train

Unnamed: 0_level_0,store_id,amount,installments,holyday,days_of_week_0,days_of_week_1,days_of_week_2,days_of_week_3,days_of_week_4,days_of_week_5,days_of_week_6
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2016-12-14 18:05:31,0,5,0.0,0,0,0,1,0,0,0,0
2016-12-14 18:05:54,0,-5,0.0,0,0,0,1,0,0,0,0
2016-12-19 12:42:31,0,144,0.0,0,1,0,0,0,0,0,0
2016-12-19 12:48:08,0,66,0.0,0,1,0,0,0,0,0,0
2016-12-19 13:31:08,0,24,0.0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
2018-07-19 17:58:31,1799,600,0.0,0,0,0,0,1,0,0,0
2018-07-19 18:54:34,1799,275,0.0,0,0,0,0,1,0,0,0
2018-07-22 14:46:57,1799,350,0.0,0,0,0,0,0,0,0,1
2018-07-25 18:09:13,1799,300,0.0,0,0,0,1,0,0,0,0


In [28]:
def splitById(id) :
    is_store_id = Raw_train['store_id'] == id
    splited = Raw_train[is_store_id]
    return splited

splited_train = splitById(0)
splited_train

Unnamed: 0_level_0,store_id,amount,installments,holyday,days_of_week_0,days_of_week_1,days_of_week_2,days_of_week_3,days_of_week_4,days_of_week_5,days_of_week_6
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2016-12-14 18:05:31,0,5,0.0,0,0,0,1,0,0,0,0
2016-12-14 18:05:54,0,-5,0.0,0,0,0,1,0,0,0,0
2016-12-19 12:42:31,0,144,0.0,0,1,0,0,0,0,0,0
2016-12-19 12:48:08,0,66,0.0,0,1,0,0,0,0,0,0
2016-12-19 13:31:08,0,24,0.0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
2018-07-31 23:12:44,0,74,0.0,0,0,1,0,0,0,0,0
2018-07-31 23:16:50,0,97,0.0,0,0,1,0,0,0,0,0
2018-07-31 23:40:24,0,49,0.0,0,0,1,0,0,0,0,0
2018-07-31 23:55:10,0,29,0.0,0,0,1,0,0,0,0,0
