주피터를 실행하자마자 에러가 떴는데, pip를 새로 설치하면서 jupyter가 pip에 없어서 발생하는 에러였다!  
당황하지 말고 pip install jupyter로 해결해주자.

In [1]:
# 진행률을 표기하기 위한 tqdm
from tqdm import tqdm

# DataFrame분석을 위한 pandas
import pandas as pd

# validation을 위한 전처리 train_test_split
from sklearn.model_selection import train_test_split

# 분석 및 평가를 위한 sklearn.metrics
from sklearn.metrics import confusion_matrix, accuracy_score

# 시각화 모듈을 위한 matplotlib
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm
import seaborn

## 1. Read Datasets

In [6]:
# dataset 읽기
train_df = pd.read_csv()

In [2]:
## 그래서 아래 10분할 해서 train_df-_sample만 사용하기로 했다
train_df_sample = pd.read_csv("./train_df_sample.csv")

In [3]:
# 10%만 사용하니까 그나마 좀 괜찮다.
len(train_df_sample)

2860539

In [4]:
# 가장 무난한 머신러닝 및 scikit learn에 있는 모듈 => Boosting, 그 중에서도 GBM, lightGBM 사용해보자.

## 2. PreProcessing

- EDA
  - null check (isna, fillna)
  - data type (info)
  - outlier check(describe, boxplot)
  - feature selection (~까지 가면 좋겠으나, 지금은 vanilla로 진행) 



```python

"""  
Dataset Info.  
 
train.csv [파일]   
시간 순으로 나열된 7일 동안의 웹 광고 클릭 로그  
ID: train 데이터 샘플 고유 ID  
Click: 예측 목표인 클릭 여부  
0: 클릭하지 않음, 1: 클릭  
F01 ~ F39 : 각 클릭 로그와 연관된 Feature  
개인정보 보호를 위해 상세 정보는 비식별 처리됨  

"""  
```

In [9]:
train_df_sample.head()

Unnamed: 0,ID,Click,F01,F02,F03,F04,F05,F06,F07,F08,...,F30,F31,F32,F33,F34,F35,F36,F37,F38,F39
27047686,TRAIN_27047686,0,CYOAMVC,DJDKEYH,,,XCAJWBW,13,VNXTVLH,VAWXMCR,...,TFZIBRI,GTISJWW,9787.0,0.0,NPEGYAH,IRUDRFB,15.0,JSOMQYE,0.0,KOFRDGL
7223697,TRAIN_07223697,0,JCDXFYU,PILDDJU,IAGJDOH,1.0,LFPUEOV,68,MHBXSWB,LPYPUNA,...,REPSWLB,KHZNEZF,15063.0,0.0,QMOULXS,IRUDRFB,,XBKBHCW,0.0,LNUQAZZ
21459435,TRAIN_21459435,0,VNOHLIR,PKLDGLX,IAGJDOH,5.0,EVTUBMN,6,JNIVDXP,LPYPUNA,...,SLXYBBG,GTISJWW,601.0,0.0,SHMKPOR,IRUDRFB,6.0,FXWZZCX,0.0,XHYNPHU
6264302,TRAIN_06264302,0,UCFAVXY,DJDKEYH,,3.0,WNMKDBA,1,YCWUGFD,VAWXMCR,...,MFPUCBU,ENBEWZP,4210.0,0.0,DVESEGJ,IRUDRFB,3.0,VDQUXYS,0.0,JDVFUQP
22182054,TRAIN_22182054,0,JCDXFYU,PILDDJU,IAGJDOH,11.0,LFPUEOV,81,PIFJCMX,FTPHMPQ,...,NZGEZLW,KHZNEZF,1.0,39.0,QMOULXS,IRUDRFB,0.0,AQGCGGG,1.0,QHKAWMA


In [6]:
train_df_sample.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2860539 entries, 0 to 2860538
Data columns (total 42 columns):
 #   Column      Dtype  
---  ------      -----  
 0   Unnamed: 0  int64  
 1   ID          object 
 2   Click       int64  
 3   F01         object 
 4   F02         object 
 5   F03         object 
 6   F04         float64
 7   F05         object 
 8   F06         int64  
 9   F07         object 
 10  F08         object 
 11  F09         object 
 12  F10         object 
 13  F11         float64
 14  F12         object 
 15  F13         object 
 16  F14         int64  
 17  F15         object 
 18  F16         object 
 19  F17         object 
 20  F18         float64
 21  F19         float64
 22  F20         object 
 23  F21         object 
 24  F22         object 
 25  F23         object 
 26  F24         float64
 27  F25         object 
 28  F26         object 
 29  F27         float64
 30  F28         object 
 31  F29         float64
 32  F30         object 
 33  F31    

In [8]:
train_df_sample.columns

Index(['Unnamed: 0', 'ID', 'Click', 'F01', 'F02', 'F03', 'F04', 'F05', 'F06',
       'F07', 'F08', 'F09', 'F10', 'F11', 'F12', 'F13', 'F14', 'F15', 'F16',
       'F17', 'F18', 'F19', 'F20', 'F21', 'F22', 'F23', 'F24', 'F25', 'F26',
       'F27', 'F28', 'F29', 'F30', 'F31', 'F32', 'F33', 'F34', 'F35', 'F36',
       'F37', 'F38', 'F39'],
      dtype='object')

In [21]:
x_col = train_df_sample.columns[3:]
print(x_col)

Index(['F01', 'F02', 'F03', 'F04', 'F05', 'F06', 'F07', 'F08', 'F09', 'F10',
       'F11', 'F12', 'F13', 'F14', 'F15', 'F16', 'F17', 'F18', 'F19', 'F20',
       'F21', 'F22', 'F23', 'F24', 'F25', 'F26', 'F27', 'F28', 'F29', 'F30',
       'F31', 'F32', 'F33', 'F34', 'F35', 'F36', 'F37', 'F38', 'F39'],
      dtype='object')


In [22]:
train_y = train_df_sample['Click']
train_x = train_df_sample[x_col]

print(train_x.shape)

(2860539, 39)


In [31]:
# 불필요한 컬럼 드랍.
if 'ID' in train_df_sample.columns:
    train_df_sample.drop(['Unnamed: 0', 'ID'], axis=1, inplace=True)

In [30]:
train_df_sample.columns

Index(['Click', 'F01', 'F02', 'F03', 'F04', 'F05', 'F06', 'F07', 'F08', 'F09',
       'F10', 'F11', 'F12', 'F13', 'F14', 'F15', 'F16', 'F17', 'F18', 'F19',
       'F20', 'F21', 'F22', 'F23', 'F24', 'F25', 'F26', 'F27', 'F28', 'F29',
       'F30', 'F31', 'F32', 'F33', 'F34', 'F35', 'F36', 'F37', 'F38', 'F39'],
      dtype='object')

In [32]:
train_object_col = train_df_sample.select_dtypes(['object'])
train_numeric_col = train_df_sample.select_dtypes(exclude=['object'])

print(train_object_col.columns)
print('----'*10) 
print(train_numeric_col.columns)

Index(['F01', 'F02', 'F03', 'F05', 'F07', 'F08', 'F09', 'F10', 'F12', 'F13',
       'F15', 'F16', 'F17', 'F20', 'F21', 'F22', 'F23', 'F25', 'F26', 'F28',
       'F30', 'F31', 'F34', 'F35', 'F37', 'F39'],
      dtype='object')
----------------------------------------
Index(['Click', 'F04', 'F06', 'F11', 'F14', 'F18', 'F19', 'F24', 'F27', 'F29',
       'F32', 'F33', 'F36', 'F38'],
      dtype='object')


In [33]:
train_numeric_col.describe()

Unnamed: 0,Click,F04,F06,F11,F14,F18,F19,F24,F27,F29,F32,F33,F36,F38
count,2860539.0,2286398.0,2860539.0,2564728.0,2860539.0,2127584.0,2601387.0,1962540.0,1754219.0,1754219.0,2835559.0,2601387.0,2127584.0,2780468.0
mean,0.1943284,31.21925,117.5535,406.3136,9.613142,6.838256,0.3433157,113.9983,25.38591,4.529029,19434.06,2.021605,8.156234,0.1854519
std,0.3956829,453.4102,403.9554,663.4069,14.69758,8.967519,0.6112556,588.2291,89.71299,7.833788,72021.22,32.76068,18.73038,1.999053
min,0.0,1.0,-1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
25%,0.0,3.0,1.0,57.0,1.0,2.0,0.0,7.0,2.0,1.0,354.0,0.0,2.0,0.0
50%,0.0,7.0,5.0,183.0,5.0,4.0,0.0,27.0,6.0,2.0,2711.0,0.0,4.0,0.0
75%,0.0,21.0,50.0,485.0,14.0,9.0,1.0,82.0,18.0,4.0,10221.0,1.0,10.0,0.0
max,1.0,65535.0,26871.0,8000.0,7382.0,927.0,10.0,322022.0,7546.0,220.0,2626120.0,51324.0,12112.0,626.0
