## 전자상거래 배송 데이터

### 제품 배송 시간에 맞춰 배송되었는지 예측모델 만들기
학습용 데이터 (X_train, y_train)을 이용하여 배송 예측 모형을 만든 후, 이를 평가용 데이터(X_test)에 적용하여 얻은 예측값을 다음과 같은 형식의 CSV파일로 생성하시오(제출한 모델의 성능은 ROC-AUC 평가지표에 따라 채점)

![](./extrafiles/exam02.png)

[시험용 데이터셋 만들기] 
코드는 예시문제와 동일한 형태의 X_train, y_train, X_test 데이터를 만들기 위함임

(유의사항)

성능이 우수한 예측모형을 구축하기 위해서는 적절한 데이터 전처리, 피처엔지니어링,   
분류알고리즘, 하이퍼파라미터 튜닝, 모형 앙상블 등이 수반되어야 한다.   
수험번호.csv파일이 만들어지도록 코드를 제출한다.
제출한 모델의 성능은 ROC-AUC형태로 읽어드린다.

## 분류모형
확률적 모형
- 확률적 생성 모형 : LDA, QDA, 나이브 베이지안
- 확률적 판별 모형 : 로지스틱회귀, 의사결정나무

판별함수모형
- 인공신경망
- **커널SVM**
- 퍼셉트론


In [1]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np

pd.set_option("display.max_columns", 20)
pd.set_option("display.max_rows", 20)
pd.set_option("display.width", 2000)

# 라이브러리 및 데이터 불러오기
df = pd.read_csv('./extrafiles/Train.csv', engine='python')
df

Unnamed: 0,ID,Warehouse_block,Mode_of_Shipment,Customer_care_calls,Customer_rating,Cost_of_the_Product,Prior_purchases,Product_importance,Gender,Discount_offered,Weight_in_gms,Reached.on.Time_Y.N
0,1,D,Flight,4,2,177,3,low,F,44,1233,1
1,2,F,Flight,4,5,216,2,low,M,59,3088,1
2,3,A,Flight,2,2,183,4,low,M,48,3374,1
3,4,B,Flight,3,3,176,4,medium,M,10,1177,1
4,5,C,Flight,2,2,184,3,medium,F,46,2484,1
...,...,...,...,...,...,...,...,...,...,...,...,...
10994,10995,A,Ship,4,1,252,5,medium,F,1,1538,1
10995,10996,B,Ship,4,1,232,5,medium,F,6,1247,0
10996,10997,C,Ship,5,4,242,5,low,F,4,1155,0
10997,10998,F,Ship,5,2,223,6,medium,M,2,1210,0


In [2]:
# 이상치 결측치 확인
df.isna().sum()

ID                     0
Warehouse_block        0
Mode_of_Shipment       0
Customer_care_calls    0
Customer_rating        0
Cost_of_the_Product    0
Prior_purchases        0
Product_importance     0
Gender                 0
Discount_offered       0
Weight_in_gms          0
Reached.on.Time_Y.N    0
dtype: int64

In [6]:
# 이상치 확인 - Discount_offered 이 값만 조금 이상함
print(df.columns)
df.describe()

Index(['ID', 'Warehouse_block', 'Mode_of_Shipment', 'Customer_care_calls', 'Customer_rating', 'Cost_of_the_Product', 'Prior_purchases', 'Product_importance', 'Gender', 'Discount_offered', 'Weight_in_gms', 'Reached.on.Time_Y.N'], dtype='object')


Unnamed: 0,ID,Customer_care_calls,Customer_rating,Cost_of_the_Product,Prior_purchases,Discount_offered,Weight_in_gms,Reached.on.Time_Y.N
count,10999.0,10999.0,10999.0,10999.0,10999.0,10999.0,10999.0,10999.0
mean,5500.0,4.054459,2.990545,210.196836,3.567597,8.590963,3634.016729,0.596691
std,3175.28214,1.14149,1.413603,48.063272,1.52286,6.095461,1635.377251,0.490584
min,1.0,2.0,1.0,96.0,2.0,1.0,1001.0,0.0
25%,2750.5,3.0,2.0,169.0,3.0,4.0,1839.5,0.0
50%,5500.0,4.0,3.0,214.0,3.0,7.0,4149.0,1.0
75%,8249.5,5.0,4.0,251.0,4.0,10.0,5050.0,1.0
max,10999.0,7.0,5.0,310.0,10.0,19.0,7846.0,1.0


In [5]:
quan25 = df['Discount_offered'].quantile(0.25)
quan75 = df['Discount_offered'].quantile(0.75)
quanDiff = (quan75 - quan25)*1.5
mask_min = df['Discount_offered'] < quan25 - quanDiff
mask_max = df['Discount_offered'] > quan75 + quanDiff
df.loc[mask_min, 'Discount_offered'] = quan25 - quanDiff
df.loc[mask_max, 'Discount_offered'] = quan75 + quanDiff

In [7]:
# 데이터 셋 나누기
X = df[['Warehouse_block', 'Mode_of_Shipment', 'Customer_care_calls', 'Customer_rating', 'Cost_of_the_Product', 'Prior_purchases', 'Product_importance', 'Gender', 'Discount_offered', 'Weight_in_gms']]
y = df['Reached.on.Time_Y.N']

X_num = df[['Customer_care_calls', 'Customer_rating', 'Cost_of_the_Product', 'Prior_purchases', 'Discount_offered', 'Weight_in_gms']]
X_cat = df[['Warehouse_block', 'Mode_of_Shipment', 'Product_importance', 'Gender']]

X_cat_dummy = pd.get_dummies(X_cat)


from sklearn.preprocessing import MinMaxScaler
X_num_scaled = MinMaxScaler().fit_transform(X_num)
X_num_scaled = pd.DataFrame(X_num_scaled, index=X_num.index, columns=X_num.columns)

X = pd.concat([X_num_scaled, X_cat_dummy], axis=1)

# train_test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=3333, stratify=y)

In [22]:
# 모델적용
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
model_lr = LogisticRegression().fit(X_train, y_train)
model_nb = GaussianNB().fit(X_train, y_train)


# cross validation score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

kfold = KFold(n_splits=5, shuffle=True, random_state=1234)
score = cross_val_score(LogisticRegression(), X_train, y_train, cv=kfold)
print("교차검증 결과 : >>", score)
print("교차검증 결과 : >>", score.mean())
print()

score = cross_val_score(GaussianNB(), X_train, y_train, cv=kfold)
print("교차검증 결과 : >>", score)
print("교차검증 결과 : >>", score.mean())

# 가우시안이 로지스틱 보다 예측율이 더 높다.
print(model_nb.score(X_test, y_test))

교차검증 결과 : >> [0.63575758 0.63818182 0.64060606 0.61818182 0.66040024]
교차검증 결과 : >> 0.6386255030597056

교차검증 결과 : >> [0.63272727 0.64848485 0.65030303 0.63575758 0.67070952]
교차검증 결과 : >> 0.6475964496388996
0.6556363636363637
