In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

# 資料集介紹
這是一個紀錄使用者瀏覽網站以及最後是否會購買的資料集，總共有 18 個 column ，其中 10 個是 numerical attributes, 8 個是 categorical attributes ，以下做個大致的分類與介紹：
```
類別：
1. 不同種類的網頁瀏覽次數：
   Administrative, Informational, Product Related
   
2. 不同種類的網頁瀏覽總時間：
   Administrative Duration, Informational Duration, Product Related Duration

3. 評估網頁重要性相關的 column:
   BounceRates, ExitRates, PageValues
   
4. 瀏覽網站的期間和特殊節日(情人節、聖誕節)的接近程度：
   SpecialDay
   
5. 類別相關的 column:
   Month: 瀏覽網頁是在哪個月份
   OperatingSystems: 用什麼作業系統瀏覽
   Browser: 用哪種瀏覽器
   Region: 瀏覽所在地區
   TrafficType: 文中寫不清楚，只能當作某種分類
   VisitorType: 瀏覽的人是曾經來過還是第一次來
   Weekend: 是否在假日瀏覽

6. Target column:
   Revenue: 我要預測的 column 
```

# 定義問題
因為這個資料集明確的說 "Revenue" 可以當作分類的標籤所以我就打算直接用其他欄位來預測 Revenue 的類別

# Data preprocession

In [3]:
shop = pd.read_csv('online_shoppers_intention.csv')

In [4]:
shop.head()

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,1,1,1,1,Returning_Visitor,False,False
1,0,0.0,0,0.0,2,64.0,0.0,0.1,0.0,0.0,Feb,2,2,1,2,Returning_Visitor,False,False
2,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,4,1,9,3,Returning_Visitor,False,False
3,0,0.0,0,0.0,2,2.666667,0.05,0.14,0.0,0.0,Feb,3,2,2,4,Returning_Visitor,False,False
4,0,0.0,0,0.0,10,627.5,0.02,0.05,0.0,0.0,Feb,3,3,1,4,Returning_Visitor,True,False


```
觀察一下 Month 欄位發現只有十種月份的資料，因此我用 1-10 來找順序排各個月份
```

In [6]:
shop['Month'].unique()

array(['Feb', 'Mar', 'May', 'Oct', 'June', 'Jul', 'Aug', 'Nov', 'Sep',
       'Dec'], dtype=object)

In [12]:
def mapMonth(month):
    return {
        'Feb': 1,
        'Mar': 2,
        'May': 3,
        'June': 4,
        'Jul': 5,
        'Aug': 6,
        'Sep': 7,
        'Oct': 8,
        'Nov': 9,
        'Dec': 10
    }[month]

In [16]:
shop['Month'] = shop['Month'].apply(mapMonth)

In [18]:
shop.head()

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,1,1,1,1,1,Returning_Visitor,False,False
1,0,0.0,0,0.0,2,64.0,0.0,0.1,0.0,0.0,1,2,2,1,2,Returning_Visitor,False,False
2,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,1,4,1,9,3,Returning_Visitor,False,False
3,0,0.0,0,0.0,2,2.666667,0.05,0.14,0.0,0.0,1,3,2,2,4,Returning_Visitor,False,False
4,0,0.0,0,0.0,10,627.5,0.02,0.05,0.0,0.0,1,3,3,1,4,Returning_Visitor,True,False


```
接者來處理一下 VistorType 欄位
發現竟然有三種類別，於是就用 map 的方式幫他們分類
```

In [19]:
shop['VisitorType'].unique()

array(['Returning_Visitor', 'New_Visitor', 'Other'], dtype=object)

In [24]:
vistor = {'Returning_Visitor': 1, 'New_Visitor':2, 'Other':3}

In [26]:
shop['VisitorType'] = shop['VisitorType'].map(vistor)

```
再來處理一下 Weekend 和 Revenue 欄位
```

In [30]:
true_false = {True: 1, False: 0}

In [33]:
shop['Weekend'] = shop['Weekend'].map(true_false)
shop['Revenue'] = shop['Revenue'].map(true_false)

```
把 dataframe 變小一點
```

In [34]:
shop.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12330 entries, 0 to 12329
Data columns (total 18 columns):
Administrative             12330 non-null int64
Administrative_Duration    12330 non-null float64
Informational              12330 non-null int64
Informational_Duration     12330 non-null float64
ProductRelated             12330 non-null int64
ProductRelated_Duration    12330 non-null float64
BounceRates                12330 non-null float64
ExitRates                  12330 non-null float64
PageValues                 12330 non-null float64
SpecialDay                 12330 non-null float64
Month                      12330 non-null int64
OperatingSystems           12330 non-null int64
Browser                    12330 non-null int64
Region                     12330 non-null int64
TrafficType                12330 non-null int64
VisitorType                12330 non-null int64
Weekend                    12330 non-null int64
Revenue                    12330 non-null int64
dtypes: float

In [35]:
shop['Administrative'] = shop['Administrative'].astype(np.int8)
shop['Informational'] = shop['Informational'].astype(np.int8)
shop['ProductRelated'] = shop['ProductRelated'].astype(np.int8)
shop['Month'] = shop['Month'].astype(np.int8)
shop['OperatingSystems'] = shop['OperatingSystems'].astype(np.int8)
shop['Browser'] = shop['Browser'].astype(np.int8)
shop['Region'] = shop['Region'].astype(np.int8)
shop['TrafficType'] = shop['TrafficType'].astype(np.int8)
shop['VisitorType'] = shop['VisitorType'].astype(np.int8)
shop['Weekend'] = shop['Weekend'].astype(np.int8)
shop['Revenue'] = shop['Revenue'].astype(np.int8)

In [36]:
shop.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12330 entries, 0 to 12329
Data columns (total 18 columns):
Administrative             12330 non-null int8
Administrative_Duration    12330 non-null float64
Informational              12330 non-null int8
Informational_Duration     12330 non-null float64
ProductRelated             12330 non-null int8
ProductRelated_Duration    12330 non-null float64
BounceRates                12330 non-null float64
ExitRates                  12330 non-null float64
PageValues                 12330 non-null float64
SpecialDay                 12330 non-null float64
Month                      12330 non-null int8
OperatingSystems           12330 non-null int8
Browser                    12330 non-null int8
Region                     12330 non-null int8
TrafficType                12330 non-null int8
VisitorType                12330 non-null int8
Weekend                    12330 non-null int8
Revenue                    12330 non-null int8
dtypes: float64(7), int8

```
先把處理過的資料存起來
```

In [37]:
shop.to_csv('Shop.csv')

# 先來看看 Baseline 如何

In [38]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [42]:
df_ = shop.copy()
X_train, X_test, y_train, y_test = train_test_split(df_.drop(['Revenue'],axis=1), 
                                                    df_['Revenue'], test_size=0.30)
logmodel = LogisticRegression(random_state=0, class_weight='balanced')
logmodel.fit(X_train,y_train)
predictions = logmodel.predict(X_test)

In [40]:
from sklearn.metrics import classification_report
from sklearn import metrics

In [41]:
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.89      0.98      0.93      3112
           1       0.76      0.36      0.49       587

   micro avg       0.88      0.88      0.88      3699
   macro avg       0.83      0.67      0.71      3699
weighted avg       0.87      0.88      0.86      3699



In [43]:
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.95      0.89      0.92      3138
           1       0.53      0.71      0.61       561

   micro avg       0.86      0.86      0.86      3699
   macro avg       0.74      0.80      0.76      3699
weighted avg       0.88      0.86      0.87      3699

