## Dataset. model_dataset
<br>
<img src="./images/model_dataset.png"/>

Part explanation for the columns:  
1. label_group (obeject): 4 groups of resonse to offers
    - 'none_offer'
    - 'no_care'
    - 'tried'
    - 'effective_offer'
2. label_seg (int): 12 segments based on age and gender
    - values: 1 ... 12  <br>  
  
(More details in <u>2_heuristic_exploration.ipynb</u>)

###  <u>10 Kinds</u> of offer_id
| offer_id #| type | duration | requirement | reward |
|:-| :-| :-:|:-:|:-:|
| 0 | bogo | 7 | 10 | 10 |
| 1 | bogo | 5 | 10 | 10 |
| 2 | infomational | 4 | - | - |
| 3 | bogo | 7 | 5 | 5 |
| 4 | discount | 10 | 20 | 5 |
| 5 | discount | 7 | 7 | 3 |
| 6 | discount | 10 | 10 | 2 |
| 7 | informational | 3 | - | - |
| 8 | bogo | 5 | 5 | 5 |
| 9 | discount | 7 | 10 | 2 |

### <u>12 Segements</u> based on 'age' and 'gender'
<br>
    
|Segment #| Age Group (edge included)<br> (Experiment in 2018) | Gender | 
|---| --- | --- | 
|1| Millenials(-21 & 22-37) | M | 
|2| Millenials(-21 & 22-37) | F | 
|3| Millenials(-21 & 22-37) | O | 
|4| Gen X(38-53) | M|
|5| Gen X(38-53) | F|
|6| Gen X(38-53) | O|
|7| Baby Boomer(54-72) | M|
|8| Baby Boomer(54-72) | F|
|9| Baby Boomer(54-72) | O|
|10| Silent(73-90 & 91+) | M|
|11| Silent(73-90 & 91+) | F|
|12| Silent(73-90 & 91+) | O|

### <u>4 Groups</u> of possible responsiveness to offer
<br>

|Group| received | viewed |valid completed | transaction amount |Scenario |
| :-| :-: | :-:| :-: | :-: | :- |
|1.none_offer| 0 | 0 | 0 | |haven't received the offer |
|2.no_care | 1 | 0 | - | |received but not viewed.<br> regarded as no_care|
|| 1 | 1 | 0 | =0.0 | received, viewed but no transaction |
|| 1 | 1 | 1<br>viewed after completed |  | received, but completed unintentionally |
|3.tried| 1 | 1 | 0 | >0.0|received, viewed, have transaction |
|4.effctive_offer | 1 | 1 | 1<br>viewed before completed | | viewed before completed,  effctive offer|

# <a class="anchor" id="Start">Table of Contents</a>

I. [Feature Engineer](#1)<br>
II.[Build model Pipeline](#2)<br>
III.[Explore intersting Questions](#3)
    - Q3.1 Send offer to a person, is this offer effective?
    - Q3.2 Given a person, recommend an offer with more effctivity?
IV.[Build Neural Network for Regeression](#4)<br>
[References](#References)

In [1]:
import pandas as pd
import numpy as np
import math
import json

from time import time
from datetime import date
from collections import defaultdict

import seaborn as sb
import matplotlib.pyplot as plt
%matplotlib inline

model_dataset_raw = pd.read_csv('./data_generated/model_raw_dataset.csv', dtype={'offer_id': str})

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures

from sklearn.metrics import mean_squared_error
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier


from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier

In [96]:
from sklearn.pipeline import Pipeline
import pickle

## <a class="anchor" id="1">[I. Feature Engineer](#Start)</a>

### 1. Add features
- Total transactions amount of individuals `'amount_total'`
- Count of offers received of individuals  `'offer_received_cnt'`

In [4]:
# Load in transactions dataset

# wrangled transcript with updated information of offer
transcript_offer = pd.read_csv('./data_generated/wrangled_transcript_offer.csv', dtype={'person': int})
# recover to original dataset: index is the same
transcript_offer.index = transcript_offer.iloc[:, 0].values
del transcript_offer['Unnamed: 0']

In [5]:
transcript_amount = transcript_offer.groupby('person').sum()['amount']
offer_received_cnt = model_dataset_raw.groupby(['person']).count()['offer_id']
persons = transcript_amount.index.tolist()

for person in persons:
    is_person = (model_dataset_raw.person == person)
    model_dataset_raw.loc[is_person,'amount_total'] = transcript_amount.loc[person]
    model_dataset_raw.loc[is_person,'offer_received_cnt'] = offer_received_cnt.loc[person]

In [49]:
model_dataset = model_dataset_raw.copy()

In [50]:
model_dataset.groupby('label_group').count()

Unnamed: 0_level_0,person,offer_id,time_received,time_viewed,time_transaction,time_completed,amount_with_offer,label_effective_offer,reward,difficulty,...,mobile,social,web,gender,age,income,member_days,label_seg,amount_total,offer_received_cnt
label_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
effctive_offer,26826,26826,26826,26826,26826,26826,26826,26826,26826,26826,...,26826,26826,26826,26826,26826,26826,26826,26826,26826,26826
no_care,31613,31613,31613,14972,22859,13581,31613,31613,31613,31613,...,31613,31613,31613,31613,31613,31613,31613,31613,31613,31613
none_offer,5,5,0,0,5,0,5,5,0,0,...,0,0,0,5,5,5,5,5,5,5
tried,8062,8062,8062,8062,8062,0,8062,8062,8062,8062,...,8062,8062,8062,8062,8062,8062,8062,8062,8062,8062


**FOUND:**
1. The 5 person in group `none_offer` will be droped, so that there is no more NaNs in the target columns in `model_dataset` 

In [51]:
is_dataset_kepp = (model_dataset.label_group != 'none_offer')
model_dataset = model_dataset[is_dataset_kepp]

### 2. One-hot code for target obejects
- gender
- label_group

In [52]:
gender_onehot = pd.get_dummies(model_dataset['gender'], prefix='gender')
label_group_onehot = pd.get_dummies(model_dataset['label_group'], prefix='group')
offer_id_onehot =  pd.get_dummies(model_dataset['offer_id'], prefix='offer')

In [53]:
model_dataset = pd.concat([model_dataset, gender_onehot, label_group_onehot, offer_id_onehot], axis=1)

### 3. Features of time
1. Time object
    - 'time_received'
    - 'time_viewed'
    - 'time_transaction'
    - 'time_completed'
2. Transform the time_transaction to transaction_cnt
3. Fill the NaNs with 0

In [54]:
model_dataset[(model_dataset.time_transaction.isin(['-1']))].offer_id.unique()  #-1标签 只对应offer_id.isin(['2','7']) 

array(['2', '7'], dtype=object)

In [55]:
def transform_transaction_cnt(dataset):
    # group of offer_id=='2' '7'
    # group of transaction = -1
    # group of transaction with ','
    
    dataset['time_transaction'] = dataset['time_transaction'].apply(lambda x: len(str(x).split(','))-1)
    
    is_group_info = (dataset.offer_id.isin(['2', '7']) & (dataset.label_effective_offer==1))
    dataset.loc[is_group_info, 'time_transaction'] = 1
    
    return dataset

model_dataset = transform_transaction_cnt(model_dataset)

In [58]:
model_dataset.rename(columns={'time_transaction': 'transaction_cnt'}, inplace=True)

# drop the useless columns for modeling
model_dataset.drop(['label_effective_offer'], axis=1, inplace=True)

values = {'time_viewed': 0.0, 'time_completed': 0.0} #time_viewed: 49860 non-null, time_completed: 40407 non-null
model_dataset.fillna(value=values, inplace=True)

model_dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 66501 entries, 0 to 66505
Data columns (total 39 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   person                66501 non-null  int64  
 1   offer_id              66501 non-null  object 
 2   time_received         66501 non-null  float64
 3   time_viewed           66501 non-null  float64
 4   transaction_cnt       66501 non-null  int64  
 5   time_completed        66501 non-null  float64
 6   amount_with_offer     66501 non-null  float64
 7   reward                66501 non-null  float64
 8   difficulty            66501 non-null  float64
 9   duration              66501 non-null  float64
 10  offer_type            66501 non-null  object 
 11  email                 66501 non-null  float64
 12  mobile                66501 non-null  float64
 13  social                66501 non-null  float64
 14  web                   66501 non-null  float64
 15  gender             

## <a class="anchor" id="2">[II. Build model Pipeline](#Start)</a>

In [59]:
# 方便重启
model_dataset_test = model_dataset.copy()

In [None]:
model_dataset = model_dataset_test

### 1. Select features and target 
[References[1]](https://github.com/syuenloh/UdacityDataScientistCapstone/blob/master/Starbucks%20Capstone%20Challenge%20-%20Using%20Starbucks%20app%20user%20data%20to%20predict%20effective%20offers.ipynb)

In [60]:
# Target: label_group
model_dataset['label_group'] = model_dataset['label_group'].replace(['no_care','tried', 'effctive_offer'],['0','1','1'])
model_dataset = model_dataset.astype({'label_group': int})

model_dataset.groupby('label_group').count()  
# 31613	 VS 34888: The distribution of the targets seems balanced

Unnamed: 0_level_0,person,offer_id,time_received,time_viewed,transaction_cnt,time_completed,amount_with_offer,reward,difficulty,duration,...,offer_0,offer_1,offer_2,offer_3,offer_4,offer_5,offer_6,offer_7,offer_8,offer_9
label_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,31613,31613,31613,31613,31613,31613,31613,31613,31613,31613,...,31613,31613,31613,31613,31613,31613,31613,31613,31613,31613
1,34888,34888,34888,34888,34888,34888,34888,34888,34888,34888,...,34888,34888,34888,34888,34888,34888,34888,34888,34888,34888


In [120]:
def select_features_target(df, target_cols, drop_cols):
    '''
    INPUT:
    - df(DataFrame): dataset include all possible features and target
    - target_cols(list): list of columns names as target
    - drop_cols(list): list of columns names
    
    OUTPUT:
    - 
    '''
    # df[[]] is DataFrame
    target = np.array(df[target_cols]) #[] is Series, [[]] is DataFrame——np.array
    features = df.drop(drop_cols, axis=1)
    
    return features, target

### 2. prepare model pipeline
[References[1]](https://github.com/syuenloh/UdacityDataScientistCapstone/blob/master/Starbucks%20Capstone%20Challenge%20-%20Using%20Starbucks%20app%20user%20data%20to%20predict%20effective%20offers.ipynb)

- print @ 参考result
    - train time
    - test time
    - train score
    - test score
    
- return
    - model clf(with params)
    - 指标的报告：@ETL recall, precision, f1-score

In [105]:
def select_clf(pickle_path, clf_ls, features, target, test_size=0.20, random_state=9):
    '''
    OUTPUT:
    - results(dict): 'model', 'train_time', 'pred_time', 'train_score', 'test_score'
    '''
    # split into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(features, target, 
                                                        test_size=test_size, 
                                                        random_state=random_state)
    
    results = defaultdict()
    visual_results = pd.DataFrame(columns=['model', 'train_time', 'test_time',
                                         'train_score', 'test_score'])
    #models = defaultdict()
    #report_ls = []
    
    
    for classifier in clf_ls:
        pipe = Pipeline(steps=[('preprocessor', StandardScaler()),
                               ('clf', classifier)])
                           
        start_train = time()
        model = pipe.fit(X_train, y_train)
        end_train = time()
        results['train_time'] = end_train-start_train
        
        # predict in train set
        pred_train = model.predict(X_train)
        
        # predict in test set and Calculate the time
        start_test = time()
        pred_test = model.predict(X_test)
        end_test = time()
        results['test_time'] = end_test-start_test
    
        # add training accuracy to results
        # what is the score？
        results['train_score']=model.score(X_train,y_train)
    
        #add testing accuracy to results
        results['test_score']=model.score(X_test,y_test)
        
        
        print("{} trained on {} samples.".format(classifier.__class__.__name__, len(y_train)))
        print("Train time: {}s".format(results['train_time']))
        print("Test time: {}s".format(results['test_time']))
        print("MSE_train: %.4f" % mean_squared_error(y_train,pred_train))
        print("MSE_test: %.4f" % mean_squared_error(y_test,pred_test))
        print("Training accuracy: %.4f" % results['train_score'])
        print("Test accuracy: %.4f" % results['test_score'])
        
        # output the report
        report = classification_report(y_test, pred_test,digits=4) #output_dict=True
        print(report)
                # df_report = pd.DataFrame(report).transpose()
                # report_ls.append(df_report)
        
        # for scaler value need an index
        new_record = pd.Series([classifier.__class__.__name__, results['train_time'],
                            results['test_time'], results['train_score'], results['test_score']],
                           index=visual_results.columns)
        visual_results = visual_results.append(new_record, ignore_index=True)
        
        #models[classifier.__class__.__name__] = model
        with open(pickle_path, "wb") as f:
                pickle.dump(model, f)
        
    return visual_results #,report_ls

## <a class="anchor" id="3">[III. Explore intersting Questions](#Start)</a>

### Q3.1 Send offer to a person, is this offer effective?
1. Target
2. Features

In [98]:
# features: 
# include offer_id(reward, difficulty, duration, email, mobile, social, web)
drop_cols = ['person', 'offer_id', 'reward', 'difficulty', 'duration','offer_type','email',
            'mobile', 'social', 'web', 'gender', 'label_group', 'label_seg',
            'group_effctive_offer', 'group_no_care', 'group_tried']
            
features, target = select_features_target(model_dataset, drop_cols)

classifiers = [
    KNeighborsClassifier(3),
    #SVC(kernel="rbf", C=0.025, probability=True),
    #NuSVC(probability=True),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    AdaBoostClassifier(),
    GradientBoostingClassifier()
    ]

# test for ideal with group infos
pickle_path = './models_with_transaction.pckl'
visual_results = select_clf(pickle_path, classifiers, features, target, test_size=0.20, random_state=9)

KNeighborsClassifier trained on 53200 samples.
MSE_train: 0.0796
MSE_test: 0.1389
Training accuracy: 0.9204
Test accuracy: 0.8611
              precision    recall  f1-score   support

           0     0.8870    0.8111    0.8473      6320
           1     0.8413    0.9065    0.8726      6981

    accuracy                         0.8611     13301
   macro avg     0.8641    0.8588    0.8600     13301
weighted avg     0.8630    0.8611    0.8606     13301

DecisionTreeClassifier trained on 53200 samples.
MSE_train: 0.0000
MSE_test: 0.0095
Training accuracy: 1.0000
Test accuracy: 0.9905
              precision    recall  f1-score   support

           0     0.9908    0.9891    0.9899      6320
           1     0.9901    0.9917    0.9909      6981

    accuracy                         0.9905     13301
   macro avg     0.9905    0.9904    0.9904     13301
weighted avg     0.9905    0.9905    0.9905     13301

RandomForestClassifier trained on 53200 samples.
MSE_train: 0.0000
MSE_test: 0.0542


In [103]:
visual_results

Unnamed: 0,model,train_time,test_time,train_score,test_score
0,KNeighborsClassifier,1.403196,30.19572,0.920376,0.861138
1,DecisionTreeClassifier,0.90848,0.02099,1.0,0.990452
2,RandomForestClassifier,15.748988,0.336807,1.0,0.945794
3,AdaBoostClassifier,6.374353,0.199887,0.904023,0.90294
4,GradientBoostingClassifier,17.145186,0.075957,0.940526,0.939704


In [104]:
# 'group_effctive_offer', 'group_no_care', 'group_tried','transaction_cnt', 'time_completed' has direct information of target classes
drop_cols = ['person', 'offer_id', 'reward', 'difficulty', 'duration','offer_type','email',
            'mobile', 'social', 'web', 'gender', 'label_group', 'label_seg',
            'group_effctive_offer', 'group_no_care', 'group_tried', 
            'transaction_cnt', 'time_completed']

features, target = select_features_target(model_dataset, drop_cols)

classifiers = [
    KNeighborsClassifier(3),
    #SVC(kernel="rbf", C=0.025, probability=True),
    #NuSVC(probability=True),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    AdaBoostClassifier(),
    GradientBoostingClassifier()
    ]

# test for ideal with group infos
pickle_path = './models_with_transaction.pckl'
visual_results2 = select_clf(pickle_path, classifiers, features, target, test_size=0.20, random_state=9)

KNeighborsClassifier trained on 53200 samples.
MSE_train: 0.1164
MSE_test: 0.2107
Training accuracy: 0.8836
Test accuracy: 0.7893
              precision    recall  f1-score   support

           0     0.8151    0.7198    0.7645      6320
           1     0.7706    0.8522    0.8093      6981

    accuracy                         0.7893     13301
   macro avg     0.7928    0.7860    0.7869     13301
weighted avg     0.7917    0.7893    0.7880     13301

DecisionTreeClassifier trained on 53200 samples.
MSE_train: 0.0000
MSE_test: 0.1196
Training accuracy: 1.0000
Test accuracy: 0.8804
              precision    recall  f1-score   support

           0     0.8672    0.8835    0.8753      6320
           1     0.8927    0.8775    0.8851      6981

    accuracy                         0.8804     13301
   macro avg     0.8800    0.8805    0.8802     13301
weighted avg     0.8806    0.8804    0.8804     13301

RandomForestClassifier trained on 53200 samples.
MSE_train: 0.0000
MSE_test: 0.0874


In [106]:
visual_results2

Unnamed: 0,model,train_time,test_time,train_score,test_score
0,KNeighborsClassifier,1.301253,23.104777,0.883553,0.789264
1,DecisionTreeClassifier,0.929468,0.025983,1.0,0.880385
2,RandomForestClassifier,9.909329,0.385781,1.0,0.912563
3,AdaBoostClassifier,3.703881,0.22387,0.896109,0.895497
4,GradientBoostingClassifier,11.375487,0.034981,0.9125,0.912864


In [107]:
# 'group_effctive_offer', 'group_no_care', 'group_tried','transaction_cnt', 'time_completed' has direct information of target classes
drop_cols = ['person', 'offer_id', 'reward', 'difficulty', 'duration','offer_type','email',
            'mobile', 'social', 'web', 'gender', 'label_group', 'label_seg',
            'group_effctive_offer', 'group_no_care', 'group_tried', 
            'transaction_cnt', 'time_completed', 'time_viewed']

features, target = select_features_target(model_dataset, drop_cols)

classifiers = [
    KNeighborsClassifier(3),
    #SVC(kernel="rbf", C=0.025, probability=True),
    #NuSVC(probability=True),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    AdaBoostClassifier(),
    GradientBoostingClassifier()
    ]

# test for ideal with group infos
pickle_path = './models_with_transaction_3.pckl'
visual_results_3 = select_clf(pickle_path, classifiers, features, target, test_size=0.20, random_state=9)

KNeighborsClassifier trained on 53200 samples.
Train time: 3.8258111476898193s
Test time: 19.92059850692749s
MSE_train: 0.1628
MSE_test: 0.2909
Training accuracy: 0.8372
Test accuracy: 0.7091
              precision    recall  f1-score   support

           0     0.7025    0.6726    0.6873      6320
           1     0.7146    0.7422    0.7281      6981

    accuracy                         0.7091     13301
   macro avg     0.7086    0.7074    0.7077     13301
weighted avg     0.7089    0.7091    0.7087     13301

DecisionTreeClassifier trained on 53200 samples.
Train time: 0.9194741249084473s
Test time: 0.021988630294799805s
MSE_train: 0.0000
MSE_test: 0.2501
Training accuracy: 1.0000
Test accuracy: 0.7499
              precision    recall  f1-score   support

           0     0.7312    0.7492    0.7401      6320
           1     0.7678    0.7506    0.7591      6981

    accuracy                         0.7499     13301
   macro avg     0.7495    0.7499    0.7496     13301
weighted avg

In [108]:
visual_results_3

Unnamed: 0,model,train_time,test_time,train_score,test_score
0,KNeighborsClassifier,3.825811,19.920599,0.83718,0.70912
1,DecisionTreeClassifier,0.919474,0.021989,1.0,0.749944
2,RandomForestClassifier,9.940312,0.493717,1.0,0.820014
3,AdaBoostClassifier,3.894773,0.192892,0.806579,0.808887
4,GradientBoostingClassifier,11.050676,0.042976,0.818308,0.819111


In [109]:
# 'group_effctive_offer', 'group_no_care', 'group_tried','transaction_cnt', 'time_completed' has direct information of target classes
drop_cols = ['person', 'offer_id', 'reward', 'difficulty', 'duration','offer_type','email',
            'mobile', 'social', 'web', 'gender', 'label_group', 'label_seg',
            'group_effctive_offer', 'group_no_care', 'group_tried', 
            'transaction_cnt', 'time_completed', 'time_viewed','amount_with_offer']

features, target = select_features_target(model_dataset, drop_cols)

classifiers = [
    KNeighborsClassifier(3),
    #SVC(kernel="rbf", C=0.025, probability=True),
    #NuSVC(probability=True),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    AdaBoostClassifier(),
    GradientBoostingClassifier()
    ]

# test for ideal with group infos
pickle_path = './models_with_transaction_4.pckl'
visual_results_4 = select_clf(pickle_path, classifiers, features, target, test_size=0.20, random_state=9)

KNeighborsClassifier trained on 53200 samples.
Train time: 2.5915162563323975s
Test time: 11.598362445831299s
MSE_train: 0.1846
MSE_test: 0.3239
Training accuracy: 0.8154
Test accuracy: 0.6761
              precision    recall  f1-score   support

           0     0.6620    0.6505    0.6562      6320
           1     0.6885    0.6993    0.6939      6981

    accuracy                         0.6761     13301
   macro avg     0.6752    0.6749    0.6750     13301
weighted avg     0.6759    0.6761    0.6760     13301

DecisionTreeClassifier trained on 53200 samples.
Train time: 0.8335225582122803s
Test time: 0.014992713928222656s
MSE_train: 0.0000
MSE_test: 0.3550
Training accuracy: 1.0000
Test accuracy: 0.6450
              precision    recall  f1-score   support

           0     0.6249    0.6324    0.6287      6320
           1     0.6636    0.6564    0.6599      6981

    accuracy                         0.6450     13301
   macro avg     0.6442    0.6444    0.6443     13301
weighted av

In [110]:
# 'group_effctive_offer', 'group_no_care', 'group_tried','transaction_cnt', 'time_completed' has direct information of target classes
drop_cols = ['person', 'offer_id', 'reward', 'difficulty', 'duration','offer_type','email',
            'mobile', 'social', 'web', 'gender', 'label_group', 'label_seg',
            'group_effctive_offer', 'group_no_care', 'group_tried', 
            'time_completed', 'time_viewed']

features, target = select_features_target(model_dataset, drop_cols)

classifiers = [
    KNeighborsClassifier(3),
    #SVC(kernel="rbf", C=0.025, probability=True),
    #NuSVC(probability=True),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    AdaBoostClassifier(),
    GradientBoostingClassifier()
    ]

# test for ideal with group infos
pickle_path = './models_with_transaction_5.pckl'
visual_results_5 = select_clf(pickle_path, classifiers, features, target, test_size=0.20, random_state=9)

KNeighborsClassifier trained on 53200 samples.
Train time: 1.2412879467010498s
Test time: 25.400697469711304s
MSE_train: 0.1302
MSE_test: 0.2210
Training accuracy: 0.8698
Test accuracy: 0.7790
              precision    recall  f1-score   support

           0     0.7798    0.7453    0.7621      6320
           1     0.7783    0.8095    0.7936      6981

    accuracy                         0.7790     13301
   macro avg     0.7790    0.7774    0.7779     13301
weighted avg     0.7790    0.7790    0.7786     13301

DecisionTreeClassifier trained on 53200 samples.
Train time: 1.447699785232544s
Test time: 0.01699066162109375s
MSE_train: 0.0000
MSE_test: 0.2506
Training accuracy: 1.0000
Test accuracy: 0.7494
              precision    recall  f1-score   support

           0     0.7319    0.7457    0.7388      6320
           1     0.7658    0.7528    0.7592      6981

    accuracy                         0.7494     13301
   macro avg     0.7489    0.7492    0.7490     13301
weighted avg 

In [None]:
def model_select_param(classifier， param_grid, features, target, test_size=0.20, random_state=9):
    # split into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(features, target, 
                                                        test_size=test_size, 
                                                        random_state=random_state)
    
    pipe = Pipeline(steps=[('preprocessor', StandardScaler()),
                        ('clf', classifier)])
    CV = GridSearchCV(pipe, param_grid, n_jobs= 1)
    
    results = defaultdict()
    
    start = time()
    CV.fit(X_train, y_train) 
    end = time()
    
    # Attribute: best_estimator_  best_params_  best_score_
    results['model'] = CV
    results['train_time'] = end - start
    
    # predict in train set
    pred_train = CV.predict(X_train)

    # predict in test set and Calculate the time
    start_test = time()
    pred_test = CV.predict(X_test)
    end_test = time()
    results['test_time'] = end_test-start_test

    # add training accuracy to results
    # what is the score？
    results['train_score']=CV.score(X_train,y_train)

    #add testing accuracy to results
    results['test_score']=CV.score(X_test,y_test)

    print("{} trained on {} samples.".format(CV.best_estimator_, len(y_train)))
    print("MSE_train: %.4f" % mean_squared_error(y_train, pred_train))
    print("MSE_test: %.4f" % mean_squared_error(y_test, pred_test))
    print("Training accuracy: %.4f" % results['train_score'])
    print("Test accuracy: %.4f" % results['test_score'])
    print(classification_report(y_test, pred_test,digits=4))

    return results

### Q3.2 Given a person, recommend an offer with more effctivity?
1. Target
2. Features

In [None]:
# 'group_effctive_offer', 'group_no_care', 'group_tried','transaction_cnt', 'time_completed' has direct information of target classes
target_cols = ['']
drop_cols = ['person', 'offer_id', 'reward', 'difficulty', 'duration','offer_type','email',
            'mobile', 'social', 'web', 'gender', 'label_group', 'label_seg',
            'group_effctive_offer', 'group_no_care', 'group_tried', 
            'time_completed', 'time_viewed']

features, target = select_features_target(model_dataset, drop_cols)

classifiers = [
    KNeighborsClassifier(3),
    #SVC(kernel="rbf", C=0.025, probability=True),
    #NuSVC(probability=True),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    AdaBoostClassifier(),
    GradientBoostingClassifier()
    ]

# test for ideal with group infos
pickle_path = './models_with_transaction_5.pckl'
visual_results_5 = select_clf(pickle_path, classifiers, features, target, test_size=0.20, random_state=9)

## <a class="anchor" id="4">[IV. Build neural network for regeression](#Start)</a>

## <a class="anchor" id="References">[References](#Start)</a>
[[1]Starbucks Capstone Challenge: Using Starbucks app user data to predict effective offers](https://github.com/syuenloh/UdacityDataScientistCapstone/blob/master/Starbucks%20Capstone%20Challenge%20-%20Using%20Starbucks%20app%20user%20data%20to%20predict%20effective%20offers.ipynb)<br>


In [None]:
model_dataset.offer_received_cnt.hist() #series 直接画图