## Portfolio Exercise: Starbucks
<br>

<img src="https://opj.ca/wp-content/uploads/2018/02/New-Starbucks-Logo-1200x969.jpg" width="200" height="200">
<br>
<br>
 
#### Background Information

The dataset you will be provided in this portfolio exercise was originally used as a take-home assignment provided by Starbucks for their job candidates. The data for this exercise consists of about 120,000 data points split in a 2:1 ratio among training and test files. In the experiment simulated by the data, an advertising promotion was tested to see if it would bring more customers to purchase a specific product priced at $10. Since it costs the company 0.15 to send out each promotion, it would be best to limit that promotion only to those that are most receptive to the promotion. Each data point includes one column indicating whether or not an individual was sent a promotion for the product, and one column indicating whether or not that individual eventually purchased that product. Each individual also has seven additional features associated with them, which are provided abstractly as V1-V7.

#### Optimization Strategy

Your task is to use the training data to understand what patterns in V1-V7 to indicate that a promotion should be provided to a user. Specifically, your goal is to maximize the following metrics:

* **Incremental Response Rate (IRR)** 

IRR depicts how many more customers purchased the product with the promotion, as compared to if they didn't receive the promotion. Mathematically, it's the ratio of the number of purchasers in the promotion group to the total number of customers in the purchasers group (_treatment_) minus the ratio of the number of purchasers in the non-promotional group to the total number of customers in the non-promotional group (_control_).

$$ IRR = \frac{purch_{treat}}{cust_{treat}} - \frac{purch_{ctrl}}{cust_{ctrl}} $$


* **Net Incremental Revenue (NIR)**

NIR depicts how much is made (or lost) by sending out the promotion. Mathematically, this is 10 times the total number of purchasers that received the promotion minus 0.15 times the number of promotions sent out, minus 10 times the number of purchasers who were not given the promotion.

$$ NIR = (10\cdot purch_{treat} - 0.15 \cdot cust_{treat}) - 10 \cdot purch_{ctrl}$$

For a full description of what Starbucks provides to candidates see the [instructions available here](https://drive.google.com/open?id=18klca9Sef1Rs6q8DW4l7o349r8B70qXM).

Below you can find the training data provided.  Explore the data and different optimization strategies.

#### How To Test Your Strategy?

When you feel like you have an optimization strategy, complete the `promotion_strategy` function to pass to the `test_results` function.  
From past data, we know there are four possible outomes:

Table of actual promotion vs. predicted promotion customers:  

<table>
<tr><th></th><th colspan = '2'>Actual</th></tr>
<tr><th>Predicted</th><th>Yes</th><th>No</th></tr>
<tr><th>Yes</th><td>I</td><td>II</td></tr>
<tr><th>No</th><td>III</td><td>IV</td></tr>
</table>

The metrics are only being compared for the individuals we predict should obtain the promotion – that is, quadrants I and II.  Since the first set of individuals that receive the promotion (in the training set) receive it randomly, we can expect that quadrants I and II will have approximately equivalent participants.  

Comparing quadrant I to II then gives an idea of how well your promotion strategy will work in the future. 

Get started by reading in the data below.  See how each variable or combination of variables along with a promotion influences the chance of purchasing.  When you feel like you have a strategy for who should receive a promotion, test your strategy against the test dataset used in the final `test_results` function.

In [1]:
# load in packages
from itertools import combinations

from test_results import test_results, score
import numpy as np
import pandas as pd
import scipy as sp
import sklearn as sk
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.linear_model import LogisticRegression


import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline

# load in the data
train_data = pd.read_csv('./training.csv')
train_data.head()

  from numpy.core.umath_tests import inner1d


Unnamed: 0,ID,Promotion,purchase,V1,V2,V3,V4,V5,V6,V7
0,1,No,0,2,30.443518,-1.165083,1,1,3,2
1,3,No,0,3,32.15935,-0.645617,2,3,2,2
2,4,No,0,2,30.431659,0.133583,1,1,4,2
3,5,No,0,0,26.588914,-0.212728,2,1,4,2
4,8,Yes,0,3,28.044332,-0.385883,1,1,2,2


In [2]:
# Cells for you to work and document as necessary - 
# definitely feel free to add more cells as you need
train_data['purchase'].unique()

array([0, 1], dtype=int64)

In [3]:
#Total Numeber of users
train_data.shape[0]

84534

In [4]:
#Experiment Set
train_data[train_data["Promotion"]=="Yes"].shape[0]#number of users who received promotion Experiment set

42364

In [5]:
#Control set
train_data[train_data["Promotion"]=="No"].shape[0]#number of users who did not received promotion  Control set 

42170

In [6]:
#number of purchaser who received promotion
train_data[(train_data["Promotion"]=="Yes")&(train_data["purchase"]==1)].shape[0]

721

In [7]:
#number of purchaser in the non promotional group
train_data[(train_data["Promotion"]=="No")&(train_data["purchase"]==1)].shape[0]

319

## Function to find IRR
$$IRR = \frac{purch_{treat}}{cust_{treat}} - \frac{purch_{ctrl}}{cust_{ctrl}}$$


In [8]:
#Incremental Response Rate (IRR)
def getirr(train_data):
    T_purch=train_data[(train_data["Promotion"]=="Yes")&(train_data["purchase"]==1)].shape[0]
    print("Treated purchase :  ", T_purch)
    T_cust=train_data[(train_data["Promotion"]=="Yes")].shape[0]
    print("Treated Customer :  ", T_cust)
    C_purch=train_data[(train_data["Promotion"]=="No")&(train_data["purchase"]==1)].shape[0]
    print("Control purchase :  ", C_purch)
    C_cust=train_data[(train_data["Promotion"]=="No")].shape[0]
    print("Control Customer :  ", C_cust)
    irr=(T_purch/T_cust)-(C_purch/C_cust)
    print("Incremental Response Rate (IRR) :",irr)


In [9]:
getirr(train_data)

Treated purchase :   721
Treated Customer :   42364
Control purchase :   319
Control Customer :   42170
Incremental Response Rate (IRR) : 0.009454547819772702


In [10]:
irr=(721/32170)-(319/52170)
irr
# by decreasing the treated customer we can increase the incremental response rate

0.016297560002214134

## Function to find Net Incremental Revenue (NIR)
$$NIR = (10\cdot purch_{treat} - 0.15 \cdot cust_{treat}) - 10 \cdot purch_{ctrl}$$


In [11]:
#function to find NIR
def getnir(train_data,priceofproduct,costperpromotion):
    T_purch=train_data[(train_data["Promotion"]=="Yes")&(train_data["purchase"]==1)].shape[0]
    print("Treated purchase :  ", T_purch)
    T_cust=train_data[(train_data["Promotion"]=="Yes")].shape[0]
    print("Treated Customer :  ", T_cust)
    C_purch=train_data[(train_data["Promotion"]=="No")&(train_data["purchase"]==1)].shape[0]
    print("Control purchase :  ", C_purch)
    NIR=((priceofproduct*T_purch)-(costperpromotion*T_cust))-(priceofproduct*C_purch)
    print("price of product: ",priceofproduct)
    print("cost per promotion: ",costperpromotion)
    print("Net Incremental Revenue (NIR) :", NIR)
    

In [12]:
getnir(train_data,10,0.15)

Treated purchase :   721
Treated Customer :   42364
Control purchase :   319
price of product:  10
cost per promotion:  0.15
Net Incremental Revenue (NIR) : -2334.5999999999995


In [13]:
((721*10)-(0.15*22364))-(10*319)
#by decreasing the treated customer base we see that we can increase the net incremental revenue

665.4000000000001

In [14]:
train_data.columns

Index(['ID', 'Promotion', 'purchase', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6',
       'V7'],
      dtype='object')

In [15]:
def build_model():
    """
    pipe line construction
    """
    pipeline = Pipeline([
    #('vect', CountVectorizer(analyzer='char',lowercase=False)),
    ('tfidf', TfidfVectorizer(analyzer='char',lowercase=False)),
    ('clf', LogisticRegression())
    ])

    # uncommenting more parameters will give better exploring power but will
    # increase processing time in a combinatorial way
    parameters = {
    #'vect__max_df': (0.5, 0.75, 1.0),
    # 'vect__max_features': (None, 5000, 10000, 50000),
    #'vect__ngram_range': [(1, 1), (1, 2)],  # unigrams or bigrams
    'tfidf__use_idf': [True, False],
    'tfidf__norm': ['l1', 'l2'],
    'clf__max_iter': [1000],
    'clf__C': [0.00001, 0.000001],
    'clf__penalty': ['l2'],
    # 'clf__max_iter': (10, 50, 80),
    }

    cv = GridSearchCV(pipeline, param_grid=parameters) #using random forest as the classifier

    return cv
    

In [16]:
X=train_data[['V1', 'V2', 'V3', 'V4', 'V5', 'V6','V7']]
Y=train_data['purchase']

In [17]:
def save_model(model, model_filepath):
        with open(model_filepath, 'wb') as pkl_file:
                  pickle.dump(model, pkl_file)
        pkl_file.close()

In [18]:
def evaluate_model(model, X_test, Y_test):
     
                Y_pred = model.predict(X_test)
                print(classification_report(Y_test, Y_pred, digits=2))
            

In [19]:
model_filepath = "C:/Users/inamu/Downloads/Udacity Datascientist nanodegree/Starbucks"


In [20]:
X

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7
0,2,30.443518,-1.165083,1,1,3,2
1,3,32.159350,-0.645617,2,3,2,2
2,2,30.431659,0.133583,1,1,4,2
3,0,26.588914,-0.212728,2,1,4,2
4,3,28.044332,-0.385883,1,1,2,2
...,...,...,...,...,...,...,...
84529,1,30.084876,1.345672,1,1,3,1
84530,3,33.501485,-0.299306,1,1,4,1
84531,1,31.492019,1.085939,2,3,2,2
84532,1,37.766106,0.999361,2,2,1,2


In [21]:
def main():     
        X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.5)
        
        print('Building model...')
        model =  LogisticRegression(solver='liblinear', random_state=0)
#build_model()
        
        print('Training model...')
        #train the model and show the results
        model.fit(X_train, Y_train)
        
        print('Evaluating model...')
        evaluate_model(model, X_test, Y_test)
        print("The Score of the model is : ",model.score(X_test,Y_test))


        return(model)
        
        '''
        print('Saving model...\n    MODEL: {}'.format("model_filepath"))
        save_model(model, model_filepath)
 
        print('Trained model saved!')'''

In [22]:
len(train_data[train_data["purchase"]==0])

83494

# Training the model, with a balanced dataset


In [23]:
len(Y_train[Y_train==1])

NameError: name 'Y_train' is not defined

In [24]:
len(Y_train[Y_train==0])

NameError: name 'Y_train' is not defined

In [25]:
notpurchased=(train_data[train_data['purchase']==0]).sample(n=200)
purchased= train_data[train_data['purchase']==1]
df=pd.concat([notpurchased, purchased], axis=0)
df=df.sample(frac = 1)
X=df[['V1', 'V2', 'V3', 'V4', 'V5', 'V6','V7']]
Y=df['purchase']

In [26]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.4)
print('Building model...')
model_balance =  LogisticRegression(solver='liblinear', random_state=0)
print('Training model...')
#train the model and show the resultsB
model_balance.fit(X_train, Y_train)
print('Evaluating model...')
evaluate_model(model_balance, X_test, Y_test)
print("The Score of the model is : ",model_balance.score(X_test,Y_test))

Building model...
Training model...
Evaluating model...
             precision    recall  f1-score   support

          0       0.00      0.00      0.00        78
          1       0.84      1.00      0.91       418

avg / total       0.71      0.84      0.77       496

The Score of the model is :  0.842741935483871


  'precision', 'predicted', average, warn_for)


In [27]:
model=main()


Building model...
Training model...
Evaluating model...
             precision    recall  f1-score   support

          0       0.00      0.00      0.00       106
          1       0.83      1.00      0.91       514

avg / total       0.69      0.83      0.75       620

The Score of the model is :  0.8290322580645161


In [28]:
model.score(X_test,Y_test)

0.842741935483871

In [29]:
model.predict_proba(X_test)[0][0]

0.13695307382597321

# Imporving the model based on techniques learned from the youtube video

In [49]:
# Training my model

log_reg = LogisticRegression(random_state=10, solver = 'lbfgs')

log_reg.fit(X_train, Y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=10, solver='lbfgs', tol=0.0001,
          verbose=0, warm_start=False)

In [51]:
# Methods we can use in Logistic

# predict - Predict class labels for samples in X
log_reg.predict(X_test)
y_pred = log_reg.predict(X_test)

# predict_proba - Probability estimates
pred_proba = log_reg.predict_proba(X_test)

# coef_ - Coefficient of the features in the decision function
log_reg.coef_

# score- Returns the mean accuracy on the given test data and labels - below

array([[ 0.13886515, -0.01068873, -0.02520097,  0.68318732,  0.15353304,
        -0.02771652,  0.01762488]])

## 9. Evaluating the Model

In [52]:
# Accuracy on Train
print("The Training Accuracy is: ", log_reg.score(X_test, Y_test))

# Accuracy on Test
print("The Testing Accuracy is: ", log_reg.score(X_test, Y_test))


# Classification Report
print(classification_report(Y_test, y_pred))

The Training Accuracy is:  0.842741935483871
The Testing Accuracy is:  0.842741935483871
             precision    recall  f1-score   support

          0       0.00      0.00      0.00        78
          1       0.84      1.00      0.91       418

avg / total       0.71      0.84      0.77       496



  'precision', 'predicted', average, warn_for)


In [48]:
test_data = pd.read_csv('./Test.csv')
test_data.head()

Unnamed: 0,ID,Promotion,purchase,V1,V2,V3,V4,V5,V6,V7
0,2,No,0,1,41.37639,1.172517,1,1,2,2
1,6,Yes,0,1,25.163598,0.65305,2,2,2,2
2,7,Yes,0,1,26.553778,-1.597972,2,3,4,2
3,10,No,0,2,28.529691,-1.078506,2,3,2,2
4,12,No,0,2,32.378538,0.479895,2,2,1,2


In [31]:
getirr(test_data)

Treated purchase :   339
Treated Customer :   20748
Control purchase :   141
Control Customer :   20902
Incremental Response Rate (IRR) : 0.009593158278250108


In [32]:
getnir(test_data,10,0.15)

Treated purchase :   339
Treated Customer :   20748
Control purchase :   141
price of product:  10
cost per promotion:  0.15
Net Incremental Revenue (NIR) : -1132.1999999999998


### We need to increase the IRR and NIR using the model we just created on the test data 

In [33]:
test_data.columns

Index(['ID', 'Promotion', 'purchase', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6',
       'V7'],
      dtype='object')

In [34]:
len(testdata_Y[testdata_Y==1])

NameError: name 'testdata_Y' is not defined

In [35]:
testdata_X=test_data[['V1', 'V2', 'V3', 'V4', 'V5', 'V6','V7']]
testdata_Y=test_data["purchase"]
print("The Score of the model is : ",model_balance.score(testdata_X,testdata_Y))
a=model_balance.predict(testdata_X)
len(a[a==1])

The Score of the model is :  0.011524609843937574


41650

In [36]:
testinput=test_data[['V1', 'V2', 'V3', 'V4', 'V5', 'V6','V7']]
yes_promotions=test_data[test_data["Promotion"]=="Yes"]
yes_promotions2=yes_promotions[['V1', 'V2', 'V3', 'V4', 'V5', 'V6','V7']]

In [37]:
testpredictions=model.predict(testinput)
testpredictions[testpredictions==1]

array([1, 1, 1, ..., 1, 1, 1], dtype=int64)

In [38]:
predictions_probability=model.predict_proba(yes_promotions2)
predictions=model.predict(yes_promotions2)

In [39]:
predictions[predictions==0]

array([], dtype=int64)

### Highly doubt the cabality of the model 

In [40]:
predictions_df=pd.DataFrame(predictions)
predictions_df

Unnamed: 0,0
0,1
1,1
2,1
3,1
4,1
...,...
20743,1
20744,1
20745,1
20746,1


In [41]:
output=pd.concat([yes_promotions['ID'].reset_index(), predictions_df], axis=1)

output#this is the prediction output with the id columns

Unnamed: 0,index,ID,0
0,1,6,1
1,2,7,1
2,5,13,1
3,8,29,1
4,14,43,1
...,...,...,...
20743,41644,126165,1
20744,41646,126174,1
20745,41647,126176,1
20746,41648,126177,1


In [42]:
output=output.rename(columns={0: '0', 1: '1'})
output

Unnamed: 0,index,ID,0
0,1,6,1
1,2,7,1
2,5,13,1
3,8,29,1
4,14,43,1
...,...,...,...
20743,41644,126165,1
20744,41646,126174,1
20745,41647,126176,1
20746,41648,126177,1


In [43]:
output[[output["0"]>0.9]]

ValueError: Item wrong length 1 instead of 20748.

In [44]:
a=output[output["0"]>0.99]
a=a['index']=
#a=a.to_frame()

SyntaxError: invalid syntax (<ipython-input-44-c475aa04f6ff>, line 2)

In [45]:
test_data

Unnamed: 0,ID,Promotion,purchase,V1,V2,V3,V4,V5,V6,V7
0,2,No,0,1,41.376390,1.172517,1,1,2,2
1,6,Yes,0,1,25.163598,0.653050,2,2,2,2
2,7,Yes,0,1,26.553778,-1.597972,2,3,4,2
3,10,No,0,2,28.529691,-1.078506,2,3,2,2
4,12,No,0,2,32.378538,0.479895,2,2,1,2
...,...,...,...,...,...,...,...,...,...,...
41645,126166,No,0,1,33.073275,-0.126150,2,2,2,2
41646,126174,Yes,0,0,32.065720,0.999361,2,3,2,1
41647,126176,Yes,1,3,27.691221,1.172517,2,2,1,2
41648,126177,Yes,0,2,32.306427,0.566472,1,2,4,1


In [46]:
def promotion_strategy(df):
    '''
    INPUT 
    df - a dataframe with *only* the columns V1 - V7 (same as train_data)

    OUTPUT
    promotion_df - np.array with the values
                   'Yes' or 'No' related to whether or not an 
                   individual should recieve a promotion 
                   should be the length of df.shape[0]
                
    Ex:
    INPUT: df
    
    V1	V2	  V3	V4	V5	V6	V7
    2	30	-1.1	1	1	3	2
    3	32	-0.6	2	3	2	2
    2	30	0.13	1	1	4	2
    
    OUTPUT: promotion
    
    array(['Yes', 'Yes', 'No'])
    indicating the first two users would recieve the promotion and 
    the last should not.
    '''
    
    
    
    
    return promotion

In [47]:
# This will test your results, and provide you back some information 
# on how well your promotion_strategy will work in practice
test_results(promotion_strategy)

NameError: name 'promotion' is not defined