#  AWS Machine Learning Engineer Nanodegree Scholarship – Udacity 
# Optimizing Starbucks rewards using Machine Learning 
## [Capstone Project]

## Data Preprocessing for Visualization Only
Data provided by Starbucks is preprocessed for visualization and converted into one table:

| id | gender | age | income | difficulty | reward | web | mobile | social | bogo | disc | info | amount_spend |
|----|--------|-----|--------|------------|--------|-----|--------|--------|------|------|------|--------------|

**This is necessary for:**
* convinient data exploration
* visualization.

**Fields:**
* **id, gender, age, and income** are copied from ``profile.json``
* **difficulty, reward, web, mobile, social** are from ``portfolio.json``
* Finaly, **bogo, disc, info** and **amount_spend** are calculated from ``transcript.json``.

**Values in bogo, disc, info are calculated by matching customer's responses to each offer, particularly:**
+ The offer was not given to a customer.  
+ A customer received the offer but never viewed it.  
+ The offer was viewed.  
+ The offer was viewed and completed (in the case of informational offers, it means that the offer was viewed and a transaction was made by the customer within the “influence” period specified by Starbucks).


In [1]:
import os
import time
import pandas as pd

**Function :** ```def parse_offers(offers, transactions, offer_name)``` 

* Extracts the responses to offers given by each customer.
* param offers: dict. All offers of a single type (bogo, or disc, or info).
* param transactions: DataFrame. Transaction made by a customer.
* param offer_name: str. Name of the offer type (bogo, or disc, or info).
* return: list of responses given by a customer to offers.

In [2]:
def parse_offers(offers, transactions, offer_name):
    
    data = []

    for _id, info in offers.items():
        # get all the transactions associated with the offer (by id)
        history = transactions[(transactions['value'] == {'offer id': _id}) |
                               (transactions['value'] == {'offer_id': _id, 'reward': info['reward']})]
        if not history.empty:  # if the offer was received
            receive_time = history[history['event'] == 'offer received']['time'].values
            viewed = history[history['event'] == 'offer viewed']
            completed = history[history['event'] == 'offer completed']
            trx_time = transactions[transactions['event'] == 'transaction']['time'].values

            response = 1
            # check whether the offer was viewed and completed within the influence period specified by Starbucks
            for rt in receive_time:
                view_time = viewed[(viewed['time'] >= rt) &
                                   (viewed['time'] <= rt + info['duration'] * 24)]['time'].to_list()
                if view_time:
                    # if viewed and responded during the influence period
                    if not completed[(completed['time'] >= view_time[0]) &
                                     (completed['time'] <= rt + info['duration'] * 24)].empty or \
                            (offer_name == 'info' and any(
                                view_time[0] <= tt <= rt + info['duration'] * 24 for tt in trx_time)):
                        response = 3
                        break   # only the best response is recorded
                    # if only viewed during the influence period
                    else:
                        response = 2

            data.append({
                'response': response,
                'difficulty': info['difficulty'],
                'reward': info['reward'],
                'channels': info['channels']
            })

    return data

**Function:** ```def transform(profile, transcript, offers)```
* Transforms given by Starbucks input files into one big table suitable for further data analysis.
* Particularly, the new table looks as follows:
  ```
    | id | gender | age | income | difficulty | reward | web | mobile | social | bogo | disc | info | amount_spend |
    |----|--------|-----|--------|------------|--------|-----|--------|--------|------|------|------|--------------|
     ```

* param profile: DataFrame. profile.json converted to Pandas DataFrame
* param transcript: DataFrame. transcript.json converted to Pandas DataFrame
* param offers: dict. Contains information about each offer type
* return: list of dictionaries, each of which corresponds to a row in the final table.



In [3]:
def transform(profile, transcript, offers):
   
    data = []

    # iterate over each person's information and construct the table with responses given by customers
    # the table may contain rows with a person several times, each corresponds to a different response to an offer
    for index, person in profile.iterrows():
        transactions = transcript[transcript['person'] == person['id']]
        # sum all the transactions made by a customer
        amount_spend = transactions[transactions['event'] == 'transaction']['value'].to_list()
        amount_spend = sum([amount['amount'] for amount in amount_spend])

        for name, offer in offers.items():
            parsed = parse_offers(offer, transactions, name)
            for po in parsed:
                data.append({
                    'id': person['id'],
                    'gender': person['gender'],
                    'age': person['age'],
                    'income': person['income'],
                    'difficulty': po['difficulty'],
                    'reward': po['reward'],
                    'web': 1 if 'web' in po['channels'] else 0,          # email column is not included (base case)
                    'mobile': 1 if 'mobile' in po['channels'] else 0,
                    'social': 1 if 'social' in po['channels'] else 0,
                    'bogo': po['response'] if name == 'bogo' else 0,
                    'disc': po['response'] if name == 'disc' else 0,
                    'info': po['response'] if name == 'info' else 0,
                    'amount_spend': amount_spend
                })

        if not (index + 1) % 1000:
            print(f'Processed {index + 1:d} entries')

    return data


**Function:** ```def preprocess(portfolio, profile, transcript, save_dir='data')```
* Preprocesses and cleans the input data, and then transforms several input sets into one.
* The output table is saved as pickle file.
* This one has (no duplication by id) is for visualization (Data_cleaned_EDA.pkl).
* param portfolio: DataFrame. portfolio.json converted to Pandas DataFrame.
* param profile: DataFrame. profile.json converted to Pandas DataFrame.
* param transcript: DataFrame. transcript.json converted to Pandas DataFrame.
* param save_dir: str. Output files saving location.


In [5]:
def preprocess(portfolio, profile, transcript, save_dir='data'):
    
    # read offers into dictionaries of the form:
    # {'offer id': {'duration': int, 'reward': int, 'difficulty': int, 'channels': list}}
    bogo = portfolio[portfolio['offer_type'] ==
                     'bogo'][['id', 'duration', 'reward', 'difficulty', 'channels']]. \
                     set_index('id').to_dict(orient='index')
    disc = portfolio[portfolio['offer_type'] ==
                     'discount'][['id', 'duration', 'reward', 'difficulty', 'channels']]. \
                     set_index('id').to_dict(orient='index')
    info = portfolio[portfolio['offer_type'] ==
                     'informational'][['id', 'duration', 'reward', 'difficulty', 'channels']]. \
                     set_index('id').to_dict(orient='index')

    # Drop rows from the customer table with no information but a customer's id
    profile = profile.drop(profile[(profile['gender'].isna()) &
                                   (profile['age'] == 118) &
                                   (profile['income'].isna())].index).reset_index(drop=True)

    # apply data transformation
    data = pd.DataFrame(transform(profile, transcript, {'bogo': bogo, 'disc': disc, 'info': info}))


    # apply grouping for easier data visualization and save
    Data_EDA = data.drop(columns=['difficulty', 'reward'])
    Data_EDA = Data_EDA.groupby(['id']).max().reset_index()
    Data_EDA.to_pickle('Data_cleaned_EDA.pkl')
    #Data_EDA.to_csv('Data_cleaned_EDA.csv')

    print('Saving Data_cleaned_EDA.pkl')


In [6]:
import time
if __name__ == "__main__":
    start = time.time()
    portfolio = pd.read_json(('portfolio.json'), orient='records', lines=True)
    profile = pd.read_json(('profile.json'), orient='records', lines=True)
    transcript = pd.read_json(('transcript.json'), orient='records', lines=True)

    preprocess(portfolio, profile, transcript)
    print(f'Preprocessing time {(time.time() - start):.2f} s')

Processed 1000 entries
Processed 2000 entries
Processed 3000 entries
Processed 4000 entries
Processed 5000 entries
Processed 6000 entries
Processed 7000 entries
Processed 8000 entries
Processed 9000 entries
Processed 10000 entries
Processed 11000 entries
Processed 12000 entries
Processed 13000 entries
Processed 14000 entries
Saving Data_cleaned_EDA.pkl
Preprocessing time 550.54 s
