# Starbucks Capstone Challenge

### Introduction

This data set contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks. 

Not all users receive the same offer, and that is the challenge to solve with this data set.

Your task is to combine transaction, demographic and offer data to determine which demographic groups respond best to which offer type. This data set is a simplified version of the real Starbucks app because the underlying simulator only has one product whereas Starbucks actually sells dozens of products.

Every offer has a validity period before the offer expires. As an example, a BOGO offer might be valid for only 5 days. You'll see in the data set that informational offers have a validity period even though these ads are merely providing information about a product; for example, if an informational offer has 7 days of validity, you can assume the customer is feeling the influence of the offer for 7 days after receiving the advertisement.

You'll be given transactional data showing user purchases made on the app including the timestamp of purchase and the amount of money spent on a purchase. This transactional data also has a record for each offer that a user receives as well as a record for when a user actually views the offer. There are also records for when a user completes an offer. 

Keep in mind as well that someone using the app might make a purchase through the app without having received an offer or seen an offer.

### Example

To give an example, a user could receive a discount offer buy 10 dollars get 2 off on Monday. The offer is valid for 10 days from receipt. If the customer accumulates at least 10 dollars in purchases during the validity period, the customer completes the offer.

However, there are a few things to watch out for in this data set. Customers do not opt into the offers that they receive; in other words, a user can receive an offer, never actually view the offer, and still complete the offer. For example, a user might receive the "buy 10 dollars get 2 dollars off offer", but the user never opens the offer during the 10 day validity period. The customer spends 15 dollars during those ten days. There will be an offer completion record in the data set; however, the customer was not influenced by the offer because the customer never viewed the offer.

### Cleaning

This makes data cleaning especially important and tricky.

You'll also want to take into account that some demographic groups will make purchases even if they don't receive an offer. From a business perspective, if a customer is going to make a 10 dollar purchase without an offer anyway, you wouldn't want to send a buy 10 dollars get 2 dollars off offer. You'll want to try to assess what a certain demographic group will buy when not receiving any offers.

### Final Advice

Because this is a capstone project, you are free to analyze the data any way you see fit. For example, you could build a machine learning model that predicts how much someone will spend based on demographics and offer type. Or you could build a model that predicts whether or not someone will respond to an offer. Or, you don't need to build a machine learning model at all. You could develop a set of heuristics that determine what offer you should send to each customer (i.e., 75 percent of women customers who were 35 years old responded to offer A vs 40 percent from the same demographic to offer B, so send offer A).

# Business Questions 

Target marketing is getting more and more popular nowadays due to its advantages compared to some traditional ways of marketing. By breaking the market into different segments, target marketing saves companies unnecessary effort and enables them to focus on the key segments consisting of customers who match the products and services the best. 

In order to be able to focus on key groups of the customers and adjust our marketing strategies according to different groups, we need to first identify the features of these different customer groups. Based on that, we can apply different marketing/advertizing strategies in a more group-based manner and thus achieve the best profit/marketing cost scenario.

In this project, I will use the Starbucks's data to indentify different user groups and answer these two questions:

1. Which groups of people are most responsive to each type of the three offers, including discount, buy one get one for free (bogo), and informational.

2. How can we best present each type of offer i.e. email, mobile, web, or social, to the users? 

# Data Understanding

The data is contained in three files:

* portfolio.json - containing offer ids and meta data about each offer (duration, type, etc.)
* profile.json - demographic data for each customer
* transcript.json - records for transactions, offers received, offers viewed, and offers completed

Here is the schema and explanation of each variable in the files:

**portfolio.json**
* id (string) - offer id
* offer_type (string) - type of offer ie BOGO, discount, informational
* difficulty (int) - minimum required spend to complete an offer
* reward (int) - reward given for completing an offer
* duration (int) - time for offer to be open, in days
* channels (list of strings)

**profile.json**
* age (int) - age of the customer 
* became_member_on (int) - date when customer created an app account
* gender (str) - gender of the customer (note some entries contain 'O' for other rather than M or F)
* id (str) - customer id
* income (float) - customer's income

**transcript.json**
* event (str) - record description (ie transaction, offer received, offer viewed, etc.)
* person (str) - customer id
* time (int) - time in hours since start of test. The data begins at time t=0
* value - (dict of strings) - either an offer id or transaction amount depending on the record

# Data Wrangling and Exploration 

With the business questions in mind, I would first take an overall look at the datasets, try to build up some intuition about how I can use the data to answer my questions, and prepare the data to be ready for further analysis.

In this part, I will clean and prepare the data by indentifying abnormal data points (such as unlikely values etc), handeling missing data, transforming the data types where it's necessary, and combining data from different datasets for future analysis.

First of all, I will import the necessary libraries and read in the original data files.

In [134]:
# Import necessary libraries
import pandas as pd
import numpy as np
import math
import json
%matplotlib inline

from datetime import datetime 

In [560]:
# Read in the json files
portfolio = pd.read_json('data/portfolio.json', orient='records', lines=True)
profile = pd.read_json('data/profile.json', orient='records', lines=True)
transcript = pd.read_json('data/transcript.json', orient='records', lines=True)

### Identifying abnormal data

After read in the data, I would like to have a look at the data shapes and some basic statistics about the data such as count, mean, max, and min. This way I can have an overview of the data and possibly identify some abnormalities if there is anything standing out.   

In [331]:
# Take a look at the shapes of the datasets
portfolio.shape, profile.shape, transcript.shape

((10, 6), (17000, 5), (306534, 4))

In [332]:
# Take a look at the data file one by one
portfolio

Unnamed: 0,channels,difficulty,duration,id,offer_type,reward
0,"[email, mobile, social]",10,7,ae264e3637204a6fb9bb56bc8210ddfd,bogo,10
1,"[web, email, mobile, social]",10,5,4d5c57ea9a6940dd891ad53e9dbe8da0,bogo,10
2,"[web, email, mobile]",0,4,3f207df678b143eea3cee63160fa8bed,informational,0
3,"[web, email, mobile]",5,7,9b98b8c7a33c4b65b9aebfe6a799e6d9,bogo,5
4,"[web, email]",20,10,0b1e1539f2cc45b7b9fa7c272da2e1d7,discount,5
5,"[web, email, mobile, social]",7,7,2298d6c36e964ae4a3e7e9706d1fb8c2,discount,3
6,"[web, email, mobile, social]",10,10,fafdcd668e3743c1bb461111dcafc2a4,discount,2
7,"[email, mobile, social]",0,3,5a8bc65990b245e5a138643cd4eb9837,informational,0
8,"[web, email, mobile, social]",5,5,f19421c1d4aa40978ebb69ca19b0e20d,bogo,5
9,"[web, email, mobile]",10,7,2906b810c7d4411798c6938adc9daaa5,discount,2


In [333]:
profile.head()

Unnamed: 0,age,became_member_on,gender,id,income
0,118,20170212,,68be06ca386d4c31939f3a4f0e3dd783,
1,55,20170715,F,0610b486422d4921ae7d2bf64640c50b,112000.0
2,118,20180712,,38fe809add3b4fcf9315a9694bb96ff5,
3,75,20170509,F,78afa995795e4d85b5d9ceeca43f5fef,100000.0
4,118,20170804,,a03223e636434f42ac4c3df47e8bac43,


In [334]:
transcript.head()

Unnamed: 0,event,person,time,value
0,offer received,78afa995795e4d85b5d9ceeca43f5fef,0,{'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'}
1,offer received,a03223e636434f42ac4c3df47e8bac43,0,{'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'}
2,offer received,e2127556f4f64592b11af22de27a7932,0,{'offer id': '2906b810c7d4411798c6938adc9daaa5'}
3,offer received,8ec6ce2a7e7949b1bf142def7d0e0586,0,{'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'}
4,offer received,68617ca6246f4fbc85e91a2a49552598,0,{'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'}


In [335]:
# Take a look at some basic statistics of the data
portfolio.describe(), profile.describe(), transcript.describe()

(       difficulty   duration     reward
 count   10.000000  10.000000  10.000000
 mean     7.700000   6.500000   4.200000
 std      5.831905   2.321398   3.583915
 min      0.000000   3.000000   0.000000
 25%      5.000000   5.000000   2.000000
 50%      8.500000   7.000000   4.000000
 75%     10.000000   7.000000   5.000000
 max     20.000000  10.000000  10.000000,
                 age  became_member_on         income
 count  17000.000000      1.700000e+04   14825.000000
 mean      62.531412      2.016703e+07   65404.991568
 std       26.738580      1.167750e+04   21598.299410
 min       18.000000      2.013073e+07   30000.000000
 25%       45.000000      2.016053e+07   49000.000000
 50%       58.000000      2.017080e+07   64000.000000
 75%       73.000000      2.017123e+07   80000.000000
 max      118.000000      2.018073e+07  120000.000000,
                 time
 count  306534.000000
 mean      366.382940
 std       200.326314
 min         0.000000
 25%       186.000000
 50%       

One thing in the profile data caught my attention, which was that the maximum age was 118 years old. I decide to look into this feature more closely to see if there is anything special about these individuals. After all, it is not impossible that people might just give fake information when registering in the app and this makes the demographic information of these people less than meaningful to be used for our analysis. 

To identify these users with this unusually high age, I will first create a subset with only users who have an age of 118 and then check some basic statistics of this sub-dataset.

In [336]:
# Creat a subset from profile which only contains users with an age of 118
age_118 = profile[profile['age'] == 118]

# Look up some basic information of the columns in this sub-dataset 
age_118.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175 entries, 0 to 16994
Data columns (total 5 columns):
age                 2175 non-null int64
became_member_on    2175 non-null int64
gender              0 non-null object
id                  2175 non-null object
income              0 non-null float64
dtypes: float64(1), int64(2), object(2)
memory usage: 102.0+ KB


In [337]:
# Take a look at some basic statistics of the sub-dataset
age_118.describe()

Unnamed: 0,age,became_member_on,income
count,2175.0,2175.0,0.0
mean,118.0,20168040.0,
std,0.0,10091.05,
min,118.0,20130800.0,
25%,118.0,20160700.0,
50%,118.0,20170730.0,
75%,118.0,20171230.0,
max,118.0,20180730.0,


It turns out there are 2175 users with the age of 118 years old. More interestingly, there is no recorded gender or income data for these users in the datset at all. Therefore, I decide to drop these users from the dataset as they don't really provide meaningful information for our following analysis.

In [338]:
# Update the profile file to exclude the above-discussed 118 years old users  
#profile = profile[profile['age'] < 118]

# Take another look at the new profile file
#profile.describe()

### Handling missing data

Now that I have taken care of these abnormal age data in the profile dataset, I would like to have a look at the missing data in the profile and transcript files. Note that the portfolio file contains only 10 rows without any missing data.

In [339]:
profile.isnull().mean(), transcript.isnull().mean()

(age                 0.000000
 became_member_on    0.000000
 gender              0.127941
 id                  0.000000
 income              0.127941
 dtype: float64, event     0.0
 person    0.0
 time      0.0
 value     0.0
 dtype: float64)

The 'gender' and 'income' columns in the profile data have almost 13% missing data, whereas the transcript data doesn't have missing data. However, I decide that for now I will move on with the data cleaning process and come back to dealing with the missing data in the profile data later.

### Data transforming and reshaping 

Now I would start doing data transformation and reshaping. The steps that I would take include:

1. Transforming the 'became_member_on' column in the profile dataset to only contain year;

2. Cleaning any columns with multivalues such as the 'value' column in the transcript dataset;

3. Mapping the id columns in the datasets with more easily readable values;

4. Creating dummy variables where it's appropriate;

5. Merging the datasets into one dataset containing necessary features for the following machine learning process;

6. Finally, drop the duplicated rows in the dataset if there are any.

**1. Transform the 'became_member_on' column in the profile data**

Since the detailed date and month data doesn't provide me with much interpretable information, I decide that I would only keep the year data in the became_member_on column in the profile dataset. Later I might be using the year data to investigate whether there is a correlation between the membership time and the customers' behaviour in our analysis.

In [561]:
# Apply a lambda function to transform the became_member_on into datetime data \ 
# and extract only the year 
profile['became_member_on'] = profile['became_member_on'].apply(lambda x: datetime.strptime(str(x), '%Y%m%d').year)

# Sanity check of the transformation
profile.head()

Unnamed: 0,age,became_member_on,gender,id,income
0,118,2017,,68be06ca386d4c31939f3a4f0e3dd783,
1,55,2017,F,0610b486422d4921ae7d2bf64640c50b,112000.0
2,118,2018,,38fe809add3b4fcf9315a9694bb96ff5,
3,75,2017,F,78afa995795e4d85b5d9ceeca43f5fef,100000.0
4,118,2017,,a03223e636434f42ac4c3df47e8bac43,


**2. Clean the 'value' column in the transcript dataset**

The 'value' column in the transcript data contains data that is hard to be analyzed, especially for the machine learning part that will come later. Therefore, it is necessary to transform the data in this column into a readable format. 

Since every entry in this column is a dictionary, I would first transform the content of the dictionaries into string data, and then seperate the keyes and the values of the dictionaries into two different columns.

In [562]:
# Transform the 'value' column into string and put it into a 'keys_values' column
transcript['keys_values'] = transcript['value'].apply(lambda x: ''.join('{}:{}'.format(key, val) for key, val in x.items()))

# Extract the keys and values from the 'keys_values' column and separate them into two columns
transcript['keys'] = transcript['keys_values'].apply(lambda x: x.split(':')[0])
transcript['values'] = transcript['keys_values'].apply(lambda x: x.split(':')[1])

# Drop the unnecessary columns 
transcript = transcript.drop(['value', 'keys_values'], axis=1)

# Sanity check
transcript.head()

Unnamed: 0,event,person,time,keys,values
0,offer received,78afa995795e4d85b5d9ceeca43f5fef,0,offer id,9b98b8c7a33c4b65b9aebfe6a799e6d9
1,offer received,a03223e636434f42ac4c3df47e8bac43,0,offer id,0b1e1539f2cc45b7b9fa7c272da2e1d7
2,offer received,e2127556f4f64592b11af22de27a7932,0,offer id,2906b810c7d4411798c6938adc9daaa5
3,offer received,8ec6ce2a7e7949b1bf142def7d0e0586,0,offer id,fafdcd668e3743c1bb461111dcafc2a4
4,offer received,68617ca6246f4fbc85e91a2a49552598,0,offer id,4d5c57ea9a6940dd891ad53e9dbe8da0


In [563]:
# Take a look at the what unique values the 'keys' and 'values' columns contain
transcript['keys'].unique(), transcript['values'].unique()

(array(['offer id', 'amount', 'offer_id'], dtype=object),
 array(['9b98b8c7a33c4b65b9aebfe6a799e6d9',
        '0b1e1539f2cc45b7b9fa7c272da2e1d7',
        '2906b810c7d4411798c6938adc9daaa5', ..., '685.07', '405.04',
        '476.33'], dtype=object))

The 'keys' column contains three unique values including 'offer id', 'amount', and 'offer_id'. In the following mapping part, I would map 'offer id' to 'offer_id' to make the data consistent.

In [564]:
# Transform 'offer id' into 'offer_id' in the 'keys' column
transcript['keys'] = transcript['keys'].map({'offer id': 'offer_id', 'offer_id': 'offer_id', 'amount': 'amount'})

# Sanity check
transcript['keys'].unique()

array(['offer_id', 'amount'], dtype=object)

On the other hand, the 'values' column contains offer ids as well as numeric data, in accordance with the 'offer id' / 'offer_id' and the 'amount' data in the 'keys' column. I would like to take a closer look at this column when the keys are 'offer_ids' and 'amount' seperately. 

I expect the data in the 'values' columns to be only offer ids when the 'keys' column contains only 'offer_id', and only numbers when the 'keys' column contains only 'amount'.

In [565]:
# Unique values in the 'values' column when the 'keys' column has 'amount'
transcript[transcript['keys'] == 'amount']['values'].unique()

array(['0.8300000000000001', '34.56', '13.23', ..., '685.07', '405.04',
       '476.33'], dtype=object)

In [566]:
# Unique values in the 'values' column when the 'keys' column has 'offer_id'
transcript[transcript['keys'] == 'offer_id']['values'].unique()

array(['9b98b8c7a33c4b65b9aebfe6a799e6d9',
       '0b1e1539f2cc45b7b9fa7c272da2e1d7',
       '2906b810c7d4411798c6938adc9daaa5',
       'fafdcd668e3743c1bb461111dcafc2a4',
       '4d5c57ea9a6940dd891ad53e9dbe8da0',
       'f19421c1d4aa40978ebb69ca19b0e20d',
       '2298d6c36e964ae4a3e7e9706d1fb8c2',
       '3f207df678b143eea3cee63160fa8bed',
       'ae264e3637204a6fb9bb56bc8210ddfd',
       '5a8bc65990b245e5a138643cd4eb9837',
       '2906b810c7d4411798c6938adc9daaa5reward',
       'fafdcd668e3743c1bb461111dcafc2a4reward',
       '9b98b8c7a33c4b65b9aebfe6a799e6d9reward',
       'ae264e3637204a6fb9bb56bc8210ddfdreward',
       '4d5c57ea9a6940dd891ad53e9dbe8da0reward',
       '2298d6c36e964ae4a3e7e9706d1fb8c2reward',
       'f19421c1d4aa40978ebb69ca19b0e20dreward',
       '0b1e1539f2cc45b7b9fa7c272da2e1d7reward'], dtype=object)

The data in the 'values' columns seems to be as expected when the key is 'amount', containing only numbers. 

However, some of the offer ids have a 'reward' tag attached to the end of the offer id. According to the portfolio data, these offer ids with the 'reward' tag stand for offers that are non-informational i.e. with a reward. This information doesn't need to be repeated again in the transcript data and because of this reason I will drop the 'reward' tag from the 'values' column in the transcript data.

As the offer ids always contain 32 characters, and numbers in the 'values' column are unlikely to be longer than 32 characters, I would drop the 'reward' tag by simply slicing the string data in this column. 

In [567]:
# Slice the data in the 'values' column and only the keep the first 32 characters
transcript['values'] = transcript['values'].apply(lambda x: x[:32])

# Sanity check
transcript[transcript['keys'] == 'offer_id']['values'].unique()

array(['9b98b8c7a33c4b65b9aebfe6a799e6d9',
       '0b1e1539f2cc45b7b9fa7c272da2e1d7',
       '2906b810c7d4411798c6938adc9daaa5',
       'fafdcd668e3743c1bb461111dcafc2a4',
       '4d5c57ea9a6940dd891ad53e9dbe8da0',
       'f19421c1d4aa40978ebb69ca19b0e20d',
       '2298d6c36e964ae4a3e7e9706d1fb8c2',
       '3f207df678b143eea3cee63160fa8bed',
       'ae264e3637204a6fb9bb56bc8210ddfd',
       '5a8bc65990b245e5a138643cd4eb9837'], dtype=object)

**3. Map the id columns**

After transforming the time data, I will move on to mapping the id columns in the three datasets with more easily readable values, as the original id values are very long and hard for humans to read.

Specifically, I will map the 'id' column in the portfolio data to an 'offer_id' column, the 'id' column in the profile data and the 'person' column in the transcript data to a 'customer_id' column, respectively.

Since both the profile data and the transcript data have customer ids, I will first check if these two datasets have exactly the same customers. if yes, I will map the customer ids only once. This way, I can avoid mapping the same customer to different customer_id values in the two dataset which would make it problematic to merge the two datasets later on the cutomer_id key. 

In [556]:
# Check if the profile data and in the transcript data have the exactly the same customers
set(profile.id.tolist()) == set(transcript.person.tolist()) 

True

Now that we know the two datasets have the same customers, we can just map the id/person data to customer_id once. And then we only need to put the customer_id column into the two datasets.

For the transcript data, both the 'person' column and the 'values' column need to be transformed. Furthermore, the 'values' column contains offer ids which should be mapped as well as amount of money spent by the customers which shouldn't be transformed. Therefore, I will create two subsets from the transcript data with one containing only offer ids in the 'values' column and one containing only 'amount of money' in the 'values' column. After the mapping, I will concatenate the two subsets back together to one transcript dataset.

In [578]:
# Write a function for mapping the ids in the three datasets
def id_mapper(col):
    '''
    INPUT - a column in the dataset containing the id information that needs to be mapped
    
    OUTPUT - a dictionary with the keys as the original id data and the values as easy-to-read values, i.e. integers
    '''
    coded_dict = dict()
    cter = 1
    
    for val in col:
        if val not in coded_dict:
            coded_dict[val] = cter
            cter += 1

    return coded_dict

# Call the id_mapper function to create a dict containg the mapping infomation
offer_id_map = id_mapper(portfolio['id'])
customer_id_map = id_mapper(profile['id'])

# Use the map method to map the id columns in the datasets
portfolio['offer_id'] = portfolio['id'].map(offer_id_map)
profile['customer_id'] = profile['id'].map(customer_id_map)
transcript['customer_id'] = transcript['person'].map(customer_id_map)

# Drop the original id columns
portfolio = portfolio.drop('id', axis=1)
profile = profile.drop('id', axis=1)
transcript = transcript.drop('person', axis=1)

# Further transfer the 'values' column in the transcript data
# by creating a subset containing only offer ids in the 'values' column

offer_id_subset = transcript[transcript['keys'] == 'offer_id'].copy() 
amount_subset = transcript[transcript['keys'] == 'amount'].copy()  

# Map the offer ids in the 'values' column in the offer_id_subset 
offer_id_subset['values'] = offer_id_subset['values'].map(offer_id_map) 

# Concatenate the two subsets back to one dataset
transcript = pd.concat([offer_id_subset, amount_subset], axis=0)

**4. Reshape the data**

From the previous steps I learnt that the transcript data has different 'events', which correspond to different types of data in the 'keys' and 'values' columns. Therefore, I would create sub-datasets based on different 'events'.

In [620]:
# Find out all unique events in the transcript data
transcript['event'].unique()

array(['offer received', 'offer viewed', 'offer completed', 'transaction'],
      dtype=object)

In [621]:
# Create subsets containing only one of these events
offer_received = transcript[transcript['event'] == 'offer received'].copy()
offer_viewed = transcript[transcript['event'] == 'offer viewed'].copy()
offer_completed = transcript[transcript['event'] == 'offer completed'].copy()
transaction= transcript[transcript['event'] == 'transaction'].copy()

# Create a new column (either 'offer_id' or 'amount') based on the 'keys' and the 'values' columns
# and drop the 'keys' and 'values' columns
offer_received['offer_id'] = offer_received['values'].copy()
offer_received = offer_received.drop(['keys', 'values'], axis=1)

offer_viewed['offer_id'] = offer_viewed['values'].copy()
offer_viewed = offer_viewed.drop(['keys', 'values'], axis=1)

offer_completed['offer_id'] = offer_completed['values'].copy()
offer_completed = offer_completed.drop(['keys', 'values'], axis=1)

transaction['amount'] = transaction['values'].copy()
transaction = transaction.drop(['keys', 'values'], axis=1)

In [623]:
offer_received.shape, offer_viewed.shape, offer_completed.shape, transaction.shape

((76277, 4), (57725, 4), (33579, 4), (138953, 4))

In [626]:
offer_viewed.head(20)

Unnamed: 0,event,time,customer_id,offer_id
12650,offer viewed,0,9,9
12651,offer viewed,0,32,8
12652,offer viewed,0,36,2
12653,offer viewed,0,42,1
12655,offer viewed,0,49,8
12656,offer viewed,0,56,10
12660,offer viewed,0,72,6
12661,offer viewed,0,78,6
12662,offer viewed,0,84,9
12663,offer viewed,0,97,7


In [583]:
portfolio

Unnamed: 0,channels,difficulty,duration,offer_type,reward,offer_id
0,"[email, mobile, social]",10,7,bogo,10,1
1,"[web, email, mobile, social]",10,5,bogo,10,2
2,"[web, email, mobile]",0,4,informational,0,3
3,"[web, email, mobile]",5,7,bogo,5,4
4,"[web, email]",20,10,discount,5,5
5,"[web, email, mobile, social]",7,7,discount,3,6
6,"[web, email, mobile, social]",10,10,discount,2,7
7,"[email, mobile, social]",0,3,informational,0,8
8,"[web, email, mobile, social]",5,5,bogo,5,9
9,"[web, email, mobile]",10,7,discount,2,10


In [580]:
profile.head()

Unnamed: 0,age,became_member_on,gender,income,customer_id
0,118,2017,,,1
1,55,2017,F,112000.0,2
2,118,2018,,,3
3,75,2017,F,100000.0,4
4,118,2017,,,5


In [585]:
transcript['event'].unique()

array(['offer received', 'offer viewed', 'offer completed', 'transaction'],
      dtype=object)

In [398]:
f = pd.get_dummies(sub['event'], drop_first=False)

In [402]:
f.head()

Unnamed: 0,offer completed,offer received,offer viewed,transaction
0,0,1,0,0
1,0,1,0,0
2,0,1,0,0
3,0,1,0,0
4,0,1,0,0


In [401]:
f.sum(axis=0).unique()

array([ 33579,  76277,  57725, 138950], dtype=int64)

In [388]:
sub.key.dtypes

dtype('O')

In [405]:
list({'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'})

['offer id']

In [364]:
profile.shape, transcript.shape

((17000, 5), (306534, 4))