# 星巴克毕业项目

### 简介

这个数据集是一些模拟 Starbucks rewards 移动 app 上用户行为的数据。每隔几天，星巴克会向 app 的用户发送一些推送。这个推送可能仅仅是一条饮品的广告或者是折扣券或 BOGO（买一送一）。一些顾客可能一连几周都收不到任何推送。 

顾客收到的推送可能是不同的，这就是这个数据集的挑战所在。

你的任务是将交易数据、人口统计数据和推送数据结合起来判断哪一类人群会受到某种推送的影响。这个数据集是从星巴克 app 的真实数据简化而来。因为下面的这个模拟器仅产生了一种饮品， 实际上星巴克的饮品有几十种。

每种推送都有有效期。例如，买一送一（BOGO）优惠券推送的有效期可能只有 5 天。你会发现数据集中即使是一些消息型的推送都有有效期，哪怕这些推送仅仅是饮品的广告，例如，如果一条消息型推送的有效期是 7 天，你可以认为是该顾客在这 7 天都可能受到这条推送的影响。

数据集中还包含 app 上支付的交易信息，交易信息包括购买时间和购买支付的金额。交易信息还包括该顾客收到的推送种类和数量以及看了该推送的时间。顾客做出了购买行为也会产生一条记录。 

同样需要记住有可能顾客购买了商品，但没有收到或者没有看推送。

### 示例

举个例子，一个顾客在周一收到了满 10 美元减 2 美元的优惠券推送。这个推送的有效期从收到日算起一共 10 天。如果该顾客在有效日期内的消费累计达到了 10 美元，该顾客就满足了该推送的要求。

然而，这个数据集里有一些地方需要注意。即，这个推送是自动生效的；也就是说，顾客收到推送后，哪怕没有看到，满足了条件，推送的优惠依然能够生效。比如，一个顾客收到了"满10美元减2美元优惠券"的推送，但是该用户在 10 天有效期内从来没有打开看到过它。该顾客在 10 天内累计消费了 15 美元。数据集也会记录他满足了推送的要求，然而，这个顾客并没被受到这个推送的影响，因为他并不知道它的存在。

### 清洗

清洗数据非常重要也非常需要技巧。

你也要考虑到某类人群即使没有收到推送，也会购买的情况。从商业角度出发，如果顾客无论是否收到推送都打算花 10 美元，你并不希望给他发送满 10 美元减 2 美元的优惠券推送。所以你可能需要分析某类人群在没有任何推送的情况下会购买什么。

### 最后一项建议

因为这是一个毕业项目，你可以使用任何你认为合适的方法来分析数据。例如，你可以搭建一个机器学习模型来根据人口统计数据和推送的种类来预测某人会花费多少钱。或者，你也可以搭建一个模型来预测该顾客是否会对推送做出反应。或者，你也可以完全不用搭建机器学习模型。你可以开发一套启发式算法来决定你会给每个顾客发出什么样的消息（比如75% 的35 岁女性用户会对推送 A 做出反应，对推送 B 则只有 40% 会做出反应，那么应该向她们发送推送 A）。


# 数据集

一共有三个数据文件：

* portfolio.json – 包括推送的 id 和每个推送的元数据（持续时间、种类等等）
* profile.json – 每个顾客的人口统计数据
* transcript.json – 交易、收到的推送、查看的推送和完成的推送的记录

以下是文件中每个变量的类型和解释 ：

**portfolio.json**
* id (string) – 推送的id
* offer_type (string) – 推送的种类，例如 BOGO、打折（discount）、信息（informational）
* difficulty (int) – 满足推送的要求所需的最少花费
* reward (int) – 满足推送的要求后给与的优惠
* duration (int) – 推送持续的时间，单位是天
* channels (字符串列表)

**profile.json**
* age (int) – 顾客的年龄 
* became_member_on (int) – 该顾客第一次注册app的时间
* gender (str) – 顾客的性别（注意除了表示男性的 M 和表示女性的 F 之外，还有表示其他的 O）
* id (str) – 顾客id
* income (float) – 顾客的收入

**transcript.json**
* event (str) – 记录的描述（比如交易记录、推送已收到、推送已阅）
* person (str) – 顾客id
* time (int) – 单位是小时，测试开始时计时。该数据从时间点 t=0 开始
* value - (dict of strings) – 推送的id 或者交易的数额

**注意：**如果你正在使用 Workspace，在读取文件前，你需要打开终端/命令行，运行命令 `conda update pandas` 。因为 Workspace 中的 pandas 版本不能正确读入 transcript.json 文件的内容，所以需要更新到 pandas 的最新版本。你可以单击 notebook 左上角橘黄色的 jupyter 图标来打开终端/命令行。  

下面两张图展示了如何打开终端/命令行以及如何安装更新。首先打开终端/命令行：
<img src="pic1.png"/>

然后运行上面的命令：
<img src="pic2.png"/>

最后回到这个 notebook（还是点击橘黄色的 jupyter 图标），再次运行下面的单元格就不会报错了。

# Table of Contents

I. [Data Discovering & Visualization](#Data-Discovering)<br>
II.[基于排名的推荐方法](#Rank)<br>
III.[基于用户-用户的协同过滤](#User-User)<br>
IV.[基于内容的推荐方法（选修内容）](#Content-Recs)<br>
V. [矩阵分解](#Matrix-Fact)<br>
VI.[其他内容和总结](#conclusions)<br>
[References](#references)

In [1]:
import pandas as pd
import numpy as np
import math
import json

from collections import defaultdict
from time import time

import matplotlib.pyplot as plt
%matplotlib inline

# read in the json files
portfolio = pd.read_json('data/portfolio.json', orient='records', lines=True)
profile = pd.read_json('data/profile.json', orient='records', lines=True)
transcript = pd.read_json('data/transcript.json', orient='records', lines=True)

In [29]:
from time import time

# <a class="anchor" id="Data-Discovering">I. Data Discovering & Visualization</a>

## 1. Basic informations
- .head()
    - Index, Columns(Features)
- .info()
    - NaN values, data size
- .describe()
    - Statistik infomation
- .unique()
- .value_counts()

In [None]:
portfolio.head(3)

In [None]:
portfolio.info() 

In [None]:
portfolio.describe()

In [None]:
profile.head(3)

In [None]:
profile.info()

In [None]:
profile.describe()

In [None]:
transcript.head(3)

In [None]:
transcript.info()

In [None]:
transcript.describe()

# Summary
1. profile_cleaned (14825, 5) 
    - id
    - gender	
    - age		
    - became_member_on	
    - income
2. portfolio_cleaned (10, 9)
    - id
    - email	mobile	social	web
    - reward;	difficulty;	offer_type
    - duration	
    		
3. transcript_cleaned (306534, 6)——>(272762, 5)legal
    - person		
    - offer_id	
    - reward(useless)
    - amount
    - event
    - time

In [None]:
profile_cleaned.to_csv('./profile_cleaned.csv')
portfolio_cleaned.to_csv('./portfolio_cleaned.csv')
transcript_cleaned.to_csv('./transcript_cleaned.csv')

In [21]:
profile_cleaned = pd.read_csv('./profile_cleaned.csv')
portfolio_cleaned= pd.read_csv('./portfolio_cleaned.csv')
transcript_cleaned= pd.read_csv('./transcript_cleaned.csv')

In [22]:
profile_cleaned.index = profile_cleaned.iloc[:, 0].values
del profile_cleaned['Unnamed: 0']

portfolio_cleaned.index = portfolio_cleaned.iloc[:, 0].values
del portfolio_cleaned['Unnamed: 0']

transcript_cleaned.index = transcript_cleaned.iloc[:, 0].values
del transcript_cleaned['Unnamed: 0']

In [25]:
portfolio_cleaned.rename({'id': 'offer_id'},axis=1, inplace=True)
transcript_offer = transcript_cleaned.merge(portfolio_cleaned[['duration','offer_id','offer_type']], how='left', on='offer_id')
transcript_offer = transcript_offer.astype({'offer_id': 'str'})

In [27]:
transcript_offer.offer_id.unique()

array(['3', '9', '8', '2', '4', '0', '6', '1', '5', '7', '-1'],
      dtype=object)

In [28]:
target_dataset = pd.DataFrame(columns=['person', 'offer_id', 'time_received', 'time_viewed', 'time_transaction','time_completed','amount_with_offer','label_effective_offer'])

# Functions

In [46]:
def get_dateset_from_unique_offer_id(groupby_offer_id, offer_id_list, transactions):
    '''
    DESCRIPTION:
    Based on unique_offer_id of unique_person update the following dataset:

    Update transactions with related offer_id in the original dataset transcript_offer.

    Update target_dataset - (DataFrame), the structure of target_dataset is
        ['person', 'offer_id', 'time_received', 'time_viewed', 'time_transaction','time_completed','amount_with_offer','label_effective_offer']

    INPUT:
        groupby_offer_id - (pandas groupby object) groupby offer_id of get_dateset_from_unique_person
        offer_id_list - (list) offer_id list with unique values of unique person
        transactions - (DataFrame) all transactions of a unique person

    OUTPUT: None
    '''
    start = time()
    
    valid_offer_id = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
    for offer_id in offer_id_list:
        # offer_id has 10 valid values, except -1 represent nan values of offer_id
        if offer_id not in valid_offer_id:
            continue
        unique_offer_id = groupby_offer_id.get_group(offer_id)
        df_units, units_count = cut_unique_offer_id_2_units(unique_offer_id)

        # informational offer_type without 'offer completed'
        if unique_offer_id.offer_type.unique()[0] == 'informational':
            get_dateset_from_informational_offer(df_units, units_count, transactions)

        else:
        #unique_offer_id.offer_type.unique()[0] in ['bogo', 'discount']:
            get_dateset_from_other_offer(df_units, units_count, transactions)
        print("data for unique offer_id wrangled, time is {}" .format(time()-start))
    print("data for unique person wrangled, time is {}" .format(time()-start))

In [31]:
def fill_offer_id_4_transaction(original_id, updated_id):
    '''Filling the offer_id for transactions, if they are related with an offer.
    Final result is '-1' or a list: [offer_id_1, offer_id_2,...]

    INPUT:
        original_id - ('-1' or list) original is '-1';after updated is a list (because there will be more than one transaction related with an offer)
        updated_id - (int) represent the related offer_id

    OUTPUT:
        lst - (list) updated offer_id infomation of transactions
    '''
    if original_id == '-1':  #obejct(astype('str')): str '-1'
        new_value = updated_id
    else:
        # there is already at least one valid transaction related with an offer
        new_value = original_id+','+updated_id
    return new_value

In [32]:
def get_dateset_from_informational_offer(df_units, units_count, transactions):
    # target_dataset and transcript_offer are global variables.
    '''
    DESCRIPTION:
    For the offer_id == 2 | 7 (informational_offer)

    Update transactions with related offer_id in the original dataset transcript_offer.

    Update target_dataset - (DataFrame), the structure of target_dataset is
        ['person', 'offer_id', 'time_received', 'time_viewed', 'time_transaction','time_completed','amount_with_offer','label_effective_offer']

    INPUT:
        df_units - (list of DataFrame) transaction units from a unique_offer_id of unique_person
        units_count - (int) number of transaction units in df_units
        transactions - (DataFrame) 'event' is 'transaction' for a unique_person

    OUTPUT: None
    '''
    #!!!REMEMBER: it's already units, with 'offer received' or not
    #!!!REMEMBER: units_count can't be 0, since there is no nan in 'event'
    person   = df_units[0].person.unique()[0]  #.unique() returns a numpy.array
    offer_id = df_units[0].offer_id.unique()[0]
    # different offer_id has a different valid duration
    duration = df_units[0].duration.unique()[0]

    for i in range(units_count):
        df_unit = df_units[i]

        time_received = df_unit[df_unit.event=='offer received'].time.min()
        time_viewed = df_unit[df_unit.event=='offer viewed'].time.min()
        time_completed = df_unit[df_unit.event=='offer completed'].time.min()

        # init the transaction time
        # (after a valid transaction, the offer is finished, so there will be at most one transaction time)
        time_transaction = -1
        # init the amount related to an offer
        amount_with_offer = 0
        # init the label of effective_offer
        label_effective_offer = -1

        # FLAG of 'offer received'
        is_received = (df_unit[df_unit.event=='offer received'].shape[0]!=0)

        if is_received:
            # at least one transaction exist
            if transactions.shape[0] != 0:
                transaction_time = np.array(transactions.time)
                time_begin = time_received
                time_end = time_received + duration

                is_valid_duration = (transaction_time >= time_begin) & (transaction_time <= time_end)
                valid_transactions = transactions[is_valid_duration]

                # update the 1st transaction, get the label_effective_offer
                if valid_transactions.shape[0] != 0:
                    # the 1st transaction is the valid transaction related with an offer
                    valid_transactions.head(1).loc[:, 'offer_id'] = offer_id
                    time_transaction = valid_transactions.head(1).time.min()

                    # get the data in original dataset transcript_offer, to update the offer_id of transaction with the related offer_id
                    global transcript_offer
                    valid_transactions_2b_labeled = transcript_offer.loc[valid_transactions.index]

                    # update the offer_id of transaction in transcript_offer
                    valid_transactions_2b_labeled['offer_id'] = valid_transactions_2b_labeled['offer_id'].apply(fill_offer_id_4_transaction, args=(offer_id,))
                    transcript_offer.update(valid_transactions_2b_labeled)

                    label_effective_offer = 1
                    amount_with_offer = valid_transactions.head(1).amount.sum()

            # received but without transaction
            else:
                label_effective_offer = 0

        # update the target_dataset
        global target_dataset
        target_dataset = target_dataset.append({
                    "person":   person,
                    "offer_id": offer_id,
                    "time_received": time_received,
                    "time_viewed": time_viewed,
                    "time_transaction": time_transaction,
                    "time_completed": time_completed,
                    "amount_with_offer": amount_with_offer,
                    "label_effective_offer": label_effective_offer
                    }, ignore_index=True)


In [33]:
def get_dateset_from_other_offer(df_units, units_count, transactions):
    '''
    DESCRIPTION:
    For the offer_id != (2 & 7)
    (REMEMBER to exclude offer_id == -1)

    Update transactions with related offer_id in the original dataset transcript_offer.

    Update target_dataset - (DataFrame), the structure of target_dataset is
        ['person', 'offer_id', 'time_received', 'time_viewed', 'time_transaction','time_completed','amount_with_offer','label_effective_offer']

    INPUT:
        df_units - (list of DataFrame) transaction units from a unique_offer_id of unique_person
        units_count - (int) number of transaction units in df_units
        transactions - (DataFrame) 'event' is 'transaction' for a unique_person

    OUTPUT: None
    '''
    #!!!REMEMBER: it's already units, with 'offer received' or not
    #!!!REMEMBER: units_count can't be 0, since there is no nan in 'event'
    person   = df_units[0].person.unique()[0]  #.unique() returns a numpy.array
    offer_id = df_units[0].offer_id.unique()[0]
    # different offer_id has a different valid duration
    duration = df_units[0].duration.unique()[0]

    for i in range(units_count):
        df_unit = df_units[i]

        time_received = df_unit[df_unit.event=='offer received'].time.min()
        time_viewed = df_unit[df_unit.event=='offer viewed'].time.min()
        time_completed = df_unit[df_unit.event=='offer completed'].time.min()

        # init the transaction time with a empty list
        # (there will be more than one valid transaction time)
        time_transaction = ""
        # init the amount related to an offer
        amount_with_offer = 0
        # init the label of effective_offer
        label_effective_offer = -1

        # FLAG of 'offer received'
        is_received = (df_unit[df_unit.event=='offer received'].shape[0]!=0)
        # FLAG of 'offer completed'
        is_completed = (df_unit[df_unit.event=='offer completed'].shape[0]!=0)

        if is_received:
            if is_completed:
                #REMEMBER: to be completed, there must be transaction(s)
                transaction_time = np.array(transactions.time)

                #valid transaction(s) exist between 'offer received' and 'offer completed'
                is_valid_duration = (transaction_time >= time_received) & (transaction_time <= time_completed)
                valid_transactions = transactions[is_valid_duration]

                valid_transactions.loc[:, 'offer_id'] = offer_id

                # get the index of valid transactions, to update offer_id of transactions in the original dataset transcript_offer
                global transcript_offer

                valid_transactions_2b_labeled = transcript_offer.loc[valid_transactions.index]
                valid_transactions_2b_labeled['offer_id']=valid_transactions_2b_labeled['offer_id'].apply(fill_offer_id_4_transaction, args=(offer_id,))
                transcript_offer.update(valid_transactions_2b_labeled)

                # update the label of effective_offer
                label_effective_offer = 1
                amount_with_offer = valid_transactions.amount.sum()
                # there may be more than one valid transaction
                for time in valid_transactions.time.values.tolist():
                    time_transaction = time_transaction+','+str(time)


            else:
                # without 'offer completed'
                transaction_time = np.array(transactions.time)
                time_begin = time_received
                time_end = time_received + duration
                # transaction(s) in valid duration should be regarded as 'tried transaction(s)'
                is_valid_duration = (transaction_time >= time_begin) & (transaction_time <= time_end)
                valid_transactions = transactions[is_valid_duration]

                # transaction(s) in valid duration should be updated with the related offer_id in the dataset transcript_offer

                valid_transactions.loc[:, 'offer_id'] = offer_id
                valid_transactions_2b_labeled = transcript_offer.loc[valid_transactions.index]
                valid_transactions_2b_labeled['offer_id']=valid_transactions_2b_labeled['offer_id'].apply(fill_offer_id_4_transaction, args=(offer_id,))
                transcript_offer.update(valid_transactions_2b_labeled)

                label_effective_offer = 0 # tried but not completed
                amount_with_offer = valid_transactions.amount.sum()
                for time in valid_transactions.time.values.tolist():
                    time_transaction = time_transaction+','+str(time)

        # update the target_dataset
        global target_dataset
        target_dataset = target_dataset.append({
                    "person":   person,
                    "offer_id": offer_id,
                    "time_received": time_received,
                    "time_viewed": time_viewed,
                    "time_transaction": time_transaction,
                    "time_completed": time_completed,
                    "amount_with_offer": amount_with_offer,
                    "label_effective_offer": label_effective_offer
                    }, ignore_index=True)

In [34]:
def cut_unique_offer_id_2_units(unique_offer_id):
    '''
    DESCRIPTION:
        The raw data is transcript of unique_offer_id in one unique_person. Since there may be more than one offer for this unique_offer_id. That's why we cut the transcript to independent pieces, and call it 'units'.

    INPUT:
        unique_offer_id - (DataFrame) transcript of unique_offer_id in one unique_person

    OUTPUT:
        df_units - (list of DataFrame) transaction units from a unique_offer_id of unique_person
        units_count - (int) number of transaction units in df_units
    '''
    events = unique_offer_id['event']
    index = events.index.values
    index_min = events.index.min()
    index_max = events.index.max()
    index_received = events[events=='offer received'].index

    df_units = []
    if len(index_received) == 0:
        units_count = 1
        df_unit = unique_offer_id
        df_units.append(df_unit)

    elif index_received[0] == index_min:
        units_count = len(index_received)
        #当units_count=1时？？
        if units_count == 1:
            df_unit = unique_offer_id
            df_units.append(df_unit)

        else:
            for i in range(units_count - 1):
                df_unit = unique_offer_id[(index >= index_received[i]) & (index < index_received[i+1])]
                df_units.append(df_unit)
            df_unit = unique_offer_id[(index >= index_received[i+1]) & (index <= index_max)]
            df_units.append(df_unit)

    else:
        units_count = len(index_received)+1
        df_unit = unique_offer_id[(index >= index_min) & (index < index_received[0])]
        df_units.append(df_unit)
        #当units_count=2时？？
        if units_count == 2:
            df_unit = unique_offer_id[(index >= index_received[0]) & (index <= index_max)]
            df_units.append(df_unit)
        for i in range(1, units_count - 1):
            df_unit = unique_offer_id[(index >= index_received[i]) & (index < index_received[i+1])]
            df_units.append(df_unit)
        df_unit = unique_offer_id[(index >= index_received[i+1]) & (index <= index_max)]
        df_units.append(df_unit)

    return df_units, units_count

# Enjoying Test......
1. target_dataset
2. transcript_offer
    - offer_id (object of str)

In [65]:
####test init
target_dataset = pd.DataFrame(columns=['person', 'offer_id', 'time_received', 'time_viewed', 'time_transaction','time_completed','amount_with_offer','label_effective_offer'])

transcript_offer = transcript_cleaned.merge(portfolio_cleaned[['duration','offer_id','offer_type']], how='left', on='offer_id')
transcript_offer = transcript_offer.astype({'offer_id': 'str'})

####CHECK init
print(transcript_offer.offer_id.unique().tolist())
transcript_offer.info()

['3', '9', '8', '2', '4', '0', '6', '1', '5', '7', '-1']
<class 'pandas.core.frame.DataFrame'>
Int64Index: 272762 entries, 0 to 272761
Data columns (total 7 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   person      272762 non-null  int64  
 1   event       272762 non-null  object 
 2   time        272762 non-null  float64
 3   amount      272762 non-null  float64
 4   offer_id    272762 non-null  object 
 5   duration    148805 non-null  float64
 6   offer_type  148805 non-null  object 
dtypes: float64(3), int64(1), object(3)
memory usage: 16.6+ MB


In [58]:
#transcript_offer.offer_id.unique()[-1].split(',')


['3', '6']

## test for CLASS

In [73]:
target_dataset

Unnamed: 0,person,offer_id,time_received,time_viewed,time_transaction,time_completed,amount_with_offer,label_effective_offer


In [75]:
transcript_offer.offer_id.unique()

array(['3', '9', '8', '2', '4', '0', '6', '1', '5', '7', '-1'],
      dtype=object)

In [78]:
# import class
from data_preprocessing_class import PreprocessingData
data_process = PreprocessingData(target_dataset, transcript_offer)

ImportError: cannot import name 'PreprocessingData' from 'data_preprocessing_class' (D:\D\test_file\jupyter\Git\Udacity_project\4_opimize_promotion_Starbucks\Udacity_DSND_proj4_StarbucksPromotion\data_preprocessing_class.py)

In [72]:
person_ids = transcript_offer.person.unique()
for person_id in person_ids:
    unique_person = transcript_offer[transcript_offer['person']==person_id]
    transactions = unique_person[unique_person.event=='transaction']

    groupby_offer_id = unique_person.groupby(['offer_id'])
    offer_id_list = unique_person.offer_id.unique()

    data_process.get_dateset_from_unique_offer_id(groupby_offer_id, offer_id_list, transactions)

NameError: name 'np' is not defined

In [None]:
data_process.target_dataset

## test for FUNCTIONS

In [None]:
person_ids = transcript_offer.person.unique()
for person_id in person_ids:
    unique_person = transcript_offer[transcript_offer['person']==person_id]
    transactions = unique_person[unique_person.event=='transaction']

    groupby_offer_id = unique_person.groupby(['offer_id'])
    offer_id_list = unique_person.offer_id.unique()

    get_dateset_from_unique_offer_id(groupby_offer_id, offer_id_list, transactions)

In [49]:
###test for funcs

# unique_person & transactions 
person_ids = transcript_offer.person.unique()
index = 1
person_id = person_ids[index]
unique_person = transcript_offer[transcript_offer['person']==person_id]
transactions = unique_person[unique_person.event=='transaction']

In [50]:
groupby_offer_id = unique_person.groupby(['offer_id'])
offer_id_list = unique_person.offer_id.unique()

offer_id_list

array(['9', '-1', '2', '3', '6'], dtype=object)

In [42]:
# unique offer_id
offer_id = '7'
unique_offer_id = groupby_offer_id.get_group(offer_id)
df_units, units_count = cut_unique_offer_id_2_units(unique_offer_id)
print(units_count)
df_units

1


[       person           event  time  amount offer_id  duration     offer_type
 47385       1  offer received   7.0     0.0        7       3.0  informational
 75802       1    offer viewed   9.0     0.0        7       3.0  informational]

In [43]:
# offer_id == '2' | '7'
get_dateset_from_informational_offer(df_units, units_count, transactions)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [None]:
# offer_id != '-1'
get_dateset_from_informational_offer(df_units, units_count, transactions)

In [48]:
# target_dataset
# transcript_offer
target_dataset

Unnamed: 0,person,offer_id,time_received,time_viewed,time_transaction,time_completed,amount_with_offer,label_effective_offer
0,1,7,7.0,9.0,9.25,,19.67,1
1,1,3,0.0,0.25,",5.5",5.5,19.89,1
2,1,7,7.0,9.0,9.25,,19.67,1
3,1,0,17.0,17.0,",21.25",21.25,21.72,1
4,1,8,21.0,24.25,",21.25",21.25,21.72,1
5,1,3,0.0,0.25,",5.5",5.5,19.89,1
6,1,7,7.0,9.0,9.25,,19.67,1
7,1,0,17.0,17.0,",21.25",21.25,21.72,1
8,1,8,21.0,24.25,",21.25",21.25,21.72,1


In [60]:
transcript_offer.person.nunique()

14825

In [51]:
get_dateset_from_unique_offer_id(groupby_offer_id, offer_id_list, transactions)

data for unique offer_id wrangled, time is 0.17090296745300293
data for unique offer_id wrangled, time is 0.19688916206359863


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


data for unique offer_id wrangled, time is 0.38378095626831055
data for unique offer_id wrangled, time is 0.6016559600830078
data for unique person wrangled, time is 0.6036555767059326
