# 星巴克毕业项目

### 简介

这个数据集是一些模拟 Starbucks rewards 移动 app 上用户行为的数据。每隔几天，星巴克会向 app 的用户发送一些推送。这个推送可能仅仅是一条饮品的广告或者是折扣券或 BOGO（买一送一）。一些顾客可能一连几周都收不到任何推送。 

顾客收到的推送可能是不同的，这就是这个数据集的挑战所在。

你的任务是将交易数据、人口统计数据和推送数据结合起来判断哪一类人群会受到某种推送的影响。这个数据集是从星巴克 app 的真实数据简化而来。因为下面的这个模拟器仅产生了一种饮品， 实际上星巴克的饮品有几十种。

每种推送都有有效期。例如，买一送一（BOGO）优惠券推送的有效期可能只有 5 天。你会发现数据集中即使是一些消息型的推送都有有效期，哪怕这些推送仅仅是饮品的广告，例如，如果一条消息型推送的有效期是 7 天，你可以认为是该顾客在这 7 天都可能受到这条推送的影响。

数据集中还包含 app 上支付的交易信息，交易信息包括购买时间和购买支付的金额。交易信息还包括该顾客收到的推送种类和数量以及看了该推送的时间。顾客做出了购买行为也会产生一条记录。 

同样需要记住有可能顾客购买了商品，但没有收到或者没有看推送。

### 示例

举个例子，一个顾客在周一收到了满 10 美元减 2 美元的优惠券推送。这个推送的有效期从收到日算起一共 10 天。如果该顾客在有效日期内的消费累计达到了 10 美元，该顾客就满足了该推送的要求。

然而，这个数据集里有一些地方需要注意。即，这个推送是自动生效的；也就是说，顾客收到推送后，哪怕没有看到，满足了条件，推送的优惠依然能够生效。比如，一个顾客收到了"满10美元减2美元优惠券"的推送，但是该用户在 10 天有效期内从来没有打开看到过它。该顾客在 10 天内累计消费了 15 美元。数据集也会记录他满足了推送的要求，然而，这个顾客并没被受到这个推送的影响，因为他并不知道它的存在。

### 清洗

清洗数据非常重要也非常需要技巧。

你也要考虑到某类人群即使没有收到推送，也会购买的情况。从商业角度出发，如果顾客无论是否收到推送都打算花 10 美元，你并不希望给他发送满 10 美元减 2 美元的优惠券推送。所以你可能需要分析某类人群在没有任何推送的情况下会购买什么。

### 最后一项建议

因为这是一个毕业项目，你可以使用任何你认为合适的方法来分析数据。例如，你可以搭建一个机器学习模型来根据人口统计数据和推送的种类来预测某人会花费多少钱。或者，你也可以搭建一个模型来预测该顾客是否会对推送做出反应。或者，你也可以完全不用搭建机器学习模型。你可以开发一套启发式算法来决定你会给每个顾客发出什么样的消息（比如75% 的35 岁女性用户会对推送 A 做出反应，对推送 B 则只有 40% 会做出反应，那么应该向她们发送推送 A）。


# 数据集

一共有三个数据文件：

* portfolio.json – 包括推送的 id 和每个推送的元数据（持续时间、种类等等）
* profile.json – 每个顾客的人口统计数据
* transcript.json – 交易、收到的推送、查看的推送和完成的推送的记录

以下是文件中每个变量的类型和解释 ：

**portfolio.json**
* id (string) – 推送的id
* offer_type (string) – 推送的种类，例如 BOGO、打折（discount）、信息（informational）
* difficulty (int) – 满足推送的要求所需的最少花费
* reward (int) – 满足推送的要求后给与的优惠
* duration (int) – 推送持续的时间，单位是天
* channels (字符串列表)

**profile.json**
* age (int) – 顾客的年龄 
* became_member_on (int) – 该顾客第一次注册app的时间
* gender (str) – 顾客的性别（注意除了表示男性的 M 和表示女性的 F 之外，还有表示其他的 O）
* id (str) – 顾客id
* income (float) – 顾客的收入

**transcript.json**
* event (str) – 记录的描述（比如交易记录、推送已收到、推送已阅）
* person (str) – 顾客id
* time (int) – 单位是小时，测试开始时计时。该数据从时间点 t=0 开始
* value - (dict of strings) – 推送的id 或者交易的数额

**注意：**如果你正在使用 Workspace，在读取文件前，你需要打开终端/命令行，运行命令 `conda update pandas` 。因为 Workspace 中的 pandas 版本不能正确读入 transcript.json 文件的内容，所以需要更新到 pandas 的最新版本。你可以单击 notebook 左上角橘黄色的 jupyter 图标来打开终端/命令行。  

下面两张图展示了如何打开终端/命令行以及如何安装更新。首先打开终端/命令行：
<img src="pic1.png"/>

然后运行上面的命令：
<img src="pic2.png"/>

最后回到这个 notebook（还是点击橘黄色的 jupyter 图标），再次运行下面的单元格就不会报错了。

# Table of Contents

I. [Data Discovering & Visualization](#Data-Discovering)<br>
II.[基于排名的推荐方法](#Rank)<br>
III.[基于用户-用户的协同过滤](#User-User)<br>
IV.[基于内容的推荐方法（选修内容）](#Content-Recs)<br>
V. [矩阵分解](#Matrix-Fact)<br>
VI.[其他内容和总结](#conclusions)<br>
[References](#references)

In [1]:
import pandas as pd
import numpy as np
import math
import json

import matplotlib.pyplot as plt
%matplotlib inline

from collections import defaultdict

# read in the json files
portfolio = pd.read_json('data/portfolio.json', orient='records', lines=True)
profile = pd.read_json('data/profile.json', orient='records', lines=True)
transcript = pd.read_json('data/transcript.json', orient='records', lines=True)

# <a class="anchor" id="Data-Discovering">I. Data Discovering & Visualization</a>

## 1. Basic informations
- .head()
    - Index, Columns(Features)
- .info()
    - NaN values, data size
- .describe()
    - Statistik infomation
- .unique()
- .value_counts()

In [None]:
portfolio.head(3)

In [None]:
portfolio.info() 

In [None]:
portfolio.describe()

In [None]:
profile.head(3)

In [None]:
profile.info()

In [None]:
profile.describe()

In [None]:
transcript.head(3)

In [None]:
transcript.info()

In [None]:
transcript.describe()

## 2. Data Wrangling
### 2.1. DataFrame `profile`

#### `2.1.1.` NaNs
1. The NaNs in columns 'gender', 'income' and illegal age(age is 118) appear simultaneously.
    - It's fine to delete all the 2175 NaNs.

In [None]:
########
### check
########
# .isnull() 对None和NaN都有效
nan_in_income = np.where(profile.income.isnull())[0]
none_in_gender = np.where(profile.gender.isnull())[0]
illegal_age = np.where(profile.age==118)

print((nan_in_income  == none_in_gender).all())
print((nan_in_income  == illegal_age).all())

In [2]:
# drop all 2175 the NaNs
profile_cleaned = profile.dropna(axis=0)
profile_cleaned.shape

(14825, 5)

#### `2.1.2.` Column: 'became_member_on'
1. Change 'int' type to 'date'
2. Calculate the timedelta(days) compared with the latest date `20130729`

In [3]:
from datetime import date

def int2date(argdate: int) -> date:
    """
    If you have date as an integer, use this method to obtain a datetime.date object.

    Parameters
    ----------
    argdate : int
      Date as a regular integer value (example: 20160618)

    Returns
    -------
    dateandtime.date
      A date object which corresponds to the given value `argdate`.
    """
    year = int(argdate / 10000)
    month = int((argdate % 10000) / 100)
    day = int(argdate % 100)

    return date(year, month, day)


print(int2date(20160618))

#https://stackoverflow.com/questions/9750330/how-to-convert-integer-into-date-object-python

2016-06-18


In [4]:
profile_cleaned['became_member_on'] = profile_cleaned.became_member_on.apply(lambda date_int: int2date(date_int))
profile_cleaned

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,gender,age,id,became_member_on,income
1,F,55,0610b486422d4921ae7d2bf64640c50b,2017-07-15,112000.0
3,F,75,78afa995795e4d85b5d9ceeca43f5fef,2017-05-09,100000.0
5,M,68,e2127556f4f64592b11af22de27a7932,2018-04-26,70000.0
8,M,65,389bc3fa690240e798340f5a15918d5c,2018-02-09,53000.0
12,M,58,2eeac8d8feae4a8cad5a6af0499a211d,2017-11-11,51000.0
...,...,...,...,...,...
16995,F,45,6d5f3a774f3d4714ab0c092238f3a1d7,2018-06-04,54000.0
16996,M,61,2cb4f97358b841b9a9773a7aa05a9d77,2018-07-13,72000.0
16997,M,49,01d26f638c274aa0b965d24cefe3183f,2017-01-26,73000.0
16998,F,83,9dc1421481194dcd9400aec7c9ae6366,2016-03-07,50000.0


### 2.2. DataFrame `portfolio`

#### `2.2.1.` Column: 'channels'
1. Extract the list values into dependent columns

In [5]:
channels_df = portfolio['channels'].str.join(sep='*').str.get_dummies(sep='*')  #str.join(sep='*')
#https://intellipaat.com/community/32880/create-dummies-from-a-column-with-multiple-values-in-pandas

In [8]:
portfolio_cleaned = pd.concat([portfolio, channels_df], axis=1)
portfolio_cleaned.drop(labels='channels', axis=1, inplace=True)

portfolio_cleaned

Unnamed: 0,reward,difficulty,duration,offer_type,id,email,mobile,social,web
0,10,10,7,bogo,ae264e3637204a6fb9bb56bc8210ddfd,1,1,1,0
1,10,10,5,bogo,4d5c57ea9a6940dd891ad53e9dbe8da0,1,1,1,1
2,0,0,4,informational,3f207df678b143eea3cee63160fa8bed,1,1,0,1
3,5,5,7,bogo,9b98b8c7a33c4b65b9aebfe6a799e6d9,1,1,0,1
4,5,20,10,discount,0b1e1539f2cc45b7b9fa7c272da2e1d7,1,0,0,1
5,3,7,7,discount,2298d6c36e964ae4a3e7e9706d1fb8c2,1,1,1,1
6,2,10,10,discount,fafdcd668e3743c1bb461111dcafc2a4,1,1,1,1
7,0,0,3,informational,5a8bc65990b245e5a138643cd4eb9837,1,1,1,0
8,5,5,5,bogo,f19421c1d4aa40978ebb69ca19b0e20d,1,1,1,1
9,2,10,7,discount,2906b810c7d4411798c6938adc9daaa5,1,1,0,1


### 2.3. DataFrame `transcript`

#### `2.3.1.` Column: 'value'
1. Extract the dict values into dependent columns according to the keys in dict

In [7]:
value_cleaned = transcript.value.apply(pd.Series)

In [9]:
#nan值替换
is_value_2_update = ~value_cleaned['offer id'].isnull()
value_cleaned['offer_id'] = np.where(is_value_2_update, value_cleaned['offer id'], value_cleaned['offer_id'])


In [None]:
########
### check
########
is_reward = ~transcript_cleaned['reward'].isnull()
transcript_cleaned[['offer_id','reward']][is_reward] 

test_df = transcript_cleaned[['offer_id','reward']][is_reward]
test_df.groupby('offer_id').max()['reward']
test_df.groupby('offer_id').min()['reward'] 
portfolio_cleaned[['id', 'reward']] #transcript_cleaned中的reward数据重复，可以删除

In [None]:
########
### check
########
#非nan值的统计
print(value_cleaned['offer id'].count())
print(value_cleaned['reward'].count())
print(value_cleaned['offer_id'].count())
print(value_cleaned['amount'].count()) #非nan值
print(len(value_cleaned['amount']))  #size

In [10]:
value_cleaned.drop(labels='offer id', axis=1, inplace=True)
transcript_cleaned = pd.concat([transcript, value_cleaned], axis=1).drop(labels=['value','reward'], axis=1)

In [None]:
########
### check
########
is_legal_transcript = transcript_cleaned.person.isin(profile_cleaned.id)
transcript_cleaned = transcript_cleaned[is_legal_transcript]
transcript_cleaned

......
### go deep into Data Wrangling......
......

In [11]:
#person_id to int 0-  in profile_cleaned & transcript_cleaned

id_2_int = defaultdict(int)
    
ids_origin = profile_cleaned['id'].unique()
num = 0
for each_id in ids_origin:
    id_2_int[each_id] = num
    num += 1
    
def person_id_mapping(df_2_map, col, mapping_dict=id_2_int):
    col_mapped = df_2_map[col].apply(lambda x: id_2_int[x])
    df_2_map[col] = col_mapped
    
    return df_2_map

In [12]:
transcript_cleaned = person_id_mapping(transcript_cleaned, 'person')
transcript_cleaned

Unnamed: 0,person,event,time,amount,offer_id
0,1,offer received,0,,9b98b8c7a33c4b65b9aebfe6a799e6d9
1,0,offer received,0,,0b1e1539f2cc45b7b9fa7c272da2e1d7
2,2,offer received,0,,2906b810c7d4411798c6938adc9daaa5
3,0,offer received,0,,fafdcd668e3743c1bb461111dcafc2a4
4,0,offer received,0,,4d5c57ea9a6940dd891ad53e9dbe8da0
...,...,...,...,...,...
306529,14791,transaction,714,1.59,
306530,14796,transaction,714,9.53,
306531,14809,transaction,714,3.61,
306532,14815,transaction,714,3.53,


In [13]:
profile_cleaned = person_id_mapping(profile_cleaned, 'id')
profile_cleaned

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]


Unnamed: 0,gender,age,id,became_member_on,income
1,F,55,0,2017-07-15,112000.0
3,F,75,1,2017-05-09,100000.0
5,M,68,2,2018-04-26,70000.0
8,M,65,3,2018-02-09,53000.0
12,M,58,4,2017-11-11,51000.0
...,...,...,...,...,...
16995,F,45,14820,2018-06-04,54000.0
16996,M,61,14821,2018-07-13,72000.0
16997,M,49,14822,2017-01-26,73000.0
16998,F,83,14823,2016-03-07,50000.0


In [14]:
offer_id_2_int = defaultdict()
offer_ids = portfolio_cleaned['id']
label = 0
for offer_id in offer_ids:
    offer_id_2_int[offer_id]=label
    label += 1

offer_id_2_int[-1]=-1  #针对transcript_cleaned中的nan.fillna(-1) int型

In [15]:
portfolio_cleaned['id'] = portfolio_cleaned['id'].apply(lambda x: offer_id_2_int[x])
portfolio_cleaned

Unnamed: 0,reward,difficulty,duration,offer_type,id,email,mobile,social,web
0,10,10,7,bogo,0,1,1,1,0
1,10,10,5,bogo,1,1,1,1,1
2,0,0,4,informational,2,1,1,0,1
3,5,5,7,bogo,3,1,1,0,1
4,5,20,10,discount,4,1,0,0,1
5,3,7,7,discount,5,1,1,1,1
6,2,10,10,discount,6,1,1,1,1
7,0,0,3,informational,7,1,1,1,0
8,5,5,5,bogo,8,1,1,1,1
9,2,10,7,discount,9,1,1,0,1


In [16]:
transcript_cleaned.amount.unique()  #用0代替nan可行：假如原始数据有0，会造成意义混淆

array([   nan,   0.83,  34.56, ..., 685.07, 405.04, 476.33])

In [17]:
transcript_cleaned.fillna(value={'amount': 0, 'offer_id': -1}, axis =0, inplace=True)
transcript_cleaned['offer_id'] = transcript_cleaned['offer_id'].apply(lambda x: offer_id_2_int[x])

transcript_cleaned['time']=transcript_cleaned['time']/24.0
transcript_cleaned

Unnamed: 0,person,event,time,amount,offer_id
0,1,offer received,0.00,0.00,3
1,0,offer received,0.00,0.00,4
2,2,offer received,0.00,0.00,9
3,0,offer received,0.00,0.00,6
4,0,offer received,0.00,0.00,1
...,...,...,...,...,...
306529,14791,transaction,29.75,1.59,-1
306530,14796,transaction,29.75,9.53,-1
306531,14809,transaction,29.75,3.61,-1
306532,14815,transaction,29.75,3.53,-1


In [18]:
profile_cleaned.to_csv('./profile_cleaned.csv')
portfolio_cleaned.to_csv('./portfolio_cleaned.csv')
transcript_cleaned.to_csv('./transcript_cleaned.csv')

In [None]:
'''
profile_cleaned.rename(columns={"id": "person"}, inplace=True)
portfolio_cleaned.rename(columns={"id": "offer_id"}, inplace=True)
'''

# Summary
1. profile_cleaned (14825, 5) 
    - id   `int 0-...`
    - gender	
    - age		
    - became_member_on	
    - income
2. portfolio_cleaned (10, 9)
    - id    `int 0-9`
    - email	mobile	social	web
    - reward;	difficulty;	offer_type
    - duration	
    		
3. transcript_cleaned (306534, 6)——>(272762, 5)legal
    - person   `int 0-...`	
    - offer_id	`int 0-9`
    - ~~reward(useless)~~
    - amount
    - event    `get labels(def transformation)`
    - time  /24 (days)

### 2.4. Get labels 

#### `2.4.1.` Promotion was viewed

```python

```