# Starbucks Capstone Challenge

### Introduction

This data set contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks. 

Not all users receive the same offer, and that is the challenge to solve with this data set.

Your task is to combine transaction, demographic and offer data to determine which demographic groups respond best to which offer type. This data set is a simplified version of the real Starbucks app because the underlying simulator only has one product whereas Starbucks actually sells dozens of products.

Every offer has a validity period before the offer expires. As an example, a BOGO offer might be valid for only 5 days. You'll see in the data set that informational offers have a validity period even though these ads are merely providing information about a product; for example, if an informational offer has 7 days of validity, you can assume the customer is feeling the influence of the offer for 7 days after receiving the advertisement.

You'll be given transactional data showing user purchases made on the app including the timestamp of purchase and the amount of money spent on a purchase. This transactional data also has a record for each offer that a user receives as well as a record for when a user actually views the offer. There are also records for when a user completes an offer. 

Keep in mind as well that someone using the app might make a purchase through the app without having received an offer or seen an offer.

### Example

To give an example, a user could receive a discount offer buy 10 dollars get 2 off on Monday. The offer is valid for 10 days from receipt. If the customer accumulates at least 10 dollars in purchases during the validity period, the customer completes the offer.

However, there are a few things to watch out for in this data set. Customers do not opt into the offers that they receive; in other words, a user can receive an offer, never actually view the offer, and still complete the offer. For example, a user might receive the "buy 10 dollars get 2 dollars off offer", but the user never opens the offer during the 10 day validity period. The customer spends 15 dollars during those ten days. There will be an offer completion record in the data set; however, the customer was not influenced by the offer because the customer never viewed the offer.

### Cleaning

This makes data cleaning especially important and tricky.

You'll also want to take into account that some demographic groups will make purchases even if they don't receive an offer. From a business perspective, if a customer is going to make a 10 dollar purchase without an offer anyway, you wouldn't want to send a buy 10 dollars get 2 dollars off offer. You'll want to try to assess what a certain demographic group will buy when not receiving any offers.

### Final Advice

Because this is a capstone project, you are free to analyze the data any way you see fit. For example, you could build a machine learning model that predicts how much someone will spend based on demographics and offer type. Or you could build a model that predicts whether or not someone will respond to an offer. Or, you don't need to build a machine learning model at all. You could develop a set of heuristics that determine what offer you should send to each customer (i.e., 75 percent of women customers who were 35 years old responded to offer A vs 40 percent from the same demographic to offer B, so send offer A).

# Data Sets

The data is contained in three files:

* portfolio.json - containing offer ids and meta data about each offer (duration, type, etc.)
* profile.json - demographic data for each customer
* transcript.json - records for transactions, offers received, offers viewed, and offers completed

Here is the schema and explanation of each variable in the files:

**portfolio.json**
* id (string) - offer id
* offer_type (string) - type of offer ie BOGO, discount, informational
* difficulty (int) - minimum required spend to complete an offer
* reward (int) - reward given for completing an offer
* duration (int) - time for offer to be open, in days
* channels (list of strings)

**profile.json**
* age (int) - age of the customer 
* became_member_on (int) - date when customer created an app account
* gender (str) - gender of the customer (note some entries contain 'O' for other rather than M or F)
* id (str) - customer id
* income (float) - customer's income

**transcript.json**
* event (str) - record description (ie transaction, offer received, offer viewed, etc.)
* person (str) - customer id
* time (int) - time in hours since start of test. The data begins at time t=0
* value - (dict of strings) - either an offer id or transaction amount depending on the record

--- 

# Importing Libraries and Loading Data

Importing the necessary Python libraries and loads data from JSON files into Pandas DataFrames.

### Libraries Imported
- `pandas` (`pd`): Used for data manipulation and analysis.
- `numpy` (`np`): Provides support for numerical operations.
- `math`: Standard Python library for mathematical functions.
- `json`: Enables working with JSON data.
- `matplotlib.pyplot` (`plt`): Used for data visualization.
- `seaborn` (`sns`): Enhances visualization with statistical plotting.

### Loading Data
The `pd.read_json()` function is used to read JSON files:
- `portfolio_raw`: Contains data from `portfolio.json`, likely representing promotional offers.
- `profile_raw`: Contains data from `profile.json`, likely storing user demographic information.
- `transcript_raw`: Contains data from `transcript.json`, likely recording user interactions or transactions.

Each file is read with `orient='records'` and `lines=True`, ensuring that each JSON object in the file is interpreted as a separate record (suitable for line-delimited JSON files).


In [1]:
# importing libraries
import pandas as pd
import numpy as np

# read in the json files
portfolio_raw = pd.read_json('data/portfolio.json', orient='records', lines=True)
profile_raw = pd.read_json('data/profile.json', orient='records', lines=True)
transcript_raw = pd.read_json('data/transcript.json', orient='records', lines=True)


# Data understanding

--- 
Objectives: 

* Examination of each individual table and its corresponding columns.
* Exploratory data analysis (EDA) with some statistics.
---


**portfolio.json:**

 Ten offers type and his atributes.

-> Data is cleaned and ready to be used

In [2]:
portfolio_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   reward      10 non-null     int64 
 1   channels    10 non-null     object
 2   difficulty  10 non-null     int64 
 3   duration    10 non-null     int64 
 4   offer_type  10 non-null     object
 5   id          10 non-null     object
dtypes: int64(3), object(3)
memory usage: 612.0+ bytes


In [3]:
# showing the entire table
portfolio_raw

Unnamed: 0,reward,channels,difficulty,duration,offer_type,id
0,10,"[email, mobile, social]",10,7,bogo,ae264e3637204a6fb9bb56bc8210ddfd
1,10,"[web, email, mobile, social]",10,5,bogo,4d5c57ea9a6940dd891ad53e9dbe8da0
2,0,"[web, email, mobile]",0,4,informational,3f207df678b143eea3cee63160fa8bed
3,5,"[web, email, mobile]",5,7,bogo,9b98b8c7a33c4b65b9aebfe6a799e6d9
4,5,"[web, email]",20,10,discount,0b1e1539f2cc45b7b9fa7c272da2e1d7
5,3,"[web, email, mobile, social]",7,7,discount,2298d6c36e964ae4a3e7e9706d1fb8c2
6,2,"[web, email, mobile, social]",10,10,discount,fafdcd668e3743c1bb461111dcafc2a4
7,0,"[email, mobile, social]",0,3,informational,5a8bc65990b245e5a138643cd4eb9837
8,5,"[web, email, mobile, social]",5,5,bogo,f19421c1d4aa40978ebb69ca19b0e20d
9,2,"[web, email, mobile]",10,7,discount,2906b810c7d4411798c6938adc9daaa5


**profile.json**

Customers demographic data.

--> There are 17000 customers in the dataset.

--> There are 2175 (~12,7 %) NoneType values in `gender` and `income`, and the `age` values is 118 for these.

--> ~50 % are 'M' and ~36 % are 'F'. There is ~1.2 %  'O' type gender.

--> The colum `became_member_on` has non-formated date values.

In [4]:
profile_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            14825 non-null  object 
 1   age               17000 non-null  int64  
 2   id                17000 non-null  object 
 3   became_member_on  17000 non-null  int64  
 4   income            14825 non-null  float64
dtypes: float64(1), int64(2), object(2)
memory usage: 664.2+ KB


In [5]:
profile_raw['gender'].value_counts(dropna=False, normalize=True)

gender
M       0.499059
F       0.360529
None    0.127941
O       0.012471
Name: proportion, dtype: float64

In [6]:
profile_raw['age'].value_counts(dropna=False, normalize=True)

age
118    0.127941
58     0.024000
53     0.021882
51     0.021353
59     0.021118
         ...   
100    0.000706
96     0.000471
98     0.000294
101    0.000294
99     0.000294
Name: proportion, Length: 85, dtype: float64

In [7]:
profile_raw.describe(include='all') 

Unnamed: 0,gender,age,id,became_member_on,income
count,14825,17000.0,17000,17000.0,14825.0
unique,3,,17000,,
top,M,,e4052622e5ba45a8b96b59aba68cf068,,
freq,8484,,1,,
mean,,62.531412,,20167030.0,65404.991568
std,,26.73858,,11677.5,21598.29941
min,,18.0,,20130730.0,30000.0
25%,,45.0,,20160530.0,49000.0
50%,,58.0,,20170800.0,64000.0
75%,,73.0,,20171230.0,80000.0


**transcript.json**

A time line of the events that took place during the simulation event.

--> There is a inconsistent dicttionary keys in the `value` column.

--> The `' '` in the `event` column categorie's names can be normalized to `'_'`.

--> There are no missing values.

In [8]:
transcript_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306534 entries, 0 to 306533
Data columns (total 4 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   person  306534 non-null  object
 1   event   306534 non-null  object
 2   value   306534 non-null  object
 3   time    306534 non-null  int64 
dtypes: int64(1), object(3)
memory usage: 9.4+ MB


In [9]:
{[*x][0] for x in transcript_raw['value']}

{'amount', 'offer id', 'offer_id'}

In [10]:
transcript_raw['event'].value_counts()

event
transaction        138953
offer received      76277
offer viewed        57725
offer completed     33579
Name: count, dtype: int64

# Data Preparation
---
- Objective:

Creation of `Analytical Tables` datasets for analisys, visuals recommendations and machine learning applications.

- Strategy:
1. Loading raw data from the original tables.
2. Restructuring it using groupby/unstack and creating fetures.
3. Selecting relevant variables and fetures.

This process ensures that the data tables and **features** is properly formatted, aggregated, and cleaned for further analysis. 

---

#### **Portfolio Dataset**
- Update column `id` to make them more easier to read.

In [11]:
#creating a copy from the original dataframe
portfolio = portfolio_raw.copy() 

# renaming the columns using a dictionary
port_id = {
    'ae264e3637204a6fb9bb56bc8210ddfd': 'ofr_A',
    '4d5c57ea9a6940dd891ad53e9dbe8da0': 'ofr_B',
    '3f207df678b143eea3cee63160fa8bed': 'ofr_C',
    '9b98b8c7a33c4b65b9aebfe6a799e6d9': 'ofr_D',
    '0b1e1539f2cc45b7b9fa7c272da2e1d7': 'ofr_E',
    '2298d6c36e964ae4a3e7e9706d1fb8c2': 'ofr_F',
    'fafdcd668e3743c1bb461111dcafc2a4': 'ofr_G',
    '5a8bc65990b245e5a138643cd4eb9837': 'ofr_H',
    'f19421c1d4aa40978ebb69ca19b0e20d': 'ofr_I',
    '2906b810c7d4411798c6938adc9daaa5': 'ofr_J'
}

# mapping the id column
portfolio['ofr_id_short'] = portfolio['id'].map(port_id)

# persist a csv file to the bronze folder
portfolio.to_csv('medalion_data_store/bronze/portfolio.csv', index=False)

portfolio

Unnamed: 0,reward,channels,difficulty,duration,offer_type,id,ofr_id_short
0,10,"[email, mobile, social]",10,7,bogo,ae264e3637204a6fb9bb56bc8210ddfd,ofr_A
1,10,"[web, email, mobile, social]",10,5,bogo,4d5c57ea9a6940dd891ad53e9dbe8da0,ofr_B
2,0,"[web, email, mobile]",0,4,informational,3f207df678b143eea3cee63160fa8bed,ofr_C
3,5,"[web, email, mobile]",5,7,bogo,9b98b8c7a33c4b65b9aebfe6a799e6d9,ofr_D
4,5,"[web, email]",20,10,discount,0b1e1539f2cc45b7b9fa7c272da2e1d7,ofr_E
5,3,"[web, email, mobile, social]",7,7,discount,2298d6c36e964ae4a3e7e9706d1fb8c2,ofr_F
6,2,"[web, email, mobile, social]",10,10,discount,fafdcd668e3743c1bb461111dcafc2a4,ofr_G
7,0,"[email, mobile, social]",0,3,informational,5a8bc65990b245e5a138643cd4eb9837,ofr_H
8,5,"[web, email, mobile, social]",5,5,bogo,f19421c1d4aa40978ebb69ca19b0e20d,ofr_I
9,2,"[web, email, mobile]",10,7,discount,2906b810c7d4411798c6938adc9daaa5,ofr_J


#### **Profile Dataset**
- Convert the `became_member_on` column to a standardized **datetime** format for consistency and easier analysis.
-  Create a new column with only the year and month of the date.

In [67]:
# copy the raw data into a new dataframe
profile = profile_raw.copy(deep=True)

# Convert the 'became_member_on' column to a datetime format
profile['became_member_on'] = pd.to_datetime(profile['became_member_on'], format='%Y%m%d')

# Create a new column with only the year and month of the membership
profile['bec_memb_year_month'] = pd.to_datetime(profile['became_member_on'], format='%Y%m%d').dt.strftime('%Y-%m')

profile['gender'] = profile['gender'].fillna('gen_ukn')

# appling age categoization     
profile['age_group'] = (pd.cut(
    profile['age'],
    bins=[-1, 25, 45, 65, 118],
    labels=['Young', 'Adult', 'Middle', 'Senior'], 
    include_lowest=True  
    ))

# Persist a csv file to the data store
profile.to_csv('medalion_data_store/bronze/profile.csv', index=False)

profile.head()

Unnamed: 0,gender,age,id,became_member_on,income,bec_memb_year_month,age_group
0,gen_ukn,118,68be06ca386d4c31939f3a4f0e3dd783,2017-02-12,,2017-02,Senior
1,F,55,0610b486422d4921ae7d2bf64640c50b,2017-07-15,112000.0,2017-07,Middle
2,gen_ukn,118,38fe809add3b4fcf9315a9694bb96ff5,2018-07-12,,2018-07,Senior
3,F,75,78afa995795e4d85b5d9ceeca43f5fef,2017-05-09,100000.0,2017-05,Senior
4,gen_ukn,118,a03223e636434f42ac4c3df47e8bac43,2017-08-04,,2017-08,Senior


#### **Transcript dataset**
- Clean, normalizing and transforming column values from the transcript orighinal table.
- Creating a new table `transcript_b` using json_normalize() and concat() methods.

**Strategy:**

1. Copy the data.
2. Normalizing the `value` column dictionarie keys `offer id` --> `offer_id`.
3. Creating the `transcript_b` table using `pd.json_normalize()` and `pd.concat()` functions.
3. Normalizing the `event` column values replacing `' '` --> `'_'`. 
4. Creating `ofr_id_short` column with an id more readeble and droping the `offer_id` column.
5. Fill NaN with apropriate values.
6. Creating a `tag` column to identify the person-event-offer interactions one by one. (as a person can interact with the same offer type more than once)
7. Persist the table in a csv file and save in the data store.

In [13]:
def fix_offer_id(value):
    """
    Fixes the 'offer id' key in a dictionary by renaming it to 'offer_id'.

    Parameters:
    value (dict): A dictionary that may contain the 'offer id' key.

    Returns:
    dict: The updated dictionary with 'offer id' replaced by 'offer_id'.
    """
    if isinstance(value, dict) and 'offer id' in value:
        value['offer_id'] = value.pop('offer id')
    return value

In [14]:
# copy the raw data into a new dataframe
transcript = transcript_raw.copy(deep=True)


# appling the fix offer function
transcript['value'] = transcript['value'].apply(fix_offer_id)

# Normalize the 'value' column with json_normalize method
value_df = pd.json_normalize(transcript['value']) 
transcript_b = pd.concat([transcript, value_df], axis=1).drop('value', axis=1)

# Normalizing the event column categorie's names
transcript_b['event'] = [x.split(' ')[1] if len(x.split(' ')) > 1 else x for x in transcript_b['event']  ]

# mapping the offer_id to the offer_id_short and Dropping the offer_id column
transcript_b['ofr_id_short'] = transcript_b['offer_id'].map(port_id).fillna('tran')
transcript_b = transcript_b.drop(columns = ['offer_id'])

# creating a tag column to identify the order of the events for each person-offer-event fact
transcript_b['tag'] = (
    transcript_b.groupby(['person', 'ofr_id_short', 'event'], observed=True)
    .cumcount()
)

# persist a csv file to the bronze folder
transcript_b.to_csv('medalion_data_store/bronze/transcript_b.csv', index=False)

transcript_b.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306534 entries, 0 to 306533
Data columns (total 7 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   person        306534 non-null  object 
 1   event         306534 non-null  object 
 2   time          306534 non-null  int64  
 3   amount        138953 non-null  float64
 4   reward        33579 non-null   float64
 5   ofr_id_short  306534 non-null  object 
 6   tag           306534 non-null  int64  
dtypes: float64(2), int64(2), object(3)
memory usage: 16.4+ MB


### Separating *transactions* events in transcript_b dataset from the others

* **Source**: 

    `transcript_b`.

* **Created tables:**

    `events` and `transactions`: separated data from `transcript_b`


- creating `events` and `transactions` dataframes
- Dropping columns that are not needed

In [15]:
transcript_b

Unnamed: 0,person,event,time,amount,reward,ofr_id_short,tag
0,78afa995795e4d85b5d9ceeca43f5fef,received,0,,,ofr_D,0
1,a03223e636434f42ac4c3df47e8bac43,received,0,,,ofr_E,0
2,e2127556f4f64592b11af22de27a7932,received,0,,,ofr_J,0
3,8ec6ce2a7e7949b1bf142def7d0e0586,received,0,,,ofr_G,0
4,68617ca6246f4fbc85e91a2a49552598,received,0,,,ofr_B,0
...,...,...,...,...,...,...,...
306529,b3a1272bc9904337b331bf348c3e8c17,transaction,714,1.59,,tran,13
306530,68213b08d99a4ae1b0dcb72aebd9aa35,transaction,714,9.53,,tran,1
306531,a00058cf10334a308c68e7631c529907,transaction,714,3.61,,tran,19
306532,76ddbd6576844afe811f1a3c0fbb5bec,transaction,714,3.53,,tran,12


In [16]:
# filtering  the data event / transactions
events = transcript_b[transcript_b['event'] != 'transaction'].copy()
transactions = transcript_b[transcript_b['event'] == 'transaction'].copy()

# drop the 'amount' column as it contais zero for all rows.
events = events.drop(columns=['amount']) 

# drop the reward (all zeros) and offer_id columns as they are not relevant in this dataset. 
transactions = transactions.drop(columns=['reward',	'ofr_id_short']) 

# Sorting table by time
events = events.sort_values(by=['person', 'time'])
transactions = transactions.sort_values(by=['person', 'time'])

events['time_diff'] = events.groupby(['person', 'event'])['time'].diff()
transactions['time_diff'] = transactions.groupby(['person'])['time'].diff()

# persisting the data in the silver layer
events.to_csv('medalion_data_store/bronze/events.csv', index=False)
transactions.to_csv('medalion_data_store/bronze/transactions.csv', index=False)

In [17]:
events.info()

<class 'pandas.core.frame.DataFrame'>
Index: 167581 entries, 55972 to 262475
Data columns (total 7 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   person        167581 non-null  object 
 1   event         167581 non-null  object 
 2   time          167581 non-null  int64  
 3   reward        33579 non-null   float64
 4   ofr_id_short  167581 non-null  object 
 5   tag           167581 non-null  int64  
 6   time_diff     120979 non-null  float64
dtypes: float64(2), int64(2), object(3)
memory usage: 10.2+ MB


In [18]:
transactions.info()

<class 'pandas.core.frame.DataFrame'>
Index: 138953 entries, 89291 to 289924
Data columns (total 6 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   person     138953 non-null  object 
 1   event      138953 non-null  object 
 2   time       138953 non-null  int64  
 3   amount     138953 non-null  float64
 4   tag        138953 non-null  int64  
 5   time_diff  122375 non-null  float64
dtypes: float64(2), int64(2), object(2)
memory usage: 7.4+ MB


# Tracking custumers

Explore the tables that track customers to see 'where they are':
  - There are 16578 (97.5%) customers in the transactions tables and 16994 (99.9%) in the events tables.
  - 97.5% made transactions
  - 6 (less 0.1%) clients did not receive offers (not in events table) but made transactions.
  
  Potential custumers:
  - 422 (2,5%) customers made no transactions (not in the transactions table) but received offers.
  - More relevant segment (44,7%) : Gendre: `Male`, member since: `2018-03`, age group: `Senior` of the 422 customers.


In [19]:
len({x for x in transactions['person']})

16578

In [20]:
len({x for x in events['person']})

16994

In [21]:
len((set(transactions['person']) - set(events['person'])))

6

In [22]:
profile.loc[profile['id'].isin(list((set(transactions['person']) - set(events['person'])))), :]

Unnamed: 0,gender,age,id,became_member_on,income,bec_memb_year_month,age_group
872,F,72,c6e579c6821c41d1a7a6a9cf936e91bb,2017-10-14,35000.0,2017-10,Senior
5425,,118,da7a7c0dcfcb41a8acc7864a53cf60fb,2017-08-01,,2017-08,Senior
5639,F,66,eb540099db834cf59001f83a4561aef3,2017-09-29,34000.0,2017-09,Senior
6789,F,55,3a4874d8f0ef42b9a1b72294902afea9,2016-08-16,88000.0,2016-08,Middle
14763,F,54,ae8111e7e8cd4b60a8d35c42c1110555,2017-01-06,72000.0,2017-01,Middle
15391,M,91,12ede229379747bd8d74ccdc20097ca3,2015-10-05,70000.0,2015-10,Senior


In [23]:
len((set(events['person']) - set(transactions['person'])))

422

In [24]:
profile.loc[profile['id'].isin(list((set(events['person']) - set(transactions['person'])))), :].describe(include='all')

Unnamed: 0,gender,age,id,became_member_on,income,bec_memb_year_month,age_group
count,333,422.0,422,422,333.0,422,422
unique,3,,422,,,33,4
top,M,,b6f74fc8e1664cfb9b44834dd9f7cf48,,,2018-03,Senior
freq,189,,1,,,42,188
mean,,70.810427,,2017-12-11 19:44:04.549763072,73537.537538,,
min,,18.0,,2013-11-23 00:00:00,31000.0,,
25%,,51.0,,2017-10-06 06:00:00,58000.0,,
50%,,62.5,,2018-01-21 00:00:00,73000.0,,
75%,,82.75,,2018-04-23 18:00:00,89000.0,,
max,,118.0,,2018-07-26 00:00:00,119000.0,,


In [25]:
profile.loc[profile['id'].isin(list((set(events['person']) - set(transactions['person'])))), :].isna().sum()

gender                 89
age                     0
id                      0
became_member_on        0
income                 89
bec_memb_year_month     0
age_group               0
dtype: int64

In [26]:
profile.isna().sum()

gender                 2175
age                       0
id                        0
became_member_on          0
income                 2175
bec_memb_year_month       0
age_group                 0
dtype: int64

In [28]:
profile

Unnamed: 0,gender,age,id,became_member_on,income,bec_memb_year_month,age_group
0,,118,68be06ca386d4c31939f3a4f0e3dd783,2017-02-12,,2017-02,Senior
1,F,55,0610b486422d4921ae7d2bf64640c50b,2017-07-15,112000.0,2017-07,Middle
2,,118,38fe809add3b4fcf9315a9694bb96ff5,2018-07-12,,2018-07,Senior
3,F,75,78afa995795e4d85b5d9ceeca43f5fef,2017-05-09,100000.0,2017-05,Senior
4,,118,a03223e636434f42ac4c3df47e8bac43,2017-08-04,,2017-08,Senior
...,...,...,...,...,...,...,...
16995,F,45,6d5f3a774f3d4714ab0c092238f3a1d7,2018-06-04,54000.0,2018-06,Adult
16996,M,61,2cb4f97358b841b9a9773a7aa05a9d77,2018-07-13,72000.0,2018-07,Middle
16997,M,49,01d26f638c274aa0b965d24cefe3183f,2017-01-26,73000.0,2017-01,Middle
16998,F,83,9dc1421481194dcd9400aec7c9ae6366,2016-03-07,50000.0,2016-03,Senior


In [29]:
portfolio

Unnamed: 0,reward,channels,difficulty,duration,offer_type,id,ofr_id_short
0,10,"[email, mobile, social]",10,7,bogo,ae264e3637204a6fb9bb56bc8210ddfd,ofr_A
1,10,"[web, email, mobile, social]",10,5,bogo,4d5c57ea9a6940dd891ad53e9dbe8da0,ofr_B
2,0,"[web, email, mobile]",0,4,informational,3f207df678b143eea3cee63160fa8bed,ofr_C
3,5,"[web, email, mobile]",5,7,bogo,9b98b8c7a33c4b65b9aebfe6a799e6d9,ofr_D
4,5,"[web, email]",20,10,discount,0b1e1539f2cc45b7b9fa7c272da2e1d7,ofr_E
5,3,"[web, email, mobile, social]",7,7,discount,2298d6c36e964ae4a3e7e9706d1fb8c2,ofr_F
6,2,"[web, email, mobile, social]",10,10,discount,fafdcd668e3743c1bb461111dcafc2a4,ofr_G
7,0,"[email, mobile, social]",0,3,informational,5a8bc65990b245e5a138643cd4eb9837,ofr_H
8,5,"[web, email, mobile, social]",5,5,bogo,f19421c1d4aa40978ebb69ca19b0e20d,ofr_I
9,2,"[web, email, mobile]",10,7,discount,2906b810c7d4411798c6938adc9daaa5,ofr_J


In [31]:
events

Unnamed: 0,person,event,time,reward,ofr_id_short,tag,time_diff
55972,0009655768c64bdeb2e877511632db8f,received,168,,ofr_H,0,
77705,0009655768c64bdeb2e877511632db8f,viewed,192,,ofr_H,0,
113605,0009655768c64bdeb2e877511632db8f,received,336,,ofr_C,0,168.0
139992,0009655768c64bdeb2e877511632db8f,viewed,372,,ofr_C,0,180.0
153401,0009655768c64bdeb2e877511632db8f,received,408,,ofr_I,0,72.0
...,...,...,...,...,...,...,...
214717,ffff82501cea40309d5fdd7edcca4a07,completed,504,5.0,ofr_D,0,90.0
230690,ffff82501cea40309d5fdd7edcca4a07,viewed,534,,ofr_D,0,120.0
246495,ffff82501cea40309d5fdd7edcca4a07,received,576,,ofr_J,2,72.0
258362,ffff82501cea40309d5fdd7edcca4a07,completed,576,2.0,ofr_J,2,72.0


In [30]:
transactions

Unnamed: 0,person,event,time,amount,tag,time_diff
89291,0009655768c64bdeb2e877511632db8f,transaction,228,22.16,0,
168412,0009655768c64bdeb2e877511632db8f,transaction,414,8.57,1,186.0
228422,0009655768c64bdeb2e877511632db8f,transaction,528,14.11,2,114.0
237784,0009655768c64bdeb2e877511632db8f,transaction,552,13.56,3,24.0
258883,0009655768c64bdeb2e877511632db8f,transaction,576,10.27,4,24.0
...,...,...,...,...,...,...
200255,ffff82501cea40309d5fdd7edcca4a07,transaction,498,13.17,10,84.0
214716,ffff82501cea40309d5fdd7edcca4a07,transaction,504,7.79,11,6.0
258361,ffff82501cea40309d5fdd7edcca4a07,transaction,576,14.23,12,72.0
274809,ffff82501cea40309d5fdd7edcca4a07,transaction,606,10.12,13,30.0


# grouping profiles 

In [94]:
df = events.merge(profile.dropna(), how='left', left_on=['person'], right_on=['id']).drop(columns=['id'])
df = df.merge(portfolio, on=['ofr_id_short'], how='left').drop(columns=['id'])
df

Unnamed: 0,person,event,time,reward_x,ofr_id_short,tag,time_diff,gender,age,became_member_on,income,bec_memb_year_month,age_group,reward_y,channels,difficulty,duration,offer_type
0,0009655768c64bdeb2e877511632db8f,received,168,,ofr_H,0,,M,33.0,2017-04-21,72000.0,2017-04,Adult,0,"[email, mobile, social]",0,3,informational
1,0009655768c64bdeb2e877511632db8f,viewed,192,,ofr_H,0,,M,33.0,2017-04-21,72000.0,2017-04,Adult,0,"[email, mobile, social]",0,3,informational
2,0009655768c64bdeb2e877511632db8f,received,336,,ofr_C,0,168.0,M,33.0,2017-04-21,72000.0,2017-04,Adult,0,"[web, email, mobile]",0,4,informational
3,0009655768c64bdeb2e877511632db8f,viewed,372,,ofr_C,0,180.0,M,33.0,2017-04-21,72000.0,2017-04,Adult,0,"[web, email, mobile]",0,4,informational
4,0009655768c64bdeb2e877511632db8f,received,408,,ofr_I,0,72.0,M,33.0,2017-04-21,72000.0,2017-04,Adult,5,"[web, email, mobile, social]",5,5,bogo
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
167576,ffff82501cea40309d5fdd7edcca4a07,completed,504,5.0,ofr_D,0,90.0,F,45.0,2016-11-25,62000.0,2016-11,Adult,5,"[web, email, mobile]",5,7,bogo
167577,ffff82501cea40309d5fdd7edcca4a07,viewed,534,,ofr_D,0,120.0,F,45.0,2016-11-25,62000.0,2016-11,Adult,5,"[web, email, mobile]",5,7,bogo
167578,ffff82501cea40309d5fdd7edcca4a07,received,576,,ofr_J,2,72.0,F,45.0,2016-11-25,62000.0,2016-11,Adult,2,"[web, email, mobile]",10,7,discount
167579,ffff82501cea40309d5fdd7edcca4a07,completed,576,2.0,ofr_J,2,72.0,F,45.0,2016-11-25,62000.0,2016-11,Adult,2,"[web, email, mobile]",10,7,discount


In [135]:
person_group = df.groupby(['ofr_id_short','offer_type', 'gender', 'event']).agg(
cnt=('ofr_id_short', 'count'),

).unstack(level=[3]).reset_index() #.fillna(0)

person_group.columns = ['_'.join(col).strip('_') for col in person_group.columns.to_flat_index()]

person_group['comp_ratio'] = person_group['cnt_completed'] / person_group['cnt_received']

person_group['view_ratio'] = person_group['cnt_viewed'] / person_group['cnt_received']

person_group.iloc[:, 3:7] = person_group.iloc[:, 3:7].fillna(0)

person_group = person_group.fillna(0).round(2)


person_group.to_csv('medalion_data_store/silver/person_group.csv', index=False)

person_group


Unnamed: 0,ofr_id_short,offer_type,gender,cnt_completed,cnt_received,cnt_viewed,comp_ratio,view_ratio
0,ofr_A,bogo,F,1857.0,2750.0,2364.0,0.68,0.86
1,ofr_A,bogo,M,1741.0,3840.0,3454.0,0.45,0.9
2,ofr_A,bogo,O,59.0,93.0,83.0,0.63,0.89
3,ofr_B,bogo,F,1746.0,2737.0,2623.0,0.64,0.96
4,ofr_B,bogo,M,1519.0,3784.0,3635.0,0.4,0.96
5,ofr_B,bogo,O,45.0,72.0,71.0,0.62,0.99
6,ofr_C,informational,F,0.0,2749.0,1515.0,0.0,0.55
7,ofr_C,informational,M,0.0,3812.0,1902.0,0.0,0.5
8,ofr_C,informational,O,0.0,96.0,70.0,0.0,0.73
9,ofr_D,bogo,F,1999.0,2767.0,1544.0,0.72,0.56


# **Features Engineering**

- Process the datasets to create useful tables and extract fetures from the data.

## Sorted offers by person-event

A summary table of the person events in the order in which they accur for each user and their offers

In [None]:
sorted_offers = events.pivot_table(index=['person'], columns=['event'], values=['ofr_id_short'], aggfunc= lambda x: ' > '.join(x)).reset_index()

sorted_offers.columns = ['_'.join(col).strip('_') for col in sorted_offers.columns.to_flat_index()]

sorted_offers['first_completed'] = [x.split(' > ')[0] if type(x) == str else x for x in sorted_offers['ofr_id_short_completed']]
sorted_offers['last_completed'] = [x.split(' > ')[-1] if type(x) == str else x for x in sorted_offers['ofr_id_short_completed']]


sorted_offers['ofr_id_short_completed'] = sorted_offers['ofr_id_short_completed'].fillna('no_ofr_comp')
sorted_offers['ofr_id_short_received'] = sorted_offers['ofr_id_short_received'].fillna('no_ofr_rec')
sorted_offers['ofr_id_short_viewed'] = sorted_offers['ofr_id_short_viewed'].fillna('no_ofr_view')

sorted_offers = sorted_offers.fillna('no_ofr_comp').reset_index()

# pd.get_dummies(sorted_offers, columns=['ofr_id_short_completed'], drop_first=True, prefix_sep= ' > ')

sorted_offers.to_csv('medalion_data_store/silver/sorted_offers.csv', index=False)

sorted_offers

Unnamed: 0,index,person,ofr_id_short_completed,ofr_id_short_received,ofr_id_short_viewed,first_completed,last_completed
0,0,0009655768c64bdeb2e877511632db8f,ofr_I > ofr_G > ofr_J,ofr_H > ofr_C > ofr_I > ofr_G > ofr_J,ofr_H > ofr_C > ofr_I > ofr_G,ofr_I,ofr_J
1,1,00116118485d4dfda04fdbaba9a87b5c,no_ofr_comp,ofr_I > ofr_I,ofr_I > ofr_I,no_ofr_comp,no_ofr_comp
2,2,0011e0d4e6b944f998e987f904e8c1e5,ofr_F > ofr_E > ofr_D,ofr_C > ofr_F > ofr_H > ofr_E > ofr_D,ofr_C > ofr_F > ofr_H > ofr_E > ofr_D,ofr_F,ofr_D
3,3,0020c2b971eb4e9188eac86d93036a77,ofr_G > ofr_G > ofr_B,ofr_G > ofr_A > ofr_G > ofr_B > ofr_H,ofr_G > ofr_B > ofr_H,ofr_G,ofr_B
4,4,0020ccbbb6d84e358d3414a3ff76cffd,ofr_F > ofr_I > ofr_D,ofr_F > ofr_I > ofr_H > ofr_D,ofr_F > ofr_I > ofr_H > ofr_D,ofr_F,ofr_D
...,...,...,...,...,...,...,...
16989,16989,fff3ba4757bd42088c044ca26d73817a,ofr_G > ofr_D > ofr_J,ofr_G > ofr_D > ofr_H > ofr_J > ofr_H > ofr_J,ofr_G > ofr_D > ofr_H,ofr_G,ofr_J
16990,16990,fff7576017104bcc8677a8d63322b5e1,ofr_G > ofr_G > ofr_D,ofr_G > ofr_B > ofr_A > ofr_G > ofr_D,ofr_G > ofr_B > ofr_A > ofr_G,ofr_G,ofr_D
16991,16991,fff8957ea8b240a6b5e634b6ee8eafcf,no_ofr_comp,ofr_G > ofr_C > ofr_B,ofr_G > ofr_B,no_ofr_comp,no_ofr_comp
16992,16992,fffad4f4828548d1b5583907f2e9906b,ofr_I > ofr_I > ofr_D,ofr_I > ofr_H > ofr_I > ofr_D,ofr_I > ofr_H > ofr_I > ofr_D,ofr_I,ofr_D


In [None]:
sorted_offers.isna().sum()

person                    0
ofr_id_short_completed    0
ofr_id_short_received     0
ofr_id_short_viewed       0
first_completed           0
last_completed            0
dtype: int64

## Item-Event Ranking

Using events table to ranking the offers by counts of the (`event`) per (`offers`) type and it's (`time`) math/statistics features.

In [None]:
# Grouping the events DataFrame by 'ofr_id_short' and 'event' to count the occurrences
item_event = events.groupby(['ofr_id_short', 'event']).agg(
    itm_cnt=('ofr_id_short', 'count'), # count offer type by event type.
    itm_mean_t=('time', 'mean'),
    itm_max_t=('time', 'max'),
    itm_min_t=('time', 'min'),
    itm_dif_mean =('time_diff', 'mean'),
    itm_dif_max =('time_diff', 'max'),
    itm_dif_std =('time_diff', 'std')

).unstack(level=[1]).reset_index().fillna(0).round(2)

# Flattening multi-level column names and joining them with an underscore
item_event.columns = ['_'.join(col).strip('_') for col in item_event.columns.to_flat_index()]

# Calculating the completion and viewed rates based on event counts
item_event['itm_completion_rate'] = item_event['itm_cnt_completed'] / item_event['itm_cnt_received']
item_event['itm_viewed_rate'] = item_event['itm_cnt_viewed'] / item_event['itm_cnt_received']
item_event['itm_RSD_copleted'] = item_event['itm_dif_std_completed'] / item_event['itm_mean_t_completed']

# drop unnecessary coluns (all zero values)
item_event = item_event.loc[:,  [col for col in item_event.columns if not col.endswith('received')]]

# drop unnecessary coluns (all equal values)
item_event = item_event.drop(columns=['itm_max_t_viewed', 'itm_max_t_completed'])

# decimal precision rounding
item_event =item_event.fillna(0).round(2)

# Saving the resulting DataFrame to a CSV file
item_event.to_csv('medalion_data_store/silver/item_event.csv', index=False)

# Returning the item_event DataFrame
item_event


Unnamed: 0,ofr_id_short,itm_cnt_completed,itm_cnt_viewed,itm_mean_t_completed,itm_mean_t_viewed,itm_min_t_completed,itm_min_t_viewed,itm_dif_mean_completed,itm_dif_mean_viewed,itm_dif_max_completed,itm_dif_max_viewed,itm_dif_std_completed,itm_dif_std_viewed,itm_completion_rate,itm_viewed_rate,itm_RSD_copleted
0,ofr_A,3688.0,6716.0,394.77,352.62,0.0,0.0,154.61,172.85,678.0,684.0,124.26,107.2,0.48,0.88,0.31
1,ofr_B,3331.0,7298.0,385.72,353.12,0.0,0.0,146.0,166.64,642.0,672.0,120.31,105.9,0.44,0.96,0.31
2,ofr_C,0.0,4144.0,0.0,358.64,0.0,0.0,0.0,172.73,0.0,666.0,0.0,103.96,0.0,0.54,0.0
3,ofr_D,4354.0,4171.0,407.05,361.98,0.0,0.0,161.66,173.16,660.0,666.0,130.99,105.99,0.57,0.54,0.32
4,ofr_E,3420.0,2663.0,431.55,366.75,0.0,0.0,178.05,172.78,678.0,630.0,137.45,92.32,0.45,0.35,0.32
5,ofr_F,5156.0,7337.0,400.32,354.75,0.0,0.0,160.15,163.98,678.0,690.0,131.47,104.11,0.67,0.96,0.33
6,ofr_G,5317.0,7327.0,399.12,348.87,0.0,0.0,168.4,166.1,696.0,660.0,130.62,105.82,0.7,0.96,0.33
7,ofr_H,0.0,6687.0,0.0,353.93,0.0,0.0,0.0,169.64,0.0,690.0,0.0,103.49,0.0,0.88,0.0
8,ofr_I,4296.0,7264.0,382.94,349.8,0.0,0.0,152.18,165.0,660.0,648.0,125.44,104.77,0.57,0.96,0.33
9,ofr_J,4017.0,4118.0,409.95,356.2,0.0,0.0,163.78,173.34,690.0,708.0,131.2,104.26,0.53,0.54,0.32


## User-Item-Event matrix:

A table that tracks interactions based on (`person`),  (`ofr_id_short`) and (`time`) columns. The aggregation process computes several statistics related to the **event** counting, and **time** for each user-offer interaction.

---

**Source table**: 

  `events` table: providing event-type, time-stamp and reward per person interactions with offers.

**Created table:**
  
  `user_item_events`: grouped table by `'person'` and `'ofr_id_short'` with counts and statistics features.

**Columns Created:**

  1. **cnt_eve**: Total number of events per (person, ofr_id_short) pair.
  2. **sum_rew_eve**: Sum of rewards.
  3. **mean_rew_eve**: Mean of rewards.
  4. **median_rew_eve**: Median of rewards.
  5. **std_rew_eve**: Standard deviation of rewards. **(*removed*)**
  6. **min_rew_eve**: Minimum reward value.
  7. **max_rew_eve**: Maximum reward value.
  8. **range_rew_eve**: Range of rewards (max - min).
  9. **mean_t_eve**: Mean event time.
  10. **median_t_eve**: Median event time.
  11. **std_t_eve**: Standard deviation of event times. **(*removed*)**
  12. **min_t_eve**: Minimum event time.
  13. **max_t_eve**: Maximum event time.
  14. **range_t_eve**: Range of event times (max - min).
  15. **freq_eve**: Frequency of events per (person, ofr_id_short). If max time equals min time, frequency is set to 0 to avoid division by zero.
  16. **last_to_end_eve**: last event occurrence before the end of offer program (end=714h).

---

In [None]:
user_item_event_count = events.groupby(['person', 'ofr_id_short', 'event']).agg(
    cnt=('ofr_id_short', 'count'),
    # us_ofr_mean_t=('time', 'mean'),
    # us_ofr_max_t=('time', 'max'),
    # us_ofr_min_t=('time', 'min'),
    # us_ofr_dif_mean = ('time_diff', 'mean'),
    # us_ofr_dif_max = ('time_diff', 'max'),
    # us_ofr_dif_std = ('time_diff', 'std')

).unstack(level=[1,2]).reset_index().fillna(0) #.round(2) #

user_item_event_count.columns = ['_'.join(col).strip('_') for col in user_item_event_count.columns.to_flat_index()]

user_item_event_count.to_csv('medalion_data_store/silver/user_item_event_count.csv', index=False)


user_item_event_count

Unnamed: 0,person,cnt_ofr_C_received,cnt_ofr_C_viewed,cnt_ofr_G_completed,cnt_ofr_G_received,cnt_ofr_G_viewed,cnt_ofr_H_received,cnt_ofr_H_viewed,cnt_ofr_I_completed,cnt_ofr_I_received,...,cnt_ofr_F_completed,cnt_ofr_F_received,cnt_ofr_F_viewed,cnt_ofr_A_received,cnt_ofr_B_completed,cnt_ofr_B_received,cnt_ofr_B_viewed,cnt_ofr_A_completed,cnt_ofr_A_viewed,cnt_ofr_J_viewed
0,0009655768c64bdeb2e877511632db8f,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,00116118485d4dfda04fdbaba9a87b5c,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0011e0d4e6b944f998e987f904e8c1e5,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,...,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0020c2b971eb4e9188eac86d93036a77,0.0,0.0,2.0,2.0,1.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0
4,0020ccbbb6d84e358d3414a3ff76cffd,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16989,fff3ba4757bd42088c044ca26d73817a,0.0,0.0,1.0,1.0,1.0,2.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
16990,fff7576017104bcc8677a8d63322b5e1,0.0,0.0,2.0,2.0,2.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0
16991,fff8957ea8b240a6b5e634b6ee8eafcf,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
16992,fffad4f4828548d1b5583907f2e9906b,0.0,0.0,0.0,0.0,0.0,1.0,1.0,2.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
user_item_event = events.groupby(['person', 'ofr_id_short', 'event']).agg(
    us_ofr_cnt=('ofr_id_short', 'count'),
    us_ofr_mean_t=('time', 'mean'),
    us_ofr_max_t=('time', 'max'),
    us_ofr_min_t=('time', 'min'),
    us_ofr_dif_mean = ('time_diff', 'mean'),
    us_ofr_dif_max = ('time_diff', 'max'),
    us_ofr_dif_std = ('time_diff', 'std')

).unstack(level=[2]).reset_index().round(2) #.fillna(0)

user_item_event.columns = ['_'.join(col).strip('_') for col in user_item_event.columns.to_flat_index()]

# computing the stage of the fact: 1- just received, 2- received-viwed or received-completed, 3- all occurrences
user_item_event['stage'] = user_item_event[['us_ofr_cnt_completed',	'us_ofr_cnt_received',	'us_ofr_cnt_viewed']].sum(axis=1)

# Replacing NaN values with infinity to facilitate time calculations
user_item_event['us_ofr_mean_t_completed'] = user_item_event['us_ofr_mean_t_completed'].fillna(np.inf)
user_item_event['us_ofr_mean_t_received'] = user_item_event['us_ofr_mean_t_received'].fillna(np.inf)
user_item_event['us_ofr_mean_t_viewed'] = user_item_event['us_ofr_mean_t_viewed'].fillna(np.inf)

# Computing time differences for various stages of the offer lifecycle
user_item_event['to_vr'] = user_item_event['us_ofr_mean_t_viewed'] - user_item_event['us_ofr_mean_t_received']
user_item_event['to_cv'] = user_item_event['us_ofr_mean_t_completed'] - user_item_event['us_ofr_mean_t_viewed']
user_item_event['to_cr'] = user_item_event['us_ofr_mean_t_completed'] - user_item_event['us_ofr_mean_t_received']

user_item_event['to_cv'] = user_item_event['to_cv'].fillna(0)

# Calculating curiosity, eagerness, and overall responsiveness scores
user_item_event['curiosity_vr'] = (2 / (np.exp(-user_item_event['to_vr']*0.3) + 1))
user_item_event['eagerness_cv'] = (2 / (np.exp(-user_item_event['to_cv']*0.3) + 1))
user_item_event['overall_cr'] = (2 / (np.exp(-user_item_event['to_cr'])*0.3 + 1))


# Defining influence metrics based on time conditions
user_item_event['influence'] = (
    (~pd.isna(user_item_event['us_ofr_mean_t_completed'])) & 
    (user_item_event['us_ofr_mean_t_completed'] != np.inf) &
    (user_item_event['us_ofr_mean_t_completed'] >= user_item_event['us_ofr_mean_t_viewed'])
).astype(int)

# Extreme influence is when the offer is viewed and completed at the same time.
user_item_event['ext_influence'] = (
    (~pd.isna(user_item_event['us_ofr_mean_t_viewed'])) & 
    (user_item_event['us_ofr_mean_t_viewed'] != np.inf) & 
    (user_item_event['us_ofr_mean_t_completed'] == user_item_event['us_ofr_mean_t_viewed'])
).astype(int)

user_item_event['us_ofr_freq_copleted'] = user_item_event['us_ofr_cnt_completed'] / ((user_item_event['us_ofr_max_t_completed'] - user_item_event['us_ofr_min_t_completed'])).clip(1, np.inf)
user_item_event['us_ofr_freq_view'] = user_item_event['us_ofr_cnt_viewed'] / ((user_item_event['us_ofr_max_t_viewed'] - user_item_event['us_ofr_min_t_viewed']).clip(1, np.inf))

# Dropping intermediate time calculation columns and others, rounding final values.
user_item_event = user_item_event.drop(
    columns=['us_ofr_mean_t_completed', 'us_ofr_mean_t_received', 'us_ofr_mean_t_viewed', 'to_vr', 'to_cv', 'to_cr', 'us_ofr_cnt_received']
).round(1)

user_item_event = user_item_event.round(4).fillna(0)


user_item_event.to_csv('medalion_data_store/silver/user_item_event.csv', index=False)


user_item_event

Unnamed: 0,person,ofr_id_short,us_ofr_cnt_completed,us_ofr_cnt_viewed,us_ofr_max_t_completed,us_ofr_max_t_received,us_ofr_max_t_viewed,us_ofr_min_t_completed,us_ofr_min_t_received,us_ofr_min_t_viewed,...,us_ofr_dif_std_received,us_ofr_dif_std_viewed,stage,curiosity_vr,eagerness_cv,overall_cr,influence,ext_influence,us_ofr_freq_copleted,us_ofr_freq_view
0,0009655768c64bdeb2e877511632db8f,ofr_C,0.0,1.0,0.0,336.0,372.0,0.0,336.0,372.0,...,0.0,0.0,2.0,2.0,2.0,2.0,0,0,0.0,1.0
1,0009655768c64bdeb2e877511632db8f,ofr_G,1.0,1.0,528.0,504.0,540.0,528.0,504.0,540.0,...,0.0,0.0,3.0,2.0,0.1,2.0,0,0,1.0,1.0
2,0009655768c64bdeb2e877511632db8f,ofr_H,0.0,1.0,0.0,168.0,192.0,0.0,168.0,192.0,...,0.0,0.0,2.0,2.0,2.0,2.0,0,0,0.0,1.0
3,0009655768c64bdeb2e877511632db8f,ofr_I,1.0,1.0,414.0,408.0,456.0,414.0,408.0,456.0,...,0.0,0.0,3.0,2.0,0.0,2.0,0,0,1.0,1.0
4,0009655768c64bdeb2e877511632db8f,ofr_J,1.0,0.0,576.0,576.0,0.0,576.0,576.0,0.0,...,0.0,0.0,2.0,2.0,0.0,1.5,0,0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
63283,fffad4f4828548d1b5583907f2e9906b,ofr_I,2.0,2.0,516.0,408.0,510.0,36.0,0.0,6.0,...,0.0,0.0,6.0,2.0,2.0,2.0,1,0,0.0,0.0
63284,ffff82501cea40309d5fdd7edcca4a07,ofr_D,1.0,1.0,504.0,504.0,534.0,504.0,504.0,534.0,...,0.0,0.0,3.0,2.0,0.0,1.5,0,0,1.0,1.0
63285,ffff82501cea40309d5fdd7edcca4a07,ofr_E,1.0,1.0,198.0,168.0,174.0,198.0,168.0,174.0,...,0.0,0.0,3.0,1.7,2.0,2.0,1,0,1.0,1.0
63286,ffff82501cea40309d5fdd7edcca4a07,ofr_G,1.0,1.0,60.0,0.0,6.0,60.0,0.0,6.0,...,0.0,0.0,3.0,1.7,2.0,2.0,1,0,1.0,1.0


In [None]:
user_item_event.isna().sum()

person                       0
ofr_id_short                 0
us_ofr_cnt_completed         0
us_ofr_cnt_viewed            0
us_ofr_max_t_completed       0
us_ofr_max_t_received        0
us_ofr_max_t_viewed          0
us_ofr_min_t_completed       0
us_ofr_min_t_received        0
us_ofr_min_t_viewed          0
us_ofr_dif_mean_completed    0
us_ofr_dif_mean_received     0
us_ofr_dif_mean_viewed       0
us_ofr_dif_max_completed     0
us_ofr_dif_max_received      0
us_ofr_dif_max_viewed        0
us_ofr_dif_std_completed     0
us_ofr_dif_std_received      0
us_ofr_dif_std_viewed        0
stage                        0
curiosity_vr                 0
eagerness_cv                 0
overall_cr                   0
influence                    0
ext_influence                0
us_ofr_freq_copleted         0
us_ofr_freq_view             0
dtype: int64

## User-Item-U-Event matrix:

A table that tracks **unique** interactions by person-offer-event over time to create features that analyse user engagement and responsiveness for each unique offer interaction, person by person.

---

* **Source table**: 

    `events` table: event type and time per person interactions with offers.

* **Created table:**
    
   `user_item_event`: table of *facts* of unique person-offer-event interactions.

### Fact Definition:

-  A **Fact** represents a unique ocurrence sequence when a person engages with an single offer, and is tracked from the moment the offer is received, viewed, and completed. The interactions that occur more than once is tracked as a unique fact.
- The **'tag'** column created ensures the uniqueness of the fact and *prevents aggregation when a person interacts with the same offer type more than once*. As the aggregation using the tag column, it returns a single value, using agg with 'max', 'min' or 'mean' will return the same value for a fact.

### Engagement Metrics Feature Engineering Strategy: Key features to quantify user responsiveness.

- Each *person-event-offer-tag* fact is grouped and stacked into `event` columns, placing `time` and `reward` values accordingly.
- Time-event colunm are created and missing values computed as `NaN` value.


  - **Handling Missing Values:**
    - Before features calculations, the stacked columns `time_completed`, `time_received`, and `time_viewed` columns have missing values filled with `np.inf` (never ended event), avoiding *artificial* values imputies.

  - **Time-Differences-help-features** - not included in the final dataset as it contains np.inf values
    - `to_vr`: Time from receiving to viewing - delays in viewing - {0 to inf}
    - `to_cv`: Time from viewing to completion - delays in completions after viewing - {-inf to inf}
    - `to_cr`: Time from receiving to completion - delays in completion, even if viewed or not - {0 to inf}

  - **Inverse Time-Based Scores (`1/(x+1) * 100`):** calculated using the previous time differences where x is the time difference.
    - `curiosity_vr`: `{0 to 100}` Measures _speed_ of viewing after receiving.
    - `eagerness_cv`: `{0 to 100} or -1` Measures _speed_ of completion after viewing (`-1` indicates completion before viewing).
    - `overall_cr`: `{0 to 100}` Overall responsiveness from receiving to completion.

  - **Influence Metrics:**
    - **`influence`**: Binary flag (1/0) indicating if an offer was completed after viewing (responsiveness).
    - **`ext_influence`**: Binary flag capturing simultaneous viewing and completion time (extreme responsiveness).

  - **Counts:** (by fact using tag column to avoid aggregation on the same offer type per person interactions).
    - `count_offer_completed`: Count of completed offers.
    - `count_offer_received`: Count of received offers. 
    - `count_offer_viewed`: Count of viewed offers.

---

> NOTE: This table have features that can be combined to determine which demographic groups respond best to which offer type.

In [None]:
user_item_u_event = events.groupby(['person', 'ofr_id_short', 'tag', 'event']).agg(
    us_ofr_cnt=('ofr_id_short', 'count'),
    us_ofr_mean_t=('time', 'mean'),
    us_ofr_max_t=('time', 'max'),
    us_ofr_min_t=('time', 'min'),
    us_ofr_dif_mean = ('time_diff', 'mean'),
    us_ofr_dif_max = ('time_diff', 'max'),
    us_ofr_dif_std = ('time_diff', 'std')

).unstack(level=[3]).reset_index().round(2) #.fillna(0)

user_item_u_event.columns = ['_'.join(col).strip('_') for col in user_item_u_event.columns.to_flat_index()]

# computing the stage of the fact: 1- just received, 2- received-viwed or received-completed, 3- all occurrences
user_item_u_event['stage'] = user_item_u_event[['us_ofr_cnt_completed',	'us_ofr_cnt_received',	'us_ofr_cnt_viewed']].sum(axis=1)

# Replacing NaN values with infinity to facilitate time calculations
user_item_u_event['us_ofr_mean_t_completed'] = user_item_u_event['us_ofr_mean_t_completed'].fillna(np.inf)
user_item_u_event['us_ofr_mean_t_received'] = user_item_u_event['us_ofr_mean_t_received'].fillna(np.inf)
user_item_u_event['us_ofr_mean_t_viewed'] = user_item_u_event['us_ofr_mean_t_viewed'].fillna(np.inf)

# Computing time differences for various stages of the offer lifecycle
user_item_u_event['to_vr'] = user_item_u_event['us_ofr_mean_t_viewed'] - user_item_u_event['us_ofr_mean_t_received']
user_item_u_event['to_cv'] = user_item_u_event['us_ofr_mean_t_completed'] - user_item_u_event['us_ofr_mean_t_viewed']
user_item_u_event['to_cr'] = user_item_u_event['us_ofr_mean_t_completed'] - user_item_u_event['us_ofr_mean_t_received']

user_item_u_event['to_cv'] = user_item_u_event['to_cv'].fillna(0)

# Calculating curiosity, eagerness, and overall responsiveness scores
user_item_u_event['curiosity_vr'] = (2 / (np.exp(-user_item_u_event['to_vr']*0.3) + 1))
user_item_u_event['eagerness_cv'] = (2 / (np.exp(-user_item_u_event['to_cv']*0.3) + 1))
user_item_u_event['overall_cr'] = (2 / (np.exp(-user_item_u_event['to_cr'])*0.3 + 1))


# Defining influence metrics based on time conditions
user_item_u_event['influence'] = (
    (~pd.isna(user_item_u_event['us_ofr_mean_t_completed'])) & 
    (user_item_u_event['us_ofr_mean_t_completed'] != np.inf) &
    (user_item_u_event['us_ofr_mean_t_completed'] >= user_item_u_event['us_ofr_mean_t_viewed'])
).astype(int)

# Extreme influence is when the offer is viewed and completed at the same time.
user_item_u_event['ext_influence'] = (
    (~pd.isna(user_item_u_event['us_ofr_mean_t_viewed'])) & 
    (user_item_u_event['us_ofr_mean_t_viewed'] != np.inf) & 
    (user_item_u_event['us_ofr_mean_t_completed'] == user_item_u_event['us_ofr_mean_t_viewed'])
).astype(int)

user_item_u_event['us_ofr_freq_copleted'] = user_item_u_event['us_ofr_cnt_completed'] / ((user_item_u_event['us_ofr_max_t_completed'] - user_item_u_event['us_ofr_min_t_completed'])).clip(1, np.inf)
user_item_u_event['us_ofr_freq_view'] = user_item_u_event['us_ofr_cnt_viewed'] / ((user_item_u_event['us_ofr_max_t_viewed'] - user_item_u_event['us_ofr_min_t_viewed']).clip(1, np.inf))

# Dropping intermediate time calculation columns and others, rounding final values.
user_item_u_event = user_item_u_event.drop(
    columns=['us_ofr_mean_t_completed', 'us_ofr_mean_t_received', 'us_ofr_mean_t_viewed', 'to_vr', 'to_cv', 'to_cr', 'us_ofr_cnt_received']
).round(1)

user_item_u_event = user_item_u_event.round(4).fillna(0)


user_item_u_event.to_csv('medalion_data_store/silver/user_item_u_event.csv', index=False)


user_item_u_event

Unnamed: 0,person,ofr_id_short,tag,us_ofr_cnt_completed,us_ofr_cnt_viewed,us_ofr_max_t_completed,us_ofr_max_t_received,us_ofr_max_t_viewed,us_ofr_min_t_completed,us_ofr_min_t_received,...,us_ofr_dif_std_received,us_ofr_dif_std_viewed,stage,curiosity_vr,eagerness_cv,overall_cr,influence,ext_influence,us_ofr_freq_copleted,us_ofr_freq_view
0,0009655768c64bdeb2e877511632db8f,ofr_C,0,0.0,1.0,0.0,336.0,372.0,0.0,336.0,...,0.0,0.0,2.0,2.0,2.0,2.0,0,0,0.0,1.0
1,0009655768c64bdeb2e877511632db8f,ofr_G,0,1.0,1.0,528.0,504.0,540.0,528.0,504.0,...,0.0,0.0,3.0,2.0,0.1,2.0,0,0,1.0,1.0
2,0009655768c64bdeb2e877511632db8f,ofr_H,0,0.0,1.0,0.0,168.0,192.0,0.0,168.0,...,0.0,0.0,2.0,2.0,2.0,2.0,0,0,0.0,1.0
3,0009655768c64bdeb2e877511632db8f,ofr_I,0,1.0,1.0,414.0,408.0,456.0,414.0,408.0,...,0.0,0.0,3.0,2.0,0.0,2.0,0,0,1.0,1.0
4,0009655768c64bdeb2e877511632db8f,ofr_J,0,1.0,0.0,576.0,576.0,0.0,576.0,576.0,...,0.0,0.0,2.0,2.0,0.0,1.5,0,0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
76272,ffff82501cea40309d5fdd7edcca4a07,ofr_E,0,1.0,1.0,198.0,168.0,174.0,198.0,168.0,...,0.0,0.0,3.0,1.7,2.0,2.0,1,0,1.0,1.0
76273,ffff82501cea40309d5fdd7edcca4a07,ofr_G,0,1.0,1.0,60.0,0.0,6.0,60.0,0.0,...,0.0,0.0,3.0,1.7,2.0,2.0,1,0,1.0,1.0
76274,ffff82501cea40309d5fdd7edcca4a07,ofr_J,0,1.0,1.0,384.0,336.0,354.0,384.0,336.0,...,0.0,0.0,3.0,2.0,2.0,2.0,1,0,1.0,1.0
76275,ffff82501cea40309d5fdd7edcca4a07,ofr_J,1,1.0,1.0,414.0,408.0,414.0,414.0,408.0,...,0.0,0.0,3.0,1.7,1.0,2.0,1,1,1.0,1.0


In [None]:
user_item_u_event.isna().sum()

person                       0
ofr_id_short                 0
tag                          0
us_ofr_cnt_completed         0
us_ofr_cnt_viewed            0
us_ofr_max_t_completed       0
us_ofr_max_t_received        0
us_ofr_max_t_viewed          0
us_ofr_min_t_completed       0
us_ofr_min_t_received        0
us_ofr_min_t_viewed          0
us_ofr_dif_mean_completed    0
us_ofr_dif_mean_received     0
us_ofr_dif_mean_viewed       0
us_ofr_dif_max_completed     0
us_ofr_dif_max_received      0
us_ofr_dif_max_viewed        0
us_ofr_dif_std_completed     0
us_ofr_dif_std_received      0
us_ofr_dif_std_viewed        0
stage                        0
curiosity_vr                 0
eagerness_cv                 0
overall_cr                   0
influence                    0
ext_influence                0
us_ofr_freq_copleted         0
us_ofr_freq_view             0
dtype: int64

## User-Transactions matrix:

### A table that tracks transactions based on **person** columns. The aggregation process computes several statistics related to the **transactions amount**, and **time** for each user.
---

**Source table**: 

  `transactions` table: providing amount-spent and time-stamp per person transactions.

**Created table:**
    `user_transactions`: grouped table by `'person'` with counts and statistics features.

**Columns Created:**

1. **cnt_tran**: Total number of transactions per person.
2. **sum_am_tran**: Sum of transaction amounts.
3. **mean_am_tran**: Mean of transaction amounts.
4. **median_am_tran**: Median of transaction amounts.
5. **std_am_tran**: Standard deviation of transaction amounts. **(*removed*)**
6. **min_am_tran**: Minimum transaction amount.
7. **max_am_tran**: Maximum transaction amount.
8. **range_amount_tran**: Range of transaction amounts (max - min).
9. **mean_t_tran**: Mean transaction time.
10. **median_t_tran**: Median transaction time.
11. **std_t_tran**: Standard deviation of transaction times. **(*removed*)**
12. **min_t_tran**: Minimum transaction time.
13. **max_t_tran**: Maximum transaction time.
14. **range_t_tran**: Range of transaction times (max - min).
15. **freq_tran**: Frequency of transactions per person. If max time equals min time, frequency is set to 0 to avoid division by zero.
16. **last_to_end_tran**: last transaction occurrence before the end of offer program (end=714h).

**Handling Missing Values:**
  - fill NaN with zero.

In [None]:
# Group transaction data by person and event, aggregating count and sum of amounts
user_transactions = transactions.groupby(['person']).agg(
    cnt_tran=('event', 'count'),        
    
    sum_am_tran=('amount', 'sum'),    
    mean_am_tran=('amount', 'mean'),   
    median_am_tran=('amount', 'median'), 
    min_am_tran=('amount', 'min'),         
    max_am_tran=('amount', 'max'),         
    small_tran_count = ('amount', lambda x: (x < x.quantile(0.25)).sum()), # n de transações pequenas
    big_tran_count = ('amount', lambda x: (x > x.quantile(0.75)).sum()), # n de transações grandes
    range_amount_tran=('amount', lambda x: x.max() - x.min()),

    mean_t_tran=('time', 'mean'),
    median_t_tran=('time', 'median'),
    min_t_tran=('time', 'min'),        
    max_t_tran=('time', 'max'),
    dif_t_tran_mean = ('time_diff', 'mean'), 
    dif_t_tran_min = ('time_diff', 'min'),
    dif_t_tran_max = ('time_diff', 'max'), 
    std_diff_t_tran = ('time_diff', 'std'),
    range_t_tran = ('time', lambda x: x.max() - x.min()),  
    freq_tran = ('time', lambda x: (len(x) / (x.max() - x.min()))*100 if x.max() != x.min() else 0),
    recency_tran = ('time', lambda x: (714 - x.max()/714)), 

).round(2).reset_index() #.fillna(0)

user_transactions['max_to_sum_am_tran'] = user_transactions['max_am_tran'] / user_transactions['sum_am_tran'] # preseça de valor alto
user_transactions['median_to_mean_am_tran'] = user_transactions['median_am_tran'] / user_transactions['mean_am_tran'] # distorção

transactions['week'] = transactions['time'] // 24*7
user_transactions['weekly_tran_mean'] = transactions.groupby(['person', 'week']).size().groupby('person').mean().values # média de compra por semana
user_transactions['weekly_tran_min'] = transactions.groupby(['person', 'week']).size().groupby('person').min().values # mínimo de compras por semana
user_transactions['weekly_tran_max'] = transactions.groupby(['person', 'week']).size().groupby('person').max().values # máximo de compras por semana

user_transactions['dif_t_tran_mean'] = user_transactions['dif_t_tran_mean'].fillna(0)
user_transactions['dif_t_tran_min'] = user_transactions['dif_t_tran_min'].fillna(0)
user_transactions['dif_t_tran_max'] = user_transactions['dif_t_tran_max'].fillna(0)
user_transactions['std_diff_t_tran'] = user_transactions['std_diff_t_tran'].fillna(0)

# save the table in the data store
user_transactions.to_csv('medalion_data_store/silver/user_transactions.csv', index=False)

# Output final user-item-transactions matrix
user_transactions

Unnamed: 0,person,cnt_tran,sum_am_tran,mean_am_tran,median_am_tran,min_am_tran,max_am_tran,small_tran_count,big_tran_count,range_amount_tran,...,dif_t_tran_max,std_diff_t_tran,range_t_tran,freq_tran,recency_tran,max_to_sum_am_tran,median_to_mean_am_tran,weekly_tran_mean,weekly_tran_min,weekly_tran_max
0,0009655768c64bdeb2e877511632db8f,8,127.60,15.95,13.84,8.57,28.16,2,2,19.59,...,186.0,65.12,468,1.71,713.03,0.220690,0.867712,1.000000,1,1
1,00116118485d4dfda04fdbaba9a87b5c,3,4.09,1.36,0.70,0.20,3.19,1,1,2.99,...,162.0,101.82,180,1.67,713.34,0.779951,0.514706,1.500000,1,2
2,0011e0d4e6b944f998e987f904e8c1e5,5,79.46,15.89,13.49,8.96,23.03,1,1,14.07,...,324.0,136.33,522,0.96,713.08,0.289831,0.848962,1.000000,1,1
3,0020c2b971eb4e9188eac86d93036a77,8,196.86,24.61,24.35,17.24,33.86,2,2,16.62,...,366.0,131.40,654,1.22,713.01,0.172000,0.989435,1.333333,1,2
4,0020ccbbb6d84e358d3414a3ff76cffd,12,154.05,12.84,12.76,6.81,20.08,3,3,13.27,...,180.0,56.24,630,1.90,713.06,0.130347,0.993769,1.090909,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16573,fff3ba4757bd42088c044ca26d73817a,11,580.98,52.82,20.98,10.99,388.22,3,3,377.23,...,210.0,60.89,546,2.01,713.23,0.668216,0.397198,1.100000,1,2
16574,fff7576017104bcc8677a8d63322b5e1,6,29.94,4.99,5.03,2.08,8.01,2,2,5.93,...,282.0,94.20,660,0.91,713.03,0.267535,1.008016,1.000000,1,1
16575,fff8957ea8b240a6b5e634b6ee8eafcf,5,12.15,2.43,0.89,0.64,6.39,1,1,5.75,...,360.0,150.47,558,0.90,713.19,0.525926,0.366255,1.000000,1,1
16576,fffad4f4828548d1b5583907f2e9906b,12,88.83,7.40,7.52,2.05,12.18,3,3,10.13,...,114.0,37.57,642,1.87,713.05,0.137116,1.016216,1.090909,1,2


In [None]:
user_transactions.isna().sum()

person                    0
cnt_tran                  0
sum_am_tran               0
mean_am_tran              0
median_am_tran            0
min_am_tran               0
max_am_tran               0
small_tran_count          0
big_tran_count            0
range_amount_tran         0
mean_t_tran               0
median_t_tran             0
min_t_tran                0
max_t_tran                0
dif_t_tran_mean           0
dif_t_tran_min            0
dif_t_tran_max            0
std_diff_t_tran           0
range_t_tran              0
freq_tran                 0
recency_tran              0
max_to_sum_am_tran        0
median_to_mean_am_tran    0
weekly_tran_mean          0
weekly_tran_min           0
weekly_tran_max           0
dtype: int64

## User-Transactions-Time matrix

Constructing a **Transaction-Time** based matrix from `transaction` table. The time frame was divided in 20 periods. The first period conteins all times. The second has 36 hours less than the first time, and so on. The sum of transactions per period was computed. 

---
* **Source**: 

    `transaction` table.

* **Created tables:**
    
    `transactions_time` with related features.

    
### Feature Engineering strategy:
**periods:** The values in each period changes tracked by line.
**churn:** if no transactions in the three last periods was found compute 1 else 0. 
**recency:** The time to end of the offer program from the last transaction (714-time). 
**churn2:** if recency is more then 96h compute 1 else 0.

### Data Processing
- The `.unstack()` function pivots the `event` or the `ofr_id_short` column, converting different event/offer types into separated columns and filling it's values as above.

- No further calculations are performed on the data columns, as the summary statistics are computed directly from the groupby/unstack operations.

- The resulting multi-level column names are flattened using list comprehension to create more readable column names.


---

In [None]:
# Aggregate transaction counts per user over time
user_transactions_time = (
    transactions.groupby(['person', 'time'])
    .size()
    .unstack(level=1)
    #.reset_index()
)

# Rename columns to indicate transaction counts over time
user_transactions_time.columns = [
    f"time_tran_{col}" for col in user_transactions_time.columns.to_flat_index()
]

# Compute transaction frequencies over defined periods (~60 hours per period)
frequences_period = pd.DataFrame({
    f'period_{i//6 + 1}': user_transactions_time.iloc[:, i+1:].sum(axis=1)
    for i in range(0, 119, 6)
}).reset_index()

# Define churn as 1 if no transactions occurred in the last three periods
churn = pd.Series((frequences_period.iloc[:, -3:].sum(axis=1) == 0).astype(int), name='churn')

# Compute recency: Time elapsed since last transaction (714 - max transaction time per person)
recency = (714 - transactions.groupby('person')['time'].max()).reset_index(drop=True)
recency.name = 'recency'

# Alternative churn definition: 1 if recency is greater than 96 hours
churn2 = pd.Series((recency > 96).astype(int), name='churn2')

# Combine all features into a final churn prediction table
user_transactions_time = pd.concat([
        frequences_period, recency, churn, churn2
], axis=1)

# Save churn data to a CSV file
user_transactions_time.to_csv('medalion_data_store/silver/user_transactions_time.csv', index=False)

# Output the churn table
user_transactions_time


Unnamed: 0,person,period_1,period_2,period_3,period_4,period_5,period_6,period_7,period_8,period_9,...,period_14,period_15,period_16,period_17,period_18,period_19,period_20,recency,churn,churn2
0,0009655768c64bdeb2e877511632db8f,8.0,8.0,8.0,8.0,8.0,8.0,8.0,7.0,7.0,...,6.0,6.0,5.0,3.0,3.0,3.0,2.0,18,0,0
1,00116118485d4dfda04fdbaba9a87b5c,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,240,1,1
2,0011e0d4e6b944f998e987f904e8c1e5,5.0,5.0,5.0,5.0,4.0,4.0,4.0,3.0,3.0,...,3.0,3.0,3.0,2.0,2.0,1.0,0.0,60,0,0
3,0020c2b971eb4e9188eac86d93036a77,8.0,8.0,5.0,5.0,4.0,4.0,4.0,4.0,4.0,...,4.0,4.0,2.0,2.0,2.0,2.0,2.0,6,0,0
4,0020ccbbb6d84e358d3414a3ff76cffd,12.0,12.0,11.0,11.0,11.0,11.0,11.0,9.0,8.0,...,2.0,2.0,2.0,2.0,1.0,1.0,0.0,42,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16573,fff3ba4757bd42088c044ca26d73817a,11.0,9.0,8.0,7.0,7.0,6.0,5.0,4.0,4.0,...,3.0,3.0,1.0,0.0,0.0,0.0,0.0,162,1,1
16574,fff7576017104bcc8677a8d63322b5e1,6.0,5.0,5.0,5.0,5.0,5.0,4.0,4.0,3.0,...,3.0,3.0,3.0,2.0,1.0,1.0,1.0,18,0,0
16575,fff8957ea8b240a6b5e634b6ee8eafcf,5.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,...,2.0,1.0,1.0,0.0,0.0,0.0,0.0,138,1,1
16576,fffad4f4828548d1b5583907f2e9906b,12.0,11.0,10.0,9.0,8.0,8.0,7.0,7.0,7.0,...,4.0,4.0,3.0,2.0,1.0,1.0,0.0,36,0,0


In [None]:
user_transactions_time.isna().sum()

person       0
period_1     0
period_2     0
period_3     0
period_4     0
period_5     0
period_6     0
period_7     0
period_8     0
period_9     0
period_10    0
period_11    0
period_12    0
period_13    0
period_14    0
period_15    0
period_16    0
period_17    0
period_18    0
period_19    0
period_20    0
recency      0
churn        0
churn2       0
dtype: int64

# Analytical table 

A final table mergin selected tables created in the previous steps.

The goals is get different types of dataset to be used in the analysis, modeling and recommendations.

In [None]:
# Merge unique event features with transcript features based on 'person' column
# Use a left join to keep all records from user_item_event
analytical_table = profile.merge(user_item_u_event, left_on='id', right_on='person', how='right').drop(columns=['person'])

analytical_table = analytical_table.merge(user_event, left_on='id', right_on='person', how='left').drop(columns=['person'])

analytical_table = analytical_table.merge(user_transactions, left_on='id', right_on='person', how='left').drop(columns=['person'])

analytical_table = analytical_table.merge(user_transactions_time, left_on='id', right_on='person', how='left').drop(columns=['person'])

# Save the final dataset to a CSV file
analytical_table.to_csv('medalion_data_store/gold/analytical_table.csv', index=False)


analytical_table

NameError: name 'user_event' is not defined

In [None]:
analytical_table.isna().sum()

gender              9776
age                    0
id                     0
became_member_on       0
income              9776
                    ... 
period_19           1881
period_20           1881
recency             1881
churn               1881
churn2              1881
Length: 101, dtype: int64

In [None]:
analytical_table.dropna()

Unnamed: 0,gender,age,id,became_member_on,income,bec_memb_year_month,age_group,ofr_id_short,tag,us_ofr_cnt_completed,...,period_14,period_15,period_16,period_17,period_18,period_19,period_20,recency,churn,churn2
0,M,33,0009655768c64bdeb2e877511632db8f,2017-04-21,72000.0,2017-04,Adult,ofr_C,0,0.0,...,6.0,6.0,5.0,3.0,3.0,3.0,2.0,18.0,0.0,0.0
1,M,33,0009655768c64bdeb2e877511632db8f,2017-04-21,72000.0,2017-04,Adult,ofr_G,0,1.0,...,6.0,6.0,5.0,3.0,3.0,3.0,2.0,18.0,0.0,0.0
2,M,33,0009655768c64bdeb2e877511632db8f,2017-04-21,72000.0,2017-04,Adult,ofr_H,0,0.0,...,6.0,6.0,5.0,3.0,3.0,3.0,2.0,18.0,0.0,0.0
3,M,33,0009655768c64bdeb2e877511632db8f,2017-04-21,72000.0,2017-04,Adult,ofr_I,0,1.0,...,6.0,6.0,5.0,3.0,3.0,3.0,2.0,18.0,0.0,0.0
4,M,33,0009655768c64bdeb2e877511632db8f,2017-04-21,72000.0,2017-04,Adult,ofr_J,0,1.0,...,6.0,6.0,5.0,3.0,3.0,3.0,2.0,18.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
76272,F,45,ffff82501cea40309d5fdd7edcca4a07,2016-11-25,62000.0,2016-11,Adult,ofr_E,0,1.0,...,5.0,3.0,3.0,2.0,1.0,0.0,0.0,66.0,0.0,0.0
76273,F,45,ffff82501cea40309d5fdd7edcca4a07,2016-11-25,62000.0,2016-11,Adult,ofr_G,0,1.0,...,5.0,3.0,3.0,2.0,1.0,0.0,0.0,66.0,0.0,0.0
76274,F,45,ffff82501cea40309d5fdd7edcca4a07,2016-11-25,62000.0,2016-11,Adult,ofr_J,0,1.0,...,5.0,3.0,3.0,2.0,1.0,0.0,0.0,66.0,0.0,0.0
76275,F,45,ffff82501cea40309d5fdd7edcca4a07,2016-11-25,62000.0,2016-11,Adult,ofr_J,1,1.0,...,5.0,3.0,3.0,2.0,1.0,0.0,0.0,66.0,0.0,0.0
