# Starbucks Capstone Challenge

### Introduction

This data set contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks. 

Not all users receive the same offer, and that is the challenge to solve with this data set.

Your task is to combine transaction, demographic and offer data to determine which demographic groups respond best to which offer type. This data set is a simplified version of the real Starbucks app because the underlying simulator only has one product whereas Starbucks actually sells dozens of products.

Every offer has a validity period before the offer expires. As an example, a BOGO offer might be valid for only 5 days. You'll see in the data set that informational offers have a validity period even though these ads are merely providing information about a product; for example, if an informational offer has 7 days of validity, you can assume the customer is feeling the influence of the offer for 7 days after receiving the advertisement.

You'll be given transactional data showing user purchases made on the app including the timestamp of purchase and the amount of money spent on a purchase. This transactional data also has a record for each offer that a user receives as well as a record for when a user actually views the offer. There are also records for when a user completes an offer. 

Keep in mind as well that someone using the app might make a purchase through the app without having received an offer or seen an offer.

### Example

To give an example, a user could receive a discount offer buy 10 dollars get 2 off on Monday. The offer is valid for 10 days from receipt. If the customer accumulates at least 10 dollars in purchases during the validity period, the customer completes the offer.

However, there are a few things to watch out for in this data set. Customers do not opt into the offers that they receive; in other words, a user can receive an offer, never actually view the offer, and still complete the offer. For example, a user might receive the "buy 10 dollars get 2 dollars off offer", but the user never opens the offer during the 10 day validity period. The customer spends 15 dollars during those ten days. There will be an offer completion record in the data set; however, the customer was not influenced by the offer because the customer never viewed the offer.

### Cleaning

This makes data cleaning especially important and tricky.

You'll also want to take into account that some demographic groups will make purchases even if they don't receive an offer. From a business perspective, if a customer is going to make a 10 dollar purchase without an offer anyway, you wouldn't want to send a buy 10 dollars get 2 dollars off offer. You'll want to try to assess what a certain demographic group will buy when not receiving any offers.

### Final Advice

Because this is a capstone project, you are free to analyze the data any way you see fit. For example, you could build a machine learning model that predicts how much someone will spend based on demographics and offer type. Or you could build a model that predicts whether or not someone will respond to an offer. Or, you don't need to build a machine learning model at all. You could develop a set of heuristics that determine what offer you should send to each customer (i.e., 75 percent of women customers who were 35 years old responded to offer A vs 40 percent from the same demographic to offer B, so send offer A).

# Data Sets

The data is contained in three files:

* portfolio.json - containing offer ids and meta data about each offer (duration, type, etc.)
* profile.json - demographic data for each customer
* transcript.json - records for transactions, offers received, offers viewed, and offers completed

Here is the schema and explanation of each variable in the files:

**portfolio.json**
* id (string) - offer id
* offer_type (string) - type of offer ie BOGO, discount, informational
* difficulty (int) - minimum required spend to complete an offer
* reward (int) - reward given for completing an offer
* duration (int) - time for offer to be open, in days
* channels (list of strings)

**profile.json**
* age (int) - age of the customer 
* became_member_on (int) - date when customer created an app account
* gender (str) - gender of the customer (note some entries contain 'O' for other rather than M or F)
* id (str) - customer id
* income (float) - customer's income

**transcript.json**
* event (str) - record description (ie transaction, offer received, offer viewed, etc.)
* person (str) - customer id
* time (int) - time in hours since start of test. The data begins at time t=0
* value - (dict of strings) - either an offer id or transaction amount depending on the record

--- 

# Importing Libraries and Loading Data

Importing the necessary Python libraries and loads data from JSON files into Pandas DataFrames.

### Libraries Imported
- `pandas` (`pd`): Used for data manipulation and analysis.
- `numpy` (`np`): Provides support for numerical operations.
- `math`: Standard Python library for mathematical functions.
- `json`: Enables working with JSON data.
- `matplotlib.pyplot` (`plt`): Used for data visualization.
- `seaborn` (`sns`): Enhances visualization with statistical plotting.

### Loading Data
The `pd.read_json()` function is used to read JSON files:
- `portfolio_raw`: Contains data from `portfolio.json`, likely representing promotional offers.
- `profile_raw`: Contains data from `profile.json`, likely storing user demographic information.
- `transcript_raw`: Contains data from `transcript.json`, likely recording user interactions or transactions.

Each file is read with `orient='records'` and `lines=True`, ensuring that each JSON object in the file is interpreted as a separate record (suitable for line-delimited JSON files).


In [1]:
# importing libraries
import pandas as pd
import numpy as np

# read in the json files
portfolio_raw = pd.read_json('data/portfolio.json', orient='records', lines=True)
profile_raw = pd.read_json('data/profile.json', orient='records', lines=True)
transcript_raw = pd.read_json('data/transcript.json', orient='records', lines=True)


# Data understanding

--- 
Objectives: 

* Examination of each individual table and its corresponding columns.
* Exploratory data analysis (EDA) with some statistics.
---


**portfolio.json:**

 Ten offers type and his atributes.

-> Data is cleaned and ready to be used

In [2]:
portfolio_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   reward      10 non-null     int64 
 1   channels    10 non-null     object
 2   difficulty  10 non-null     int64 
 3   duration    10 non-null     int64 
 4   offer_type  10 non-null     object
 5   id          10 non-null     object
dtypes: int64(3), object(3)
memory usage: 612.0+ bytes


In [3]:
# showing the entire table
portfolio_raw

Unnamed: 0,reward,channels,difficulty,duration,offer_type,id
0,10,"[email, mobile, social]",10,7,bogo,ae264e3637204a6fb9bb56bc8210ddfd
1,10,"[web, email, mobile, social]",10,5,bogo,4d5c57ea9a6940dd891ad53e9dbe8da0
2,0,"[web, email, mobile]",0,4,informational,3f207df678b143eea3cee63160fa8bed
3,5,"[web, email, mobile]",5,7,bogo,9b98b8c7a33c4b65b9aebfe6a799e6d9
4,5,"[web, email]",20,10,discount,0b1e1539f2cc45b7b9fa7c272da2e1d7
5,3,"[web, email, mobile, social]",7,7,discount,2298d6c36e964ae4a3e7e9706d1fb8c2
6,2,"[web, email, mobile, social]",10,10,discount,fafdcd668e3743c1bb461111dcafc2a4
7,0,"[email, mobile, social]",0,3,informational,5a8bc65990b245e5a138643cd4eb9837
8,5,"[web, email, mobile, social]",5,5,bogo,f19421c1d4aa40978ebb69ca19b0e20d
9,2,"[web, email, mobile]",10,7,discount,2906b810c7d4411798c6938adc9daaa5


**profile.json**

Customers demographic data.

--> There are 17000 customers in the dataset.

--> There are 2175 (~12,7 %) NoneType values in `gender` and `income`, and the `age` values is 118 for these.

--> ~50 % are 'M' and ~36 % are 'F'. There is ~1.2 %  'O' type gender.

--> The colum `became_member_on` has non-formated date values.

In [4]:
profile_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            14825 non-null  object 
 1   age               17000 non-null  int64  
 2   id                17000 non-null  object 
 3   became_member_on  17000 non-null  int64  
 4   income            14825 non-null  float64
dtypes: float64(1), int64(2), object(2)
memory usage: 664.2+ KB


In [5]:
profile_raw['gender'].value_counts(dropna=False, normalize=True)

gender
M       0.499059
F       0.360529
None    0.127941
O       0.012471
Name: proportion, dtype: float64

In [6]:
profile_raw['age'].value_counts(dropna=False, normalize=True)

age
118    0.127941
58     0.024000
53     0.021882
51     0.021353
59     0.021118
         ...   
100    0.000706
96     0.000471
98     0.000294
101    0.000294
99     0.000294
Name: proportion, Length: 85, dtype: float64

In [7]:
profile_raw.describe(include='all') 

Unnamed: 0,gender,age,id,became_member_on,income
count,14825,17000.0,17000,17000.0,14825.0
unique,3,,17000,,
top,M,,e4052622e5ba45a8b96b59aba68cf068,,
freq,8484,,1,,
mean,,62.531412,,20167030.0,65404.991568
std,,26.73858,,11677.5,21598.29941
min,,18.0,,20130730.0,30000.0
25%,,45.0,,20160530.0,49000.0
50%,,58.0,,20170800.0,64000.0
75%,,73.0,,20171230.0,80000.0


**transcript.json**

A time line of the events that took place during the simulation event.

--> There is a inconsistent dicttionary keys in the `value` column.

--> The `' '` in the `event` column categorie's names can be normalized to `'_'`.

--> There are no missing values.

In [8]:
transcript_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306534 entries, 0 to 306533
Data columns (total 4 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   person  306534 non-null  object
 1   event   306534 non-null  object
 2   value   306534 non-null  object
 3   time    306534 non-null  int64 
dtypes: int64(1), object(3)
memory usage: 9.4+ MB


In [9]:
{[*x][0] for x in transcript_raw['value']}

{'amount', 'offer id', 'offer_id'}

In [10]:
transcript_raw['event'].value_counts()

event
transaction        138953
offer received      76277
offer viewed        57725
offer completed     33579
Name: count, dtype: int64

# Data Preparation
---
- Objective:

Creation of `Analytical Tables` datasets for analisys, visuals recommendations and machine learning applications.

- Strategy:
1. Loading raw data from the original tables.
2. Restructuring it using groupby/unstack and creating fetures.
3. Selecting relevant variables and fetures.

This process ensures that the data tables and **features** is properly formatted, aggregated, and cleaned for further analysis. 

---

#### **Portfolio Dataset**
- Update column `id` to make them more easier to read.

In [11]:
#creating a copy from the original dataframe
portfolio = portfolio_raw.copy() 

# renaming the columns using a dictionary
port_id = {
    'ae264e3637204a6fb9bb56bc8210ddfd': 'ofr_A',
    '4d5c57ea9a6940dd891ad53e9dbe8da0': 'ofr_B',
    '3f207df678b143eea3cee63160fa8bed': 'ofr_C',
    '9b98b8c7a33c4b65b9aebfe6a799e6d9': 'ofr_D',
    '0b1e1539f2cc45b7b9fa7c272da2e1d7': 'ofr_E',
    '2298d6c36e964ae4a3e7e9706d1fb8c2': 'ofr_F',
    'fafdcd668e3743c1bb461111dcafc2a4': 'ofr_G',
    '5a8bc65990b245e5a138643cd4eb9837': 'ofr_H',
    'f19421c1d4aa40978ebb69ca19b0e20d': 'ofr_I',
    '2906b810c7d4411798c6938adc9daaa5': 'ofr_J'
}

# mapping the id column
portfolio['ofr_id_short'] = portfolio['id'].map(port_id)

# persist a csv file to the bronze folder
portfolio.to_csv('medalion_data_store/bronze/portfolio.csv', index=False)

portfolio

Unnamed: 0,reward,channels,difficulty,duration,offer_type,id,ofr_id_short
0,10,"[email, mobile, social]",10,7,bogo,ae264e3637204a6fb9bb56bc8210ddfd,ofr_A
1,10,"[web, email, mobile, social]",10,5,bogo,4d5c57ea9a6940dd891ad53e9dbe8da0,ofr_B
2,0,"[web, email, mobile]",0,4,informational,3f207df678b143eea3cee63160fa8bed,ofr_C
3,5,"[web, email, mobile]",5,7,bogo,9b98b8c7a33c4b65b9aebfe6a799e6d9,ofr_D
4,5,"[web, email]",20,10,discount,0b1e1539f2cc45b7b9fa7c272da2e1d7,ofr_E
5,3,"[web, email, mobile, social]",7,7,discount,2298d6c36e964ae4a3e7e9706d1fb8c2,ofr_F
6,2,"[web, email, mobile, social]",10,10,discount,fafdcd668e3743c1bb461111dcafc2a4,ofr_G
7,0,"[email, mobile, social]",0,3,informational,5a8bc65990b245e5a138643cd4eb9837,ofr_H
8,5,"[web, email, mobile, social]",5,5,bogo,f19421c1d4aa40978ebb69ca19b0e20d,ofr_I
9,2,"[web, email, mobile]",10,7,discount,2906b810c7d4411798c6938adc9daaa5,ofr_J


#### **Profile Dataset**
- Convert the `became_member_on` column to a standardized **datetime** format for consistency and easier analysis.
-  Create a new column with only the year and month of the date.

In [12]:
def categorize_age(age):
    """
    Categorizes the input age into a predefined group.

    Parameters:
    age (int): The age to categorize.

    Returns:
    str: The category the age falls into. One of 'Young', 'Adult', 
         'Middle', or 'Senior'.
    """
    if age < 25:
        return 'Young'
    elif 25 <= age < 40:
        return 'Adult'
    elif 40 <= age < 60:
        return 'Middle'
    else:
        return 'Senior'


In [13]:
# copy the raw data into a new dataframe
profile = profile_raw.copy(deep=True)

# Convert the 'became_member_on' column to a datetime format
profile['became_member_on'] = pd.to_datetime(profile['became_member_on'], format='%Y%m%d')

# Create a new column with only the year and month of the membership
profile['bec_memb_year_month'] = pd.to_datetime(profile['became_member_on'], format='%Y%m%d').dt.strftime('%Y-%m')

# appling the age categoization function    
profile['age_group'] = profile['age'].apply(categorize_age).astype('category')

# Persist a csv file to the data store
profile.to_csv('medalion_data_store/bronze/profile.csv')

profile.head()

Unnamed: 0,gender,age,id,became_member_on,income,bec_memb_year_month,age_group
0,,118,68be06ca386d4c31939f3a4f0e3dd783,2017-02-12,,2017-02,Senior
1,F,55,0610b486422d4921ae7d2bf64640c50b,2017-07-15,112000.0,2017-07,Middle
2,,118,38fe809add3b4fcf9315a9694bb96ff5,2018-07-12,,2018-07,Senior
3,F,75,78afa995795e4d85b5d9ceeca43f5fef,2017-05-09,100000.0,2017-05,Senior
4,,118,a03223e636434f42ac4c3df47e8bac43,2017-08-04,,2017-08,Senior


#### **Transcript dataset**
- Clean, normalizing and transforming column values from the transcript orighinal table.
- Creating a new table `transcript_b` using json_normalize() and concat() methods.

**Strategy:**

1. Copy the data.
2. Normalizing the `value` column dictionarie keys `offer id` --> `offer_id`.
3. Creating the `transcript_b` table using `pd.json_normalize()` and `pd.concat()` functions.
3. Normalizing the `event` column values replacing `' '` --> `'_'`. 
4. Creating `ofr_id_short` column with an id more readeble and droping the `offer_id` column.
5. Fill NaN with apropriate values.
6. Creating a `tag` column to identify the person-event-offer interactions one by one. (as a person can interact with the same offer type more than once)
7. Persist the table in a csv file and save in the data store.

In [14]:
def fix_offer_id(value):
    """
    Fixes the 'offer id' key in a dictionary by renaming it to 'offer_id'.

    Parameters:
    value (dict): A dictionary that may contain the 'offer id' key.

    Returns:
    dict: The updated dictionary with 'offer id' replaced by 'offer_id'.
    """
    if isinstance(value, dict) and 'offer id' in value:
        value['offer_id'] = value.pop('offer id')
    return value

In [15]:
# copy the raw data into a new dataframe
transcript = transcript_raw.copy(deep=True)


# appling the fix offer function
transcript['value'] = transcript['value'].apply(fix_offer_id)

# Normalize the 'value' column with json_normalize method
value_df = pd.json_normalize(transcript['value']) 
transcript_b = pd.concat([transcript, value_df], axis=1).drop('value', axis=1)

# Normalizing the event column categorie's names
transcript_b['event'] = [x.split(' ')[1] if len(x.split(' ')) > 1 else x for x in transcript_b['event']  ]

# mapping the offer_id to the offer_id_short and Dropping the offer_id column
transcript_b['ofr_id_short'] = transcript_b['offer_id'].map(port_id).fillna('tran')
transcript_b = transcript_b.drop(columns = ['offer_id'])

# creating a tag column to identify the order of the events for each person-offer-event fact
transcript_b['tag'] = (
    transcript_b.groupby(['person', 'ofr_id_short', 'event'], observed=True)
    .cumcount()
)

# persist a csv file to the bronze folder
transcript_b.to_csv('medalion_data_store/bronze/transcript_b.csv', index=False)

transcript_b.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306534 entries, 0 to 306533
Data columns (total 7 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   person        306534 non-null  object 
 1   event         306534 non-null  object 
 2   time          306534 non-null  int64  
 3   amount        138953 non-null  float64
 4   reward        33579 non-null   float64
 5   ofr_id_short  306534 non-null  object 
 6   tag           306534 non-null  int64  
dtypes: float64(2), int64(2), object(3)
memory usage: 16.4+ MB


### Separating *transactions* events in transcript_b dataset from the others

* **Source**: 

    `transcript_b`.

* **Created tables:**

    `events` and `transactions`: separated data from `transcript_b`


- creating `events` and `transactions` dataframes
- Dropping columns that are not needed

In [16]:
# filtering  the data that are different than transaction
events = transcript_b[transcript_b['event'] != 'transaction'].copy()

# drop the 'amount' column as it contais zero for all rows.
events = events.drop(columns=['amount']) 

# filtering  the transaction data 
transactions = transcript_b[transcript_b['event'] == 'transaction'].copy()

# drop the reward (all zeros) and offer_id columns as they are not relevant in this dataset. 
transactions = transactions.drop(columns=['reward',	'ofr_id_short']) 

# persisting the data in the silver layer
events.to_csv('medalion_data_store/bronze/events.csv', index=False)
transactions.to_csv('medalion_data_store/bronze/transactions.csv', index=False)

In [17]:
events.info()

<class 'pandas.core.frame.DataFrame'>
Index: 167581 entries, 0 to 306527
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   person        167581 non-null  object 
 1   event         167581 non-null  object 
 2   time          167581 non-null  int64  
 3   reward        33579 non-null   float64
 4   ofr_id_short  167581 non-null  object 
 5   tag           167581 non-null  int64  
dtypes: float64(1), int64(2), object(3)
memory usage: 8.9+ MB


In [18]:
transcript.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306534 entries, 0 to 306533
Data columns (total 4 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   person  306534 non-null  object
 1   event   306534 non-null  object
 2   value   306534 non-null  object
 3   time    306534 non-null  int64 
dtypes: int64(1), object(3)
memory usage: 9.4+ MB


# **Features Engineering**

- Process the datasets to create useful tables and extract fetures from the data.

## Ranking the Offers

Using events table to ranking the offers by counts of the (`event`) column categories and create rate features

In [19]:
# Grouping the events DataFrame by 'ofr_id_short' and 'event' to count the occurrences
offer_rank = events.groupby(['ofr_id_short', 'event'])[['event']].count().unstack(level=[1]).reset_index()

# Flattening multi-level column names and joining them with an underscore
offer_rank.columns = ['_'.join(col).strip('_') for col in offer_rank.columns.to_flat_index()]

# Calculating the completion and viewed rates based on event counts
offer_rank['completion_rate'] = offer_rank['event_completed'] / offer_rank['event_received']
offer_rank['viewed_rate'] = offer_rank['event_viewed'] / offer_rank['event_received']

# Saving the resulting DataFrame to a CSV file
offer_rank.to_csv('medalion_data_store/gold/offer_rank.csv', index=False)

# Returning the offer_rank DataFrame
offer_rank


Unnamed: 0,ofr_id_short,event_completed,event_received,event_viewed,completion_rate,viewed_rate
0,ofr_A,3688.0,7658.0,6716.0,0.481588,0.876991
1,ofr_B,3331.0,7593.0,7298.0,0.438694,0.961148
2,ofr_C,,7617.0,4144.0,,0.544046
3,ofr_D,4354.0,7677.0,4171.0,0.567149,0.543311
4,ofr_E,3420.0,7668.0,2663.0,0.446009,0.347287
5,ofr_F,5156.0,7646.0,7337.0,0.67434,0.959587
6,ofr_G,5317.0,7597.0,7327.0,0.699882,0.96446
7,ofr_H,,7618.0,6687.0,,0.877789
8,ofr_I,4296.0,7571.0,7264.0,0.567428,0.959451
9,ofr_J,4017.0,7632.0,4118.0,0.526336,0.53957


## User-Item-Event matrix:

A table that tracks unique interactions by person-offer-event over time to create features that analyse user engagement and responsiveness for each unique offer interaction, person by person.

---

* **Source table**: 

    `events` table: event type and time per person interactions with offers.

* **Created table:**
    
   `user_item_event`: table of *facts* of unique person-offer-event interactions.

### Fact Definition:

-  A **Fact** represents a unique ocurrence sequence when a person engages with an single offer, and is tracked from the moment the offer is received, viewed, and completed. The interactions that occur more than once is tracked as a unique fact.
- The **'tag'** column created ensures the uniqueness of the fact and *prevents aggregation when a person interacts with the same offer type more than once*. As the aggregation using the tag column, it returns a single value, using agg with 'max', 'min' or 'mean' will return the same value for a fact.

### Engagement Metrics Feature Engineering Strategy: Key features to quantify user responsiveness.

- Each *person-event-offer-tag* fact is grouped and stacked into `event` columns, placing `time` and `reward` values accordingly.
- Time-event colunm are created and missing values computed as `NaN` value.


  - **Handling Missing Values:**
    - Before features calculations, the stacked columns `time_completed`, `time_received`, and `time_viewed` columns have missing values filled with `np.inf` (never ended event), avoiding *artificial* values imputies.

  - **Time-Differences-help-features** - not included in the final dataset as it contains np.inf values
    - `to_vr`: Time from receiving to viewing - delays in viewing - {0 to inf}
    - `to_cv`: Time from viewing to completion - delays in completions after viewing - {-inf to inf}
    - `to_cr`: Time from receiving to completion - delays in completion, even if viewed or not - {0 to inf}

  - **Inverse Time-Based Scores (`1/(x+1) * 100`):** calculated using the previous time differences where x is the time difference.
    - `curiosity_vr`: `{0 to 100}` Measures _speed_ of viewing after receiving.
    - `eagerness_cv`: `{0 to 100} or -1` Measures _speed_ of completion after viewing (`-1` indicates completion before viewing).
    - `overall_cr`: `{0 to 100}` Overall responsiveness from receiving to completion.

  - **Influence Metrics:**
    - **`influence`**: Binary flag (1/0) indicating if an offer was completed after viewing (responsiveness).
    - **`ext_influence`**: Binary flag capturing simultaneous viewing and completion time (extreme responsiveness).

  - **Counts:** (by fact using tag column to avoid aggregation on the same offer type per person interactions).
    - `count_offer_completed`: Count of completed offers.
    - `count_offer_received`: Count of received offers. 
    - `count_offer_viewed`: Count of viewed offers.

---

> NOTE: This table have features that can be combined to determine which demographic groups respond best to which offer type.

In [20]:
# Grouping events by 'person', 'ofr_id_short', 'event', and 'tag', then aggregating the maximum 'time' and 'reward' per group and restructuring the DataFrame.
user_item_event = (
    events.groupby(['person', 'event', 'ofr_id_short', 'tag'], dropna=False)
    .agg(
        time=('time', 'max'),
        reward=('reward', 'max'),
        cnt = ('ofr_id_short', 'count')
    )
    .unstack(level=1)  # Unstacking by 'event' to widen the DataFrame
    .reset_index()
)

# Flattening the multi-level column names
user_item_event.columns = ['_'.join(col).strip('_') for col in user_item_event.columns.to_flat_index()]

# Dropping unnecessary reward columns (containing zero for all values)
user_item_event = user_item_event.drop(columns=['reward_received', 'reward_viewed'])

# Replacing NaN values with infinity to facilitate time calculations
user_item_event['time_completed'] = user_item_event['time_completed'].fillna(np.inf)
user_item_event['time_received'] = user_item_event['time_received'].fillna(np.inf)
user_item_event['time_viewed'] = user_item_event['time_viewed'].fillna(np.inf)

# Computing time differences for various stages of the offer lifecycle
user_item_event['to_vr'] = user_item_event['time_viewed'] - user_item_event['time_received']
user_item_event['to_cv'] = user_item_event['time_completed'] - user_item_event['time_viewed']
user_item_event['to_cr'] = user_item_event['time_completed'] - user_item_event['time_received']

# Calculating curiosity, eagerness, and overall responsiveness scores
user_item_event['curiosity_vr'] = (1 / (user_item_event['to_vr'] + 1)) * 100
user_item_event['eagerness_cv'] = [(1 / (x + 1))*100 if ((not np.isnan(x)) & (x >=0)) else -1 for x in user_item_event['to_cv']]
user_item_event['overall_cr'] = (1 / (user_item_event['to_cr'] + 1)) * 100

# Defining influence metrics based on time conditions
user_item_event['influence'] = (
    (~pd.isna(user_item_event['time_completed'])) & 
    (user_item_event['time_completed'] != np.inf) &
    (user_item_event['time_completed'] >= user_item_event['time_viewed'])
).astype(int)

# Extreme influence is when the offer is viewed and completed at the same time.
user_item_event['ext_influence'] = (
    (~pd.isna(user_item_event['time_viewed'])) & 
    (user_item_event['time_viewed'] != np.inf) & 
    (user_item_event['time_completed'] == user_item_event['time_viewed'])
).astype(int)

# Dropping intermediate time calculation columns and others, rounding final values.
user_item_event = user_item_event.drop(
    columns=['time_completed', 'time_received', 'time_viewed', 'to_vr', 'to_cv', 'to_cr']
).round(1)


# Persisting the data in the gold layer
user_item_event.to_csv('medalion_data_store/gold/user_item_event.csv', index=False)



# Displaying the final DataFrame
user_item_event

Unnamed: 0,person,ofr_id_short,tag,reward_completed,cnt_completed,cnt_received,cnt_viewed,curiosity_vr,eagerness_cv,overall_cr,influence,ext_influence
0,0009655768c64bdeb2e877511632db8f,ofr_C,0,,,1.0,1.0,2.7,0.0,0.0,0,0
1,0009655768c64bdeb2e877511632db8f,ofr_G,0,2.0,1.0,1.0,1.0,2.7,-1.0,4.0,0,0
2,0009655768c64bdeb2e877511632db8f,ofr_H,0,,,1.0,1.0,4.0,0.0,0.0,0,0
3,0009655768c64bdeb2e877511632db8f,ofr_I,0,5.0,1.0,1.0,1.0,2.0,-1.0,14.3,0,0
4,0009655768c64bdeb2e877511632db8f,ofr_J,0,2.0,1.0,1.0,,0.0,-1.0,100.0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
76272,ffff82501cea40309d5fdd7edcca4a07,ofr_E,0,5.0,1.0,1.0,1.0,14.3,4.0,3.2,1,0
76273,ffff82501cea40309d5fdd7edcca4a07,ofr_G,0,2.0,1.0,1.0,1.0,14.3,1.8,1.6,1,0
76274,ffff82501cea40309d5fdd7edcca4a07,ofr_J,0,2.0,1.0,1.0,1.0,5.3,3.2,2.0,1,0
76275,ffff82501cea40309d5fdd7edcca4a07,ofr_J,1,2.0,1.0,1.0,1.0,14.3,100.0,14.3,1,1


## User-Item-Transactions Matrix

Constructing a **User-Item-transactions Matrix** from `event` countings and `transaction` statistics.

> NOTE: The transaction reflects overall characteristics of the person. It does not relate to specific offers.

---
* **Source**: 

    `event` and `tansaction` tables.

* **Created tables:**
   
   `user_item_transactions` table


### Feature Engineering strategy:

1. **Event Data Processing:**
   - Groups event records by user (`person`), (`event`) type, and offer ID (`ofr_id_short`).
   - Counts occurrences of each event per user per offer.
   - Unstacks the grouped data to create a structured table with event types as columns.
   - Fill NaN values with zero, as the table is generated by counting.

2. **Transaction Data Processing:**
   - Groups transactions by user (`person`) and event type.
   - Counts transactions and sums transaction amounts for each user-event pair.
   - Unstacks the grouped data and fills missing values.
   - Fill NaN values with zero, as the table is generated by counting and sum.

3. **Merging Data:**
   - Merges event and transaction grouped tables to create a consolidated **user-item-transaction dataset**.
   - Merges with user profile data using `id` as the key, forming the final **User-Item-transaction** table.
   - Save the table in the data store.
---

In [21]:
# Group event data by person, event, and offer ID, then count occurrences
user_item_events = events.groupby(['person', 'event', 'ofr_id_short']).agg(
    cnt=('event', 'count'),  # Count the occurrences of each event type per person per offer
).unstack(level=[1, 2]).reset_index().fillna(0)

# Flatten multi-index column names
user_item_events.columns = ['_'.join(col).strip('_') for col in user_item_events.columns.to_flat_index()]

# Group transaction data by person and event, aggregating count and sum of amounts
user_transactions = transactions.groupby(['person', 'event']).agg(
    cnt=('amount', 'count'),  # Count the number of transactions per person per event
    sum=('amount', 'sum')     # Sum of transaction amounts per person per event
).unstack(level=[1]).reset_index().fillna(0)

# Flatten multi-index column names
user_transactions.columns = ['_'.join(col).strip('_') for col in user_transactions.columns.to_flat_index()]

# Merge event and transaction data on 'person' to form a unified user-item-transaction matrix
user_item_transactions = user_item_events.merge(user_transactions, on='person', how='left')

# save the table in the data store
user_item_transactions.to_csv('medalion_data_store/gold/user_item_transactions.csv', index=False)

# Output final user-item-transactions matrix
user_item_transactions

Unnamed: 0,person,cnt_completed_ofr_G,cnt_completed_ofr_I,cnt_completed_ofr_J,cnt_received_ofr_C,cnt_received_ofr_G,cnt_received_ofr_H,cnt_received_ofr_I,cnt_received_ofr_J,cnt_viewed_ofr_C,...,cnt_viewed_ofr_F,cnt_completed_ofr_B,cnt_received_ofr_A,cnt_received_ofr_B,cnt_viewed_ofr_B,cnt_completed_ofr_A,cnt_viewed_ofr_A,cnt_viewed_ofr_J,cnt_transaction,sum_transaction
0,0009655768c64bdeb2e877511632db8f,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,127.60
1,00116118485d4dfda04fdbaba9a87b5c,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,4.09
2,0011e0d4e6b944f998e987f904e8c1e5,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,79.46
3,0020c2b971eb4e9188eac86d93036a77,2.0,0.0,0.0,0.0,2.0,1.0,0.0,0.0,0.0,...,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,8.0,196.86
4,0020ccbbb6d84e358d3414a3ff76cffd,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,12.0,154.05
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16989,fff3ba4757bd42088c044ca26d73817a,1.0,0.0,1.0,0.0,1.0,2.0,0.0,2.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,11.0,580.98
16990,fff7576017104bcc8677a8d63322b5e1,2.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,6.0,29.94
16991,fff8957ea8b240a6b5e634b6ee8eafcf,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,5.0,12.15
16992,fffad4f4828548d1b5583907f2e9906b,0.0,2.0,0.0,0.0,0.0,1.0,2.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,12.0,88.83


## User-Item-Time matrix

Constructing a **User-Item-time Matrix** from `transcript_b` countings and statistics.

---
* **Source**: 

    `transcript_b` table.

* **Created tables:**
    
    `event_person_summary` and `offer_person_summary` with related features.

    A intermediate table `transcript_features` is created by merging `event_person_summary` and `offer_person_summary` and
    A final table `transcript_features_profile` is created by merging `transcript_features` with `profile`.

### Feature Engineering strategy:
The code groups data by `person`-`event` columns for *offer-person* summary & `person`-`ofr_id_short` columns for *event-person* summary, then computes the following statistics:

- **Event or Offer type Counts (`cnt`)**: Number of occurrences of each event/offer type per person.
- **Average Time (`avg_time`)**: Mean timestamp of each event/offer type per person.
- **Maximum Time (`max_time`)**: Latest occurrence timestamp of each event/offer type per person.
- **Minimum Time (`min_time`)**: Earliest occurrence timestamp of each event/offer type per person.

### Data Processing
- The `.unstack()` function pivots the `event` or the `ofr_id_short` column, converting different event/offer types into separated columns and filling it's values as above.

- No further calculations are performed on the data columns, as the summary statistics are computed directly from the groupby/unstack operations.

- The resulting multi-level column names are flattened using list comprehension to create more readable column names.


---

In [22]:
# operating counts and time-statistics with focus on *events*
user_item_time = transcript_b.groupby(['person', 'event']).agg( 
    cnt_eve = ('event', 'count'), # number of total events per person 
    avg_time = ('time', 'mean'),
    max_time = ('time', 'max'),
    min_time = ('time', 'min')
    
    ).unstack(level=1).reset_index()  

# rename multi-level columns    
# fill counting missing values with 0
user_item_time.columns = ['_'.join(col).strip('_') for col in user_item_time.columns.to_flat_index()]
user_item_time.iloc[:,1:5] = user_item_time.iloc[:,1:5].fillna(0)
user_item_time = user_item_time #.fillna(10000)

# operating counts and time-statistics with focus on *offers*
person_offer_time = transcript_b.groupby(['person', 'ofr_id_short']).agg(
    cnt = ('ofr_id_short', 'count'), # number of total 
    avg_time = ('time', 'mean'),
    max_time = ('time', 'max'),
    min_time = ('time', 'min')
    
    ).unstack(1).reset_index()  

# rename multi-level columns    
# fill counting missing values with 0
person_offer_time.columns = ['_'.join(col).strip('_') for col in person_offer_time.columns.to_flat_index()]
person_offer_time.iloc[:,1:12] = person_offer_time.iloc[:,1:12].fillna(0)
person_offer_time = person_offer_time #.fillna(10000)

# merging the two dataframes
transcript_features = person_offer_time.merge(user_item_time, on='person', how='left').reset_index(drop=True)

# dropping duplicated columns: ['ofr_id_short'] == tran and ['event'] == transaction conteins the same information
#transcript_features = transcript_features.drop(columns=['min_time_tran', 'avg_time_tran', 'max_time_tran']) 


user_item_time.to_csv('medalion_data_store/silver/user_item_time.csv', index=False)

user_item_time

Unnamed: 0,person,cnt_eve_completed,cnt_eve_received,cnt_eve_transaction,cnt_eve_viewed,avg_time_completed,avg_time_received,avg_time_transaction,avg_time_viewed,max_time_completed,max_time_received,max_time_transaction,max_time_viewed,min_time_completed,min_time_received,min_time_transaction,min_time_viewed
0,0009655768c64bdeb2e877511632db8f,3.0,5.0,8.0,4.0,506.0,398.4,543.00,390.0,576.0,576.0,696.0,540.0,414.0,168.0,228.0,192.0
1,00116118485d4dfda04fdbaba9a87b5c,0.0,2.0,3.0,2.0,,372.0,408.00,423.0,,576.0,474.0,630.0,,168.0,294.0,216.0
2,0011e0d4e6b944f998e987f904e8c1e5,3.0,5.0,5.0,5.0,468.0,283.2,451.20,298.8,576.0,504.0,654.0,516.0,252.0,0.0,132.0,6.0
3,0020c2b971eb4e9188eac86d93036a77,3.0,5.0,8.0,3.0,358.0,283.2,348.75,366.0,510.0,504.0,708.0,660.0,54.0,0.0,54.0,12.0
4,0020ccbbb6d84e358d3414a3ff76cffd,3.0,4.0,12.0,4.0,400.0,354.0,375.00,376.5,600.0,504.0,672.0,582.0,222.0,168.0,42.0,168.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16995,fff3ba4757bd42088c044ca26d73817a,3.0,6.0,11.0,3.0,234.0,332.0,246.00,252.0,528.0,576.0,552.0,540.0,6.0,0.0,6.0,6.0
16996,fff7576017104bcc8677a8d63322b5e1,3.0,5.0,6.0,4.0,460.0,331.2,392.00,285.0,594.0,576.0,696.0,522.0,192.0,0.0,36.0,24.0
16997,fff8957ea8b240a6b5e634b6ee8eafcf,0.0,3.0,5.0,2.0,,496.0,379.20,546.0,,576.0,576.0,660.0,,408.0,18.0,432.0
16998,fffad4f4828548d1b5583907f2e9906b,3.0,4.0,12.0,4.0,380.0,288.0,323.50,337.5,588.0,576.0,678.0,666.0,36.0,0.0,36.0,6.0


---
## Offer-Events Summary Features
Tracking overall offers interactions in events table independent of person. This is a table that tracks general count-time interactions with an offer.

* **Source table**: 

    `events`: event type and time per person interactions with offers.

* **Created table:**
    
   `offer_event_features`: counting and time-based features for each offer.

### Feature Engineering Strategy:

- Each *offers* is grouped and stacked into `event` columns, counting `event` values accordingly. Time satatistics are computed.
- Count-event and time-event colunm are created.

---

In [23]:
# grouping the facts for counting events  (not using the tag column).
offer_event_summary = events.groupby(['ofr_id_short', 'event'], dropna=False).agg(
    cnt=('ofr_id_short', 'count'),
    mean_time = ('time', 'mean'),
    max_time = ('time', 'max'),
    min_time = ('time', 'min')
).unstack(level=1).round(1).reset_index()

offer_event_summary.iloc[:,1:4] = offer_event_summary.iloc[:,1:4].fillna(0)
offer_event_summary = offer_event_summary #.fillna(10000)

# Flattening the multi-level column names
offer_event_summary.columns = ['_'.join(col).strip('_') for col in offer_event_summary.columns.to_flat_index()]

offer_event_summary.to_csv('medalion_data_store/silver/offer_event_summary.csv', index=False)

offer_event_summary

Unnamed: 0,ofr_id_short,cnt_completed,cnt_received,cnt_viewed,mean_time_completed,mean_time_received,mean_time_viewed,max_time_completed,max_time_received,max_time_viewed,min_time_completed,min_time_received,min_time_viewed
0,ofr_A,3688.0,7658.0,6716.0,394.8,329.8,352.6,714.0,576.0,714.0,0.0,0.0,0.0
1,ofr_B,3331.0,7593.0,7298.0,385.7,335.2,353.1,696.0,576.0,714.0,0.0,0.0,0.0
2,ofr_C,0.0,7617.0,4144.0,,331.9,358.6,,576.0,714.0,,0.0,0.0
3,ofr_D,4354.0,7677.0,4171.0,407.1,334.1,362.0,714.0,576.0,714.0,0.0,0.0,0.0
4,ofr_E,3420.0,7668.0,2663.0,431.5,331.3,366.7,714.0,576.0,714.0,0.0,0.0,0.0
5,ofr_F,5156.0,7646.0,7337.0,400.3,336.4,354.7,714.0,576.0,714.0,0.0,0.0,0.0
6,ofr_G,5317.0,7597.0,7327.0,399.1,330.5,348.9,714.0,576.0,714.0,0.0,0.0,0.0
7,ofr_H,0.0,7618.0,6687.0,,332.5,353.9,,576.0,714.0,,0.0,0.0
8,ofr_I,4296.0,7571.0,7264.0,382.9,332.2,349.8,696.0,576.0,714.0,0.0,0.0,0.0
9,ofr_J,4017.0,7632.0,4118.0,410.0,332.0,356.2,714.0,576.0,714.0,0.0,0.0,0.0


## Person-Transactions Summary Features

This section of the code aggregates transaction data at the `person` level, creating summary statistics that describe each individual's spending behavior. The goal is to generate features that help analyze transaction patterns.

* **Source table**: 

    `transaction`: event type and time per person interactions with offers.

* **Created table:**
    
    `transac_amount_sumary`: intermediate table with summary statistics for each person.
    
    `transac_amount_sumary_profile`: final table merged with `profile` table.

**Strategy:**

The code groups transactions by `person` and computes various summary statistics on the `amount` spent:

- **Total Transaction Amount (`tran_amoun_tot`)**: Sum of all transaction amounts for each person.
- **Average Transaction Amount (`tran_amoun_mean`)**: Mean transaction amount per person, indicating typical spending behavior.
- **Maximum Transaction Amount (`tran_amoun_max`)**: Highest transaction amount recorded for each person.
- **Minimum Transaction Amount (`tran_amoun_min`)**: Lowest transaction amount recorded for each person.

### Additional Considerations
- `tran_amount_std` (standard deviation of transaction amounts), are commented out. 
- The `.round(1)` function rounds all numerical values to one decimal place for cleaner representation.

In [47]:
# grouping  by person
person_transaction_summary = transactions.groupby(['person'], dropna=False).agg(
    tran_count=('amount', 'count'),
    tran_amoun_tot =('amount', 'sum'),
    tran_amoun_mean=('amount', 'mean'),
    tran_amoun_max=('amount', 'max'),
    tran_amoun_min=('amount', 'min'),
    # tran_amount_std = ('amount', 'std')
).round(1).reset_index()

person_transaction_summary.to_csv('medalion_data_store/silver/person_transaction_summary.csv', index=False)

person_transaction_summary

Unnamed: 0,person,tran_count,tran_amoun_tot,tran_amoun_mean,tran_amoun_max,tran_amoun_min
0,0009655768c64bdeb2e877511632db8f,8,127.6,16.0,28.2,8.6
1,00116118485d4dfda04fdbaba9a87b5c,3,4.1,1.4,3.2,0.2
2,0011e0d4e6b944f998e987f904e8c1e5,5,79.5,15.9,23.0,9.0
3,0020c2b971eb4e9188eac86d93036a77,8,196.9,24.6,33.9,17.2
4,0020ccbbb6d84e358d3414a3ff76cffd,12,154.0,12.8,20.1,6.8
...,...,...,...,...,...,...
16573,fff3ba4757bd42088c044ca26d73817a,11,581.0,52.8,388.2,11.0
16574,fff7576017104bcc8677a8d63322b5e1,6,29.9,5.0,8.0,2.1
16575,fff8957ea8b240a6b5e634b6ee8eafcf,5,12.2,2.4,6.4,0.6
16576,fffad4f4828548d1b5583907f2e9906b,12,88.8,7.4,12.2,2.0


# Analytical table 

A final table mergin selected tables created in the previous steps.

The goals is get different types of dataset to be used in the analysis, modeling and recommendations.

In [56]:
# Merge unique event features with transcript features based on 'person' column
# Use a left join to keep all records from user_item_event
analytical_user_item = user_item_event.merge(profile, left_on='person', right_on='id', how='left').drop(columns=['id'])

analytical_user_item = analytical_user_item.merge(user_item_transactions, on='person', how='left')

analytical_user_item = analytical_user_item.merge(user_item_time, on='person', how='left')

analytical_user_item = analytical_user_item.merge(person_transaction_summary, on='person', how='left')

analytical_user_item = analytical_user_item.drop(columns=['tran_count'])



# # Merge transaction amount summary with the existing dataset based on 'person'
# # Fill missing values with 0 after the merge
# person_activity = person_activity.merge(transac_amount_sumary, on='person', how='left').fillna(0)

# # Merge person activity data with profile data based on 'person' and 'id'
# # Drop the redundant 'id' column after the merge
# person_activity_profile = person_activity.merge(profile, left_on='person', right_on='id', how='left').drop(columns=['id'])

# # Merge with portfolio data using 'ofr_id_short' as the key
# # Drop the redundant 'id' column after the merge
# person_activity_profile = person_activity_profile.merge(portfolio, on='ofr_id_short', how='left').drop(columns=['id'])

# # Save the final dataset to a CSV file
# person_activity_profile.to_csv('medalion_data_store/gold/person_activity_profile.csv')

# # Display the final DataFrame
# person_activity_profile


In [57]:
analytical_user_item

Unnamed: 0,person,ofr_id_short,tag,reward_completed,cnt_completed,cnt_received,cnt_viewed,curiosity_vr,eagerness_cv,overall_cr,...,max_time_transaction,max_time_viewed,min_time_completed,min_time_received,min_time_transaction,min_time_viewed,tran_amoun_tot,tran_amoun_mean,tran_amoun_max,tran_amoun_min
0,0009655768c64bdeb2e877511632db8f,ofr_C,0,,,1.0,1.0,2.7,0.0,0.0,...,696.0,540.0,414.0,168.0,228.0,192.0,127.6,16.0,28.2,8.6
1,0009655768c64bdeb2e877511632db8f,ofr_G,0,2.0,1.0,1.0,1.0,2.7,-1.0,4.0,...,696.0,540.0,414.0,168.0,228.0,192.0,127.6,16.0,28.2,8.6
2,0009655768c64bdeb2e877511632db8f,ofr_H,0,,,1.0,1.0,4.0,0.0,0.0,...,696.0,540.0,414.0,168.0,228.0,192.0,127.6,16.0,28.2,8.6
3,0009655768c64bdeb2e877511632db8f,ofr_I,0,5.0,1.0,1.0,1.0,2.0,-1.0,14.3,...,696.0,540.0,414.0,168.0,228.0,192.0,127.6,16.0,28.2,8.6
4,0009655768c64bdeb2e877511632db8f,ofr_J,0,2.0,1.0,1.0,,0.0,-1.0,100.0,...,696.0,540.0,414.0,168.0,228.0,192.0,127.6,16.0,28.2,8.6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
76272,ffff82501cea40309d5fdd7edcca4a07,ofr_E,0,5.0,1.0,1.0,1.0,14.3,4.0,3.2,...,648.0,582.0,60.0,0.0,60.0,6.0,226.1,15.1,23.3,7.2
76273,ffff82501cea40309d5fdd7edcca4a07,ofr_G,0,2.0,1.0,1.0,1.0,14.3,1.8,1.6,...,648.0,582.0,60.0,0.0,60.0,6.0,226.1,15.1,23.3,7.2
76274,ffff82501cea40309d5fdd7edcca4a07,ofr_J,0,2.0,1.0,1.0,1.0,5.3,3.2,2.0,...,648.0,582.0,60.0,0.0,60.0,6.0,226.1,15.1,23.3,7.2
76275,ffff82501cea40309d5fdd7edcca4a07,ofr_J,1,2.0,1.0,1.0,1.0,14.3,100.0,14.3,...,648.0,582.0,60.0,0.0,60.0,6.0,226.1,15.1,23.3,7.2


In [58]:
analytical_user_item.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76277 entries, 0 to 76276
Data columns (total 68 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   person                76277 non-null  object        
 1   ofr_id_short          76277 non-null  object        
 2   tag                   76277 non-null  int64         
 3   reward_completed      33579 non-null  float64       
 4   cnt_completed         33579 non-null  float64       
 5   cnt_received          76277 non-null  float64       
 6   cnt_viewed            57725 non-null  float64       
 7   curiosity_vr          76277 non-null  float64       
 8   eagerness_cv          76277 non-null  float64       
 9   overall_cr            76277 non-null  float64       
 10  influence             76277 non-null  int64         
 11  ext_influence         76277 non-null  int64         
 12  gender                66501 non-null  object        
 13  age             

In [59]:
analytical_user_item[analytical_user_item['person'] == '68be06ca386d4c31939f3a4f0e3dd783'] #['cnt_completed']

Unnamed: 0,person,ofr_id_short,tag,reward_completed,cnt_completed,cnt_received,cnt_viewed,curiosity_vr,eagerness_cv,overall_cr,...,max_time_transaction,max_time_viewed,min_time_completed,min_time_received,min_time_transaction,min_time_viewed,tran_amoun_tot,tran_amoun_mean,tran_amoun_max,tran_amoun_min
31212,68be06ca386d4c31939f3a4f0e3dd783,ofr_E,0,,,1.0,1.0,7.7,0.0,0.0,...,696.0,582.0,552.0,168.0,360.0,216.0,20.4,2.3,5.2,0.1
31213,68be06ca386d4c31939f3a4f0e3dd783,ofr_F,0,3.0,1.0,1.0,1.0,100.0,2.0,2.0,...,696.0,582.0,552.0,168.0,360.0,216.0,20.4,2.3,5.2,0.1
31214,68be06ca386d4c31939f3a4f0e3dd783,ofr_G,0,2.0,1.0,1.0,1.0,100.0,0.7,0.7,...,696.0,582.0,552.0,168.0,360.0,216.0,20.4,2.3,5.2,0.1
31215,68be06ca386d4c31939f3a4f0e3dd783,ofr_G,1,,,1.0,1.0,14.3,0.0,0.0,...,696.0,582.0,552.0,168.0,360.0,216.0,20.4,2.3,5.2,0.1
31216,68be06ca386d4c31939f3a4f0e3dd783,ofr_J,0,,,1.0,1.0,2.0,0.0,0.0,...,696.0,582.0,552.0,168.0,360.0,216.0,20.4,2.3,5.2,0.1


In [61]:
analytical_user_item.duplicated().sum()

np.int64(0)

In [62]:
analytical_user_item.columns[analytical_user_item.T.duplicated(keep=False)] #.sum()

Index([], dtype='object')

# Churn Prediction

Constructing a **Churn Prediction Table** based on transaction data. It captures key features such as **recency, frequency, and value** to determine customer churn.

## Features

1. **Recency:**
   - Measures the time since the last recorded transaction.
   - Computed as `714 - max(transaction time per person)`.
   - A higher value indicates that a user has not interacted recently.

2. **Frequency:**
   - Tracks the number of transactions made by a customer in defined time periods (periods start as full time (714h) and are reduced 20 times by ~36h)
   - Sum of transaction counts across these 20 periods.

3. **Value:**
   - Represents the **average** value of transactions per customer.
   - Useful for understanding the spending behavior of each user.

4. **Churn Definition:**
   - **Primary Churn (`churn`)**: A user is considered churned (`1`) if no transactions occurred in the last three periods.
   - **Alternative Churn (`churn2`)**: A user is marked as churned (`1`) if the recency is greater than 96 hours.

## Application

- **Customer Retention Strategies**: Identifies users at risk of churning, enabling proactive engagement.
- **Marketing Campaigns**: Targets high-value customers with tailored offers.
- **Revenue Optimization**: Helps predict potential loss due to customer inactivity.

This structured dataset can be utilized for machine learning models to forecast churn and enhance business decision-making.


In [74]:
# Aggregate transaction counts per user over time
transactions_time = (
    transactions.groupby(['person', 'time'])
    .size()
    .unstack(level=1, fill_value=0)
    .reset_index()
)

# Rename columns to indicate transaction counts over time
transactions_time.columns = [
    f"time_tran_{col}" for col in transactions_time.columns.to_flat_index()
]

# Compute transaction frequencies over defined periods (~60 hours per period)
frequences_period = pd.DataFrame({
    f'period_{i//6 + 1}': transactions_time.iloc[:, i+1:].sum(axis=1)
    for i in range(0, 119, 6)
})

# Compute recency: Time elapsed since last transaction (714 - max transaction time per person)
recency = (714 - transactions.groupby('person')['time'].max()).reset_index(drop=True)
recency.name = 'recency'
# Compute average transaction value per user
value = transactions.groupby('person')['amount'].mean().reset_index(drop=True)

# Define churn as 1 if no transactions occurred in the last three periods
churn = pd.Series((frequences_period.iloc[:, -3:].sum(axis=1) == 0).astype(int), name='churn')

# Alternative churn definition: 1 if recency is greater than 96 hours
churn2 = pd.Series((recency > 96).astype(int), name='churn2')

# Combine all features into a final churn prediction table
churn_table = pd.concat([
    transactions_time['time_tran_person'], frequences_period, recency, value, churn, churn2
], axis=1)

churn_table = profile.merge(churn_table, left_on='id', right_on='time_tran_person', how='left').drop(columns=['time_tran_person'])

# Save churn data to a CSV file
churn_table.to_csv('medalion_data_store/gold/churn_table.csv', index=False)

# Output the churn table
churn_table


Unnamed: 0,gender,age,id,became_member_on,income,bec_memb_year_month,age_group,period_1,period_2,period_3,...,period_15,period_16,period_17,period_18,period_19,period_20,recency,amount,churn,churn2
0,,118,68be06ca386d4c31939f3a4f0e3dd783,2017-02-12,,2017-02,Senior,9.0,9.0,9.0,...,6.0,4.0,3.0,2.0,1.0,1.0,18.0,2.266667,0.0,0.0
1,F,55,0610b486422d4921ae7d2bf64640c50b,2017-07-15,112000.0,2017-07,Middle,3.0,2.0,2.0,...,1.0,0.0,0.0,0.0,0.0,0.0,186.0,25.670000,1.0,1.0
2,,118,38fe809add3b4fcf9315a9694bb96ff5,2018-07-12,,2018-07,Senior,6.0,6.0,6.0,...,2.0,2.0,2.0,2.0,1.0,1.0,18.0,2.383333,0.0,0.0
3,F,75,78afa995795e4d85b5d9ceeca43f5fef,2017-05-09,100000.0,2017-05,Senior,7.0,7.0,7.0,...,2.0,0.0,0.0,0.0,0.0,0.0,180.0,22.752857,1.0,1.0
4,,118,a03223e636434f42ac4c3df47e8bac43,2017-08-04,,2017-08,Senior,3.0,3.0,3.0,...,1.0,1.0,1.0,1.0,0.0,0.0,102.0,1.550000,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16995,F,45,6d5f3a774f3d4714ab0c092238f3a1d7,2018-06-04,54000.0,2018-06,Middle,7.0,7.0,6.0,...,2.0,2.0,2.0,2.0,1.0,1.0,24.0,2.861429,0.0,0.0
16996,M,61,2cb4f97358b841b9a9773a7aa05a9d77,2018-07-13,72000.0,2018-07,Senior,7.0,7.0,7.0,...,1.0,1.0,1.0,1.0,1.0,0.0,60.0,3.710000,0.0,0.0
16997,M,49,01d26f638c274aa0b965d24cefe3183f,2017-01-26,73000.0,2017-01,Middle,8.0,8.0,8.0,...,2.0,2.0,2.0,1.0,1.0,0.0,42.0,4.967500,0.0,0.0
16998,F,83,9dc1421481194dcd9400aec7c9ae6366,2016-03-07,50000.0,2016-03,Senior,14.0,13.0,13.0,...,6.0,4.0,4.0,3.0,2.0,1.0,30.0,13.547857,0.0,0.0


In [71]:
transactions_time

Unnamed: 0,time_tran_person,time_tran_0,time_tran_6,time_tran_12,time_tran_18,time_tran_24,time_tran_30,time_tran_36,time_tran_42,time_tran_48,...,time_tran_660,time_tran_666,time_tran_672,time_tran_678,time_tran_684,time_tran_690,time_tran_696,time_tran_702,time_tran_708,time_tran_714
0,0009655768c64bdeb2e877511632db8f,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,1,1,0,0,0
1,00116118485d4dfda04fdbaba9a87b5c,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0011e0d4e6b944f998e987f904e8c1e5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0020c2b971eb4e9188eac86d93036a77,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0
4,0020ccbbb6d84e358d3414a3ff76cffd,0,0,0,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16573,fff3ba4757bd42088c044ca26d73817a,0,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
16574,fff7576017104bcc8677a8d63322b5e1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,0
16575,fff8957ea8b240a6b5e634b6ee8eafcf,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
16576,fffad4f4828548d1b5583907f2e9906b,0,0,0,0,0,0,1,0,0,...,0,0,0,1,0,0,0,0,0,0


In [66]:
churn_table[churn_table['time_tran_person'] == '68be06ca386d4c31939f3a4f0e3dd783'] 

Unnamed: 0,time_tran_person,time_tran_0,time_tran_6,time_tran_12,time_tran_18,time_tran_24,time_tran_30,time_tran_36,time_tran_42,time_tran_48,...,period_15,period_16,period_17,period_18,period_19,period_20,time,amount,churn,churn2
6799,68be06ca386d4c31939f3a4f0e3dd783,0,0,0,0,0,0,0,0,0,...,6,4,3,2,1,1,18,2.266667,0,0
